Options
Design and Implementation of H.264/MPEG-4 AVC Encoder for SDTV/HDTV Application
Date Issued
2004
Date
2004
Author(s)
Chen, Tung-Chien
DOI
en-US
Abstract
The new video coding standard, H.264/AVC, developed by Joint Video Team (JVT) significantly outperforms previous standards in compression due to the new features including motion estimation (ME) with variable block sizes and multiple reference frames, intra prediction, context-based adaptive variable length coding (CAVLC), context-based adaptive binary arithmetic coding (CABAC), in-loop deblocking filter and more. Compared with MPEG-4, H.263, and MPEG-2, H.264/AVC can achieve 39%, 49%, and 64% of bit-rate reduction, respectively. The huge computational complexity is the penalty. Up to 3.6 Tera-instructions per second of computational complexity and 5.6 Tera-bytes per second of memory access are required for baseline profile level 3.1 (one reference frame and H+-64/V+-32 full search). It is obvious that hardware acceleration is a must for real-time applications. However, the reference software adopts sequential processing of each block in the macroblock (MB) and creates data dependencies that are harmful for parallel processing and MB pipelining. The video coding system with traditional two-stage MB pipelines, prediction (ME) and block engine (BE=MC+DCT+Q+IQ+IDCT+VLC), cannot be applied to H.264/MPEG-4 AVC efficiently because of the much more complex prediction procedures and the reconstruction loop that should not be separated with prediction.
In this thesis, the first H.264/MPEG-4 AVC VLSI encoding system is proposed. According to our analysis, five major functions, integer motion estimation (IME), fractional motion estimation (FME), intra prediction (INTRA), entropy coding (EC), and deblocking (DB) are mapped into four MB pipeline stages with hardware-oriented algorithms and sophisticated scheduling to enable parallel processing and MB pipelining. The bandwidth requirement is reduced by utilizing shared memories and local data transmission. The improved Lagrangian multiplier can enhance the compressed video quality by up to 1.2 dB at high bitrates for large frame size with large motion compared with reference software. To support the new features of H.264/MPEG-4 AVC in each MB pipeline stage, several new architectures are proposed. In IME stage, parallel array of eight 128-PE SAD trees are designed with snake scan data flow to achieve 100% of processing element (PE) utilization and low on-chip SRAM bandwidth. Reuse of overlapped search area can save 87.5% of off-chip bandwidth. In FME stage, we analyze the Lagrangian inter-mode decision loops and provide decomposing methodologies to obtain the optimized projection in hardware implementation. The proposed architecture providing 36 times of parallelism per reference frame is characterized by regular flow and high utilization. In INTRA stage, architectures of reconfigurable intra predictor generator and parallel multi-transform engine are applied. Besides, interleaved schedule and proposed partial distortion elimination (PDE) scheme are used to meet the real-time constraint with only four times of parallelism. In DB stage, interleaved memory organization and an 8x4-pixel array with reconfigurable data path are used to support the 2-D filter with only one parallel-in parallel-out reconfigurable 1-D filter. Finally, highly utilized CAVLC engine is realized by dual-scan buffers for 4x4-block level pipelining in EC stage. Besides, 96-bits packer is proposed to support conversion from raw byte sequence payload (RBSP) to encapsulated byte sequence payload (EBSP).
A prototype chip is implemented by using Artisan 0.18um standard CMOS cell library with UMC 0.18um 1P6M technology. The total gate count is about 970K synthesized at 120 MHz. It can support H.264/MPEG-4 AVC encoding in baseline profile level 3.0 with four reference frames under 81 MHz of operation frequency and level 3.1 with one reference frame under 108 MHz of operation frequency. The maximum processing capability is 108K MB's per second or namely HDTV 720p (1280x720) 4:2:0 30Hz video. Totally 34.72 Kbytes on-chip memory and 3.11 Mbytes off-chip memory are required. The core size is 7.68x4.13 mm^2. The average power dissipation is 635 mW when it operates at 120 MHz under 1.8 V power supply.
In this thesis, the first H.264/MPEG-4 AVC VLSI encoding system is proposed. According to our analysis, five major functions, integer motion estimation (IME), fractional motion estimation (FME), intra prediction (INTRA), entropy coding (EC), and deblocking (DB) are mapped into four MB pipeline stages with hardware-oriented algorithms and sophisticated scheduling to enable parallel processing and MB pipelining. The bandwidth requirement is reduced by utilizing shared memories and local data transmission. The improved Lagrangian multiplier can enhance the compressed video quality by up to 1.2 dB at high bitrates for large frame size with large motion compared with reference software. To support the new features of H.264/MPEG-4 AVC in each MB pipeline stage, several new architectures are proposed. In IME stage, parallel array of eight 128-PE SAD trees are designed with snake scan data flow to achieve 100% of processing element (PE) utilization and low on-chip SRAM bandwidth. Reuse of overlapped search area can save 87.5% of off-chip bandwidth. In FME stage, we analyze the Lagrangian inter-mode decision loops and provide decomposing methodologies to obtain the optimized projection in hardware implementation. The proposed architecture providing 36 times of parallelism per reference frame is characterized by regular flow and high utilization. In INTRA stage, architectures of reconfigurable intra predictor generator and parallel multi-transform engine are applied. Besides, interleaved schedule and proposed partial distortion elimination (PDE) scheme are used to meet the real-time constraint with only four times of parallelism. In DB stage, interleaved memory organization and an 8x4-pixel array with reconfigurable data path are used to support the 2-D filter with only one parallel-in parallel-out reconfigurable 1-D filter. Finally, highly utilized CAVLC engine is realized by dual-scan buffers for 4x4-block level pipelining in EC stage. Besides, 96-bits packer is proposed to support conversion from raw byte sequence payload (RBSP) to encapsulated byte sequence payload (EBSP).
A prototype chip is implemented by using Artisan 0.18um standard CMOS cell library with UMC 0.18um 1P6M technology. The total gate count is about 970K synthesized at 120 MHz. It can support H.264/MPEG-4 AVC encoding in baseline profile level 3.0 with four reference frames under 81 MHz of operation frequency and level 3.1 with one reference frame under 108 MHz of operation frequency. The maximum processing capability is 108K MB's per second or namely HDTV 720p (1280x720) 4:2:0 30Hz video. Totally 34.72 Kbytes on-chip memory and 3.11 Mbytes off-chip memory are required. The core size is 7.68x4.13 mm^2. The average power dissipation is 635 mW when it operates at 120 MHz under 1.8 V power supply.
Subjects
編碼器
標準弁
積體電路
影像壓縮
VLSI
JVT
standard
h.264
video
compression
AVC
Type
thesis
File(s)
No Thumbnail Available
Name
ntu-93-R91943022-1.pdf
Size
23.31 KB
Format
Adobe PDF
Checksum
(MD5):9a34a77ccc59d3f180f3438cffba9b88