DESIGN AND IMPLEMENTATION OF AN H.264/MPEG-4 AVC DECODER FOR 2048X1024 30FPS VIDEOS
Date Issued
2005
Date
2005
Author(s)
Chen, To-Wei
DOI
zh-TW
Abstract
H.264 is the newest video coding standard developed by the Joint Video Team (JVT). Compared
with MPEG-4, H.263, and MPEG-2, H.264 can reduce 39%, 49%, and 64% of bit-rate,
respectively. Because of its superior performance, H.264 has been widely adopted by commercial
applications including digital TV broadcasting (European DVB-T and Japanese HDTV),
next-generation DVD (Blu-ray DVD and HD-DVD), and network streaming (Apple QuickTime).
The coding efficiency improvement of H.264 comes at the price of huge computation and
complexity. For our targeted specification (baseline profile level 4.1), the computation of
more than 83 Giga-instructions per second and the bandwidth of more than 70 Giga-bytes per
second are required. Moreover, new functions such as advanced prediction schemes and
deblocking filter increase the complexity of the system. To fulfill the requirements of H.264
high definition applications, an efficient system design is very necessary.
Traditional video decoding hardware designs are mostly based on macroblock pipeline. However,
if this traditional design methodology is directly adopted in H.264 decoder design, much
on-chip memory is wasted. New features of coding tools also make the module-wise design very
challenging. For ultra high-end applications, the entropy decoder becomes the throughput
bottleneck, while intuitive parallel processing techniques are not applicable to speed up the
entropy decoder due to its context-based adaptive nature. Because of variable block sizes and
quarter-pixel-precision motion vector features, the motion compensated inter prediction module
consumes bandwidth of more than three times that of previous standard MPEG-4 SP. The
frame-based deblocking operation seriously degrades system hardware utilization and the
deblocking filtering has to be supported in two directions (horizontal and vertical) leading
to complex data flow and control.
We propose a hybrid task pipelining system to address these crucial issues. Balanced
pipelining schedules and proper degrees of parallelism are contributed to deliver the huge and
complex computation capability. Block-level, macroblock-level, and macroblock/frame-level
pipelining schedules are arranged for CAVLD/IQ/IT/INTRA_PRED, INTER_PRED, and DEBLOCK,
respectively. As a result, the resulted internal pipeline memory as well as the bandwidth
consumption can be significantly reduced. Moreover, efficient modules are provided. The
entropy decoder unit smoothly decodes bitstream into symbols without bubble cycles thus high
decoding throughput can be achieved, and the proposed CAVLD unit can be extended to higher
parallelism with low area overhead because only the Level table and the Run table are
modified. The proposed memory access scheme of Interpolation Window Reuse (IWR) and
Interpolation Window Classification (IWC) of the motion compensated inter prediction unit
saves 60% of external memory bandwidth, and the proposed processing order of
4x4-blocks for inter prediction enables high utilization of the reuse buffer. DEBLOCK
unit breaks the frame-level deblocking operation to macroblock-level operations so that the
hardware utilization can be greatly increased. Our proposed transpose array combined with 1-D
filter solves the complex data flow and control problem.
A prototype chip is implemented using Artisan standard CMOS cell library with TSMC 0.18um
1P6M technology. The total gate count is about 217K synthesized at 120 MHz. It can support
H.264/MPEG-4 AVC decoding in baseline profile level 4.1 with five reference frames. The
maximum processing capability is 246K macroblocks per second or 2048x1024 4:2:0 30Hz
video. Totally about 10 Kbytes on-chip memory and 16 Mbytes off-chip memory are required. The
core size is 2.19x2.19 mm2. The average power dissipation is 186.4 mW when operating
at 120 MHz with 1.8 V power supply. Compared to other H.264 decoder works, the proposed design
requires less gate count and less on-chip memory. Therefore it is a good choice to be
integrated into high definition video decoding applications. When the specification is down to
QCIF (176x144), 15Hz video, our chip can deliver real-time decoding at 725 KHz with 1.8
V power supply and only consumes power of 1.18 mW. This low power feature makes our design
also suitable for the mobile applications.
Subjects
多媒體
解碼器
視訊
H.264
MPEG-4
decoder
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-94-R91943116-1.pdf
Size
23.31 KB
Format
Adobe PDF
Checksum
(MD5):d3f566a34180dbc9553d84c27b9fe49a