針對離散小波轉換以及移動補償式時間濾波之積體電路架構設計與分析

陳良基臺灣大學：電子工程學研究所黃朝宗Huang, Chao-TsungChao-TsungHuang2007-11-272018-07-102007-11-272018-07-102005http://ntur.lib.ntu.edu.tw//handle/246246/57535離散小波轉換已為靜態影像以及動態視訊壓縮技術帶來革命性的進展，在本論文中，針對離散小波轉換的積體電路設計以及記憶分析，依三個不同的維度分為三個部份來討論，依次為：一維離散小波轉換、二維離散小波轉換以及從時間軸進行離散小波轉換的移動補償式時間濾波。第一章將介紹離散小波轉換在視訊壓縮演算法上扮演的角色，以及硬體實現上的考量。接著第一部份由第二章介紹一維離散小波轉換演算法和既有硬體架構開始，在第三章裡翻轉式結構將被提出，其可加快提昇式結構的工作頻率，第四章將提出一種全新的設計類別，可提供最少的邏輯閘個數。從第五章開始進入第二部份—二維離散小波轉換，首先是分析不同的影像輸入方法造成效能的差異，第六章將提出可適用於任意離散小波轉換硬體模組的一般型二維條狀式架構，並可最小化內部記憶體，第七章更進一步討論條狀式內部記憶體的實現方法，藉由所提出的多重提昇結構可以減少內部記憶體的面積大小以及存取功率，第八章為討論邊界延伸問題以及形狀適應離散小波轉換的硬體架構。第三部份—移動補償式時間濾波—由第九章開始，在介紹完該演算法後，接著為一階移動補償式時間濾波的記憶體分析以及所提出的資料共享機制，第十章進一步討論多階移動補償式時間濾波的系統分析，最後一個具高度彈性的系統架構將被提出。本論文的主要貢獻與未來方向於第十一章中做結論。Discrete Wavelet Transform (DWT) has led the revolution of block-based image coding and close-loop video coding systems. In this dissertation, VLSI architectures and memory analysis of DWT in three dimensions are discussed in three different parts: One-Dimensional (1-D) DWT, Two-Dimensional (2-D) DWT, and Motion-Compensated Temporal Filtering (MCTF) that performs DWT in the temporal direction. Because 1-D DWT, 2-D DWT, and MCTF are pixel-level, framelevel, and group-of-picture-level operations, the design levels target at processing element, module, and system, respectively. The implementation method of 1-D DWT is highly related to the algorithm view. In Part I of this dissertation, many different algorithm views for DWT are surveyed first: two-channel filter bank, polyphase decomposition, lifting scheme, and B-spline factorization. The previous 1-D DWT architectures can be classified into convolution- and lifting-based. Second, we propose a flipping structure to reduce the critical path of lifting-based DWT architecture without any hardware overhead. The lifting-based architectures are usually adopted because of its fewer computation complexity and in-place implementation. However, the critical path is potentially long owing to the serial connection of triangular matrices. The flipping structure multiplies the inverses on the timing accumulation path for efficiently reducing the critical path, instead of the conventional pipelining technique that introduces many registers. The case studies of JPEG 2000 (9,7) filter and the linear (6,10) filter demonstrate the efficiency of flipping structure. Third, a new category of DWT implementation based on B-spline factorization is proposed, which can use fewer multipliers. For Daubechies wavelets, it can guarantee to reduce about one half of multiplies compared to convolution-based architectures. However, the lifting scheme cannot reduce the computation complexity for even linear DWT filters. By case studies of the (6,10), (10,18), and (9,7) filters, the proposed B-spline-based architecture shows the superior performance in terms of logic gate count. The 2-D DWT belongs to frame-based computations, so the performance of hardware implementation is dominated by external memory bandwidth and internal memory size. In Part II of this dissertation, a detailed survey for different scan methods is first given and classified into five categories. An overlapped stripe-based scan is proposed to provide a better trade-off for memory requirement. Second, generic line-based 2-D DWT architectures are proposed, which can adopt any kind of 1-D DWT modules. For 1-level 2-D line-based architecture, the line buffer is separated into data buffer and temporal buffer. We propose a data flow for data buffer and a mapping method for temporal buffer, which can minimize the line buffer size. Two multi-level 2-D DWT line-based architectures are also proposed, which can minimize the external memory access, with different hardware utilizations. Third, we propose a memory-efficient implementation for line-based 2-D DWT, which is called multiple-lifting scheme. The implementation issues of temporal buffer are first discussed. Then, the proposed multiple-lifting scheme provides a new implementation method for temporal buffer. It can reduce the temporal buffer access frequency to replace the two-port SRAM by single-port SRAM. The reduction of access frequency also decreases the power consumption of temporal buffer proportionally. By evaluating hardware designs for the (9,7) filter with Artisan 0.18um cell library and RAM compiler, the efficiency of area and power reduction is proven. Fourth, an efficient VLSI implementation for 2-D Shape-Adaptive DWT (SA-DWT) is proposed. The SA-DWT requires the capability to process the boundary extension for very short signal segments. It is proposed to be implemented by use of stage-based boundary extension strategy and shape-adaptive boundary handling unit. The SA-DWT with the JPEG 2000 lossy (9,7) filter and the MPEG-4 VTC (9,3) filter are implemented to prove the efficiency. Furthermore, the SA-DWT implementation with (9,7) filter is fabricated with core area 1.68x1.68mm2 in TSMC 0.25um process. This chip has real-time processing capability of 1-level 2-D SA-DWT for HDTV1080p 30fps sequences when working at 50MHz. MCTF is to perform DWT in the temporal direction with Motion Compensation (MC). MCTF has become the core technology in interframe wavelet video coding and the next generation video coding standard MPEG SVC. In Part III of this dissertation, the first research work on VLSI architecture and memory analysis of MCTF are presented. First, memory issues of one-level MCTF are discussed. The 5/3 MCTF consists of prediction and update stages. The former is analyzed in terms of macroblock- and frame-level data reuse schemes separately. After reviewing previous macroblock-level reuse schemes, we propose a new Level C+ scheme to provide a good trade-off between Level C and D schemes. Based on the open-loop prediction nature, we propose three frame-level data reuse schemes: Double Reference Frames ME, Double Current Frames ME, and modified Double Current Frames ME. The analysis of 5/3 MCTF is based on the combination of prediction and update stages, which includes external memory bandwidth and storage size. Second, system issues of multi-level MCTF are discussed, including computation complexity, external memory bandwidth and storage size, and coding delay. The computation complexities are very similar for different MCTF configurations, but other three system issues are quite different. They depend on the adopted macroblock- and frame-level reuse scheme, decomposition level, inter- or intra-coded lowpass frames, and 5/3 or 1/3 MCTF. Based on simulation results, the impact of the latter three parameters on coding performance is evaluated. Accordingly, a flexible and efficient system architecture is proposed for multi-level MCTF. It can adapt the temporal prediction structures to any-level 5/3 MCTF, 1/3 MCTF, or Hierarchial B-frames, and even the close-loop MCP with two reference frames. In summary, this dissertation presents a fast lifting-based architecture, named flipping structure, and a new design category based on B-spline factorization that can provide the smallest gate count for 1-D DWT processing element design. For 2-D DWT, a generic line-based architecture is proposed to minimize the on-chip memory and to be capable of adopting any kind of DWT modules. Furthermore, a memory-efficient implementation for temporal buffer, called multiple-lifting scheme, is presented to reduce the memory area and access power efficiently. Besides, the boundary extension of SA-DWT is addressed by proposed stage-based boundary handling units. As for MCTF, system-level implementation issues are considered. The block-level and frame-level data reuse schemes are both discussed for one-level MCTF. According to analysis results, a flexible and efficient system architecture is proposed for multi-level MCTF, which can support many configurations of MCTF systems.Abstract xix 1 Introduction 1 1.1 Trends in Image Coding Algorithm . . . . . . . . . . . . . . . . . 1 1.2 Trends in Video Coding Algorithm . . . . . . . . . . . . . . . . . 4 1.3 Memory Management of Multimedia System . . . . . . . . . . . 6 1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . 8 I One-Dimensional Discrete Wavelet Transform 11 2 Survey of 1-D DWT Algorithm and Architecture 13 2.1 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Continuous Wavelet to Discrete Wavelet . . . . . . . . . . 14 2.1.2 Polyphase Decomposition . . . . . . . . . . . . . . . . . 17 2.1.3 Lifting Scheme . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.4 B-spline Formulation . . . . . . . . . . . . . . . . . . . . 20 2.2 Conventional 1-D DWT Architectures . . . . . . . . . . . . . . . 21 2.2.1 Convolution-based Architectures . . . . . . . . . . . . . . 22 2.2.2 Lifting-based Architectures . . . . . . . . . . . . . . . . 23 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Flipping Structure for Lifting-based DWT Hardware Implementation 29 3.1 Flipping Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Precision Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 JPEG 2000 Default (9,7) Filter . . . . . . . . . . . . . . . 35 3.3.2 Integer (9,7) Filter . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 Linear (6,10) Filter . . . . . . . . . . . . . . . . . . . . . 42 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 B-spline Factorized DWT/IDWT Architecture 47 4.1 Proposed B-spline Factorized Architecture . . . . . . . . . . . . . 47 4.1.1 B-spline Factorized Architectures for DWT and IDWT . . 47 4.1.2 Implementation Methods of the B-spline Part . . . . . . . 49 4.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Multipliers and Adders . . . . . . . . . . . . . . . . . . . 51 4.2.2 Critical Path and Registers . . . . . . . . . . . . . . . . . 53 4.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.1 DWT with JPEG 2000 Default (9,7) Filter . . . . . . . . . 55 4.3.2 DWT with the Linear (6,10) Filter . . . . . . . . . . . . . 58 4.3.3 DWT with the Linear (10,18) Filter . . . . . . . . . . . . 61 4.3.4 IDWT with the Linear (10,18) Filter . . . . . . . . . . . . 62 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 II Two-Dimensional Discrete Wavelet Transform 69 5 Analysis of 2-D DWT Hardware Architectures 71 5.1 Introduction to 2-D DWT . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Classification of Frame Memory Scan Methods . . . . . . . . . . 73 5.2.1 Direct Scan . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.2 Line-based Scan . . . . . . . . . . . . . . . . . . . . . . 74 5.2.3 Non-overlapped and Overlapped Block-based Scans . . . 76 5.2.4 Non-overlapped Stripe-based Scan . . . . . . . . . . . . . 79 5.2.5 Proposed Overlapped Stripe-based Scan Method . . . . . 79 5.3 Comparison of Scan Methods for 1-level 2-D DWT Architectures 80 5.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6 Generic Line-based 2-D DWT Architecture 83 6.1 Proposed 1-level Line-based Architecture . . . . . . . . . . . . . 84 6.1.1 Data Flow for Data Buffer of Size 1.5N . . . . . . . . . . 87 6.1.2 Complete Data Flow for Data Buffer of Size (1+(1/2)k)N 89 6.1.3 Numerical Analysis of Data Buffer Size . . . . . . . . . . 91 6.1.4 Integration of Lifting-based DWT Module . . . . . . . . . 93 6.2 Proposed Multi-level Line-based Architectures . . . . . . . . . . 94 6.2.1 Adopting Two 1-D DWT Modules (2DWTM) . . . . . . . 94 6.2.2 Adopting Three 1-D DWT Modules (3DWTM) . . . . . . 95 6.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3.2 JPEG 2000 Default Lossy (9,7) Filter . . . . . . . . . . . 99 6.3.3 JPEG 2000 Default Lossless (5,3) Filter . . . . . . . . . . 100 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7 Multiple-Lifting Scheme: Memory-Efficient Implementation for Line-based 2-D DWT 103 7.1 Implementation Issues of Temporal Buffer for Line-based 2-D DWT 104 7.1.1 Lifting-based DWT Module . . . . . . . . . . . . . . . . 104 7.1.2 Convolution-based DWT Module . . . . . . . . . . . . . 105 7.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 106 7.2 Proposed Multiple-Lifting Scheme . . . . . . . . . . . . . . . . . 107 7.2.1 Reducing Average Memory Bandwidth using Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.2.2 Proposed Two-lifting Scheme . . . . . . . . . . . . . . . 109 7.2.3 Proposed N-lifting Scheme . . . . . . . . . . . . . . . . . 111 7.3 Proposed M-Scan for Multiple-lifting Scheme . . . . . . . . . . . 111 7.3.1 Proposed M-Scan for Two-lifting Scheme . . . . . . . . . 112 7.3.2 Proposed M-Scan for N-lifting Scheme . . . . . . . . . . 113 7.3.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 114 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8 VLSI Architecture for Lifting-based Shape-Adaptive DWT with Odd-Symmetric Filters 117 8.1 Background Introduction . . . . . . . . . . . . . . . . . . . . . . 118 8.2 SA-DWT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 119 8.3 Previous Boundary Extension Handling Technique for Convolution-based DWT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 8.4 Lifting Scheme of Odd-Symmetric Filters . . . . . . . . . . . . . 125 8.5 Proposed VLSI Architecture Design Method for Lifting-based Odd-Symmetric SA-DWT . . . . . . . . . . . . . . . . . . . . . . . . 126 8.5.1 Stage-based Boundary Extension Strategy . . . . . . . . . 127 8.5.2 Shape-Adaptive Boundary Handling Unit . . . . . . . . . 127 8.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.6.1 JPEG 2000 Default (9,7) Filter . . . . . . . . . . . . . . . 129 8.6.2 MPEG-4 VTC (9,3) Filter . . . . . . . . . . . . . . . . . 133 8.7 Chip Implementation for SA-DWT with (9,7) Filter . . . . . . . . 137 8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 III Motion-Compensated Temporal Filtering 141 9 Memory Analysis for One-Level MCTF 143 9.1 Introduction to Motion-Compensated Temporal Filtering . . . . . 144 9.2 Macroblock-level Data Reuse for Motion Estimation . . . . . . . 146 9.2.1 Conventional Data Reuse Schemes for FSBMA . . . . . . 148 9.2.2 Proposed Level C+ Scheme for FSBMA . . . . . . . . . . 149 9.3 Memory Analysis for Prediction Stage . . . . . . . . . . . . . . . 150 9.3.1 Redundancy Access for MC in MCTF . . . . . . . . . . . 151 9.3.2 Direct Implementation of Prediction Stage . . . . . . . . . 153 9.3.3 Proposed Double Reference Frames ME . . . . . . . . . . 154 9.3.4 Proposed Double Current Frames ME . . . . . . . . . . . 154 9.3.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.4 Memory Analysis for One-level MCTF . . . . . . . . . . . . . . 156 9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 10 Analysis and Architecture for Multi-level MCTF 161 10.1 Introduction to Multi-level MCTF . . . . . . . . . . . . . . . . . 161 10.1.1 Decomposition Level . . . . . . . . . . . . . . . . . . . . 162 10.1.2 Inter- or Intra-coding for Lowpass Frames . . . . . . . . . 162 10.1.3 Update Stage . . . . . . . . . . . . . . . . . . . . . . . . 163 10.2 Analysis of Multi-level MCTF . . . . . . . . . . . . . . . . . . . 163 10.2.1 Computation Complexity and External Memory Access . 163 10.2.2 External Memory Storage . . . . . . . . . . . . . . . . . 164 10.2.3 Coding Delay . . . . . . . . . . . . . . . . . . . . . . . . 165 10.2.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . 166 10.3 Proposed System Architecture for Multi-level MCTF . . . . . . . 167 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 11 Conclusion 177 11.1 Principal Contributions . . . . . . . . . . . . . . . . . . . . . . . 177 11.1.1 One-Dimensional Discrete Wavelet Transform . . . . . . 177 11.1.2 Two-Dimensional Discrete Wavelet Transform . . . . . . 179 11.1.3 Motion-Compensated Temporal Filtering . . . . . . . . . 181 11.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 182 11.2.1 Dual-Mode DWT/IDWT for JPEG 2000 Codec . . . . . . 182 11.2.2 Lossy Embedded Compression for Open-loop MCTF . . . 183 11.2.3 Flexible and High Quality SVC Single-Chip Encoder . . . 183 11.2.4 Power-aware SVC Single-Chip Decoder . . . . . . . . . . 1831031950 bytesapplication/pdfen-US超大型積體電路架構移動補償式時間濾波離散小波轉換視訊影像壓縮motion-compensated temporal filteringVLSI architecturediscrete wavelet transformvideo and image compression針對離散小波轉換以及移動補償式時間濾波之積體電路架構設計與分析VLSI Architecture and Analysis of Discrete Wavelet Transform and Motion-Compensated Temporal Filteringthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/57535/1/ntu-94-F90943002-1.pdf