高效能JPEG 2000編碼系統之設計與實現

陳良基臺灣大學：電子工程學研究所方弘吉Fang, Hung-ChiHung-ChiFang2007-11-272018-07-102007-11-272018-07-102005http://ntur.lib.ntu.edu.tw//handle/246246/57418JPEG 2000 為目前最新一代的靜態影像壓縮標準，它不但在壓縮效率是目前所有標準中最好的，而且還有很多有用的功能。但是，由於它的複雜度太高以至於目前還無法在市面上推出。在本論文中，我們提出了一個高效能的JPEG 2000編碼器系統，希望可以解決此問題，讓JPEG 2000廣泛地被接受及使用。針對嵌入式方塊編碼器，我們提出了一個平行化架構，此架構藉由一次處理一整個係數來提高運算速度。同時，編碼方塊及狀態變數所需之記憶體也不再需要了，省去這些記憶體就可以大大的減少嵌入式方塊編碼器的面積，因為在以往的架構中，這些記憶體大概就會佔去整體80%的面積。除了省面積外，本架構大約比以往的最好的架構快6倍左右。簡言之，我們所提出之平行化方塊編碼器架構為一高效能且低成本的設計。壓縮率-失真之最佳化是JPEG 2000的一個很重要的特點，不過它有一個致命傷，就是它會浪費不必要的運算量和運算時間同時又需要大量記憶體來暫存位元流及其它資料，因為它是一個"後壓縮最佳化"演算法，也就是說它是在壓縮之後才做最佳化，所以，不論壓縮率有多高，所需要的運算量都和無失真一樣，可是，絕大部分壓縮出來的位元流都不會被採用，同時位元流和其它資料要先暫存起來，等到最佳化完成後才能輸出。為了解決這個問題同時又能夠提供最佳化的特點，我們提出了一個"前壓縮最佳化"演算法，它可以在嵌入式方塊編碼器壓縮之前就做最佳化，如此一來，所有產生的位元流都是有用的，就不會有浪費運算量的問題，而且，因為大部分的運算都不用了，運算的時間也會因此而縮短，另外，產生的位元流也可以直接輸出而不必暫存起來。因此，我們所提出的前壓縮最佳化演算法是一個低功率、高速度及低成本的演算法。基於以上兩個新提出的技術，我們實作了一個平行化的JPEG~2000編碼器單晶片系統，它可以即時的壓縮HDTV 720p畫面每秒30張。在離散小波轉換方面，我們採用了多層次線型二維架構，這個架構可以把頻寬降到理論值的下限，也就是每個像素都只讀取一次。我們用台積電0.25 µm的製程下線，此晶片的核心面積為5.5 mm2，在81 MHz的工作頻率下，功率消耗為348 mW。此晶片不但面積是文獻上最小的，處理速度也是最快的。在本論文的最後，我們針對大畫面切割提出了一個橫條管線的方式，這個管線方式可以讓JPEG 2000在晶片上的記憶體需求變成和畫面切割的大小的平方根成正比，這樣一來晶片面積將會比以往記憶體需求和畫面切割大小成正比的架構小很多，舉例來說，若是畫面切割的大小為256x256，記憶體需求將可以降到以往架構的8.5%。為了配合此一橫條管線的方式，我們進一步提出了層次切換離散小波轉換器及方塊切換嵌入式方塊編碼器，層次切換離散小波轉換器為一多層次的塊狀掃描架構，而方塊切換嵌入式方塊編碼器可以同時在13個編碼方塊間作切換。這個架構在256x256的畫面切割大小時，面積可以降到平行化JPEG 2000編碼器的30%以下，而且，畫面切割越大，省下的面積就更多。我們相信，在本論文中提出的編碼器系統架構提出後，可以讓整個JPEG 2000編碼器的面積成本和JPEG編碼器的成本不會差太多，在這個情況下， JPEG 2000就很有可以被市場接受，因為它的壓縮率和可以提供的功能比JPEG好太多了，因此，我們相信JPEG 2000會開始取代JPEG而成為靜態影像壓縮的主流系統。JPEG 2000 is a new still image coding standard that has not only better coding efficiency but also abundant useful features. However, its high computational complexity and memory requirements has obstructed its entering the market. In this dissertation, we proposed a high performance JPEG 2000 encoding system to solve this problem. For the embedded block coding, we proposed a parallel architecture to increase the throughput by processing a coefficient at a time. Thus, the state variable memory and code-block memory are eliminated. This greatly reduces the hardware cost since these memories occupy more than 80% area of the embedded block coding engine in conventional architectures. Moreover, the processing speed is increased by more than 6 times compared with the best result in the literature. Therefore, the proposed parallel architecture is high performance for its high speed and low cost. The rate-distortion optimization is an important function of JPEG 2000. However, the post-compression rate-distortion optimization algorithm recommended in the reference software requires that the original image is losslessly coded regardless of target bit rate. This wastes the computational power and time to process the unnecessary data, and requires a large memory to buffer the lossless bit stream. To solve this problem, we propose a pre-compression rate-distortion optimization algorithm, which can perform the rate-distortion optimization before the embedded block coding. Thus, the embedded block coding only needs to process necessary data. This greatly reduces the processing time and computation power of the embedded block coding. Moreover, it does not need to buffer the bit stream. Therefore, the proposed pre-compression rate-distortion optimization algorithm presents low power, high speed, and low cost capability. Based on the above two new techniques, a high speed parallel JPEG 2000 encoder chip is implemented. It can encode HDTV 720p video in real-time. For the discrete wavelet transform, we adopt the multi-level line-based 2-D architecture. The memory bandwidth requirement of this chip is therefore minimized, i.e. each pixel is read one and only one time. The chip is fabricated by TSMC 0.25 µm CMOS technology, and the core area is 5.5 mm2. The power consumption is 348 mW at 81 MHz. This encoder has the highest throughput on smallest silicon area compared with all other encoders in the literature. Finally, we propose a stripe pipeline scheme for large tile size. By use of this scheme, the on-chip memory requirement of a JPEG 2000 encoder is proportional to the square root of the tile size while it is proportional to the tile size in previous works. For a tile size of 256×256, the tile memory requirement is reduced to only 8.5% of previous works. To achieve the stripe pipeline scheme, the level switch discrete wavelet transform and the code-block switch embedded block coding has been proposed. The level switch discrete wavelet transform is a multi-level blockbased scan architecture, and the code-block switch embedded block coding can process 13 code-blocks in parallel. As a result, the hardware cost of this pipeline architecture is about 30% of the parallel encoder when the tile size is 256×256, and the area saving increases as the increase of the tile size. With the algorithms and architectures proposed in this dissertation, the cost of the JPEG 2000 encoder can be reduced to only several times of that of the JPEG encoder. Moreover, all the features and functionalities of JPEG 2000 are retained.Therefore, we believe that JPEG 2000 will start to take the place of JPEG as the core technology of still image coding systems in the near future.Abstract xvii 1 Introduction 1 1.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Discrete Wavelet Transform . . . . . . . . . . . . . . . . 2 1.1.2 Embedded Block Coding . . . . . . . . . . . . . . . . . . 3 1.1.3 Rate-Distortion Optimization . . . . . . . . . . . . . . . 3 1.2 Design Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . 4 2 JPEG 2000 Algorithm 7 2.1 Coding System Overview . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Two-Dimensional Decompositions . . . . . . . . . . . . . 9 2.2.2 One-Dimensional Filter Procedure . . . . . . . . . . . . . 11 2.3 Embedded Block Coding . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Context Formation . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Arithmetic Encoder . . . . . . . . . . . . . . . . . . . . . 20 2.4 Rate-Distortion Optimization . . . . . . . . . . . . . . . . . . . . 26 2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.2 Coding Performance . . . . . . . . . . . . . . . . . . . . 28 2.5.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 Parallel EBC Architecture 45 3.1 Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Parallel EBC Algorithm . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Coding Pass Classification . . . . . . . . . . . . . . . . . 48 3.2.2 Context Formation . . . . . . . . . . . . . . . . . . . . . 50 3.2.3 Arithmetic Encoder . . . . . . . . . . . . . . . . . . . . . 53 3.3 Parallel EBC Architecture . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 Gobang Register Bank . . . . . . . . . . . . . . . . . . . 55 3.3.2 Parallel Context Formation . . . . . . . . . . . . . . . . . 57 3.3.3 Re-configurable First-In First-Out . . . . . . . . . . . . . 63 3.3.4 Folded Arithmetic Encoder . . . . . . . . . . . . . . . . . 64 3.3.5 Code-Block Parallel Processing . . . . . . . . . . . . . . 68 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4.1 FIFO Length . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4.2 FAE Architecture . . . . . . . . . . . . . . . . . . . . . . 71 3.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . 73 3.4.4 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 Pre-Compression RDO Algorithm 79 4.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1.1 JPEG 2000 Coding Hierarchy . . . . . . . . . . . . . . . 81 4.1.2 Distortion Constraint RDO . . . . . . . . . . . . . . . . . 83 4.1.3 Pre- and Post-compression Rate-Distortion Optimization . 83 4.2 R-D Calculation Before Compression . . . . . . . . . . . . . . . 85 4.2.1 Image Quality Control . . . . . . . . . . . . . . . . . . . 85 4.2.2 Rate Estimation . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.3 Distortion Estimation . . . . . . . . . . . . . . . . . . . . 90 4.3 Pre-compression RDO Algorithm . . . . . . . . . . . . . . . . . 91 4.3.1 Distortion Calculation . . . . . . . . . . . . . . . . . . . 93 4.3.2 Incremental R-D Calculation . . . . . . . . . . . . . . . . 94 4.3.3 R-D Normalization . . . . . . . . . . . . . . . . . . . . . 94 4.3.4 Candidates Increase . . . . . . . . . . . . . . . . . . . . 94 4.3.5 Truncation Point Decision . . . . . . . . . . . . . . . . . 96 4.4 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . . . . . . 96 4.4.1 Coding Performance . . . . . . . . . . . . . . . . . . . . 96 4.4.2 Computation and Memory Reduction . . . . . . . . . . . 100 4.4.3 Distortion Control Precision . . . . . . . . . . . . . . . . 103 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5 Parallel JPEG 2000 Encoder 105 5.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2.1 Discrete Wavelet Transform . . . . . . . . . . . . . . . . 108 5.2.2 Pre-Compression Rate-Distortion Optimization . . . . . . 110 5.2.3 Embedded Block Coding . . . . . . . . . . . . . . . . . . 112 5.2.4 Bit Stream Formation . . . . . . . . . . . . . . . . . . . . 117 5.2.5 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3.1 Performance and Chip Feature . . . . . . . . . . . . . . . 119 5.3.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6 Pipeline JPEG 2000 Encoder 125 6.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.2.1 Stripe Pipeline Scheme . . . . . . . . . . . . . . . . . . . 129 6.2.2 Level Switch DWT . . . . . . . . . . . . . . . . . . . . . 132 6.2.3 Code-block Switch EBC . . . . . . . . . . . . . . . . . . 136 6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 145 6.3.1 Memory Reduction . . . . . . . . . . . . . . . . . . . . . 145 6.3.2 TMC Architecture . . . . . . . . . . . . . . . . . . . . . 146 6.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . 148 6.3.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7 Conclusion 151 7.1 Principal Contributions . . . . . . . . . . . . . . . . . . . . . . . 151 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.2.1 Full Featured JPEG 2000 Encoder . . . . . . . . . . . . . 153 7.2.2 High Performance JPEG 2000 Codec . . . . . . . . . . . 153 Bibliography 155 Index 161 Resume 1653015172 bytesapplication/pdfen-US數位視訊架構設計最佳化演算法數位訊號處理影像處理影像編碼signal processingalgorithmArchitecture designimage codingchip implementationoptimizationJPEG 2000高效能JPEG 2000編碼系統之設計與實現Design and Implementation of High Performance JPEG 2000 Encoding Systemthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/57418/1/ntu-94-F90943013-1.pdf