針對MPEG 2/4 AAC和HE AAC音訊編解碼器的核心技術設計

陳良基臺灣大學：電機工程學研究所黃世緯Huang, Shih-WayShih-WayHuang2007-11-262018-07-062007-11-262018-07-062005http://ntur.lib.ntu.edu.tw//handle/246246/53442數位音訊編碼技術已在我們生活中的娛樂和通訊扮演重要的角色。在這篇博士論文，提出針對最先進的音訊編碼標準─MPEG-2/4 AAC（先進音訊編碼）和它的延伸MPEG-4 HE AAC（高效率先進音訊編碼）的核心技術設計。為了可以實現在低複雜度的應用上─如具有錄音和播放音樂功能的可攜式產品，將討論降低AAC編碼器和HE AAC解碼器的計算量。本篇論文分為兩部份。第一部份針對MPEG AAC編碼，提出低運算量、低記憶體的PAM（聽覺心理學模型）。PAM是MPEG AAC編碼器內的核心技術。包含了許多複雜的數學函數來描述人耳的聽覺系統。因此，挑戰在於即要降低運算量和記憶體，同時也要維持聲音的品質。設計的主要觀念是將這些複雜函數轉換為簡化的查表和共同的函數，以及替換掉不必要的運算。此外，修改偵測和決定的方法來提升聲音的品質。PAM的複雜度可以降低為原來的12.2%（降低了87.8%）。這結果可以使得即時的MPEG-2/4 AAC編碼器在規格為低複雜度、立體聲道、128 kb/s（每秒一千個位元）的位元率下，運算量低於20 MOPS（每秒一百萬個運算），並且具有CD品質的聲音。在第二部份，提出針對HE AAC解碼器的低功率版SBR（頻段複製）方法內的快速QMF（正交鏡像濾波器段）。QMF是HE AAC解碼器的核心技術。設計的主要觀念是將QMF內佔大量運算量的矩陣運算轉換成常見的快速DCT（離散餘弦轉換）。因此，運算量可以被降至原本乘法的2.7%和加法的7.8%。我們相信不久的未來在我們周圍將有許多這些音訊編碼標準的應用。這篇研究將可提供幫助。Digital audio coding technology has played an important role in our daily life for entertainment and communication. In this dissertation, key technology designs for the state-of-the-art audio coding, MPEG-2/4 Advanced Audio Coding (AAC) and its extension MPEG-4 High Efficiency Advanced Audio Coding (HE AAC), are proposed. In order to achieve the goal of low complexity applications such as portable devices with audio playback and recording, study on the reduction of the complexity of AAC encoders and HE AAC decoders are discussed. The dissertation is divided into two parts. The first part presents a low computation, low memory Psycho-acoustic Model (PAM) for MPEG AAC encoding. PAM is the key technology in the MPEG AAC encoder. It has various complicated functions to model the human auditory system. Therefore, the challenge is to reduce the computation and memory while maintaining the sound quality. The main concept of this work is based on the conversion of complicated functions into optimized look-up tables and common functions, and on the replacement of the computation that is unnecessary. Besides, the detection and decision method is modified to improve sound quality. The complexity of the proposed PAM is reduced to 12.2% (by 87.8%), and this design can lead to a real-time MPEG-2/4 Low Complexity profile stereo encoder at 128 kb/s below 20 MOPS with CD quality maintained. In the second part, fast Quadrature Mirror Filterbank (QMF) in the Low power Spectral Band Replication (SBR) tools for the MPEG HE AAC decoder is derived. QMF is the key technology of the HE AAC decoder. The main concept of this work is to transform the computation-intensive matrix operations in QMF into conventional fast Discrete Cosine Transform (DCT). Therefore, the computational complexity can be reduced up to 2.7% and 7.8% with respect to the original multiplications and additions. We are convinced that there will be many applications around us with these audio coding standards in the near future. This study can be of great benefit.Contents Abstract 7 1. Introduction 9 1.1. Application 9 1.1.1. Digital Audio Coding 9 1.1.2. MPEG AAC and HE AAC 10 1.2. Motivation 11 1.3. Problem Definition 12 1.4. Challenges 13 1.5. Contributions 13 1.5.1. Low Computation, Low Memory PAM for MPEG AAC Encoding 14 1.5.2. Fast Filterbank for HE AAC Decoding 14 1.6. Dissertation Organization 18 2. Background 19 2.1. Digital Audio Coding: Why, What, and How 19 2.1.1. Why 19 2.1.2. What 20 2.1.3. How (Principle of Digital Audio Coding) 22 2.2. History of MPEG Audio Coding Standards 24 2.2.1. MPEG-1 27 2.2.2. MPEG-2 27 2.2.3. MPEG-4 AAC 28 2.2.4. MPEG-4 HE AAC (Bandwidth Extension) 29 2.2.5. Comparison between the Audio Codecs 30 3. Low Computation, Low Memory PAM for the MPEG AAC Encoder 35 3.1. Introduction 35 3.2. Analyses of AAC Algorithms 38 3.3. PAM Algorithms 41 3.4. Challenges 45 3.4.1. Computation 45 3.4.2. Memory 46 3.4.3. Quality 46 3.5. Previous Works 48 3.5.1. MDCT-based PAM 48 3.5.2. 32-b logarithmic data format 49 3.6. Proposed Design 52 3.6.1. Method 1 - Pre-Computed Masking Spreading 52 3.6.2. Method 2 - Modified MDCT-based PAM 55 3.6.3. Method 3 - Reduced Table of Spreading Function 64 3.6.4. Method 4 – Logarithm-based PAM 69 3.7. Experiments and Results 77 3.7.1. Assessment of Sound Quality 77 3.7.2. Profiling the Reduction Rate (Method 1+2) 78 3.7.3. Word Length of the Reduced Spreading-Function Table (Method 3) 79 3.7.4. Quality Degradation by 16-bit Logarithmic Format (Method 4) 80 3.7.5. Encoding Time (Method 1-4) 83 3.7.6. Encoding Quality (Method 1-4) 83 3.8. Summary 94 4. Fast Decomposition of Filterbanks for the MPEG-4 HE AAC Decoder 97 4.1. Introduction 97 4.2. Algorithm Review 98 4.3. Profiling 98 4.4. Problem Definition 101 4.5. Previous Works on Fast Filterbanks 102 4.6. Review on Conventional DCT Types 102 4.7. Development Methods 103 4.7.1. AQMF 104 4.7.2. SQMF 107 4.7.3. Downsampled SQMF 110 4.8. Performance 117 4.9. Summary 118 5. Conclusion 121 5.1. Principal contributions 121 5.1.1. Low Computation, Low Memory PAM for MPEG AAC Encoding 121 5.1.2. Fast Filterbank for HE AAC Decoding 122 5.2. Future directions 123 5.2.1. Toward applications of lower bit rate audio coding 123 5.2.2. Toward applications of scalable audio coding 123 5.2.3. Toward applications of high-definition audio coding 123 Bibliography 125 Publication 133 List of Figures Figure 1 1. The interests of this dissertation. 15 Figure 1 2. The complexity profiling of the MPEG AAC encoder. PAM is the key technology in the encoder. 16 Figure 1 3. The complexity profiling of the MPEG HE AAC decoder. The filterbanks in SBR are the key technology in the encoder. 16 Figure 2 1. Block diagram of the perceptual audio codec. 24 Figure 2 2. Parametric coding in combination with perceptual coding (core). 24 Figure 2 3. The compression ratio of the significant MPEG audio coding standards. 26 Figure 2 4. Results of the AAC stereo verification tests [20]. The horizontal axis stands for the audio encoder, (profile, if any) and bitrate. The vertical axis represents the sound quality. 0.0 represents that tested quality is imperceptible to the reference. The smaller the diffscores (difference), the better the quality. 31 Figure 2 5. AAC quality comparison [20][31]. The horizontal axis stands for the bitrate. The vertical axis represents the sound quality. 0.0 represents that tested quality is imperceptible to the reference. The smaller the value, the better the quality. 32 Figure 2 6. HE AAC verification tests [20]. The horizontal axis represents AAC at 48 kb/s, AAC at 60 kb/s, 3.5 kHz low-pass-filtered Hidden Reference, 7 kHz low-pass-filtered Hidden Reference, HE AAC with High Quality SBR at 32 kb/s, HE AAC with Low Power SBR at 32 kb/s, HE AAC with High Quality SBR at 48 kb/s, HE AAC with Low Power SBR at 48. The vertical axis represents the MUSHRA [30] scores. 100 stands for the quality of reference. The higher the score, the better the quality. 33 Figure 2 7. Sound quality comparison from the European Broadcasting Union testing at 48 kb/s stereo between MP3, AAC, HE AAC (alias aacPlus), and other encoders [35]. The vertical axis represents the MUSHRA [30] scores. 100 stands for the quality of reference. The higher the score, the better the quality. 34 Figure 3 1. Block diagram of an AAC encoder. 40 Figure 3 2. Block diagram of the PAM in [4]. 42 Figure 3 3. Detailed block diagram of the PAM from the 13 steps in [4]. 44 Figure 3 4. Pre-Echoes resulting from processing blocks of 2048 samples [56]. The top figure shows the original signal, the middle shows the re-quantized signal with the pre-echo, and the bottom shows difference signal between the original and the re-quantized signal. 48 Figure 3 5. Original PAM, including two set of FFT and Threshold Generation (TG). 50 Figure 3 6. MDCT-based PAM, replacing FFT spectrums with MDCT spectrums. 51 Figure 3 7. Signal path in the AAC encoder. 51 Figure 3 8. Spreading Function. 54 Figure 3 9. Pseudo code of the spreading function. 54 Figure 3 10. Concept of the proposed modified MDCT-based PAM. 58 Figure 3 11. Block diagram of the proposed modified MDCT-based PAM. 58 Figure 3 12. Block diagram of MDCT 1 and Threshold Generation 1. 59 Figure 3 13. Waveform view of a series of castanets. The vertical axis is the amplitude, and the horizontal axis is time. Each surge (attack) is a castanet. 61 Figure 3 14. PE versus Frame number from the sound of Figure 3 13. The vertical axis is the magnitude of PE, and the horizontal axis is the frame number along time. If PE is larger than the threshold, the frame is considered to be attacked. 61 Figure 3 15. Waveform view of a pop music. The vertical axis is the amplitude, and the horizontal axis is time. 62 Figure 3 16. PE versus Frame number from the sound of Figure 3 15. The vertical axis is the magnitude of PE, and the horizontal axis is the frame number along time. Attacks (transients) are falsely detected because those PE are larger than the threshold. 62 Figure 3 17. PE versus Frame number from the sound of Figure 3 13. The vertical axis is the magnitude of PE, and the horizontal axis is the frame number along time. Each big surge of PE corresponds to the surge (attack) in Figure 3 13. 63 Figure 3 18. The distribution of zero values and non-zero values. 67 Figure 3 19. The proposed Method 3 by storage in two arrays. 68 Figure 3 20. Complex functions in PAM by Method 4 (Logarithm-based PAM) (a) Before Method 4. (b) After Method 4. 72 Figure 3 21. Energy and threshold stored in logarithmic format accordingly and naturally. 73 Figure 3 22. Signal path in the AAC encoder with proposed PAM. 73 Figure 3 23. The scales of Objective Difference Grade (ODG). 78 Figure 3 24. Stereo waveform snapshot of the sound preech01 (at time = 5.82 s). (a) Uncompressed sound. (b) Compressed by the original FFT-based PAM. (c) Compressed without block switching (LONG block only). (d) Compressed by the proposed PAM. Note that because of lack of block decision, (c) is worse than (b) and (d). The original (b) and the proposed (d) both can detect the attacks (transients) correctly. 88 Figure 3 25. (a), (b) are the left-channel waveform and spectral views encoded by the original FFT-based PAM, and (c), (d) are the left-channel waveform and spectral views encoded by the proposed PAM. They are almost the same. 93 Figure 4 1. Block diagram of the HE AAC decoder with the Low power SBR decoder. 99 Figure 4 2. Profiling of the HE AAC decoder with the Low power SBR. 100 Figure 4 3. Profiling of the HE AAC decoder with the Low power SBR using downsampled SQMF. 100 Figure 4 4. The decomposition of the matrix operation in AQMF. 112 Figure 4 5. The decomposition of the matrix operation in SQMF. 112 Figure 4 6. The decomposition of the matrix operation in downsampled SQMF. 113 Figure 4 7. Signal flow graph of the proposed decomposition in AQMF. 114 Figure 4 8. Signal flow graph of the proposed decomposition in AQMF. 115 Figure 4 9. Signal flow graph of the proposed decomposition in downsampled SQMF. 116 List of Tables Table 1 1. Applications of MPEG AAC and HE AAC. (‘*’ stands for combination with Parametric Stereo (PS) [14]). 11 Table 1 2. Processing power requirement of MP3 and AAC codecs [15]. 17 Table 1 3. Complexity ratio of HE AAC and plain AAC. 17 Table 2 1. Various classes of audio. 20 Table 2 2. Important factors for a given digital audio coder. 22 Table 3 1. Computational complexity analyses of the AAC LC stereo encoder. 40 Table 3 2. The functional descriptions of the 13 steps in PAM. 43 Table 3 3. Comparisons between FFT and MDCT. 50 Table 3 4. Computational complexity analyses of PAM. 54 Table 3 5. Comparison of different MDCT-based PAM. 60 Table 3 6. Reduction rate of computational complexity 60 Table 3 7. The number of zero and non-zero values in the table of spreading function (sampling rate 44100 Hz). 66 Table 3 8. The number of values for storage (sampling rate 44100 Hz). 66 Table 3 9. The reduction of table’s size at different sampling rates. 69 Table 3 10. The original PAM vs. the proposed PAM (Method 4). 74 Table 3 11. Comparison of computational complexity and required look-up tables. 75 Table 3 12. Comparison of data memory storage and bandwidth in Threshold Generation. 75 Table 3 13. The reduction of the proposed PAM. 76 Table 3 14. Computational complexity of the AAC encoder after optimization (128 kb/s, LC Stereo). 76 Table 3 15. Comparison between the proposed PAM and the previous low power work [42]. 76 Table 3 16. Tested audio files and their characteristics. 81 Table 3 17. Simulated reduction rates by the proposed PAM. 81 Table 3 18. The word length reduction vs. sound quality degradation (sampling rate 44100 Hz). 82 Table 3 19. Quality degradation by 16-bit logarithmic format of energy and masking threshold. 82 Table 3 20. Encoding time of the encoder with the proposed PAM. 86 Table 3 21. Encoding quality in ODG with or without block switching. 89 Table 3 22. Encoding quality in ODG with or without block switching. 89 Table 3 23. Audio excerpts from common audio CD. 89 Table 3 24. Comparison of encoding quality in ODG. 90 Table 3 25. Comparison of encoding quality in NMR. 91 Table 4 1. Reduction of computational complexity in the proposed QMF. 119 Table 4 2. Reduction of computational complexity in the proposed QMF with downsampled SQMF. 1191805055 bytesapplication/pdfen-US音訊編解碼器核心技術Audio CodecsMPEGAACHE AAC針對MPEG 2/4 AAC和HE AAC音訊編解碼器的核心技術設計Key Technology Design of Audio Codecs for MPEG-2/4 AAC and HE AACthesishttp://ntur.lib.ntu.edu.tw/bitstream/246246/53442/1/ntu-94-D88921030-1.pdf