# An Efficient Linear-Phase FIR Filter Architecture Design for Wireless Embedded System Shyh Feng Lin, Sheng-Chieh Huang, Feng-Sung Yang, Chung-Wei Ku, and Liang-Gee Chen DSP/IC Design Lab, Department of Electrical Engineering and Graduate Institute of Electronics Engineering National Taiwan University, Taipei, Taiwan, R.O.C Abstract - This paper presents a novel approach for implementing power-efficient finite impulse response (FIR) filters that requires less-power consumption than traditional FIR filter implementation in wireless embedded systems. The proposed schemes impose to the direct form and achieve certain reduction in the power consumption. A novel re-timed structure and balanced modularized techniques are introduced and used to reduce the critical path to achieve hardware efficiency. A novel separated signed processing data flow scheme with modifying CSD (Canonical Signed Digit) representation is also introduced and used to reduce the transition, which is the main source of power consumption. By using a combination of proposed methods, balanced modularized with re-timed techniques and separated processing data flow scheme with modifying CSD representation, The proposed structures are shown with up to 71% reduction in power consumption with slight area overhead. #### INTRDUCTION In several wireless hand-held systems, the finite impulse response (FIR) filters are the indispensable parts among various image/video communication applications to reduce noise and enhance the specific features. With given specification, the dedicated filter is designed to fit in the applications and has the least effect of redundancy. However, the previous designs of the dedicated filter architecture still had some weaknesses. The overhead of the subexpression sharing [1][2] is a complicated routine like a chaotic adder tree. For keeping timing correct, the substructure sharing will make the registers in rapid growth. In addition, the advantage of the fixed coefficients can not be utilized by the folded architecture [3][4]. Hence, the folded architecture will lose the benefit in the chip area and the power consumption. The direct form and the transposed form [5][6] usually represent the filter coefficients in the canonical signed digit (CSD) to decrease the non zero digits of the constant multipliers. At the same time, FIRGEN [5] and Laskowski [6] contributed to the elimination of the MSB sign extension redundancy. However, the disadvantage is the structural symmetry in the linear phase frequency response can not be applied to transposed form filters designs. In this paper a novel design method for linear phase FIR architecture is presented, the proposed architectures are discussed in section II. With the IS95 WCDMA filter as example, the chip implementation, considerations and evaluation results are shown in section III. Finally, a brief conclusion is made. ## FILITER ALGORITHM AND PROPOSED ARCHITEC-TURE FIR filter is the convolution of input samples and impulse response of the system. The FIR filter with length N can be mathematically described as in equation (1): $$y_i = \sum_{k=0}^{N-1} c_k x_{i-k}$$ (1) Where $c_k$ means the coefficients, and the input signal is the $x_{i-k}$ . The CSD representation is derived from the SPT and there are no consecutive nonzero terms in representation. In the proposed approaches, the direct form with CSD coefficient representation is considered as example. There are four steps to reduce the power consumption. The first step is re-timing. Secondly, the balanced modularized architecture and the carry save adder tree are used to reduce circuit transition. Thirdly, the architecture with minimizing the transitions of the 2's complement coefficients is proposed. Finally the modification CSD representation is to solve the unbalance coefficients of positive and negative digits. #### A. Re-timed Direct Form Architecture (RDFA) The pipeline partition is useful to shorten accumulation path. For avoiding the additional registers in pipelining, a good compromise is the re-timed direct form. As In Fig. 1, the filter length is 15, and the critical path contains 1 multiplier and 14 adders. After applying re-timed architecture, the filter is divided into 5 pipeline groups as in Fig. 2, and the critical path reduces to 1 multiplier and 3 adders. Fig. 1. Traditional direct form architecture. However, such an idea is not totally appropriate for dedicated FIR filters because the partial products of the constant multiplier in each stage are not the same. The multiplier in each stage is not identical to each other, and it will produce the unbalanced critical path is each pipeline stage. The proposed Balanced Modularized Architecture in the following section will solve this problem. Fig. 2. Re-timed direct form architecture. ## B. Balanced Modularized Architecture (BMA) For deep sub-micron fabrication techniques, the effect of routing gradually dominates the speed. Hence, the modularized re-timed direct form while keeping the latency invariant is a good choice. The same nonzero terms digits instead of the same numbers of coefficients are chosen to perform in each module as Fig. 3 to ensure the balance adder tree in the partial-products summation. And carry-save adder tree is adopted with the same depth is used in the module design. Due to the Wallace tree uses the 3:2 compression ratio, the bit numbers of each bit-plane are the 9, 6, 4, 3, and 2 in the Wallace tree. According to the target specification, the depth of the carry-save adder tree that is 4 was chosen, and the corresponding number of nonzero terms digits is 9. However, only the first module in the direction of the data flow accommodates 9 inputs while others accommodate 7 inputs. It is because subsequent module takes the sum and carries on the previous module as it inputs. Fig. 3. Balanced modularized FIR filter architecture. The comparison of the summation steps is illustrated in Fig. 4. In terms of the same transition numbers in summation steps, the proposed implementation uses 16 full adders and 2 half adders fewer than Yamazaki [7] proposed to use 16 full adders and 4 half adders. ## C. Separated Signed Processing Architecture (SSPA) The 2's complement number representation for VLSI design will cause amounts of power consumption. For example, 0 in a 10 bit 2's complement number representation is 0000000000 but -1 is 1111111111. A lot of transitions will consume a large amount of power. To avoid the positive to negative transition, the Separated Signed Processing Architecture (SSPA) was proposed. Separated Signed Processing Architecture contains two parts, one is separated the negative digits of coefficients from positive digits, and the other one is biasing the input data to positive. Fig. 4. Comparison of summation steps. (a) The proposed method. (b) Yamazaki's method. Fig. 5. Separated signed data path approach. The coefficient $c_k$ is decomposed into positive $(c_{kp})$ and negative $(c_{kn})$ parts according to the CSD representation. Two accumulating paths for each sign are finally utilized as in Fig. 5. For example, 0.010100101 is decomposed into 0.000100100 $(c_{kp})$ and -0.010000001 $(c_{kn})$ . As a result, this design processes the biased input signal X in two different data paths for each sign without any control. In order to avoid the transition between positive and negative caused by the input data, the filter input must be biased to a positive number instead of sign-magnitude representation. And delete these biases at the last stage of the accumulation path. In the case of 8-bit 2's complement representation, bias should be 128 and the biased input is between 0 and 255 instead of between -128 and 127. The proposed bias can be easily achieved by inverting the most significant bit. ## D. Modification to the CSD representation (MCSD) Separated Signed Processing will produce the unbalance module. Because the occurrences of positive and negative digits have the same probability, it is just the average statistics. In practice, due to modularization, the distributions of positive and negative CSD digits are not uniform locally, the quantity of coefficients need to be modified. The modification to the CSD representation is proposed to solve the above problem. The concept is to modify the CSD representation to let the positive and negative parts balance, and the nonzero digits numbers are the same as before. For example, if the number of positive digits is much less than that of negative digits then 101 should be changed into 011 to increase the number of positive digits while decreasing the number of negative digits. On the contrary, when the situation is reversed, 101 could be changed into 011 to increase the number of negative digits. The quantized coefficients of the SRRC filter for the IS-95 WCDMA systems is illustrated in Table I, where the depth of the carry-save adder tree is 4 and the maximal number of bits in the bit-plane is 9. In this example, module 2 has 7 positive digits but only 1 negative digit. Thus negative module 2 processes only one nonzero digit but still uses pipeline registers to separate itself from the negative module 3. slight modification to coefficient 22, the result will change dramatically. The original value of coefficient 22 is 1.01001 and modified to 0.11001, which keeps the same value and number of nonzero digits. The number of modules becomes 3 without affecting the timing as shown in Table II. As a result, the number of pipelining registers is reduced. Evidently the modified CSD coefficients result in a structure that has high utilization of hardware, such as the registers and the adders. #### **CHIP IMPLEMENTATION** This section shows an example for IS-95 WCDMA pulse shaping FIR filter, which has 33-taps. The simulations used Verilog-XL tool with the ISM nonlinear model. Besides, compared with linear-phase direct form architecture for IS-95 WCDMA filters, obviously the modularization decreases the transition count as shown in Table III. When the FIR filter is fed a sequence of randomly generating data, the result is also similar. For the IS-95 WCDMA pulse shaping filter, the proposed architecture can reduce the number of transitions to be 71.4%. This chip is fabricated in 0.6 $\mu m$ SPTM CMOS technique. The Chip specification and CHIP photo are shown in Table IV and Fig. 6. | In- | CSD Coef- | Module | Number<br>Of Digits | | |-----|------------|--------|---------------------|-----| | dex | ficient | ID | | Neg | | 0 | 0.0000101 | | | | | 1 | 0.00001001 | | | | | 2 | 0.00001 | 3 | 2 | 6 | | 3 | 0.000001 | | | | | 4 | 0.0000101 | | | | | 5 | 0.0001 | | | | | 6 | 0.00101 | 2 | 7 | 1 | | 7 | 0.00010101 | ] - | | | | 8 | 0.0000101 | | <u> </u> | | | 9 | 0.0000101 | ] | | | | 10 | 0.0001 | | | 1 | | 11 | 0.00010101 | | | _ | | 12 | 0.0000001 | ] 1 | 6 | 7 | | 13 | 0.0010101 | ] | | | | 14 | 0.001 | 1 | | | | 15 | 0.00101 | | | L | | 16 | 0.000001 | | | | | 17 | 0.001001 | | } | 9 | | 18 | 0.0100101 | | 7 | | | 19 | 0.001001 | ] , | | | | 20 | 0.00101 | ] ~ | [ ' | | | 21 | 0.1001 | | | | | 22 | 1.01001 | 1 | | | | 23 | 1 | | | ļ | Table I. The coefficient grouping of pulse-shaping filter for IS-95 WCDMA. | Index | CSD Coef-<br>ficient | Module<br>ID | Numb<br>Dig | | |-------|----------------------|--------------|-------------|----------| | | | | | Neg | | 0 | 0.0000101 | 2 | 7 | 7 | | 1 | 0.00001001 | | | | | 2 | 0.00001 | | | | | 3 | 0.000001 | | | | | 4 | 0.0000101 | | | | | 5 | 0.0001 | 1 | ļ | | | 6 | 0.00101 | | | | | 7 | 0.00010101 | | | <u> </u> | | 8 | 0.0000101 | 1 | 7 | 6 | | 9 | 0.0000101 | | | | | 10 | 0.0001 | | | | | 11 | 0.00010101 | | | | | 12 | 0.0000001 | ] | | | | 13 | 0.0010101 | | | Ì | | 14 | 0.001 | | <u> </u> | <u>↓</u> | | 15 | 0.00101 | 0 | 9 | 9 | | 16 | 0.000001 | ] | | | | 17 | 0.001001 | | 1 | | | 18 | 0.0100101 | 4 | 1 | | | 19 | 0.001001 | _ | | | | 20 | 0.00101 | _ | | | | 21 | 0.1001 | | 1 | | | 22 | 0.11001 | | [ | | | 23 | 1 | | | | Table II. The coefficient grouping of IS-95 WCDMA pulse-shaping filter after modification to coefficient 22. Table III Number of transitions of three cases. | Case | Depth | Direct<br>Form<br>(Z) | RDFA+B<br>MA (A) | RDFA+B<br>MA+SSP<br>A+MCSD<br>(B) | A/Z | B/A | B/Z | |-----------------------------------|--------|-----------------------|------------------|-----------------------------------|----------------|-------|----------------| | Simple<br>channel<br>data | 5 | 9084792 | 7967036 | 7379256 | 0.877 | 0.926 | 0.812 | | Random input | 5 | 23105843 | 20296336 | 18848949 | 0.878 | 0.929 | 0.816 | | IS-95<br>pulse-shapin<br>g filter | 4<br>5 | 47719669 | | | 0.774<br>0.858 | | 0.714<br>0.810 | Table IV. The CHIP specification | Technology | | 0.6 μm CMOS SPTM | |-------------------|------|-------------------------| | Area | | 3.2x3.1 mm <sup>2</sup> | | Transistor Count | | 30,106 | | Max Operating | Fre- | 66.7MHz | | quency | | 367.5mW@50MHz | | Power Consumption | | 7.35mW/MHz | | Power Efficiency | | 5Volt | | Supply Voltage | | | Fig. 6. The CHIP photo ## **CONCLUSIONS** In this paper, a low-power architecture for dedicated linear phase FIR filter is proposed. Four schemes are suggested, including re-timed structure, balanced modularized architecture, separated signed processing data flow and modification the CSD representation. From the experiment results, the proposed signal processing schemes contribute about ten to thirty percent of transition reduction in the accumulation path to achieve maximal efficiency of hardware components. Since FIR filter plays an important role for DSP and digital communications, the proposed architecture will be very useful for the wireless embedded system design, especially for portable information applications. ## REFERENCES - [1]. G. Wacey and D. R. Bull, "POFGEN: A Design Automation System for VLSI Digital Filters with Invariant Transfer Function," IEEE International Symposium on Circuits and Systems, ISCAS, vol. 1, 1993, pp. 631 634. - [2]. Mohammed Abo-Zahhad and Sabah Mohamed Ahmed, "Filter Designer: A Complete Design and Synthesis Program for Lumped, Wave-Digital, FIR and IIR Filters," Proceedings of the Thirteenth National Radio Science Conference, March 19-21, 1996, Cairo, Egypt, pp. C24.1 – C24.15. - [3]. Varun Verma and Charles Chien, "A VHDL based Functional Compiler for Optimum Architecture Generation of FIR Filters," IEEE International Symposium on Circuits and Systems, ISCAS 1996, vol. 4, pp. 564 – 567. - [4]. Wolfgang Wilhelm and Tobias G. Noll, "A New Mapping Technique for Automated Design of Highly Efficient Multiplexed FIR Digital filters," Proceedings of 1997 IEEE International Symposium on Circuits and Systems, ISCAS 1997, vol. 4, pp. 2252 2255. - [5]. Rajeev Jain, Paul T. Yang, and Toshiaki Yoshino, "FIRGEN: A Computer-Aided Design System for High Performance FIR Filter Integrated Circuits," IEEE Transactions on Signal Processing, vol. 39, no. 7, July 1991, pp. 1655 1668 - [6]. Laskowski, J. and Samueli, H., "A 150-MHz 43-Tap Half-Band FIR Digital Filter in 1.2-um CMOS Generated by Silicon Compiler," Proceedings of the Custom Integrated Circuits Conference, 1992, pp. 11.4.1-11.4.4. - [7]. Takao Yamazaki, Yoshihito Kondo, Sayuri Igota, Nonmembers, and Seiichiro Iwase, ""FASTOOL" an FIR Filter Compiler Based on the Automatic Design of the Multi-Input-Adder," IEICE Trans. Fundaments, vol. E78-A, no. 12, December 1995, pp. 1699 - 1705