# A Fast and Power-Saving Self-Timed Manchester Carry-Bypass Adder for Booth Multiplier-Accumulator Design

\*I-Chyn Wey, Hwang-Cherng Chow, You-Gang Chen, and \*An-Yeu Wu

\* Graduate Institute of Electronics Engineering, National Taiwan University No.1, Sec. 4, Roosevelt Road, Taipei 106, Taiwan Institute of Semiconductor Technology, Chang Gung University Kwei-Shan, Tao-Yuan 333, Taiwan.

Email: archi@access.ee.ntu.edu.tw, hcchow@mail.cgu.edu.tw, m9128201@stmail.cgu.edu.tw, andywu@cc.ee.ntu.edu.tw

## Abstract

In this paper, a fast and power-saving self-timed Manchester Carry-Bypass Adder (MCBA) is proposed based on the property analysis of the MCBA completion signal. By using a new self-timed approach, the critical path in the summation array of Multiplier-Accumulator (MAC) can be removed without conventional dual MCBA chain circuit. As a result, the speed of the proposed self-timed MCBA can be improved 23.3% and save 56.8% power consumption. Finally, a 16-bit\* 16-bit+40-bit Booth MAC with this new self-timed MCBA is demonstrated at 2.5V, 150MHz in UMC 0.25um process with 71.28mW power only.

#### 1. Introduction

A digital MAC is one of the main fundamental kernels in digital signal processors (DSP) [1]. Developing a high speed MAC is crucial for real-time DSP applications. Moreover, low power design for MAC is more challenging for portable applications, and consumer electronics [2].

To meet the speed constraint of 16-bit\*16-bit+40-bit MAC in the DSP applications about 100MHz, a glitch-free modified Booth encoding scheme [3] is adopted to reduce the number of additions in multiplication. Besides, the Wallace tree architecture [4] is applied to accelerate the summation speed in the partial product array, and the fast adders are constructed by MCBA [5] to accelerate the carry propagation in the adder with two input operands. In the synchronous MAC without pipeline arrangement, the self-timed circuit design technique must be adopted in the design of Manchester carry chain. The self-timed technique is thought to be power saving [6]; however, the generation of completion-signal is very complex, which would lead to the area and power penalty because of the usage of dual-chain design [7,9,10]. Therefore, we will propose a new self-timed design of MCBA without dual chain, which can speed up about 23.3% and save 56.8% power.

## 2. Proposed MAC architecture and circuit design

For the design of low power Booth MAC we adopt a high speed, glitch-free MBE recoding scheme [4] in the part of partial product generation scheme to accelerate the speed and save glitch power dissipation. The partial product summation array is constructed by the full adder based Wallace tree architecture as illustrated in Figure 1. By this way, we can accelerate the summation speed by parallel addition and save power by reducing the adder number as compared to the carry-save architecture. The fast adder always lies in the critical path in the synchronous non-pipelined MAC, no matter in the partial product summation array, in the final stage adder, or in the accumulator. In order to meet the low power demand, the fast adder is constructed by the self-timed MCBA instead of the carry-lookahead adder, which can save large power since the parallel styled implementation of fast adder is replaced by serial-chained architecture. All 1-bit adder cells are constructed by static mirror adders. The proposed low power MAC is designed without static power loss, and finally all the sub-cells are designed in a fully hierarchical manner.





## 3. New self-timed Manchester carry-bypass adder

In the synchronous MAC without pipeline arrangement, the self-timed circuit design techniques must be adopted in the design of Manchester carry chain. The self-timed techniques is thought to be power saving; however, the speed in the conventional clock-delayed self-timed design [7] is slow. Moreover, the extra MCBA-delayed circuit is needed, which would lead to extra power consumption and extra area penalty. The design of MCBA-delayed circuit can choose simply inverter-based delay buffer or copy of original MCBA circuit to serve as the MCBA delay. The former choice can save power and area, while the latter design is more reliable since the delay time in MCBA varies with the process deviation. Another well-known self-timed approach accelerates the speed by using the generation of a completion signal [7,9,10]. In this case, the clock delay in each stage is no more the critical MCBA delay, which is replaced with the average of MCBA delay. Therefore, the speed in the self-timed MCBA can be accelerated; however, the critical path is still the same. Moreover, the generation of the completion signal of Manchester carry chain is very complex, which would lead to the area and power penalty due to the usage of dual-chain design [7,9,10]. The example of 6-bit dual-chain self-timed Manchester adder is shown in Figure 2. In Figure 2, there must be one signal in every Ci and CBi pairs is "1" in the evaluation period; therefore, the "Done" signal can be sent out to the next stage to inform they start to compute. This method can be designed with only little speed penalty; however, the penalty spent in area, and power is quite serious.



Figure 2: A 6-bit dual-chain self-timed Manchester adder

In this paper, we propose a new self-timed approach in the design of Manchester carry chain to overcome the speed bottleneck in the critical path of summation array and the dual-chain used to generate the completion-signal of Manchester carry chain is no more needed. First, we consider the characteristics of dynamic circuits. In the evaluation period of dynamic circuits, if the function is implemented by the NMOS network, we can tolerate the input signal changing from logic "0" to logic "1"; however, the case of input signal changing from logic "1" to logic "0" is not allowed. In the Manchester carry chain, the signal in the carry out nodes are either maintained in stable as logical "0" or changed from logic "0" to logic "1"; therefore, the carry out signals in the Manchester carry chain possess the property that can be used to be a completion signal sent to the next dynamic stage. However, in the actual design of Manchester carry bypass adder, the carry out in the present stage does not directly be connected to the NMOS gate of dynamic circuit in the next stage. Hence we must further ensure that the signal sent to the next dynamic stage can still maintain the property of either kept in stable or simply changed from logic "0" to logic "1". First, we can find the "generate" signal  $g_0$  constructed by the "AND" gate can also maintain such property since the carry out signals in the Manchester carry chain will arrive in the dynamic gate latter and the carry out signal would never be stable as logical "1" during one clock cycle. And then we can also find that the "propagate" signal  $p_0$  doesn't connect to the dynamic gate. This point demonstrates that we can directly send the carry out signals in the Manchester carry

chain to the next stage. And we can use the delay module constructed by the full adder to generate the asynchronous clock as completion signal. The demonstration of operation timing in our proposed self-timed MCBA is illustrated in Figure 3.



Figure 3: The operation timing in our proposed self-timed MCBA

By this way, the critical path in the summation array can be removed, and the dual-chain used to generate completion signal is needn't. Therefore, the speed can be faster and the power can be greatly saved in the proposed self-timed MCBA. Moreover, since the MCBA doesn't lie in the critical path of summation array, the size in the self-timed MCBA can be further reduced and the power consumption can be reduced. The complete proposed 8-bit self-timed MCBA is presented in Figure 4. By using our proposed self-timed Manchester adder, the critical path in the summation array of MAC lies in the full adder rather than in the fast adder.



Figure 4: The proposed 8-bit self-timed MCBA

#### 4. Performance comparisons and simulation results

The detailed comparison of various self-timed approaches is illustrated in Figure 5. In the conventional self-timed approach by using delayed clock, the delay of asynchronous clock is constructed by the worst MCBA delay to guarantee the correct function. Therefore, the speed is quite slow and the delay time of asynchronous clock would vary with the process deviation. Also the power consumption and the area would increase because of the usage of extra MCBA delay circuit. In the conventional self-timed approach by using completion signal generation [7,9,10], the asynchronous control signal is generated by the MCBA completion signal generation circuit. Consequently, the speed is moderate, which is the average delay time of the MCBA circuit. By using such approach, the power consumption and the area increase greatly because of the usage of dual MCBA chain to generate the completion signal. Moreover, both conventional approaches mentioned above lie in the critical path of summation array. In the proposed new self-timed approach, the delay time of asynchronous clock is the worst delay in the full adder, which is tolerant to the process deviation. Therefore, the speed is fast and the proposed self-timed MCBA does not lie in the critical path of summation array, while the critical path still exists in the full adder. Moreover, the power consumption and the area can be greatly reduced because the full adder delay is quite small and the MCBA circuit doesn't lie in the critical path.



Figure 5: The various self-timed approaches

(a) conventional self-timed MCBA with delayed clock (b) conventional self-timed MCBA with completion signal generation (c) proposed new self-timed approach



Figure 6: The operation timing analysis and comparisons of various self-timed approaches

The comparison of operation timing between various self-timed approaches is illustrated in Figure 6. In the proposed self-timed MCBA, the evaluation period in each stage can start as the output signal of the full adder is stable. In the conventional completion signal generation self-timed approach, the start of evaluation period is controlled by the MCBA completion signal. In the conventional delayed clock self-timed approach, the evaluation period in each stage can start only as the output signal of the MCBA in the previous stage is stable. As illustration in Figure 6, the speed in our proposed design is the fastest, and the speed in the conventional self-timed approach by using delayed clock is the slowest.

Transistor level simulation results are performed based on UMC 0.25um process, 2.5V supply voltage, and 150MHz clock frequency with 1ns rise/fall time by HSPICE.

Table 1 presents the comparison results of highspeed mux-based CLA [8], self-timed MCBA by using inverter-based delayed clock [7]; denoted as [7-1], self-timed MCBA by using MCBA-based delayed clock [7]; denoted as [7-2], self-timed MCBA by using completion signal generation [7,9,10]; denoted as [7-3], and the proposed new self-timed MCBA. From Table 1, our new design shows minimal transistor count of only 344, and the speed bottleneck in the partial product summation array can be removed from the self-timed MCBA with improvement of 13.8%, 21.3%, 29.1%, and 23.3% over [8], [7-1], [7-2], and [7-3], respectively. And the proposed self-timed MCBA possesses good process deviation tolerance. As for the comparison of power consumption, our proposed design consumes the lowest energy of 1.83mw, with improvement of 73.2%, 30.4%, 60.4%, and 56.8% over [8], [7-1], [7-2], and [7-3], respectively. Also, the new design has a minimal power delay product value with improvement of 77.1%, 45.5%, 72.1%, and 67.1% over [8], [7-1], [7-2], and [7-3], respectively.

|                             | New   | CLA[8]       | [7-1] | [7-2] | [7-3]        |
|-----------------------------|-------|--------------|-------|-------|--------------|
| Tr. Count                   | 344   | 416          | 468   | 676   | 650          |
| Teritical (ns)              | 0.56  | 0.65         | 0.71  | 0.79  | 0.73         |
| Improve                     |       | 13.8%        | 21.3% | 29.1% | 23.3%        |
| Process deviation tolerance | good  | very<br>good | worst | worse | very<br>good |
| Power (mw)                  | 1.83  | 6.84         | 2.63  | 4.62  | 4.24         |
| Energy (uw/MHz)             | 12.20 | 45.60        | 17.53 | 30.80 | 28.27        |
| Improve                     |       | 73.2%        | 30.4% | 60.4% | 56.8%        |
| PDP                         | 1.02  | 4.45         | 1.87  | 3.65  | 3.10         |
| Improve                     |       | 77.1%        | 45.5% | 72.1% | 67.1%        |

Table 1: Comparisons of various 8-bit self-timed fast adders





Figure 7 illustrates the performance comparisons of various MAC designs, where MAC\_P represents the MAC with proposed self-timed MCBA, MAC\_CLA represents the MAC with carry-lookahead adder, and MAC\_[7-3] represents the MAC with self-timed MCBA in [7-3]. In Figure 7, at nearly the same speed, the MAC with proposed self-timed MCBA consumes the minimal power of only 71.28mw, which saves 51.12mw and 25.98mw with respect to MAC\_CLA and MAC\_[7-3].



Figure 8: Waveform of the proposed low power Booth MAC at 2.5V, 150MHz

The function of MAC is verified to be correct by using gate-level simulation. The timing is further verified by HSPICE simulation based on UMC 0.25 $\mu$  process and the 40-bit MAC output waveforms, denoted as macpp<0> to macpp<39>, are illustrated in Figure 8.

Table 2: Summary of performance results

| Process             | UMC 0.25um |  |  |
|---------------------|------------|--|--|
| Supply Voltage      | 2.5V       |  |  |
| Transistor count    | 10468      |  |  |
| Operating Frequency | 150MHz     |  |  |
| Average Power       | 71.28mw    |  |  |

Finally, Table 2 shows the performance summary of the proposed low power Booth MAC with new self-timed MCBA. In Table 2, we show that total transistor number in the proposed MAC is 10468, and the average power consumption is only 71.28mw as operating at 2.5V, 150MHz corresponding to the input pattern with 25.78% transition probability.

### 5. Conclusion

In this paper, a fast and power-saving self-timed MCBA without dual chain is proposed. The new self-timed MCBA is designed based on the analysis of the property of completion signal and input signal of MCBA. By using a new self-timed approach, the critical path in the summation array of MAC can be removed. As a result, in the proposed self-timed MCBA, the delay is only 0.56ns with 23.3% improvement and the power consumption is only 1.83mw with 56.8% improvement as compared to the self-timed MCBA with completion signal generation. Finally, a 16-bit\* 16-bit+40-bit Booth MAC with this new self-timed MCBA is demonstrated at 2.5V, 150MHz in UMC 0.25um process with 71.28mW power only. Therefore, the proposed self-timed design for Booth MAC is very suitable for low power and fast DSP applications.

### References

- D. A. Pucknell and K. Eshraghan, "Basic VLSI Design", Prentice Hall, pp. 240-253, 1994.
- [2] A. M. Shams, W. M. Badawy, and M. A. Bayoumi, "An Enhanced Low-Power Computational Kernel for a Pipelined Multiplier- Accumulator Unit," Proc. of the 10th International Conference on Microelectronics, pp. 33-36, 1998.
- [3] H. C. Chow and I. C. Wey, "A 3.3V 1GHz high speed pipelined Booth multiplier" Proc. IEEE International Symp. on Circuits and Systems, vol. 1, pp. 457-460, 2002.
- [4] C. S. Wallace, "A Suggestion for Fast Multiplier," IEEE Trans. on Electronic Computers, Volume EC-13, pp. 14-17, February 1964.
- [5] H. C. Chow and I. C. Wey, "A 3.3V 1GHz Low-Latency Pipelined Booth Multiplier with New Manchester Carry-Bypass Adder" Proc. IEEE International Symp. on Circuits and Systems, vol. 5, pp. 121-124, 2003.
- [6] C. J. Jou and I. Y. Chuang, "Low-Power Globally Asynchronous Locally Synchronous Design Using Self-Timed Circuit Technology" Proc. IEEE International Symp. on Circuits and Systems, vol. 3, pp. 1808-1811, 1997.
- [7] J. M. Rabaey, "Digital Integrated Circuits," Prentice Hall, second edition, 2003.
- [8] K. K. Parhi, "VLSI Digital Signal Processing Systems: Design and Implementation," Wiley-Interscience, 1999.
- [9] J. H. Won and K. Choi, "Self-timed statistical carry lookahead adder using multiple-output DCVSL," Proc. of the 6th International Conference on VLSI and CAD, pp. 560-563, 1999.
- [10] G. A. Ruiz and M.A. Manzano, "Compact 32-bit CMOS Adder in multiple-output DCVS logic for self-timed circuits," IEE Proc. Circuits, Devices and Systems, pp. 183-188, 1999.