66M/70Mw Hs and Ultra-Low Power 16X16 Mac Design Using TG For Web-Based Multimedia System
66M/70Mw Hs and Ultra-Low Power 16X16 Mac Design Using TG For Web-Based Multimedia System
66M/70Mw Hs and Ultra-Low Power 16X16 Mac Design Using TG For Web-Based Multimedia System
Seung-Min Lee, Jin -Hong Chung, Hyung-Seok Yoon and Mike Myung-Ok Lee Department of Information and Communication Engineering, Dongshin University, 252 Daeho-Dong, Naju, Chonnam, 520-714, Korea (Tel) +82-613-330-3195 (Fax) +82-613-330-2909 (E-mail) {mikelee,idec_du}@dongshinu.ac.kr Abstract
In this paper a study has been presented on High Speed(HS) and 79mW Low Power(LP) 16X16 MAC performance of XOR-Based circuits using transmission gate logic(TG) implemented on 0.6um CMOS DLP/DLM technology. It is shown that our proposed MAC results in better performance than other published MACs due to no DC leakage currents for low power and bypassing unnecessary switching activities with latches before and after multiplier for high speed. 4:2 CSA(Carry Select Adder) used in the Wallace Tree is implemented to generate HS carries and sums[8]. Designed circuit block diagram with designated multiplicands and multipliers and HS 16-bit CSA blocks to execute arithmetic additions and multiplications are represented as shown in Fig. 2.
Multiplicand Multiplier
In2(15:8)
In1(15:8)
In2(7:0) In1(7:0)
I. INTRODUCTION
Real time multimedia system for speech, video, and image processings is inevitable due to the increasing use of portable systems, e.g., cellular phone, personal communications services, and notebook computers. Most frequently used IC cores for the multimedia communication system are RISC core and DSP core. Embedded higher-bit microprocessor has been widely used in conventional computer system[1], however could not satisfy new application areas like real-time MPEG and/or Web-based internet applications. Also conventional 16-bit or 24-bit fixed-point DSP has been developed for signal processing only, which is inadequate for image processing and can not process in real-time due to large quantities of data required for image and multimedia data[2]. Thereby HS MAC for the multimedia DSP core or RISC core must be developed. The MAC should be designed to satisfy both fast multiplication and areaefficient hardware. Radix-4 booths algorithm for the multiplier is mainly utilized and XOR-based circuits for ALU, Adder, Booth encoder, Mux and CSA using TG[3][4] are partially used for HS and LP target in this study[3][4][5][6].
CS 16bit Adder CS 16bit Adder
8Bit CS Adder2
8Bit CS Adder1
CS 16bit Adder
2X1 Multiplex
CS 16bit Adder Multiplier Output
Out(16:8)
Out(7:0)
(a)
(b)
In this study, fundamental design procedures are (1) to make Verilog HDL codes for front-end design: synthesis, logic simulations using Synopsys, for timing simulations using Cadence Verilog-XL and for back-end: Cadences Composer and Virtuso layout tool. After gate-level synthesis from high level behaviral and/or structural RTL HDL codes, basic schematics are optimized as our designed algorithmic approaches. Overall design flow with proper CAD tools is shown in Fig. 3[9] and final fab site is chosen as Hyundai Elec..
Latch
Latch
16
Latch
16
Latch
16 Modified Booth Encoder
Multiplier
Booth Selector 8 Partial products ( 16 bits )
Latch
(a)
(b)
Fig. 1. (a)HS/LP MAC diagram, (b)16X16 Radix-4 booth multiplier to be used in the MAC.
In the MAC, as aforementioned, TG-based circuits like Adder and other blocks are all layouted and verified as
shown in Fig.4.
Process This work OK A
DSP Core
Operations
# of Tr. 18.8K
Freq [MHz] 66 80
Remark
Ref. [1]
28.5K
2.35
50
(a)
(a)
(b) Fig. 6. Top: performance compaison with our work, bottom:timing simulation results for the MAC using EPIC tool.
(b)
(c)
Fig. 4. (a) Various TG layouts for MAC, (b) A example for HS/LP adder schematic using the TGs, (c) Adder layout with extracted view(see the bottom for detailed MOSFETs)
V. CONCLUSION
Fig.5. Synthesized result for the MAC using Synopsys.
Maximum 100MHz High Speed(HS) and 79mW Low Power(LP) 16X16 MAC performance at 3.3v of XORBased circuits using transmission gate logic(TG)
implemented on 0.6um CMOS DLP/DLM technology are obtained. Our proposed MAC results in better performance than other published MACs due to no DC leakage currents for low power and bypassing unnecessary switching activities with latches before and after multiplier for high speed.
REFERENCES
[1]H. Murakami and Naoka Yano,A MultiplierAccumulator Macro for a 45 MIPS Embedded RISC Processor, IEEE JSSC, Vol. 31, No.7, pp.1067-1071, 1996. [2]S.H. Yoon and M.H. Sunwoo,Design and Implementation of a DSP Chip for Portable Multimedia Applications, IEEK, Vol.35C, No. 12, pp.31-39, 1998. [3]Y. Ye, K. Roy and R. Drechsler,Power Consumption in XOR-Based Circuits, ASP-DAC99,, pp.299-302, 1999. [4] Wang, S. Fang and W. Feng, "New Efficient Designs for XOR and XNOR Functions on the Transistor Level," IEEE Journal of Solid-State Circuits, Vol. 29, No. 7, July 1994. [5]Y.S. Kwon, et al.,A New Single-Clock Flip-Flop for Half-Swing Clocking, ASP-DAC99,, pp.117-120, 1999. [6] A. Chandrsekasan, Desgin, Kluwer Academic Pub., 1995. [7 M.S. Kim and T.W.Cho, Design for Low Power High Speed 8-bit ELM Adder using Hybrid Logic, 4th MPW, IDEC, Daejon, Jan. , 1999. [8]C.S. Wallace,A Suggestion for fast multiplier, IEEE Trans. Electronic Computer, Vol.EC-13, pp.14-17, Feb.,1964. [9]Cadence,EPIC, Synopsys, HSPICE Users Manuals, Technical Reports,1998.
Table.1. Implemented 34-type primitive cells and 5 PAD cells
Group
Function NMOS Count 1
Group
Function
D-F/F
Count 1
Combina tional
PMOS NAND NOR XOR/XNOR AOI OAI Inverter Buffer Tri-Stat Buf.
1 4 4 2 1 1 1 1 1
Sequential
1 1 1 1 1 1 1 9 5 39
Special
5-type