66M/70Mw Hs and Ultra-Low Power 16X16 Mac Design Using TG For Web-Based Multimedia System

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

66M/70mW HS AND ULTRA-LOW POWER 16X16 MAC DESIGN USING TG FOR WEB-BASED MULTIMEDIA SYSTEM

Seung-Min Lee, Jin -Hong Chung, Hyung-Seok Yoon and Mike Myung-Ok Lee Department of Information and Communication Engineering, Dongshin University, 252 Daeho-Dong, Naju, Chonnam, 520-714, Korea (Tel) +82-613-330-3195 (Fax) +82-613-330-2909 (E-mail) {mikelee,idec_du}@dongshinu.ac.kr Abstract
In this paper a study has been presented on High Speed(HS) and 79mW Low Power(LP) 16X16 MAC performance of XOR-Based circuits using transmission gate logic(TG) implemented on 0.6um CMOS DLP/DLM technology. It is shown that our proposed MAC results in better performance than other published MACs due to no DC leakage currents for low power and bypassing unnecessary switching activities with latches before and after multiplier for high speed. 4:2 CSA(Carry Select Adder) used in the Wallace Tree is implemented to generate HS carries and sums[8]. Designed circuit block diagram with designated multiplicands and multipliers and HS 16-bit CSA blocks to execute arithmetic additions and multiplications are represented as shown in Fig. 2.
Multiplicand Multiplier

In2(15:8)

In1(15:8)

In2(7:0) In1(7:0)

Radix-4 Booth's Encoder

I. INTRODUCTION
Real time multimedia system for speech, video, and image processings is inevitable due to the increasing use of portable systems, e.g., cellular phone, personal communications services, and notebook computers. Most frequently used IC cores for the multimedia communication system are RISC core and DSP core. Embedded higher-bit microprocessor has been widely used in conventional computer system[1], however could not satisfy new application areas like real-time MPEG and/or Web-based internet applications. Also conventional 16-bit or 24-bit fixed-point DSP has been developed for signal processing only, which is inadequate for image processing and can not process in real-time due to large quantities of data required for image and multimedia data[2]. Thereby HS MAC for the multimedia DSP core or RISC core must be developed. The MAC should be designed to satisfy both fast multiplication and areaefficient hardware. Radix-4 booths algorithm for the multiplier is mainly utilized and XOR-based circuits for ALU, Adder, Booth encoder, Mux and CSA using TG[3][4] are partially used for HS and LP target in this study[3][4][5][6].
CS 16bit Adder CS 16bit Adder

8Bit CS Adder3 (+1)

8Bit CS Adder2

8Bit CS Adder1

CS 16bit Adder

2X1 Multiplex
CS 16bit Adder Multiplier Output

CS bit (CS Adder1(8))

Out(16:8)

Out(7:0)

(a)

(b)

Fig. 2. (a)Detailed Wallace Tree structure, (b) HS CSA B.D..

In this study, fundamental design procedures are (1) to make Verilog HDL codes for front-end design: synthesis, logic simulations using Synopsys, for timing simulations using Cadence Verilog-XL and for back-end: Cadences Composer and Virtuso layout tool. After gate-level synthesis from high level behaviral and/or structural RTL HDL codes, basic schematics are optimized as our designed algorithmic approaches. Overall design flow with proper CAD tools is shown in Fig. 3[9] and final fab site is chosen as Hyundai Elec..

II. DESIGN OF HS/LP MAC


Basic configuration for the HS MAC with detailed Radix-4 booth multiplier is depicted as shown in Fig. 1[7].
Multiplicand X [ 16 : 0 ] Multiplier Y [ 16 : 0 ] 16

Latch

Latch

16

Latch
16

Latch
16 Modified Booth Encoder

Multiplier
Booth Selector 8 Partial products ( 16 bits )

Latch

Wallace tree Adder

Fig. 3. Design methodology for the full-custom MAC.

(a)

(b)

Fig. 1. (a)HS/LP MAC diagram, (b)16X16 Radix-4 booth multiplier to be used in the MAC.

In the MAC, as aforementioned, TG-based circuits like Adder and other blocks are all layouted and verified as

shown in Fig.4.
Process This work OK A
DSP Core

VD D [V] 3.3 3.3 3.3

Operations

(bits) 16X16+40 16X16+36 32X32+63

# of Tr. 18.8K

Area 2 [mm ] 2.25

Freq [MHz] 66 80

Power [m ] W 79 82.5 330

Remark

HD0.6u SPTM O.35u TLM 0.4u SPDM

Post Sim. Post Sim Post Sim

Ref. [1]

28.5K

2.35

50

(a)
(a)

(b) Fig. 6. Top: performance compaison with our work, bottom:timing simulation results for the MAC using EPIC tool.

(b)

IV. RESULTS AND DISCUSSION


As shown in Fig.5, every combinational and sequential block for the MAC are designed at the level of primitive transistor level based on 0.6um design rule along with PAD layouts using Cadence and number of transistor counts is about 20,000. Expected results for the worst case multiplications and additions are carried out, extracting all netlists including parasitic effects from the layout. HSPICE simulation results also show similar outputs as the EPIC whose simulation time is quite short due to blocked digital level of simulation for better convergence. From Fig. 7, final verification results msut be error and/or warning free for entering the FAB site. It is shown that the power consumption was 79mW at 66MHz at 3.3v[8] and comparison is shown with other published data. As shown inTable 1, designed cells will be used for future full custom design and we are now updating further 0.25um CMOS technology to design FFT, MPEG and ADSL.

(c)
Fig. 4. (a) Various TG layouts for MAC, (b) A example for HS/LP adder schematic using the TGs, (c) Adder layout with extracted view(see the bottom for detailed MOSFETs)

III. VERIFICATION AND IMPLEMENTATION


Synthesis by Synopsys for the MAC(see Fig. 5) and verifications for logic, timing simulations by Cadence Verilog-XL, speed/power estimations by EPIC are completed as shown in Fig. 6 and final layout without any DRC and LVS problems is also shown in Fig. 7. In this study we made 34-type primitive cells and 5-type I/O PAD cells as listed in Table 1, which will be used as IPs and designed for full-custom design in the future.

Fig. 7. Final Microphotograph layout of the full-custom MAC.

V. CONCLUSION
Fig.5. Synthesized result for the MAC using Synopsys.

Maximum 100MHz High Speed(HS) and 79mW Low Power(LP) 16X16 MAC performance at 3.3v of XORBased circuits using transmission gate logic(TG)

implemented on 0.6um CMOS DLP/DLM technology are obtained. Our proposed MAC results in better performance than other published MACs due to no DC leakage currents for low power and bypassing unnecessary switching activities with latches before and after multiplier for high speed.

REFERENCES
[1]H. Murakami and Naoka Yano,A MultiplierAccumulator Macro for a 45 MIPS Embedded RISC Processor, IEEE JSSC, Vol. 31, No.7, pp.1067-1071, 1996. [2]S.H. Yoon and M.H. Sunwoo,Design and Implementation of a DSP Chip for Portable Multimedia Applications, IEEK, Vol.35C, No. 12, pp.31-39, 1998. [3]Y. Ye, K. Roy and R. Drechsler,Power Consumption in XOR-Based Circuits, ASP-DAC99,, pp.299-302, 1999. [4] Wang, S. Fang and W. Feng, "New Efficient Designs for XOR and XNOR Functions on the Transistor Level," IEEE Journal of Solid-State Circuits, Vol. 29, No. 7, July 1994. [5]Y.S. Kwon, et al.,A New Single-Clock Flip-Flop for Half-Swing Clocking, ASP-DAC99,, pp.117-120, 1999. [6] A. Chandrsekasan, Desgin, Kluwer Academic Pub., 1995. [7 M.S. Kim and T.W.Cho, Design for Low Power High Speed 8-bit ELM Adder using Hybrid Logic, 4th MPW, IDEC, Daejon, Jan. , 1999. [8]C.S. Wallace,A Suggestion for fast multiplier, IEEE Trans. Electronic Computer, Vol.EC-13, pp.14-17, Feb.,1964. [9]Cadence,EPIC, Synopsys, HSPICE Users Manuals, Technical Reports,1998.
Table.1. Implemented 34-type primitive cells and 5 PAD cells
Group
Function NMOS Count 1

Group

Function
D-F/F

Count 1

Combina tional

PMOS NAND NOR XOR/XNOR AOI OAI Inverter Buffer Tri-Stat Buf.

1 4 4 2 1 1 1 1 1

Sequential

JK-F/F Muxed F/F T-F/F Latch Full-Adder Multiplexer Decoder


CLU/BCLU

1 1 1 1 1 1 1 9 5 39

Special

I/O Pad Total

5-type

You might also like