633888485056270520
633888485056270520
633888485056270520
Architecture of TMS 320C6x functional units fetch and execute Pipelining Registers addressing modes instruction sets Timers Interrupts serial ports DMA memory
Introduction to DSP
A digital signal processor (DSP) is a type of microprocessor that are optimized for Digital signal Processing They Integrates system control and math-intensive functions Advantage is speed, cost and energy efficiency.
Alternatives
FPGA
Field-Programmable Gate Arrays have the capability of being reconfigurable within a system But more expensive, have high power dissipation
ASIC - Application Specific Integrated circuits
can perform specific functions extremely well, and can be made quite power efficient. But since ASICS are not field-programmable, their functionality cannot be iteratively changed or updated while in product development
Why go digital?
Digital signal processing techniques are now so powerful that sometimes it is extremely difficult, if not impossible, for analogue signal processing to achieve similar performance. Examples:
FIR filter with linear phase. Adaptive filters.
Why do we need DSP processors? Use a DSP processor when the following are required:
Cost saving. Smaller size. Low power consumption. Processing of many high frequency signals in real-time.
Applications
Internal Buses
P E R I P H E R A L S
External Memory
Central Processing Unit
Classification of DSP
Von Neumann's architecture Harvard architecture Super Harvard architecture
One shared memory for instructions (program) and data with one data bus and one address bus between processor and memory. Instructions and data have to be fetched in sequential order (known as the Von Neuman Bottleneck), limiting the operation bandwidth.
Its design is simple It is mostly used to interface to external memory.
HARVARD ARCHITECTURE
uses physically separate memories for their instructions and data, requiring dedicated buses for each of them. Instructions and operands can therefore be fetched simultaneously. Different program and data bus widths are possible, allowing program and data memory to be better optimized to the architectural requirements. Eg.: If the instruction format requires 14 bits then program bus
and memory can be made 14-bit wide, while the data bus and data memory remain 8-bit wide.
Bus
OR
Classification of DSP
Fixed point performs integer operations Floating point performs both integer and floating point
processors
It is the application that dictates which device and platform to use in order to achieve optimum performance at a low cost. For educational purposes, use the floating-point device as it can support both fixed and floating point operations.
Fixed point TMS320C1x, C2x, C5x .. Floating point TMS320C3x, C4x, C67x .
programs in assembly often have better performance; they run faster and use less memory, resulting in lower cost.
C / Assembly ?
Motorola (www.mot.com) DSP561xx 16 bit fixed point DSP560xx 24 bit, fixed point DSP96002 32 bit, floating point
Texas Instruments (www.ti.com) TMS320Cxx 16 bit fixed point TMS320Cxx 32 bit floating point
TMS320 Family
C2000 C5000
C6000
Lowest Cost
Control Systems Motor Control Storage Digital Ctrl Systems
Efficiency
Best MIPS Wireless phones Internet audio players Digital still cameras Modems Telephony VoIP
Multi Channel and Multi Function App's Comm. Infrastructure Wireless Base-stations Audio and Speech Processing Imaging Multi-media Servers Video
C6000 Roadmap
Floating Point
Performance
C6201 C6701 C6211
Multi-core
C6414
C6415 C6411
C6416
Media Gateway 3G Wireless Infrastructure
1st Generation
C6203 C6204 C6202 C6711
Fixed-point
Floating-point
Time
Eg:
VLIW architectures execute multiple instructions/cycle and use simple, regular instruction sets More parallelism, higher performance Better compiler targets
Disadvantages of VLIW Architectures New kinds of programmer/compiler complexity Programmer (or code-generation tool) must keep track of instruction scheduling Deep pipelines and long latencies can be confusing, may make peak performance elusive Increased memory use High program memory bandwidth requirements High power consumption Misleading MIPS ratings
VelociTI
VLIW modification done by TI is called VelociTI Reduces code size Increases performance when instructions reside off-chip
C6X architecture is based on the high-performance advanced VelociTI very-long-instruction-word (VLIW) architecture developed by Texas Instruments (TI)
an excellent choice for multichannel and multifunction applications (Several instructions captured & processed simultaneously)
Unlimited Internet bandwidth Universal wireless communication New telephony features Remote medical diagnostics Automated cruise control Personal home base station Personalized home security
TMS320C6711
A floating point processor with VLIW architecture Internal memory includes a two level cache architecture - 4KB of level 1 program cache (L1P) - 4KB of level 1 data cache (L1D)
- 64 KB of RAM / level 2 cache for data/program (L2) Has direct interface to both synchronous memories (SDRAM and SBSRAM) and asynchronous (SRAM and EPROM) With 32 bit address bus , total memory space is 232 =4GB It requires 3.3v for I/O and 1.8v for core Operates at 150 MHz perform 900 million floating point operations per second
(MFLOPS) Translates to 1200 million instructions per second (MIPS)
DSK Contents
16M SDRAM
128K FLASH
Parallel Port I/F TMS320C6711 Power Jack Power LED C6711 DSP D. Card I/F (Periph Con.) User DIP switches Reset Emulation JTAG Header Three User LEDs 16-bit codec (A/D & D/A) Line Level Input (microphone) Line Level Output (speakers)
Block diagram
CPU
There are two sets of functional units A and B Each set contains four units and a register file. One set contains functional units .L1, .S1, .M1, and .D1 the other set contains units .D2, .M2, .S2, and .L2. .M unit : multiplication operation .L unit : logical and arithmetic operations .S unit : branch, bit manipulation and arithmetic operations .D unit : load/store and arithmetic operations
The C67x CPU executes all C62x instructions. In addition to C62x fixed-point instructions, the six out of eight functional units (.L1, .S1, .M1, .M2, .S2, and .L2) also execute floating-point instructions. The remaining two functional units (.D1 and .D2) also execute the new LDDW instruction which loads 64 bits per CPU side for a total of 128 bits per cycle.
TMS320C6711 Memory
External Memory
- Synchronous Memory (SRAM, SBSRAM)
- Asynchronous Memory (SDRAM, EPROM)
Internal Memory
- Program - Data
Registers: The two register files each contain 16 32-bit registers for a total of 32 general-purpose registers (A0~A15, B0~B15) Interaction with the CPU must be done through these registers The four functional units on each side of the CPU can freely share the 16 registers belonging to that side. two cross paths 1x and 2x connects all the registers on the other side (which can access data from the register files on the opposite side.) If register access is by functional units on the same side of the CPU, register file can service all the units in a single clock cycle -register access using the register file across the CPU supports one read and one write per cycle.
Each function unit has read/write ports Data path 1 (2) units read/write A (B) registers Data path 2 (1) can read one A (B) register per cycle
Two simultaneous memory accesses cannot use registers of same register file as address pointers
32-bit program address bus, 256-bit program data bus Two 32-bit data address (DA1, DA2) Two 32-bit(64-bit for floating-point version) load data buses (LD1, LD2) Two 32-bit(64-bit for floating-point version) store data buses (ST1, ST2) Two 32-bit DMA data buses, two 32-bit DMA address buses Off-chip or external memory is accessed through a 22bit address and a 32-bit data bus
'C6x Peripherals
External Memory McBSP EMIF C6x DMA Boot PLL CPU HPI/XB Timer
EMIF
External Memory Interface. A 32-bit bus on which external memories and other devices can be connected. It includes features like internal wait state generation and SDRAM control. The EMIF can interface to both synchronous and synchronous memories.
McBSP
2 McBSP Multichannel buffered serial ports. Each McBSP can be used for high speed serial data transmission with external devices or reprogrammed as general purpose I/Os. McBSP1 is used to transmit and receive audio data from the AIC23 stereo codec. McBSP0 is used to control the codec through its serial control port.
On-chip PLL generates processor clock rate from slower external clock reference.
Timers generates periodic timer events as a function of the processor clock. Used by DSP/BIOS to create time slices for multitasking.
Power Down units - Save power for durations when CPU is inactive
EDMA Controller Enhanced DMA controller allows high speed data transfers without intervention from the DSP. BOOT - Boot from 4M external block - Boot from HPI/XB
CPU operations Fetch instruction from memory (DSP program memory) Decode instruction Execute instruction including reading data values
PR C6x
Memory PW PS PG
PR
DP
DC
C6x
Memory PW PS PG
31
0 31
0 31
31
31
31
31
31
Pipelining
It is a key feature in DSP to get parallel instructions working properly
non-pipelined scalar architecture - A processor that executes every instruction one after the
other - may use processor resources inefficiently, potentially leading to poor performance.
Basic Ideas
Parallel processing
time
Pipelined processing
time
P1
P2 P3 P4
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
a4 b4 c4 d4
P1
P2 P3 P4
a1
b1 a2
c1 b2 a3
d1 c2 b3 a4 d2 c3 b4 d3 c4 d4
Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction throughput. The throughput of the instruction pipeline is determined by how often an instruction exits the pipeline If the stages are perfectly balanced, then the time per instruction on the pipelined machine is equal to
Time per instruction on nonpipelined machine Number of pipe stages
There are 3 stages of pipelining: Program fetch composed of 4 phases PG program address generate to fetch an address PS program address send to send the address PW program address ready wait to wait for data PR program fetch packet receive to read opcode from memory Decode stage composed of 2 phases DP dispatch all the instructions within an FP to the appropriate functional units DC instruction decode Execute stage composed of 6 (fixed point)-10 (floating point) a) multiplication instruction consists of 2 phases due to 1 delay b) load instruction consists of 5 phases due to 4 delays c) branch instruction consists of 6 phases due to 5 delays
Pipeline phases
Program fetch
PG PS PW PR
decode
DP DC
execute
E1- E6 (E1-E10 for double
precision)
Pipelining effects
Clock cycles
1 2 3 4 5 6 7 8 9 10
PG
PS PG
PW PS PG
PR PW PS PG
DP PR PW PS
DC DP PR PW
E1 DC DP PR
E2 E1 DC DP
E3 E2 E1 DC
E4 E3 E2 E1
PG
PS
PG
PW
PS PG
PR
PW PS
DP
PR PW
DC
DP PR
Each row represents an FP PG of first FP starts in cycle 1,PG of second FP starts in cycle 2 and so on. Each FP has 4 phases for fetch ,2 phases for decode and execution phases can take from 1 to 10 phases At cycle 7, instruction in the first FP are in the first execution phase E1, instruction in the second FP is in decoding phase, instruction in the third FP is in dispatching phase and so on.. All the instructions are proceeding through various phases Therefore pipeline is FULL
Most instructions have 1 execute phase Multiply (MPY) has 2 Load (LDH/LDW) has 5 Branch (B) has 6 phases Additional execute phases are associated with floating point and double precision type instructions (upto 10 phases) eg: MPYDP has 9 delay slots and a total 10 phases Functional unit latency: The number of cycles that an instruction ties up a functional unit. it is 1 for all instructions except double precision instructions no other instructions can use the functional unit it is different from delay slot eg: MPYDP has 4 functional unit latency but 9 delay slots
delay slot: some instructions that are physically after the instruction are executed as if they were located before it. Classic examples are branch and call instructions, which often execute the following instruction before the branch or call is performed.
Instruction Set
Assembly code format:
Label
II
instruction or data (label must be in the first column) Parallel bars (II) are used if the instructions are being executed parallel with the previous instructions this field ([ ]) is optional to make the associated instruction conditional - 5 registers are used as conditional registers - [A2] specifies that the associated instruction executes if A2 is not zero - [!A2] associated instructions are executed if A2 is zero
instruction field can be assembler directive or mnemonic - assembler directive is a command for assembler .short : initialize 16 bit integer .int : initialize 32 bit integer .float : initialize 32 bit IEEE single precision constant - mnemonic is an actual instruction that executes at run time Unit field can be any one of the 8 functional units (optional) Comments starting in column 1 begin with an asterisk or a semicolon whereas comments starting in any other column must begin with a semicolon
Eg: ADD
.L1
A3,A7,A7
; add A3+A7
A7 B7 A6
II
MPY MPYH
.M2 A7,B7,B6 ; multiply 16 LSBs of A7,B7 .M1 A7,B7,A6 ; multiply 16 MSBs of A7,B7
Instruction set
They are designed to make maximum use of the processors resources and at the same time minimize the memory space required to store the instructions. Minimizing the storage space ensures the cost effectiveness of the overall system. To ensure the maximum use of hardware of the DSP, the instructions are designed to perform several parallel operations in a single instruction, typically including fetching of data in parallel with main arithmetic operation.
Instructions are kept short by restricting which register can be used with which operations and which operations can be combined in an instruction. Some of the latest processors use VLIW architectures, where in multiple instructions are issued and executed per cycle. In such architectures the instructions are short and designed to perform much less work thus requiring less memory and increased speed because of the VLIW architecture.
Logical
AND CMPEQ CMPGT CMPLT NOT OR SHL SHR SSHL XOR
Data Mgmt
LDB/H/W MV MVC MVK MVKL MVKH MVKLH STB/H/W
Program Ctrl
B IDLE NOP
Bit Mgmt
CLR EXT LMBD NORM SET
.L Unit
ABS ADD AND CMPE Q CMPG T CMPLT LMBD MV NEG NORM NOT OR SADD SAT SSUB SUB SUBC XOR ZERO
.D Unit
STB/H/W SUB SUBA ZERO
.M Unit
MPY MPYH SMPY SMPYH
Other
NOP IDLE
.L Unit
ADDDP ADDSP DPINT DPSP INTDP INTDPU INTSP INTSPU SPINT SPTRUNC SUBSP SUBDP
.D Unit
.M Unit
MPYSP MPYDP MPYI MPYID
ADDAD
LDDW
The interrupt flag register (IFR) - contains the status of INT4-INT15 and NMI interrupt. - Each corresponding bit in the IFR is set to 1 when that interrupt occurs; otherwise, the bits are cleared to 0. - If you want to check the status of interrupts, use the MVC instruction to read the IFR. The interrupt return pointer register (IRP) - contains the return pointer that directs the CPU to the proper location to continue program execution after processing a maskable interrupt. - A branch using the address in IRP (B IRP) in your interrupt service routine returns to the program flow when interrupt servicing is complete.
The interrupt set register (ISR) - allows you to manually set the maskable interrupts (INT15-INT4) in the interrupt flag register (IFR). - Writing a 1 to any of the its in ISR causes the corresponding interrupt flag (IFn) to be set in IFR. - Writing a 0 to any bit in ISR has no effect. - You cannot set any bit in ISR to affect NMI or reset. The interrupt service table pointer register (ISTP) - is used to locate the interrupt service routine (ISR). The NMI return pointer register (NRP) - contains the return pointer that directs the CPU to the proper location to continue program execution after NMI processing. - A branch using the address in NRP (B NRP) in your interrupt service routine returns to the program flow when NMI servicing is complete. The E1 phase program counter (PCE1) - contains the 32-bit address of the fetch packet in the E1 pipeline phase.
Addressing modes
Determines how one access memory Addressing refers to means to specify location of operands for instructions - types of addressing are called addressing modes - operands may be input operands for the operation as well as results of the operation Addressing modes supported by the TMS320C67x include
register-indirect, indexed register-indirect, and modulo addressing (circular addressing). Immediate data is also supported. The TMS320C67x does not support modulo addressing for 64bit data.
Immediate The operand is part of the instruction Register The operand is specified in a register Direct The address of the operand is part of the instruction (added to imply memory page) Indirect The address of the operand is stored in a register
(implied)
not supported
Register-Indirect Addressing
Operand is located in memory address stored in a register Special group of registers can be used to store addresses (address registers) Most important addressing mode in DSPs Efficient from instruction set point of view Few bits are needed to indicate address of operand can be used with or without displacement
32 registers(A0-A15,B0-B15) are used as pointers Indirect addressing uses * in conjunction with one of the 32 registers
register R contains address of a memory location where a data value is stored 2. *R++ (d) - register R contains memory address - after the memory address is used, R is postincremented such that new address is R+1 if d=1 - double minus (- -) update the address by d-1 3. * ++ R(d) - address is preincremented or offset by d - current address is R+d or R-d 4. * + R(d) - address is preincremented by d, such that the current address is R+d - however R pre increments without modification - unlike previous case, R is not updated or modified
1. *R
Circular addressing
Circular addressing is used to create a circular buffer Buffer is created in hardware and is very useful for applications like digital filtering This addressing mode in conjunction with circular buffer updates samples by shifting data without creating overhead as in direct shifting When pointer reaches bottom location, and when incremented the pointer is automatically wrapped around to the top location Two independent buffers are available using BK0 and BK1 within the AMR register Registers A4-A7 and B4-B7 in conjunction with .D unit can be used as pointers MVC (move constant) is the only instruction to access AMR and other control registers
Circular Buffer
At the beginning of each sample period, a new sample will be read into the circular buffer,overwriting the oldest sample. The newest sample x(n) will be stored at the memory location pointed at by auxiliary register AR(i).
The need of processing the digital signals in real time, evolves the concept of Circular Buffering. Circular buffers are used to store the most recent values of a continually updated signal. Circular buffering allows processors to access a block of data sequentially and then automatically wrap around to the beginning address exactly the pattern used to access coefficients in FIR filter.
Circular buffering also very helpful in implementing first-in, first-out buffers, commonly used for I/O and for FIR delay lines.
With circular addressing, the field also specifies which BK (block size) field to use for a circular buffer. In addition, the buffer must be aligned on a byte boundary equal to the block size.
Eg: MVK
The value 0x0004 =(0100) into 16 LSB of AMR sets bit 2 (third bit) to 1 and all other bits to zero. This sets the mode to 01 and selects register A5 as pointer to buffer using BK0 The value 0x0005 =(0101) into 16 MSB of AMR sets bits 16 and 18 to 1. This corresponds to value of N used to select size of buffer = 2 N+1 = 64 bytes using BKO
Interrupts
The C6711device supports 16 prioritized interrupts Types of interrupts: Reset Maskable Non maskable
Reset (RESET) Reset is the highest priority interrupt and is used to halt the CPU and return it to a known state. The reset interrupt is unique in a number of ways: - RESET is an active-low signal. All other interrupts are active-high signals. - RESET must be held low for 10 clock cycles before it goes high again to reinitialize the CPU properly. - The instruction execution in progress is aborted and all registers are returned to their default states. - RESET is not affected by branches.
Maskable Interrupts (INT4INT15) - These have lower priority than the NMI and reset
interrupts. - These interrupts can be associated with external devices, on-chip peripherals, software control etc. The interrupt source for interrupts 4-15 can be programmed by modifying the selector value (binary value) in the corresponding fields of the Interrupt Selector Control registers: MUXH (address 0x019C0000) and MUXL (address 0x019C0004).
Interrupt Priority
Type
Non maskable
Interrupt Name
RESET NMI Reserved Reserved
Default Source
Maskable
INT4
EXT_INT4
INT5
INT6 INT7 INT8 INT9 INT10
EXT_INT5
EXT_INT6 EXT_INT7 DMA_INT0 DMA_INT1 SD_INT
INT11
INT12 INT13 INT14 INT15
DMA_INT2
DMA_INT3 DSPINT TINT0 TINT1
Multichannel Buffered Serial Port (McBSP) The standard serial port interface provides:
Full-duplex communication Double-buffered data registers, which allow a continuous data stream Independent framing and clocking for reception and transmission Direct interface to industry-standard codecs, analog interface chips (AICs), and other serially connected A/D and D/A devices Multi channel transmission and reception of up to 128 channels.
The McBSP consists of a data path and a control path that connect to external devices. Separate pins for transmission and reception communicate data to these external devices. Four other pins communicate control information (clocking and frame synchronization). The device communicates to the McBSP using 32-bit-wide control and data registers accessible via the internal peripheral bus.
Pin CLKR CLKX CLKS DR DX FSR FSX Description Receive clock Transmit clock External clock Received serial data Transmitted serial data Receive frame synchronization Transmit frame synchronization
CPU or DMA write the DATA to be transmitted to the Data transmit register (DXR) which is shifted out to DX via the transmit shift register (XSR). Similarly, receive data on the DR pin is shifted into the receive shift register (RSR) and copied into the receive buffer register (RBR). RBR is then copied to DRR, which can be read by the CPU or the DMA controller. This allows internal data movement and external data communications simultaneously. The following control registers are used in multichannel operation: The multi channel control register (MCR) The transmit channel enable register (XCER) The receive channel enable register (RCER)
Other registers for clock generation, frame synchronization and control are: serial port control register (SPCR) receive control register (RCR) transmit control register (XCR) pin control register (PCR) Sample rate generator register (SRGR)
DMA
Direct Memory Access transfers data to or from the processors memory without the involvement of the processor itself. DMA is commonly used to provide improved performance with input/output devices. Rather than have the processor read data from an I/O device and copy the data into memory or vice versa, a separate DMA controller can handle such transfers in parallel. The processor loads the DMA controller with control information including the starting address for the transfer, the number of words to be transferred, the source and the destination.
The DMA controller uses the bus request pin to notify the DSP core that it is ready to make a transfer to or from external memory. The DSP core completes its current instruction, releases control of external memory and signals the DMA controller via the bus grant pin that the DMA transfer can proceed. The DMA controller then transfers the specified number of data words and optionally signals completion through an interrupt. Some processor can also have multiple channels DMA managing DMA transfers in parallel.
Timer
The C67x has two 32-bit general-purpose timers that can be used to: Time events
Count events
Generate pulses
The timer works in one of the two signaling modes depending on whether clocked by an internal or an external source. The timer has an input pin (TINP) and an output pin (TOUT). The TINP pin can be used as a general purpose input, and the TOUT pin can be used as a general-purpose output. When an internal clock is provided, the timer generates timing sequences to trigger peripheral or external devices such as DMA controller or A/D converter respectively. When an external clock is provided, the timer can count external events and interrupt the CPU after a specified number of events.
Load/Store Options
In 'C6x the instruction set supports several types
of load/store instructions:
loads 16 bits(half word) into B7 whose address in memory is specified by B2 load into A7 the content in memory specified by A7 STW .D2 A1,*+A4[20]
stores 32 bit word A1 into memory whose address is specified by A4 offset by 20(32 bits) or 80 bytes