633888485056270520

Syllabus
Architecture of TMS 320C6x functional units fetch and execute Pipelining Registers addressing modes instruction sets Timers Interrupts serial ports DMA memory
Introduction to DSP
A digital signal processor (DSP) is a type of microprocessor that are optimized for Digital signal Processing They Integrates system control and math-intensive functions Advantage is speed, cost and energy efficiency.
It is a key component in many communication, medical, military and industrial products.
Alternatives
FPGA
Field-Programmable Gate Arrays have the capability of being reconfigurable within a system But more expensive, have high power dissipation
ASIC - Application Specific Integrated circuits
can perform specific functions extremely well, and can be made quite power efficient. But since ASICS are not field-programmable, their functionality cannot be iteratively changed or updated while in product development
Why go digital?
Digital signal processing techniques are now so powerful that sometimes it is extremely difficult, if not impossible, for analogue signal processing to achieve similar performance. Examples:
FIR filter with linear phase. Adaptive filters.
With DSP it is easy to:

Change applications. Correct applications. Update applications.
Additionally DSP reduces:

Noise susceptibility. Chip count. Development time. Cost. Power consumption.
Why do we need DSP processors? Use a DSP processor when the following are required:
Cost saving. Smaller size. Low power consumption. Processing of many high frequency signals in real-time.
Applications
General DSP System Block Diagram

Internal Memory
Internal Buses
P E R I P H E R A L S
External Memory
Central Processing Unit
Classification of DSP
Von Neumann's architecture Harvard architecture Super Harvard architecture
VON NEUMANN'S ARCHITECTURE
One shared memory for instructions (program) and data with one data bus and one address bus between processor and memory. Instructions and data have to be fetched in sequential order (known as the Von Neuman Bottleneck), limiting the operation bandwidth.
Its design is simple It is mostly used to interface to external memory.
HARVARD ARCHITECTURE
uses physically separate memories for their instructions and data, requiring dedicated buses for each of them. Instructions and operands can therefore be fetched simultaneously. Different program and data bus widths are possible, allowing program and data memory to be better optimized to the architectural requirements. Eg.: If the instruction format requires 14 bits then program bus
and memory can be made 14-bit wide, while the data bus and data memory remain 8-bit wide.
Efficient Memory Access

General purpose processors More optimized DSP processors
Bus
OR
Early DSP processors
Classification of DSP
Fixed point performs integer operations Floating point performs both integer and floating point
processors
It is the application that dictates which device and platform to use in order to achieve optimum performance at a low cost. For educational purposes, use the floating-point device as it can support both fixed and floating point operations.
Fixed point TMS320C1x, C2x, C5x .. Floating point TMS320C3x, C4x, C67x .
C versus Assembly language

Programs in C are more flexible and quicker to develop.
programs in assembly often have better performance; they run faster and use less memory, resulting in lower cost.
C / Assembly ?
How complicated is the program?

If it is large and intricate, you will probably want to use C. If it is small and simple, assembly may be a good choice.
Are you pushing the maximum speed of the DSP?

If so, assembly will give you the last drop of performance from the device. For less demanding applications, you should consider using C.
How many programmers will be working together?

If the project is large enough for more than one programmer, lean toward C use in-line assembly only for time critical segments.
Which is more important, product cost / development cost ?

If it is product cost, choose assembly; if it is development cost, choose C.
What is your background?

If you are experienced in assembly (on other microprocessors), choose assembly for your DSP. If your previous work is in C, choose C for your DSP.
The Digital Signal Processor Market
Digital Signal Processor market is dominated by 4 companies.

Analog Devices (www.analog.com/dsp) ADSP-21xx 16 bit, fixed point ADSP-21xxx 32 bit, floating and fixed Lucent Technologies (www.lucent.com) DSP16xxx 16 bit fixed point DSP32xx 32 bit floating point
Motorola (www.mot.com) DSP561xx 16 bit fixed point DSP560xx 24 bit, fixed point DSP96002 32 bit, floating point
Texas Instruments (www.ti.com) TMS320Cxx 16 bit fixed point TMS320Cxx 32 bit floating point
TMS320 Family
C2000 C5000
C6000
Lowest Cost
Control Systems Motor Control Storage Digital Ctrl Systems

Efficiency
Best MIPS Wireless phones Internet audio players Digital still cameras Modems Telephony VoIP
Best Performance & Ease-of-Use

Multi Channel and Multi Function App's Comm. Infrastructure Wireless Base-stations Audio and Speech Processing Imaging Multi-media Servers Video
C6000 Roadmap
Floating Point
Performance
C6201 C6701 C6211
Multi-core
C64x DSP 1.1 GHz
2nd Generation (Fixed Point)

C64x DSP
General Purpose
C6414
C6415 C6411
C6416
Media Gateway 3G Wireless Infrastructure
1st Generation
C6203 C6204 C6202 C6711
C62x C6713 C6205 C67x C6712
Fixed-point
Floating-point
Time
Feature of the TMS320C6x

The Texas Instruments TMS320C6x family of microprocessors is one of the largest VLIW success stories to date This family of processors are built to deliver speed Family have different size, cost, memory, peripherals, power consumption specifications
Fixed-point C6201 version
5-ns Instruction Cycle Time 200-MHz Clock Rate performance of up to 1600 MIPS Eight 32-Bit Instructions/Cycle
Eg:
Floating-point C6701 version

Can operate at 167MHz 6ns Instruction cycle time 1 giga floating-point operations per second (GFLOPS)
Very Long Instruction Word (VLIW )

refers to a CPU architecture designed to take advantage of instruction level parallelism executes operation in parallel based on a fixed schedule determined when programs are compiled. the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler hence the processor does not need the scheduling hardware
VLIW CPUs offer significant computational power with less hardware complexity greater compiler complexity
VLIW architectures execute multiple instructions/cycle and use simple, regular instruction sets More parallelism, higher performance Better compiler targets
Disadvantages of VLIW Architectures New kinds of programmer/compiler complexity Programmer (or code-generation tool) must keep track of instruction scheduling Deep pipelines and long latencies can be confusing, may make peak performance elusive Increased memory use High program memory bandwidth requirements High power consumption Misleading MIPS ratings
VelociTI
VLIW modification done by TI is called VelociTI Reduces code size Increases performance when instructions reside off-chip
C6X architecture is based on the high-performance advanced VelociTI very-long-instruction-word (VLIW) architecture developed by Texas Instruments (TI)
an excellent choice for multichannel and multifunction applications (Several instructions captured & processed simultaneously)
TMS320C6x with VelociTI Enables Cost-Effective Solutions for Emerging Applications
Unlimited Internet bandwidth Universal wireless communication New telephony features Remote medical diagnostics Automated cruise control Personal home base station Personalized home security
TMS320C6000. DSP Device Nomenclature
TMS320C6711
A floating point processor with VLIW architecture Internal memory includes a two level cache architecture - 4KB of level 1 program cache (L1P) - 4KB of level 1 data cache (L1D)
- 64 KB of RAM / level 2 cache for data/program (L2) Has direct interface to both synchronous memories (SDRAM and SBSRAM) and asynchronous (SRAM and EPROM) With 32 bit address bus , total memory space is 232 =4GB It requires 3.3v for I/O and 1.8v for core Operates at 150 MHz perform 900 million floating point operations per second
(MFLOPS) Translates to 1200 million instructions per second (MIPS)
DSK Contents
1.8V Power Supply
16M SDRAM
128K FLASH
Daughter Card I/F (EMIF Connector)
Parallel Port I/F TMS320C6711 Power Jack Power LED C6711 DSP D. Card I/F (Periph Con.) User DIP switches Reset Emulation JTAG Header Three User LEDs 16-bit codec (A/D & D/A) Line Level Input (microphone) Line Level Output (speakers)
3.3V Power Supply JTAG Header
Block diagram
CPU
There are two sets of functional units A and B Each set contains four units and a register file. One set contains functional units .L1, .S1, .M1, and .D1 the other set contains units .D2, .M2, .S2, and .L2. .M unit : multiplication operation .L unit : logical and arithmetic operations .S unit : branch, bit manipulation and arithmetic operations .D unit : load/store and arithmetic operations
The C67x CPU executes all C62x instructions. In addition to C62x fixed-point instructions, the six out of eight functional units (.L1, .S1, .M1, .M2, .S2, and .L2) also execute floating-point instructions. The remaining two functional units (.D1 and .D2) also execute the new LDDW instruction which loads 64 bits per CPU side for a total of 128 bits per cycle.
TMS320C6711 Memory
3-Access level of Memory Map

1. L1 Memory -Cache-based Architecture -Program Cache & Data Cache -Size : PC(4Kbyte), DC(4Kbyte) 2. L2 Memory - Size : 64Kbyte - Program & Data
3. L3 Memory External Memory
External Memory
- Synchronous Memory (SRAM, SBSRAM)
- Asynchronous Memory (SDRAM, EPROM)
Internal Memory
- Program - Data
Registers: The two register files each contain 16 32-bit registers for a total of 32 general-purpose registers (A0~A15, B0~B15) Interaction with the CPU must be done through these registers The four functional units on each side of the CPU can freely share the 16 registers belonging to that side. two cross paths 1x and 2x connects all the registers on the other side (which can access data from the register files on the opposite side.) If register access is by functional units on the same side of the CPU, register file can service all the units in a single clock cycle -register access using the register file across the CPU supports one read and one write per cycle.
Restrictions on Register Accesses

Registers A0,A1,B0,B1 are used as conditional registers
Registers A4-A7 and B4-B7 are used for circular addressing

Registers A0-A9 and B0-B9 (except B3) are temporary registers Any Registers A10-A15 and B10-B15 used are saved and later restored before returning from a subroutine
Each function unit has read/write ports Data path 1 (2) units read/write A (B) registers Data path 2 (1) can read one A (B) register per cycle
40 bit words stored in adjacent register pair

Used in extended precision accumulation 32 LSB bits are stored in even register(eg.A2) and remaining
8 bits stored in the 8 LSB of next upper (odd) register(A3)

64 bit is also stored in the similar fashion
Two simultaneous memory accesses cannot use registers of same register file as address pointers
C6x internal buses
32-bit program address bus, 256-bit program data bus Two 32-bit data address (DA1, DA2) Two 32-bit(64-bit for floating-point version) load data buses (LD1, LD2) Two 32-bit(64-bit for floating-point version) store data buses (ST1, ST2) Two 32-bit DMA data buses, two 32-bit DMA address buses Off-chip or external memory is accessed through a 22bit address and a 32-bit data bus
'C6x Peripherals
External Memory McBSP EMIF C6x DMA Boot PLL CPU HPI/XB Timer
EMIF
External Memory Interface. A 32-bit bus on which external memories and other devices can be connected. It includes features like internal wait state generation and SDRAM control. The EMIF can interface to both synchronous and synchronous memories.
McBSP
2 McBSP Multichannel buffered serial ports. Each McBSP can be used for high speed serial data transmission with external devices or reprogrammed as general purpose I/Os. McBSP1 is used to transmit and receive audio data from the AIC23 stereo codec. McBSP0 is used to control the codec through its serial control port.
On-chip PLL generates processor clock rate from slower external clock reference.
Timers generates periodic timer events as a function of the processor clock. Used by DSP/BIOS to create time slices for multitasking.
Power Down units - Save power for durations when CPU is inactive
EDMA Controller Enhanced DMA controller allows high speed data transfers without intervention from the DSP. BOOT - Boot from 4M external block - Boot from HPI/XB
SBSRAM: Synchronous Burst Static Random Access Memory
Host Port Interface (HPI)

The host port interface (HPI) is a parallel port through which a host processor can directly access the CPUs memory space. The host device is the master of the interface, therefore increasing its ease of access. The host and the CPU can exchange information via internal or external memory. In addition, the host has direct access to memory-mapped peripherals. Connectivity to the CPUs memory space is provided through the DMA controller. Expansion bus (XB) is a replacement for the HPI, as well as an expansion of the EMIF. The expansion provides two distinct areas of functionality (host port and I/O port) which can co-exist in a system
CPU operations Fetch instruction from memory (DSP program memory) Decode instruction Execute instruction including reading data values
Program Fetch (F)

Program fetching consists of 4 phases
generate fetch address (PG)
send address to memory (PS) wait for data ready (PW) read opcode (PR)
PR C6x
Memory PW PS PG
Decode Stage (D)

Decode stage consists of two phases
dispatch instruction to functional unit (DP)
instruction decoded at functional unit (DC)
PR
DP
DC
C6x
Memory PW PS PG
Execute Stage (E)

An execute packet (EP) consists of a group of instructions that can be executed in parallel within the same cycle Number of EP within a fetch packet can vary from one (with 8 parallel instructions) to 8 (with no parallel instructions) bit 0 (LSB) of every 32 bit instruction determines if the next instruction belongs to same EP or not if 1 same EP if 0 part of next EP
FETCH and EXECUTION PACKETS (Fetch packet consists of 8 32-bit instructions)

Consider an FP with three EP:
Instruction A II Instruction B
In the fetch packet ,

EP1 contains 2 parallel instructions, EP2 contains 3 and EP3 has 3 parallel instructions
instruction C II Instruction D II Instruction E

Instruction F II Instruction G II Instruction H
31
0 31
0 31
31
31
31
31
31
Pipelining
It is a key feature in DSP to get parallel instructions working properly
Requires careful timing
Overlap operations to increase performance
Pipeline CPU operations to increase clock speed over a sequential implementation

Separate parallel functional units
Peripheral interfaces for I/O do not burden CPU
non-pipelined scalar architecture - A processor that executes every instruction one after the
other - may use processor resources inefficiently, potentially leading to poor performance.
pipelining - executing different sub-steps of sequential instructions

simultaneously
superscalar architectures - executing multiple instructions entirely simultaneously
Basic Ideas
Parallel processing
time
Pipelined processing
time
P1
P2 P3 P4
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
a4 b4 c4 d4
P1
P2 P3 P4
a1
b1 a2
c1 b2 a3
d1 c2 b3 a4 d2 c3 b4 d3 c4 d4
Less inter-processor communication Complicated processor hardware
More inter-processor communication Simpler processor hardware
Colors: different types of operations performed a, b, c, d: different data streams processed
Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction throughput. The throughput of the instruction pipeline is determined by how often an instruction exits the pipeline If the stages are perfectly balanced, then the time per instruction on the pipelined machine is equal to
Time per instruction on nonpipelined machine Number of pipe stages
There are 3 stages of pipelining: Program fetch composed of 4 phases PG program address generate to fetch an address PS program address send to send the address PW program address ready wait to wait for data PR program fetch packet receive to read opcode from memory Decode stage composed of 2 phases DP dispatch all the instructions within an FP to the appropriate functional units DC instruction decode Execute stage composed of 6 (fixed point)-10 (floating point) a) multiplication instruction consists of 2 phases due to 1 delay b) load instruction consists of 5 phases due to 4 delays c) branch instruction consists of 6 phases due to 5 delays
Pipeline phases
Program fetch
PG PS PW PR
decode
DP DC
execute
E1- E6 (E1-E10 for double
precision)
Pipelining effects
Clock cycles
1 2 3 4 5 6 7 8 9 10
PG
PS PG
PW PS PG
PR PW PS PG
DP PR PW PS
DC DP PR PW
E1 DC DP PR
E2 E1 DC DP
E3 E2 E1 DC
E4 E3 E2 E1
PG
PS
PG
PW
PS PG
PR
PW PS
DP
PR PW
DC
DP PR
Each row represents an FP PG of first FP starts in cycle 1,PG of second FP starts in cycle 2 and so on. Each FP has 4 phases for fetch ,2 phases for decode and execution phases can take from 1 to 10 phases At cycle 7, instruction in the first FP are in the first execution phase E1, instruction in the second FP is in decoding phase, instruction in the third FP is in dispatching phase and so on.. All the instructions are proceeding through various phases Therefore pipeline is FULL
Most instructions have 1 execute phase Multiply (MPY) has 2 Load (LDH/LDW) has 5 Branch (B) has 6 phases Additional execute phases are associated with floating point and double precision type instructions (upto 10 phases) eg: MPYDP has 9 delay slots and a total 10 phases Functional unit latency: The number of cycles that an instruction ties up a functional unit. it is 1 for all instructions except double precision instructions no other instructions can use the functional unit it is different from delay slot eg: MPYDP has 4 functional unit latency but 9 delay slots
delay slot: some instructions that are physically after the instruction are executed as if they were located before it. Classic examples are branch and call instructions, which often execute the following instruction before the branch or call is performed.
Instruction Set
Assembly code format:
Label
II
[ ] Instruction Unit operands ; comments
A Label represents a specific address/memory location that contains an
instruction or data (label must be in the first column) Parallel bars (II) are used if the instructions are being executed parallel with the previous instructions this field ([ ]) is optional to make the associated instruction conditional - 5 registers are used as conditional registers - [A2] specifies that the associated instruction executes if A2 is not zero - [!A2] associated instructions are executed if A2 is zero
instruction field can be assembler directive or mnemonic - assembler directive is a command for assembler .short : initialize 16 bit integer .int : initialize 32 bit integer .float : initialize 32 bit IEEE single precision constant - mnemonic is an actual instruction that executes at run time Unit field can be any one of the 8 functional units (optional) Comments starting in column 1 begin with an asterisk or a semicolon whereas comments starting in any other column must begin with a semicolon
Eg: ADD
.L1
A3,A7,A7
; add A3+A7
A7 B7 A6
II
MPY MPYH
.M2 A7,B7,B6 ; multiply 16 LSBs of A7,B7 .M1 A7,B7,A6 ; multiply 16 MSBs of A7,B7
Instruction set
They are designed to make maximum use of the processors resources and at the same time minimize the memory space required to store the instructions. Minimizing the storage space ensures the cost effectiveness of the overall system. To ensure the maximum use of hardware of the DSP, the instructions are designed to perform several parallel operations in a single instruction, typically including fetching of data in parallel with main arithmetic operation.
Instructions are kept short by restricting which register can be used with which operations and which operations can be combined in an instruction. Some of the latest processors use VLIW architectures, where in multiple instructions are issued and executed per cycle. In such architectures the instructions are short and designed to perform much less work thus requiring less memory and increased speed because of the VLIW architecture.
'C6x Instruction Set (by category)

Arithmetic
ABS ADD ADDA ADDK ADD2 MPY MPYH NEG SMPY SMPYH SADD SAT SSUB SUB SUBA SUBC SUB2 ZERO
Logical
AND CMPEQ CMPGT CMPLT NOT OR SHL SHR SSHL XOR
Data Mgmt
LDB/H/W MV MVC MVK MVKL MVKH MVKLH STB/H/W
Program Ctrl
B IDLE NOP
Bit Mgmt
CLR EXT LMBD NORM SET
'C6x Instruction Set (by unit)

.S Unit
ADD ADDK ADD2 AND B CLR EXT MV MVC MVK MVKL MVKH MVKLH NEG NOT OR SET SHL SHR SSHL SUB SUB2 XOR ZERO
.L Unit
ABS ADD AND CMPE Q CMPG T CMPLT LMBD MV NEG NORM NOT OR SADD SAT SSUB SUB SUBC XOR ZERO
.D Unit
STB/H/W SUB SUBA ZERO
.M Unit
MPY MPYH SMPY SMPYH
ADD ADDA LDB/H/W MV NEG
Other
NOP IDLE
C67x Addl Instructions (by unit)

.S Unit
ABSSP ABSDP CMPGTSP CMPEQSP CMPLTSP CMPGTDP CMPEQDP CMPLTDP RCPSP RCPDP RSQRSP RSQRDP SPDP
.L Unit
ADDDP ADDSP DPINT DPSP INTDP INTDPU INTSP INTSPU SPINT SPTRUNC SUBSP SUBDP
.D Unit
.M Unit
MPYSP MPYDP MPYI MPYID
ADDAD
LDDW
Control Register File
Addressing mode register (AMR) - specifies the addressing mode
Control status register (CSR) - contains control and status bits.

Interrupt clear register (ICR) - allows you to manually clear the maskable interrupts (INT15-INT4) in the interrupt flag register (IFR). - Writing a 1 to any of the bits in ICR causes the corresponding interrupt flag (IFn) to be cleared in IFR. - Writing a 0 to any bit in ICR has no effect. - You cannot set any bit in ICR to affect NMI or reset.
Interrupt enable register (IER) - enables and disables individual interrupts.
The interrupt flag register (IFR) - contains the status of INT4-INT15 and NMI interrupt. - Each corresponding bit in the IFR is set to 1 when that interrupt occurs; otherwise, the bits are cleared to 0. - If you want to check the status of interrupts, use the MVC instruction to read the IFR. The interrupt return pointer register (IRP) - contains the return pointer that directs the CPU to the proper location to continue program execution after processing a maskable interrupt. - A branch using the address in IRP (B IRP) in your interrupt service routine returns to the program flow when interrupt servicing is complete.
The interrupt set register (ISR) - allows you to manually set the maskable interrupts (INT15-INT4) in the interrupt flag register (IFR). - Writing a 1 to any of the its in ISR causes the corresponding interrupt flag (IFn) to be set in IFR. - Writing a 0 to any bit in ISR has no effect. - You cannot set any bit in ISR to affect NMI or reset. The interrupt service table pointer register (ISTP) - is used to locate the interrupt service routine (ISR). The NMI return pointer register (NRP) - contains the return pointer that directs the CPU to the proper location to continue program execution after NMI processing. - A branch using the address in NRP (B NRP) in your interrupt service routine returns to the program flow when NMI servicing is complete. The E1 phase program counter (PCE1) - contains the 32-bit address of the fetch packet in the E1 pipeline phase.
Addressing modes
Determines how one access memory Addressing refers to means to specify location of operands for instructions - types of addressing are called addressing modes - operands may be input operands for the operation as well as results of the operation Addressing modes supported by the TMS320C67x include
register-indirect, indexed register-indirect, and modulo addressing (circular addressing). Immediate data is also supported. The TMS320C67x does not support modulo addressing for 64bit data.
Immediate The operand is part of the instruction Register The operand is specified in a register Direct The address of the operand is part of the instruction (added to imply memory page) Indirect The address of the operand is stored in a register
ADD .L1 -13,A1,A6
(implied)
ADD .L1 A7,A6,A7
not supported
LDW .L1 *A5++[8],A1
Register-Indirect Addressing
Operand is located in memory address stored in a register Special group of registers can be used to store addresses (address registers) Most important addressing mode in DSPs Efficient from instruction set point of view Few bits are needed to indicate address of operand can be used with or without displacement
32 registers(A0-A15,B0-B15) are used as pointers Indirect addressing uses * in conjunction with one of the 32 registers
register R contains address of a memory location where a data value is stored 2. *R++ (d) - register R contains memory address - after the memory address is used, R is postincremented such that new address is R+1 if d=1 - double minus (- -) update the address by d-1 3. * ++ R(d) - address is preincremented or offset by d - current address is R+d or R-d 4. * + R(d) - address is preincremented by d, such that the current address is R+d - however R pre increments without modification - unlike previous case, R is not updated or modified
1. *R
Circular addressing
Circular addressing is used to create a circular buffer Buffer is created in hardware and is very useful for applications like digital filtering This addressing mode in conjunction with circular buffer updates samples by shifting data without creating overhead as in direct shifting When pointer reaches bottom location, and when incremented the pointer is automatically wrapped around to the top location Two independent buffers are available using BK0 and BK1 within the AMR register Registers A4-A7 and B4-B7 in conjunction with .D unit can be used as pointers MVC (move constant) is the only instruction to access AMR and other control registers
Circular Buffer
At the beginning of each sample period, a new sample will be read into the circular buffer,overwriting the oldest sample. The newest sample x(n) will be stored at the memory location pointed at by auxiliary register AR(i).
The need of processing the digital signals in real time, evolves the concept of Circular Buffering. Circular buffers are used to store the most recent values of a continually updated signal. Circular buffering allows processors to access a block of data sequentially and then automatically wrap around to the beginning address exactly the pattern used to access coefficients in FIR filter.
Circular buffering also very helpful in implementing first-in, first-out buffers, commonly used for I/O and for FIR delay lines.
Addressing Mode Register (AMR)

For each of the eight registers (A4A7, B4B7) that can perform linear or circular addressing, the addressing mode register (AMR) specifies the addressing mode.
A 2-bit field for each register selects the address modification mode: linear (the default) or circular mode.
With circular addressing, the field also specifies which BK (block size) field to use for a circular buffer. In addition, the buffer must be aligned on a byte boundary equal to the block size.
AMR mode and description

Mode 00 01 description for linear addressing for circular addressing using BK0 For circular addressing using BK1 reserved
Block size = 2N+1 bytes
Eg: MVK
.S2 0X0004,B2 ; lower 16 bits to B2 MVKLH .S2 0x0005,B2 ; upper 16 bits to B2
The value 0x0004 =(0100) into 16 LSB of AMR sets bit 2 (third bit) to 1 and all other bits to zero. This sets the mode to 01 and selects register A5 as pointer to buffer using BK0 The value 0x0005 =(0101) into 16 MSB of AMR sets bits 16 and 18 to 1. This corresponds to value of N used to select size of buffer = 2 N+1 = 64 bytes using BKO
Interrupts
The C6711device supports 16 prioritized interrupts Types of interrupts: Reset Maskable Non maskable
Reset (RESET) Reset is the highest priority interrupt and is used to halt the CPU and return it to a known state. The reset interrupt is unique in a number of ways: - RESET is an active-low signal. All other interrupts are active-high signals. - RESET must be held low for 10 clock cycles before it goes high again to reinitialize the CPU properly. - The instruction execution in progress is aborted and all registers are returned to their default states. - RESET is not affected by branches.
Nonmaskable Interrupt (NMI)
- NMI is the second-highest priority interrupt

- generally used to alert the CPU of a serious hardware problem such as imminent power failure. - For NMI processing to occur, the non maskable interrupt enable (NMIE) bit in the interrupt enable register must be set to 1.
Maskable Interrupts (INT4INT15) - These have lower priority than the NMI and reset
interrupts. - These interrupts can be associated with external devices, on-chip peripherals, software control etc. The interrupt source for interrupts 4-15 can be programmed by modifying the selector value (binary value) in the corresponding fields of the Interrupt Selector Control registers: MUXH (address 0x019C0000) and MUXL (address 0x019C0004).
Interrupt Priority
Type
Non maskable
Interrupt Name
RESET NMI Reserved Reserved
Default Source
Maskable
INT4
EXT_INT4
INT5
INT6 INT7 INT8 INT9 INT10
EXT_INT5
EXT_INT6 EXT_INT7 DMA_INT0 DMA_INT1 SD_INT
INT11
INT12 INT13 INT14 INT15
DMA_INT2
DMA_INT3 DSPINT TINT0 TINT1
Multichannel Buffered Serial Port (McBSP) The standard serial port interface provides:
Full-duplex communication Double-buffered data registers, which allow a continuous data stream Independent framing and clocking for reception and transmission Direct interface to industry-standard codecs, analog interface chips (AICs), and other serially connected A/D and D/A devices Multi channel transmission and reception of up to 128 channels.
An element sizes of 8, 12, 16, 20, 24, or 32-bit.

- 8-bit data transfers with LSB or MSB first.
The McBSP consists of a data path and a control path that connect to external devices. Separate pins for transmission and reception communicate data to these external devices. Four other pins communicate control information (clocking and frame synchronization). The device communicates to the McBSP using 32-bit-wide control and data registers accessible via the internal peripheral bus.
Pin CLKR CLKX CLKS DR DX FSR FSX Description Receive clock Transmit clock External clock Received serial data Transmitted serial data Receive frame synchronization Transmit frame synchronization
CPU or DMA write the DATA to be transmitted to the Data transmit register (DXR) which is shifted out to DX via the transmit shift register (XSR). Similarly, receive data on the DR pin is shifted into the receive shift register (RSR) and copied into the receive buffer register (RBR). RBR is then copied to DRR, which can be read by the CPU or the DMA controller. This allows internal data movement and external data communications simultaneously. The following control registers are used in multichannel operation: The multi channel control register (MCR) The transmit channel enable register (XCER) The receive channel enable register (RCER)
Other registers for clock generation, frame synchronization and control are: serial port control register (SPCR) receive control register (RCR) transmit control register (XCR) pin control register (PCR) Sample rate generator register (SRGR)
DMA
Direct Memory Access transfers data to or from the processors memory without the involvement of the processor itself. DMA is commonly used to provide improved performance with input/output devices. Rather than have the processor read data from an I/O device and copy the data into memory or vice versa, a separate DMA controller can handle such transfers in parallel. The processor loads the DMA controller with control information including the starting address for the transfer, the number of words to be transferred, the source and the destination.
The DMA controller uses the bus request pin to notify the DSP core that it is ready to make a transfer to or from external memory. The DSP core completes its current instruction, releases control of external memory and signals the DMA controller via the bus grant pin that the DMA transfer can proceed. The DMA controller then transfers the specified number of data words and optionally signals completion through an interrupt. Some processor can also have multiple channels DMA managing DMA transfers in parallel.
Timer
The C67x has two 32-bit general-purpose timers that can be used to: Time events
Count events
Generate pulses
Interrupt the CPU

Send synchronization events to the DMA controller
The timer works in one of the two signaling modes depending on whether clocked by an internal or an external source. The timer has an input pin (TINP) and an output pin (TOUT). The TINP pin can be used as a general purpose input, and the TOUT pin can be used as a general-purpose output. When an internal clock is provided, the timer generates timing sequences to trigger peripheral or external devices such as DMA controller or A/D converter respectively. When an external clock is provided, the timer can count external events and interrupt the CPU after a specified number of events.
Load/Store Options
In 'C6x the instruction set supports several types
of load/store instructions:
Four load instructions:

LDDW LDW LDH LDB Loa 64-bit double word (C67x only) Load 32-bit word Load 16-bit half-word (short) Load 8-bit byte
Three store instructions:

STW STH STB
LDH .D2 *B2++,B7 II LDH .D1 *A2++,A7
loads 16 bits(half word) into B7 whose address in memory is specified by B2 load into A7 the content in memory specified by A7 STW .D2 A1,*+A4[20]
stores 32 bit word A1 into memory whose address is specified by A4 offset by 20(32 bits) or 80 bytes
Load, and Store Paths

The C67x DSP has two 32-bit paths for loading data from memory to the register File: LD1 for register file A, and LD2 for register file B. The C67x DSP also has a second 32-bit load path for both register files A and B. This allows the LDDW instruction to simultaneously load two 32-bit values into register file A and two 32-bit values into register file B. For side A, LD1a is the load path for the 32 LSBs and LD1b is the load path for the 32 MSBs. For side B, LD2a is the load path for the 32 LSBs and LD2b is the load path for the 32 MSBs. There are also two 32-bit paths, ST1 and ST2, for storing register values to memory from each register file.

633888485056270520

Uploaded by

Copyright:

Available Formats

633888485056270520

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

633888485056270520

Uploaded by

Copyright:

Available Formats

Syllabus

It is a key component in many communication, medical, military and industrial products.

With DSP it is easy to:

Additionally DSP reduces:

General DSP System Block Diagram

VON NEUMANN'S ARCHITECTURE

Efficient Memory Access

Early DSP processors

C versus Assembly language

How complicated is the program?

Are you pushing the maximum speed of the DSP?

How many programmers will be working together?

Which is more important, product cost / development cost ?

What is your background?

The Digital Signal Processor Market

Digital Signal Processor market is dominated by 4 companies.

Best Performance & Ease-of-Use

C64x DSP 1.1 GHz

2nd Generation (Fixed Point)

C62x C6713 C6205 C67x C6712

Feature of the TMS320C6x

Floating-point C6701 version

Very Long Instruction Word (VLIW )

TMS320C6x with VelociTI Enables Cost-Effective Solutions for Emerging Applications

TMS320C6000. DSP Device Nomenclature

1.8V Power Supply

Daughter Card I/F (EMIF Connector)

3.3V Power Supply JTAG Header

3-Access level of Memory Map

3. L3 Memory External Memory

Restrictions on Register Accesses

Registers A4-A7 and B4-B7 are used for circular addressing

40 bit words stored in adjacent register pair

8 bits stored in the 8 LSB of next upper (odd) register(A3)

C6x internal buses

SBSRAM: Synchronous Burst Static Random Access Memory

Host Port Interface (HPI)

Program Fetch (F)

Decode Stage (D)

instruction decoded at functional unit (DC)

Execute Stage (E)

FETCH and EXECUTION PACKETS (Fetch packet consists of 8 32-bit instructions)

In the fetch packet ,

instruction C II Instruction D II Instruction E

Requires careful timing

Overlap operations to increase performance

Pipeline CPU operations to increase clock speed over a sequential implementation

Peripheral interfaces for I/O do not burden CPU

pipelining - executing different sub-steps of sequential instructions

superscalar architectures - executing multiple instructions entirely simultaneously

Less inter-processor communication Complicated processor hardware

More inter-processor communication Simpler processor hardware

Colors: different types of operations performed a, b, c, d: different data streams processed

[ ] Instruction Unit operands ; comments

A Label represents a specific address/memory location that contains an

'C6x Instruction Set (by category)

'C6x Instruction Set (by unit)

ADD ADDA LDB/H/W MV NEG

C67x Addl Instructions (by unit)

LDH .D2 B2++,B7 II LDH .D1 A2++,A7