DR Tahir Zaidi: Targets For Algorithms
DR Tahir Zaidi: Targets For Algorithms
DR Tahir Zaidi: Targets For Algorithms
1
Lecture 1
Introduction
Texas Instrument Approach to DSP
Dr Tahir Zaidi
Targets For Algorithms
Processor
ASIC
FPGA
CPLD
DSP
7/7/2011
2
Design Options for Digital Systems
Special Purpose Hardware
Custom ICs / ASICs
Software Programmable Processor
Pentium, PowerPC, etc
FPGA (possibly with embedded general
purpose microprocessor)
Xilinx, Altera, etc
DSP
TI, ADSP, etc
Comparison of Options
Specific HW Gen Purpose HW
NRE/Dev Cost
Speed
Flexibility
Time to Market
Production Cost
7/7/2011
3
7/7/2011
4
7/7/2011
5
7/7/2011
6
7/7/2011
7
7/7/2011
8
7/7/2011
9
7/7/2011
10
Learning Objectives
Why process signals digitally?
Definition of a real-time application.
Why use Digital Signal Processing processors?
What are the typical DSP algorithms?
Parameters to consider when choosing a DSP
processor.
Programmable vs ASIC DSP.
Texas Instruments TMS320 family.
7/7/2011
11
Why go digital?
Digital signal processing techniques are
now so powerful that sometimes it is
extremely difficult, if not impossible, for
analogue signal processing to achieve
similar performance.
Examples:
FIR filter with linear phase.
Adaptive filters.
Why go digital?
Analogue signal processing is achieved by using
analogue components such as:
Resistors.
Capacitors.
Inductors.
The inherent tolerances associated with these
components, temperature, voltage changes and
mechanical vibrations can dramatically affect the
effectiveness of the analogue circuitry.
7/7/2011
12
Why go digital?
With DSP it is easy to:
Change applications.
Correct applications.
Update applications.
Additionally DSP reduces:
Noise susceptibility.
Chip count.
Development time.
Cost.
Power consumption.
Why NOT go digital?
High frequency signals cannot be processed
digitally because of two reasons:
Analog to Digital Converters, ADC cannot
work fast enough.
The application can be too complex to be
performed in real-time.
7/7/2011
13
Microprocessor
Any CPU that is contained on a single chip
Little chip is the heart of a computer. Often
referred to as just the processor
Does all the computations like adding,
subtracting, multiplying, and dividing
In PCs, most popular Intel Pentium chip
In Macs, the PowerPC chip (Motorola,
IBM, and Apple)
Digital Signal Processor
A DSP is a general purpose processor
with features specifically designed to make
Signal processing applications fast and
efficient
7/7/2011
14
DSP, RISC, CISC Processor
A processor is frequently categorized based
on the width of its busses (4,8,16,32,64)
Clock Rate (i.e. at what rate does the
processor execute instructions)
Complexity of Instruction Set
CISC : Complex Instruction Set Computer
RISC : Reduced Instruction Set Computer
Why not use a General Purpose Processor
(GPP) such as a Pentium instead of a DSP
processor?
What is the power consumption of a Pentium
and a DSP processor?
What is the cost of a Pentium and a DSP
processor?
Why do we need DSP processors?
7/7/2011
15
Why do we need DSP processors?
Use a DSP processor when the following are required:
Cost saving.
Smaller size.
Low power consumption.
Processing of many high frequency signals in real-time.
Use a GPP processor when the following are required:
Large memory.
Advanced operating systems.
Day 02 Session 2 Slide
32
What are the typical DSP algorithms?
Algorithm Equation
Finite Impulse Response Filter
=
=
M
k
k
k n x a n y
0
) ( ) (
Infinite Impulse Response Filter
= =
+ =
N
k
k
M
k
k
k n y b k n x a n y
1 0
) ( ) ( ) (
Convolution
=
=
N
k
k n h k x n y
0
) ( ) ( ) (
Discrete Fourier Transform
=
=
1
0
] ) / 2 ( exp[ ) ( ) (
N
n
nk N j n x k X t
Discrete Cosine Transform ( ) ( )
=
(
+ =
1
0
1 2
2
cos ). ( ). (
N
x
x u
N
x f u c u F
t
The Sum of Products (SOP) is the key
element in most DSP algorithms:
7/7/2011
16
Hardware Vs Microcode Multiplication
DSP processors are optimised to perform
multiplication and addition operations.
Multiplication and addition are done in
hardware and in one cycle.
Example: 4-bit multiply (unsigned).
1011
x 1110
1011
x 1110
Hardware Microcode
10011010 0000
1011.
1011..
1011...
10011010
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Parameters to consider when
choosing a DSP processor
Parameter
Arithmetic format
Extended floating point
Extended Arithmetic
Performance (peak)
Number of hardware multipliers
Number of registers
Internal L1 program memory cache
Internal L1 data memory cache
Internal L2 cache
32-bit
N/A
40-bit
1200MIPS
2 (16 x 16-bit) with
32-bit result
32
32K
32K
512K
32-bit
64-bit
40-bit
1200MFLOPS
2 (32 x 32-bit) with 32
or 64-bit result
32
32K
32K
512K
TMS320C6211
(@150MHz)
TMS320C6711
(@150MHz)
7/7/2011
17
Parameters to consider when
choosing a DSP processor
Parameter
I/O bandwidth: Serial Ports
(number/speed)
DMA channels
Multiprocessor support
Supply voltage
Power management
On-chip timers (number/width)
Cost
Package
External memory interface
controller
JTAG
2 x 75Mbps
16
Not inherent
3.3V I/O, 1.8V Core
Yes
2 x 32-bit
US$ 21.54
256 Pin BGA
Yes
Yes
2 x 75Mbps
16
Not inherent
3.3V I/O, 1.8V Core
Yes
2 x 32-bit
US$ 21.54
256 Pin BGA
Yes
Yes
TMS320C6211
(@150MHz)
TMS320C6711
(@150MHz)
Floating Vs Fixed Point Processors
Applications which require:
High precision.
Wide dynamic range.
High signal-to-noise ratio.
Ease of use.
Need a floating point processor
Drawback of floating point processors:
Higher power consumption.
Can be higher cost.
Can be slower than fixed-point counterparts
and larger in size.
7/7/2011
18
It is the application that dictates which device
and platform to use in order to achieve optimum
performance at a low cost.
For educational purposes, use the floating-point
device (C6711/C6713) as it can support both
fixed and floating point operations.
Floating Vs Fixed Point Processors
General Purpose DSP Vs DSP in ASIC
Application Specific Integrated Circuits
(ASICs) are semiconductors designed for
dedicated functions.
The advantages and disadvantages of
using ASICs are listed below:
Advantages
High throughput
Lower silicon area
Lower power consumption
Improved reliability
Reduction in system noise
Low overall system cost
Disadvantages
High investment cost
Less flexibility
Long time from design to
market
7/7/2011
19
Embedded Systems Characteristics
Real-Time
Real, defined timing requirements for particular actions to
be accomplished
Event Driven
Actions of the system are in response to events, not a
predefined sequence.
Resource constrained
Memory Size, speed, power constrained
Special purpose
Device must only perform certain well defined tasks
Embedded System Example
Events :
Button Press
Knob Turned
New Sample needed by
D/A converter
Data block available
from CD drive
7/7/2011
20
Embedded SW Design Flow
Develop Code for a Target processor
Since target is minimal (not much memory,
I/Oetc. Code development done on a
separate machine. (e.g a PC)
Cross Compiler / Assembler
Simulator
Code then run in the target system and
observed. Debug support programmed into
the software
Emulation / Debugging
In-Circuit Emulator
Debug Kernel BIOS
JTAG Emulation
Interactively Run Code
Breakpoints
Single Step
Watch Variables
Observe interaction with rest of system
Development environment is frequently
processor specific
7/7/2011
21
TMS320C6000
Architectural Overview
Learning Objectives
Describe C6000 CPU architecture.
Introduce some basic instructions.
Describe the C6000 memory map.
Provide an overview of the peripherals.
7/7/2011
22
General DSP System Block Diagram
P
E
R
I
P
H
E
R
A
L
S
Central
Processing
Unit
Internal Memory
Internal Buses
External
Memory
TMS320 Family Overview
The TMS320 DSP family consists of fixed-
point, floating-point, and multiprocessor
digital signal processors (DSPs). TMS320
DSPs have an architecture designed
specifically for real-time signal processing.
7/7/2011
23
TMS320 Family Overview
First of TMS320 Family TMS32010 in 1982
C1x, C2x, C2xx, C5x, and C54x fixed-point
DSPs
C3x and C4x floating-point DSPs, and
C8x multiprocessor DSPs.
TMS320C6x New generation DSPs
TMS320C6000 DSP Platform
TMS320C62x Fixed Point DSP generation
TMS320C64x Fixed Point DSP generation
TMS320C67x Floating Point DSP generation
C62x code compatible with C64x and C67x
Use VelociTI architecture, high-performance,
VLIW architecture, excellent for multi-
channel multifunction applications
7/7/2011
24
C6000 VelociTI architecture first to use
advanced VLIW
High performance through increased
instruction-level parallelism
VelociTI is a highly deterministic
architecture, having few restrictions on how
or when instructions are fetched, executed,
or stored
Architectural flexibility key to breakthrough
efficiency levels of TMSC6000 Optimizing
C compiler
TMS320C6000 DSP Platform
C62x/C67x processor consists of three main parts
CPU (or the core)
Peripherals
Memory
8 functional units operate in parallel, 2 sets of 4 functional units
Units communicate using cross path between two register files,
each of which contains 16 32-bit registers
Program parallelism defined at compile time. No data
dependency checking done in hardware at run time
256bit program memory fetches 8 32bit instructions every cycle
On-chip program and data memory. Configurable as Cache
Peripherals include DMA controller, power-down logic, EMIF,
serial port(s), expansion bus or host port, and timer(s)
TMS320C6000 DSP Platform
7/7/2011
25
TMS320C6000 DSP Platform
CPU
Program fetch unit
Instruction dispatch unit
Instruction decode unit
32 32-bit registers
Two data paths, each with four functional units
Control registers
Control logic
Test, emulation, and interrupt logic
TMS320C6000 DSP Platform
7/7/2011
26
Functional
Unit
Fixed-Point Operations Floating-Point
Operations
.L unit (.L1,.L2) 32/40-bit arithmetic and compare
operations Leftmost 1 or 0 bit counting
for 32 bits Normalization count for 32 and
40 bits 32-bit logical operations
Arithmetic operations
Conversion operations:
DP SP, INT DP,
INT SP
.S unit (.S1, .S2) 32-bit arithmetic operations
32/40-bit shifts and 32-bit bit-field
operations
32-bit logical operations
Branches
Constant generation
Register transfers to/from the control
register file (.S2 only)
Compare reciprocal and
reciprocal
square-root operations
Absolute value operations
SP to DP conversion
operations
.M unit (.M1, .M2) 16 X 16 bit multiply operations 32 X 32 bit multiply
operations
Floating-point multiply
operations
.D unit (.D1, .D2) 32-bit add, subtract, linear and
circular address calculation
Loads / stores with a 5-bit constant offset
Loads / stores with a 15-bit constant
offset (.D2 only)
Load double word with a 5-
bit
constant offset
TMS320C6713 DSP Platform
7/7/2011
27
TI TMS320C6713 DSP
TI TMS320C6713 DSP Features
DMA Controller
Serial Ports (I/O)
Multiple Computation Units
Cache
On-chip PLL
Host Port Interface
Timers
Floating Point Units
7/7/2011
28
Register
Description
Name Detail
AMR Addressing mode register Specifies linear or circular addressing for 8
registers; contains sizes for circular adrsing
CSR Control status register Contains the global interrupt enable bit, cache
control bits,and other miscellaneous control
and status bits
IFR Interrupt flag register Displays status of interrupts
ISR Interrupt set register To set pending interrupts manually
ICR Interrupt clear register To clear pending interrupts manually
IER Interrupt enable register Allows enabling/disabling of individual
interrupts
ISTP Interrupt service table
pointer
Points to beginning of interrupt service table
IRP Interrupt return pointer Contains maskable interrupt return address
NRP Nonmaskable interrupt
return pointer
Contains nonmaskable interrupt return
address
PCE1 Program counter, E1 phase Contains the address of FP that contains EP
in the E1 pipeline stage
TMS320C62x/C67x Control Register Files
Interrupts
14 interrupts. RESET, NMI, and INT4INT15
IACK to acknowledge interrupt request
INUM0INUM3 indicates interrupt vector being serviced
Interrupt vectors relocatable
Interrupt vectors consist of one fetch packet which provides
for quick servicing
These signals may be tied directly to pins on the device,
connected to on-chip peripherals, or may be disabled
permanently by being tied inactive on chip
7/7/2011
29
C6x
Processor
Memory
Map
FETCH AND EXECUTE PACKETS
FP with three EPs: EP1, with two
parallel instructions, and EP2 and
EP3, each with three parallel
instructions
7/7/2011
30
PIPELINING
Program Fetch
PG: program address generate (in the CPU) to fetch an address
PS: program address send (to memory) to send the address
PW: program address ready wait (memory read) to wait for data
PR: program fetch packet receive (at the CPU) to read opcode
Decode Stage
DP: dispatch all instructions within FP to appropriate units
DC: instruction decode
Execute Stage
Multiply instruction, consists of two phases due to one delay
Load instruction, consists of five phases due to four delays
Branch instruction, consists of six phases due to five delays
Phases
Effects
PIPELINING
7/7/2011
31
Basic Numbering Formats
Three main numbering formats:
unsigned representation
2s complement representation (signed)
floating point representations
Fixed point representations of fractions
Saturating arithmetic
Multiplication of fractions
Basic Numbering Formats
7/7/2011
32
Coding Format
Assembly code
Assembler Directives
7/7/2011
33
7/7/2011
34
Project Files
7/7/2011
35
GEL Files
7/7/2011
36
GEL File Example
Linker Command File .cmd
7/7/2011
37
Library Files
Installing and Running CCS