DSP SHARC Processors PART1

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 33
At a glance
Powered by AI
The key takeaways are that SHARC processors are floating-point digital signal processors based on a Super Harvard Architecture. They are used in applications such as audio processing, imaging, and motor control.

SHARC processors are high-performance floating-point processors used in applications such as home theater audio, professional audio, automotive audio, and broad industrial applications. They provide computational power from less than $10 to 400MHz.

SHARC processors have a Super Harvard architecture with separate program and data memory buses. They also have on-chip peripherals, caches, and multiple internal buses to achieve high performance.

1 INTRODUCTION TO SHARC

PROCESSORS
This chapter briefly describes the SHARC processor’s architecture and key
features and compares available models.
Topics include:
• “What are SHARC Processors?” on page 1-1
• “Three Generations of SHARC Processors” on page 1-5
What are SHARC Processors?
SHARC is the name of a family of high-performance 32-bit floating-point
processors based on a Super Harvard Architecture. SHARC processors
dominate the floating-point digital signal processing market, delivering
exceptional core and memory performance complemented by outstanding
I/O throughput. The industry standard SHARC family makes floating-
point processing economical for applications where performance and
dynamic range are key considerations such as home, professional, and
automotive audio, medical, and industrial and instrumentation products.
The SHARC processor portfolio currently consists of three generations of
products providing code-compatible solutions, ranging from entry-level
products priced at less than $10 to the highest performance products
offering fixed- and floating-point computational power to 400 MHz/2400
MFLOPs. Regardless of the specific product choice, all SHARC processors
provide a common set of features and functionality usable across many
signal processing markets and applications. This baseline functionality
enables the SHARC user to leverage legacy code and design experience,
while transitioning to higher-performance, more highly integrated
SHARC products.
By integrating on-chip, single-instruction, multiple-data (SIMD) processing
elements, SDRAM, and I/O peripherals, SHARC processors deliver
breakthrough signal processing performance.

SHARC Applications
The combination of a high performance core surrounded by appropriate
peripherals, a large software library, and award-winning development tools
makes SHARC processors the ideal choice for audio and broad market
processor applications. Here are some applications:
• Home theater/digital home applications. The ADSP-21266,
ADSP-21365/6, and ADSP-21367 processors permit highly efficient
software implementations of audio decode and post processing
algorithms, such as Dolby Digital, Dolby Digital EX, DTS-ES Discrete
6.1, DTS-ESMatrix 6.1, DTS 96/24™ 5.1, MPEG-2 AAC
LC, MPEG-2 BC 2ch, Dolby Pro Logic II, Dolby Pro Logic 2x,
DTS Neo:6, and WMA Pro. Libraries of all standard–and many
proprietary–audio algorithms reside in on-chip ROM, eliminating
the need for external ROM.
• Professional audio applications. A number of the third-generation
SHARC processors are well-suited for professional audio applications
requiring high processing power and advanced on-chip
peripherals such as sample rate conversion, S/PDIF transmitter/
receiver, and BGA and LQFP package options.
• Automotive audio applications. The ADSP-2136x, with integration
of sample-rate conversion, DTCP cipher, precision clock
generators, and serial ports, is an ideal choice for new multichannel
automotive audio designs.
• Broad market use. SHARC processors are available in commercial,
industrial, and automotive temperature grade packages. They are
used in a wide variety of signal processing applications, providing
up to 400 MHz performance in a single instruction, multiple data
architecture (SIMD). Applications include imaging, medical
devices, communications, military, test equipment, 3-D graphics,
speech recognition, and motor control.

Architecture Overview
This section describes architectural features of the SHARC processor.
Super Harvard Architecture
The 32-bit floating-point SHARC processors from Analog Devices are
based on a Super Harvard architecture that balances exceptional core
and
memory performance with outstanding I/O throughput capabilities.
This
architecture extends the original concepts of separate program and
data
memory busses by adding an I/O processor with its associated
dedicated
busses.
In addition to satisfying the demands of the most computationally
intensive,
real-time signal processing applications, SHARC processors integrate
large memory arrays and application-specific peripherals designed to
simplify
product development and reduce time to market.

Common Architectural Features


SHARC processors share the following architectural features.
• 32/40-bit IEEE floating-point math
• 32-bit fixed-point multipliers with 64-bit product and 80-bit
accumulation
• No arithmetic pipeline. All computations are single-cycle.
• Circular buffer addressing supported in hardware
• Sixteen address pointers support 16 circular buffers.
• Six nested levels of zero-overhead looping in hardware
• Rich algebraic assembly language syntax
• Conditional arithmetic, bit manipulation, divide and square root,
bit field deposit and extract supported by instruction set
• Zero-overhead background transfers at full clock rate without
processor
intervention
In the core, every instruction can execute in a single cycle. The buses
and
instruction cache provide rapid unimpeded data flow to the core to
maintain
the execution rate.
Figure 1-1 on page 1-6 shows a detailed block diagram of a single core
SHARC 32-bit processor and the I/O processor (IOP). It illustrates the
following architectural features:
• Two processing elements (PEx and PEy), each containing 32-bit
IEEE floating-point computation units–multiplier, arithmetic
logic unit (ALU), shifter, and data register file
• Program sequencer with related instruction cache, interval timer,
and data address generators (DAG1 and DAG2)
• An SDRAM controller that provides an interface to as many as four
separate banks of industry-standard SDRAM devices
• Up to a maximum of 4 Mbits of on-chip SRAM and 6 Mbits of
on-chip, mask-programmable ROM
• Input/output processor (IOP) with integrated direct memory
access (DMA) controller, serial peripheral interface (SPI) compatible
port, and serial ports (SPORTs) for point-to-point
multiprocessor communications
• A variety of audio-centric peripheral modules including a
Sony/Philips digital interface (S/PDIF), sample rate converter
(SRC), and pulse width modulation (PWM). Table 1-1 on
page 1-6 provides details on these and other features for the current
members of the ADSP-2136x processor generation.

• JTAG test access port for emulation


Figure 1-1 also shows the three on-chip buses of the ADSP-21367/8/9
processors: the PM bus, DM bus, and I/O bus. The PM bus provides
access to instructions or data. During a single cycle, these buses let
the
processor access two data operands from memory, access an
instruction
(from cache), and perform a DMA transfer. In addition, Figure 1-1
shows
the asychronous memory interface available on the ADSP-21368
processor.

Three Generations of SHARC Processors


The SHARC architecture has a long history in the floating-point
processor
market. While architectural enhancements have been made with each
successive processor generation, the common traits of exceptional
floating-
point performance, matched to high-bandwidth memory and I/O
transfers, remains. All three generations of SHARC processors are still
in
production, offering a variety of code-compatible options to meet a
wide
array of price, performance, and footprint requirements.
Figure ---ADSP-21368 Block Diagram

First-generation SHARC products offer performance of up to


66MHz/198 MFLOPS and form the cornerstone of the SHARC processor
family. Their easy-to-use instruction set architecture that supports both
32-bit fixed-point and 32/40-bit floating-point data formats, combined
with large memory arrays and sophisticated communications ports,
make
them suitable for a wide array of parallel processing applications
including
consumer audio, medical imaging, military, industrial, and
instrumentation.
Second-generation products contain dual multipliers, ALUs, shifters,
and
data register files, significantly increasing overall system performance
in a
variety of applications. This capability is especially relevant in
consumer,
automotive, and professional audio where the algorithms related to
stereo
channel processing can effectively utilize the SIMD architecture.
Third-generation SHARC products employ an enhanced SIMD
architecture
that extends CPU performance to 400 MHz/2400 MFLOPS. These
products also integrate a variety of ROM configurations and audio-
centric
peripherals designed to decrease time to market and reduce the
overall bill
of materials costs. This increased level of performance and peripheral
integration
allow third-generation SHARC processors to be considered as
single chip solutions for a variety of audio markets.
Each SHARC processor provides unique capabilities, while being
pin-compatible with other SHARC devices. Table 1-1 on page 1-8 lists
key third generation SHARC processor specifications. For more
information,
view the SHARC processor selection table online at the Analog
Devices Web site at:
http://www.analog.com/sharc

Processor Peripherals and Performance


SHARC processors represent a class of devices that combine an
extremely
capable single-instruction, multiple-data (SIMD) processor engine with
features like core timers, general purpose timers, UARTs, and SPI
ports.
In addition to advanced peripherals, SHARC processors use a software
programmable, on-chip phase lock loop (PLL) that allows software
control
during runtime of core and peripheral clock of the SHARC processors.

Performance
Real-time signal processing tasks are I/O and computationally
intensive.
In addition to high-speed math units and single-cycle instruction
execution
(including single-cycle multiply accumulates [MACs]), SHARC
processors are designed for maximum I/O and memory access
bandwidth.
This balance of core speed, memory integration, and I/O bandwidth
achieves the sustained performance critical to real-time applications.
ADSP-21262 EZ-KIT Lite Evaluation System

Processor Core
The processor core consists of two processing elements (each with
three
computation units and data register file), a program sequencer, two
DAGs, a timer, and an instruction cache. All processing occurs in the
processor
core.
Processing Elements
The processor core contains two processing elements: PEx and PEy.
Each

element contains a data register file and three independent


computation
units: an arithmetic logic unit (ALU), a multiplier with an 80-bit
fixed-point accumulator, and a shifter. For meeting a wide variety of
processing
needs, the computation units process data in three formats: 32-bit
fixed-point, 32-bit floating-point, and 40-bit floating-point. The floating-
point operations are single-precision IEEE-compatible. The 32-bit
floating-point format is the standard IEEE format, whereas the 40-bit
extended-precision format has eight additional least significant bits
(LSBs)
of mantissa for greater accuracy.
The ALU performs a set of arithmetic and logic operations on both
fixed-point and floating-point formats. The multiplier performs floating-
point or fixed-point multiplication and fixed-point
multiply/accumulate or multiply/cumulative-subtract operations. The
shifter performs logical and arithmetic shifts, bit manipulation, bit-wise
field deposit and extraction, and exponent derivation operations on 32-
bit
operands. These computation units complete all operations in a single
cycle; there is no computation pipeline. The output of any unit may
serve
as the input of any unit on the next cycle. All units are connected in
parallel,
rather than serially. In a multifunction computation, the ALU and
multiplier perform independent, simultaneous operations.
Each processing element has a general-purpose data register file that
transfers
data between the computation units and the data buses and stores
intermediate results. A register file has two sets (primary and
secondary) of
16 general-purpose registers each for fast context switching. All of the
registers
are 40 bits wide. The register file, combined with the core
processor’s Super Harvard Architecture, allows unconstrained data flow
between computation units and internal memory.
Primary processing element (PEx). PEx processes all
computational
instructions whether the processor is in single-instruction, single-data
(SISD) or single-instruction, multiple-data (SIMD) mode. This element
corresponds to the computational units and register file in previous
ADSP-21000 family processors.
Secondary processing element (PEy). PEy processes each
computational
instruction in lock-step with PEx, but only processes these instructions
when the processor is in SIMD mode. Because many operations are
influenced
by this mode, more information on SIMD is available in multiple
locations:

• For information on PEy operations, see “Processing Elements” on


page 2-1.
• For information on data addressing in SIMD mode, see “Addressing
in SISD and SIMD Modes” on page 4-20.
• For information on data accesses in SIMD mode, see “SISD,
SIMD, and Broadcast Load Modes” on page 5-37.
• For information on SIMD programming, see “Instruction Set” in
Chapter 8, Instruction Set, and “Computations Reference” in
Chapter 9, Computations Reference.
Program Sequence Control
Internal controls for program execution come from four functional
blocks:
program sequencer, data address generators, core timer, and
instruction
cache. Two dedicated address generators and a program sequencer
supply
addresses for memory accesses. Together the sequencer and data
address
generators allow computational operations to execute with maximum
efficiency since the computation units can be devoted exclusively to
processing
data. With its instruction cache, the ADSP-2136x processors can
simultaneously fetch an instruction from the cache and access two
data
operands from memory. The DAGs also provide built-in support for
zero-overhead circular buffering.
Program sequencer. The program sequencer supplies instruction
addresses to program memory. It controls loop iterations and
evaluates
conditional instructions. With an internal loop counter and loop stack,
the processors execute looped code with zero overhead. No explicit
jump
instructions are required to loop or to decrement and test the counter.
To
achieve a high execution rate while maintaining a simple programming
model, the processor employs a five stage pipeline to process
instructions
— fetch1, fetch2, decode, address and execute. For more
information, see
“Instruction Pipeline” on page 3-2.
Data address generators. The DAGs provide memory addresses
when data
is transferred between memory and registers. Dual data address
generators
enable the processor to output simultaneous addresses for two
operand
reads or writes. DAG1 supplies 32-bit addresses for accesses using the
DM
bus. DAG2 supplies 32-bit addresses for memory accesses over the PM
bus.
Each DAG keeps track of up to eight address pointers, eight address
modifiers,
and for circular buffering eight base-address registers and eight
buffer-length registers. A pointer used for indirect addressing can be
modified
by a value in a specified register, either before (pre-modify) or after
(post-modify) the access. A length value may be associated with each
pointer to perform automatic modulo addressing for circular data
buffers.
The circular buffers can be located at arbitrary boundaries in memory.
Each DAG register has a secondary register that can be activated for
fast
context switching.
Circular buffers allow efficient implementation of delay lines and other
data structures required in digital signal processing They are also
commonly
used in digital filters and Fourier transforms. The DAGs
automatically handle address pointer wraparound, reducing overhead,
increasing performance, and simplifying implementation.
Interrupts. The ADSP-2136x processors have three external hardware
interrupts. The processor also provides three general-purpose
interrupts,
and a special interrupt for reset. The processor has internally-
generated
interrupts for the timer, DMA controller operations, circular buffer
overflow,
stack overflows, arithmetic exceptions, and user-defined software
interrupts.
For the general-purpose interrupts and the internal timer interrupt, the
processor automatically stacks the arithmetic status (ASTATx) register
and
mode (MODE1) registers in parallel with the interrupt servicing, allowing
15
nesting levels of very fast service for these interrupts.
Context switch. Many of the processor’s registers have secondary
registers
that can be activated during interrupt servicing for a fast context
switch.
The data registers in the register file, the DAG registers, and the
multiplier
result register all have secondary registers. The primary registers are
active
at reset, while the secondary registers are activated by control bits in a
mode control register.
Timer. The core’s programmable interval timer provides periodic
interrupt
generation. When enabled, the timer decrements a 32-bit count
register every cycle. When this count register reaches zero, the
ADSP-2136x processors generate an interrupt and asserts their timer
expired output. The count register is automatically reloaded from a 32-
bit
period register and the countdown resumes immediately.
Instruction cache. The program sequencer includes a 32-word
instruction
cache that effectively provides three-bus operation for fetching an
instruction
and two data values. The cache is selective; only instructions whose
fetches conflict with data accesses using the PM bus are cached. This
caching allows full speed execution of core, looped operations such as
digital

filter multiply-accumulates, and FFT butterfly processing. For more


information on the cache, refer to “Using the Cache” on page 3-8.
Processor Internal Buses
The processor core has six buses: PM address, PM data, DM address,
DM
data, I/O address, and I/O data. The PM bus is used to fetch instructions
from memory, but may also be used to fetch data. The DM bus can
only
be used to fetch data from memory. The I/O bus is used solely by the
IOP
to facilitate DMA transfers. In conjunction with the cache, this Super
Harvard Architecture allows the core to fetch an instruction and two
pieces of data in the same cycle that a data word is moved between
memory
and a peripheral. This architecture allows dual data fetches, when the
instruction is supplied by the cache.
Bus capacities. The PM and DM address buses are both 32 bits wide,
while the PM and DM data buses are both 64 bits wide.
These two buses provide a path for the contents of any register in the
processor
to be transferred to any other register or to any data memory
location in a single cycle. When fetching data over the PM or DM bus,
the
address comes from one of two sources: an absolute value specified in
the
instruction (direct addressing) or the output of a data address
generator
(indirect addressing). These two buses share the same port of the
memory.
Each memory block also has a dedicated I/O address bus and I/O data
bus
to let the I/O processor access internal memory for DMA without
delaying
the processor core (in the absence of memory block conflict). The I/O
address bus is 18 bits wide, and the I/O data bus is 32 bits wide.
Data transfers. Nearly every register in the processor core is
classified as a
universal register (Ureg). Instructions allow the transfer of data between
any two universal registers or between a universal register and
memory.
This support includes transfers between control registers, status
registers,
and data registers in the register file. The PM bus connect (PX) registers
permit data to be passed between the 64-bit PM data bus and the 64-
bit
DM data bus, or between the 40-bit register file and the PM data bus.
These registers contain hardware to handle the data width difference.
For
more information, see “Processing Element Registers” on page B-22.

A INSTRUCTION SET QUICK


REFERENCE
This instruction set summary provides a syntax summary for each
instruction
and includes a cross reference to each instruction’s reference page.
Chapter Overview
The following summary topics appear in this chapter.
• “Compute and Move/Modify Summary” on page A-2
• “Program Flow Control Summary” on page A-4
• “Immediate Move Summary” on page A-5
• “Miscellaneous Operations Summary” on page A-7
• “Register Types Summary” on page A-9
• “Memory Addressing Summary” on page A-13
• “Instruction Set Notation Summary” on page A-14
• “Conditional Execution Codes Summary” on page A-16
• “SISD/SIMD Conditional Testing Summary” on page A-18
• “Instruction Opcode Acronym Summary” on page A-19
• “Universal Register Codes” on page A-23
• “ADSP-2136x Instruction Opcode Map” on page A-28

You might also like