Tutorial On TI C6678

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65
At a glance
Powered by AI
The key takeaways are that the document discusses the Shannon chip architecture from Texas Instruments, which is a multi-core DSP chip with 8 DSP cores and various peripherals to support applications like networking and communications.

The main components of the Shannon chip include 8 C66x DSP cores, 4MB of shared memory, a navigator for the multi-core ecosystem, a packet infrastructure, and peripherals like Ethernet, PCIe, timers and GPIO.

The Shannon chip supports multi-core processing through its 8 DSP cores, 4MB of shared memory, and a navigator component that helps manage the multi-core ecosystem. The cores can access shared resources and communicate via the packet infrastructure.

Texas Instruments

TMS320C6678 (Shannon)
DSP Training

Brighton Feng
November, 2010

Copyright © 2010 Texas Instruments. All rights reserved.


Outline
 C6678 DSP Overview
 Multi-core DSP programming
 Interconnection and resource sharing
 Peripherals overview

Copyright © 2010 Texas Instruments. All rights reserved.


Shannon Functional Diagram
• Multi-Core SoC C6678 (Shannon)
System Elements
• Fixed/Floating C66x™ Core
– Eight cores @ 1.0 GHz, 0.5 MB Local L2 . . . 8 C66x Cores Power Mgt SysMon
– 4.0 MB shared memory
Debug EDMA
– 256 GMAC, 128 GFLOP

L2 Memory
• Navigator C66x core

Navigator
Peripherals and I/O
– Multicore eco system
sRIO TSIP
• Packet Infrastructure L1 D L1 P
Flash PCIe
• Network Coprocessor
– IP Network solution for IP v4/6 UART SPI, I2C
– 1.5M packets per sec (1Gb Ethernet
wire-rate) TeraNet 2 Hyperlink50
– IPsec, SRTP, Air Interface Encryption

Multicore
fully offloaded Memory System Crypto/IPSec Enet
• 3-port GigE Switch (Layer 2) Multicore Memory
Controller CoProcessor Switch
DDR-3
64b

SGMII
• Low Power Consumption

SGMII
Packet
– Adaptive Voltage Scaling (Smart Shared Memory CoProcessor
ReflexTM)
• Hyperlink 50
– 50G Expansion port
– Transparent to Software
4
• Multicore Debugging
Copyright © 2010 Texas Instruments. All rights reserved.
Enhanced DSP core
C66x
Performance improvement

100% backward object


code compatible

Increased
Fixed and floating
Point capability

Improved support for


complex arithmetic
and matrix
computation

C64x+
C67x+ SPLOOP and 16-
bit instructions
for smaller code
size
C64x
C67x Flexible level one Advanced fixed-
memory point
2x registers architecture instructions
Native
instructions for iDMA for rapid Four 16-bit or
IEEE 754, SP&DP data transfers eight 8-bit MACs
Enhanced between local
Advanced VLIW floating-point memories Two-level cache
architecture add capabilities

FLOATING-POINT VALUE FIXED-POINT VALUE

Copyright © 2010 Texas Instruments. All rights reserved.


C66x core block diagram

256 Bits

C66x Core
Instruction Fetch
Control Registers
SPLOOP Buffer Interrupt
Control
Instruction Dispatch
In-Circuit Emulation
Instruction Decode
Data Path 1 Data Path 2

A Register File B Register File


A0 – A31 B0 –B31

L1 S1 M1 D1 D2 M2 S2 L2

+ + + + x x x x + + + + x x x x + + + +
+ + + + x x x x + + + + x x x x + + + +
+ + + + x x x x + + + + x x x x + + + +
+ + + + x x x x + + + + x x x x + + + +

2x64 Bits

Copyright © 2010 Texas Instruments. All rights reserved.


Key Improvements of C66x
 4x Multiply Accumulate improvement
 Enhanced complex arithmetic and matrix operations
 2x Arithmetic and Logical operations
improvement
 Support the floating point arithmetic. Single
precision floating point operation capability
same as 32 bit fixed point operation capability
 division and square root is supported by
floating point instruction

Copyright © 2010 Texas Instruments. All rights reserved.


C64x+  C66x Comparison
Operation Precision Operations Operations Function Unit
per cycle per cycle
on C64x+ on C66x
MAC Real 8 x 8 2x4=8 2 x 8 = 16 M1, M2
Real 16 x 16 2x2=4 2 x 8 = 16 M1, M2
Real 32 x 32 2x1=2 2x4=8 M1, M2
Complex (16,16) 2 x 1 = 2 2x4=8 M1, M2
x (16,16)
Complex (32,32) N/A 2x1=2 M1, M2
x (32,32)
Arithmetic 8 bit 4 x 4 = 16 4 x 8 = 32 L1, L2, S1, S2
Logical 16 bit 4x2=8 4 x 4 = 16 L1, L2, S1, S2
32 bit 4x1=4 4x2=8 L1, L2, S1, S2
Memory 8 bit, 16 bit, 32 2x1=2 2x1=2 D1, D2
Access bit, 64 bit

Copyright © 2010 Texas Instruments. All rights reserved.


Outline
 C6678 DSP Overview
 Multi-core DSP programming
 Memory Architecture Overview
 Shannon Memory Architecture
Improvement
 Programming model
 Interconnection and resource sharing
 Peripherals overview

Copyright © 2010 Texas Instruments. All rights reserved.


TCI6486 Memory Architecture
Internal
L2 RAM

S
Core 0
M

.
Shared Shared .
L2
L2 RAM
Control
.
DMA
(Core Internal
Shared SCR
speed)/2 L2 RAM
L2 ROM External
256 bit
(Core Memory
speed)/3
128 bit

S
Core N
M

EDMA M

Copyright © 2010 Texas Instruments. All rights reserved.


Shannon Memory Architecture
Internal
L2 RAM

S
Core 0

Shared .
L2 RAM
.
.
Shared DMA
Memory Internal SCR
Control L2 RAM
External
Memory (Core
(Core speed)/3
speed)/2 128 bit
256 bit
S
Core N

EDMA M EDMA M

Copyright © 2010 Texas Instruments. All rights reserved.


Outline
 C6678 DSP Overview
 Multi-core DSP programming
 Memory Architecture Overview
 Shannon Memory Architecture
Improvement
 Programming model
 Interconnection and resource sharing
 Peripherals overview

Copyright © 2010 Texas Instruments. All rights reserved.


Addition of XMC
 Bring over existing EMC MDMA path
 Fat pipe to external (and internal) shared memory
 Bus width: 256 instead of 128 bits
 Clock rate: CPUCLK/2 instead of CPUCLK/3
 Optimize requests for MSMC / DDR3 memory
 L2 line allocations and evictions are split into sub-lines of 64 bytes

 Memory Protection and Address Extension (MPAX) support


 16 segments of programmable size (powers of 2: 4KB to 4GB)
 Each segment maps a 32-bit address to a 36-bit address.
 Each segment controls access: supervisor/user, R/W/X, (non-)secure
 Memory protection for shared internal MSMC memory and external DDR3
memory
 Multi-stream Prefetch support
 Program prefetch buffer up to 128 bytes
 Data prefetch buffer up to 8 x 128 bytes
 Prefetch enabled/disabled on 16MB ranges defined in MAR
 Manual flush for coherence purposes
 Note: no IDMA path

Copyright © 2010 Texas Instruments. All rights reserved.


MAR Register Extension

• L2 memory controller extends the MAR registers by adding the “PFX” field,
L2 memory controller uses this bit to convey XMC whether a given address
range is prefetchable.

Copyright © 2010 Texas Instruments. All rights reserved.


MSMC Block Diagram
256 256

CGEM CGEM
Slave Port x N CGEM cores Slave Port
MSMC Core
256 256
System MSMC Datapath
Memory
Slave Port Protection
for shared and
Extension
VBUSM 256 SRAM 256
Unit
(SMS) (MPAX) Arbitration for Banks
RAM banks,
256-bits per
256 bank
SCR
Memory
Protection EDC for SRAM
System
and
Slave Port Extension
for Unit
(MPAX)
VBUSM 256
external 256
memory
(SES)
 One slave interface per C66x
Megamodule (256 bits @ CPUCLK/2)
MSMC System MSMC EMIF
Master Port
 Uses a 36 bit address extended inside
Master Port events a C66x Megamodule core
 Two slave interfaces (256 bits @
VBUSM 256
VBUSM 256 CPUCLK/2) for access from system
masters
 SMS interface for accesses to MSMC
EMIF – 64 bit SRAM space
SCR
DDR3
 SES interface for accesses to DDR3
space
 Both interfaces support memory
protection and address extension
 One master interface (256-bits @
CPUCLK/2) for access to the DDR3
EMIF
 One master interface (256 bits @
CPUCLK/2) for access to system
slaves
Copyright © 2010 Texas Instruments. All rights reserved.
MSMC Shared Memory
 4 banks x 2 sub-banks, sub-bank are 256-bit
wide.
 Reduces conflicts between C66x Megamodule cores
and system masters
 Features a dynamic fair-share bank arbitration for
each transfer
 Supports bandwidth management. Avoid
indefinite starvation for lower priority requests
due to higher priority requests
 Features Not Supported
 Cache coherency between L1/L2 caches in C66x
Megamodule cores and MSMC memory
 Cache coherency between XMC prefetch buffers and
MSMC memory
 C66x Megamodule to C66x Megamodule cache
coherency via MSMC

Copyright © 2010 Texas Instruments. All rights reserved.


MPAX Units
 MPAX stands for “Memory Protection and
Address Extension”
 There are N+2 MPAX units in a system with N
C66x Megamodules
 N MPAX units for all requests from N C66x
Megamodules to internal shared memory, external
shared memory or any system slave
 1 MPAX unit for all requests from any system master
to internal shared memory
 1 MPAX unit for all requests from any system master
to external shared memory
 Each MPAX unit operates on a number of
segments of programmable size
 Each segment maps a 32-bit address to a 36-bit
address.
 Each segment controls access.
Copyright © 2010 Texas Instruments. All rights reserved.
Number of Segments
 Each C66x Megamodule has 16 segments which
control direct (load/store) requests to internal
shared memory, external shared memory and
any other system slave.
 Any master identified by a privilege ID has
 8 segments for requests to internal shared memory
 8 segments for requests to external shared memory.
 Some masters work on behalf of other masters.
They will inherit the privilege ID of their
commanding master. As such, each C66x
Megamodule also has
 8 segments for indirect (DMA) requests to internal
shared memory
 8 segments for indirect (DMA) requests to external
shared memory
Copyright © 2010 Texas Instruments. All rights reserved.
Segment Definition
 Each segment is defined by a base address and a size
 The segment size can be set to any power of 2 from 4K to
4GB
 The segment base address is constrained to power-of-2
boundary equal to size.
 One would expect that each request should find one
matching segment, however ...
 a request may find two or more matching segments, in
which case segments with higher ID take priority over
segments with lower ID. This allows
 creating non-power of 2 segments
 creating 3 segments with just 2 segment definitions
 ...
 a request may find no matching segment, in which case an
error is reported in Memory protection fault reporting
registers (XMPFAR, XMPFSR)

Copyright © 2010 Texas Instruments. All rights reserved.


XMC Segment Registers
XMPAXH/XMPAXL[15-0]

Copyright © 2010 Texas Instruments. All rights reserved.


MPAX Default Memory Map F:FFFF_FFFF

8:8000_0000
FFFF_FFFF Segment 15 Disabled 8:7FFF_FFFF
Segment 14 Disabled
Segment 13 Disabled
Segment 12 Disabled
Segment 11 Disabled 8:0000_0000
7:FFFF_FFFF
Segment 10 Disabled
Segment 9 Disabled Upper 60GB
8000_0000
7FFF_FFFF Segment 8 Disabled
Segment 7 Disabled
Segment 6 Disabled
Segment 5 Disabled
Segment 4 Disabled 1:0000_0000
0C00_0000 Segment 3 Disabled 0:FFFF_FFFF
0BFF_FFFF Segment 2 Disabled
(not remappable)
0000_0000 Segment 1 BADDR = 80000h; RADDR = 800000h; Size = 2GB
CGEM Logical Segment 0 BADDR = 00000h; RADDR = 000000h; Size = 2GB Lower 4GB
32-bit Memory Map

0:8000_0000
0:7FFF_FFFF

0:0C00_0000

 XMC configures MPAX segments 0 and 1 so that 0:0BFF_FFFF


0:0000_0000

C66x Megamodule can access system memory. System Physical


36-bit Memory Map

 The power up configuration is that segment 1


remaps 8000_0000 – FFFF_FFFF in C66x
Megamodule’s address space to 8:0000_0000 –
8:7FFF_FFFF in the system address map.
 This corresponds to the first 2GB of address space
dedicated to EMIF by the MSMC controller.
Copyright © 2010 Texas Instruments. All rights reserved.
MPAX MSMC Aliasing Example

FFFF_FFFF

0:0C1F_FFFF

BADDR = 21000h; RADDR = 00C000h; Size = 2MB


MSMC RAM
BADDR = 20000h; RADDR = 00C000h; Size = 2MB (2MB)
BADDR = 0C000h; RADDR = 00C000h; Size = 2MB

0:0C00_0000
21xx_xxxx MSMC RAM Alias 2
20xx_xxxx MSMC RAM Alias 1

0Cxx_xxxx “Fast” MSMC RAM


0BFF_FFFF
(not remappable)
0000_0000

CGEM 32-bit Memory Map

 Example shows 3 segments to map the MSMC RAM address


space into C66x Megamodule’s address space as three distinct
2MB ranges. By programming the MARs accordingly, the three
segments could have different semantics.
 Accesses to MSMC RAM via this alias do not use the “fast RAM”
path and incur additional cycles of latency.

Copyright © 2010 Texas Instruments. All rights reserved.


MPAX Overlayed Segments Example
F:FFFF_FFFF

FFFF_FFFF Segment 15 Disabled Upper 60GB


Segment 14 Disabled
Segment 13 Disabled
Segment 12 Disabled
C000_7xxx Segment 11 Disabled
Segment 10 Disabled
Segment 9 Disabled 1:0000_0000
8000_0000 0:FFFF_FFFF
7FFF_FFFF Segment 8 Disabled
Segment 7 Disabled
Segment 6 Disabled
0:C000_7xxx
Segment 5 Disabled
Segment 4 Disabled

0C00_0000 Segment 3 Disabled


0BFF_FFFF Segment 2 BADDR = C0007h; RADDR = 050042h; Size = 4K 0:8000_0000
(not remappable) Lower 4GB
0000_0000 Segment 1 BADDR = 80000h; RADDR = 080000h; Size = 2GB 0:7FFF_FFFF

CGEM 32-bit Memory Map Segment 0 BADDR = 00000h; RADDR = 000000h; Size = 2GB
0:5004_2xxx

 segment 1 matches 8000_0000 through FFFF_FFFF, 0:0C00_0000


and segment 2 matches C000_7000 through C000_7FFF. 0:0BFF_FFFF
0:0000_0000
 Because segment 2 is higher priority than segment 1, System 36-bit Memory Map
its settings take priority, effectively carving a 4K hole in
segment 1’s 2GB address space.
 Furthermore, it maps this 4K space to 0:5004_2000 -
0:5004_2FFF, which overlaps the mapping established
by segment 0. This physical address range is now
accessible by two logical address ranges.
Copyright © 2010 Texas Instruments. All rights reserved.
Outline
 C6678 DSP Overview
 Multi-core DSP programming
 Memory Architecture Overview
 Shannon Memory Architecture
Improvement
 Programming model
 Interconnection and resource sharing
 Peripherals overview

Copyright © 2010 Texas Instruments. All rights reserved.


single program image
C6000 C6000 C6000
Core 0 Core 1 Core 2

L1 Data L1 Data L1 Data

L1 Prog L1 Prog L1 Prog

code
App.out App.out App.out
and
read/write
Data 1 Data 2
data
Data 0

L2 memory L2 memory L2 memory

Shared code Data 0


and App.out Shared L2 or
Data 1
Read only DDR memory
data Data 2

 Same image on each DSP core


 Aliased addressing used for DSP core to access local L2
 DNUM DSP core register for:
 Global addressing when programming EDMA3, SRIO, …
 Separate buffer per DSP core in DDR: dp= bufBase+ BUF_SIZE*DNUM
Copyright © 2010 Texas Instruments. All rights reserved.
Shannon MPAX enables easy single program image

CGEM address space (1) SoC address space CGEM address space (n)

MSMC RAM MSMC RAM MSMC RAM


internal internal internal
code1 code1 code1
data2 data2 data2

data2

External memory External memory External memory


code2 code2 code2
MPAX

MPAX
data3 data3 data3

data3

virtual address space (1) SoC address space virtual address space (n)

Copyright © 2010 Texas Instruments. All rights reserved.


multiple program image

C6000
C6000 C6000
C6000 C6000
C6000
Core
Core0 0 Core
Core1 1 Core
Core2 2

L1L1Data
Data L1L1Data
Data L1L1Data
Data

L1L1Prog
Prog L1L1Prog
Prog L1L1Prog
Prog

App0.out App1.out App2.out

Data 0 Data 1 Data 2


L2 memory L2 memory L2 memory

App0.out App1.out App2.out


Shared L2 or
DDR memory Data 1 Data 2
Data 0

 Each DSP core has its image


 Static split of DDR2 per DSP core
 Global or local addressing used for L2 addressing

Copyright © 2010 Texas Instruments. All rights reserved.


Shannon Software
• Flexible development C6678 Software
environment for the customer.
Customer Application
• Customer can choose to develop
their application using all or any
one of the software layers. Voice Gateway Video
Demonstration Kit
• Will contain following software Demo Transcoding
Demonstration Kit
layers App
– BIOS and Linux Operating System Speech Audio Video
support DSPLIB
Codec Codec Codec
– Chip Support Library
– Platform Development Kit Operating System w/ Boot Loader
– Inter Core Communication
BIOS Linux
– Optimized DSP functions library
– Optimized Audio, Video and Multi-core Entitlement
Speech codecs
– Voice Gateway Demonstration Kit Inter Core Communication
– Video Transcoding Demonstration
Kit Full Silicon Entitlement
– Demonstration applications Platform Development Kit
Chip Support Library

43

Copyright © 2010 Texas Instruments. All rights reserved.


Shannon Debug
Best Multicore Debug and Visualization Debug enabled Multicore SoC
C6678 (Shannon)
System Elements
. . . 8 C66x Cores Power Mgt SysMon

Debug EDMA
C66x

L2 Memory

Navigator
core Peripherals and I/O

ETB
sRIO TSIP
L1 D L1 P
Flash PCIe
UART SPI, I2C
Data
TRACE TeraNet 2 Hyperlink50
Visualization

Multicore
Memory System Enet
Crypto/IPSec
Multicore Memory CoProcessor Switch
DDR-3 Controller

TRACE
64b

SGMII
SGMII
Shared Packet
Memory CoProcessor

Debug visibility at core, across multicore and for SoC


45

Copyright © 2010 Texas Instruments. All rights reserved.


Outline
 C6678 DSP Overview
 Multi-core DSP programming
 Interconnection and resource sharing
 Interconnection Architecture
 Shannon Hardware queue
 Inter-core communication
 Shared Resource Management
 Peripherals overview

Copyright © 2010 Texas Instruments. All rights reserved.


Shannon Switch Fabric
S VUSR

CPU/2
256b S Shared M
VUSR M VBUSM L2 RAM
SCR S
TC0 M
16ch DMA
TC1 M
EDMA_0
S MSMC_SS M DDR3
XMC

TC2 M
64ch SS GEM MM
SS GEM
TC3 M
DMA TC4 M GEM
GEM MM
TC5 M SS GEM MM
SS GEM
CPU/3
EDMA_1,2
CPU/3 GEM
GEM MM 32b CONFIG
TC6 M 128b VBUSP
64ch TC7 M VBUSM SCR
DMA TC8 M
TC9 M
SCR

M
SRIO
M
S QM_SS
PA_SS M
S PCIe

QM_SS M

PCIe M S EMIF16

TSIP 0,1 M

Copyright © 2010 Texas Instruments. All rights reserved.


Outline
 C6678 DSP Overview
 Multi-core DSP programming
 Interconnection and resource sharing
 Interconnection Architecture
 Shannon Hardware queue
 Inter-core communication
 Shared Resource Management
 Peripherals overview

Copyright © 2010 Texas Instruments. All rights reserved.


Hardware Queue Architecture
 packetized Data transfer architecture
designed to minimize DSP core
interaction while maximizing memory
and bus efficiency
 the key communication platform for TI’s
future Infrastructure DSPs
 Used by following peripherals in
Shannon:
 Serial RapidIO, Packet Accelerator
 Each module contains its own DMA to
transfer associated data with the ‘jobs’, No
CPU resources involved

Copyright © 2010 Texas Instruments. All rights reserved.


Hardware Queue
Send a ‘job’ Retrieve a ‘job’
 Producer writes ‘jobs’ into a Queue.
 Consumer reads ‘jobs’ from the
CPU1 Queue 1..x CPUx
Queue
 Supports Multiple In – Multiple Out
CPU2 Acc 1
 Multiple Producers can write to the
same Queue
CPU3 Acc 2  Used to share common Hardware
Queue
 Multiple Consumers can read from
Packet Controller RapidIO the same Queue
Acc.
 Used for Load Balancing
RapidIO Peri x  Abstracts the Consumer
DMA
 Consumer can be a Hardware IP
.... … (accelerator, peripheral) or a
software (ie a CPU core)
 Transparent for the Producer
Producer Queue Consumer
Manager
  ‘Easy’ to upgrade to new
hardware. The ‘job gets done’.
  Minimize changes to Host
software, Easy maintenance

Copyright © 2010 Texas Instruments. All rights reserved.


Packet Queuing Data Structure Diagram

Copyright © 2010 Texas Instruments. All rights reserved.


Hardware Queue Operation
 Push to a queue
 Host write pointer of new descriptor to a queue register.
 Queue manager links (modify the link RAM) the new
descriptor to the tail (or header) of the queue.
 Tail (or header) pointer points to the new descriptor.
 Pop from a queue
 Host read a descriptor pointer from a queue register.
 Queue manager returns the descriptor pointed by the header
pointer
 Header pointer points to the next descriptor.
 Monitor queue
 Queue manager generates events when queue changes: not
empty, entry count, exceed threshold, starvation…
 Queue Diversion
 Entire queue contents can be cleared or moved to another
queue destination using a single register write

Copyright © 2010 Texas Instruments. All rights reserved.


Shannon Hardware queue architecture
Queue Interrupt Queue Event Queue Event

Accumulation Packet DMA Packet DMA


DSP
DSP core
Buffer DSPcore
core (SRIO) (PA)

VBUS

Packet
Q0 Q1 Qx DMA APDSP
IF IF IF
(Internal) Queue
Buffer . Descriptor Link Interrupts
RAM APDSP
Memory . RAM
Queue
. Manager

Queue Manage Subsystem

Queue Events

Copyright © 2010 Texas Instruments. All rights reserved.


Queue Manager Subsystem

 Support 8192 queues


 HW queues are multi-core safe without mutual
exclusion, multiple senders can use a destination
queue without restrictions
 Can Notify Packet DMA when transfer is pending
 Can notify DSP core when packet is pending, can
copy descriptor pointers of transferred data to
destination core’s local memory to reduce access
latency
 Internal Packet DMA
 Transfer packet from one queue to another queue. Good for
core to core data transfer.

Copyright © 2010 Texas Instruments. All rights reserved.


Descriptor RAM
 Data elements (buffers) to
be passed on queues are
first described to a
descriptor region manager Memory Descriptor Region
Registers
built into the QM.
 Although technically called Region 0
descriptors, these memory 0x1000 0
0x1000

elements can hold any 32 byte 0 16 32

arbitrary data. buffers


15
The size of the data Region 1
elements must be a power of 0x2000

2, from 32 bytes to 8192 0x2000


16
16 4 256

bytes in length.
 20 configurable memory 256 byte
buffers …
regions (for descriptor
storage) 19
Region 19
 The number of elements in
the region must be a power
of 2, from 32 buffers to 4096
buffers in the region.
Copyright © 2010 Texas Instruments. All rights reserved.
Linking RAM
 Linking RAM contains 1 entry for each Linking RAM
descriptor . Linking RAM entry is effectively
Forward Pointer Table
an extension of the descriptor
17 - - -
 Linking RAM stores Forward data pointer - x - -
that is critical for the PUSH / POP operations - - - -
performed by the Queue Manager - - - -
- 5 19 x
 Linkage between physical address of
descriptor and physical address of Linking
RAM is performed inside the QM using Queue Contents
information provided in the Queue
Management configuration registers Queue 0 Queue 1

 Linking RAM is typically placed in local


memory for speed. This allows data 0 18
elements to be linked and unlinked in a
queue very quickly, even though the buffers
themselves may be in external memory 17 19

 There is no limit to the length of a single


queue, only a limit on the total number of
data elements in the system. 5

 2 configurable Linking RAM regions


Copyright © 2010 Texas Instruments. All rights reserved.
Queue Data Flow Example, Transmit

Host Processor

INIT: Host Allocates


TX 2 Processor TX 1 Processor Rx Free Descriptors Interrupt Generator
Queues a packet fetches a descriptor and initializes queues
to a Tx Queue to fill with the data
to transmit

Queue Manager
Free Rx
Tx
Queue
Descriptor Queue
Queue

TX 3 Port transmits
TX 4 Port Posts
the buffer being
Packet Descriptor
pointed to by to return Queue
the descriptor

Tx Port Rx Port

Copyright © 2010 Texas Instruments. All rights reserved.


Queue Data Flow Example, Receive

RX 4
Host Processor
Interrupt according to pacing rules or poll
Optionally prefetches the descriptor
INIT: Host Allocates to L2 prior to interrupting
Rx Free Descriptors RX 3 Not Empty
Level Status Interrupt Generator
and initializes queues

Queue Manager
Free Rx
Tx
Descriptor Queue
Queue
Queue

RX 1 Port Fetches a RX 2 Port Posts


Free Descriptor Packet to
and transfers the
Rx Queue
data to the buffer
pointed to by the
descriptor

Tx Port Rx Port

Copyright © 2010 Texas Instruments. All rights reserved.


Accumulator (A Programmable DSP)
 Accumulator is used to help
DSP core efficiently POP Accumulation Memory
DSP core
descriptor pointers from (Descriptor Pointer Array)
queue.
 Accumulator pop descriptor
pointer from queue and write Queue
Interrupts
to accumulation memory
(normally in DSP local Timer for
memory). APDSP Interrupt
Pacing
 Accumulator generates
interrupt to DSP core Queue Events
according to interrupt pacing
configuration. Descriptor
RAM Queue Manager
 Two Accumulator (PDSP) Monitor Queue
Changes
 One generate 32 interrupts,
each for one queue.
 The other generate 16
interrupts, each is combined
event for 32 queues. Totally
monitor 16x32 queues.

Copyright © 2010 Texas Instruments. All rights reserved.


Hardware queue Performance Consideration
 Push Operation
 1~4 words write. Since it is post operation, normally,
do not stall DSP core.
 Pop Operation
 1~4 words read. Stall DSP core about 80~100 cycles.
 Accumulator (PDSP) can pop the descriptors to DSP
local memory which will save DSP cycles dramatically.
 Descriptor Access
 Write/read full descriptor may consume many cycles.
 For most applications, DSP core can initialize all
descriptors during initialization, and only write/read
few fields of the descriptor during run time.

Copyright © 2010 Texas Instruments. All rights reserved.


Outline
 C6678 DSP Overview
 Multi-core DSP programming
 Interconnection and resource sharing
 Interconnection Architecture
 Shannon Hardware queue
 Inter-core communication
 Shared Resource Management
 Peripherals overview

Copyright © 2010 Texas Instruments. All rights reserved.


Shared Data in the L2 SRAM of transmitter

DSP DSP
Core X Core Y

L1 Cache L1 Cache

L2 RAM L2 RAM
L2 Cache Data Switch L2 Cache
Fabric Center

DDR2 SDRAM

 If cache is enabled, Core Y needs invalidate cache before


read

Copyright © 2010 Texas Instruments. All rights reserved.


Shared Data in the L2 SRAM of receiver

DSP DSP
Core X Core Y

L1 Cache L1 Cache

L2 RAM L2 RAM
L2 Cache Data Switch L2 Cache
Fabric Center

DDR2 SDRAM

 If cache is enabled, Core X needs write back cache after


write
Copyright © 2010 Texas Instruments. All rights reserved.
Shared Data in the shared memory

DSP DSP
Core X Core Y

L1 Cache L1 Cache

L2 RAM L2 RAM
L2 Cache Data Switch L2 Cache
Fabric Center

Shared L2 or DDR

 If cache is enabled, Core X needs write back cache after


write; core Y needs invalidate cache before read
Copyright © 2010 Texas Instruments. All rights reserved.
Use IPC register for inter-core communication

DSP Configuration DSP


Core X Switch Fabric IPC Core Y

L1 Cache L1 Cache

L2 RAM L2 RAM
L2 Cache L2 Cache

 Interrupt is generated for Core Y


 No cache coherency issue

Copyright © 2010 Texas Instruments. All rights reserved.


Inter-core Data Block exchange with EDMA

DSP DSP
Core X Core Y

L1 Cache L1 Cache

EDMA

Data Data

L2 RAM L2 RAM
L2 Cache Data Switch L2 Cache
Fabric Center

 Interrupt is generated for Core Y


 No cache coherency issue

Copyright © 2010 Texas Instruments. All rights reserved.


Inter-core data exchange through hardware queue
(Packet DMA copy)

DSP DSP
Core X Core Y

L1 Cache L1 Cache

Packet
DMA
Src Dst
Que Que
L2 RAM L2 RAM
L2 Cache Data Switch L2 Cache
Fabric Center

 Core X simply push data to Source Queue


 Packet DMA transfer the data Dest Queue
 Core Y simply pop data from Dest Queue
 If Queue buffers are in L2 RAM, Software on both cores do not
need maintenance the cache coherency.

Copyright © 2010 Texas Instruments. All rights reserved.


Inter-core data exchange through hardware queue
(Zero Copy)

DSP DSP
Core X Core Y

L1 Cache L1 Cache

Queue
Manager

L2 RAM L2 RAM
L2 Cache Data Switch L2 Cache
Fabric Center

Shared
Queue

 Core X push data to Shared Queue, Core Y pop data from Shared
Queue
 Multi-core can access Shared Queue simultaneously without mutual
exclusion
 Software need maintenance the cache coherency.
Copyright © 2010 Texas Instruments. All rights reserved.
Outline
 C6678 DSP Overview
 Multi-core DSP programming
 Interconnection and resource sharing
 Interconnection Architecture
 Shannon Hardware queue
 Inter-core communication
 Shared Resource Management
 Peripherals overview

Copyright © 2010 Texas Instruments. All rights reserved.


Shared resources
 Internal shared L2 and External Shared memory (DDR)
 Each core access shared memory independently. Arbitration
handled by switch fabric and end-point arbiters.
 Shared on-chip Peripherals
 Configuration: Typically done at startup to set the operating
mode of a particular logic block (e.g. DDR settings). Should be
done by a single core as part of the boot process.
 Use:
 Peripherals with Hardware queue, Each core access hardware
queue independently. Arbitration handled by queue manager.
 Ethernet, SRIO on Shannon…
 Multi-channel peripherals can be split amongst the cores for
concurrent, orthogonal control
 EDMA, TSIP, Timer…
 Single-channel peripherals can be controlled by a single master,
servicing the other cores if needed. Or mutual exclusively used by
multi-masters through semaphore.
 I2C, SPI…

Copyright © 2010 Texas Instruments. All rights reserved.


System-level prioritization for arbitration
 A user-specified priority may be assigned to:
 Any DSP core accesses
 Any EDMA, sRIO, Ethernet, … on-chip transfers
 Each of the master ports are assigned a priority (8
levels) configurable

Copyright © 2010 Texas Instruments. All rights reserved.


Hardware Semaphores on Shannon for atomic accesses

 What function does the Semaphore module provide?


 A method to control who accesses a shared resource
 Provides accesses for shared resources in an atomic manner
 Read-modify-write sequence is not broken
 Features of the Semaphore module
 Binary Semaphore
 Contains 64 semaphores to be used within the system
 Two methods of accessing a semaphore resource
 Direct Access
 A core directly accesses a semaphore resource. If free, the semaphore
will be granted. If not, the semaphore is not granted
 Useful if the system can afford to poll for the semaphore
 Indirect access
 A core indirectly accesses a semaphore resource by writing to it. Once it
is free an interrupt will notify the DSP core that it is available.

Copyright © 2010 Texas Instruments. All rights reserved.


Outline
 C6678 DSP Overview
 Multi-core DSP programming
 Interconnection and resource sharing
 Peripherals overview

Copyright © 2010 Texas Instruments. All rights reserved.


Shannon RapidIO Gen 2 Features and Enhancements

4 lanes – options include 2x  Messaging


 Type 9 Packets Support (Data
Streaming)
 Type 11 Message –
classification improvements
 DirectIO
 Baud rates: 5 Gbaud per  8 Load/Store (DirectIO) Units
(up from 4)
lane in addition to 1.25, 2.5,
 Shadow register sets for LSUs
3.125 Gbaud per lane to simplify management and
 DeviceID Support minimize overhead
 Provide up to 1MB block
 16 Local DeviceIDs (up transfers (up from 4KB)
from 1)
 Packet Forwarding with Reset
 8 Multicast IDs (up from 3) Isolation
 24 Interrupt outputs (up
from 8)

88

Copyright © 2010 Texas Instruments. All rights reserved.


RapidIO – Topology Examples

C6678 C6678
C6678 C6678 C6678 C6678 DSP DSP
DSP DSP DSP DSP
SRIO
Ring Switch Switch
C6678 C6678 C6678 C6678
C6678 C6678
DSP DSP DSP DSP
DSP DSP

C6678 C6678 C6678 C6678 C6678


DSP DSP DSP DSP DSP

Chain Mesh
C6678 C6678
C6678 C6678 C6678
DSP DSP
DSP DSP DSP

89

Copyright © 2010 Texas Instruments. All rights reserved.


Packet Accelerator Subsystem On Shannon
 3 Port Ethernet Switch
 Port 0: Internal hardware queue port
 Port 1: SGMII 0 Port, 1Gbps
 Port 2: SGMII 1 Port, 1Gbps
 Packet Accelerator (PA)
 L2, L3, and L4 packet processing
 1.5M packets per sec
 Security Accelerator (SA)
 Encryption/Decryption
 IPSEC ESP
 IPSEC AH
 SRTP
 3GPP

91

Copyright © 2010 Texas Instruments. All rights reserved.


IEEE 1588 support
Device A is the master device
Device B is the slave device

Message B is used to send the actual


transmit time (tA) of Message A
Message D is used to send the actual
receive time (rC) of Message C

wire time in one direction


((rC - tA)-(tC - rA))/2

 EMAC hardware supports classifying at the physical level


ingress and egress frames as timing synchronization
frames and the timestamp is recorded.
 A software algorithm running on DSP core would then run
the algorithm to calculate the delay and adjust local time
accordingly.

Copyright © 2010 Texas Instruments. All rights reserved.


TSIP Overview
 1024 8-bit timeslots receive and transmit.
 8 links of 128 timeslots at 8.192 Mbps.
 4 links of 256 timeslots at 16.384 Mbps.
 2 links of 512 timeslots at 32.768 Mbps.
 Two clock and frame sync inputs.
 Independent clocking – 1 receive clock and 1 transmit
clock.
 Redundant/common clocking – 1 receive/transmit clock
with second clock as backup.

Copyright © 2010 Texas Instruments. All rights reserved.


Shannon PCIe Interface
Nyquist/Shannon incorporates PCIe interface with
the following characteristics:
 Two SERDES lanes running at 5 GBaud/2.5GBaud
 Gen2 compliant
 Three different operational modes (default defined by pin
inputs at power up; can be overwritten by software):
 Root Complex (RC)
 End Point (EP)
 Legacy End Point
 Single Virtual Channel (VC)
 Single Traffic Class (TC)
 Maximum Payloads
 Egress – 128 bytes
 Ingress – 256 bytes
 Configurable BAR filtering, IO filtering and configuration
filtering
94

Copyright © 2010 Texas Instruments. All rights reserved.


Remaining Peripherals & System Elements (1/2)
 EMIF16
 Supports NAND flash memory, up to 256MB
 Supports NOR flash up to 16MB
 Supports asynchronous SRAM mode, up to 1MB
 Used for booting, logging, announcement, etc.
 64-Bit Timers
 Total of 16 64-bit timers
 One 64-bit timer per core is dedicated to serve as a watchdog (or may be used
as a general purpose timer)
 Eight 64-bit timers are shared for general purpose timers
 Each 64-bit timer can be configured as two individual 32-bit timers
 Timer Input/Output pins
 Two timer Input pins
 Two timer Output pins
 Timer input pins can be used as GPI
 Timer output pins can be used as GPO

Copyright © 2010 Texas Instruments. All rights reserved.


Remaining Peripherals & System Elements (2/2)
 UART Interface – Operates at up to 128,000 baud
 I2C Interface
 Supports 400Kbps throughput
 Supports full 7-bit address field
 Supports EEPROM size of 4 Mbit
 SPI Interface
 Operates at up to 66MHz
 Supports two chip selects
 Support master mode
 GPIO Interface
 16 GPIO pins
 Can be configured as interrupt pins
 Interrupt can select either rising edge or falling edge

Copyright © 2010 Texas Instruments. All rights reserved.


Q&A

Copyright © 2010 Texas Instruments. All rights reserved.

You might also like