Tutorial On TI C6678

Texas Instruments
TMS320C6678 (Shannon)
DSP Training
Brighton Feng
November, 2010
Copyright © 2010 Texas Instruments. All rights reserved.

Outline
 C6678 DSP Overview
 Multi-core DSP programming
 Interconnection and resource sharing
 Peripherals overview

Shannon Functional Diagram
• Multi-Core SoC C6678 (Shannon)
System Elements
• Fixed/Floating C66x™ Core
– Eight cores @ 1.0 GHz, 0.5 MB Local L2 . . . 8 C66x Cores Power Mgt SysMon
– 4.0 MB shared memory
Debug EDMA
– 256 GMAC, 128 GFLOP
L2 Memory
• Navigator C66x core
Navigator
Peripherals and I/O
– Multicore eco system
sRIO TSIP
• Packet Infrastructure L1 D L1 P
Flash PCIe
• Network Coprocessor
– IP Network solution for IP v4/6 UART SPI, I2C
– 1.5M packets per sec (1Gb Ethernet
wire-rate) TeraNet 2 Hyperlink50
– IPsec, SRTP, Air Interface Encryption
Multicore
fully offloaded Memory System Crypto/IPSec Enet
• 3-port GigE Switch (Layer 2) Multicore Memory
Controller CoProcessor Switch
DDR-3
64b
SGMII
• Low Power Consumption
SGMII
Packet
– Adaptive Voltage Scaling (Smart Shared Memory CoProcessor
ReflexTM)
• Hyperlink 50
– 50G Expansion port
– Transparent to Software
4
• Multicore Debugging
Enhanced DSP core
C66x
Performance improvement
100% backward object

code compatible
Increased
Fixed and floating
Point capability
Improved support for

complex arithmetic
and matrix
computation
C64x+
C67x+ SPLOOP and 16-
bit instructions
for smaller code
size
C64x
C67x Flexible level one Advanced fixed-
memory point
2x registers architecture instructions
Native
instructions for iDMA for rapid Four 16-bit or
IEEE 754, SP&DP data transfers eight 8-bit MACs
Enhanced between local
Advanced VLIW floating-point memories Two-level cache
architecture add capabilities
FLOATING-POINT VALUE FIXED-POINT VALUE

C66x core block diagram
256 Bits
C66x Core
Instruction Fetch
Control Registers
SPLOOP Buffer Interrupt
Control
Instruction Dispatch
In-Circuit Emulation
Instruction Decode
Data Path 1 Data Path 2
A Register File B Register File

A0 – A31 B0 –B31
L1 S1 M1 D1 D2 M2 S2 L2
+ + + + x x x x + + + + x x x x + + + +
+ + + + x x x x + + + + x x x x + + + +
+ + + + x x x x + + + + x x x x + + + +
+ + + + x x x x + + + + x x x x + + + +
2x64 Bits

Key Improvements of C66x
 4x Multiply Accumulate improvement
 Enhanced complex arithmetic and matrix operations
 2x Arithmetic and Logical operations
improvement
 Support the floating point arithmetic. Single
precision floating point operation capability
same as 32 bit fixed point operation capability
 division and square root is supported by
floating point instruction

C64x+  C66x Comparison
Operation Precision Operations Operations Function Unit
per cycle per cycle
on C64x+ on C66x
MAC Real 8 x 8 2x4=8 2 x 8 = 16 M1, M2
Real 16 x 16 2x2=4 2 x 8 = 16 M1, M2
Real 32 x 32 2x1=2 2x4=8 M1, M2
Complex (16,16) 2 x 1 = 2 2x4=8 M1, M2
x (16,16)
Complex (32,32) N/A 2x1=2 M1, M2
x (32,32)
Arithmetic 8 bit 4 x 4 = 16 4 x 8 = 32 L1, L2, S1, S2
Logical 16 bit 4x2=8 4 x 4 = 16 L1, L2, S1, S2
32 bit 4x1=4 4x2=8 L1, L2, S1, S2
Memory 8 bit, 16 bit, 32 2x1=2 2x1=2 D1, D2
Access bit, 64 bit

Outline
 Memory Architecture Overview
 Shannon Memory Architecture
Improvement
 Programming model

TCI6486 Memory Architecture
Internal
L2 RAM
S
Core 0
M
.
Shared Shared .
L2
L2 RAM
Control
.
DMA
(Core Internal
Shared SCR
speed)/2 L2 RAM
L2 ROM External
256 bit
(Core Memory
speed)/3
128 bit
S
Core N
M
EDMA M

Shannon Memory Architecture
Internal
L2 RAM
S
Core 0
Shared .
L2 RAM
.
.
Shared DMA
Memory Internal SCR
Control L2 RAM
External
Memory (Core
(Core speed)/3
speed)/2 128 bit
256 bit
S
Core N
EDMA M EDMA M

Outline
Improvement

Addition of XMC
 Bring over existing EMC MDMA path
 Fat pipe to external (and internal) shared memory
 Bus width: 256 instead of 128 bits
 Clock rate: CPUCLK/2 instead of CPUCLK/3
 Optimize requests for MSMC / DDR3 memory
 L2 line allocations and evictions are split into sub-lines of 64 bytes
 Memory Protection and Address Extension (MPAX) support

 16 segments of programmable size (powers of 2: 4KB to 4GB)
 Each segment maps a 32-bit address to a 36-bit address.
 Each segment controls access: supervisor/user, R/W/X, (non-)secure
 Memory protection for shared internal MSMC memory and external DDR3
memory
 Multi-stream Prefetch support
 Program prefetch buffer up to 128 bytes
 Data prefetch buffer up to 8 x 128 bytes
 Prefetch enabled/disabled on 16MB ranges defined in MAR
 Manual flush for coherence purposes
 Note: no IDMA path

MAR Register Extension
• L2 memory controller extends the MAR registers by adding the “PFX” field,
L2 memory controller uses this bit to convey XMC whether a given address
range is prefetchable.

MSMC Block Diagram
256 256
CGEM CGEM
Slave Port x N CGEM cores Slave Port
MSMC Core
256 256
System MSMC Datapath
Memory
Slave Port Protection
for shared and
Extension
VBUSM 256 SRAM 256
Unit
(SMS) (MPAX) Arbitration for Banks
RAM banks,
256-bits per
256 bank
SCR
Memory
Protection EDC for SRAM
System
and
Slave Port Extension
for Unit
(MPAX)
VBUSM 256
external 256
memory
(SES)
 One slave interface per C66x
Megamodule (256 bits @ CPUCLK/2)
MSMC System MSMC EMIF
Master Port
 Uses a 36 bit address extended inside
Master Port events a C66x Megamodule core
 Two slave interfaces (256 bits @
VBUSM 256
VBUSM 256 CPUCLK/2) for access from system
masters
 SMS interface for accesses to MSMC
EMIF – 64 bit SRAM space
SCR
DDR3
 SES interface for accesses to DDR3
space
 Both interfaces support memory
protection and address extension
 One master interface (256-bits @
CPUCLK/2) for access to the DDR3
EMIF
 One master interface (256 bits @
CPUCLK/2) for access to system
slaves
MSMC Shared Memory
 4 banks x 2 sub-banks, sub-bank are 256-bit
wide.
 Reduces conflicts between C66x Megamodule cores
and system masters
 Features a dynamic fair-share bank arbitration for
each transfer
 Supports bandwidth management. Avoid
indefinite starvation for lower priority requests
due to higher priority requests
 Features Not Supported
 Cache coherency between L1/L2 caches in C66x
Megamodule cores and MSMC memory
 Cache coherency between XMC prefetch buffers and
MSMC memory
 C66x Megamodule to C66x Megamodule cache
coherency via MSMC

MPAX Units
 MPAX stands for “Memory Protection and
Address Extension”
 There are N+2 MPAX units in a system with N
C66x Megamodules
 N MPAX units for all requests from N C66x
Megamodules to internal shared memory, external
shared memory or any system slave
 1 MPAX unit for all requests from any system master
to internal shared memory
 1 MPAX unit for all requests from any system master
to external shared memory
 Each MPAX unit operates on a number of
segments of programmable size
 Each segment maps a 32-bit address to a 36-bit
address.
 Each segment controls access.
Number of Segments
 Each C66x Megamodule has 16 segments which
control direct (load/store) requests to internal
shared memory, external shared memory and
any other system slave.
 Any master identified by a privilege ID has
 8 segments for requests to internal shared memory
 8 segments for requests to external shared memory.
 Some masters work on behalf of other masters.
They will inherit the privilege ID of their
commanding master. As such, each C66x
Megamodule also has
 8 segments for indirect (DMA) requests to internal
shared memory
 8 segments for indirect (DMA) requests to external
shared memory
Segment Definition
 Each segment is defined by a base address and a size
 The segment size can be set to any power of 2 from 4K to
4GB
 The segment base address is constrained to power-of-2
boundary equal to size.
 One would expect that each request should find one
matching segment, however ...
 a request may find two or more matching segments, in
which case segments with higher ID take priority over
segments with lower ID. This allows
 creating non-power of 2 segments
 creating 3 segments with just 2 segment definitions
 ...
 a request may find no matching segment, in which case an
error is reported in Memory protection fault reporting
registers (XMPFAR, XMPFSR)

XMC Segment Registers
XMPAXH/XMPAXL[15-0]

MPAX Default Memory Map F:FFFF_FFFF
8:8000_0000
FFFF_FFFF Segment 15 Disabled 8:7FFF_FFFF
Segment 14 Disabled
Segment 13 Disabled
Segment 12 Disabled
Segment 11 Disabled 8:0000_0000
7:FFFF_FFFF
Segment 10 Disabled
Segment 9 Disabled Upper 60GB
8000_0000
7FFF_FFFF Segment 8 Disabled
Segment 7 Disabled
Segment 6 Disabled
Segment 5 Disabled
0C00_0000 Segment 3 Disabled 0:FFFF_FFFF
0BFF_FFFF Segment 2 Disabled
(not remappable)
0000_0000 Segment 1 BADDR = 80000h; RADDR = 800000h; Size = 2GB
CGEM Logical Segment 0 BADDR = 00000h; RADDR = 000000h; Size = 2GB Lower 4GB
32-bit Memory Map
0:8000_0000
0:7FFF_FFFF
0:0C00_0000
 XMC configures MPAX segments 0 and 1 so that 0:0BFF_FFFF

0:0000_0000
C66x Megamodule can access system memory. System Physical

36-bit Memory Map
 The power up configuration is that segment 1

remaps 8000_0000 – FFFF_FFFF in C66x
Megamodule’s address space to 8:0000_0000 –
8:7FFF_FFFF in the system address map.
 This corresponds to the first 2GB of address space
dedicated to EMIF by the MSMC controller.
MPAX MSMC Aliasing Example
FFFF_FFFF
0:0C1F_FFFF
BADDR = 21000h; RADDR = 00C000h; Size = 2MB

MSMC RAM
BADDR = 20000h; RADDR = 00C000h; Size = 2MB (2MB)
BADDR = 0C000h; RADDR = 00C000h; Size = 2MB
0:0C00_0000
21xx_xxxx MSMC RAM Alias 2
20xx_xxxx MSMC RAM Alias 1
0Cxx_xxxx “Fast” MSMC RAM

0BFF_FFFF
(not remappable)
0000_0000
CGEM 32-bit Memory Map
 Example shows 3 segments to map the MSMC RAM address

space into C66x Megamodule’s address space as three distinct
2MB ranges. By programming the MARs accordingly, the three
segments could have different semantics.
 Accesses to MSMC RAM via this alias do not use the “fast RAM”
path and incur additional cycles of latency.

MPAX Overlayed Segments Example
F:FFFF_FFFF
FFFF_FFFF Segment 15 Disabled Upper 60GB

Segment 14 Disabled
Segment 13 Disabled
Segment 12 Disabled
C000_7xxx Segment 11 Disabled
Segment 10 Disabled
8000_0000 0:FFFF_FFFF
7FFF_FFFF Segment 8 Disabled
Segment 7 Disabled
Segment 6 Disabled
0:C000_7xxx
Segment 5 Disabled
Segment 4 Disabled
0C00_0000 Segment 3 Disabled

0BFF_FFFF Segment 2 BADDR = C0007h; RADDR = 050042h; Size = 4K 0:8000_0000
(not remappable) Lower 4GB
0000_0000 Segment 1 BADDR = 80000h; RADDR = 080000h; Size = 2GB 0:7FFF_FFFF
CGEM 32-bit Memory Map Segment 0 BADDR = 00000h; RADDR = 000000h; Size = 2GB
0:5004_2xxx
 segment 1 matches 8000_0000 through FFFF_FFFF, 0:0C00_0000

and segment 2 matches C000_7000 through C000_7FFF. 0:0BFF_FFFF
0:0000_0000
 Because segment 2 is higher priority than segment 1, System 36-bit Memory Map
its settings take priority, effectively carving a 4K hole in
segment 1’s 2GB address space.
 Furthermore, it maps this 4K space to 0:5004_2000 -
0:5004_2FFF, which overlaps the mapping established
by segment 0. This physical address range is now
accessible by two logical address ranges.
Outline
Improvement

single program image
C6000 C6000 C6000
Core 0 Core 1 Core 2
L1 Data L1 Data L1 Data
L1 Prog L1 Prog L1 Prog
code
App.out App.out App.out
and
read/write
Data 1 Data 2
data
Data 0
L2 memory L2 memory L2 memory
Shared code Data 0

and App.out Shared L2 or
Data 1
Read only DDR memory
data Data 2
 Same image on each DSP core

 Aliased addressing used for DSP core to access local L2
 DNUM DSP core register for:
 Global addressing when programming EDMA3, SRIO, …
 Separate buffer per DSP core in DDR: dp= bufBase+ BUF_SIZE*DNUM
Shannon MPAX enables easy single program image
CGEM address space (1) SoC address space CGEM address space (n)
MSMC RAM MSMC RAM MSMC RAM

internal internal internal
code1 code1 code1
data2 data2 data2
data2
External memory External memory External memory

code2 code2 code2
MPAX
MPAX
data3 data3 data3
data3
virtual address space (1) SoC address space virtual address space (n)

multiple program image
C6000
C6000 C6000
C6000 C6000
C6000
Core
Core0 0 Core
Core1 1 Core
Core2 2
L1L1Data
Data L1L1Data
Data L1L1Data
Data
L1L1Prog
Prog L1L1Prog
Prog L1L1Prog
Prog
App0.out App1.out App2.out
Data 0 Data 1 Data 2

L2 memory L2 memory L2 memory
App0.out App1.out App2.out

Shared L2 or
DDR memory Data 1 Data 2
Data 0
 Each DSP core has its image

 Static split of DDR2 per DSP core
 Global or local addressing used for L2 addressing

Shannon Software
• Flexible development C6678 Software
environment for the customer.
Customer Application
• Customer can choose to develop
their application using all or any
one of the software layers. Voice Gateway Video
Demonstration Kit
• Will contain following software Demo Transcoding
Demonstration Kit
layers App
– BIOS and Linux Operating System Speech Audio Video
support DSPLIB
Codec Codec Codec
– Chip Support Library
– Platform Development Kit Operating System w/ Boot Loader
– Inter Core Communication
BIOS Linux
– Optimized DSP functions library
– Optimized Audio, Video and Multi-core Entitlement
Speech codecs
– Voice Gateway Demonstration Kit Inter Core Communication
– Video Transcoding Demonstration
Kit Full Silicon Entitlement
– Demonstration applications Platform Development Kit
Chip Support Library
43

Shannon Debug
Best Multicore Debug and Visualization Debug enabled Multicore SoC
C6678 (Shannon)
System Elements
. . . 8 C66x Cores Power Mgt SysMon
Debug EDMA
C66x
L2 Memory
Navigator
core Peripherals and I/O
ETB
sRIO TSIP
L1 D L1 P
Flash PCIe
UART SPI, I2C
Data
TRACE TeraNet 2 Hyperlink50
Visualization
Multicore
Memory System Enet
Crypto/IPSec
Multicore Memory CoProcessor Switch
DDR-3 Controller
TRACE
64b
SGMII
SGMII
Shared Packet
Memory CoProcessor
Debug visibility at core, across multicore and for SoC

45

Outline
 Interconnection Architecture
 Shannon Hardware queue
 Inter-core communication
 Shared Resource Management

Shannon Switch Fabric
S VUSR
CPU/2
256b S Shared M
VUSR M VBUSM L2 RAM
SCR S
TC0 M
16ch DMA
TC1 M
EDMA_0
S MSMC_SS M DDR3
XMC
TC2 M
64ch SS GEM MM
SS GEM
TC3 M
DMA TC4 M GEM
GEM MM
TC5 M SS GEM MM
SS GEM
CPU/3
EDMA_1,2
CPU/3 GEM
GEM MM 32b CONFIG
TC6 M 128b VBUSP
64ch TC7 M VBUSM SCR
DMA TC8 M
TC9 M
SCR
M
SRIO
M
S QM_SS
PA_SS M
S PCIe
QM_SS M
PCIe M S EMIF16
TSIP 0,1 M

Outline

Hardware Queue Architecture
 packetized Data transfer architecture
designed to minimize DSP core
interaction while maximizing memory
and bus efficiency
 the key communication platform for TI’s
future Infrastructure DSPs
 Used by following peripherals in
Shannon:
 Serial RapidIO, Packet Accelerator
 Each module contains its own DMA to
transfer associated data with the ‘jobs’, No
CPU resources involved

Hardware Queue
Send a ‘job’ Retrieve a ‘job’
 Producer writes ‘jobs’ into a Queue.
 Consumer reads ‘jobs’ from the
CPU1 Queue 1..x CPUx
Queue
 Supports Multiple In – Multiple Out
CPU2 Acc 1
 Multiple Producers can write to the
same Queue
CPU3 Acc 2  Used to share common Hardware
Queue
 Multiple Consumers can read from
Packet Controller RapidIO the same Queue
Acc.
 Used for Load Balancing
RapidIO Peri x  Abstracts the Consumer
DMA
 Consumer can be a Hardware IP
.... … (accelerator, peripheral) or a
software (ie a CPU core)
 Transparent for the Producer
Producer Queue Consumer
Manager
  ‘Easy’ to upgrade to new
hardware. The ‘job gets done’.
  Minimize changes to Host
software, Easy maintenance

Packet Queuing Data Structure Diagram

Hardware Queue Operation
 Push to a queue
 Host write pointer of new descriptor to a queue register.
 Queue manager links (modify the link RAM) the new
descriptor to the tail (or header) of the queue.
 Tail (or header) pointer points to the new descriptor.
 Pop from a queue
 Host read a descriptor pointer from a queue register.
 Queue manager returns the descriptor pointed by the header
pointer
 Header pointer points to the next descriptor.
 Monitor queue
 Queue manager generates events when queue changes: not
empty, entry count, exceed threshold, starvation…
 Queue Diversion
 Entire queue contents can be cleared or moved to another
queue destination using a single register write

Shannon Hardware queue architecture
Queue Interrupt Queue Event Queue Event
Accumulation Packet DMA Packet DMA

DSP
DSP core
Buffer DSPcore
core (SRIO) (PA)
VBUS
Packet
Q0 Q1 Qx DMA APDSP
IF IF IF
(Internal) Queue
Buffer . Descriptor Link Interrupts
RAM APDSP
Memory . RAM
Queue
. Manager
Queue Manage Subsystem
Queue Events

Queue Manager Subsystem
 Support 8192 queues

 HW queues are multi-core safe without mutual
exclusion, multiple senders can use a destination
queue without restrictions
 Can Notify Packet DMA when transfer is pending
 Can notify DSP core when packet is pending, can
copy descriptor pointers of transferred data to
destination core’s local memory to reduce access
latency
 Internal Packet DMA
 Transfer packet from one queue to another queue. Good for
core to core data transfer.

Descriptor RAM
 Data elements (buffers) to
be passed on queues are
first described to a
descriptor region manager Memory Descriptor Region
Registers
built into the QM.
 Although technically called Region 0
descriptors, these memory 0x1000 0
0x1000
elements can hold any 32 byte 0 16 32
arbitrary data. buffers

15
The size of the data Region 1
elements must be a power of 0x2000
2, from 32 bytes to 8192 0x2000

16
16 4 256
bytes in length.
 20 configurable memory 256 byte
buffers …
regions (for descriptor
storage) 19
Region 19
 The number of elements in
the region must be a power
of 2, from 32 buffers to 4096
buffers in the region.
Linking RAM
 Linking RAM contains 1 entry for each Linking RAM
descriptor . Linking RAM entry is effectively
Forward Pointer Table
an extension of the descriptor
17 - - -
 Linking RAM stores Forward data pointer - x - -
that is critical for the PUSH / POP operations - - - -
performed by the Queue Manager - - - -
- 5 19 x
 Linkage between physical address of
descriptor and physical address of Linking
RAM is performed inside the QM using Queue Contents
information provided in the Queue
Management configuration registers Queue 0 Queue 1
 Linking RAM is typically placed in local

memory for speed. This allows data 0 18
elements to be linked and unlinked in a
queue very quickly, even though the buffers
themselves may be in external memory 17 19
 There is no limit to the length of a single

queue, only a limit on the total number of
data elements in the system. 5
 2 configurable Linking RAM regions

Queue Data Flow Example, Transmit
Host Processor
INIT: Host Allocates

TX 2 Processor TX 1 Processor Rx Free Descriptors Interrupt Generator
Queues a packet fetches a descriptor and initializes queues
to a Tx Queue to fill with the data
to transmit
Queue Manager
Free Rx
Tx
Queue
Descriptor Queue
Queue
TX 3 Port transmits
TX 4 Port Posts
the buffer being
Packet Descriptor
pointed to by to return Queue
the descriptor
Tx Port Rx Port

Queue Data Flow Example, Receive
RX 4
Host Processor
Interrupt according to pacing rules or poll
Optionally prefetches the descriptor
INIT: Host Allocates to L2 prior to interrupting
Rx Free Descriptors RX 3 Not Empty
Level Status Interrupt Generator
and initializes queues
Queue Manager
Free Rx
Tx
Descriptor Queue
Queue
Queue
RX 1 Port Fetches a RX 2 Port Posts

Free Descriptor Packet to
and transfers the
Rx Queue
data to the buffer
pointed to by the
descriptor
Tx Port Rx Port

Accumulator (A Programmable DSP)
 Accumulator is used to help
DSP core efficiently POP Accumulation Memory
DSP core
descriptor pointers from (Descriptor Pointer Array)
queue.
 Accumulator pop descriptor
pointer from queue and write Queue
Interrupts
to accumulation memory
(normally in DSP local Timer for
memory). APDSP Interrupt
Pacing
 Accumulator generates
interrupt to DSP core Queue Events
according to interrupt pacing
configuration. Descriptor
RAM Queue Manager
 Two Accumulator (PDSP) Monitor Queue
Changes
 One generate 32 interrupts,
each for one queue.
 The other generate 16
interrupts, each is combined
event for 32 queues. Totally
monitor 16x32 queues.

Hardware queue Performance Consideration
 Push Operation
 1~4 words write. Since it is post operation, normally,
do not stall DSP core.
 Pop Operation
 1~4 words read. Stall DSP core about 80~100 cycles.
 Accumulator (PDSP) can pop the descriptors to DSP
local memory which will save DSP cycles dramatically.
 Descriptor Access
 Write/read full descriptor may consume many cycles.
 For most applications, DSP core can initialize all
descriptors during initialization, and only write/read
few fields of the descriptor during run time.

Outline

Shared Data in the L2 SRAM of transmitter
DSP DSP
Core X Core Y
L1 Cache L1 Cache
L2 RAM L2 RAM
L2 Cache Data Switch L2 Cache
Fabric Center
DDR2 SDRAM
 If cache is enabled, Core Y needs invalidate cache before

read

Shared Data in the L2 SRAM of receiver
DSP DSP
Core X Core Y
L1 Cache L1 Cache
L2 RAM L2 RAM
Fabric Center
DDR2 SDRAM
 If cache is enabled, Core X needs write back cache after

write
Shared Data in the shared memory
DSP DSP
Core X Core Y
L1 Cache L1 Cache
L2 RAM L2 RAM
Fabric Center
Shared L2 or DDR
 If cache is enabled, Core X needs write back cache after

write; core Y needs invalidate cache before read
Use IPC register for inter-core communication
DSP Configuration DSP

Core X Switch Fabric IPC Core Y
L1 Cache L1 Cache
L2 RAM L2 RAM
L2 Cache L2 Cache
 Interrupt is generated for Core Y

 No cache coherency issue

Inter-core Data Block exchange with EDMA
DSP DSP
Core X Core Y
L1 Cache L1 Cache
EDMA
Data Data
L2 RAM L2 RAM
Fabric Center
 Interrupt is generated for Core Y

 No cache coherency issue

Inter-core data exchange through hardware queue
(Packet DMA copy)
DSP DSP
Core X Core Y
L1 Cache L1 Cache
Packet
DMA
Src Dst
Que Que
L2 RAM L2 RAM
Fabric Center
 Core X simply push data to Source Queue

 Packet DMA transfer the data Dest Queue
 Core Y simply pop data from Dest Queue
 If Queue buffers are in L2 RAM, Software on both cores do not
need maintenance the cache coherency.

Inter-core data exchange through hardware queue
(Zero Copy)
DSP DSP
Core X Core Y
L1 Cache L1 Cache
Queue
Manager
L2 RAM L2 RAM
Fabric Center
Shared
Queue
 Core X push data to Shared Queue, Core Y pop data from Shared
Queue
 Multi-core can access Shared Queue simultaneously without mutual
exclusion
 Software need maintenance the cache coherency.
Outline

Shared resources
 Internal shared L2 and External Shared memory (DDR)
 Each core access shared memory independently. Arbitration
handled by switch fabric and end-point arbiters.
 Shared on-chip Peripherals
 Configuration: Typically done at startup to set the operating
mode of a particular logic block (e.g. DDR settings). Should be
done by a single core as part of the boot process.
 Use:
 Peripherals with Hardware queue, Each core access hardware
queue independently. Arbitration handled by queue manager.
 Ethernet, SRIO on Shannon…
 Multi-channel peripherals can be split amongst the cores for
concurrent, orthogonal control
 EDMA, TSIP, Timer…
 Single-channel peripherals can be controlled by a single master,
servicing the other cores if needed. Or mutual exclusively used by
multi-masters through semaphore.
 I2C, SPI…

System-level prioritization for arbitration
 A user-specified priority may be assigned to:
 Any DSP core accesses
 Any EDMA, sRIO, Ethernet, … on-chip transfers
 Each of the master ports are assigned a priority (8
levels) configurable

Hardware Semaphores on Shannon for atomic accesses
 What function does the Semaphore module provide?

 A method to control who accesses a shared resource
 Provides accesses for shared resources in an atomic manner
 Read-modify-write sequence is not broken
 Features of the Semaphore module
 Binary Semaphore
 Contains 64 semaphores to be used within the system
 Two methods of accessing a semaphore resource
 Direct Access
 A core directly accesses a semaphore resource. If free, the semaphore
will be granted. If not, the semaphore is not granted
 Useful if the system can afford to poll for the semaphore
 Indirect access
 A core indirectly accesses a semaphore resource by writing to it. Once it
is free an interrupt will notify the DSP core that it is available.

Outline

Shannon RapidIO Gen 2 Features and Enhancements
4 lanes – options include 2x  Messaging

 Type 9 Packets Support (Data
Streaming)
 Type 11 Message –
classification improvements
 DirectIO
 Baud rates: 5 Gbaud per  8 Load/Store (DirectIO) Units
(up from 4)
lane in addition to 1.25, 2.5,
 Shadow register sets for LSUs
3.125 Gbaud per lane to simplify management and
 DeviceID Support minimize overhead
 Provide up to 1MB block
 16 Local DeviceIDs (up transfers (up from 4KB)
from 1)
 Packet Forwarding with Reset
 8 Multicast IDs (up from 3) Isolation
 24 Interrupt outputs (up
from 8)
88

RapidIO – Topology Examples
C6678 C6678
C6678 C6678 C6678 C6678 DSP DSP
DSP DSP DSP DSP
SRIO
Ring Switch Switch
C6678 C6678 C6678 C6678
C6678 C6678
DSP DSP DSP DSP
DSP DSP
C6678 C6678 C6678 C6678 C6678

DSP DSP DSP DSP DSP
Chain Mesh
C6678 C6678
C6678 C6678 C6678
DSP DSP
DSP DSP DSP
89

Packet Accelerator Subsystem On Shannon
 3 Port Ethernet Switch
 Port 0: Internal hardware queue port
 Port 1: SGMII 0 Port, 1Gbps
 Port 2: SGMII 1 Port, 1Gbps
 Packet Accelerator (PA)
 L2, L3, and L4 packet processing
 1.5M packets per sec
 Security Accelerator (SA)
 Encryption/Decryption
 IPSEC ESP
 IPSEC AH
 SRTP
 3GPP
91

IEEE 1588 support
Device A is the master device
Device B is the slave device
Message B is used to send the actual

transmit time (tA) of Message A
Message D is used to send the actual
receive time (rC) of Message C
wire time in one direction

((rC - tA)-(tC - rA))/2
 EMAC hardware supports classifying at the physical level

ingress and egress frames as timing synchronization
frames and the timestamp is recorded.
 A software algorithm running on DSP core would then run
the algorithm to calculate the delay and adjust local time
accordingly.

TSIP Overview
 1024 8-bit timeslots receive and transmit.
 8 links of 128 timeslots at 8.192 Mbps.
 Two clock and frame sync inputs.
 Independent clocking – 1 receive clock and 1 transmit
clock.
 Redundant/common clocking – 1 receive/transmit clock
with second clock as backup.

Shannon PCIe Interface
Nyquist/Shannon incorporates PCIe interface with
the following characteristics:
 Two SERDES lanes running at 5 GBaud/2.5GBaud
 Gen2 compliant
 Three different operational modes (default defined by pin
inputs at power up; can be overwritten by software):
 Root Complex (RC)
 End Point (EP)
 Legacy End Point
 Single Virtual Channel (VC)
 Single Traffic Class (TC)
 Maximum Payloads
 Egress – 128 bytes
 Ingress – 256 bytes
 Configurable BAR filtering, IO filtering and configuration
filtering
94

Remaining Peripherals & System Elements (1/2)
 EMIF16
 Supports NAND flash memory, up to 256MB
 Supports NOR flash up to 16MB
 Supports asynchronous SRAM mode, up to 1MB
 Used for booting, logging, announcement, etc.
 64-Bit Timers
 Total of 16 64-bit timers
 One 64-bit timer per core is dedicated to serve as a watchdog (or may be used
as a general purpose timer)
 Eight 64-bit timers are shared for general purpose timers
 Each 64-bit timer can be configured as two individual 32-bit timers
 Timer Input/Output pins
 Two timer Input pins
 Two timer Output pins
 Timer input pins can be used as GPI
 Timer output pins can be used as GPO

Remaining Peripherals & System Elements (2/2)
 UART Interface – Operates at up to 128,000 baud
 I2C Interface
 Supports 400Kbps throughput
 Supports full 7-bit address field
 Supports EEPROM size of 4 Mbit
 SPI Interface
 Operates at up to 66MHz
 Supports two chip selects
 Support master mode
 GPIO Interface
 16 GPIO pins
 Can be configured as interrupt pins
 Interrupt can select either rising edge or falling edge

Q&A

Tutorial On TI C6678

Uploaded by

Copyright:

Available Formats

Tutorial On TI C6678

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tutorial On TI C6678

Uploaded by

Copyright:

Available Formats

What are the main components of the Shannon chip?

What are the main components of the Shannon chip?

How does the Shannon chip support multi-core processing?

How does the Shannon chip support multi-core processing?

Texas Instruments

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

100% backward object

Improved support for

FLOATING-POINT VALUE FIXED-POINT VALUE

Copyright © 2010 Texas Instruments. All rights reserved.

A Register File B Register File

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

 Memory Protection and Address Extension (MPAX) support

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

 XMC configures MPAX segments 0 and 1 so that 0:0BFF_FFFF

C66x Megamodule can access system memory. System Physical

 The power up configuration is that segment 1

BADDR = 21000h; RADDR = 00C000h; Size = 2MB

0Cxx_xxxx “Fast” MSMC RAM

CGEM 32-bit Memory Map

 Example shows 3 segments to map the MSMC RAM address

Copyright © 2010 Texas Instruments. All rights reserved.

FFFF_FFFF Segment 15 Disabled Upper 60GB

0C00_0000 Segment 3 Disabled

 segment 1 matches 8000_0000 through FFFF_FFFF, 0:0C00_0000

Copyright © 2010 Texas Instruments. All rights reserved.

L1 Data L1 Data L1 Data

L1 Prog L1 Prog L1 Prog

L2 memory L2 memory L2 memory

Shared code Data 0

 Same image on each DSP core

MSMC RAM MSMC RAM MSMC RAM

External memory External memory External memory

Copyright © 2010 Texas Instruments. All rights reserved.

App0.out App1.out App2.out

Data 0 Data 1 Data 2

App0.out App1.out App2.out

 Each DSP core has its image

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Debug visibility at core, across multicore and for SoC

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.

Copyright © 2010 Texas Instruments. All rights reserved.