543 MP Paper
543 MP Paper
543 MP Paper
1. Introduction
In embedded systems the ability to perform Real-Time Analysis (RTA) can involve a
dedicated hardware and software capability with an end-to-end methodology that
supports the transferring of data between the host and the target in a lossless and reliable
manner. Specifically, the Real-Time Analysis encompassed by this methodology consists
of capturing data from a target application using dedicated hardware, transferring it
through various layers of software dedicated to creating a real-time path and making it
available to a host application for the purpose of analysis. Analysis includes the
determining of whether applications meet both timing and logical correctness
requirements.
1
Scan-based emulation is a pervasive method that is deployed to debug, develop and
analyze real-time applications running on DSPs. The JTAG boundary scan specification
permits the connecting of multiple devices in a serial daisy-chained arrangement. Section
2 covers scan-based emulation in detail and sets the stage for RTA.
The RTA hardware and software architecture that our methodology relies upon is
presented in Section 3. It also includes an application that illustrates the necessity of
RTA in a multiprocessor environment.
2. Scan-Based Emulation
With a traditional emulator, the CPU to be emulated is usually removed from its socket
and replaced with an emulator pod. The emulator pod typically has a replacement CPU,
plus various amounts of random logic and memory to monitor what is happening on the
CPU pins.
With modern CPUs such as DSPs, the traditional approach has several problems. The
first problem is the speed of newer DSP chips. Bus cycle times can be 25 nanoseconds or
shorter, and all instructions typically execute in a single cycle. This makes it difficult for
a traditional emulator to allow emulation at full speed. The number of pins to monitor can
be staggering, with chips having multiple 32-bit address and data buses, making a
traditional emulator expensive. The second, and more serious, problem is that DSPs often
have on-chip caches, pipelines, memory and peripherals. Sometimes a whole algorithm
can execute without any activity on the CPU pins.
The solution to these problems is scan-based emulation. With scan-based emulation, the
CPU is never removed from the socket; in fact it can be soldered directly onto the board.
Instead, the CPU has a serial scan interface, allowing the emulator to scan the internals of
the device through a standard connector. The pinout for this connector is defined by a
standards committee, making it possible to support several devices with a single
emulator1.
• Emulation at full device speed - Since there is no logic needed to monitor what
happens on the CPU bus, the emulator can allow the device to execute programs
at full speed.
2
• Non-intrusive emulation - Since no logic is attached to any CPU pin, the CPU
bus is not affected at all by the emulation process. The emulator will not affect the
operation of the bus, as is so often the case with traditional emulators.
• In-circuit emulation - The CPU can be soldered to the board while emulating.
This makes denser packaging possible, and also makes the emulator a
manufacturing test tool.
• Full access to internal memory, caches, pipelines and registers - The complete
state of the processor is visible to the outside through a scan interface.
• Complete access to the system from the CPU - Any peripheral or memory that
the CPU can access in the system can also be accessed through the scan interface.
The emulator can look at the system "through the eyes of the CPU". This makes it
possible to debug and diagnose a system where nothing is working except the
CPU itself.
The Joint Test Action Group (JTAG) defines an interface called the JTAG interface for
testing individual devices on printed circuit boards, without the need to remove the
devices from the board. This is accomplished by a method called boundary scan, whereby
the state of each pin of each device (with some special logic on the device) is serially
scanned out from the device. Multiple devices can be daisy chained, and an entire PC
board can therefore be scanned in a single scan chain.
It is possible to use the same method to scan out not only the state of a devices pins, but
to scan out any internal information from the device, such as register values, memory
location; as a consequence scan-based emulation was born.
The JTAG specification does not include the pinout for the JTAG connector. The
extension to JTAG defines a 14 pin, 2 row, 0.1" spacing JTAG connector header, with
pinout and physical dimensions common to all DSPs that support JTAG involved in this
methodology2.
During JTAG emulation, the emulator supplies the clock that scans the device. This
means that the target clock speed is completely independent of the emulation clock, and
the emulator can support targets running at any clock speed.
The JTAG device architecture is based on the IEEE 1149.1 architecture. In this
specification, there are four dedicated pins collectively known as the Test Access Port
(TAP). They are:
3
• Test Mode Select (TMS)
• Test Clock (TCK)
• Test Rest (optional)
A boundary scan cell is connected to each boundary scan register on each device that is
being scanned. The architecture further specifies a finite machine TAP controller with
inputs TMS and TCK. There is an Instruction Register (IR) holding the current
instruction, a bypass register, and an optional 32-bit identification register for permanent
identification.
Boundary scan cells are configured into a parallel-in, parallel-out shift register. Parallel
load operations cause signals from the core logic to be loaded into the output cells.
Parallel unload operations cause the signals to be loaded from the input cells to the core
logic. Data is shifted in serial mode by daisy chaining devices. The figure below shows
the TDI of each device connected to the TDO of the next device in the scan chain.
To Host Computer
Typically, the system architect is responsible for determining the type (homogeneous or
heterogeneous) type of arrangement of devices, their order in the scan chain and if they
will be placed in bypass.
4
3. Real-Time Analysis (RTA)
The following application in the domain of high energy physics illustrates the necessity
for RTA in a heterogeneous multiprocessor environment. The Fermilab Tevatron
Collider generates 15 million particle collisions per second. These particle collisions
result in the creation of subatomic particles that travel through a spectrometer. The data
output from the spectrometer is in the order of terabytes per second and must be analyzed
in real time. The analysis engine comprises a massively parallel arrangement of
heterogeneous DSPs and GPPs (general purpose processors). Analysis consists of
applying algorithms that reconstruct and filter the collision data. The result is a select set
of interesting collisions from which physicists can study some of the remaining mysteries
of matter and antimatter in the universe3.
Historically, several different methods have been employed to debug and analyze real-
time embedded applications. Traditional debuggers were used to set breakpoints that
stop the target application so that the memory state could be examined. This method has
proven to be inadequate for most real-time applications because setting breakpoints halts
the application and therefore interferes with the timing constraints of the system. The
memory state is not guaranteed to contain reliable results.
Logic analyzers have been used for many years to clamp onto the data busses of the
target and monitor the data flow of the application in order to analyze application
behavior. Aside from the fact that logic analyzers are expensive ($15K to $60K for a
DSP), the increase in system-level integration over the years has resulted in fewer
exposed data paths for the logic analyzer to monitor.
In some cases, pre-production versions of the chips containing in-circuit emulation (ICE)
structures were manufactured. These could be used to debug real-time applications.
However, since the debugging environment is not equivalent to the final production
environment, the application’s performance cannot be guaranteed to remain the same
from the ICE version to the final chip.
Most modern microprocessors are architected with specialized hardware counters that can
be programmed for the purpose of tracing applications. Traditionally these registers have
been used to determine the design of the microarchitecture such as caches and TLBs, etc.
Whereas these registers can be used to trace the behavior of applications at a very fine
level of granularity, they cannot easily be used as a RTA mechanism. An ancillary yet
significant issue is that analysis requires that the user have an advanced knowledge of the
target microarchitecture in order to interpret the data. Finally, tracing supports data
transfer only from target to host and not from host to target.
5
Host Target
Host
RTA RTA Real-
App1
Host Emu Emulator Emu Target Time
…
SW SW JTAG HW SW Target
Host Lib Lib App
AppM
Figure 2
Single processor RTA based upon JTAG emulation
An alternative real-time analysis solution based upon JTAG emulation is presented here.
This hardware and software architecture for a single processor is shown in Figure 2. The
JTAG interface that connects the on-chip emulation logic to the host-based emulator
provides the physical mechanism on which to transport data from the target to the host
and vice versa. The target application is the subject matter to be analyzed; it is the source
of data to be sent to the host and the sink for data received from the host. Therefore, a
RTA target software library exists to bridge the gap from the target application to the on-
chip emulation hardware. Good software engineering practices dictate that an API exist
for this software library. On the host, the data is to be analyzed by a host application.
This host application may also input data to the target application. We must therefore
bridge the gap from the emulator to the host application. An emulation software driver
controls the scanning of data to/from the target via the emulator. It is the first piece of
host software to receive data from the target and the last piece of host software to handle
data heading to the target. A RTA host software library funnels the data between the
emulation software driver and the host application. Again, an API exists for the RTA
host software library. It should be noted that multiple host applications may be run
concurrently.
Data flow in this architecture is bi-directional: data flows from the target application to
the host application for analysis, and data may flow from the host application to the target
application for supplying input parameters. Such input parameters may be used for fine
tuning performance, for supplying test data, etc. Refer to Figure 3.
For target-to-host data transfer, there are two distinct parts of the data flow path. The
first part extends from the target application to the RTA host software library. This is the
real-time transportation leg. Since our target application has real-time constraints, data
6
must be off-loaded from the target to the host at a certain rate. The RTA host software
library is the first piece of software on the host that realizes it has received real-time data
for analysis. (The emulation software is agnostic to what type of data it is scanning.) The
RTA host software can record the data to disk and be done with it, or buffer it internally.
The second part of the target-to-host data flow path extends from the RTA host software
library to the host application. The data is analyzed by the host application. If the data
has been recorded in persistent storage, then the data can be played back at any time. If
the data is not in persistent storage, then it must be analyzed by the host application as it
is produced; that is, the data must be drained from the RTA host software buffers as they
fill.
Host Target
Host
RTA RTA Real-
App1
Host Emu Emulator Emu Target Time
…
SW SW JTAG HW SW Target
Host Lib Lib App
AppM
Input
Figure 3
Data flow in single processor RTA based upon JTAG emulation
The above RTA architecture for a single embedded processor is easily extended to a
multiprocessor environment. This is shown in Figure 4. A RTA target software library
must exist on each embedded target to connect the target application to the emulation
logic on that target. For multi-core architectures, a RTA target software library will exist
for each core. The data from each processor is scanned up to the host via the JTAG
interface ring as described in Section 2 (Scan-Based Emulation). On the host, there exists
an emulation software driver corresponding to each target in the system. Each emulation
software driver receives the data from its corresponding target and delivers the data to the
one RTA host software library.
7
Target Processors
SW … TDO
…
Host Emu TDO
AppM SW Pn
TDI Emu RTA Real-
HW Target Time Pn
TDO SW App
Figure 4
Multiprocessor RTA via Scan-Based Emulation
This architecture for multiprocessor real-time analysis via scan-based emulation provides
the basis for the methodology.
4. Methodology
The figures of merit used to determine the success of this methodology are performance,
scalability, ease of use and reliability. We discuss these criteria within the scope of both
hardware capabilities and the RTA software architecture in a multiprocessor
environment.
4.1 Performance
8
4.1.1 Dedicated Emulation Hardware
Data can be streamed between target and host by using peripherals such as DMA and by
performing real-time memory write operations.
Host Target
Virtual Path 1
Virtual Path N
Figure 5
Virtual Data Paths
A RTA solution for a multiprocessor environment must be able to identify the processor
from which data originates. This introduces the need to mark the data with a processor
identifier. The decision then becomes where in the system to do this. If we examine the
host, we see that there is a one-to-one correspondence between the emulation software
drivers and the processors in the system. Since there is an emulation software driver for
each target in the system, these drivers can stamp the data with a processor identifier.
9
Note that from a performance perspective, it is better to mark the data on the host-side as
to the target-side. If a unique processor identifier were sent down to the target and the
data were tagged there, then more data would be sent from the target to the host and
would consume precious bandwidth.
4.2 Scalability
A key aspect of this methodology is scalability. This issue is addressed in both hardware
and software.
The JTAG specification permits the daisy chaining of hardware. The limits placed on the
number of devices that can be daisy chained is based on signal strength limitations as
opposed to the JTAG specification.
In software, data is tagged from each target with a unique identifier (as described in
Section 4.1.2) so that data being transferred between host and target can be identified as
to which processor it belongs.
Further, the RTA architecture is software scalable; writing the target application is not
dependent upon the number or processors and does not need to be altered if processors
are either added or removed from the system configuration. There is no requirement that
the target application have any knowledge of the type or number of processors in a scan
chain at the time of development.
The emulation drivers and the RTA host software are architected to manage the data from
the different processors.
The host application should be able to select from which processor to send or receive
data. This is accomplished by incorporating this functionality into the host API. This
10
proves to be very favorable with respect to scalability. By allowing the host application
to select the processor, the same target application can be replicated without change on
multiple processors to exploit parallel computing power.
For example, let’s assume that we have a target application that performs a series of
transformations on a given vector, and then transfers the resulting vector to the host for
analysis. Let’s further assume that there are many vectors that must be transformed and
that we choose to deploy the same target application on as many processors as there are
vectors to achieve maximum computing parallelization. We can design a host application
that sends a different vector to each processor and then collects the resulting vectors from
each processor, respectively, for analysis. See Figure 6.
Host Application
• Select Processor1 Processor1
• Send vector1
Transform
• Select Processor2
• Send vector2
• Select Processor3 Processor2
• Send vector3
• Select Processor1 Transform
• Get resultant vector
• Select Processor2 Processor3
• Get resultant vector
• Select Processor3 Transform
• Get resultant vector
Figure 6
Processor Selection
Note that this example still holds if the target application sends its data via a virtual data
path. Since the target application is replicated unchanged, each processor would be
sending data on the same virtual path. However, since the host application selects data
on the basis of both processor identifier and virtual data path identifier, all data is
uniquely identifiable.
This example illustrates that host-application control over processor selection results in a
scalable multiprocessor methodology.
11
Ease of use is an important but often difficult figure of metric to sustain. A software
debugging environment is provided that permits the user to easily configure the hardware
in the system.
A trend in DSP emulation hardware is to support device registers that are mapped at fixed
addresses. This permits the source code porting of applications. Further, a trend in more
contemporary DSP emulation logic is to replicate the logic on all DSPs. This further
simplifies the deployment of RTA tools.
At setup, the user selects the type of target and loads the system with an emulation driver
for that target. The user also specifies the number of targets of each type and their
position in the scan chain. Without this capability users would have to add code in their
host applications that performed the same function, resulting in messy and unnecessarily
complex code.
The debugging support software permits the setting of devices on a scan chain to be
bypassed. In the absence of this support, the application might have to disassemble
unwanted scans.
Host side support is provided in the way of object-oriented interfaces based on the
Component Object Model (COM)4, which is a defacto industry standard. This permits the
host application developer to write client programs that are not tightly coupled to a
specific DSP.
4.4 Reliability
The JTAG specification has been long established as a reliable standard. It has been
adopted and extended. An extensive set of target libraries have been developed for
various flavors of DSPs based on boundary scan. Reliability is achieved through reuse of
the same register set in different versions of emulation hardware across ISAs and within
ISAs.
The use of unidirectional virtual paths for both target-to-host and host-to-target data
transfers assists in ensuring that there is no data corruption. Further, host applications
synchronize on data buffers connected to virtual paths and so there is no data loss.
12
Buffer management is precise and is architected to ensure no data loss on both target and
host sides.
Another feature of the RTA architecture is congestion control. With this capability
buffers are guaranteed not to overflow.
During host-to-target data transfer, the RTA architecture signals the end of data transfer
through a virtual path using callbacks. Callbacks are used to notify target applications
that data sent by the host has to be read. The virtual paths through which data is passed
cannot be reused unless previously written data has been consumed.
Data is copied from the target application into buffers in the RTA target software library.
This supports reliability by ensuring that the target application does not accidentally
overwrite data.
5. Challenges
Each family of DSP has its particular variant of emulation hardware. This has an impact
on the RTA protocol that is used to transfer data between host and target. For instance,
some of the emulation capabilities on some DSPs use interrupts to signal the flow of data
between host and target. In the absence of emulation interrupt support, the application
must poll the emulation hardware for the presence of data.
Another issue is the support for DSPs with varying word sizes (16 bit and 32 bit).
These issues have been addressed on the target side by developing the RTA target
software libraries that get linked in with the application. These target libraries comprise
the software that is responsible for programming the emulation and peripheral device
registers and effect data transfer.
On the host side, the RTA host software is a target independent layer that can filter data
in a multiprocessor environment to send and receive the data from a particular target
unambiguously.
6. Conclusion
The RTA methodology presented in this paper is extensively used and widely accepted.
The problem that is illustrated in the high energy physics application presented in Section
3 is not limiting. Our experiences to date have shown that other domains such as wireless
13
and mobile computing require the processing of RTA data where both DSPs and
microcontrollers are on the same scan chain.
The development of this RTA capability has been predicated on the JTAG specification
and the adherence to this standard in the emulation hardware that has been designed into
the DSP core.
The software that has been developed is able to differentiate between the various DSPs.
The virtual paths in the RTA architecture guarantee data integrity. The COM interfaces
permit the analysis of data via Commercial Off-The-Shelf (COTS) tools such as
MATLAB® and LabVIEW™.
Biographies
Debbie Keil is a member of Texas Instruments’ technical ladder, a distinction held by the
top twenty percent of TI’s technical staff worldwide. She is the co-architect of TI’s Real-
Time Analysis technology and the technical lead of the software engineering team that
develops real-time analysis solutions for TI DSPs. Debbie has extensive industry
experience in compiler development and embedded systems programming. She has
published in the EE Times and has patents pending. She is a member of the International
WHO’S WHO of Information Technology. Debbie holds a Master’s degree in Computer
Science and a Bachelor’s degree in Computer Science and Math from the University of
Pittsburgh.
References
14
[2] Texas Instruments JTAG Extensions http://www.ti.com
[3] Gottschalk, E.E., et al., "The BTeV DAQ and Trigger System – Some Throughput,
Usability, and Fault Tolerance Aspects," Proceedings of the Computing in High Energy
and Nuclear Physics Conference (CHEP 2001), p. 628, Beijing, China, September 2001.
15