Simics Unleashed - Applications of Virtual Platforms 2013
Simics Unleashed - Applications of Virtual Platforms 2013
Simics Unleashed - Applications of Virtual Platforms 2013
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying,
recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written
permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive,
Danvers, MA 01923, (978) 750-8400, fax (978) 750-4744. Requests to the Publisher for permission should be addressed to the Publisher, Intel Press, Intel
Corporation, 2111 NE 25th Avenue, JF3-330, Hillsboro, OR 97124-5961. E-Mail: [email protected].
This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold with the understanding that the publisher
is not engaged in professional services. If professional advice or other expert assistance is required, the services of a competent professional person should be sought.
Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the presented subject
matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by estoppel or otherwise, to any such patents,
trademarks, copyrights, or other intellectual property rights.
Intel may make changes to specifications, product descriptions, and plans at any time, without notice.
Fictitious names of companies, products, people, characters, and/or data mentioned herein are not intended to represent any real individual, company, product, or event.
Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. Intel, the Intel
logo, Intel Atom, Intel AVX, Intel Battery Life Analyzer, Intel Compiler, Intel Core i3, Intel Core i5, Intel Core i7, Intel DPST, Intel Energy Checker, Intel
Mobile Platform SDK, Intel Intelligent Power Node Manager, Intel QuickPath Interconnect, Intel Rapid Memory Power Management (Intel RMPM), Intel
VTune Amplifier, and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results
to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Other names and brands may be claimed as the property of others.
This book is printed on acid-free paper.
Printed in China
10 9 8 7 6 5 4 3 2 1
ALL INFORMATION PROVIDED WITHIN OR OTHERWISE ASSOCIATED WITH THIS PUBLICATION INCLUDING, INTER ALIA, ALL SOFTWARE
CODE, IS PROVIDED AS IS, AND FOR EDUCATIONAL PURPOSES ONLY. INTEL RETAINS ALL OWNERSHIP INTEREST IN ANY INTELLECTUAL
PROPERTY RIGHTS ASSOCIATED WITH THIS INFORMATION AND NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE,
TO ANY INTELLECTUAL PROPERTY RIGHT IS GRANTED BY THIS PUBLICATION OR AS A RESULT OF YOUR PURCHASE THEREOF. INTEL
ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO THIS INFORMATION
INCLUDING, BY WAY OF EXAMPLE AND NOT LIMITATION, LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR THE INFRINGEMENT OF ANY INTELLECTUAL PROPERTY RIGHT ANYWHERE IN THE WORLD.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary.
You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when
combined with other products.
Intels compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations
include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations
not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding
the specific instruction sets covered by this notice.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTELS
TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A Mission Critical Application is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE
OR USE INTELS PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES,
SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS,
DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY,
PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS
SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or
instructions marked reserved or undefined. Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from
future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current
characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or
go to: http://www.intel.com/design/literature.htm
Articles
Simics* Overview..................................................................................................................................................... 8
Simics*SystemC* Integration.............................................................................................................................. 54
Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults.............................................. 178
Table of Contents | 5
Intel Technology Journal | Volume 17, Issue 2, 2013
Peter S. Magnusson,
Engineering Director, Google, Inc.
When I interviewed for an internship at the Swedish Institute of Computer Science (SICS) in 1991, the project was to write
a parallel-computer simulator for the Data Diffusion Machine (DDM) research effort being led by Seif Haridi and Erik
Hagersten. I was hired by Andrzej Ciepielewski and Torbjrn Granlund and the project was supposed to take six weeks. It took a
little longer than that.
Back in the 1980s, it was common for computer architecture research to be entirely based on simulations running computationally
intensive workloadstraditional high performance computing. The best practice in the field was summarized by the release of the
Stanford Parallel Applications for Shared Memory (SPLASH) benchmark suitewhich coincidentally also occurred in 1991.
However, at the time (late 1980s, early 1990s), a number of research groups recognized that much of parallel-computer usage
was not compute-intensive as much as it was data-intensivefor example, transactional workloads, better represented by
benchmarks like TPC-C. But these workloads were generally commercial software, large parts of which were only available in
binary. They also heavily relied on the underlying operating system.
So a project was conceived to develop a simulator to both support the computer architecture work around the DDM project and
also support porting an operating system to the prototype. An existing, groundbreaking simulation environment developed by
Robert Bedichek at the University of Washington was extended to support a multiprocessor system and to mimic real devices.
(As a curious aside to the reader, Roberts work on simulation began at his time at Intel in the late 1980s, so Simics now being an
Intel product closes the loop.)
The six weeks grew. Some six calendar years, twenty man years, and several hundred thousand lines of code later, in 1997, the
simulation group in the Computer and Network Architectures (CNA) group at SICS finally succeeded in the original goal:
booting a commercial operating system (Solaris* 2.6) on a simulated Sun Microsystems server (sun4m architecture). This
was the first known occasion of an academic group running an unmodified commercial operating system in a fully simulated
environment. The full system simulator was born.
The simulation group at SICS eventually grew to five people, all of whom became founding employees of Virtutech in 1998:
Magnus Christensson, Fredrik Larsson, Peter Magnusson, Andreas Moestedt, and Bengt Werner. Our first customers were Sun
Microsystems, Ericsson, and HP. To the original SPARC* V8 architecture, we added SPARC V9, x86, x86-64, Power, ARM,
Itanium, and so on. We invented a number of new technologies and tools along the way, making Simics by far the most capable
tool in its field.
With the launch of Simics 3.0 and the Hindsight* technology in 2005, all the core elements that I remember scoping out on a
whiteboard around 1993 were in place, and several I hadnt imagined. So in some sense, it became a software project that literally
took over 100 times longer than originally planned.
In the process I became convinced (and still am) that this is by far the best way forward to improve software development
environments, since, once inside a deterministic simulator, you can do some very interesting things.
Foreword | 7
Intel Technology Journal | Volume 17, Issue 2, 2013
Simics* Overview
Contributor
Daniel Aarno This article provides an overview of Wind River Simics*, a full-system simulation
Software and Services Group, framework jointly developed by Intel and Wind River. Simics technology has
Intel Corporation been used to help develop complex software and hardware systems for more than
two decades. This technical overview describes what Simics is, its main design
Jakob Engblom goals and principles, and how it works. The article also describes the overall
Wind River simulation landscape, and how Simics fits into the big picture.
Introduction
A full-system simulator (FSS) like A full-system simulator (FSS) like Simics[7] is a model of a digital system that
is complete enough to run the real targets software stack and fast enough
Simics is a model of a digital system to be useful for software developers. The speed and full-system simulation
that is complete enough to run the real capabilities of Simics differentiates it from most simulation tools provided
by the electronic design automation (EDA) industry[8], which are typically
targets software stack and fast enough extremely accurate from a hardware perspective, but too slow to be practical for
operating system (OS), application, or systems software.
to be useful for software developers.
In an FSS, there are models of processors, memories, peripheral devices,
networks, and so on, making up a model of the target machine. The key goal of
the simulation is that as far as the software running on the target is concerned,
it could just as well be running on physical hardware. Often this means that
the simulation solution includes more than just the computer components.
The simulation also integrates various simulators for the external environment
that the computer system is operating in.
The main users of Simics are software The main users of Simics are software and systems developers, and their main
problem is how to develop complex systems involving both hardware and software.
and systems developers, and their main Virtual Machine Monitors (VMMs) like VMWare* or Virtualbox* also run
problem is how to develop complex complete software stacksbut for a runtime use case, not for the complete product
lifecycle. In addition a VMM only simulates a generic, simplified hardware platform,
systems involving both hardware and whereas Simics can ensure binary compatibility with an actual real-world machine
software. such as a specific Intel chipset (PCH) and processor variant.
Target Systems
The target systems simulated with Simics range from single-processor aerospace
boards to large shared-memory multiprocessor servers and rack-based
telecommunications, data communications, and server systems containing
thousands of processors across hundreds of boards. The systems are often
heterogeneous, containing processors with different word-length, endianness,
and clock frequency. For example, there can be 64-bit Intel Architecture
8 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
Often, target systems are networked. There can be networks of distinct systems and
networks internal to a system (such as VME, I2C, PCIe, and Ethernet-based rack
backplanes). Multiple networks and multiple levels of networks are common.
Simulation runs can cover many hours or days of target time and involve Simulation runs can cover many
multiple loads of software and reboots of all or part of the system. Even a
simple task such as booting Linux and loading a small test program on an
hours or days of target time and involve
eight-processor SoC can take over 30 billion instructions. Profiling and multiple loads of software and reboots of
instrumentation runs can take tens of billions of instructions.
all or part of the system.
Simics can be used to model future processors and chipsets well in advance
of hardware availability. Such early hardware deployment of Simics allows Simics can be used to model future
BIOS, OS, and application software development to be performed long before
even prototype silicon is available. processors and chipsets well in advance
Simics is often used with models of hardware that are also available in silicon. of hardware availability.
Some models started life as early hardware models, and others have been
created after the hardware was commercially available in order to directly
support the main software and system development effort.
Simics* Overview | 9
Intel Technology Journal | Volume 17, Issue 2, 2013
used from the software stack. FSS is used to develop system software, including
The software development schedule debug and test. The software development schedule can be decoupled from the
availability of hardware when using FSS and it improves software development
can be decoupled from the availability productivity by providing a better environment than hardware.
of hardware
Figure 2 illustrates the concept of shift-left, where software, drivers and
BIOS development, integration, and test efforts are performed much earlier
in the development process. This not only reduces products time to market,
it also reduces the cost of fixing defects when discovered earlier in the product
lifecycle and increases product quality and customer satisfaction.
Engineering Resources
Resources Product
Costs Quality
Risks Reduced Time-to-
Market Increased
Software Revenue
Software
The article Using Virtual Platforms for BIOS Development and Validation
by Steve Carbonari describes the development of BIOS code on Simics models
in advance of hardware availability, as well as how Simics is being used after
silicon becomes available.
The article Post-Silicon Impact: Simics Helps the Next Generation of Network
Transformation and Migration to Software Defined Networks (SDNs) by Tian
Tian describes a high-level view of how Simics has been used for early hardware
access for Intel communications chips, developing software stacks and drivers.
The article Early Hardware Register Validation with Simics by Alexey Veselyi
and John Ayers describes a lower-level use case, where Simics is used to validate
the register design of hardware very early in the design process.
10 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
As a system matures and the next generation begins development, Simics can Simics can be used to smoothly move
be used to smoothly move from the current generation to the next generation.
By setting up a model containing a mix of old and new hardware components
from the current generation to the next
(such as different generations of boards in a rack-based system), software can generation.
gradually be updated to match the next hardware generation. As part of this
process, new boards can be tested in a system containing existing legacy boards.
This is represented by the arrow back to the start in Figure 1.
Running real unmodified software stacks has many benefits. Since Simics is
primarily used for software development, running the actual software that
is being developed makes eminent sense. The software is compiled using the
same tools and compilers that are used with the hardware target, avoiding
inconsistencies and deviations introduced by host compilation or other
approximations or variant builds for simulation and quick tests.
Unmodified software also means unmodified build systems, and thus there is
no need for users to set up special builds or build targets for creating software
to run on Simics. There may be portions of the system where only machine
code is available, such as proprietary libraries, drivers, or operating systems, and
in such cases running the real binary code is the only way to get a complete
software system running.
Using unmodified software also means that software can be managed and
loaded in the same way as on a real system, making Simics useful for testing
operations and maintenance of the target system.
The article Using Virtual Platforms for BIOS Development and Validation
mentioned earlier describes how Simics is used to develop, test, and debug
Simics* Overview | 11
Intel Technology Journal | Volume 17, Issue 2, 2013
low-level BIOS code, which is probably the most difficult type of software to
run on a simulator.
Processors
Target Machine(s)
External
World
Configuration Core Services Event Queue Multithreading Connections
Management API Simics and Time and Scaling
Core
Simics
Eclips GUI
Ethernet
Modules
Component Serial
Device Network CPU Core Memory Feature Keyboard
Package
Mouse
Debuggers
...
Simics models can be distributed as Simics models can be distributed as binary-only modules, with no need to
supply source code to the users. Binary distribution simplifies the installation
binary-only modules, for end users, as they do not have to compile any code or set up build
environments. It also offers a level of protection for intellectual property when
different companies exchange models. By obfuscating the names of hardware
registers and limiting the amount of metadata included in the modules it is
possible to safely distribute models of very sensitive future hardware designs to
external users. It makes it possible to limit the information disclosure by the
model to precisely that of the documentation provided, even if the model itself
needs to contain undocumented and secret registers to make BIOS and low-level
firmware code work correctly.
12 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
Simics modularity enables short rebuild times for large systems, as only
the modules that are actually changed have to be recompiled. The rest of
the simulation is unaffected, and each Simics module can be updated and
upgraded independently.
Simics uses the C-level ABI and host operating system dynamic loading
facilities. The C++ ABI varies between compiler versions and compiler vendors,
and is thus not usable in the interface between modules, even though C++
can be used internally in modules. The Simics framework provides bindings to
write Simics modules using DML (see below), Python, C, C++, and SystemC,
but users can actually use any language they like as long as they can link to C
code. For example, a complete JVM has been integrated into Simics, running
modules written in Java.[1]
Scalability
As discussed above, Simics target systems can potentially be very large.
To efficiently simulate such large systems, Simics makes use of several
techniques which are described in more detail in the section Simics
Performance Techniques. Scalability has been an important attribute of
Simics since the very first commercial deployments, originally relying on
distributed simulation[7], and evolving into a multithreaded (and distributed)
implementation.[8]
Simics* Overview | 13
Intel Technology Journal | Volume 17, Issue 2, 2013
Simics can also be run from a normal command-line shell, on both Linux and
Windows hosts. This makes it possible to run Simics without invoking the
Eclipse GUI and is useful when it comes to automating Simics runs from other
tools. Simics behaves just like any other UNIX-style command-line application
when needed.
the Simics architecture separates the As illustrated in Figure 3, the Simics architecture separates the function of the target
hardware system from the connections to the outside world. The target consoles
function of the target hardware system shown in Figure 4 are not part of the device models of the serial ports and graphics
from the connections to the outside processor unit, but rather provided as generic functions by the Simics framework.
This means that all consoles behave in the same way and provide support for
world. command-line scripting, record and replay of inputs, and reverse execution.
14 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
Other Virtual
Board
Traffic Generator
Ethernet
Device
Virtual Inspection and Fault
Network Injection
PHY
Physical
Virtual Board Service Node Real Network Network
(DNS, DHCP, ftp, Connection
NFS, ...)
Simics
As well as passively observing the state of the target system, Simics users can
change it. This is used for fault injection or to quickly set up a system to make
software run without necessarily having all boot code in place.
Scripting
Simics scripts work the same way in a Simics simulation started from Eclipse, in
an interactive command-line session, and in an automated batch run on a remote
compute server. Basic scripts are written in the Simics CLI command-line language,
and for more complex tasks there is a full Python environment embedded in there is a full Python environment
Simics. The Python engine has access to all parts of the simulated system and can
interact with all Simics API calls. CLI and Python scripts can exchange data and
embedded in Simics.
Simics* Overview | 15
Intel Technology Journal | Volume 17, Issue 2, 2013
variables with each other, and it is common to find snippets of Python embedded
Users can create their own custom inside of Simics scripts. Users can create their own custom CLI commands in order
to automate or simplify common tasks peculiar to their system or environment.
CLI commands
A typical Simics scripting example is shown in Code 1. It is a script that opens a
Simics checkpoint and then runs a command on the target. The parameters to the
command are sent in as Simics CLI variables to this script, but are also provided
with default values in case nothing is provided. The script branch at the end is a
construct that lets script code run in parallel to the target system and react to events
16 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
in the simulation. This makes it very easy to automate and script systems containing
many different parts where the global order of scripted events is unknown before
the simulation starts. Separate scripts can be attached to the different parts.
## Parameters to run:
if not defined opmode { $opmode = software_byte }
if not defined generations { $generations = 100 }
if not defined packet_length { $packet_length = 1000 }
if not defined packet_count { $packet_count = 1000 }
if not defined thread_count { $thread_count = 4 }
if not defined output_level { $output_level = 0 }
$system = viper
$con = $system.console.con
# Script branch that will run the program and wait
# for it to complete
# by watching the target serial console
$prog_name = /mnt/rule30_threaded.elf
$cmd = (%s %s %d %d %d %d %d \n % [$prog_name,
$opmode, $packet_count, $generations,
$packet_length, $output_level, $thread_count])
script-branch {
local $system = $system
local $con = $con
local $cmd = $cmd
local $prompt = ~]#
add-session-comment Starting run
$con.input $cmd
$con.wait-for-string $prompt
add-session-comment Run finished
stop
}
Code 1. Example Simics Target Automation CLI Script
Source: Wind River, 2013
Simics* Overview | 17
Intel Technology Journal | Volume 17, Issue 2, 2013
Using Simics scripts, it is easy to automate and replicate the setup of even the
most complex target systems. Multiple machines, boards, and networks can all be
Compared to configuring hardware set up, configured, and reliably reproduced. Compared to configuring hardware
lab setups for even small networks, Simics can save hours and days of setup time.
lab setups for even small networks,
The article Using Simics in Education mentioned earlier describes how
Simics can save hours and days of network topologies are automatically generated in order to support networking
setup time. training, providing a typical example of the power of Simics scripting to
automate system setups.
The Simics debugger obviously The Simics debugger obviously supports reverse debugging, as well as user
operations that arbitrarily change the targets state and time. Simics has the
supports reverse debugging, ability to trace or put breakpoints on aspects of the target that are inaccessible
on the hardware, such as hardware interrupts, processor exceptions, writes
to control registers, device accesses, arbitrary memory accesses, software task
switches, and log messages from device models. In Simics it is possible to
single-step interrupt handling code and to stop an entire system, consisting of
multiple networked machines, synchronously.
As Simics models the actual hardware and runs the OS code just like the physical
hardware would, it does not directly know anything about the OS. Indeed, it is
not necessary to run an OS on a Simics model, bare-metal code is commonly
used for low-level tasks. Thus, for Simics to be able to provide advanced features
based on the OS running on the target a feature known as OS awareness is
OS awareness provides the user necessary. OS awareness provides the user with a full software perspective of the
system, in addition to the hardware perspective. OS awareness allows Simics
with a full software perspective of the investigate to the state of the target system and resolve the current set of executing
system, threads and processes. The OS awareness module for a particular OS knows the
layout and structure of things like process descriptor tables and run queues, and
can provide the debugger with information about the currently running processes
and threads. OS awareness lets the Simics debugger, scripts, and extensions act
when programs and processes are started, terminated, or switched in and out.
18 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
The debugger leverages OS awareness to allow debugging of individual The debugger leverages OS awareness
applications and threads, as well as stepping up and down through the software
stack layers. Symbolic debug information can be attached to processors to allow debugging of individual
for bare-metal debug and to software stack contexts (like a kernel or user applications and threads, as well as
application) for debugging only a certain part of the software system.
stepping up and down through the
The Simics debugger is a full-system debugger, meaning that it connects to
the entire target system and not just a single processor or board. In Figure software stack layers.
7, we see two target machines inside a single debug session (server_p and
Checkpointing
Simics has been designed from the ground up to support checkpointing of Simics has been designed from the
the simulation state. This gives Simics the ability to save the complete state
of a simulation to disk and later bring the saved state back and continue
ground up to support checkpointing of
the simulation without any logical interruption from the perspective of the the simulation state.
hardware model and the target software.
Simics* Overview | 19
Intel Technology Journal | Volume 17, Issue 2, 2013
Checkpoints contain the state of both Checkpoints contain the state of both the hardware and the software (which is
implicit in the hardware state as it is described by the contents of memory, disks,
the hardware and the software CPU registers, and device registers). Based on our experience, Simics checkpoints
are portable across time and space, and let users do things like the following:
Restore the simulation state from a previous run for the same user on the
same machine as the checkpoints were taken. This helps an individual user
work more efficiently.
Restore on a different host machine. This means that checkpoints can be
shared between users, enabling all kinds of collaboration.
Restore into an updated version of the same simulation model. This makes
it possible to use checkpoints taken with older versions of a model, making
them portable across time.
Restore into a completely different simulation model that uses the same
architectural state. For example, a detailed clock-cycle driven model
initialized from a fast Simics run.
Replay a particular sequence of inputs captured in one simulation session
into a second simulation session.
Checkpointing can be used to support workflow optimization, such as a
nightly boot setup where target system configurations are booted as part of a
nightly build, and checkpoints saved. During the workday, software developers
simply pick up checkpoints of the relevant target states, with no need to boot
the target machines themselves.
package bugs and communicate Another important use of checkpointing is to package bugs and communicate
them between testing and engineering, between companies, and across
them between testing and engineering, the world. Simics checkpoints make the reproduction of the bug and the
between companies, and across the environment needed to reproduce the bug trivial.[2]
20 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
Such changes are done from scripts, the Simics command-line, and the System
Editor in Eclipse at will. This dynamic nature of a system is necessary to
support system-level development work and to support the dynamic nature of
the target systems discussed in the introduction.
Simics* Overview | 21
Intel Technology Journal | Volume 17, Issue 2, 2013
The most basic connections are the serial and graphics consoles provided with
Simics to allow a user to interact with the simulated computer system. It is also
quite common to connect Simics simulated machines to the real world via Ethernet
networks and serial ports, using various real-network systems to bridge between the
physical world and the virtual system (as illustrated in Figure 5). In this way, Simics
target systems have been used in hardware-in-the-loop simulations.
Today, it is very common to use simulation of the physical world during the
design of products like vehicles, space crafts, and engines. Such environment
simulators can be integrated with Simics, creating holistic models that encompass
all aspects of the target system. Essentially, hardware-in-the-loop is replaced by
making it possible for any simulation-in-the-loop, making it possible for any developer to have a complete
cyber-physical system on their desk for software testing and debugging.
developer to have a complete cyber-
Figure 8 shows how such a setup is achieved. On the Simics computer side, there
physical system on their desk for
need to be models of the actual devices the computer uses to sense and control
software testing and debugging. its environment. Then, the environment model is either run inside of Simics, or
(more commonly) in a separate process communicating with a proxy module in
Simics over a network socket or other inter process communication mechanism.
22 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
Application
Target OS Environment
Simulation Module
Sensor Input
Host OS
Host Hardware
The Simics memory model is similar to the SystemC TLM 2.0 loosely
timed (LT) model[5] in that a memory access is a blocking call. However, the
Simics model is a special case of the LT model with a zero time delay; this is
sometimes referred to as software-timed (ST) or programmers view (PV). The
different common timing models are illustrated in Figure 9.
Bus Cycle Accurate (BCA) Request Data Data Data Data Data Request
Clock
Simics* Overview | 23
Intel Technology Journal | Volume 17, Issue 2, 2013
For Intel architecture targets, Simics also takes advantage of the Intel
Virtualization Technology (Intel VT) found in Intel processors to run the
target code directly on the host. This makes it possible to achieve performance
close to native speeds.
24 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
Multithreaded Simulation
Simics makes use of multiple host processor cores to simulate the target system. Simics makes use of multiple host
When running in multithreaded mode, Simics still implements precisely
the same target semantics and behavior as when running single threaded.
processor cores to simulate the target
This means that the simulation behavior is independent of the host, and system.
that simulation repeatability is maintained as checkpoints and setups are
communicated between Simics users.
When the processor power (or memory) of a single host is insufficient to run a
large system, Simics can also use distributed simulation. In such a setup, multiple
Simics processes running on different hosts are connected into a single coherent
and time-synchronized simulation system. Multithreading lets Simics take
advantage of scale-up of individual hosts, and distribution takes advantage of
scale-out as more simulation hosts are added. See the article by Rechistov for an
example of scaling up and scaling out Simics to run a very large target system.
System Modeling
Getting a model in place for a relevant target system is a prerequisite to
using Simics. Without some virtual hardware to run on, software will not be
particularly interesting.
For other users, a standard system might be sufficient. Simics ships with Quick
Start Platforms (QSP), which provide a simple idealized multicore system that
runs Wind River Linux and VxWorks by using customized BSPs. The QSP
provides serial ports, Ethernet ports, timers, and disks. The QSP will not run
the same bootrom and OS image as a real board, but it will in general run user-
level application binaries compiled for the real system. In this way, they provide
a system that gets a user started quickly and that is entirely sufficient for using
Simics features to debug, test, and analyze software applications.
Simics* Overview | 25
Intel Technology Journal | Volume 17, Issue 2, 2013
Device Modeling
Simics provides the tools necessary If a model does not exist, Simics provides the tools necessary to quickly
and efficiently develop new models that can be easily integrated into
to quickly and efficiently develop new existing targets. The core of building models of new hardware in Simics
models that can be easily integrated into is the modeling of the device found in the new hardware. As mentioned
above, Simics provides a C API and ABI meaning that models written in
existing targets. almost any language can be integrated into Simics. However, Simics also
provides its own domain specific language, the Device Modeling Language
(DML), which is specifically developed to allow rapid development of
robust device models for Simics. Besides DML, the other most commonly
used languages for creating device models are C, C++ (including
SystemC), and Python.
The article Device Driver Synthesis by Mona Vij et al. describes a creative use
of the device models created for Simics. They use DML models of hardware
created as part of the hardware design process as an input to a tool that creates
device drivers.
26 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
the Sample Device browser below it. The Eclipse New Sample Device
wizard creates new devices, and other modules, based on the examples
provided with Simics. The Sample Device view lets you look at the
example code without necessarily creating a new device model in your own
workspace. This is quite convenient when borrowing functionality from an
example.
Component System
To aid configuration and management of a simulation setup, Simics has the
concept of components. Components describe the aggregations of models
that make up the units of a system, such as disks, SoC, platform controller
hubs, memory DIMMs, PCI Express* cards, Ethernet switches, and similar
familiar units. They carry significant metadata and provide a natural map of
a system.
Figure 12 shows a stylized example of a component hierarchy, where a system The component system provides
is built from two boards connected by an Ethernet network. At each level
of the hierarchy, device models can be present, as well as components. The both structure to the models and
component system provides both structure to the models and a hierarchical
a hierarchical namespace for the
namespace for the running simulation, making it easy to reuse components and
devices without any risk of names clashing. running simulation,
Simics* Overview | 27
Intel Technology Journal | Volume 17, Issue 2, 2013
System
DDR RAM
SoC
FPGA
Flash
Component
Components usually have parameters Components usually have parameters like the number of processor cores
to use in a processor, the size of memories, the clock frequency of a core,
like the number of processor cores to or the MAC addresses of an Ethernet controller. Components provide the
use in a processor, the size of memories, standard way to create Simics simulation setups, and a normal Simics
setup script simply creates a number of components and connects them
the clock frequency of a core, or together.
the MAC addresses of an Ethernet
controller. Summary
In this introductory article, we have presented the Simics technology along
with some of its use cases and features. The following articles in this issue of the
Intel Technology Journal will describe particular ways in which Simics has been
used at Intel, Wind River, and in academia.
28 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
References
[1] Introspection of a java virtual machine under simulation, Tech.
Rep. SMLI TR-2006-159, Sun Labs 2006.
Simics* Overview | 29
Intel Technology Journal | Volume 17, Issue 2, 2013
[7] Magnusson, P., Christensson, M., Eskilson, J., Forsgren, D., Hallberg,
G., Hogberg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: A full
system simulation platform. Computer 35(2), 5058 (2002).
[8] Jakob Engblom, Daniel Aarno, and Bengt Werner, 2010. Full-
System Simulation from Embedded to High-Performance
Systems. Processor and System-on-Chip Simulation, Rainer
Leupers and Olivier Temam (ed), Springer New York Dordrecht
Heidelberg London, ISBN 978-1-4419-6174-7,
DOI 10.1007/978-1-4419-6175-4.
Author Biographies
Daniel Aarno is an engineering manager in Intels Software and Services
Group where he leads a team working on the Simics full system simulation
product. Daniel joined Intel in 2010 through the acquisition of Virtutech.
30 | Simics* Overview
Intel Technology Journal | Volume 17, Issue 2, 2013
Contributor
Steve Carbonari Creating a simulation environment for the purpose of BIOS debugging and
Intel Corporation validation requires in-depth simulation models. A separation of initialization
versus runtime models is required to optimize performance, improve
simulation initialization accuracy, and the debugging environment. The
usage model for BIOS debugging and validation can be broken up into two
categories: pre-silicon (before initial hardware is available) and post-silicon
(after initial hardware is available). Specific debugging features are required to
debug BIOS programs due to the large volume of interaction with the system
hardware, its impact to the simulation environment, and a desire to replicate
the power-on environment interfaces. Specific attention should be given
to signaling a software programming error as soon as possible. In addition,
specific simulation techniques need to be applied for BIOS memory reference
code support. Lastly, using the simulation for validation requires configuration
flexibility and fault injection to fully validate all paths within the BIOS being
validated. This article describes the high level concepts and additional depth of
modeling used to approach debugging and validating BIOS with simulation
tools. Although the context of the article is BIOS development and validation,
the concepts can be applied to simulation for any firmware project.
Introduction
BIOS demands additional BIOS (Basic Input/Output System) refers to the software that runs after initial
power is applied to the computer platform. BIOS primary function is to
functionality and depth from software initialize all hardware components to enable the computer platform to run
simulation. higher level software (such as an operating system). Developing, debugging, and
validating BIOS using software simulation demands additional functionality
and simulation depth. Particular attention needs to be given to signaling errors
at the time of register write, platform configurability, and mechanisms for fault
injection. This article describes the basics of separating simulation runtime
versus initialization, pre-silicon versus post-silicon usage models, development
and debugging in the context of BIOS (specifically memory and processor
interconnect initialization code), and BIOS validation requirements.
This initial state of platform hardware prior to BIOS execution will be referred
to as the initialization state. The BIOS executes to initialize the rest of the
hardware in preparation for the operating system to run. The state entered
after BIOS has executed and prior to the operating system is running will be
referred to as the runtime state. Refer to Figure 1.
Operating
BIOS
System
Function-Based Simulator
CPU Initialization
Platform-Based
Inter-Processor
Memory-Mapped
Interconnect Initialization
Routing
Memory Initialization
In the initialization state only minimal hardware components are initialized. In the initialization state only
The goal is to enable hardware components required to enable fetching and
executing BIOS, such as: minimal hardware components are
CPU cores are initialized initialized.
A path to the BIOS flash is established
Processor interconnects are initialized and available in slow speed mode
with minimal routing
In the runtime state the hardware is fully functional for use by an operating In the runtime state the hardware is
system. All hardware components are discovered and configured. In this state
several interfaces are transparent to the operating system and hidden by the
fully functional for use by an operating
system memory map: system.
Socket interconnects
Memory channels and interleaving
Memory type and speed
Hardware testing interfaces
Many software simulation tools can provide a platform-level simulation interface
to support operating system and driver development. However, to support
BIOS development the simulation tool must simulate the initialization state to a
sufficient depth to support the configuration of a variety of hardware components
that are not required (transparent) during the runtime state. In addition,
to support a seamless boot with optimal performance the transition from
initialization state to runtime state must be transparent to software and BIOS.
The implementation of memory spaces The Wind River Simics* implementation of memory spaces[1] provides a
mechanism that supports the transition of components from initialization to
provides a mechanism to transition from runtime. When instructions are executed they access addresses. These addresses
initialization to runtime states. are resolved using Simics memory spaces. If the address does not exist in the
Simics memory space infrastructure, an error is reported. The dynamic nature
of memory spaces allow them to be added and removed during the simulation
run. This provides the ability to only add memory spaces after the underlying
components have been fully initialized. More detail on how this mechanism
is used to support BIOS development will be described in the section
Development and Debugging BIOS.
Usage Models
The two primary usage models for BIOS development are pre-silicon and post-
silicon. Both environments present unique requirements on the simulation tool
used.
Yes
Register or Spec
Change?
Yes and No
feature set development and timelines is required (See Figure 2 for pre-silicon
BIOS development flow). In this stage:
Yes and No
Register Modeling
BIOS development starts when initial specifications and register definitions are
The initial register definitions are available. The initial register definitions and specifications are incomplete and
change frequently during the pre-silicon phase. Consequently, to avoid delays
incomplete and change frequently. in BIOS development and minimize frequency of simulator releases, specific
aspects of register modeling needs to be user configurable, such as:
Default values
Attributes (Read/Write, Read Only, Sticky, and so on)
Field definition
Offset (where in PCI space the register resides)
Memory register sets can be very large and complex. In addition, due to
the relatively fast pace of technology changes the register interface changes
frequently. BIOS software must adapt to the register changes during the BIOS software must adapt to the
hardware design. As the design changes, giving the BIOS engineer the ability to
change register values or get an updated release with new registers in a timely
register changes during the hardware
manner is critical. Ideally the simulator should provide a user configurable design.
mechanism to change all the register values prior to starting the simulation.
For example, a user-editable file that contains register definitions is read and
configured at simulation startup.
In addition, to aid the BIOS with discovering register issues, the simulator tool
must provide flexible logging of all register accesses.
In Simics, default values are made user configurable via the attribute
mechanism, and a variety of logging register mechanisms exist. However, the
other register requirements are not user configurable. To mitigate this for BIOS
development, scripting has been developed to take register definition files in
XML format and convert them to Device Modeling Language (DML) register
definition code[2]. This enables a very quick turnaround and rerelease of the
simulator for register updates.
Register side-effect code is updated to check the values of the register write
to confirm the BIOS values are correct.
Memory spaces are not added until it can be verified that the underlying
simulated hardware components have been initialized properly.
Interfaces between the register side effects and internal component modules
are defined so the register side effects can query the state of the underlying
component.
Several key enhancements to a simulation model SAD register write side effect
can be made to specifically support BIOS debug:
Simulation memory spaces are mapped (enabled) only when BIOS writes
the SAD register and sets the enable bit.
The values written to DRAM decoders by BIOS are checked to confirm the
underlying memory has been initialized. The simulator has knowledge of
the amount of memory available so when the BIOS programs a SAD for
greater than the amount of memory available, the simulator signals an error
at the time of register write to the SAD. Note that it is perfectly legal to
program the SAD for less memory than what is available.
Processor interconnect routes from the source processors SAD being Processor interconnect routes are
programmed to the destination processor is verified using a query interface
to the processor interconnect simulation module. This mechanism enables
verified using an interface to the
processor isolation and bring-up of complex multiprocessor topologies. processor interconnect simulation
Modifying the simulator SAD implementation provides the following key
module.
benefits to BIOS development and debug:
that changes from one platform to the next. The initialization of memory
encompasses several stages from reading basic DIMM data over SMBus, DDR
initialization, initializing clocks and timing parameters, to programming SAD.
The initialization of memory must The initialization of memory must be done with specific steps in a specific
order. This ordering can be at the DIMM level, channel level, and memory
be done with specific steps in a specific controller level. Consequently, in-depth state machines are required by the
order. simulation to enforce any required memory initialization ordering. In addition
the state machines may cross component boundaries. Applying power may
require interfacing with a power control unit on the board instead of the
memory controller directly. Simulation of the following is required for BIOS
memory initialization:
Power MPR,
Applied Power Reset Write Leveling Self
Initialization
On Procedure Vref, DQ Refreshing
training
SRE
ZQCL
RESET SRX
From any State MRS
ZQCS,ZQCL REF
ZQ
Calibration Idle Refreshing
PDE
PDX
ACT
PreCharge
Power
Down
Activating
Active
Power
Down
PDX
PDE
Bank
Active
Write Read
Writing Reading
PRE,
PREA
WriteA ReadA
PRE, PRE,
PREA PREA
Reading Reading
Precharging
Automatic Sequence
Command Sequence
The analog results for DDRIO are usually provided in results registers. The
simulator must provide support for the BIOS engineer to customize the results of
the registers holding the analog data. A key advantage of using a simulation tool
over hardware is the ability to change hardware return values. In the case of BIOS
memory code, the ability to provide user-customizable data to the simulation to
return invalid and boundary condition values for analog data provides an effective
means to test the BIOS memory training code in ways that cannot be done on
hardware. See the sections Configuration Flexibility and Fault Injection. The
interface should be easy to use and understand due to the complexity inherent in the
several stages of analog training of the BIOS MRC performs.
The extent of simulator support related to BIOS MRC validation and timing is
localized to platform registers that return timing information for the purposes of
hardware initialization.
1S 2S 2S
6S 8S
Debugging Tools
The primary tool used by BIOS engineers for platform debugging after
hardware power-on (post-silicon) is ITP (In-Target-Probe). An ITP is a tool
used to control the target hardware at the register level. The ITP tool allows
full control of the target hardware with access to all chipset registers, processor
The key requirements of debugging tools used for BIOS debugging are:
Multiprocessor Support
In large multiprocessor systems there are many components that are initialized
in parallel. A simulation tool must provide multiprocessor support to enable
testing of parallel flows.
Validation of BIOS
Validating BIOS using simulation tools in both pre-silicon and post-silicon
environments requires specific features to be provided by the simulation tool.
The goals of the features are to provide mechanisms to maximize code coverage
of the BIOS and automated backend testing.
Configuration Flexibility
To enable BIOS validation all To support the validation of BIOS all possible supported configurations must
be supported. In addition, an efficient simple interface must be provided so
possible supported configurations must configurations can quickly be changed during development and testing.
be supported. A Python-based tool was developed using the Simics Extension Builder
package to provide a graphical user interface to easily configure Simics based
on parameters in a platform-specific configuration file. The tool creates a
customized Simics session script based on the parameters set by the user.
Options displayed are based on the platform-specific configuration file to
ensure only supported configurations are configured. The tool supports
changing:
Platform
Number of processors
Number of cores
Number of threads
Memory topologies
Memory DIMM types, sizes, and so on
Processor interconnect topology
Specialized modes (such as manufacturing mode)
The tool updates available options based on other selections. For example,
depending on the platform selection, the available options for processor, cores,
and so on will automatically be updated. See Figure 7.
Fault Injection
Validation requires the ability to inject faults into the system as part of BIOS
regression testing. These faults can range from PCIe parity errors, network
errors, and memory errors, to disk failures, processor interconnect link failures,
and power and thermal issues.
The Simics memory space infrastructure can be used to create a fault injection
module without affecting runtime performance under normal circumstances.[8]
In this case, a private memory space is created for the specific device into which
you wish to inject the error that is attached to the main memory space. The
private memory space goes through a fault injector module prior to issuing
the request. Scripting can be used to map or de-map the private memory Scripting can be used to map or
space. The injection module can be implemented with scripts or as a DML
module. A DML module is preferred since it will support checkpointing and
de-map the private memory space.
reverse execution.
Automation
The simulator must support selection The simulation tool must support the ability to configure and run a variety
of configurations in an automated fashion. This enables automated regression
of a variety of configurations in an tests to be run as part of the BIOS nightly builds and supports validation teams
automated fashion. that are creating validation suites for hardware.
To improve the overall BIOS stability and enable the detection of defects
earlier in the BIOS development schedule, a server farm can be used to
automate simulator testing on every BIOS check-in. The implementation
incorporates a source control system, build infrastructure, database subsystem,
test launch and monitoring infrastructure, and a simulator server farm running
on multiple virtual machines (VMs). See Figure 8.
VM 1 VM 3
DB Test Client Test Client
Simics Simics
DB
VM 2 VM 4
Xen Manager
Test Client Test Client
Test Server
Simics Simics
Xen Client 2
Active
Build Server Directory
(DHCP DNS) VM 2 VM 4
Test Client Test Client
Simics Simics
6. The simulator performs all the tasks of the specified job request initiated
from the test server.
7. After the job is complete, the simulator pushes the logs to the file server for
examination by the test server.
The infrastructure can use the package update utility tool described in the
section Register Modeling and the configuration utility described in the
section Configuration Flexibility to update packages and create session scripts
for automated Simics runs.
A server test farm is critical in both pre- and post-silicon development to A server test farm is critical in both
ensure a high quality BIOS. In addition, in a post-silicon environment it can
expose differences between the simulation and the initial power-on hardware
pre- and post-silicon environments.
that may be due to simulation bugs, discrepancies in specifications, BIOS bugs,
or hardware bugs.
Summary
Creating a simulation tool to support development, debugging, and validation
of BIOS presents specific requirements. BIOS requirements change depending
on whether development is in the pre-silicon or the post-silicon environment.
The initialization state that the BIOS runs in requires more features, depth,
and debugging aids than the runtime-statebased software like an operating
system.
References
[1] Virtutech. Modeling your System in Simics. 2009. Revision 3004
http://www.virtutech.com/files/manuals/modeling-your-system-in-
simics_0.pdf
Author Biography
Steve Carbonari is a senior software engineer at Intel. He is currently working
on architecting Simics simulation environments for BIOS development and
debugging. He has ten years experience in UNIX kernel development and over
ten years experience in system software and architecture. He holds an MS degree
in Computer Science and a BS degree in Mathematics/Computer Science.
Simics*SystemC* Integration
Contributors
Asad Khan In this article we discuss the integration of SystemC* architecture models with
Data Center Group, Intels Simics Virtual Platform technology. Using a proof of concept, integration and
Intel Corporation synchronization steps and corresponding performance challenges of the integrated
platform are highlighted and solutions developed. A logging and a save/restore
Chris Wolf SystemC API are added for seamless integration into Simics and to take advantage
Data Center Group, of Simics Checkpointing. Second level integration optimizations are implemented
Intel Corporation in the form of temporal decoupling. A complete software stack is ported to
the integrated platform for system validation and software use cases including
BIOS and OS boot, firmware and drivers development, and application stack
execution demonstrating early software development with virtual platform (VP)
methodologies.
Introduction
Simics[3] is the tool of choice for virtual platform modeling for many of the
ongoing projects within Intel to enable software shift-left initiatives. Simics
supports functional modeling at speeds of the order of 10-100 million
instructions per second (MIPS) for fast OS, firmware boot, and software
development. While Simics provides virtual platform models for many of the
mainstream Intel architecture core/uncore based subsystems, and board level
platform models, there are still many internal and external intellectual property
components that need to be developed using SystemC[4], for the reasons of
modeling fidelity and the use of standardized modeling environments. IEEE
SystemC 1661 (Accelera) is the defacto industry standard for both functional
and performance modeling at the system level.[5][6] Many teams within Intel
are doing their model development using SystemC, while the same is true
for intellectual property models developed in the industry by big and small
intellectual property houses alike.[7] Besides standardization, SystemC also has
the advantage of using models developed in such a way to serve both functional
and performance needs of designers and software architects through the use of
advanced modeling artifacts. These models can then be integrated with Simics
to enable a complete platform for full software stack debug and development.
54 | Simics*SystemC* Integration
Intel Technology Journal | Volume 17, Issue 2, 2013
There are different ways in which a SystemC model can be simulated with
Simics for an integrated platform. Either the SystemC kernel can be controlled
by Simics, or run independently of Simics. This article addresses Simics
controlling the SystemC kernel.
Simics*SystemC* Integration | 55
Intel Technology Journal | Volume 17, Issue 2, 2013
Context switching between Simics and The article also deals with temporal decoupling between the two simulation
engines and how temporal decoupling impacts performance. The co-simulation
SystemC simulation engines is expensive still runs as a single thread, with Simics controlling SystemC engine, but SystemC
from a performance point of view is set up as a master in itself by assigning a time slice to it and is scheduled
through Simics. The two simulation environments are temporally decoupled
by running SystemC for a fraction of duration of the time slice. Performance
improvements achieved through temporal decoupling are presented along with
why temporal decoupling makes sense. Case studies are used to corroborate the
different methodologies and corresponding performance improvements.
56 | Simics*SystemC* Integration
Intel Technology Journal | Volume 17, Issue 2, 2013
Simics
IO/Memory Interface
PCI Interfaces
SystemC
Acceleration Complex
INTERRUPTS
Simics/SystemC
Bridge
Simics Platform System Memory
IA 1 North Bridge 1 Access
Bus Fabric 1 PCH 1 SystemC Device Model
System DRAM
PCle Endpoint
Acceleration Complex
The SystemC device model in our example system encapsulates PCIe endpoints
incorporating PCIe configuration registers, along with the device functional
model. The device is accessed through its memory or I/O space registered using
the PCIe BAR configuration registers. Any upstream transactions from the
device are sent to the Simics platform through the TLM-2.0 interfaces.
Simics*SystemC* Integration | 57
Intel Technology Journal | Volume 17, Issue 2, 2013
deadlocks and starvation. While the integration methodology discussed here poses
no restriction on SystemC modeling for integration with Simics, it does present
different challenges to the overall platform based on the type of SystemC models
integrated. Simics, being a functional simulator, prefers the SystemC models to be
functional; however it does not impose this restriction. So for this discussion we
will assume a modeling semantic based on events only, removing the performance
penalty and redundancy imposed by clocked models. An event-based methodology
has the capability to add every single event of significance for both functionality
and performance. In the following discussion we describe the performance,
corresponding optimizations, and their side effects for SystemC models.
Performance Optimizations
Simics VP functional simulations run at speeds of the order of 10100 MIPS.
Simics employs a non-preemptive time slice model for any master modules,
where instructions of the order of 100,000 are assigned to a given master
per time slice before control is passed to another master. Another factor
contributing to speed is zero delay blocking transactions with no side effects.
On the other hand, cycle- or clock-based simulation as in the case of SystemC
device model achieves speeds only of the order of hundreds of kilocycles per
second (KCPS).
58 | Simics*SystemC* Integration
Intel Technology Journal | Volume 17, Issue 2, 2013
SystemC model was a complete subsystem with its own firmware and hardware
and not a simple memory-mapped device. While SystemC models with timing
granularity of the order of clock periods are to blame, the end result is a direct
consequence of the need to context-switch between Simics and SystemC at
every event that needs to be triggered. This slowdown is worse for clock-based
models, and is improved but not eliminated using an event-based methodology
as discussed in the section Temporally Decoupled VP Performance.
There are side effects to slowing down the SystemC model compared to overall
platform speed. One obvious side effect is the slowdown of any stack that
Simics*SystemC* Integration | 59
Intel Technology Journal | Volume 17, Issue 2, 2013
uses the SystemC models because it takes longer to run. However this cost is
bearable because the platform boots much faster compared to a non-scaled
Since the SystemC model is running model. There are other side effects as well. Since the SystemC model is
running slower, any timeouts for code running on the platform may need to
slower, any timeouts for code running be increased corresponding to the clock scaling factor to disable premature
on the platform may need to be timeout expiration. Another side effect relates to any tight polling loops for
code running on the core monitoring the status registers on the SystemC side.
increased corresponding to the clock Polling leads to a context switch, slowing down the simulation, especially when
scaling factor to disable premature there is little useful work done by SystemC. The polling frequency of these
status registers had to be reduced because of the slowdown of the SystemC
timeout expiration models. This was accomplished by adding stall delays in the Simics/SystemC
Bridge whenever SystemC status registers were polled.
Performance Metrics
Performance optimizations discussed so far are used to determine the overall
performance gains. Base results are obtained using clock scaling, and any
gains over and above that are highlighted for polling and SystemC process
optimizations. Table 1: Performance Optimizations in numbers represents
one set of data for a set of software tests running on the platform and actively
exercising the Simics/SystemC interface. Any other results would vary
depending on the type of the models and the frequency of Simics/SystemC
interaction. However the numbers below represent a general trend highlighting
the fact that performance optimizations implemented make significant gains in
overall Simics/SystemC VP performance.
Interrupt Mode
Driver
The matrix represents two sets of datasetup time for the SystemC device
model and the test execution time. Columns represent the performance
comparisons for the two sets with and without stall cycles. Rows represent
the additional SystemC process optimizations using code refactoring. The
following performance improvements were observed:
60 | Simics*SystemC* Integration
Intel Technology Journal | Volume 17, Issue 2, 2013
A gain of 3060 percent using stall cycles when polling status registers.
A gain of 315 percent through replacement of SystemC_THREAD() with
SystemC_METHOD() processes.
Temporal decoupling takes advantage of the lack of timing interdependency Temporal decoupling takes advantage
between the two simulators by letting them go out of synch with each other.
This is done by making SystemC a simulation master by assigning it an
of the lack of timing interdependency
execution time slice, and placing it on Simics event calendar. When SystemC between the two simulators by letting
gets scheduled by Simics, it is run for a fraction of the time slicea number
that can be dynamically changed during simulation through a Simics attribute. them go out of synch with each other
The idea is to let SystemC get more execution cycles during busy periods, and
only short durations during idle periods. No SystemC events are posted on the
Simics event queue, leaving only time slicing to schedule SystemC.
A side effect of temporal decoupling is that Simics time runs ahead of SystemC
time, and the aggregate time difference between the two schedulers keeps
increasing as the simulation progresses, except for the case when SystemC runs
for the entire time slice duration. Another consequence is that SystemC device
interrupts encounter a scheduling delay up to a maximum of the time slice
duration. This is not a problem for most functional VPs, except when there are
real-time performance requirements.
Simics*SystemC* Integration | 61
Intel Technology Journal | Volume 17, Issue 2, 2013
SystemC is allocated a fixed time slice of 1 msec, while the run duration of
the time slice is changed as shown along with the scale factor. It is pertinent
to note that a temporally coupled simulation took 19 minutes to boot. A
temporally decoupled model with no scaling (scale factor of 1) achieves the
same state in between 23 and 24 minutes. These numbers should be used only
as a reference of the trends and not as a benchmark to determine the overall
simulation speed, as the speeds are also a factor of the host OS, host machine,
and complexity of the models.
The Simics checkpoint API is a complete API for saving/restoring the state
of a model. However, it is intrusive if added to SystemC in that the models
become dependent on the Simics headers and lose their ability to be compiled
and run as standalone models. The objective of the SaveRestore API is to
make SystemC independent of the Simics checkpoint API for compiling and
running standalone. At the same time, when running as a VP with Simics, this
API ties SystemC state to the Simics checkpoint database.
The ScdSaveRestore class ties up the SystemC model with the Simics database for
checkpointing in a VP. This class includes Simics API headers linking the Simics
62 | Simics*SystemC* Integration
Intel Technology Journal | Volume 17, Issue 2, 2013
VP with the SystemC model. It derives from the SaveRestore class and overrides Restoring a given Simics/SystemC
the base functions in SaveRestore for registering the attribute with Simics.
In a Simics SystemC VP, the Simics/SystemC bridge module[1] instantiates the VP restores the Simics time to the
ScdSaveRestore class and registers it with a handler interface. This handle to simulation time at which the checkpoint
the ScdSaveRestore class is passed to the SystemC model, tying the SystemC
checkpoint state to the Simics database. In this setup any SystemC model was taken
attribute registers with Simics along with its Save and Restore functions.
This setup is enough to save the steady state behavior of the SystemC model,
such as, for example, OS boot and the device initialization state. For dynamic
SystemC checkpointing, the API has to be extended to include TLM-2.0
generic payloads and payload event queue (PEQ) semantics of SystemC. Work
has been done on this front[2], but is outside the scope of this writing.
Simics*SystemC* Integration | 63
Intel Technology Journal | Volume 17, Issue 2, 2013
updating SystemC time this way, a significant performance drag was noticed at
checkpoint restore, the duration of which varied based on the time for which
SystemC had to be run to bring it in sync with Simics time.
standalone Simics VP and a standalone The logging API (as well as the SaveRestore API) is based on a handler
mechanism. A user can redefine the logging handler to intercept logging calls
SystemC platform and redirect them to their own API; for example, in model X, the user would
get a log handler object using its name and type.
AcLog = HandlerInterface<VpLog>::GetHandler(main);
This would get a pointer to the main handler. Using this handler, log
statements are implemented as follows.
The main logging handler is defined and registered with Simics within the
Simics/SystemC bridge module tying it to the Simics logging API. This API
also allows us to debug SystemC models outside of Simics and with the same
logging API.
Summary
In this article integration of a complete system level SystemC model is
presented with a Simics VP. Integration steps are highlighted along with the
performance challenges due to the detailed nature of the SystemC model
with its own firmware and functionality, and the different nature of the two
simulation environments. Solutions to performance problems of the integrated
platform are proposed to lead to a working VP for complete software
stacks porting for debug and development. A set of APIs are developed for
checkpointing and logging for the SystemC models to capitalize on the
Simics checkpoint and logging APIs for an integrated solution. Studies show
that performance of the integrated platform falls somewhere between the
64 | Simics*SystemC* Integration
Intel Technology Journal | Volume 17, Issue 2, 2013
Complete References
[1] Wind River Simics, Wind River Simics SystemC Bridge
Programming Guide, 2010.
Author Biographies
Asad Khan is a senior staff engineer at Intel Corporation. He received his
PhD in Electrical Engineering from Northwestern University in 1996. Asad
has almost 20 years of experience working in the electronic system level
design space at Cadence, ARM, Marvell, and Intel Corporation. His current
focus is on virtual platform methodologies for system validation, software
development, and power analysis. Asad is based in Chandler, Arizona.
Chris Wolf is the Virtual Platform Enabling Group (VPEG) lead in Intels
Intelligent Systems Groups Platform Enabling and Development. He has 9
years of experience in system validation and emulation and in virtual platform
methodologies. He has made significant contributions to the shift-left strategy
on Rangeley, Coleto Creek, Island Cove, and Bell Creek, with a focus on
virtual platform and hybrid virtual platform enablement for validation and
sofware teams. He continues to be a strong advocate for VP usage within Intel.
Simics*SystemC* Integration | 65
Intel Technology Journal | Volume 17, Issue 2, 2013
Contributors
Tian Tian This article focuses on post-silicon usage of Simics, and discusses the huge
Datacenter and Connected Systems potential for this technology to impact the ongoing network transformation.
Group, Intel Corporation Today the speed of growth and demand for infrastructure requires
infrastructure vendors refresh designs and innovate at a much faster pace than
Today the speed of growth and before. This opens a huge trend towards a software-defined network (SDN),
Network Functions Virtualization (NFV), and Intel Open Network Platform
demand for infrastructure requires (Intel ONP). The recent movement builds on the strength of x86 common
communication platforms, as well as OS solutions such as Wind River Open
infrastructure vendors refresh designs
Network Software*, and dramatically reduces time to market for infrastructure
and innovate at a much faster pace vendors with reference platform and reference software stacks. The biggest
challenge in this transformation is on the software side, because customers need
than before. to migrate legacy solutions to the new paradigm. The effort associated with the
process to debug, test, and maintain this solution moving forward can be quite
daunting. Simics is an essential technology in this ongoing transformation and
is nicely positioned to make an impact with support for security, Intel Data
Plane Development Kit (Intel DPDK), SDN, and Intel ONP as well as a
long-term roadmap for future products and technologies.
66 | Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN)
Intel Technology Journal | Volume 17, Issue 2, 2013
Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN) | 67
Intel Technology Journal | Volume 17, Issue 2, 2013
68 | Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN)
Intel Technology Journal | Volume 17, Issue 2, 2013
Before we start, we need to briefly touch on a few terms used here as we will
use the shorter versions rather than referencing the full platform names:
Once the issue was successfully recreated in Simics, the debug process became
quite easy. Simics is a run-to-run repeatable environment, which makes it very
easy to reproduce errors. We were able to step through the instructions and
isolate the issue to a small module inside the test BIOS (Figure 3). Because in
Simics one can save system states at the crime scene as a checkpoint, we provided
the files to team members on a different geographic location and invited them
check into the issue in parallel. They were able to recreate the error signature
within minutes of getting the checkpoint and continued to debug.
Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN) | 69
Intel Technology Journal | Volume 17, Issue 2, 2013
70 | Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN)
Intel Technology Journal | Volume 17, Issue 2, 2013
debugging the issue in Simics first, we saved precious time in our schedule. It
was also a great experience where team members in different locations we able
to contribute to the same debug process in real time.
Figure 4: Simics virtual platform running virtual debug agent for Intel XDP tool
(Source: Intel Corporation, 2013)
Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN) | 71
Intel Technology Journal | Volume 17, Issue 2, 2013
develop as an important piece of SDN and Intel ONP. Shown here is the
For software, things behave exactly modeling approach for an Intel QuickAssist hardware component (Figure 5).
In this approach, Intel QuickAssist hardware devices are modeled using Simics
the same from a functionality point and the whole device functionalities are modeled. It simulates functional
behavior of Intel QuickAssist and interacts with the software layers above. For
of view whether it is running on
software, things behave exactly the same from a functionality point of view
hardware or running on Simics. whether it is running on hardware or running on Simics.
Hardware Accelerator
With the Simics file system feature, users can move software packages easily
from a local PC to Simics virtual platforms. The example (Figure 6) shows that
moving software packages from a local drive and installing them onto a Simics
virtual platform is as easy as operating on local directories.
Developers use exactly the same steps needed on real hardware to load and run
the Intel QuickAssist test suite in Simics (Figure 7):
modprobe icp_qa_al
/lib/firmware/adf_ctl up
insmod build.ko
Figure 6: Using the simics file system to move production software onto the virtual platform
(Source: Intel Corporation, 2013)
72 | Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN)
Intel Technology Journal | Volume 17, Issue 2, 2013
Within a few minutes, users can run an Intel QuickAssist test (Figure 8) and
start get familiar with usage of the hardware functionalities.
Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN) | 73
Intel Technology Journal | Volume 17, Issue 2, 2013
However, just getting Intel DPDK software does not mean the job is done at
the application side. The hard work tends to integrating the key optimization
4. Start Running Data Plane Applications 4. Start Running Data Plane Applications
74 | Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN)
Intel Technology Journal | Volume 17, Issue 2, 2013
Simics can reduce the learning process for engineers on a new software stack such Simics can reduce the learning process
as Intel DPDK. As shown in Figure 9, the key bottleneck of the process tends to
be the access to real reference platforms and the test harness. The process can take
for engineers on a new software stack
days or even weeks. Even when a board is available, the access can be limited. For such as Intel DPDK.
example, for a team of engineers, sharing access to one or very few boards means
only a small amount of time is available for an individual and this may translate to
an impact to productivity. With Simics not only is the waiting process eliminated,
all developers can have their own virtual boards. They can quickly launch Simics
and begin learning the software, debug, build test cases, and explore new ways to
Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN) | 75
Intel Technology Journal | Volume 17, Issue 2, 2013
do things. For example, with help from Simics, they can get an Intel DPDK demo
traffic test going very quickly on their own PCs (Figure 10).
The actual customer usage of a reference software stack such as Intel DPDK
requires a lot of customizations. The fact that Simics can run this type of software
application proves that it becomes a viable solution in helping customers study
the reference stacks and begin integrate their solutions on their virtual targets.
This potentially brings significant saving in terms of schedule and time to market.
Niantic
Cave MAC/PHY IPMC
Barnesville
Creek Gladden
PHY
Clock
Chip
Memory
Reset Power
USB Flash Down
Sequence
HDD CPLD
76 | Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN)
Intel Technology Journal | Volume 17, Issue 2, 2013
Simics
Simics Romley
Grantley
Simics Next-
gen Comms
Simics Crystal
Forest Gladden
Simics
Rangeley
CPU Cores
Intel QuickAssist
X4 X4
Technology
GbE GbE
Legacy IO
MAC MAC
SATA
USB
Intel QuickAssist
Technology
LPC/SPI
SMBus
Timers
UART
GPIO
SATA
USB
PCIe Gen1
Root SoC Family
Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN) | 77
Intel Technology Journal | Volume 17, Issue 2, 2013
is very rare for any given engineer to have that kind of unlimited access to a
collection of systems anytime you need.
We have developed Simics solutions We have developed Simics solutions for all our critical communication and storage
platforms. A few recent models are shown in Figure 12, ranging from server,
for all our critical communication and mobile to System-On-Chip (SoC). SoC solutions such as the Intel Atom C200
storage platforms. processor (codenamed Rangeley) are gaining a lot of traction in particular in the
low-power and low-cost arena. These Intel C200 processor solutions share many
common building blocks with their Intel Xeon counterparts.
78 | Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN)
Intel Technology Journal | Volume 17, Issue 2, 2013
As more and more products and platforms begin to have Simics models,
another benefit starts to become significant: software reuse. On the Intel Atom
C200 processor (codenamed Rangeley, see Figure 13), the Intel QuickAssist
hardware module is part of the SoC, while in Intel Xeon families it is typically
part of the south bridge. From a Simics modeling point of view, the existing
solution from the Crystal Forest platform is drop-in compatible. This kind of
software reuse further enhances the lead time software work now has ahead of software reuse further enhances
hardware availability. As a result, both product solutions and Simics solutions
benefit from this consistent approach across multiple generations and families.
the lead time software work now has
Developers working on future-generation solutions can get virtual hardware up ahead of hardware availability.
and running a lot quicker because of effort invested in previous generations.
Simics solves the issue by giving access to anyone that needs it. Since it is
software, it can be installed on a laptop or desktop; it can be carried around
instead of locked in the labs. By loading different scripts and installing
different packages, users have access to all kinds of platforms and can build and
instantiate as many systems as needed for the purpose of the simulation.
Shown in the example (Figure 14) is a multi-board test that involves two Crystal
Forest Server (ATCA) boards. Each board has identical settings (of course, one
can easily create a network of different devices). The Dual-ATCA test later boots
into Busy Box (Figure 14) and users can set up a network in between the two.
Simics can simulate Crystal Forest hardware and sits nicely inside the overall
Intel ONP architecture (Figure 16) and can interact with software stacks in
Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN) | 79
Intel Technology Journal | Volume 17, Issue 2, 2013
SDN Controller
upper layers (such as OEM App) as well as initiate networking with other Intel
ONP or nonIntel ONP devices.
Simics can support SDN The usage model here is that Simics can support SDN migration moving
control and data plane operations to Intel ONP platforms. Simics can also help
migration moving control and data repartition hardware and software functionalities to support Network Functions
plane operations to Intel ONP Virtualization (NFV), where network functions (L3 forwarding, management
plane, security) are shifting from hardware to software applications running
platforms. in a virtualized environment. If the customers already have Simics models for
their current solutions, they can begin that migration process to Intel ONP
immediately without waiting for their own hardware availability. Even if they
do not yet have their Simics device models, they can begin using Simics Crystal
Forest building blocks and studying Intel ONP and SDN migration paths.
Conclusions
In this article, we walked through several examples of Simics post-silicon usage
and demonstrated its positive impact to the product lifecycle. During our work
on Intel Next Generation Communications Chipset (code named Crystal
Forest),we successfully used Simics to help debug system issues after hardware
became available and proved various usage models for the post-silicon phase.
There are several aspects that stand out from our experience:
80 | Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN)
Intel Technology Journal | Volume 17, Issue 2, 2013
Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN) | 81
Intel Technology Journal | Volume 17, Issue 2, 2013
Author Biography
Tian Tian is a seasoned embedded system designer with 12+ years of product
design, application, and market development experience. He developed a key
voice component for Intels IXP425 network processor family and architected
the Access Software OS library for Xscale-based systems. In recent years he
has been managing various x86 product families in the embedded segment
including Intel Centrino, Intel Core Duo, Intel Xeon with a focus on the
communication industry. Tian began use Simics as part of the design process
in 2010, was involved with Simics Crystal Forest external release in 2011, and
is leading the technical marketing effort for Simics usage of next-generation
platforms. Tian holds a Master of Electrical Engineering from Arizona State
University and has published many technical white papers and articles.
82 | Post-Silicon Impact: Simics* Helps the Next Generation of Network Transformation and Migration to a Software-Defined Network (SDN)
Intel Technology Journal | Volume 17, Issue 2, 2013
Contributors
Ben Blum Landslide is a Simics module designed for finding concurrency bugs in
Department of Computer Science, operating system kernels, with a focus on Pebbles. Pebbles is a UNIX-like
Carnegie Mellon University kernel specification used in course 15-410, the undergraduate operating
systems class at Carnegie Mellon University, in which students implement
David A. Eckhardt such a kernel in six weeks from the ground up. Landslides mechanism,
Department of Computer Science, called systematic testing, involves deterministically executing every possible
Carnegie Mellon University interleaving of thread transitions in a given test case and identifying which ones
expose bugs. In this article we explain the testing environment (the course,
Garth Gibson 15-410, and the kernel, Pebbles) and the testing technique; describe how
Department of Computer Science, Landslide takes advantage of certain features that Simics provides that other
Carnegie Mellon University testing environments (such as virtualization) do not; outline Landslides design,
implementation, and user interface; present some results from a preliminary
evaluation of Landslide, and discuss potential directions for future work.
Introduction
Race conditions are notoriously difficult to debug. Because of their
nondeterministic nature, they frequently do not manifest at all during testing,
and when they do manifest, it can be difficult to reproduce them reliably
enough to collect enough information to help debugging.
Many techniques exist for dynamic testing of concurrent systems for race
conditions. Systematic exploration, the strategy we focus on in this work,
involves making educated guesses as to what points during execution a
preemption would be most likely to expose a bug, enumerating the different
possibilities for interleaving threads around these points, and forcing the
system to execute all such interleavings to check if any of them results in
incorrect behavior.[1] Systematic exploration provides a better alternative to
conventional long-running stress tests, because it is less likely to overlook
buggy execution patterns, and it enables a testing framework to report more
thorough debugging information. Compared to other dynamic analyses, such
Systematic exploration is able to find as data race detection[2], systematic exploration is able to find a wider range of
types of concurrency errors because of its ability to manipulate the execution of
a wider range of types of concurrency the system under test.
errors In this article, we present Landslide, a Simics module that provides a
framework for performing systematic testing on kernel-level code.[3] Landslide
is designed with a focus on the testing environment used by students in
course 15-410, the undergraduate operating systems class at Carnegie Mellon
University (CMU). In 15-410, students implement a fully preemptible, UNIX-
like kernel from the ground up over the course of a six-week project.[4] They
use the Simics simulator as their primary testing and development platform,
although they must rely on conventional stress-testing techniques to find and
track down concurrency bugs in their code. Landslide is an effort to improve
this situation by making the more sophisticated technique of systematic testing
accessible to developers of kernel code.
The course has many learning objectives, ranging from acquiring detailed factual
knowledge about hardware features through practicing advanced cognitive
processes such as open-ended design. Students study high-level concepts such
as protection (least privilege, access control lists vs. capabilities), file-system
internals, and log-based storage. We place emphasis on acquiring information
from primary sources, including both manufacturer-provided hardware
documentation and a non-textbook technical-literature reading assignment.
Students begin with a blank slate rather than a kernel-source template or an
existing operating system, so they must synthesize design requirements from The assignment design and grading
multiple sources and must choose their own module boundaries and inter-
module conventions. Due to the foundational nature of kernel code, the encourage students to think about
assignment design and grading encourage students to think about corner cases, corner cases, instead of being satisfied by
including resource exhaustion, instead of being satisfied by the right basic idea
implementations that handle only auspicious situations. Finally, most relevant to the right basic idea implementations.
this work, students gain substantial experience in analyzing and writing lock-
based multi-threaded code and thread-synchronization objects. They practice
detecting and documenting deadlock and race conditions, including both thread/
thread concurrency and thread/interrupt concurrency.
Project Overview
In the course of a semester, students work on five programming assignments; the
first two are individual, and the remaining three, including the kernel project itself,
are the products of two-person teams. Here we are primarily concerned with the
kernel project, though we will also briefly describe the others.
Introductory Projects
The first project is a stack crawler: when invoked by a client program, it
displays the programs stack symbolically, rendering saved program-counter
values as function names and printing function parameters in accordance
with their types. This project enables students to review key process-model
and language-runtime concepts from the prerequisite course[5]; it introduces
students to our expectations about design, analysis, and making choices; finally,
because C pointers are unsafe, it requires students to consider robustness.
The second project is a simple game, such as Hangman, which runs without an
underlying operating system. The project requires students to implement a device
driver library consisting of console output, keyboard input, and a hardware timer
handler. This project and the remaining ones are written in C with some x86-32
assembly code, which is then compiled and linked into an ELF executable, stored
into a 1.44-megabyte 3.5-inch floppy-disk image, and booted via GRUB. If
the image is copied to a real floppy or embedded into an El Torito bootable
compact disc image, it can be booted on standard PC hardware; however,
students most often use Simics, to take advantage of its debugging facilities.
The third project is a 1:1 thread library for user-space programs, essentially a
stripped-down version of POSIX Pthreads. Students begin by designing mutexes
using any x86-32 atomic instructions they choose. They then write other thread-
synchronization primitives (condition variables, semaphores, and reader/writer
locks), infrastructure components (stack allocation/recycling and a thread registry),
and low-level code to launch and shut down threads. Student library code is linked
with small test programs provided by the course staff. The test programs run on
a reference kernel written by the course staff and provided in binary form, the
behavior of which is specified in a twelve-page document. In addition to providing
a reliable execution substrate, the reference kernel schedules the execution of user-
space threads created by student code according to a variety of interleaving policies.
for setting up and tearing down threads and processes (they reuse their game-
project device drivers). We briefly describe each of the 25 system calls in the
Pebbles specification in Table 1.
For most students in the class, this For most students in the class, this is the largest and most complicated software
artifact they have produced. Because the test suite and the grading criteria
is the largest and most complicated emphasize robustness and preemptibility of kernel code, there are many
software artifact they have produced. cross-cutting concerns. As students are responsible for ensuring the runtime
invariants underlying all compiler-generated code in the system (kernel and
user-space), they gain experience with debugging at both the algorithm level
and the register/bit-field level.
Students who complete the kernel project on time then work on a kernel-
extension project, with varying content depending on the semester. Past
projects have included writing a sound card driver, a file system, hibernation
(suspend to disk), kernel profiling, and an in-kernel debugger. Two recent,
more aggressive, projects have been adding paravirtualization so that their
kernels can host guest kernels and adding multiprocessor support to their
single-processor kernels.
Use of Simics
Unlike some emulators, which focus Simics serves as the main execution and debugging platform in 15-410. Unlike
some emulators, which focus on fast execution of correct code, Simics provides
on fast execution of correct code, Simics very faithful bit-level support not only for code that behaves correctly but also
provides very faithful support not only for kernels that accidentally abuse hardware. Unlike hardware virtualization
environments, Simics contains substantial debugger support: single-stepping,
for correct code but also for kernels that printing of source-level symbolic expressions, stack tracing, display of TLB
accidentally abuse hardware. entries, and even summaries of x86 hardware-defined descriptor tables. All of
these features make Simics a helpful platform for students to test their code.
A major advantage of using Simics over the QEMU emulator in particular
is that QEMU issues timer interrupts only at basic-block boundaries, which
would dramatically undermine our goal of teaching students that threads can
interleave with each other at any time.[6]
Systematic Testing
The underlying idea of systematic testing is to view the set of all possible execution
sequences, which can change due to concurrency nondeterminism, as an execution
tree. The root of this tree denotes the start of the test case, each branch represents
one execution sequence, and nodes in the tree are decision points: time points
during the execution where Landslide should attempt to force a different thread
to run, thereby making progress through the state space.
Example
Consider the example code in Code 1, which demonstrates how the thread_fork()
system call might be implemented. If a timer interrupt occurs at line 4, the
child thread can run, exit, and free its state, causing the access on line 5 to be
a use-after-free. Here, the necessary decision point for finding the bug is at
line 4. Landslide will know that there should be a decision point here because
it automatically interprets new threads becoming runnable as important
concurrency events. Other decision points may also exist, for example, during
the construction of the new thread_t struct, or during the new threads execution.
Together, the set of decision points defines an execution tree that contains this
bug, depicted in Figure 1.
1 int thread_fork() {
2 thread_t *child = construct_new_thread();
3 add_to_runqueue(child);
4 // note: at this point child may run and exit
5 return child->tid;
6 }
Code 1. Example implementation of the thread_fork() system call. This
example contains a race condition, described in the comment on line 4.
Source: Landslide: Systematic dynamic race detection in kernel space, 2011.[3]
add_to_runqueue ( )
child->tid vanish ( )
Challenges
In any systematic testing tool, there is an inherent tradeoff when defining the There is an inherent tradeoff when
set of decision points: searching with few decision points results in coarser-
grained interleavings, faster test completion, but less likelihood of finding
defining the set of decision points.
unexpected bugs; whereas searching with more decision points results in the
opposite. Accordingly, Landslide provides an interface for adjusting the set of
Combining systematic testing with a Combining the technique of systematic testing with a kernel-space execution
environment presents some additional challenges. First, a testing tool must
kernel-space execution environment control all sources of nondeterministic input to the system, and account for all
presents some additional challenges. the scheduling options by each such source of input at each decision point. In the
Pebbles environment, the only sources of nondeterminism are timer interrupts
and keyboard input. With Landslide, we focus exclusively on timer interrupts, as
they can be used to directly control the kernels context switching.
Guest Kernel
Simics
Landslide
Kernel
Instrumentation Scheduler
Memory
Tracking
Runqueues
Decision Tree
Test
Lifecycle
Tree Explorer
Figure 2: Visual representation of landslides architecture and its interface with the
kernel under test.
(Source: Landslide: Systematic dynamic race detection in kernel space, 2011.[3])
Thread Scheduler
The Landslide scheduler is responsible for keeping track of which threads exist in
the guest kernel: which are runnable at any given time, and when they are created
and destroyed. It maintains a mirror image of the guest kernels scheduler state
in the form of three queues, a pointer to the currently-running thread, and a
pointer to the previously-running thread. The queues are the runqueue, containing
the runnable threads, the sleep queue, containing threads which become runnable
after a certain number of timer ticks, and the deschedule queue, which might not
correspond to a data structure in the guest kernel, but contains all other threads
that exist on the system that are not runnable for whatever reason.
may sometimes be necessary to force the desired thread to run; for example, if the
kernel scheduler uses a round-robin policy and has a runqueue of thread IDs 1,
2, and 3 (with thread ID 1 currently running), if the Landslide scheduler desires
to run thread 3, it will take 2 interrupts before thread 3 begins running.
Landslide also maintains a set of shared memory accesses made since the last
decision point, for use with the Partial Order Reduction state space technique
(which we describe in the next section). This set of accesses allows Landslide to
determine when certain actions of different threads may conflict with, or are
independent from, each other. Landslide ignores shared memory accesses from
the kernels dynamic allocator itself, and it also ignores shared memory accesses
from the components of the kernels scheduler that run every transition.
The explorer also identifies points during execution that should count as
decision points. The selection is mainly controlled by the user, during the
annotation and configuration process. However, the explorer also automatically
identifies voluntary reschedulespoints at which the kernel explicitly invokes a
context switch of its own accord (for example, in yield())which comprise the
minimal necessary set of decision points.
During the backtracking stage, the explorer applies a state-space reduction technique
DPOR analyzes the memory accesses called Dynamic Partial Order Reduction (DPOR). Briefly, DPOR analyzes the
memory accesses in a just-finished execution to identify a set of candidate branches to
to identify a set of candidate branches to explore next. These branches represent reorderings of state transitions that conflicted
explore next. with each other, with reorderings of independent transitions pruned out. For example,
Figure 3 depicts a subset of a possible execution tree in which the highlighted
transitions of threads 1 and 2 are independent from each other (that is, if they were
reordered, the resulting kernel state would be identical.)
...
... 1 2 ...
... ...
Figure 3: An example part of an execution tree that could be pruned using DPOR. The
highlighted transitions of threads 1 and 2 are independent, meaning that to achieve full
coverage, Landslide needs to explore only one of the two subtrees.
(Source: Landslide: Systematic dynamic race detection in kernel space, 2011.[3])
Additionally, Landslide can heuristically detect infinite loops by comparing the Landslide can heuristically detect
current execution of the test case against previous executions under different
thread interleavings. If the current execution has lasted a certain proportion
infinite loops by comparing the current
longer than the average of all previous executions, as visualized in Figure 4, execution of the test case against
Landslide assumes the deviation represents a nondeterministic infinite loop.
previous executions.
Use of Simics Features
This section discusses how Landslide and Simics fit together, and highlights some
Simics features that Landslide makes heavy use of to enable systematic testing.
Landslides control over the system consists of two parts. Together, these parts
enable it to steer the kernel through the different branches of the execution
tree, testing for bugs in each branch until the tree is exhausted.
[...]
The first part is causing a timer interrupt to occur at a given point during
the kernels execution. Landslide achieves this by manipulating the CPUs
pending interrupt vector. When Landslide wishes to cause a particular thread
to preempt another thread at a given decision point, it injects a timer interrupt
before the pending instruction. In response, the kernel triggers a context-switch
If that thread is not the desired one, to the next thread on its scheduler run-queue. If that thread is not the desired
one, Landslide repeats the process, injecting more timer interrupts until the
Landslide repeats the process, injecting desired thread begins running.
more timer interrupts until the desired
The second part of Landslides control is backtracking. At the end of each
thread begins running. branch of the decision tree, if Landslide wishes to explore a different interleaving
at a particular decision point, it must reset the system state to the past state at
that point. Fortunately, Simics provides a facility for reverse-execution in the
form of the set-bookmark BOOKMARK-NAME and skip-to BOOKMARK-
NAME commands. At each decision point during execution, Landslide uses set-
bookmark to ask Simics to set a bookmark. Then, when the current execution of
the test case completes, Landslide uses skip-to to reverse-execute to the bookmark
associated with the desired decision point, at which point exploration resumes.
Because Landslide places itself outside the scope of Simics reverse execution
system, although the entire simulated machine state is reset to the earlier point,
Landslides memory of the entire state space tree is persistent.
User Interface
Instrumenting and testing a kernel with Landslide involves three stages of
effort. These are required annotations, configuring decision points for a more
efficient search, and interpreting the resulting traces Landslide emits when it
finds a bug. This section gives a brief overview of each.
Required Annotations
Users annotate their kernels to inform Landslide of certain important concurrency
events during execution. We provide a set of annotation functions, named with the
prefix tell_landslide, for this purpose. The annotations denote when a thread runs
fork(), sleep(), or vanish(), when a thread is added to or removed from the run-
queue, and when a thread becomes blocked on a mutex. The annotation is placed
just before the actual action being annotated. Code 2 shows an annotated sample
of the code from the example in the Systematic Testing section.
Finally, there are two short (nominally two-line) functions used within Landslide
itself that the user must implement. These are predicates on the kernels scheduler
state and express potentially nontrivial conditions: whether the current thread is
runnable but not on the run-queue, and whether preemption is disabled while
interrupts are on. This logic executes within Landslide, inside of Simics, rather
than as part of the simulated kernels execution.
Decision Traces
When Landslide identifies a bug, it outputs a decision trace. This trace reports
what kind of bug was detected, and also reports each decision point in the
current interleaving: which thread was running, a trace of its stack when it
was switched away from, and the thread that Landslide caused to preempt it.
With this trace, the user can better With this trace, the user can better understand the concurrent execution that
exposed the bug. In Code 3 we show an example decision trace, which depicts
understand the concurrent execution a sequence of thread interleavings that can expose the bug in the example from
that exposed the bug. the Systematic Testing section.
Results
We evaluated Landslide in two ways: first, by instrumenting two prior-semester
student kernels to measure the exploration time needed to find different races,
Future Work
There are several promising future work directions for Landslide that we
would like to explore. These include incorporating new testing techniques,
such as parallelized search, state space estimation, and new state space
reduction techniques. They also include extending Landslide to support more
complicated kernel features, such as symmetric multiprocessing and device
driver nondeterminism.
Ongoing research exists in several other techniques for coping with the exponential
nature of the state spaces associated with systematic testing. Among these are
parallelized dynamic partial order reduction[8] and dynamic interface reduction[9].
the concurrency model of the kernel under test. Chief among these are
the assumptions that the kernel schedules threads only on one processor
at a time, and that the timer interrupt is the kernels only source of
nondeterminism.
References
[1] Patrice Godefroid. VeriSoft: A Tool for the Automatic Analysis of
Concurrent Reactive Software. In Proceedings of the 9th
International Conference on Computer Aided Verification,
CAV 97, pages 476479, London, UK, 1997. Springer-Verlag.
[8] Jiri Simsa, Randy Bryant, Garth Gibson, Jason Hickey. Scalable
dynamic partial order reduction. Third International Conference
on Runtime Verification (RV2012), 2528 September 2012,
Istanbul, Turkey.
[9] Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang,
and Lintao Zhang. 2011. Practical software model checking via
dynamic interface reduction. In Proceedings of the Twenty-Third
ACM Symposium on Operating Systems Principles (SOSP 11).
ACM, New York, NY, USA, 265278.
Author Biographies
Ben Blum is a PhD candidate in the Computer Science Department at
Carnegie Mellon University. He first implemented Landslide as the research
topic for his Masters degree at CMU and is continuing the work during his
PhD studies. Ben has additionally served as a teaching assistant for 15-410 for
three semesters. His web site is at http://www.cs.cmu.edu/~bblum.
Contributors
Alexey Veselyi This article describes the use of Wind River Simics*, a full-platform functional
Intel Corporation simulator, for early validation of hardware register specifications. The Simics
model becomes one of the first consumers of the registers, and can find several
John Ayers types of errors earlier, and sometimes with a wider scope, than hardware-
Intel Corporation based validation. The article is based on an actual experience of collaboration
between Simics model developers and hardware architects within Intel
during development of an Intel Xeon chip. Simics was proved valuable as
a validation tool and contributed to shift-left (reducing time to market) for
hardware development.
Introduction
As the complexity of Intel hardware is increasing, the hardware register count
in the new platforms is growing rapidly. Increasing amounts of control and
status states are becoming architecturally visible to manufacturers, forming a
key part of competitive advantage and a liability for customer-visible bugs. At
the same time, hardware architects are being faced with the need for shorter
project development cycles and a need for earlier (shift-left) engagement with
teams that are using register specifications in their development. To handle
these conflicting requirements (more complexity and having it come in
earlier), the architects must use automated register validation techniques to
validate register specifications. Early incorporation of validation in the register
architecture definition process is vital for the early appearance of mature
specifications.
Register specifications lack an early target for validation in the project feature
specification phase and design execution phase. They also lack a sense of
delivery urgency until later in the project. Specifications are very likely to
have impeding errors that face the initial bring-up of the RTL validation
environment.
This article describes the use of Wind River Simics, a full-platform functional
simulator, for addressing the need for early register validation during the
development of an Intel Xeon chip. Simics is being used more and more for pre-
silicon software validation by multiple groups at Intel. This validation approach
proved very promising, and the intention is to adapt it for broader use.
The process of collaboration with the hardware design team is set up as follows.
Model developers start implementation of the hardware model using early
register specifications. The Simics model becomes one of the first consumers of The Simics model becomes one of
the register specification and is able to find several types of specification errors
significantly earlier than other RTL-based validation options, and in some
the first consumers of the register
cases with a perspective not available to RTL-based options. Theapproach specification
detects not only register construction errors but also allows for validation
of key architectural register specifications against legacy-derived behavior
assumptions. Additionally Simics creates the possibility to define high-
level functional tests for the new platform. These tests can cover most of
the platforms functionality, with the exception of new or heavily redefined
features. The team of Simics model engineers provides feedback to hardware
architects for every register definitions drop.
This article discusses in depth the scope of register validation using Simics and
the ways to perform it. It is based on an actual experience of such collaboration
between Simics model developers and hardware architects within Intel
during the development of an Intel Xeon chip. Simics proved to be a valuable
tool for finding bugs on the pre-software stage, thus speeding up hardware
development and promoting the shift-left paradigm (reducing product time to
market).
Legacy BIOS/SW/Drivers
Simics Platform,
Early Validation in
BIOS and OS Boot Validation
Virtual Platform
Shift-left The
Delivery of Spec External Specification (IPIX)
Logical Register Address Map
(MSR, MMIO, B:D:F) Supplement Register
Specifications for External
Consumption
IP
Block
A
Simics offers an opportunity for earlier detection of register construction errors: Simics offers an opportunity
Register address errors: overlaps, erroneous placement in PCI Header for earlier detection of register
address, and so on.
construction errors
PCI device definition errors: bus-device-function allocations contradicting
the system address plan, headers not compliant with the PCI standard,
erroneous PCI class-code, incorrect device header-type values.
Database checks find some register address errors; however, it is very difficult
to manage the list of heuristic-based checks without some segmenting of the
registers according to their impact. Simics offers a focus point for the most
architecturally significant registers to be reviewed for construction and ensures
that legacy definitions take the precedence.
BIOS routines are each associated with a list of registers to be accessed. The
architectural registers that are in support of legacy features are extracted from
the database based on these lists of registers. BIOS routines are selected from
key reference BIOS releases, such as the prior product in the product segment.
For example, as the DDRIO in successive projects moves from two channels
to three channels, there is significant value to both the BIOS development and
hardware design in having the DDR training or configuration programming
scale up in count in expected ways and in validating fundamental behaviors.
Simics provides a very effective After validation of legacy architectural functionality, Simics, that is consuming
early register definitions, provides a very effective launch point for early BIOS
launch point for early BIOS programming targets, possibly ahead of related RTL development. Examples
programming targets, possibly ahead of are: MMCFG/SNC/SAD/TAD/MMIO translation tables, routing tables,
range settings, and PCI enumeration flow.
related RTL development
Additionally Simics creates the possibility to define high-level functional
hardware tests for the new platform. These tests can cover most of the
platforms functionality, with the exception of new or heavily redefined
features.
The Simics team provides feedback to the hardware architects for every register
definitions drop.
All of the observed issues are promptly shared with the hardware architects,
who take the report into consideration for the next register definitions
a reported bug in a single register can release. Sometimes a reported bug in a single register can be an indication
to the hardware team of a bigger problem in the specifications: for
be an indication to the hardware team instance, a common error in register access types or a misplaced register
of a bigger problem bank.
Generic Errors
On the first stages of the described development cycle, validation is
performed the following way. The Simics team receives register definitions
(containing register offsets, access types, default values, fields) as an XML
database, ConfigDB, which is one of the standard formats for register
definitions exchange. The Simics engineers have developed tools to
automatically process and convert the XML into a format that the Simics
framework can understand (Device Modeling Language, DML). Another
commonly-used register specifications format, CRIF, can also be used during
this process.
During this stage, generic errors can be detected, such as overlapping registers,
invalid register and field sizes, invalid placement of custom registers into
the standard PCI header block, and so on. Also, on this stage Simics detects
invalid class-codes and header-type values when they mismatch the rest of
the PCI header definitionwhich happens to be a frequent bug in register
specifications.
Here we should note that some of these errors can be found automatically by
other validation tools, so this scope is not fully limited to Simics.
The Simics API allows for side effects of registers to be described separately
from register definitions. The Simics team takes advantage of this capability
and always makes the effort not to mix register definition with register
implementation. This makes it possible to port side effects from a previous
model to the new one, after filtering out the side effects that the new platform
should not have. Custom-register side effects constitute the major part of the
overall model implementation.
As a first approximation of the new functionality, an attempt is made to an attempt is made to combine
combine old side effects, which are already implemented for a previous
platform and thoroughly tested, with new register definitions. Then the
old side effects, which are already
platform is set up and compiled using the Simics framework. For registers that thoroughly tested, with new register
match the old definition, the amount of attention required from a developer
is minimal. For registers that are different, the developer should consult the definitions
platform documentation and understand the nature of such differences. The
Simics framework will point to every one of these registers during platform
setup. Developers then have to look through every register and make a decision
about the extent of the possible reuse of the old implementation.
For registers that changed their names, sizes, set of fields or locations, the
developer usually has to make some trivial changes, after consulting with the
platform specification. But if registers are missing or contain changes that
are incompatible with previous specifications, these observations should be
delivered as feedback to the hardware architectssome of the changes are
purposeful, and should be taken into account by the model developers, while
others are bugs in the register specification, which should be resolved in future
iterations. The Simics model is not blocked by these bugssome functionality
of the model is just temporarily disabled (or legacy definitions are used) until
the specifications are correct.
At this stage the developer also pays close attention to the rearranged bus/
device/function (BDF) map of the new platform and captures possible
discrepancies between the map and the register definition. This is achieved
by first creating the skeleton of the platform (containing dummy devices)
in conformance with the BDF map, and only then applying the register
definitions. The overall flow of platform development is illustrated in Figure 2.
Running Workloads
As soon as the Simics engineers team is done with triage and gathering low-level
register errors, they can proceed to the functional level of validation against
legacy expectations. At this stage the platform is up and running and is register-
accurate relative to the latest register specifications. Now Simics can attempt to Now Simics can attempt to run
run actual workloads to see how the platform is able to operate as a whole.
actual workloads
The first workload that is run is SeaBIOS, a GPL implementation of a 16-bit
x86 BIOS developed primarily for running on emulators. The BIOS requires
minor modifications to run on new platformsthey are usually limited to
changing device IDs in the source code; sometimes they include some shuffling
of device or register locations. Any other discrepancy should be, once again,
compared to the specifications and reported to the hardware architects. The
BIOS boot validates the operation of PCI configuration space mapping, PCI
Legend
Device
Skeletons
HAS
Model Dev Flow
Register-Accurate
Devices Feedback to
Register .... HW Architects
Definitions ........
Inconsistency ....................
...... ...... ......
...... ...... ......
with BDF Map ........
...... ...... ......
...... ...... ......
ConfigDB ...... ...... ......
XMLs ...... ...... ......
...... ...... ......
...... ...... ......
Collisions; PCI cfg Errors
Bad Class-code/Header Type
New
Register Model BIOS, OS,
Predecessor Func. Tests
Model Side-effects
Filtering
enumeration, and some other features, and also opens the path to running
actual operating systems on the early model.
When the specifications are mature enough for SeaBIOS to successfully boot,
the model attempts to boot operating systems. The Simics team has a set
of disk images with installed systems that are used by their customers with
various platforms. This includes different versions of Linux, Enterprise Linux
and SVOS, desktop and server versions of Windows. The operating systems
boot, also containing driver initializations, can validate a lot of device-specific
functionality, MMIO, interrupts, and so on.
High-Level Behavior
The next important stage of the development cycle is the implementation of
tests for specific high-level functionality. This includes PCIe ports operation,
legacy and non-legacy interrupts, NTB, networking, different types of reset,
and so on. The Simics team is working on increasing their pool of tests that can
Using these tests makes it possible to be later used for the validation of future platforms. Using these tests makes it
possible to keep track of major functional areas of the register specification. In
keep track of major functional areas of case of any issues with the tests, the root causes of the issues are determined by
the register specification the Simics team, and the register bugs are reported to hardware architects.
After all the stages are complete and the hardware architects have received
feedback, they can begin work on fixing the errors that were found. At the
same time, they are implementing features that were previously missing in
the new register definitions. So, when a new drop of the registers is ready, a
new iteration of the cycle can begin.
The mentioned tests were traditionally used by the Simics team for the
validation of their own models. What we are proposing now is the use of the
same (or similar) tests on early stages of hardware development for validation
of specifications. This is made possible by the availability of many platform
models previously implemented by the Simics team with a high level of detail.
Required Resources
With a substantial bank of mature models already developed at Intel, an early an early platform model can be
platform model can be assembled within approximately a week of work. On
the following hardware specification iterations, the turnaround time is usually
assembled within approximately a
also under a week. week of work
The model that appears during the early specifications period is only an outline
of the future model, although some parts that can be taken from another
platform can already be fully working at this stage. As the specifications
mature, we approach the BIOS development stage with an already existing
functional model. That means the work that was put in model development is
not lost, but is passed on to the later stages to be utilized for the more standard
use cases: pre-silicon software and hardware codevelopment, and, subsequently,
post-silicon validation.
Results
The results of using the Simics platform in the project for early register
validation reflect that the initial implementation has focused more on
construction validation.
We provide some examples of actual bug discoveries that were made for the
register specifications using the described process:
Registers placed in Header Offset space (0x00x3F) that are not legitimate
PCI Header registers. Since hardware redirects accesses differently based on
offset, there would have been failed access and potential bad Read/Write
effects. Found during platform setup.
Arrayed registers that became noncontiguous, while being contiguous
in a previous platform. While this is not, strictly speaking, an error, a
future BIOS can anticipate that they would be contiguous and implement
iterative increment addressing of this register set. So it is useful to point out
this issue to architects to see if breaking such logic was intentional. Found
in the process of porting legacy register side effects.
Register offsets that are in extended offset range (0100 and higher) when
BIOS expects the CFG accesses to be in Legacy mode using CF8/CFC
port-in/port-out, with only 8bit offsets. This error makes the register
unreachable in the early reset timeframe. Found during BIOS boot.
Incorrect setting in a Function 0 header_type.multi_function_device field,
which did not reflect the addition of Functions 2 and 3 in the Function
allocations. This error would have caused a failure to PCI-enumerate those
two new functions for subsequent accesses. Found during the PCIe port
functional testing.
Each of these discoveries significantly preceded the readiness for coverage by
hardware-based validation.
Summary
Wind River Simics, as a functional simulator, is being used for validations of
hardware specifications and software on the pre-silicon and post-silicon stages.
We have shown how its use as a validation tool can be extended to cover very
early hardware architecture specifications.
Hardware architects feedback Overall, we have achieved very positive results. Hardware architects feedback
states that validation with Simics significantly helped shift-left the delivery of
states that validation with Simics architecture specifications. However, this is still an early solution, and due to
significantly helped shift-left the time and resource constraints its potential was not fully realized in the given
project. Validation helped find many errors in the register construction area,
delivery of architecture specifications so this type of validation can be considered well established. As for validation
against legacy BIOS assumptions, only a few errors were found, although this
approach looks very promising.
The net result was that while the theoretical value is high, in practice the
impact was positive, but limited.
Author Biographies
Alexey Veselyi is a Simics Intel architecture model engineer, and has been with
Intel for two years. He can be reached at alexey.s.veselyi at intel.com.
John Ayers, HPC server architect, has been with Intel for 15 years. He can be
reached at john.r.ayers at intel.com.
Contributors
Parth Malani Early estimation of driving forces like power and performance for future
Software and Services Group, hardware platforms using Virtual Platform (VP) tools such as Simics can greatly
Intel Corporation improve the product design cycle, although a small gap between simulated
performance and its actual value can adversely affect other simulation derivatives
Mangesh Tamhankar such as power. In this article, we mitigate any such gap through a streamlined
Software and Services Group, system tuning methodology to achieve high degree of performance correlation
Intel Corporation on Simics, which is within 2 percent of actual hardware performance. Using
the tuned performance as a foundation, we build a power model on top of
Simics that provides accuracy of within 5 percent for various software multicore
compute workloads. The benefits offered by this experiment are twofold. First,
it can help system designers working on architecture exploration by providing
Improved performance accuracy can insights into how to properly model and tune the Simics system to reflect
crucial design details. Second, platform architects and application developers
improve estimates on other derivatives can take advantage of this accurately tuned system for early estimation and
such as power. exploration of power and thermal, which are directly dependent on simulated
performance. In a broader scope the beneficiaries of this work may include
application/driver developers, system designers and architects, marketing
professionals, process engineers, and so on.
Introduction
In traditional product design flow, software design and exploration
happens only after hardware is physically available. The Software Shift-Left
phenomenon has created an interesting space in hardware-software co-design
domain wherein software design phase is shifted ahead in the overall product
design cycle taking place in parallel with hardware design. Such shift in
design cadence shortens product time to market and enhances design quality.
Application developers can explore and optimize power and performance
of their software code without having to wait for silicon prototypes to be
Pre-silicon platform level power and available. Pre-silicon platform level power and performance simulation
through VP-based tools can help improve a variety of design steps ranging
performance simulation through VP from architecture and power management exploration to power, cost, and area
based tool can speed-up product design budgeting, as well as time to market. One of the major limitations of many
current simulation methods is reduced scope of simulation and lack of system
cycle. level details such as OS involvement. Todays dynamic applications execute
beyond CPU boundaries by using other system components such as the GPU
and ASICs. Traditional simulation methods cannot model the interaction
between these system components. Virtual Platform (VP) based simulators
such as Simics[1] offer an attractive solution to model system interaction.
We envisage VPMON to be used as a power projection tool wherein a user can VPMON power simulator provides
configure Simics with performance models of future hardware and model their
power consumption. Also it can be utilized as a power simulator, providing
means for workload power exploration
means for workload power exploration on current and future hardware. For on current and future hardware.
example, application programmers may want to evaluate the impact of their
software code changes on the power consumption. In many cases the accuracy
may not be as desirable as the polarity of the power impact. Apart from these
usages, VPMON can be extended to study thermal behavior of the system.
Transforming performance counters or power consumption to temperature has
been explored before by Chung et al.[9] and Bellosa et al.[5] respectively, and is
proven to be useful for system thermal management.
To best of our knowledge, the only work targeting Simics based power modeling
has been proposed by Bartolini et al. [10], which adds power and thermal
modeling capabilities in the Simics framework by integrating multiple external
tools such as Matlab and the GEMS memory simulator. The power model relies
on performance counters modeled in GEMS as well as Simics internal registers
Our approach is highly portable for Intel Core architecture, which is a platform-specific feature. Our approach
relies purely on the instruction set stream and is thus highly portable within
within different Intel architectures. different Intel architectures. The authors tested their technique with synthetic
workloads stressing various levels of cycles per instruction (CPI). We have tested
the proposed method with real multicore multithreaded software kernels. It
should be noted that the work in Bartolini et al.[10] also models Dynamic Voltage
and Frequency Scaling (DVFS), which we have not targeted here.
Accuracy and speed of power To achieve the best tradeoff between accuracy and speed of power simulation
is particularly challenging when modeled power depends on simulated
simulation depends on simulated performance. The correlation experiment is divided into two phases based on
performance. this fact. The two phases are outlined as below:
Performance Correlation
A possible gap between software performance on a Simics-based system
and actual hardware is first assessed. We then employ a streamlined tuning
process to enforce a high degree of performance accuracy. Simics exhibits very
attractive characteristics such as modularity, configurability, programmable
APIs, OS awareness, and dynamic tuning of system parameters. Our
performance correlation methodology is composed of two steps: 1) system
configuration and 2) performance tuning and correlation. We first focus on
configuring the Simics system as per the existing hardware specifications. It
should be noted that a system may include diverse hardware components and
devices such as CPU, GPU, memory, network card, and disks. Depending on
the nature of the application, many of these devices may not get utilized at
all. Removing such devices from simulation flow or using ad-hoc performance
models can speed up the simulation.
We target a server system with Intel Xeon processors for the correlation
experiment. Simics supports Xeon processor models and a multicore/
multithreaded execution environment through its OS involvement feature. In
Platform Configuration
We were able to run industry-standard multicore compute workloads such as We were able to run industry-
Linpack as is in their binary forms on the Simics system. The performance
reported by workload on the Simics system differed from its hardware counterpart
standard multicore workloads as is
because of the default Simics system configuration. However we were able to on Simics.
narrow this gap through tuning of various system parameters such as the number
of threads, frequency, and instructions-per-cycle (IPC), and by adding performance
models of crucial components such as instruction and data caches. The main
idea we follow is to apply and limit the tuning to the components that fall in the
workloads critical execution path. This is important because adding additional The idea is to apply and limit the
simulation models can reduce the simulation speed significantly.
tuning to workloads critical execution
Table 1 shows the detailed platform configuration we used for the performance
correlation experiment. We used Linpack and DGEMM (double-precision
path.
matrix multiplication) multithreaded workloads to represent compute-
intensive user application programs. The host system on which Simics is
running contains an Intel Core i5 dual-core processor running at 3.3 GHz
with 8 GB of RAM. We stuck to 2 GB of memory for the Simics target
system to avoid simulation overhead for the host system. It did not affect the
correlation because Linpack and DGEMM are compute-bound workloads.
It can be inferred that the base Simics configuration differs from the reference
platform in many aspects and their performance outputs thus will not be equal.
Tuning Methodology
We compared the actual GFLOPS performance reported by the workloads
on the Simics system to the reference hardware and the correlation error
is plotted in the chart in Figure 1. Data points, from left to right, on solid
Base Simics configuration is off from lines pertaining to each workload indicate progressively applied tuning and
corresponding error in accuracy. As shown, the base Simics configuration
the hardware. stood at about -43 percent and -72 percent off from the hardware for Linpack
and DGEMM respectively. We tuned the frequency as per the reference CPU
and the gap was reduced, because the default Simics frequency was lower
than hardware as shown in Table 1. Simics supports fixed IPC (finite number
of simulation steps per cycle) because it is not cycle accurate. We tuned the
IPC based on its architectural peak value and it significantly increased the
performance pushing the error rate above 0 percent. Both frequency and IPC
can be set per logical processor at runtime through attributes under Simics class
hierarchy Romley.mb.cpu0.core[i][j], where i and j reflect physical and logical
processor index respectively.
100%
Linpack
76%
80%
DGEMM
60%
40% 31%
20%
0%
21%
220%
230%
243% 236%
240%
260%
272%
280%
Base Frequency IPC Cache
Config Tuning Tuning Models
along with its access latency. All the caches use random replacement policy by
default. We used latency of 200 cycles for any accesses going to main memory.
It is interesting to note that the tuning input does not affect the high level
behavior or functionality of the workload. However it has a direct impact on
Compute-Intensive Workload
Workload Execution Timeline
No Memory
Attach Cache Detach Cache
Model, no
Model Model
Cache Model
SIMICS
Performance
S
Tuning Input
I Debug Engine
Multicore Tracing
Debug API
M
Module
Virtual Performance
A Models (Functional)
Power Model
P
I F F F F
Power
Training process correlates During the training process, the power model correlates each performance
counter to measured workload Activity Factors and outputs a single set of
performance counters to measured coefficients modeling the power contribution of each counter. We derive AF for
workload AF. each workload on an instrumented hardware system of Table 1 by measuring
core power (Pdyn), Cdyn, voltage, and frequency. We used a synthetic power
virus workload to calculate Cdyn. To make a single entry for each workload in
the model, we take average value of sampled counters (and AF) over its entire
compute loop. To summarize, the training set comprises of pairs of VPMON
counters and measured AF pertaining to each workload. Training process is
performed offline, and modeled coefficients and constants are incorporated
into the VPMON multicore tracer to simulate runtime power.
To evaluate the power model we used a testing set that has additional workloads
apart from the ones used in training. We compared the average simulated power
within the compute loop of each workload to its hardware counterpart. Figure 4
shows the relative comparison of measured versus modeled power on Simics.
Measured values are based on a fixed value of 1 and VPMON power is shown
in a relative manner for nine different workloads. All the workloads execute a
different number of threads ranging from six to twelve threads. SGEMM and
DGEMM are matrix multiplication workloads that are compute bound in nature
similar to Linpack. Stencil2D is a highly cache-bound workload. HiPwrWkld is
a synthetic power virus workload used to stress the silicon Thermal Design Power
(TDP). We also used these five workloads for the training process mentioned
before. The remaining four workloads are fast Fourier transform (FFT) single-
and double-precision kernels. For all workloads, the accuracy of the Simics power
model is within 10 percent of measured value in the worst possible scenario. The
Average modeling and projection average modeling (training) and projection (testing) accuracies are both within
5 percent of hardware measurements.
accuracies are within 5 percent of
Modeling average power may not be sufficient for workloads exhibiting highly
hardware measurements. dynamic power behavior. We also compared the instantaneous simulated power
1.10
Measured
Projected Workloads
VPMON
1.05
1.03
1.02 1.02
1.01
1.00 0.99 0.99
0.97
0.96
0.95 0.94 0.94
0.90
HiPwrWkld
DGEMN
Linpack
Stencil
SGEMM
FFT SP
FFT SP Peak
FFT DP
FFT DP Peak
Average
Figure 4: VPMON modeled vs. measured power
(Source: Intel Corporation, 2013)
60
VPMON
50 Measured
40
CPU Power (W)
30
20
10
0
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
Time (Miliseconds)
It should be noted that for many usage scenarios such as peak power study
for TDP analysis, the timing accuracy of power is less important. Instead it is
crucial to have accurate peak power and its sustained duration in such use cases.
minimal compared to granularity of The power model was validated on existing hardware and can be extended
to project the application power for future platforms. Once the whole flow
power modeling. of correlating Simics performance and power against existing hardware is
completed, proper knowledge of design parameters (frequency, voltage)
and architecture as well as process scaling (Cdyn) can be applied along with
Simics functional models of future hardware to simulate future system power
consumption. VPMON is currently portable within multiple Simics instances
as well as different Intel Xeon CPU SKUs. Other system components such as
external memory and GPU can be included for platform-level power modeling.
VPMON can have a library of multiple power models in this case. However, it
relies on functional models of system components available on Simics and thus
its accuracy and efficiency is bounded by these factors.
Complete References
[1] Engblom, J., Aarno, D. and Werner, B., Full-System Simulation
from Embedded to High-Performance Systems, in Processor and
System-on-Chip Simulation, Leupers, Rainer and Temam, Olivier
(eds), pp. 2545, Springer Verlag, 2010.
[7] Sinha, A. et al., Jouletrack: A web based tool for software energy
profiling, in Design Automation Conference (DAC), June 2001.
[9] Chung, S.-W. et al., Using On-Chip Event Counters For High-
Resolution, Real-Time Temperature Measurement, in Thermal
and Thermomechanical Phenomena in Electronics Systems, IEEE
Computer Society, May 2006.
Author Biographies
Parth Malani is an engineer at Intel working on power and performance
modeling for multi- and many core platforms. He holds a PhD and MS in
Electrical and Computer Engineering from the State University of New York
at Binghamton and a BE in Electronics and Communications from Gujarat
University, India.
Contributors
Grigory Rechistov Simulation of large computing systems is a challenging task, mainly because
Systems Simulation Center, Intel a single host may not be able to accommodate a full model. Therefore, a
Corporation simulation itself has to be distributed across several systems. Simics provides
such functionality with its individual parts communicating over a network
the task of running Simics transparently for target systems. Still, the task of running Simics distributed
is not trivial; its challenges include maintaining simulation scalability, speed,
distributed is not trivial; its challenges and manageability. This article describes one practical case of simulating a large
include maintaining simulation distributed cluster system with more than a thousand of target cores using
Simics.
scalability, speed, and manageability.
Introduction
This article describes our experience with creating and running a model
of a large computing cluster system using Wind River Simics. Scale and
resource requirements of workloads of this study made it necessary to
run the simulation on top of a distributed multi-host system, resulting in
a virtual computer cluster being simulated on a physical smaller cluster
system. In the course of this work, we adapted Simics to be executed as a
job of a cluster resource management application. In this article we present
our instrumentation technique that was used to capture parallel application
behavior. We present our observations of the simulation scalability that was
reached and outline limitations we discovered during this study.
124 | Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability
Intel Technology Journal | Volume 17, Issue 2, 2013
Parameter Value
The host system was a cluster itself, though of a smaller scale. Its configuration The host system was a cluster itself,
is outlined in Table 2.
though of a smaller scale.
Parameter Value
Number of nodes 16
Intel Xeon 5580 (Westmere),
Processors
3.33 GHz
6 (12 logical with Intel Hyper-
Number of cores per CPU
Threading Technology)
Number of CPUs 2
Disk storage 3 TB
The host system nodes ran the same version of Debian GNU/Linux 6.0
x86_64 as in the targets. It consisted of a single head node that served as an
NFS server and several compute nodes that shared Simics installation and all
required files on a network share. A network topology for both host and target
Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability | 125
Intel Technology Journal | Volume 17, Issue 2, 2013
systems was a star (Figure1). There were two separate networks, the first one
(Gigabit Ethernet) was dedicated to service traffic (NFS, SSH, and so on) and
the second, high speed Infiniband was used for applications traffic needs.
node01
Head Node
node16
as thread-safe. That is, some care should be taken when writing Simics
modules if they are supposed to be used in multithreaded environments.
When loading a new module, Simics checks it to be marked as thread-safe
and disables the feature globally if it is not. The majority of Simics modules
are already made thread-safe thus this is rarely a concern.
Parts of a simulation that are to be run in separate threads must be loosely
coupled; that is, the frequency of communication between them should be
significantly lower than average frequency of their internal communication.
For this study, this meant that an SMP system had to be simulated in
one thread, and it could contact other simulated SMP domains run in
different threads via simulated network. This is because LAN messages can
126 | Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability
Intel Technology Journal | Volume 17, Issue 2, 2013
Barrier Barrier
Domain 1
Synchronization
Synchronization
Domain 2
In this scheme, each domain is running its own part of the simulation
for a predefined amount of simulated time (called quota) without any
communication with other domains. Then barrier synchronization is used, at
which point all pending inter-domain messages are delivered.
For the distributed simulation to work, TCP/IP sockets are used. Each
participating Simics process should be configured to use the same host:port
pair, which indicates where a top-level synchronization domain is executing.
To make a clear view of placement and interaction between all parts, the whole
setup of simulation is shown in Figure 3. Each hosts logical processor core
is used to serve for one target machine with all its simulated cores. To enable
transparent interaction of target machines placed on different host systems (and
also for global simulator time synchronization) the host local network is used
to encapsulate and transfer packets of simulated network, which is also isolated
from host LAN to prevent nondeterministic influences of the real world.
Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability | 127
Intel Technology Journal | Volume 17, Issue 2, 2013
Host LAN
Target Network
Target NIC Target NIC Target NIC Target NIC Target NIC Target NIC Target NIC Target NIC
Target 1 Target 12 Target 13 Target 24 ... ... Target 181 Target 192
Target Target Target Target Target Target Target Target Target Target Target Target Target Target Target Target
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core
128 | Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability
Intel Technology Journal | Volume 17, Issue 2, 2013
Granted by SLURM
SSH
Simics Simics
User
Simics Simics
Login Node
Computing Computing
Node Node
Computing Computing
Node Node
A wrapper to the simics program, called simics-slurm, was written to automate tasks
of host resources allocation and simulation distribution. This script is executed on
a login node by the user. It accepts all regular Simics command-line arguments,
such as a target script name, but additionally performs the following steps:
simics-slurm asks the SLURM service to allocate the desired number of host
nodes.
SLURM checks if enough cluster nodes are unoccupied. If not, it stalls
until some of already active jobs finish and release enough resources. It then
allocates nodes for exclusive use and passes a list of their host names back
to the simics-slurm. The list of nodes is then fixed and they are guaranteed
to be free from other users tasks until either nodes are returned to the free
pool or the granted time period runs out.
The script then opens an SSH session to the first host node granted by
SLURM and spawns a master Simics process on it. This step is done to
relieve the original login node from running resource-intensive programs as
it only serves as an entry point to all users of the cluster and is not supposed
to host their tasks.
Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability | 129
Intel Technology Journal | Volume 17, Issue 2, 2013
The master Simics process runs the global_distrib.py script, which spawns
additional Simics slave processes through SSH on remaining allocated
nodes with relevant command-line arguments. It also chooses a random
TCP port number to be used for domain synchronization. Finally, it
creates a target machine called master0 that will serve as a head node of the
simulated cluster. It served as an NFS server inside the simulation.
Each of the Simics slave processes spawns one or more target machines, each
of which is given an simulation-unique name of nodeN, where N varies from
001 to 112, and connects to a given host:port. All information necessary to
establish connections is passed via SSH command-line arguments.
To have minimal required control over To have minimal required control over the distributed simulation we needed
the ability to pass messages between separate Simics instances and to write
the distributed simulation we needed handlers for them. It turned out that some form for it is already present. There
the ability to pass messages between is an undocumented Simics function VT_global_message () that can pass an
arbitrary text string that will be caught by a user-defined hap handler. These
separate Simics instances messages are always global, that is, broadcast to all processes, therefore for peer-
to-peer communications every Simics instance had to filter out messages not
targeted to that particular process.
Experiment Flow
The combination of Simics distribution scripts and the implementation of
global commands allowed us to organize our experiments as described below:
The master script spawns all slave Simics processes and waits until all of
them report ready through sending of global messages.
A global start of simulation command is sent back from the master process
to the rest of Simics instances. All target machines start to run. It should be
noted that a master target is allowed to pass GRUB bootloader stage earlier
than remaining targets because it has to bring up an NFS server in advance
so that it can be used inside the simulation.
130 | Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability
Intel Technology Journal | Volume 17, Issue 2, 2013
A script branch is created to wait for all simulated machines to report that
they have booted.
After a target machine has reached the Linux login prompt, a login/
password pair is entered through a keyboard model. Then a Simics process
that contains that machine sends a global message to notify the master
process that one more target machine is ready.
After all machines have booted, the keyboard model on the master target
node is used to enter necessary commands to start an MPI application.
After a predefined delay that is meant to allow the application to start up, a
global command is issued that instructs all Simics instances to activate
mpi-tracer to start recording all MPI activity.
After another predefined interval of simulated time, another global
command is broadcast to stop capturing MPI calls trace and to save already
obtained results to disk. Shortly after that, a global shutdown command is
sent for all Simics copies to quit.
It should be noted that a target application was not aware of any of the
described activityfor it, the executing of CPUID was just as if it were
completing a regular instruction.
Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability | 131
Intel Technology Journal | Volume 17, Issue 2, 2013
MPI _Send
PMPI _Send
Magic Instruction
Core_Magic_Instruction
(hap)
The analysis of collected data was done The analysis of collected data was done offline with a second group of Python
scripts that was created to extract useful information, such as MPI call frequency,
offline with a second group of Python distribution, and length, from collected binary traces. As an example of results
scripts that could be obtained with the system created, a characterization of occurrences
distribution for all MPI functions observed for mdrun (the most important
application of the Gromacs suite) is shown in Figure 6. The number of target
machines participating in this series of experiments varied from 2 through 64.
0.5
N52
0.45
N54
0.4 N56
N58
0.35 N 5 10
Probability of Call
0.3 N 5 12
N 5 16
0.25 N 5 20
N 5 32
0.2
N 5 48
0.15 N 5 64
0.1
0.05
0
M
O
th
PI
PI
PI
PI
PI
PI
PI
er
_S
_R
_I
_I
_W
_S
_A
s
se
re
en
en
llt
ec
ai
cv
nd
oa
ta
d
dre
ll
ll
cv
MPI Function
132 | Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability
Intel Technology Journal | Volume 17, Issue 2, 2013
From the data and additional statistics collected, a conclusion was made that
this application spent most of its communication time in two data exchange
calls, namely peer-to-peer MPI_Sendrecv and collective MPI_waitall. Further
analysis demonstrated that another frequent pair of routines, MPI_Isend and
MPI_Irecv, actually does not introduce significant delay into the applications
operation. Therefore, in order to optimize a cluster configuration to this
application, a system designers attention should be focused just on optimizing
for MPI_Sendrecv and, to lesser extent, for MPI_waitall.
Scalability Results
In this section we describe the largest simulation that we were able to carry out
and what it took in terms of time and space.
We were able to simulate up to 1792 cores, which constitute the target system We were able to simulate up to 1792
of interest, on about 150 physical cores. For such a large simulation 12 host
cluster nodes had to be exclusively allocated for Simics. The collection of MPI
cores, which constitute the target system
traces for 200 simulated seconds of an application run took about two days of interest, on about 150 physical cores.
of simulation. A single trace file from each host node contained about 2 GB
of binary data, totaling more than 20 GB of logs for the whole simulation. To
extract MPI call profile data from the raw logs, the typical processing time was
about one hour.
Slowdown
An initial booting process consisted of several phases: SeaBIOS boot, GRUB
bootloader, Linux kernel, and userland startup up to the shell login. A
slowdown for this part of simulation varied from 20 to 50 times. The value was
relatively low because this part of simulation was executed with Simics VMP
mode enabled.
For the MPI tracing part of the simulation, the observed slowdown varied from
800 to 2000. At this phase simulation was often (at every MPI call) interrupted
to process a magic instruction callback. As a result a lot of data was dumped on
disk during this phase, adding to resulting slowdown. Also, VMP was mostly
turned off.
Scalability Limitations
The main concern and limiting factor of this study were the memory
requirements of target applications. Each booted target machine required
about 2 GB of host memory to work. This limited the number of target
nodes on a single host node to 14. This limitation is hard to circumvent
as the memory requirements are basically noncompressible. A swap might
help in this case and Simics supports automatic offloading of its images to
swap files; still it is possible that it would result in a catastrophic simulation
slowdown.
Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability | 133
Intel Technology Journal | Volume 17, Issue 2, 2013
between failures) value of the host hardware. A process that takes too
long to complete will be interrupted by a hardware/software failure with
a high probability. A possible mitigation of this is using simulation state
checkpointing.
Conclusions
In this article Simics ability to handle large multi-machine scenarios was
demonstrated. When resources of a single host are not enough, simulation
Simics is capable of simulating can be distributed across several systems. It was shown that Simics is capable
of simulating thousands of processor cores distributed across hundreds of
thousands of processor cores distributed target systems connected within a network, and such a large simulation can be
across hundreds of target systems carried out with one tenth the physical resources while maintaining acceptable
slowdown.
Acknowledgments
The author would like to thank the following people who participated in the joint
MIPT-Intel research and helped to develop tools and methodology described in
the article: Evgeny Yulyugin, Artem Abdukhalikov, and Pavel Shishpor.
References
[1] D. Van Der Spoel, E. Lindahl, B. Hesset et al., GROMACS: Fast,
Flexible, and Free, Journal of Computational Chemistry. 2005. V.
26. No. 16. Pp. 17011718.
134 | Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability
Intel Technology Journal | Volume 17, Issue 2, 2013
Author Biographies
Grigory Rechistov is a software engineer in the Systems Simulation Center at
Intel. He joined Intel in 2007 and his career has since been devoted to creating
software models of upcoming Intel CPUs. He holds a BS and MS in applied
mathematics and physics from the Moscow Institute of Physics and Technology
and recently defended his PhD thesis in computer science. His email is grigory.
rechistov at intel.com.
Simics* on Shared Computing Clusters: The Practical Experience of Integration and Scalability | 135
Intel Technology Journal | Volume 17, Issue 2, 2013
Contributors
Mona Vij Automatic Device Driver Synthesis is a research collaboration project between
Intel Labs, Intel Corporation Intel and National Information Communications Technology Australia
(NICTA) that aims to synthesize device drivers automatically using formal
John Keys OS and device specifications. We have built a tool chain that uses Simics*
Intel Labs, Intel Corporation DML Device model sources as an input to the driver synthesis tool chain. The
tool chain has a frontend compiler that extracts the device behavior from the
Arun Raghunath Device Modeling Language (DML) model and outputs a formal representation
Intel Labs, Intel Corporation of the device behavior that we refer to as a device specification. The driver
synthesis tool combines this specification with a similar O/S specification and
Scott Hahn applies the principles of game theory to compute a winning strategy on behalf
Intel Labs, Intel Corporation of the driver and eventually converts it into driver C code. This approach aims
to use the existing device models for producing device drivers resulting in
Vincent Zimmer highly reliable drivers and faster time to market. We have synthesized a number
Software and Solutions Group, of drivers using our tool chain. Some examples include legacy IDE controller,
Intel Corporation UART, SDHCI controller, and a minimal Ethernet adapter.
Leonid Ryzhyk
University of Toronto Introduction
A device driver is the part of the operating system (OS) that is responsible
Adam Walker
for controlling an input/output (I/O) device. There is wealth of research[1][2]
NICTA
showing that drivers are a primary source of bugs, and driver development is a
major bottleneck for platform validation and time to market. Figure 1 shows
Alexander Legg
the conventional driver development process, where a driver writer uses two
NICTA
informal documents, OS and device specifications, to convert a series of OS
requests to device commands. The process of device driver creation can be
error prone and tedious. One of the main reasons is that the driver writer uses
Device drivers are the major cause of informal documents that are susceptible to misinterpretation. In addition, the
driver writer has to have domain knowledge of both the OS and the device. In
operating system failures many cases driver writers also reuse existing driver code to write a new driver,
inheriting any existing bugs in the process.
Informal Documents
Device OS
Spec Spec
Driver
Implementation
Driver Synthesis
Formal
</> Specification </>
Driver Driver
Synthesis Synthesis
Tool Tool
Driver
Implementation
game theory and synthesizes the driver code from formal specifications. This
approach improves driver reliability by reducing manual intervention, avoiding
misinterpretation of device documents by driver writers. Moreover, given a
device specification, drivers can be generated automatically for all supported
operating systems, thereby eliminating the costs associated with porting drivers.
With this approach of driver development, DML device models are used not
only for simulation, but for driver generation as well. The driver synthesis tool
chain also provides some additional capabilities like a state space explorer that
aids in DML device model debugging. Overall this approach results in correct
drivers and improves time to market by moving development earlier in design
Game theory in driver development cycle, leading to cost reduction.
In the long run we plan to support large classes of devices with this tool, from
very simple to complex devices, as long as their behavior can be represented as
a state machine. We cant synthesize drivers that perform complex computation
and are difficult to represent as a state machine. In addition, we dont plan to
support drivers for devices that are based on programmable cores, such as high-
end graphics or network processors.
OS
Specification
Class
Specification Counter-example
State Space
Explorer
Device
Specification
game. The driver assumes the role of the first player and the environment (OS,
media, and so on) describe the moves of the opponent. In the context of the
game, modeling the environment as an opponent puts more emphasis on
the environmental events that lead to failure than those that are benign. The
environment begins all games with moves that represent OS-to-driver requests.
In response to these moves, the driver must try and make moves (that is, send
commands to the device) to push the device to a winning state, corresponding
to a correct device response for the given OS request. The moves chosen by the
driver should be such that no matter what external event occurs, the device and
driver can either correctly service the OS request or fail gracefully and continue
to operate correctly in the future. Effectively the tool constructs a driver
algorithm that guarantees that the driver is able to correctly satisfy all OS
requests given any feasible driver behavior. We call such an algorithm a winning
strategy on behalf of the driver.
Tool Inputs
The tool takes multiple formal specifications as input, as described in the
following subsections.
Device Class specifications need only be written once per device class and
can be used with different OS specifications and devices of the same class
from different vendors. We believe a model similar to USBs Device Working
Group (DWG) would work best for establishing industrywide device class
specifications. In this model, classes of devices are identified and a working
group (WG) is established for each class, drawing WG membership from
interested parties who tend to be the leaders and experts in a specific device
class. The WG then develops a class specification by consensus, with the result
typically being subject to approval of the parent organization.
OS Interface Specification
The OS Interface specification describes legal sequences of interactions between
the driver and the OS as well as the expected device response on completion of
each OS request. It models when events defined in the device-class specification
must be raised in response to OS requests. This specification does not specify
how the events in the device-class specification are generated, since that should
Device Specification
Device specifications are device-specific instantiations of device class
specifications. They model the device behavior and the externally visible
artifacts of the device. In particular, they model externally visible registers
and device operations that result from the reading or writing of said registers.
The device response depends on the register values and device internal state,
such as, for example, whether the device is initialized or waiting for a request
to complete. These responses include but are not limited to updating register
values, generating interrupts, triggering one or more external events, and
interactions with other subsystems. These specifications are written at a high
level of abstraction and ignore detailed internal architecture and timing.
Tool Outputs
The tool processes the input specifications and applies the principles of game
theory to produce driver code.
Driver Code
The tool produces C code as output The tool produces C code when it finds a successful strategy. In some cases
driver writers will need to develop manual wrappers to integrate the code
if it finds a successful strategy with the OS.
One of the goals of the project is to not modify the actual device models, since
we do not want our use of the models to impact their original use in virtual
platforms and we do not want to force a fork of the models, which might lead
DML compiler extracts the relevant to issues with bug-fix propagation. We have built a DML compiler that tries
to deal with the DML to TSL conversion automatically, but in some cases we
device behavior from DML device do need to modify the model. Currently we do modify the model directly, but
models. all of the modifications we currently make to the actual model could instead
be kept in a separate annotations file, thereby leaving the model pristine. This
support will be added in the future versions of the tool.
We begin the extraction process by collecting the model variables that will
become the TSL state variables. All data objects and attributes are added to the
collection as they are encountered. Fields are added only if their alloc parameter
is true (that is, model space is allocated for its contents). Registers are added
only if they do not contain fields and their alloc parameter is true.
access methods for all register banks present. We also add transitions for each
DML event object and after keyword encountered in the model, along with a
1-bit guard variable for each event or after transition.
After identifying the entry points, we can begin extraction of the transitions.
This is done by first copying the method containing the entry point, then
replacing each call or inline statement with the body of the target method. This
is repeated recursively until no call or inline statements remain and we are left
with a full code trace through all branches of the call. As an optimization, we
concurrently evaluate if statement conditions to prune branches that will never
be taken because they will always be false.
Besides state variables, TSL allows for temporary variables. These are global
in scope but do not retain values across transitions. TSL has no notion of
transition-local variables. As part of the transition extraction, we must convert
all local variables found in DML methods to TSL temporary variables. Because
of TSLs global scoping, some amount of variable name mangling is required
to ensure unique variable names.
TSL restricts transitions from modifying a variable more than once per
transition. This requires us to analyze each extracted transition and introduce
new temporary variables and assignments when violations are identified.
TSL also requires that any single transition must update all state variables.
To meet this requirement, we analyze each branch in the transition for
assignment statements. For each variable assigned, we add an identity
assignment (state = state;) to the corresponding branch. We complete this
requirement by adding identity assignments to the end of the transition for
all remaining unassigned state variables.
The following subsections describe how our frontend DML compiler deals
with the conversion from DML to TSL.
DML Templates
Development of the compiler caused us to study several of the import files in The tool converts models into an
great detail, specifically dml-builtins.dml and utility.dml, leading us to realize
the power of well-planned template and parameter use. This in turn allowed us
intermediate representation called
to write extensions in DML itself, rather than extending the language. TSL that is amenable for analysis and
The file dml-builtins.dml provides the glue that ties banks, registers, and fields synthesis
together, as well as providing default methods and parameters for most types of
DML objects. Unfortunately, it is so closely tied to the Simics DML compiler,
dmlc, that we could not use it without porting it. Our first porting task was
to create our own versions of the methods that are intercepted by the DML
compiler. These methods are involved in the read/write access fan-out from
bank objects to registers and fields.
For Simics device I/O, the bank method access() serves as the primary entry
point for the I/O-memory interface (register read/write operations). Instead
of a single method that takes direction and size as parameters, TSL uses a set
template bank {
.
.
.
// extensions for tsl
parameter mapped_regs32 default undefined;
parameter mapped_regs16 default undefined;
parameter mapped_regs8 default undefined;
if ($this.emit_accessors == true) {
template event {
.
.
.
// variable to track posted state
data uint1 _posted_;
Unused Code
There is some code in DML device models that is for DML infrastructure and
not for device operations. Our tool has no need for such code and we needed
a way to eliminate such code from models without modifying the models. We
have defined a few annotations for use in the models. They all begin with the
sequence //@ and so are transparent to the Simics DML compiler. We use the
pair //@ignore and //@resume to hide portions of DML from our DML tool.
We have used these to some extent in the models but mostly use them in our
copies of the system import files, the DML equivalent of user/include/*.h.
Width Conversion
The TSL compiler performs strict TSL does not support type promotion or casting, so our DML compiler
performs a significant amount of expression rewriting in order to provide explicit
type checking requiring the DML width conversions. Width conversion to a wider type requires the original
compiler to coalesce types by rewriting assignment be converted to a conjunction of two assignments, the original
assignment and a second assignment to the extra bits. For example, assuming a
expressions in the emitted TSL 32-bit variable named foo and a 16-bit variable name bar, the statement:
foo = bar;
becomes:
In some cases, the format of a DML expression may prevent our tool from
being able to make this modification. For instance, the DML expression:
if (somevar == 0)
foo = bar;
else
foo = 0;
Arithmetic Operations
Current version of TSL does not support arithmetic operations (such as 1, 2, 3, 4,
or modulo) or magnitude comparison operations (such as <, <5, >, or > 5).
At this point this is just a limitation of our tool and we plan to add this support
in our tool soon. For dealing with this issue for now, our tool detects cases where
power-of-2 techniques can be used instead and performs automatic conversion.
The detection depends on one of the operands being a constant power-of-2
value. In cases where this is not obvious, we have to modify the model by hand.
For some of the devices for which the hardware is available, we also tested the
driver on actual hardware.
While the state space explorer is a critical component of a tool chain that
synthesizes driver code, it also offers capabilities that can be quite useful to a
DML model developer.
The GUI allows a model user to inspect the values of any device internal
variable in a given state by simply clicking on the node in the graph
representing the state. A pane on the left lists all the device internal variables,
and clicking on a particular state node causes this list to be updated with the
values of each variable in that state.
Further, from a given state, the GUI allows a user to pick the next transition
which would move the device state machine to another state. While this
feature is somewhat similar to the step or next operation in a traditional
software debugger, the event-driven nature of a DML model requires the
tool to provide more flexibility. The events triggering state transitions are
broadly classified into events that can be controlled by software and those
that depend on the environment (like platform hardware interrupt, line
unplugged, and so on) and therefore cannot be controlled by the device
or software. The tool allows a user to choose which event occurs next in a
given device state. The choice includes both controllable and uncontrollable
events. In the case of software-controlled actions, the user can also specify the
parameters of the action.
Figure 4: State space explorer GUI. The right pane shows the device model as a directed graph. The left pane shows device
internal variable values.
(Source: Intel Corporation, 2013)
State space explorer provides a The capabilities described above (inspecting device variable values and directing
the state machine by choosing the next transition via the GUI) allow the model
counter example when no winning writer to use the state space explorer as a debugging aid, examining the effect of
driver strategy exists (a chain of ) events on the device.
Counterexample Generation
The primary challenge in exploring the state space of a hardware device model
is its huge size, which would quickly make visualization incomprehensible and
state management cumbersome. The GUI explorer utility in the synthesis tool
chain employs numerous techniques, built on a foundation of formal methods
and symbolic execution to address this issue. These techniques include:
aggregating states with the same properties with respect to the DML mode
code into a set of states and displaying the entire set as one node
symbolic representation of the model code, which allows abstracting
the model variables (which can have a massive number of values)
Scenario Replication
Device programming sequences typically involve massaging of OS input Visual tool allows for scenario
parameters, a long series of register reads/writes, and require specific
environment conditions (such as network connectivity for a successful packet
replication by supporting save and
transmission) to hold. In order to assist the tool user in efficiently exploring the restore feature
device state space and quickly repeating long repetitive action sequences, the
GUI allows saving traces of action sequences, also known as state transitions,
from any given state. In any subsequent run of the tool, as long as the model
remains unmodified, the same scenario can be replicated by bringing the model
to the same start state and then loading the trace saved.
Input Specifications
Driver synthesis requires three input specifications for the device. This section
describes the steps involved in acquiring/developing three input specifications.
Device Specification
We used an existing SD host controller DML device model from Simics
team as our device specification. As we began to examine the model to
determine where the deviceclass related annotations should be placed, we
noticed that unlike the other DML models we had worked with, this model
did not account for inflight data transfer times. All data transfers to or
from the card model happened instantaneously. Our past experience led
We rewrote the model to account for the inflight times and validated the
changes using a stock Linux image with the Linux SD Host driver, running on
the Simics Framework. We submitted patches for these changes to Simics.
We then began the task of annotating the model with Device Class events and
attempting synthesis. As this model was the most complex model we had tried to
date, we immediately ran into problems. The complexity of the model resulted in
an output TSL file with 6.8 Kb of state space (global variables), another 12.3 Kb
for temporary variables, and 45 separate transitions. This extreme size resulted
in tool-chain execution times in excess of 4 hours. As we were still trying to
determine the correct locations for annotations, the extreme execution time was a
significant hindrance to forward progress.
Since the model is a full model, it contains transfer modes and registers that
would not be used in our project. In an attempt to reduce the overall size and
complexity, we tweaked the model to hide the unused transfer modes and
registers. This reduced model has 2.5 Kb of global variable space, 1.5 Kb of
temporary variable space, and 14 separate transitions. This reduced tool-chain
execution time to tens of minutes.
We also had to make a few changes to the model for TSL compatibility
issues. These changes included rewriting arrayed register definitions without
arrays, statement adjustments to allow width conversions, and elimination of
arithmetic operations.
Class Specification
We needed to define this specification from scratch as it does not exist today.
Normally we expect it to be published with the device industry standard
specification. This specification defines all the interfaces supported by the SD
Host controller device that are expected to be supported by all the drivers. We
started with SD host controller standard specification[6] and defined the class
interfaces. This is defined as a Word document. The class interfaces are the
points of synchronization between OS and device specifications. We will use
these interfaces to annotate both the OS and device specifications.
OS Specification
We chose to synthesize the SD host controller driver for UEFI (Unified The SD host controller DML model
Extended Firmware Interface). We used UEFI documentation[5] to define this
specification. The SD host controller driver is the lowest level driver in the
was annotated to work with the tool
layered driver stack. The OS specification for this driver was motivated by the
interfaces expected by the media layer driver.
UEFI defines a stylized model of system booting that includes interfaces between
several different executable entities, including UEFI drivers, as shown in Figure 5.
Pre UEFI
Verifier Interfaces OS-Absent
CPU App
Init
Verify
Transient OS
Chipset Device, Environment
Init Bus, or
Board Service
Transient OS
Init Driver
Boot Loader
OS-Present
EFIDriver Boot
App
Dispatcher Manager
Driver
Pre EFI Boot Dev Transient After
Security Execution Run Time
Initialization Select System Load Life
(SEC) Environment (RT)
(PEl) (BDS) (TSL) (AL)
(DXE)
These interfaces are codified by the main UEFI specification and expose
abstractions such as block device access, such as the EFI_BLOCK_IO_
PROTOCOL. The generic services in the EFI_BLOCK_IO_PROTOCOL,
such as ReadBlocks(), WriteBlocks(), and Reset(), need to be refined to an
implementation that meets the requirements of the underlying hardware
controllers. Today the requirements of the UEFI specification and its associated
Strong assurance guarantees needed driver model, along with the semantics of the hardware, are all managed by
the developer as part of the code creation process. This process is error fraught,
for firmware along with the extensive
and most developers typically take an existing driver source and port it to the
specifications available in UEFI requirements of the new hardware. As such, there is no guarantee of correctness,
with flawed existing sources being evolved via this porting process.
make EFI drivers an ideal target for
Instead, with the driver synthesis, a single instance of an OS specification for a class
synthesis
of devices can be married to a specific device specification, such as the DML for the
hardware, to derive the source. This removes the errant human interpretation of the
UEFI specification and the hardware host controller interface definition.
This is an important issue in that the UEFI firmware on the system board is
considered hardware by many end users of the platform. And with the trust
guarantees around platforms based upon UEFI Secure Boot[7], assurance
considerations, such as correctness of the implementation, gain even more
importance as all of the UEFI drivers and components are in the same trusted
computing base.
Specification Synchronization
We used the class specification as synchronization between the OS and device
specification. This involved using the class interfaces in the OS specification at
the synchronization points. Finding the correct synchronization points involved
studying the DML device model. Finding the correct place to annotate the
device model depends on the way the model is written. It was a fairly simple
process to annotate the SD host controller and EFI OS specifications.
Integration
Once we had the three inputs ready, it was an iterative process to input them
through our tool chain to synthesize the driver. We did not synthesize the
configuration interfaces for this device, but synthesized the main function to
send a command to the card. At the end of this step we were able to synthesize
the device driver strategy for this driver.
Code Generation
Code generation proved much mode tedious than anticipated. At the time of The synthesized code generated by the
writing, our synthesis tool does not support fully automatic code generation.
Instead, it allows the user to interactively construct driver source code by
tool was tested in the Simics simulator
selecting one of several possible actions proposed by the winning strategy in with an Intel Core i7 based
each state. Ongoing research on this problem is focusing on techniques for
fully automatic code generation as well as on improved methods for interactive platform model
user-guided code generation (see the section User-Guided Synthesis).
With the platform model extended, the next step was to validate the extended
model. This was done using the Linux image supplied with the platform model. We
booted the image in Simics and recompiled the kernel to create a loadable Linux
SDHCI driver. We updated the Linux image to retain the new driver modules. We
were then able to load the SDHCI driver and validate our SDHCI-MMC card
model combination using Linux file-system commands targeted to the MMC card.
Our next step was to establish an EFI baseline image. To achieve this goal,
we built an EFI image with an existing SD host controller driver and tested
that simulation environment. We then integrated our driver with the EFI
code base, replacing the existing driver. We needed to develop some wrapper
code to integrate in EFI environment. We then built and tested this driver on
the Simics simulator and successfully brought up the SD host controller and
performed read/write operations to the SD card.
User-Guided Synthesis
User guided synthesis allows a driver Our initial approach with this project was complete automatic synthesis, where
once the specifications are available, a push-button approach will result in a
writer to have fine grained control driver. In practice we realized that users want much more control over the
over the driver synthesis process structure of the driver code. In addition, in some cases synthesis gets stuck, and
having users provide some simple hints can make the job of the synthesis tool
much easier. Given these findings we decided to make a shift toward user-
guided synthesis, as illustrated in Figure 6.
} } } } }
Empty Driver Synthesised Driver Modified Driver Synthesised Driver Manually Developed
Template Template Driver
To this end we plan on using driver templates that specify the driver
structure. The user can add additional constraints on the synthesized driver
by defining a device-specific driver template that can include some hints,
or anything that is specific to a device. We plan on supporting a complete
spectrum from fully automatic synthesis, where the device-specific template
is empty, to the other extreme, where the user manually writes the complete
driver in device-specific template and our tool can then act as a verifier to
verify the driver against input specifications. We think the sweet spot is
somewhere in the middle, where the user specifies some code structure and
constraints in the device-specific template and generates more usable and
readable code (see Figure 7).
We are also working on an interactive code generation GUI that gives user
the flexibility to add any code at code generation steps. Any code manually
added this way using the code generation GUI is saved by adding it back to
the template and will automatically be available at the next iteration. Using
a combination of templates and code generation GUI, our tool chain will
provide user control over generated code at all stages of synthesis. Even though
the user gets complete control, our tool chain will validate that the user has
added correct code. Any errors caused by the user will result in synthesis failure
and not an incorrect driver.
Device-specific
Driver Driver
Driver
Template Implementation
Template
Author Biographies
Mona Vij is a researcher in Intel Labs. She has been a security and operating
systems researcher for over 20 years. She has a Masters in Computer Science
from the University of Delhi, India and a Bachelor of Science in Mathematics
from St Stephens College, Delhi.
John Keys is a Staff Engineer in Intel Labs. He has been developing low-level
software for over 25 years, for both PCs and embedded platforms. He has
experience with a wide range of hardware devices, CPUs, operating systems,
processor architectures, and platforms from bare-metal to PC to satellites
and tunnel boring machines. He has made significant contributions to the
development of PCMCIA and USB technologies and standards. Through this
leading edge work, he also became an expert in hacking an existing platform
to add new capabilities, beginning with plug-and-play support for MS-
DOS3.2. John has been with Intel for 14 years in a variety of positions. Prior
to joining Intel, he was the VP of Software for MCCI in Ithaca, NY.
He has been a Systems software researcher at Intel for the last 14 years. He has
authored 5 conference papers, 1 book chapter and holds 8 patents in the areas
of High performance computer networking, Operating Systems, Compilers
and multi-core parallelization.
Washington, Seattle, WA. He has published three books, two book chapters,
one IETF RFC, ten publications and over 270 US patents.
Adam Walker is a PhD student at the University of New South Wales, Sydney,
Australia. He obtained his Bachelors degree from the University of Auckland,
New Zealand in 2008.
Contributor
Robert Guenzel Wind River Education Services provides user training for a variety of topics,
Technical Content Engineering, including Wind River operating systems and tools, as well as more general
Wind River topics like networking. Training always includes hands-on labs, which can
complicate logistics for training sessions. Shipping boards and configuring
networks is time consuming and error prone. For that reason, education
services are using Simics as an alternative to physical hardware to streamline
training logistics and provide new ways to do training.
Introduction
By just looking at the title of the article, it remains unclear what kind of
education this article refers to. A more precise but woefully long title would
be: Using Simics* for educating people on various embedded system topics,
such as debugging tools, operating systems, device driver and application
development, networking, and security. Apparently, Simics is not amongst the
topics people are educated on. So this article is about using Simics as a training
It is not about training people on tool. It is not about training people on Simics. In fact, the training examples
in question try to hide Simics as much as possible, because the students must
Simics. not get the feeling that they are being trained on something they did not want
to be trained on, or, which would be even worse, get the feeling that they need
Simics in order to use the tools or software they are being trained on.
Using Simics in this way reduces Simics (more precisely, the target machine
that it simulates) to a mere hardware replacement, thereby throwing away a
lot of its unique features. To understand the reasons for this, a closer look at
hands-on labs is required.
Amount of equipmentThere are literally hundreds of pieces of equipment There are literally hundreds of
per class (a laptop, evaluation board, Ethernet switch, hardware debugger,
and cables per student). The overhead for making sure all pieces are there, pieces of equipment per class
in good shape, and recollected after training is significant.
InstallationCreating networks, connecting the debuggers, probes, and so
on at the customer site is very time consuming and error prone.
Insight into the systemReal hardware does not allow full system time
freeze and insight into the complete system state.
FlexibilityThe same training sessions need to be delivered on different
target machines such as PowerPC, ARM, or x86.
ScalabilityThe number of CPU cores or nodes in a network needs to
be changeable on the fly to show effects of concepts introduced within
lectures.
All of the above points have a significant impact on training maintenance and
training deliveryand at the end of the day training costs. With Simics, a lot With Simics, a lot of the above
of the above problems can be mitigated:
problems can be mitigated:
No need to ship equipment other than normal laptops.
Broken laptops are easy to replace, because they are off the shelf.
The required equipment is independent of the training that is delivered. It
is always one laptop per student.
Simics gives full insight into the system and allows full system time freeze.
For a number of training sessions a change of the used target is simple. For
example, migrating from the Simics PowerPC* Quick Start Platform to the
Simics ARM* Quick Start Platform.
System installation and bring-up is reduced to powering on the laptop and
starting Simics with a well-tested Simics script.
The following sections show the application of Simics in a number of training
examples, showing the real-world problems and how they were solved.
A third option is to use Simics. Simics also allows designing an artificial device.
Having it included into one of the Quick Start Platforms (PowerPC or ARM)
would automatically make it available in the other, because both platforms are
apart from the processor usedidentical. With the way Simics models memory-
mapped busses, it is also not difficult to put the device into an x86 platform.
dml 1.2;
device myDevice;
parameter desc = LED Controller;
parameter documentation = LED Controller with
integrated timer;
import utility.dml;
import simics/devs/signal.dml;
import simics/devs/memory-space.dml;
connect phys_mem {
parameter documentation = Connect memory
space for
+DMA operations.;
parameter configuration = optional;
interface memory_space;
}
connect irq_dev {
parameter documentation = Connect IRQ
receiver.;
parameter configuration = optional;
interface signal;
}
connect led[32] {
parameter documentation = Connect LEDs.;
parameter configuration = required;
interface signal;
}
bank regs {
parameter register_size = 4;
register MCR @ 0 x 00 is (read_write) Main
Control Reg{
field DMAADR [31:2] DMA source address;
field MODE [1] Operation mode of device;
field ENABLE [0] Enable/Disable device;
}
connect irq_dev {
method set(){
$irq_state=true;
if ($this.obj)
$irq_dev.signal.signal_raise();
}
method reset(){
$irq_state=false;
if ($this.obj)
$irq_dev.signal.signal_lower();
}
}
event lifetime_event {
parameter timebase = cycles;
parameter desc = lifetime expiry event;
method event(void *param) {
inline $irq_dev.set();
$regs.PLT=0;
}
}
bank regs {
register PLT {
method write(val){
if ($this==0 && val!=0){
$this=val;
if ($irq_state==false && $MCR.
ENABLE==1){
inline $lifetime_event.
post(val,NULL);
}
}
}
[...] //part omitted
}
The DML code is simple (myDevice_impl.dml has around 150 lines of code)
and can be maintained without exhaustive DML knowledge. Integrating
the device into the Quick Start Platform has been done at the scripting level
as shown in Code 4, because this allows us to have a single script for both
targets. Also note that the script handles the chosen OS, so there is only one the script handles the chosen OS, so
script for four different training variants.
there is only one script for four different
training variants.
if not defined target_arch { $target_arch = ppc }
if not defined target_os { $target_os = vxworks
}
if not defined target_image { $target_image =
someKernel}
$vxworks_image = $target_image
$kernel_image = $target_image
add-directory %script%
run-command-file (%simics%/targets/qsp-+$target_
arch+/qsp-+$target_os+.simics)
@device=pre_conf_object(simenv.system+.
dut,myDevice)
@device.led = [None]*32
@device.queue = SIM_get_object(simenv.system).
cpu[0]
@device.phys_mem= SIM_get_object(simenv.system).
phys_mem
The nice thing here is that the above code works with minor changes for an x86-
based target as well. The difference is that the phys_mem memory space of an x86
target is not a direct child of the top level component. That means, for all targets,
the same code base can be used for the device and the Simics machine script. Only
the above code needs to be maintained in order to enable device driver labs for
VxWorks* and Linux* on all three target architectures. So for this training example,
the advantage of using Simics lies on the training developers and maintainers side.
The students are mostly unaware of The students are mostly unaware of Simics, although the training can be
extended to show how Simics can help in device driver development and
Simics debugging, such as, for example, using the device register view (see Figure 2) to
inspect the current device state.
Figure 2: The simics 4.8 register view showing the registers of the training
device
(Source: Wind River, 2013)
Networking Training
Networking labs have quite different requirements from the device driver
development labs. Here, the actual used target machines are of secondary
interest; what is more important are:
Node 1 Node 3
Node 2 Node 4
In this case the setup requires five machines, two switches or hubs, and six
Ethernet cables. This renders per-student lab setups impossible (imagine a
class of 10 or more students). A per-class setup is feasible, but still subject A per-class setup is feasible, but still
to reliability issues. Taking the above requirements into account (especially
the flexible topologies) using real hardware is no option.
subject to reliability issues.
With this script in the Simics project, various network topologies can be
created by starting the generic script and setting the configuration file variable
to the configuration file for the various labs (Figure 4).
For example, the scripts shown in Code 6 and Code 7 generate the
topologies shown in Figure 5 and Figure 6, respectively. The script
in Code 6 resembles the topology shown in Figure 3, while the script
in Code 7 shows a more complex topology as well as connected
Wiresharks[1] and a real network connection. Note that to keep the
example code short, targets and routers that use the same architecture
and OS do actually use the same set of binaries. In actual labs, the routers
have different kernels or root file systems in order to properly execute
their router roles.
Target_2
11.11.0.20
11.11.0.22
Target_3
11.11.0.21
Router_4
10.10.0.21 Target_1
10.10.0.22
10.10.0.20
Target_0
Figure 5: The generated network topology for the script shown in Code 6 as a graph, and as shown by
the target info and system editor views.
(Source: Wind River, 2013)
#define targets
T1 = target(*vxWorksPPCArgs)
T2 = target(*vxWorksARMArgs)
T3 = target(*LinuxPPCArgs)
T4 = target(*LinuxARMArgs)
#define a router
R1 = router(*routerArgs)
T1 = target(*vxWorksPPCArgs) #target_0
T2 = target(*vxWorksARMArgs) #target_1
T3 = target(*LinuxARMArgs) #target_2
T4 = target(*LinuxPPCArgs) #target_3
NW1 = network(10.10.0.20)
NW1.connect(T1,macAddress=f6:9b:54:32:42:42)
NW1.connect(T2)
NW2 = network(11.10.0.20)
NW2.connect(T3,macAddress=f6:9b:54:32:42:43)
NW2.connect(T4,macAddress=f6:9b:54:32:42:44)
NW3 = network(12.10.0.20)
NW3.connect_to_tap(TAP)
R1 = router(*routerArgs) #router_4
R2 = router(*routerArgs)
R3 = router(*routerArgs)
R4 = router(*routerArgs)
R1.plug(NW1)
R1.plug(R2, 13.10.0.20)
R1.plug(R3, 14.10.0.20)
R4.plug(NW2)
R4.plug(R2, 15.10.0.20)
R4.plug(R3, 16.10.0.20)
R4.plug(NW3)
wireshark(13.10.0.20,255.255.255.0)
wireshark(11.10.0.20,255.255.255.0)
Code 7: A more complex topology configuration script (argument setup
omitted for spatial constraints and because it is identical to the simple script)
Source: Wind River, 2013
Simulating a network with a dozen nodes on a single simulation host sounds Simulating a network with a dozen
like a big challenge for the host, but in fact, once all targets have booted,
most nodes are idle, meaning the idle nodes only add minimally to the overall
nodes on a single simulation host sounds
simulation effort. For example, on a two-core Intel Core i7 CPU with like a big challenge for the host
2.6 GHz and 2 GB of RAM with a Linux 64-bit host OS, booting all eight
machines in the network shown in Figure 6 takes 15 seconds. Afterwards
simulation is able to progress by 4 virtual seconds per wall clock second,
because when sending packets through the system to observe the behavior of
routing protocols or firewalls, most of the machines are idling.
Note that the frame captured on the software side is four bytes shorter, because
it lacks the real FCS. Also note that the capturing tool falsely interpreted the the capturing tool falsely interpreted
final four bytes of the frame (which actually belong to the trailer) as the FCS,
which leads to a false check failure.
the final four bytes of the frame (which
actually belong to the trailer) as the
Combining the simple configuration format with the shown ability of sniffing
on the virtual hardware side and with the ability of doing packet modification FCS, which leads to a false check
using Simics Ethernet probes, quick and reliable creation of any kind of
network is feasible. Inspection, injection, or modification of traffic at any
failure.
position in the network is also possible. This enables the construction of labs to
show routing and communication protocols in action, their reaction to broken
connections, and their reaction to packet loss or packet corruption in simple
and complex network topologies.
In this training example, both the training developers and the students benefit
from using Simics, because it simplifies the training development but also
enables the use of complex and varying toplogies and inspection of complete
packets everywhere, which improves the learning experience.
Host
12.10.0.20 Router_6
16.10.0.21 14.10.0.21
16.10.0.20
Shark
Router_7 14.10.0.20 Target_0
11.10.0.22
10.10.0.20
Target_3 15.10.0.20 Router_4
11.10.0.21 10.10.0.22
11.10.0.20 10.10.0.21
15.10.0.21 13.10.0.20
Router_5 Target_1
Target_2 13.10.0.21
Shark
Figure 6: The generated network topology for the script shown in Code 7 as a graph, and as shown by
the target info and system editor views.
(Source: Wind River, 2013)
how to best exploit the number of available cores. A machine used for such a
training session needs to fulfill the following requirements:
Speed Up
1,5
1,4
1,3
1,2
1,1
1
1 2 3 4 5 6 7 8
Number of Cores
import pyobj
def _initialize(self):
pyobj.ConfObject._initialize(self)
class accessDelay(pyobj.SimpleAttribute(
None,
type=i,
attr=simics.Sim_Attr_Required)):
Delay for transactions
pass
class timing_model(pyobj.Interface):
def operate(self,mem_space, map_list, mem_op):
delayValue=self._up.accessDelay.val
Board
cpu[0]
RAM Original
cpu[1]
phys_mem
Devices
cpu[2]
Modified
cpu[3]
Board
Imem[0] local_ram[0]
cpu[0] mem[0]
RAM
Imem[1] local_ram[1]
cpu[1] mem[1]
phys_mem
Imem[3] local_ram[3]
cpu[3] mem[3]
Local Global
Timing Timing
Model Model
Figure 10: Original and modified memory space hierarchy in the Quick Start
Platform to support NUMA setups
(Source: Wind River, 2013)
SIM_mem_op_ensure_future_visibility(mem_op)
mem_op.reissue=0
SIM_log_info(2, self._up.obj, 0,
Delaying mem op for %d
cycles.%(delayValue))
return delayValue
Code 8: Timing model implemenation.
Source: Wind River, 2013
The Simics script that modifies the memory space hierarchy is shown in Code
9. Note that the timing models are not attached by default, but only when timing models are only attached
calling the Python function enable_delays. This allows booting the OS much
quicker. The timing models are only attached when executing the applications when executing the applications that
that make use of NUMA and that need to be analyzed. make use of NUMA
add-directory %script%
run-command-file %simics%/targets/qsp-ppc/qsp-
vxworks.simics
lmem.queue=cpu
Figure 11 shows the setup in action. Initially, the timing models were not
attached and hence steps and cycles are in sync (due to using a steps-per-cycle
ratio of one to one). Afterwards the delays and logging output from the timing
models were enabled, and then a check is performed that the next instruction
will execute memory accesses (r13 points to global RAM, while r14 points
to local RAM). Executing the instructions shows that the timing models are Executing the instructions shows that
active, and that afterwards steps and cycles are out of sync, because the global
memory access lasted for 26 cycles (1 cycle execution + 25 cycles transaction
the timing models are active
delay), and the local access only for 6 cycles.
This enables the creation of labs that can show how NUMA can speed up
applications, provided most of the data accesses of each core remain core local
and only few accesses need to go to the global RAM. A good example for an
algorithm that greatly benefits from NUMA is a parallel quick sort.
There is, however, the drawback that, with Simics, showing cache-coherence
protocols in action or cache coherency issues is not trivially possible, because,
by default, Simics abstracts from caching for the sake of simulation speed.
Then again, cache coherence is generally handled by hardware and is invisible
Code Executed
Critical
on CPU 1
In Reality,
Parallel Execution
Code Executed
Critical
on CPU 2
Critical
Critical
Cri- tical
Criti- cal
to the software running on the CPUs. Since the training sessions focus on
embedded software topics and tools, not being able to actually show cache
coherence is no major concern. It should be noted, however, that Simics allows
the implementation of cache models and their addition into the existing target
system. However, this would be too much effort for the above-mentioned training
example for showing something that is not of major interest to the students.
Reference
way, with real hardware.
[1] Combs, Gerald, et al. 2013. Wireshark. www.wireshark.org
Author Biography
Robert Guenzel holds a PhD in computer science from the Technical
University of Braunschweig. He joined Wind River in 2011. He is responsible
for Simics training development and supports Simics integration into other
training programs. Internet address: http://education.windriver.com
Contributors
Tingting Yu As modern software systems become more complex, they increasingly contain
University of Nebraska-Lincoln classes of faults that are not effectively detected by existing testing techniques.
For example, concurrency faults such as data races tend to be intermittent and
Witawas Srisa-an require precise execution interleavings or specific events to occur. A common
University of Nebraska-Lincoln approach to test for data races involves repeatedly executing a program in the
hope of exercising an execution interleaving that causes races. Unfortunately,
Gregg Rothermel occurrences of races do not always lead to erroneous outputs. As such, they
University of Nebraska-Lincoln often elude traditional testing approaches that rely on output-based oracles
for fault detection. To effectively test for these elusive faults, engineers need
to be able to observe the occurrences of faults by precisely controlling various
execution events that can cause such faults, and properly monitoring outputs.
Introduction
The basic characteristics of modern computer systems are that they utilize
multiple CPUs, connect to a large array of peripheral devices, and sense their
surroundings through various sensors and actuators. Such complexity makes
software development challenging as developers must rely on concurrent
programming and various combinations of interrupts, system calls, and polling
to fully utilize processing power and to timely sense their environments. It is
therefore not surprising that software running on these systems can suffer from
classes of elusive faults including various forms of concurrency faults that are
Concurrency faults are difficult to difficult to detect, isolate, and correct.
detect because they occur intermittently. Concurrency faults are difficult to detect because they occur intermittently.
To combat these faults, engineers require observability and controllability.
To combat these faults, engineers require
Observability is needed to detect when violations occur and assess their
observability and controllability. implications for correctness. Existing techniques for testing for concurrency
178 | Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults
Intel Technology Journal | Volume 17, Issue 2, 2013
It is worth noting that to date, there have been many techniques proposed
for detecting thread-level concurrency faults with the aid of observability[3]
and controllability[4]. However, these approaches have rarely been applied to
scenarios in which concurrency faults occur due to asynchronous events (such
as interrupts and software signals) and interleavings at the process-level. It is
unclear whether these approaches can work in such scenarios for two reasons.
First, controlling these events requires fine-grained execution control; that is, it
must be possible to control execution at the machine code level or kernel level
and not at the user-mode application level, which is the granularity at which
many existing techniques operate. Second, occurrences of events like interrupts
are highly dependent on hardware states; that is, interrupts can occur only
when hardware components are in certain states. Existing techniques are often
not cognizant of hardware states.[5]
Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults | 179
Intel Technology Journal | Volume 17, Issue 2, 2013
virtualized system, to monitor function calls, variable values and system states,
and to manipulate memory and buses directly to force events such as interrupts
Sim-O/C is able to stop execution and traps to occur. As such, Sim-O/C is able to stop execution at a point of
interest and force a traditionally nondeterministic event to occur. Sim-O/C
at a point of interest and force a then monitors the effects of the event on the system and determines whether
traditionally nondeterministic event any anomalies result.
to occur. Sim-O/C then monitors the Many existing approaches for detecting concurrency faults are not widely used
because they require significant deployment effort. We have designed Sim-O/C
effects of the event on the system and to overcome deployment obstacles by implementing it on a commercial virtual
determines whether any anomalies platform, Simics.[6] We chose Simics for several reasons. First, similar to other
full-system simulators, Simics provides functional and behavioral characteristics
result. similar to those of target hardware systems, enabling software components to
be developed, verified, and tested as if they are executing on the actual systems.
We have designed Sim-O/C to Second, through a rich set of Simics APIs, software engineers have the ability
to unobtrusively observe and control various system behaviors without needing
overcome deployment obstacles by
the source code. Third, due to its powerful device modeling infrastructure,
implementing it on a commercial Simics already plays a critical role in hardware/software co-design; therefore,
adding the proposed capabilities to Simics enables adoption without requiring
virtual platform, Simics.[6] undue effort. Thus, we envision that Sim-O/C will allow several aspects of
product integration testing to be moved up to the co-design phase of system
development. Fourth, licensing of Simics is free for academic institutions,
making it a good platform for research.
A Motivating Example
In early releases of version 2.6 of the Linux kernel, there is a particular data
race that occurs between the serial8250_start_tx routine and the UART
serial8250_interrupt interrupt service routine (ISR) in the UART driver
This particular fault existed in the program.[7] This particular fault existed in the source code for over three years
before being eradicated. We provide the code fragments that illustrate the
source code for over three years before error in Code 1.
being eradicated.
static int serial8250_startup(struct uart_port *port){
1. up = (struct uart_8250_port *)port;
2. ...
/* Do a quick test to see if we receive an
* interrupt when we enable the TX irq. */
3. serial_outp(up, UART_IER, UART_IER_THRI);
180 | Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults
Intel Technology Journal | Volume 17, Issue 2, 2013
Code 1. Faulty code that can cause data races in the UART driver in Linux. Faulty code that can cause data
Source: University of Nebraska-Lincoln, 2013
races in the UART driver in Linux.
Routine serial8250_startup is responsible for testing and initializing the UART
port and assigning the ISR. This routine is called before the UART port is ready
to transmit or receive data. Routine serial8250_start_tx is used to initialize data
transmission, and is called when data is ready to transmit via the UART port.
Routine serial8250_interrupt is the actual ISR, and is called to perform data
transmission when: (1) the data is ready to be transmitted; and (2) the interrupt
enable register (IER) is enabled by the serial8250_start_tx routine.
Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults | 181
Intel Technology Journal | Volume 17, Issue 2, 2013
when: 1. the device driver program is preempted by the ISR after a shared memory
access before it can proceed to the next instruction;
the device driver program is preempted 2. the ISR manipulates the content of this shared memory.
by the ISR after a shared memory Higashi et al.[8] introduce an approach to test for this fault by controlling
invocations of interrupts. In that work, they used an ARM-based processor
access before it can proceed to the next simulator and modified version of uCLinux with the same fault that could
instruction; run on that simulator. Their modifications included porting the code from
PPC to ARM and removal of irrelevant code to reduce the simulation time.
the ISR manipulates the content of this Their methodology involves invoking an interrupt at every memory read and
write operation. We recreated a similar testing system based on Sim-O/C with
shared memory. two additional optimization techniques beyond those used in the Higashi
approach. In the first optimization technique, we apply static program analysis
to detect the resources that can be affected by the UART driver and the ISR.
With this optimization, we invoke interrupts only when these shared resources
are accessed, resulting in a smaller number of interrupts. Second, we also check
system states at runtime to ensure that it is possible to invoke interrupts when
those resources are accessed, resulting in more realistic interrupt invocations.
These two optimizations should significantly reduce the time required to
conduct testing.
Overview of Sim-O/C
Currently, Sim-O/C is implemented for applications running on x86/Linux
environments. The framework includes four major components, which interact
with Simics as user-developed tools:
182 | Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults
Intel Technology Journal | Volume 17, Issue 2, 2013
To date, it is quite common for certain types of faults to be detected using code-
level instrumentation. Such faults include thread-level concurrency faults, memory
leaks, and buffer overflows. However, code-level instrumentation has been known
to introduce probe effects. Furthermore, faults that occur across applications in a
system can make code level instrumentation infeasible (for example, if only binary
code is available or instrumentation tools are not compatible with development
languages). Sim-O/C operates at the binary level so it is language independent Sim-O/C operates at the binary level
and does not rely on external tools to perform instrumentation. It also minimizes
probe effects as instrumentation is performed outside of virtual execution states.
so it is language independent and does
Controllability is achieved by allowing engineers to initiate the desired events at not rely on external tools to perform
particular execution points during testing.
instrumentation.
In addition, Sim-O/C is capable of detecting the following three classes of
faults that cannot be easily detected without the support of virtual platforms:
As such, the ultimate benefit of Sim-O/C is its ability to test for classes of the ultimate benefit of Sim-
concurrency faults that cannot be effectively detected by existing approaches.
In the future, we envision that Sim-O/C can be used to test for other elusive
O/C is its ability to test for classes
faults including temporal violations that occur due to interrupts, and improper of concurrency faults that cannot
resource usage. For example, Sim-O/C can be used to generate complex interrupt
sequences to test for violations of expected worst-case interrupt latencies. be effectively detected by existing
In the next section, we describe each component of Sim-O/C and illustrate
approaches.
how Sim-O/C can be used to test for data races that occur due to improper
arrivals of hardware interrupts. This type of fault has been identified as one of
the most nasty faults to test for in embedded software.[9] When we conduct
testing for data races, the components under test include the main application,
device drivers, and the ISRs that are associated with the device drivers. The
focus of our illustration is testing for races that occur when the application
coupled with the device drivers interact with an ISR.
Implementation Details
Figure 1 depicts the overall architecture of the Sim-O/C framework. There are
four major components in the framework in addition to Simics itself. As stated
earlier, Simics provides APIs that can be accessed via Python scripts; thus, all
components except the test oracles are Python scripts.
Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults | 183
Intel Technology Journal | Volume 17, Issue 2, 2013
Simics
Execution
Controller
Configuration
Breakpoints
Events to Monitor
Execution Execution
Test
Logs Observer
Oracles
Test Configuration
The first component in our framework is the Configuration script, the content
of which includes information such as locations at which to set execution
breakpoints on the points of interests. We can identify such points through
static or dynamic analysis (for example, we can set breakpoints at instructions
that access shared resources).
the Configuration script sets To test for data races in our example, the Configuration script sets breakpoints
on the events of interest so that the Execution Observer and Execution
breakpoints on the events of interest Controller modules can use the monitored runtime information. In this
so that the Execution Observer and example, the information generated by the Execution Observer includes:
Execution Controller modules can use information on when functions of the PuT and ISR execute and when they
return, and
the monitored runtime information. information on when SVs are accessed by the PuT and written by the ISR.
184 | Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults
Intel Technology Journal | Volume 17, Issue 2, 2013
procedure BasicConfig()
1: begin
4: endfor
7: end
Algorithm 1
Execution Observer
The Execution Observer monitors and generates information that is used The Execution Observer monitors and
by the Execution Controller and Test Oracles. The generated information
can either be recorded in a log file (to support offline analysis) or directly
generates information that is used by the
taken from the observer (to support online analysis). For offline analysis, the Execution Controller and Test Oracles.
callback functions log events into a runtime trace. This runtime trace serves
two purposes. First, it provides information for the Execution Controller. The
Execution Controller can determine the locations at which events should occur
in the following test executions. Second, Test Oracles can use the log file to
determine whether a fault occurs. For online analysis, the Execution Observer
notifies the Execution Controller as to when or where to cause an event to
occur in the current run. As in the offline analysis, runtime information can be
used directly by the oracle components for dynamic fault detection.
An important feature of the Execution Observer is that it can ignore An important feature of the Execution
irrelevant events. Typically, engineers are interested only in the program
under test and not other programs running in a system. When using the
Observer is that it can ignore irrelevant
raw callback information, the Execution Observer cannot distinguish events.
between different execution units such as processes or threads (for example,
a shared memory address specified in the BasicConfig that can be accessed
by other programs not under test.) In most cases, we can overcome this
ambiguity by using a process tracker provided by Simics. A process tracker
can distinguish each user-space process; as such, engineers are able to focus
on the program under test.
Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults | 185
Intel Technology Journal | Volume 17, Issue 2, 2013
(Algorithm 2) to isolate the PuT and ISR from other applications in the system or
other ISR invocations. Using a static analysis tool, we identify all function names in
the PuT and their entry addresses. By parsing the symbol tables, we can identify the
entry address to the ISR. Furthermore, we monitor the function return instruction
(ret in X86) to determine whether a function or the ISR has returned, and we
monitor the interrupt return instruction (iretd in X86) to determine whether the
PuT has recovered from the interrupt context.
procedure RaceObserver()
1: switch (breakpoint)
2: case function addr:
3: func list.push({func addr, ebp, esp})
4: if func addr == ISR entry
5: log ISR entry
6: is_ISR = true
7: endif
8: case ret:
9: if esp == func_list.top[index esp]
10: if func list.top[index f unc] == ISR entry
11: is_ISR = false
12: log ISR exit
13: endif
14: func list.pop()
15: endif
186 | Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults
Intel Technology Journal | Volume 17, Issue 2, 2013
Algorithm 2
In summary, events logged for testing race conditions include: (1) read/write
accesses to shared variables (an SV access by the PuT); (2) entry to the ISR; (3)
a write to an SV by the ISR; (4) return from the ISR; and (5) context switches
from the ISR to the PuT. Code 2 illustrates a sample of trace information
recording these events for this example.
...
PuT, $xmit->tail$, read, pc1
ISR entry ISR, $xmit->tail$, write
ISR exit
IRETD
pc1+1
...
Code 2.
Note that there is a race in the trace given in Code 2. By observing the Note that there is a race in the trace
program counter when an SV is accessed by the PuT and the interrupt recovery
point, we can determine that an interrupt occurs right after xmit->tail is read
given in Code 2.
by the PuT. Later on, this race can result in a program failure when the PuT
reads the wrong value of xmit-tail.
the Execution Controller. This
Execution Controller
module specifies certain events that should
The third component in our framework is the Execution Controller. This
module specifies certain events that should be invoked at particular points be invoked at particular points
Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults | 187
Intel Technology Journal | Volume 17, Issue 2, 2013
during executions of the PuT. First, the Execution Controller can artificially
create I/O interrupts simply by writing data to the I/O bus or memory
locations that have been mapped to hardware devices. Simics allows us to issue
an interrupt on a specific IRQ line from the simulator itself. The interrupt will
happen before the subsequent instruction. Second, the Execution Controller
can force the system to execute a particular exception handling routine
by artificially creating that exception, such as by configuring the system
environment into an error state. Third, the Execution Controller can force
the system to execute a particular path by specifically setting a desired branch
condition in a hardware register or a memory location. The logical address of
a global variable can be obtained by parsing the symbol table. Thus engineers
are able to artificially inject values into the memory address to set the value
of this variable. Note that in order to write to program variables, the logical
virtual address must be converted to a physical address. As for the hardware
register, the cpu.iface.int_register.write API is used to manually set the content
the Execution Controller can cause of a register. Finally, the Execution Controller can cause an interrupt and use
the interrupt handler to control kernel-level scheduling. This is an important
an interrupt and use the interrupt feature of the Execution Controller when events need to be precisely controlled
handler to control kernel-level at the kernel level. For example, to test for concurrency faults among multiple
user processes, Sim-O/C first observes the correct execution of two processes
scheduling. This is an important feature and identifies system calls that can potentially race with each other (for
of the Execution Controller when events example, two processes that both write to a file using the write system call).
Next, Sim-O/C controls the scheduling of the two processes and attempts to
need to be precisely controlled at the switch the order of the system calls in the original execution. If such a switch
causes a failure, a concurrency fault is detected.
kernel level.
Note that existing approaches would likely use functions such as yield or sleep
to achieve some form of execution control in user-level applications. However,
such approaches are not precise, as these functions do not guarantee when
the suspended process will be executed again. In addition, they do not work
for kernel-level programs such as device drivers. Instead, we use the Device
Modeling Language (DML)[11] to achieve more precise execution control. Our
approach creates a dummy device for the platform running on Simics, and
installs its associated IRQ line in the ISA bus. In this way, whenever a process
is ready to be stopped or resumed, the interrupt of this device is invoked. The
associated interrupt handler serves as an event handler. The handler first checks
the state of the process (suspending, active, and so on) using the task_struct
data structure. Next, the handler controls the process scheduling using process
scheduling APIs (such as wake_up_process()). This type of controllability can
also be used to control software signals.
In the example that we are considering, when engineers enable the controller
module RaceController, a controlled interrupt is invoked immediately after a
shared variable access by the PuT.
188 | Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults
Intel Technology Journal | Volume 17, Issue 2, 2013
Note that in the foregoing discussion we have assumed that interrupts are not
nested, but our algorithms do also support nested interrupts. Also note that in
our discussion we have considered only the presence of a single ISR, but our
algorithm can be generically applied to handle multiple ISRs.
Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults | 189
Intel Technology Journal | Volume 17, Issue 2, 2013
Test Driver
In addition to the components illustrated in Figure 1, a test driver is also
needed to automate the testing process. Typically, engineers conduct testing by
A test driver is a program that running test programs on a system. A test driver is a program that automates
the process of running test programs in a suite, managing the runtime violation
automates the process of running detection and analyzing generated log files.
test programs in a suite, managing
For example, to test for data races we need to provide two components to the
the runtime violation detection and test script: the test input and conditions governing when to invoke interrupts
from within the system. In this case, a test case is used as the test input for
analyzing generated log files. the PuT (which includes the application and any device driver running under
non-interrupt service routine context that is called by the application). Note
that test cases for the PuT can be generated based on various criteria. We
discuss the criteria we use in the evaluation section. Next, we need to describe
each interrupt condition (IC). We express an IC as a tuple: < loc, pin >. The
first element, loc, specifies a code location at which to invoke an interrupt.
The second element, pin, specifies an Interrupt Request (IRQ) line number at
which to invoke the interrupt. This is needed because typically, an interrupt
service routine can be associated with multiple IRQ lines. ICs are used only
when the controllability module is enabled.
Our system invokes only one controlled Our system invokes only one controlled interrupt per test run. This is done to
avoid fault masking effects, which may occur in cases where multiple interrupts
interrupt per test run. This is done to fire and cause a failure that would be evident in the presence of a single interrupt
avoid fault masking effects to be masked by the presence of the second.[14] Thus, our system needs to first
check a flag to determine whether a controlled interrupt has already been invoked
in the current run. If it has, the test system does not monitor any further events
in this run. Once it has been determined that there has not been any invocation
of a controlled interrupt in this run, the system then checks to see whether the
last accessed SV has already been tested in prior runs. If it has not, the system
enables the control register for UART and then calls the simple interrupt API.
Note that, given the above approach, with the controller module disabled, the
PuT runs |tc| times during a testing process, where |tc| is the number of test cases.
However, with the controller module enabled, there can be multiple runs of
each test case. This is because each test case execution may encounter different
numbers of SVs. Therefore, the number of runs depends on the number of SVs
that must be tested, and is given by the equation
|tc| (|int| + 1)
where |tc| is the number of test cases and |int| is the number of controlled
interrupts issued. We also need to run the PuT one additional time for each
test case to determine whether all SVs have been accessed.
190 | Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults
Intel Technology Journal | Volume 17, Issue 2, 2013
Evaluation
To evaluate Sim-O/C for race detection related to interrupts, we applied it to To evaluate Sim-O/C for race detection
the UART device driver on a preemptive kernel version of Fedora Core 2.6.15.
The driver includes two files, serial core.c and 8250.c, containing 1896 and
related to interrupts, we applied it to the
1445 lines of non-comment code, respectively. The main application transmits UART device driver on a preemptive
character strings to and receives character strings from the console via the
UART port. In this article we apply our testing process only to the UART kernel version of Fedora Core 2.6.15.
driver; however, the same process is also applicable to other types of device
drivers.
Our approach requires the use of existing test cases, so we generated test cases
for the system based on a code-coverage-based test adequacy criterion. After
SVs were identified (see the Test Configuration section), we generated a set
of test cases that cover the feasible SVs (SVs for which there exists a possible
execution of the program which executes them) in the PuT. This process
produced 12 test cases.
To better assess the cost and effectiveness of our approach, we considered both the To better assess the cost and effectiveness
approach and two alternative baseline approaches. In the discussion that follows,
we refer to our approach as the conditional controllability approach, because
of our approach, we considered both the
it involves issuing controlled interrupts under certain conditions. The second approach and two alternative baseline
approach that we considered, no controllability, involves testing the program
without any controlled interrupts; this is the approach that test engineers normally approaches.
use. The third approach that we consider, random controllability, involves issuing
controlled interrupts at random program locations after shared variable accesses
and without checking interrupt conditions.
Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults | 191
Intel Technology Journal | Volume 17, Issue 2, 2013
applied, and for each test case, one extra run was needed to determine whether
all shared variables had been accessed. Thus, 96 test runs were required to finish
testing the target program with an average execution time of 77.91 seconds per
test run. Including self-generated interrupts, the number of interrupts generated
In the course of applying the approach, for the target program was 352. In the course of applying the approach, we
detected a race in function uart_write_room of /linux/drivers/serial/serial_core.c,
we detected a race in function uart_ which we later determined had been corrected in subsequent versions of the
write_room of /linux/drivers/serial/ system. By running the system with observability turned off, we determined that
this fault can be detected only with observability enabled; in other words, it is
serial_core.c, which we later determined a fault that did not propagate to output on our particular test inputs, but may
had been corrected in subsequent propagate under a different test input.
versions of the system. We next tested our target program with no controllability. In this case, the only
interrupts that occur are self-generated interrupts. Because runs of each given
test can conceivably differ, we ran each test on the program 500 times. The total
number of interrupts observed was 16,500. Over the 6000 total test runs, average
execution time was 74.08 seconds per test, only 3.83 seconds less than with
controllability added. None of these test runs detected the race condition detected
by our first approach, however, either with observability enabled or disabled.
Finally, we tested our target program using the random controllability approach.
For each test case, we ran the target program three times more than the number
of runs performed under the conditional controllability approach, on each run
generating an interrupt at a randomly selected program location. The total
number of test runs was 288 and the number of interrupts generated was 1044.
In this case the average execution time per test case was 75.17 seconds, only
2.74 seconds less than with controllability added. Again, the race was not
detected, either with observability enabled or disabled.
A second characteristic of our technique is that interrupts are issued only after
shared memory accesses, and this can be much less expensive than issuing
interrupts after each memory access, which is the approach used by Higashi
et al.[8] For our target program, there are 94,941 data accesses made in the
course of running the 12 test cases. If an interrupt were issued after each data
access, we would need 82.6 days to finish testing the target program.
192 | Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults
Intel Technology Journal | Volume 17, Issue 2, 2013
Given the faults thus seeded, we ran our test cases on the faulty systems using
conditional and random controllability, and in the case of race detection, with
observability enabled and disabled. For the race condition detection approach,
conditional controllability detected two of the five faults. One of these faults was
detected both with and without observability. The same fault was also detected
with random controllability, but only with observability enabled because in this
case the fault does not propagate to output. This occurred because interrupts
issued by conditional controllability visited more unprotected shared variables
that can cause incorrect output than random controllability did, and these shared
variables are not visited by random controllability.
The second fault revealed in our race detection trial was revealed not through
observability, but rather, through output, for both conditional controllability
and random controllability. The reason this occurred is because the fault was
not actually caused by our defined race condition, but rather, by another type of
violation known as an atomicity violation, which we did not specify but can
easily detect. In particular, in this case, a read-write SV pair in the main program
is supposed to be atomic, but the ISR read this SV before it was updated in
the main program. This outcome shows that, while our approach does not while our approach does not
specifically target other types of faults, it may catch them as byproducts.
specifically target other types of faults, it
We also inspected the three potential race condition faults that were not
may catch them as byproducts.
detected by any technique. We determined that the reason for their omission
was that the interrupt handler in each of the versions does not share variables
with the main program, so they are not races. This does not mean that the code
regions involved do not need to be protected, because other ISRs may share
memory locations, or programmers may intentionally cause the regions to
execute without interruption.
Further Discussion
Our observer module considers one type of definition of a race condition. In
practice, testers can adopt different definitions because there is not a single
Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults | 193
Intel Technology Journal | Volume 17, Issue 2, 2013
general definition for the class of race conditions that occur between an ISR
and a PuT. According to our definition, we consider the case in which the
PuT first reads from or writes to an SV, and the ISR modifies this SV during
the next access to it. However, our definition does not capture the case in
which the ISR reads from the SV. As noted above, for the four faulty versions
on which the ISR does not modify the SV, we still found one fault with
controllability enabled. This fault is related to an atomicity violation, as a
code region in the main program is supposed to execute atomically, such as,
for example, before a shared variable is updated in the main program, and an
interrupt occurs and the wrong data is read by the ISR.
Our approach injects data into device ports and forces an interrupt handler to
execute one path. The data we inject is the same as the test input given to the
program. For example, if an application sends the string hello to the UART
console passed by the UART transmitter buffer, a controllability module would
inject world into the UART transmitter buffer to force an interrupt to occur
after a certain access. It is also possible to have multiple paths by which shared
Testers can extend our method by variables can exist in interrupt handlers. Testers can extend our method by
forcing interrupt handlers to execute different paths, which may increase the
forcing interrupt handlers to execute probability of revealing faults. However, no faults are left undetected due to
different paths, which may increase the missing shared variables or spin locks in the other paths of the ISR in our target
program. It is also possible to force an interrupt handler to execute only the paths
probability of revealing faults. that have definition-use relationships with the main program. This may further
reduce the number of controlled interrupts and test runs. To do this, the value
schedule approach proposed by Chen et al.[13] could be adapted.
194 | Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults
Intel Technology Journal | Volume 17, Issue 2, 2013
not designed to test the entire system at once. Instead, it is more suitable for
component testing.
In our current work, the test generation process was performed manually,
which is currently the norm in practice. Our study considers a test input to
include input values and interrupt scheduling. However, there is no reason
the approach could not also utilize input values created using existing test case
generation approaches (such as dynamic symbolic execution[13].) A problem
with such approaches by themselves is that they generate large numbers of test
cases with no methodology for judging system correctness beyond looking
for crashes. Our approach provides more powerful, automated oracles, and Our approach provides more
thus should ultimately facilitate the use of larger numbers of automatically
generated test cases.
powerful, automated oracles, and thus
should ultimately facilitate the use
Conclusion of larger numbers of automatically
We have introduced Sim-O/C, a framework that provides test engineers with generated test cases.
the ability to precisely control execution events and observe runtime context at
critical code locations. The framework is built on a commercial virtual platform
that is commonly used as part of the hardware/software co-design process. We have introduced Sim-O/C, a
The main benefit of using virtual platforms for testing is the ability to interrupt
framework that provides test engineers
execution without affecting the states of the virtualized system. Moreover, with the ability to precisely control
precise process scheduling can be implemented by installing a dummy device,
which is a significant benefit of using virtual platform. Furthermore, we can execution events and observe runtime
monitor function calls, variable values, and system states, as well as manipulate context at critical code locations.
memory and buses directly to make typically nondeterministic execution
events more deterministic.
References
[1] Apache Software Foundation. Datarace on org.apache.
https://issues.apache.org/bugzilla/show_bug.cgi?id=37458.
Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults | 195
Intel Technology Journal | Volume 17, Issue 2, 2013
[7] I. Jackson, IRQ handling race and spurious IIR read in 8250.c.
https://lkml.org/lkml/2009/3/12/379.
Author Biographies
Tingting Yu received her MS in Computer Science from University of
Nebraska-Lincoln in 2010, and a BE in Software Engineering from Sichuan
University in 2008. She is currently pursuing her PhD at the University of
196 | Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults
Intel Technology Journal | Volume 17, Issue 2, 2013
Witawas Srisa-an received his MS and PhD in Computer Science from Illinois
Institute of Technology. He joined University of Nebraska-Lincoln (UNL) in
2002 and is currently an Associate Professor in the Department of Computer
Science and Engineering. Prior to joining UNL, he was a researcher at Iowa
State University. His research interests include programming languages,
operating systems, virtual machines, and cyber security. His research projects
have been supported by NSF, AFOSR, Microsoft, and DARPA.
Sim-O/C: An Observable and Controllable Testing Framework for Elusive Faults | 197