CPEN 311: Digital Systems Design Slide Set 19: High-Level Synthesis

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 28

CPEN 311: Digital Systems Design

Slide Set 19: High-Level Synthesis:

Forget VHDL and Verilog: Design Hardware by


writing Software!

2016/2017 Term 1

Instructor: Steve Wilton


[email protected]

Slide Set 19, Page 1


Learning Objectives
Learning objectives for this slide set:
1. To understand how high-level languages can be used to improve
programmer productivity
2. To understand the limitations and advantages of high-level synthesis
3. To have a high-level understanding of where this technology is today.
4. To understand some of the challenges that exist using HLS tools
5. To understand, at a high level, how high-level synthesis tools convert C to
hardware

The intention is not to give you enough information to design something using HLS

But, when you get out in the real world, you might have to make the call: for a given
project, should your company stick with current hardware-design tools, or should you
try this “new” approach? This slide set will give you a start at gathering information to
make this decision.

Slide Set 19, Page 2


Verilog / VHDL is difficult, and design is time-consuming.

Wouldn’t it be great if you automatically convert software to hardware?

Slide Set 19, Page 3


High-Level Synthesis to the rescue!

HLS

Software Hardware
(FPGA)

Automatic software  hardware compilation


- Converts C code to Verilog code
- Verilog can then be compiled using a normal synthesis tool (eg. Quartus II)

HLS isn’t new, but now it’s in the spotlight


– Gives software developers access to hardware
– FPGA vendors are now heavily invested

Page 4
Software Hardware

Slower Execution Faster Execution


Higher Power Lower Power

Development Process Development Process


– Simple • Complex
• Sequential model • Complete parallelism
• Abstracted memory • FSMs, memories, etc
• Timing abstracted away • Cycle-by-cycle operation
• Hardware abstracted • Timing considerations
away
• Long development time
– Rapid Development

Slide Set 19, Page 5


High-Level Synthesis: Why?
1. Software designers out-number hardware designers by 10x
2. Allows hardware designers to be more productive
3. Simplifies mixed hardware/software design exploration

Ideally, software designers could design hardware without knowing anything


about the hardware.

Challenges for S/W designers:


- Debugging
- Optimizing
- Interfacing with real-world

Today, it is typically sold as a productivity-enhancer for h/w designers

Slide Set 19, Page 6


High-Level Synthesis (HLS): History

HLS has been used for a few decades


1. Propriety formats, used internally at IBM, NEC…
2. Release of general purpose HLS tools:
Impuse Accel Techn. (2002), Cadence C-Silicon (2008), Altera C2H (2009), Y
Explorations eXCite (2002), Synopsys Symphony C, Behavioural compiler…
3. FPGA Vendors: Altera OpenCL, Xilinx Vivado HLS

“Xilinx removes the difference in programming models between


a processor and an FPGA.”
- Xilinx Vivado User Guide

Slide Set 19, Page 7


An important distinction… these all kind of look like C
Verilog kind of looks like
C, but you need to think
Verilog (Register-Transfer Level) Design carefully about timing.
- Leads to strict
“synthesizable code”
vs.
rules
Writing C software on an embedded processor
(eg. Lab 6) This truly is software:
suffers speed and power
overhead of fetching
vs. instructions and fixed
datapath
High-Level Synthesis: Creating hardware from C

Specification is C, but
ideally, you get the
benefits of a custom
hardware design (speed,
power)
Slide Set 19, Page 8
FPGA High-level synthesis: Current state of the art
Xilinx and Altera both offer HLS tools:
Xilinx Vivado HLS Many people think this stands
Altera OpenCL SDK for “Low Level Virtual
Machine” but the designers
insist it is not an acronym.
Used by many companies
including Apple, Nvidia, etc.

Both built using the LLVM compiler framework


- Likely many of the same optimizations used in both tools

Slide Set 19, Page 9


Common problem with all HLS tools: straightforward sequential
software specifications don’t expose parallelism, and may not reflect
the best way to implement in hardware.

Ideally, an HLS tool would be able to mimic a hardware designer and


restructure the code (sometimes extensively) in order to build an optimum
hardware architecture.

This is very difficult, and to get good results, today, the designer has to
either use a predetermined framework to specify parallelism (eg.
Altera’s OpenCL compiler) or give “hints” (eg. Xilinx’s pragmas).

Slide Set 19, Page 10


Xilinx Vivado HLS
• Spawned from research at UCLA, acquired by Xilinx in 2011
• Language support: C, C++, System C
• “Traditional HLS”: Generate hardware for one or more C functions and
their descendants
• User must integrate HLS-generated core into system

Slide Set 19, Page 11


Controlling HLS Algorithms

Use pragmas embedded within the code

Example: to turn on loop-pipelining:

// matrix multiplication of a A*B matrix


a_row_loop: for (int i = 0; i < A_ROWS; i++) {
b_col_loop: for (int j = 0; j < B_COLS; j++) {
sum_mult = 0;
a_col_loop: for (int k = 0; k < A_COLS; k++) {
#pragma HLS pipeline II=1
sum_mult += in_a[i][k] * in_b[k][j];
}
out_c[i][j] = sum_mult;
}
}

Slide Set 19, Page 12


Other Pragma Examples:

PIPELINE: pipeline a loop


UNROLL: unroll a loop
ARRAY_PARTITION: partition an array into multiple arrays for parallel access
ARRAY_MAP: map multiple arrays into a single array
INLINE: inline a function
LATENCY: set the scheduling latency
ALLOCATION: set the # of HW instances of something

Slide Set 19, Page 13


Interfaces to Synthesized Hardware:
To connect inputs and outputs to FIFOs:

void fir( volatile ap_uint<10> * x,


volatile ap_uint<10> * y)
{
#pragma HLS INTERFACE ap_fifo depth=19 port=x
#pragma HLS INTERFACE ap_fi fo depth=19 port=y

Can also specify that some parameters are to follow AXI standard:
- Lite, Stream, Master
Can control block-level interfaces:

int adders(int in2, in2, in3) {


return (in1+in2+in3);
}

Slide Set 19, Page 14


Altera OpenCL SDK
OpenCL – Open Computing Language
-- Widely used for GPU programming
-- Allows Altera to compete directly with GPUs as compute accelerators

Host Program runs on an x86 processor, invokes “kernels” on compute devices

C function describing work to be accelerated


Key concept: Each piece of work is “indexed” in an N-dimensional space
- Each point in space is called a “work item”
- eg. Vec2[i] = vec2[i] + vec1[i] // each add is a work item
- Kernel specifies computation for one work item

Slide Set 19, Page 15


Altera OpenCL SDK
General Approach:
1. Host allocates memory on device for kernel input/output
2. Transfers input data to device
3. Invokes kernel
4. Transfers output data back from device

__kernel void sum(const int size, __global float * vec1,


__global float * vec2) {
int ii = get_global_id(0);
if (ii < size)
vec2[ii] += vec1[ii]; // handle one work-item
}

Slide Set 19, Page 16


Kernel Implementation
Focus is on heavy pipelining and keeping hardware busy at all times

Pragmas to control:
Loop unrolling
Loop pipelining
Managing the amount of hardware resources used
Vectorize code (SIMD parallelism)
Optimize DRAM accesses

Slide Set 19, Page 17


Kernel-to-Kernel Computation
Consider a “streaming” scenario:
datainput -> kernel1 -> kernel2 -> output

Normally, in OpenCL, host processor must manage the data transfers

Altera provides more efficient handling for this:

Slide Set 19, Page 18


Does this sound like something people that have never taken
CPEN 211 / CPEN 311 could really do?

Slide Set 19, Page 19


LegUp HLS
HLS framework that takes a C program as input, and compiles to
– 1) hardware alone, or…
– 2) a hybrid processor/accelerator system

– License for non-commercial research purposes


– Being commercialized by “Legup Computing Inc”.

http://legup.eecg.toronto.edu

Supported Unsupported
Functions Dynamic Memory
Arrays, Structs Recursion
Global Variables
Pointer Arithmetic
Floating Point

Slide Set 19, Page 21


Support for Debugging
High-Level synthesis sounds great but…
how does a software designer debug and optimize?

This is challenging:
1. Circuit looks nothing like the original software
• Software is fundamentally different than hardware
• Many transformations and optimizations performed during HLS

1. Debugging hardware is difficult


• Limited observability within a chip

Without a suitable ecosystem for debug, hardware acceleration using


FPGAs is doomed.

Slide Set 19, Page 22


Bugs in HLS systems
Kernel-level bugs Debug C code on
main() {

Software
• Self-contained workstation (gdb).
int i;
}
• Debug in isolation
• Easy to reproduce
HLS
RTL-level bugs Run C/RTL co-simulation

Simulation
• C/RTL mismatch on workstation.
HLS Generated
RTL
• HLS tool errors or usage
errors

I/O Devices System-Level Bugs Debug on FPGA


• Bugs in interfaces
FPGA
HLS Generated • Dependent on: (Requires observing

Hardware
Hardware • I/O data patterns internals of FPGA)
• Interaction timing
Other Other • Hard to reproduce, or
Hardware Hardware require long run times These are the
difficult bugs

Slide Set 19, Page 23


FPGA vendors have provided tools that address the first two categories
of bugs (Xilinx has a framework called SDAccel).

Debugging System-Level bugs is still challenging

Slide Set 19, Page 24


Can We Use Hardware Debug Tools?
Embedded Logic Analyzer (ELA):

Your Debug Tool:


RTL - Chooses signals to trace
Circuit - Debug circuitry added

Run

Designer is forced to debug using the hardware

1. Understanding the hardware is time consuming


• HLS optimizations means hardware looks nothing like the C code

2. Beyond the expertise of software developers

Slide Set 19, Page 25


On-Going Research in this area: Project at UBC
Framework for source-level, in-system HLS debugging
– Source-level: Using the original software.
– In-system: Circuit needs to execute in place, at speed.

HLS

2. Stop and
retrieve

1. Execute
and record
3. Debug using the recorded data

On-Chip Memory

Slide Set 19, Page 26


Captured
Execution
Window

Step through
source code

Breakpoints

View Instruction Inspect


assembly Scheduling variables
Summary: High-Level Synthesis
Provides a mechanism for benefiting from hardware acceleration without
having to design hardware.

Are we there yet?


- Still need to design the entire system. Interfaces with legacy blocks
- Correct pragma settings need careful attention
- Often out of the comfort zone of software designers
- There is work on “machine learning” best pragma settings

Today, high-level synthesis may be best suited for increasing productivity


of hardware designers.

Slide Set 19, Page 28


Learning Objectives
Learning objectives for this slide set:
1. To understand how high-level languages can be used to improve
programmer productivity
2. To understand the limitations and advantages of high-level synthesis
3. To have a high-level understanding of where this technology is today.
4. To understand some of the challenges that exist using HLS tools
5. To understand, at a high level, how high-level synthesis tools convert C to
hardware

The intention is not to give you enough information to design something using HLS

But, when you get out in the real world, you might have to make the call: for a given
project, should your company stick with current hardware-design tools, or should you
try this “new” approach? This slide set will give you a start at gathering information to
make this decision.

Slide Set 19, Page 29

You might also like