2017 MSSC Verhelst eDNNP-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Embedded Deep Neural

Network Processing
Algorithmic and processor techniques
bring deep learning to IoT and edge devices

background—footage firm, inc.


Marian Verhelst and Bert Moons

D
eep learning has networks in edge devices: mobiles, and shows how implementation-
recently become wearables, and Internet of Things driven a lgor ithm ic innov ations,
im-mensely pop­ (IoT) nodes. This would enable us together with customized yet flex-
ular for image rec­­ to analyze data locally in real time, ible processing architectures, can
ognition, as well as which is not only favorable in terms be t r ue g a m e c h a n g e r s . To h e lp
for other recognition and pattern match­­ of latency but also mitigates privacy readers fully understand the im-
ing tasks in, e.g., speech processing, issues. Yet evaluating the powerful plementation challenges as well as
natural language processing, and so but large deep neural networks with opportunities for deep neural net-
forth. The online evaluation of deep power budgets in the milliwatt or even work algorithms, we start by briefly
neural networks, however, comes with microwatt range requires a signifi- summarizing the basic concept of
significant com­­putational complex- cant improvement in processing en- deep neural networks.
ity, making it, until recently, feasible ergy efficiency.
only on power-hungry server plat- To enable such efficient evalua- The Birth of Deep Learning
forms in the cloud. In recent years, tion of deep neural networks, optimi- Deep learning [1] can be traced back
we see an emerging trend toward em- zations at both the algorithmic and to neural networks, which have been
bedded processing of deep learning hardware level are required. This around for many decades and were
article surveys such tightly interwo- already gaining popularity in the
Digital Object Identifier 10.1109/MSSC.2017.2745818 ven hardware-software process- early 1960s. A neural network is a
Date of publication: 16 November 2017 ing techniques for energy efficiency brain-inspired computing system,


1943-0582/17©2017IEEE IEEE SOLID-STATE CIRCUITS MAGAZINE fa l l 2 0 17 55
typically trained through supervised The trained classification model illustrated in Figure 1. The output
learning, whereby a machine learns in such neural networks consists of of the network indicates the prob-
a generalized model from many several layers of neurons, wherein ability that a certain object class is
training examples, enabling it to each neuron of one layer connects observed at the network’s input. In
classify new items. to each neuron of the next layer, as such a network, every individual neu-
ron creates one output o, which is a
weighted sum of its inputs i. For the
i11
nth neuron, of layer l, this can be for-
o11 o21 o31 Car? malized as
i12
o12 o22 o32
i13 House? o ln = v c /w lmn . i lm + b ln m . (1)
i14 m
o33 Dog?
i15 o13 o23
The weights w lmn and biases b ln are
b11 the flexible parameters of the net-
i11 × w111 work that enable it to represent a
i11 o11
i12 o11 particular desired input/output map-
i12 × w112 +
i13 ping for the targeted classification.
i13 × w113 σ They are trained with supervised
training examples in an initial off-
Oln = σ (∑wlmn × ilm + bln) line training phase, after which the
m network can classify new examples
presented to its inputs, a process typi-
Figure 1: A traditional fully connected neural network is made up of layers of neurons. cally referred to as inference.
­Every neuron makes a weighted sum of all its inputs, followed by a nonlinear ­transformation. Such neural networks have been
used for decades in several applica-
tion domains. In a classical pattern-
Edges recognition pipeline [Figure 2(a)],


Gradients
Neural features are generated from an input
Corners “House”
Network image by an application-specific fea-
HOG

ture extractor, hand-designed by an
expert engineer. This preliminary
Image Designed Trained Class
Feature Classifier Label feature extraction step was necessary
Extraction because, at that time, one could use
(a) only small neural networks with a
limited number of layers that did not


have the modeling capacity required
… Neural “House” for complex feature extraction from
network
… raw data. Larger neural networks were
impossible to train due to noncon-
Image Trained Trained Class vergence issues, lack of sufficiently
Feature Classifier Label large data sets, and insufficient com-
Extraction
(b) pute power.
Yet, after a long winter for neural
networks in the 1970s and 1980s,

 “House”
they regained momentum in the
1990s and again in the 2010s. The
incr e a sing av a ilabilit y of pow-
Image Trained Trained Class erful compute servers and graph-
Feature Classifier Label ics processing units (GPUs), the
Extraction abundance of digital data sources,
(c)
and innovations in training mecha-
nisms allowed training deeper and
Figure 2: (a) Traditionally, machine learning classifiers were trained and applied on hand-
deeper networks, with many layers of
crafted features. (b) The advent of deep learning allowed the network to learn and extract the
optimal feature sets. (c) Such a network trains itself to extract very coarse, low-level features neurons. This meant the start of a new
in its first layers, then finer, higher-level features in its intermediate layers, and, finally, targets era for classification, as it allow­­
full objects in the last layers. HOG: histogram of oriented gradients. ed training networks with enough

56 fa l l 2 0 17 IEEE SOLID-STATE CIRCUITS MAGAZINE


modeling capacity to operate directly To enable efficient evaluation of deep
on raw data. [Figure 2(b)]. Such “deep
learning networks” thus fulfilled neural networks, optimizations at both the
the role of both feature extractor algorithmic and hardware level are required.
and classifier.
A deeper network can automati-
cally learn the best possible features The first part of the network con- size H # H # C ) into a 3-D output ten-
during its training phase, instead of sists out of a sequence of convo- sor I (of size M # M # F ).
relying on features hand-crafted by lutional layers and pooling layers, As illustrated in Figure 4, each
humans. When inspecting trained replacing the traditional fully con- element of the output tensor O does
networks, one can see that a deep nected layers. A convolutional layer not need all elements of the input
neural network trains itself to extract transforms a 3-D input tensor O (of tensor I to be computed. Instead, it
very coarse, low-level features in its
first layers and finer, higher-level
features in its intermediate layers
Hand-Crafted Deep ImageNet Challenge:
and then targets full objects in the Features Learning
last layers [Figure 2(c)]. 1,000 Classes
28.2
25.8 1.3 M Training Images/50 k
A network’s ability to learn the
Validation/100-k Testing
most optimal features significantly
boosted the classification accuracy Eight Layers Top 5 Classification Errors (%)
of such networks, resulting in their
16.4 Eight Layers
true breakthrough: deep learning was
19 Layers
born. Over the last decade, deep learn- 11.7
ing has, as such, been able to move to 22 Layers
Human 7.7
deeper and deeper network architec- 6.7 152 Layers
5.1%
tures, enabling tremendous improve- 3.57
ments in achievable classification
accuracy, as illustrated by the results
0

N 2

VG ’14

eN 4

N 5
from the yearly ImageNet challenge
’1

’1

ex ’1

’1

gl C’1

es ’1
et

et

et
C

Al RC

R RC
R

oo R
(Figure 3) [2].
SV

SV

SV

SV

SV

G V

SV
S
IL

IL

IL

IL

IL

IL

IL
Deep Neural Network Topologies
Figure 3: The classification results of the ImageNet challenge have seen enormous boosts
Another crucial factor in the break- in accuracy since the appearance of deep learning submissions. (Data from [2].) ILSVRC: Ima-
through of deep learning technol- geNet Large-Scale Visual Recognition Challenge; AlexNet: a CNN named for Alex Krizhevsky;
ogy is the advent of new network VGG: a network from the Visual Geometry Group at Oxford University; ResNet: Residual Net.
topologies. Classical neural networks—
which rely on so-called fully con-
nected layers, with each neuron of
Trained Feature Extraction Classification
one layer connected to each neuron
C
of the next layer (Figure 1)­­— suffer F
from a very large number of training
parameters. For a network with L K
layers of N neurons each, L. (N 2 + N) K
H
parameters must be trained. Know- M
ReLU ReLU Fully Connected
ing that N can easily reach the
order of a million (e.g., for images Convolutional Max-Pooling Convolutional Max-Pooling Classification
with a million pixels), this large
pa­­­rameter set becomes unpractical for (int f = 0; f < F; f++) Per Output Pixel of a Layer:
and untrainable. for (int mx = 0; mx < M; my++) • Load C.K 2 Weights
For many tasks (mainly in image for (int my = 0; my < M; mx++) • Load C.K 2 Inputs
processing and computer vision), for (int c = 0; c < C; c++) • Do C.K 2 MACs
for (int kx = 0; kx < K; kx++) • One Output Store
convolutional neural networks (CNNs)
for (int ky = 0; ky < K; ky++) Repeat F.M 2 Times Per Layer
are more efficient. These CNNs, in­­ o [c ][mx][my] += w [f ][c ][kx][ky] . i [c ][mx + kx][my + ky]);
spired by visual neuroscience, orga-
nize the data in every network layer Figure 4: The topology and pseudocode of one layer of a typical CNN. The psuedocode is
as three-dimensional (3-D) tensors. for one layer of the network. MACs: multiply accumulation.

IEEE SOLID-STATE CIRCUITS MAGAZINE fa l l 2 0 17 57


A deeper network can automatically learn parameters. With K typically rang-
ing between one and seven and F and
the best possible features during its training C on the order of tens or hundreds,
phase, instead of relying on features hand- this method allows the creation of
very large networks while keeping
crafted by humans. the number of trainable parameters
under control—all of which gave
is connected only locally to a patch maximum of a local patch (typically deep learning its significant boost.
of the input tensor of size (K # K # C ) 2 × 2 or 3 × 3) of output units to the The majority of recent state-
through a trainable 3-D kernel W (of next layer. This thereby reduces the of-the-art deep learning networks
size K # K # C ) and a bias B. A formal dimension of the feature representa- rely on such CNNs. The optimal net-
mathematical description to com- tion and creates invariance to small work architecture, characterized
pute the outputs of a convolution shifts and distortions in the inputs. by the number of cascading stages
layer, l, is given as A modern CNN consists of tens [3] and the values of model param-
to hundreds [4] of such alternating eters F, H, C, K, and M, varies for
C K K
O lfxy = / / / I lc^x +i h^y +j h .W lfcij + B lf . convolutional and max-pooling lay- each specific application. Over the
c =0 i =0 j =0 ers, typically followed by one to last few years, various alterations
three classification layers, imple- have been proposed to this stan-
The result of the local sum com- mented using the traditional fully dard topology, such as, e.g., introduc-
puted in this filter bank is then connected neurons (Figure 4). ing feed-through connections in
passed through a nonlinearity layer, It is important to note that the ResNets [4], concatenating very small
typically a rectified linear unit (ReLU), same convolution kernel W and bias convolutions in inception networks
using the nonlinear activation func- B are used to compute all (M X M) [5], stacking depthwise and pointwise
tion f ^u h = max ^0, u h. This output outputs of one slice in the output ten- convolutions in Xception networks
can finally be processed by a max- sor. As such, every layer of the net- [6], extracting full-image dense mul-
pooling layer, which outputs only the work needs only F x ^K # K # C + 1h tiscale features using DenseNets
[7], or recurrent connections in RNNs
or long short-term memories [8].
These, however, lie beyond the scope
Embedded Device: Tx/Rx Latency Cloud: Training + Inference
of this tutorial.
Privacy
Tx Energy
Challenges for Embedded
Deep Inference
Both the training of a deep network
Raw Data and its own inferences to perform
new classifications are now typically
executed on power-hungry serv-
Classification
ers and GPUs [Figure 5(a)]. There is,
Result
Scarce Resources Infinite Resources however, a strong demand to move
(a) the inference step, in particular, out
of the cloud and into mobiles and
Embedded Device: Inference Latency Cloud: Training wearables to improve latency and
Privacy privacy issues [Figure 5(b)]. How-
Tx Energy
uP Energy
ever, current devices lack the capa-
bilities to enable deep inferences for
real-life applications.
Training Recent neural networks for image
Information
or speech processing easily require
more than 100 giga-operations (GOP)/s
Network to 1 tera-operations (TOP)/s, as well
Parameters
Scarce Resources Infinite Resources as the ability to fetch millions of
network parameters (kernel weights
(b)
and biases) per network evaluation.
Figure 5: Concerns regarding user privacy, recognition latency, and energy wasted on raw The energy consumed in these numer-
data transmission push deep learning inferences from (a) the cloud to (b) the embedded device. ous operations and data fetches is
Tx/Rx: transmitter/receiver; uP = microprocessor. the main bottleneck for embedded

58 fa l l 2 0 17 IEEE SOLID-STATE CIRCUITS MAGAZINE


inference in energy-scarce milliwatt or CNNs, inspired by visual neuroscience,
microwatt devices. Currently, micro-
controllers and embedded GPUs organize the data in every network layer
are limited to efficiencies of a few as 3-D tensors.
tens to hundreds of GOP/W, while
embedded inference will only be
fully enabled with efficiencies well be manipulated when training the techniques highlighted in this article
beyond 1 TOP/W. Overcoming this network, allowing it to find the are summarized in Figure 6.
bottleneck is possible yet requires best tradeoff between a low com-
a tight interplay between algorith- plexity and a robust network. Enhancing and Exploiting
mic optimization (modifying the 3) Deep learning networks dem- Network Structure
network topology) and hardware onstrate large sparsity. Many In many application areas, designers
optimization (modifying the process- parameters become very small, have improved the energy efficiency
ing architectures). even equal to zero, after network of embedded network evaluation by
The following section elaborates on training. Also, many data values moving away from general-purpose
the most promising optimizations cur- propagated with the network processors and developing custom-
rently being explored toward energy-ef- during evaluation become zero. ized hardware accelerators. Such
ficient, embedded deep in­­­­ference. The This can be exploited to reduce accelerators can exploit the known
focus here is on the energy-efficient operations and memory fetches data flows within the algorithm to
execution of convolutional layers, in hardware yet can also be stim- 1) enhance the parallel execution of
which form the bulk of the workload ulated further with innovative the algorithm as well as 2) minimize
during inference. However, several tech- training techniques. the number of data movements (Fig-
niques can also be applied to fully con- We will show how, for each of these ure 7). Descriptions of several app­­
nected layers. three aspects, hardware can benefit lication-specific integrated circuits
from the network’s characteristics targeting the efficient execution of
Algorithmic and Architectural but also how, during the algorith- convolutional and fully connected
Techniques for Energy Efficiency mic training phase of the network, layers have recently been published.
GPUs and central processing units it is possible to additionally opti- All solutions exhibit a very large
(CPUs) are extremely flexible, general- mize the particular characteristic to degree of parallelization, far beyond
purpose machines. While this makes reach even greater efficiency gains. CPU parallelism. This easily demon-
them widely deployable and easy to use As such, it is clear that the hardware strates itself in a data path contain-
and program, it also limits their effi- and algorithmic level need to closely ing a few hundred to thousands of
ciency because they cannot exploit cooperate not only to exploit but also multiply accumulators (MACs), with
several computational aspects of to enhance the network’s character- Google’s recent tensor processing
deep inference networks, resulting istics toward the most efficient hard- unit as an extreme example (64,000
in both a memory bottleneck and a ware-software realization. All of the MACs) [9].
computational bottleneck. More spe-
cifically, deep inference networks
have three typical characteristics
that can be exploited—or further
enhanced—to improve execution en­­ Solving the Memory Bottleneck Solving Computational Bottleneck
ergy efficiency: A) Enhancing and Exploiting Network Structure
1) Deep learning networks exhibit a Algorithmic • Spatial Data Reuse • Highly Parallel Architectures
very particular data flow with a Techniques • Hierarchical Memory • Distributed Processing
large amount of potential paral- Exploiting Data Locality
lelism and data reuse. This can, B) Enhancing and Exploiting Fault Tolerance
Tightly • Quantized Training • (Dynamic) Fixed Point
moreover, be manipulated dur-
Linked • Stochastic Memories • Analog and Statistical
ing network training by playing Processing
with the F, H, C, K, and M pa- C) Enhancing and Exploiting Network Sparsity
rameters of the network. Processor • Network Pruning • Memory and Computational
2) Deep learning networks prove Architecture • Network Compression Gating
to be quite robust to approxima- Techniques and Weight Sharing • Compressed Computing
tions or fault introductions. This
is exploited in various reduced-
precision hardware implementa- Figure 6: An overview of the algorithmic and processor architecture techniques discussed to
tions. Also, this characteristic can increase efficiency and enable the inference of deep neural networks in embedded devices.

IEEE SOLID-STATE CIRCUITS MAGAZINE fa l l 2 0 17 59


Providing data to all these func- Data reuse can be exploited by requiring intermediate accumulation
tional units in parallel would be near­ reusing the same data across multi- results o to be pushed into mem-
­ly impossible if the temporal and ple parallel execution units or, equiv- ory and refetched later, strongly im-
spatial locality of the data was not alently, across multiple time steps on pacting the input/output memory
exploited. Indeed, many computa- the same execution unit. In this topol- bandwidth. A similar scheme fetches
tions within one network layer share ogy, three extreme cases can be dis- every weight once and multiplies it
common inputs. More specifically, as tinguished, as shown in Figure 8. with many input values. This “weight
highlighted in the pseudocode shown The first multiplies the same input stationarity” or “input parallelism”
in Figure 4, every weight parameter data value with several weights of improves the weight memory band-
is reused approximately M 2 times a layer’s different output channels. width, yet at the expense of the in-
across multiple convolutions of the This is also called weight parallel or put memory bandwidth. Finally, the
same slice in the output tensor, and input stationary. In this implementa- output stationary scheme reloads
every input data point is reused tion, every input will ideally be loaded new weights and inputs every single
across F different slices of the out- into the system only once. This, how- clock cycle and yet is able to accumu-
put tensor. Moreover, the intermedi- ever, has negative repercussions on late the intermediate results locally
ate accumulation results o have to be the weight memory bandwidth, as within the MAC unit across different
accumulated C.K 2 times. This can, in the weights must be reloaded fre- clock cycles, to the benefit of the out-
a custom accelerator, be exploited in quently (every time a new input is put memory bandwidth.
several ways to further boost efficien- applied). Moreover, the accumula- All these optimizations can be
cies beyond the highly parallel, yet tion of the output o cannot be per- seen as a reshuffling of the nested
not data-flow-optimized, GPUs. formed across different clock cycles, loops in the pseudocode of Figure 4.
Of course, in practice, most realiza-
tions implement a hybrid form of
the three presented extreme cases.
MAC Array
Examples include [23] and [24], where
Weight Memory × × × ×
a two-dimensional (2-D) data path
+ + + +
FSM or multiplies every input with several
× × × × Processor weights, while every weight is also
+ + + + Controlled
multiplied with several inputs, and
Input/Output Memory × × × × [10], where the input and output
+ + + + stationarities are optimized to mini-
mize the chip input/output band-
Minimize Data Maximize Maintain width. Which parallelization scheme
Movements Parallelism Flexibility
is optimal depends strongly on the
Figure 7: Custom deep neural network processors gain efficiency by minimizing data move-
network’s dimensions; the parame-
ments and maximizing parallelism. Still, it is crucial not to lose all flexibility in mapping a ters F, H, C, K, and M, which allow
wide variety of networks. FSM: final state machine. cooptimization of the hardware; and

Weights Weight Weights Weights

× × × × + × + × × × × ×
+
Outputs
Outputs

× + ×
Inputs
Inputs

+ + × × × ×
Input

Input

+ × + × × × × ×
× + + × × × ×
+ ×

Input Stationary Weight Stationary Output Stationary Hybrids


(Weight Parallel) (Input Parallel)

Input BW Low High High Medium


Weight BW High Low High Medium
Output BW High High Low Medium

Figure 8: Different architectural topologies allow data reuse to be maximized, reusing either inputs, weights, intermediate results, or a
combination of the three. BW: bandwidth

60 fa l l 2 0 17 IEEE SOLID-STATE CIRCUITS MAGAZINE


the network itself. A more elaborate In the systolic processing concept, a 2-D array
overview of the different paralleliza-
tion schemes can be found in [11] of functional units processes data locally and
and [12], along with an assessment of passes inputs and intermediate results from
their merits.
A complementary way to reduce
unit to unit instead of to/from global memory.
the energy burden of continuous
data fetches is not to minimize the in [14]) or even just registers (as in [9]) low-resolution operations with very lim-
number of data fetches but rather to to store data locally and maximize ited kernel and network sizes.
reduce the energy cost of every data data reuse within the array. Process- While all the aforementioned tech-
fetch by exploiting temporal data ing happens as a systolic wavefront niques can dramatically boost the
locality. Most realistic deep networks through the array, wherein weight system’s throughput and energy effi-
require so much weight and input/out- coefficients can be kept stationary in ciency, it is important to keep an eye
put memory (megabytes to gigabytes) the functional units, input data are on their impact on the design’s pro-
that it is impossible to fit them in on shifted in one direction through the grammability and flexibility. Espe-
a chip memory, thus requiring fetches array, and output data accumulate in cially in the fast-paced area of deep
from energy-costly external dynamic the orthogonal direction. This allows learning, it is of the utmost impor-
random-access memory (DRAM). Simi- the performance of a very large num- tance to maintain sufficient flexibility
lar to traditional processors, this can, ber of computations for convolution toward alternative network dimensions
however, be mitigated by a memory or matrix multiplication in parallel and novel network topologies. Most
hierarchy having one or more levels of by keeping all systolic elements busy accelerators, however, succeed in this
on-chip static RAM (SRAM) or register without burdening the memory band- by enabling the acceleration of matrix
files. Frequently accessed data can, as width. Interested readers are pointed multiplications (for the fully con-
such, be stored locally to reduce its to [15] and [9] for more details. nected layers) and convolutions (for
fetching cost (Figure 9). Such systolic operation opens the the convolution layers) of any size,
An important difference with gen- door to in-memory computing, where yet with maximal efficiency for a sub-
eral-purpose solutions, however, is the computation is integrated inside set of sizes.
that the sizes of the memories in the the memory array. While this is also
hierarchy can be optimized toward pursued in traditional memory archi- Enhancing and Exploiting
the network’s structure, e.g., foresee- tectures, the results look especially Fault ­Tolerance
ing a local memory capable of cach- promising for emerging nonvolatile A second important aspect of deep
ing exactly one weight tensor, or one memory arrays. For example, in resistive neural networks that can be exploited
of the tensor [11]. Even more impor- memory technologies, a multiplication in custom processor designs is their
tantly, the networks can be trained can be implemented by exploiting fault tolerance. Many studies observe
with the processor’s memory hierar- the memory cell’s conductance as the the robustness of CNNs and other
chy in mind. As such, networks have, kernel weight, while accumulating cur- networks to perturbations on their
e.g., been explicitly trained to com- rent from different elements to imple- weight parameters and intermediate
pletely fit in on-chip memory. This ment the convolution’s accumulation computational results [17], [18]. This
optimization is, of course, highly operation [16]. However, this technol- can be exploited both at the hard-
interwoven with the parallelization ogy currently still suffers from large ware as well as the algorithmic level
scheme. By jointly optimizing these, variability, limiting applications to very in several ways.
one can adjust the degree of parallel-
ization to the memory hierarchy and
minimize the product of the number
of memory accesses with the cost of
every memory access [13]. MAC Array
Distributed and systolic process- × × × ×
Registers

Local + + + +
ing can be seen as an extreme type Off-Chip On-Chip SRAM
of such hierarchical memories. In the DRAM SRAM × × × ×
kB + + + +
systolic processing concept, a 2-D pJ/Word
array of functional units processes MB × × × ×
Tens of pJ/Word B + + + +
data locally and passes inputs and GB <pJ/Word
intermediate results from unit to Hundreds of pJ/Word
unit instead of to/from global mem-
ory. These functional units are each Figure 9: A well-designed memory hierarchy avoids drawing all weights and input data
equipped with a very small SRAM (as from the costly DRAM interface and stores frequently accessed data locally. pJ: picojoule.

IEEE SOLID-STATE CIRCUITS MAGAZINE fa l l 2 0 17 61


A straightforward way to ben- weights and intermediate results. bit-width implementations all exploit
efit from the network’s fault toler- Moreover, for very low bit widths, the deep network’s tolerance to faults
ance is to perform the computations this even allows the replacement of in a deterministic way.
at reduced computational accuracy multipliers that have several data Another school of thought targets
with limited recognition loss. Typi- values with a common weight factor energy savings through tolerating
cal benchmarks can be run at a 1–9-b via preloaded lookup tables [10]. As nondeterministic statistical errors.
fixed point rather than a 32-b floating a result, all custom CNN accelerators This can be accomplished by execut-
point at lower than 1% accuracy loss operate in fixed point. While most ing the convolutional kernels in the
[18]. This is possible by quantizing processors operate at constant 16-, noisy analog domain [26]. Alterna-
all weights of a floating-point-trained 12-, or 8-b word lengths, some recent tively, in the digital domain, stochas-
network before execution. Improved implementations support variable tic fault tolerance can be exploited
results can be obtained when intro- word-length computations, wherein by operating the circuits [27] and/
ducing quantization during the train- the processor can change the used or memory [28] in the energy-effi-
ing step itself [19], [38], resulting in computational precision from opera- cient near-threshold regions. In this
smaller or lower-precision networks tion to operation [23], [10], [24]. This region, circuit delays as well as
for the same application accuracy. As accommodates for the observation memory failures suffer from large
an extreme example, networks have that the optimal word length for a variation. Yet the networks can tol-
been specifically trained to oper- deep network strongly varies from erate such stochastic behavior up to
ate with only 1-b representations of application to application and is a certain limit. Such circuits are com-
weights alone [20] as well as with even shown to differ across various bined with circuit monitors that con-
both weights and activations [20], layers of a single deep network [18] stantly assess and control the circuit’s
[21] wherein all multiplications can [Figure 10(a)]. fault rate [28].
be replaced by efficient XNOR opera- Energy-efficient variable-resolu- Finally, the operational circum-
tions [22]. In [20], a binary-weight tion processors have been realized stances can strongly influence the net-
version of ImageNet is only 2.9% less using a technique termed dynamic work’s tolerance to approximations. In
accurate (in top-1 accuracy) than the voltage-accuracy-frequency scaling a given classification application, the
full-precision AlexNet [3]. [25] to jointly reduce the switching quality of the inputs might change
This observation can lead to major activity, supply voltage, and par- dynamically, or some classes might be
energy savings, as current CPU and allelization scheme when computa- easier to observe than others. If one
GPU architectures operate using tional resolution drops [Figure 10(c)]. tries to train one common network
32–16-b floating-point number for- This results in a scaling of the sys- that performs acceptably under all
mats. Reducing precision from 32-b tem’s energy consumption, which is possible circumstances and classes,
floating point to low precision not super-linear with the computational a large, complex, energ y-hungr y
only reduces computational energy resolution [Figure 10(b)], thus allow- network topology would be needed.
but also minimizes the storage and ing every network layer to run at its Recent work, however, promotes the
data-fetching cost needed for network own minimal energy point. Reduced training of hierarchical or staged

AlexNet on ImageNet
y3 y 2 y 1/0 y 0/0
10 x 0/0
0
10 33× Gain
Quantization (Bits)

8 16 Bit at 1% RMSE
Relative Power

x 1/0
p 0/0
6
x2 p 1/0
10–2 6 Bit
4
Uniform at 100% x3
2 p 2/0
Nonuniform at 99%
10–4 1 Bit p 3/0
0
2 4 6 8 10–6 10–4 10–2 100
Layer Number Computational Precision p7 p6 p5 p4
(a) (b) (c)

Figure 10: (a) When quantizing all weight and data values in a floating point AlexNet uniformly, the network can run at 9-b precision.
Lower precision can be achieved without significant classification accuracy loss by running every layer at its own optimal precision. This allows
(b) saving power in the function of computational precision and (c) building multipliers whose energy consumption scales drastically with com-
putational precision, through reduced activity factor and critical path length.

62 fa l l 2 0 17 IEEE SOLID-STATE CIRCUITS MAGAZINE


networks [29] that perform classifica- An important difference with general-purpose
tions in several optional stages. At
each stage, only a few layers of the net- solutions, however, is that the sizes of the
work are executed, after which a clas- memories in the hierarchy can be optimized
sification layer tries to guess the class
from the current outputs. Additional
toward the network’s structure.
network layers and classifiers are run
only if the obtained probabilities are compressing the on/off chip data model of the hardware into account
not outspoken enough, until a classi- stream using, e.g., Huffman or other and start pruning the layers that con-
fication with distinct probabilities is types of encoding. Several hardware sume the most energy, to maximize
obtained. Such dynamic evaluations implementations exploit these CNN pruning efficacy [33]. This easily
can be performed on any hardware characteristics. The authors of [24] allows the pruning of 70–90% of the
platform but, again, benefit signifi- and [11] skip all unnecessary sparse weights and saves up to 70% of ener­
cantly from implementation-aware operations by gating the inputs to ­g y consumption.
training techniques or topology- their arithmetic units if the input data Interestingly enough, networks
opt imized implementations. Infer- is zero, as a multiply-accumulate with have more compression capabili-
ence on the ImageNet data set [29] zero does not change the internal ties beyond simply that of pruning
required up to 2.6 times fewer opera- accumulation result. Both implemen- low-valued weights. After pruning
tions than state-of-the-art networks at tations also compress off-chip data and quantizing a network, it turns
equivalent accuracy. streams, either through run-length out that the resulting weight values
encoding [14] or through a simpli- are highly clustered. This allows, e.g.,
Enhancing and Exploiting Sparsity fied Huffman scheme [23]. The archi- the clustering of 8-b weights in only
Deep neural networks exhibit extreme tectures presented in [30] and [31], 16 (24) different weight clusters, each
sparsity, i.e., many of the weight val- on the other hand, allow speeding up of which can share a common weight
ues, as well as intermediate data val- sparse network evaluation s by only value expressed by a 4-b label. For
ues, are zero. Figure 11(a) shows the scheduling non-zero operations for every weight value, only the 4-b labels
sparsity of an AlexNet in function execution, improving computational are stored, and these are expanded
of the used fixed-point word length throughput up to 1.52 and 5.2 times, online to their original 8-b value using
within the network. As can be seen, respectively. a small embedded lookup table.
even for large word lengths, more More powerful opportunities arise, Recent work has shown that the
than 70% of the activations are zero. again, when the hardware and algo- combination of pruning, weight shar-
At reduced bit-width computations, rithmic plane are jointly involved. ing, and Huffman compression com-
also many weight values are quan- Deep network training algorithms presses state-of-the-art networks by
tized to zero. This opens up many can be modified to enhance the net- 50 times in memory size (deep com-
opportunities. work’s sparsity by iteratively pruning pression [32]). Traditional accelerators
On the hardware side, this can be the smallest weight values (quantiz- can benefit from such compression
exploited by preventing any MAC with ing them to zero) and retraining the but only in terms of a reduction
a zero-valued input [see Figure 11(b)], network [32]. Going one step further, in memory size and the amount of
by not even fetching zero-valued data energy-aware pruning techniques memory accessed. To execute convo­
values from memory, and by strongly even take the energy consumption lutional operations, they must still

100 AlexNet MAC Array


Mean Sparsity (%)

Weight
× × × ×
0 Memory
+ + + +
00
0 50 DRAM × × × ×
+ + + +
00 Input/Output
0 Layer Inputs × × × ×
Memory
Weights + + + +
0
2 4 6 8 10 Compress Prevent Fetching Prevent Executing
Fixed Point Precision (bits) Off-Chip Zero-Valued Zero-Input
Communication Data MACs
(a) (b)

Figure 11: (a) The sparsity of input and weight values of a typical network in function of computational precision at which the network is
evaluated. (b) This sparsity allows energy to be saved in the processor’s input/output interface, on-chip memories, and data path.

IEEE SOLID-STATE CIRCUITS MAGAZINE fa l l 2 0 17 63


Recent work has shown that the combination integrating the deep-inference chips
in complete vision-processing pipe-
of pruning, weight sharing, and Huffman lines mapping real-life applications.
compression compresses state-of-the-art This requires not only an efficient
execution of the inference kernel itself
networks by 50 times in memory size. but also efficient image slicing, data
transfer, and results interpretation.
decompress the data and, at best, offers only limited compression capa- A second interesting challenge
remain idle during zero-valued op­­ bilities, ranging typically up to only lies in the learning process. So far, most
erations. The efficient-inference engine five times compression [36]. chips focus on the inference part,
[35], however, demonstrates that it where pretrained models are efficiently
is also possible and highly benefi- Outlook executed on-chip. In the future, how-
cial to operate directly on the com- In this short tutorial, we have pre- ever, the desire for more privacy and
pressed data by adapting the data sented a selection of very promising user customization will stimulate chips
path and memory interface to the hardware and algorithmic techniques capable of executing the training phase
compressed data format. from the rapidly expanding and as well. This, however, comes with new
A network compression technique growing field of deep learning. Each computational challenges and the
that does enable straightforward exploits and/or enhances the unique need for a careful algorithm–archi-
network execution in the complex features of deep networks to improve tecture cooptimization.
domain without any hardware adapta- the energy efficiency of their execu- It is, thus, very clear that, more
tion uses singular value decomposi- tion. Together, they have allowed the than ever, the hardware and algo-
tion (SVD) [36]. By performing SVD on achievement of tremendous energy rithmic layer must be optimized
a sparse weight matrix of a fully con- savings compared to traditional jointly, grasping the various cross-
nected network layer, the matrix can CPU- and GPU-based compute plat- layer opportunities of deep neural
be decomposed into two matrices, forms. As can be seen in Figure 12 networks. This is also apparent from
the rows and columns of which are [37], this recent wave of innovations the interest of many traditionally
ordered by the function of the most breaks the barrier for embedded software-oriented companies (like
significant network parameters. By deep inference in mobile devices. Google, Amazon, and Microsoft) in
simply removing the nonsignificant Implementations far surpassing the the development of new proprietary
sections of the matrix, one is left with efficiencies of 1-TOP/W have recently hardware for deep learning.
a strongly compressed representa- been demonstrated, while computa- This field is so vibrant that every
tion of the original network layer. The tional throughput is boosted to sev- single week new ideas pop up. Of
result can be executed on any regu- eral 100 GOP. course, space does not allow us to
lar neural network accelerator, as it Still, challenges remain to effec- cover all of the exciting ideas going
is identical to the execution of two tively bring deep learning to IoT and around in the embedded deep learn-
(much smaller) fully connected layers. edge devices. First, few (if any) com- ing space at the moment. Yet we hope
While this method is more straightfor- plete end-to-end solutions have been that we were able to spark readers’
ward from a hardware point of view, it demonstrated. Doing so involves interest and stimulate further explo-
ration of this lively field.

10
References
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep
100 GOPs

10-f/s ResNet at 30 mW
Energy-Efficiency (TOPs/W)

learning,” Nature, vol. 521, no. 7553, pp


436–444 2015.
[2] O. Russakovsky, J. Deng, H. Su, J. Krause,
4b S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and
4-b Sparse L. Fei-Fei, “ImageNet large scale visual
1
1 TOPs/W 8b recognition challenge,” Int. J.Computer
16 b Vision, vol. 115, no. 3, pp. 211–252,
2015.
Minimum Energy [3] A. Krizhevsky, I. Sutskever, and G. Hinton,
GPU Peak Performance “ImageNet classification with deep convo-
CPU 2016 References lutional neural networks,” in Proc. Conf.
0.1 Neural Information Processing Systems,
1 10 100 1,000 2012, pp. 1097–1105.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep
Throughput (GOPs) residual learning for image recogni-
tion,” arXiv Preprint, arXiv:1512.03385,
2015.
Figure 12: An overview of the reported performance of the deep neural network processors
[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S.
published at the International Solid-State Circuits Conference in 2016 and 2017. Performances Reed, D. Anguelov, and A. Rabinovich,
beyond 100 GOP and 1 TOP/W will be a game changer for deep inference in embedded devices. “Going deeper with convolutions,” in Proc.

64 fa l l 2 0 17 IEEE SOLID-STATE CIRCUITS MAGAZINE


IEEE Conf. Computer Vision and Pattern lutional neural network accelerator based www.esat.kuleuven.be/~mverhels/DLIC-
Recognition, 2015, pp. 1–9. on binary weights,” in Proc. IEEE VLSI survey.html
[6] F. Chollet, “Xception: Deep learning with Computer Society Annu. Symp., July 2016, [38] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and
depthwise separable convolutions,” arXiv pp. 236–241. Y. Zou, “Dorefa-net: Training low bitwidth
Preprint, arXiv:1610.02357, 2016. [23] B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W convolutional neural networks with low
[7] F. Iandola, M. Moskewicz, S. Karayev, precision-scalable processor for real-time bitwidth gradients,” arXiv preprint, arX-
R. Girshick, T. Darrell, and K. Keutzer, large-scale ConvNets,” in Proc. IEEE Symp. iv:1606.06160.
“Densenet: Implementing efficient con- VLSI Circuits, 2016, pp. 1–2.
vnet descriptor pyramids,” arXiv Preprint, [24] B. Moons, et al. “Envision: A 0.26-to-10
arXiv:1404.1869, 2014.
[8] F. A. Gers, J. Schmidhuber, and F. Cum-
TOPS/W subword-parallel dynamic-volt-
age-accuracy-frequency-scalable convo-
About the Authors
mins, “Learning to forget: Continual pre- lutional neural network processor in 28 Marian Verhelst (marian.verhelst@
diction with LSTM,” Neural Comput., vol. nm FDSOI,” in Proc. IEEE Int. Solid-State kuleuven.be) has been an assistant
12, no. 10, pp. 2451–2471, 2000. Circuits Conf., 2017, pp. 246–257.
[9] N. P. Jouppi, et al. “In-datacenter perfor- [25] B. Moons, R. Uytterhoeven, W. Dehaene, professor at the Micro-Electronics and
mance analysis of a tensor processing and M. Verhelst, “DVAFS: Trading com- Sensors Laboratories of the Electrical
unit,” arXiv Preprint, arXiv:1704.04760, putational accuracy for energy through
2017. dynamic-voltage-accuracy-frequency- Engineering Department at KU Leu-
[10] D. Shin, J. Lee, J. Lee, and H.-J. Yoo, “DNPU: scaling,” in Proc. Conf. Design, Automa- ven, Belgium, since 2012. Her research
An 8.1 TOPS/W reconfigurable CNN-RNN tion and Test in Europe, Lausanne, 2017,
processor for general-purpose deep neu- pp. 488–493. focuses on self-adaptive circuits and
ral networks,” in Proc. IEEE Int. Solid-State [26] L. Fick, D. Blaauw, D. Sylvester, S. Skrzyn- systems, embedded machine learn-
Circuits Conf., 2017, pp. 240–241. iarz, M. Parikh, and D. Fick, “Analog in-
[11] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: memory subthreshold deep neural net- ing, and low-power sensing and pro-
A spatial architecture for energy-effi- work accelerator,” in Proc. IEEE Custom cessing for the Internet of Things. She
cient dataflow for convolutional neural Integrated Circuits Conf., Austin, TX, 2017,
networks,” in Proc. IEEE Annu. Int. Symp. pp. 1–4. received a Ph.D. degree from KU Leu-
Computer Architecture, 2016, pp. 367– [27] Y. Lin, S. Zhang, and N. R. Shanbhag. ven (cum ultima laude) in 2008. She
379. “Variation-tolerant architectures for con-
[12] M. Peemen, et al. “Memory-centric accel- volutional neural networks in the near was a visiting scholar at the Berke-
erator design for convolutional neural threshold voltage regime,” in Proc. IEEE ley Wireless Research Center of the
networks,” in Proc. IEEE 31st Int. Conf. Int. Workshop Signal Processing Systems,
Computer Design, 2013, pp. 13–19. 2016, pp. 17–22. University of California, Berkeley, in
[13] L. Cecconi, S. Smets, L. Benini, and M. [28] P. Whatmough, S. Kyu Lee, H. Lee, S. Rama, 2005. From 2008 to 2011, she worked
Verhelst, “Optimal tiling strategy for D. Brooks, and G.-Y. Wei, “A 28nm SoC
memory bandwidth reduction for Cnns: with a 1.2GHz 568nJ/pred sparse deep in the Radio Integration Research
Advanced concepts for intelligent vision neural network engine with >0.1 timing Lab of Intel Laboratories, Hillsboro,
systems,” Ph.D. dissertation, Univ. Bolo- error rate tolerance for IoT applications,”
gna 2017. in Proc. IEEE Int. Solid-State Circuits Conf., Oregon. She is an IEEE Solid-State Cir-
[14] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze. 2017, pp. 242–243. cuits Society Distinguished Lecturer
“Eyeriss: An energy-efficient reconfigu- [29] G. Huang, et al. “Multi-scale dense con-
rable accelerator for deep convolutional volutional networks for efficient predic- and a member of the Young Academy
neural networks,” in Proc. IEEE Int. Solid- tion,” arXiv Preprint, arXiv:1703.09844, of Belgium and has published over
State Circuits Conf., 2016, pp. 262–263. 2017.
[15] H. T. Kung, “Systolic algorithms for the [30] J. Albericio, P. Judd, T. Hetherington, T. 100 papers in conferences and jour-
CMU WARP processor,” Research Show- Aamodt, N. E. Jerger, and A. Moshovos, nals. She is a member of the Interna-
case @ CMU, 1984. “Cnvlutin: Ineffectual-neuron-free deep
[16] A. Shafiee, A. Nag, N. Muralimanohar, R. neural network computing,” in Proc. ACM/ tional Solid-State Circuits Conference
Balasubramonian, J. P. Strachan, M. Hu, IEEE 43rd Annu. Int. Symp. Computer (ISSCC) Technical Program Committee
R. S. Williams, and V. Srikumar, “ISAAC: A ­Architecture, June 2016, pp. 1–13.
convolutional neural network accelerator [31] D. Kim, J. Ahn, and S. Yoo, “A novel zero and the Design, Automation, and Test
with in-situ analog arithmetic in cross- weight/activation-aware hardware archi- in Europe (DATE) and ISSCC Executive
bars,” in Proc. 43rd Int. Symp. Computer tecture of convolutional neural network,”
Architecture, 2016, pp.14–26. in Proc. IEEE Design, Automation & Test in Committees. She was associate edi-
[17] P Gysel, M. Motamedi, and S. Ghiasi, Europe Conf. & Exhibition, 2017, pp. 1462– tor for IEEE Transactions on Circuits
“Hardware-oriented approximation of 1467.
convolutional neural networks,” in Proc. [32] S. Han, J. Pool, J. Tran, and W. Dally. and Systems II and currently serves in
Workshop Contribution to Int. Conf. Learn- “Learning both weights and connections the same capacity for IEEE Journal of
ing Representations, 2016. for efficient neural network,” in Proc. Ad-
[18] B. Moons, B. De Brabandere, L. Van Gool, vances in Neural Information Processing Solid-State Circuits.
and M. Verhelst, “Energy-efficient Con- Systems, 2015, pp. 1135–1143. Bert Moons received his B.S. and
vNets through approximate computing,” [33] V. Sze, T.-J. Yang, and Y.-H. Chen, “Design-
in Proc. IEEE Winter Conf. Applications ing energy-efficient convolutional neural M.S. degrees in electrical engineering
Computer Vision, 2016, pp. 1–8. networks using energy-aware pruning,” from KU Leuven, Belgium, in 2011 and
[19] I. Hubara, M. Courbariaux, D. Soudry, R. in Proc. Conf. Computer Vision and Pat-
El-Yaniv, and Y. Bengio, “Quantized neu- tern Recognition, Honolulu, Hawaii, July 2013, respectively. In 2013, he joined
ral networks: Training neural networks 21–26, 2017, pp. 5687–5695. the Micro-Electronics and Sensors Lab-
with low precision weights and activa- [34] S. Han, H. Mao, and W. J. Dally, “Deep
tions,” arXiv preprint, arXiv:1609.07061, compression: Compressing deep neural oratories of KU Leuven as a research
2016. networks with pruning, trained quantiza- assistant, funded through an indi-
[20] M. Rastegari, V. Ordonez, J. Redmon, and tion and Huffman coding,” arXiv Preprint,
A. Farhadi, “XNOR-Net: Imagenet classifi- arXiv:1510.00149, 2015. vidual grant from the Research Foun-
cation using binary convolutional neural [35] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, dation of Flanders. In 2016, he was a
networks,” in Proc. European Conf. Com- M. A. Horowitz, and W. J. Dally, “EIE: Ef-
puter Vision, 2016, pp. 525–542. ficient inference engine on compressed visiting research student at Stanford
[21] I. Hubara, M. Courbariaux, D. Soudry, R. deep neural network,” arXiv Preprint, University, California, in the Murmann
El-Yaniv, and Y. Bengio, “Binarized Neural arXiv:1602.01528, 2016.
networks in advances” in Neural Informa- [36] J. Xue, J. Li, and Y. Gong, “Restructuring Mixed-Signal Group. Currently, he is
tion Processing Systems 29, D. D. Lee, M. of deep neural network acoustic models working toward the Ph.D. degree on
Sugiyama, U. V. Luxburg, I. Guyon, and R. with singular value decomposition,” in
Garnett, Eds. Curran Assoc., Inc. 2016, pp. Proc. Interspeech Conf., 2013, pp. 2365– energy-scalable and run-time adapt-
4107–4115. 2369. able digital circuits for embedded
[22] R. Andri, L. Cavigelli, D. Rossi, and L. Be- [37] M. Verhelst. (2017). Deep learning pro-
nini, “YodaNN: An ultra-low power convo- cessor survey. [Online]. Available: http:// deep learning applications.

IEEE SOLID-STATE CIRCUITS MAGAZINE fa l l 2 0 17 65

You might also like