2017 MSSC Verhelst eDNNP-1
2017 MSSC Verhelst eDNNP-1
2017 MSSC Verhelst eDNNP-1
Network Processing
Algorithmic and processor techniques
bring deep learning to IoT and edge devices
D
eep learning has networks in edge devices: mobiles, and shows how implementation-
recently become wearables, and Internet of Things driven a lgor ithm ic innov ations,
im-mensely pop (IoT) nodes. This would enable us together with customized yet flex-
ular for image rec to analyze data locally in real time, ible processing architectures, can
ognition, as well as which is not only favorable in terms be t r ue g a m e c h a n g e r s . To h e lp
for other recognition and pattern match of latency but also mitigates privacy readers fully understand the im-
ing tasks in, e.g., speech processing, issues. Yet evaluating the powerful plementation challenges as well as
natural language processing, and so but large deep neural networks with opportunities for deep neural net-
forth. The online evaluation of deep power budgets in the milliwatt or even work algorithms, we start by briefly
neural networks, however, comes with microwatt range requires a signifi- summarizing the basic concept of
significant computational complex- cant improvement in processing en- deep neural networks.
ity, making it, until recently, feasible ergy efficiency.
only on power-hungry server plat- To enable such efficient evalua- The Birth of Deep Learning
forms in the cloud. In recent years, tion of deep neural networks, optimi- Deep learning [1] can be traced back
we see an emerging trend toward em- zations at both the algorithmic and to neural networks, which have been
bedded processing of deep learning hardware level are required. This around for many decades and were
article surveys such tightly interwo- already gaining popularity in the
Digital Object Identifier 10.1109/MSSC.2017.2745818 ven hardware-software process- early 1960s. A neural network is a
Date of publication: 16 November 2017 ing techniques for energy efficiency brain-inspired computing system,
1943-0582/17©2017IEEE IEEE SOLID-STATE CIRCUITS MAGAZINE fa l l 2 0 17 55
typically trained through supervised The trained classification model illustrated in Figure 1. The output
learning, whereby a machine learns in such neural networks consists of of the network indicates the prob-
a generalized model from many several layers of neurons, wherein ability that a certain object class is
training examples, enabling it to each neuron of one layer connects observed at the network’s input. In
classify new items. to each neuron of the next layer, as such a network, every individual neu-
ron creates one output o, which is a
weighted sum of its inputs i. For the
i11
nth neuron, of layer l, this can be for-
o11 o21 o31 Car? malized as
i12
o12 o22 o32
i13 House? o ln = v c /w lmn . i lm + b ln m . (1)
i14 m
o33 Dog?
i15 o13 o23
The weights w lmn and biases b ln are
b11 the flexible parameters of the net-
i11 × w111 work that enable it to represent a
i11 o11
i12 o11 particular desired input/output map-
i12 × w112 +
i13 ping for the targeted classification.
i13 × w113 σ They are trained with supervised
training examples in an initial off-
Oln = σ (∑wlmn × ilm + bln) line training phase, after which the
m network can classify new examples
presented to its inputs, a process typi-
Figure 1: A traditional fully connected neural network is made up of layers of neurons. cally referred to as inference.
Every neuron makes a weighted sum of all its inputs, followed by a nonlinear transformation. Such neural networks have been
used for decades in several applica-
tion domains. In a classical pattern-
Edges recognition pipeline [Figure 2(a)],
Gradients
Neural features are generated from an input
Corners “House”
Network image by an application-specific fea-
HOG
…
ture extractor, hand-designed by an
expert engineer. This preliminary
Image Designed Trained Class
Feature Classifier Label feature extraction step was necessary
Extraction because, at that time, one could use
(a) only small neural networks with a
limited number of layers that did not
…
have the modeling capacity required
… Neural “House” for complex feature extraction from
network
… raw data. Larger neural networks were
impossible to train due to noncon-
Image Trained Trained Class vergence issues, lack of sufficiently
Feature Classifier Label large data sets, and insufficient com-
Extraction
(b) pute power.
Yet, after a long winter for neural
networks in the 1970s and 1980s,
“House”
they regained momentum in the
1990s and again in the 2010s. The
incr e a sing av a ilabilit y of pow-
Image Trained Trained Class erful compute servers and graph-
Feature Classifier Label ics processing units (GPUs), the
Extraction abundance of digital data sources,
(c)
and innovations in training mecha-
nisms allowed training deeper and
Figure 2: (a) Traditionally, machine learning classifiers were trained and applied on hand-
deeper networks, with many layers of
crafted features. (b) The advent of deep learning allowed the network to learn and extract the
optimal feature sets. (c) Such a network trains itself to extract very coarse, low-level features neurons. This meant the start of a new
in its first layers, then finer, higher-level features in its intermediate layers, and, finally, targets era for classification, as it allow
full objects in the last layers. HOG: histogram of oriented gradients. ed training networks with enough
N 2
VG ’14
eN 4
N 5
from the yearly ImageNet challenge
’1
’1
ex ’1
’1
gl C’1
es ’1
et
et
et
C
Al RC
R RC
R
oo R
(Figure 3) [2].
SV
SV
SV
SV
SV
G V
SV
S
IL
IL
IL
IL
IL
IL
IL
Deep Neural Network Topologies
Figure 3: The classification results of the ImageNet challenge have seen enormous boosts
Another crucial factor in the break- in accuracy since the appearance of deep learning submissions. (Data from [2].) ILSVRC: Ima-
through of deep learning technol- geNet Large-Scale Visual Recognition Challenge; AlexNet: a CNN named for Alex Krizhevsky;
ogy is the advent of new network VGG: a network from the Visual Geometry Group at Oxford University; ResNet: Residual Net.
topologies. Classical neural networks—
which rely on so-called fully con-
nected layers, with each neuron of
Trained Feature Extraction Classification
one layer connected to each neuron
C
of the next layer (Figure 1)— suffer F
from a very large number of training
parameters. For a network with L K
layers of N neurons each, L. (N 2 + N) K
H
parameters must be trained. Know- M
ReLU ReLU Fully Connected
ing that N can easily reach the
order of a million (e.g., for images Convolutional Max-Pooling Convolutional Max-Pooling Classification
with a million pixels), this large
parameter set becomes unpractical for (int f = 0; f < F; f++) Per Output Pixel of a Layer:
and untrainable. for (int mx = 0; mx < M; my++) • Load C.K 2 Weights
For many tasks (mainly in image for (int my = 0; my < M; mx++) • Load C.K 2 Inputs
processing and computer vision), for (int c = 0; c < C; c++) • Do C.K 2 MACs
for (int kx = 0; kx < K; kx++) • One Output Store
convolutional neural networks (CNNs)
for (int ky = 0; ky < K; ky++) Repeat F.M 2 Times Per Layer
are more efficient. These CNNs, in o [c ][mx][my] += w [f ][c ][kx][ky] . i [c ][mx + kx][my + ky]);
spired by visual neuroscience, orga-
nize the data in every network layer Figure 4: The topology and pseudocode of one layer of a typical CNN. The psuedocode is
as three-dimensional (3-D) tensors. for one layer of the network. MACs: multiply accumulation.
× × × × + × + × × × × ×
+
Outputs
Outputs
× + ×
Inputs
Inputs
+ + × × × ×
Input
Input
+ × + × × × × ×
× + + × × × ×
+ ×
Figure 8: Different architectural topologies allow data reuse to be maximized, reusing either inputs, weights, intermediate results, or a
combination of the three. BW: bandwidth
Local + + + +
ing can be seen as an extreme type Off-Chip On-Chip SRAM
of such hierarchical memories. In the DRAM SRAM × × × ×
kB + + + +
systolic processing concept, a 2-D pJ/Word
array of functional units processes MB × × × ×
Tens of pJ/Word B + + + +
data locally and passes inputs and GB <pJ/Word
intermediate results from unit to Hundreds of pJ/Word
unit instead of to/from global mem-
ory. These functional units are each Figure 9: A well-designed memory hierarchy avoids drawing all weights and input data
equipped with a very small SRAM (as from the costly DRAM interface and stores frequently accessed data locally. pJ: picojoule.
AlexNet on ImageNet
y3 y 2 y 1/0 y 0/0
10 x 0/0
0
10 33× Gain
Quantization (Bits)
8 16 Bit at 1% RMSE
Relative Power
x 1/0
p 0/0
6
x2 p 1/0
10–2 6 Bit
4
Uniform at 100% x3
2 p 2/0
Nonuniform at 99%
10–4 1 Bit p 3/0
0
2 4 6 8 10–6 10–4 10–2 100
Layer Number Computational Precision p7 p6 p5 p4
(a) (b) (c)
Figure 10: (a) When quantizing all weight and data values in a floating point AlexNet uniformly, the network can run at 9-b precision.
Lower precision can be achieved without significant classification accuracy loss by running every layer at its own optimal precision. This allows
(b) saving power in the function of computational precision and (c) building multipliers whose energy consumption scales drastically with com-
putational precision, through reduced activity factor and critical path length.
Weight
× × × ×
0 Memory
+ + + +
00
0 50 DRAM × × × ×
+ + + +
00 Input/Output
0 Layer Inputs × × × ×
Memory
Weights + + + +
0
2 4 6 8 10 Compress Prevent Fetching Prevent Executing
Fixed Point Precision (bits) Off-Chip Zero-Valued Zero-Input
Communication Data MACs
(a) (b)
Figure 11: (a) The sparsity of input and weight values of a typical network in function of computational precision at which the network is
evaluated. (b) This sparsity allows energy to be saved in the processor’s input/output interface, on-chip memories, and data path.
10
References
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep
100 GOPs
10-f/s ResNet at 30 mW
Energy-Efficiency (TOPs/W)