HAQ: Hardware-Aware Automated Quantization With Mixed Precision

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

HAQ: Hardware-Aware Automated Quantization with Mixed Precision

Kuan Wang∗ , Zhijian Liu∗ , Yujun Lin∗ , Ji Lin, and Song Han
{kuanwang, zhijian, yujunlin, jilin, songhan}@mit.edu
Massachusetts Institute of Technology

73
arXiv:1811.08886v3 [cs.CV] 6 Apr 2019

Abstract MobileNets (fixed 8-bit quantization)


MobileNets (our flexible-bit quantization)
72

Top-1 Accuracy (%)


Model quantization is a widely used technique to com- 71
press and accelerate deep neural network (DNN) inference.
Emergent DNN hardware accelerators begin to support 70
mixed precision (1-8 bits) to further improve the compu-
69
tation efficiency, which raises a great challenge to find the
optimal bitwidth for each layer: it requires domain experts 68
Model Size:
to explore the vast design space trading off among accu-
1MB 2MB 3MB
racy, latency, energy, and model size, which is both time-
25 44 63 82 101 120
consuming and sub-optimal. There are plenty of specialized
Latency (ms)
hardware for neural networks, but little research has been
done for specialized neural network optimization for a par- Figure 1: We need mixed precision for different layers. We
ticular hardware architecture. Conventional quantization quantize MobileNets [12] to different number of bits (both
algorithm ignores the different hardware architectures and weights and activations), and it lies on a better pareto curve
quantizes all the layers in a uniform way. In this paper, (yellow) than fixed bit quantization (blue). The reason is that
we introduce the Hardware-Aware Automated Quantization different layers have different redundancy and have differ-
(HAQ) framework which leverages the reinforcement learn- ent arithmetic intensity (OPs/byte) on the hardware, which
ing to automatically determine the quantization policy, and advocates for using mixed precision for different layers.
we take the hardware accelerator’s feedback in the design
loop. Rather than relying on proxy signals such as FLOPs 1. Introduction
and model size, we employ a hardware simulator to gener-
ate direct feedback signals (latency and energy) to the RL In many real-time machine learning applications (such
agent. Compared with conventional methods, our framework as robotics, autonomous driving, and mobile VR/AR), deep
is fully automated and can specialize the quantization policy neural networks is strictly constrained by the latency, energy,
for different neural network architectures and hardware ar- and model size. In order to improve the hardware efficiency,
chitectures. Our framework effectively reduced the latency many researchers have proposed to quantize the weights and
by 1.4-1.95× and the energy consumption by 1.9× with neg- activations to low precision [8, 18, 34].
ligible loss of accuracy compared with the fixed bitwidth (8 Conventional quantization methods use the same number
bits) quantization. Our framework reveals that the optimal of bits for all layers [2, 14], but as different layers have dif-
policies on different hardware architectures (i.e., edge and ferent redundancy and behave differently on the hardware
cloud architectures) under different resource constraints (i.e., (computation bounded or memory bounded), it is necessary
latency, energy and model size) are drastically different. We to use mixed precision for different layers (as shown in Fig-
interpreted the implication of different quantization policies, ure 1). This flexibility was originally not supported by chip
which offer insights for both neural network architecture vendors until recently the hardware manufacturers started
design and hardware architecture design. to implement this feature: Apple released the A12 Bionic
chip that supports mixed precision for the neural network
inference [6]; NVIDIA recently introduced the Turing GPU
architecture that supports 1-bit, 4-bit, 8-bit and 16-bit arith-
metic operations [21]; Imagination launched a flexible neural
∗ indicates equal contributions. network IP that supports per-layer bitwidth adjustment for

1
Inference latency on performance feedback into the design loop. Also, as demon-
strated in Table 1, the quantization solution optimized on
HW1 HW2 HW3 one hardware might not be optimal on the other, which raises
Best Q. policy for HW1 16.29 ms 85.24 ms 117.44 ms the demand for specialized policies for different hardware
architectures.
Best Q. policy for HW2 19.95 ms 64.29 ms 108.64 ms
To this end, we propose the Hardware-Aware Automated
Best Q. policy for HW3 19.94 ms 66.15 ms 99.68 ms Quantization (HAQ) framework that leverages reinforce-
ment learning to automatically predict the quantization pol-
Table 1: Inference latency of MobileNet-V1 [12] on three icy given the hardware’s feedback. The RL agent decides the
hardware architectures under different quantization policies. bitwidth of a given neural network in a layer-wise manner.
The quantization policy that is optimized for one hardware For each layer, the agent receives the layer configuration and
is not optimal for the other. This suggests we need a spe- statistics as observation, and it then outputs the action which
cialized quantization solution for different hardware archi- is the bitwidth of weights and activations. We then leverage
tectures. (HW1: BitFusion [25], HW2: BISMO [26] edge the hardware accelerator as the environment to obtain the
accelerator, HW3: BISMO cloud accelerator, batch = 16). direct feedback from hardware to guide the RL agent to sat-
isfy the resource constraints. After all layers are quantized,
both weights and activations [13]. Besides industry, recently we finetune the quantized model for one more epoch, and
academia also works on the bit-level flexible hardware de- feed the validation accuracy after short-term retraining as the
sign: BISMO [26] proposed the bit-serial multiplier to sup- reward signal to our RL agent. During the exploration, we
port multiplications of 1 to 8 bits; BitFusion [25] supports leverage the deep deterministic policy gradient (DDPG) [17]
multiplications of 2, 4, 8 and 16 bits in a spatial manner. to supervise our RL agent. We also studied the quantization
However, a very missing part is how to determine the policy on multiple hardware architectures: both cloud and
bitwidth of both weights and activations for each layer edge neural network accelerators, with spatial or temporal
on different hardware accelerators. This is a vast design multi-precision design.
space: with M different neural network models, each with N The contribution of this paper has four aspects:
layers, on H different hardware platforms, there are in total
O(H × M × 82N )∗ possible solutions. For a widely used 1. Automation: We propose an automated framework for
ResNet-50 [9] model, the size of the search space is about quantization, which does not require domain experts
8100 , which is even larger than the number of particles in and rule-based heuristics. It frees the human labor from
the universe. Conventional methods require domain experts exploring the vast search space of choosing bitwidths.
(with knowledge of both machine learning and hardware
architecture) to explore the huge design space smartly with 2. Hardware-Aware: Our framework involves the hard-
rule-based heuristics, such as: we should retain more bits ware architecture into the loop so that it can directly
in the first layer which extracts low level features and in the reduce the latency, energy and storage on the target
last layer which computes the final outputs; also, we should hardware instead of relying on proxy signals.
use more bits in the convolution layers than in the fully-
3. Specialization: For different hardware architectures,
connected layers because empirically, the convolution layers
our framework can offer a specialized quantization pol-
are more sensitive. As the neural network becomes deeper,
icy that’s exactly tailored for the target hardware archi-
the search space increases exponentially, which makes it
tecture to optimize latency and energy.
infeasible to rely on hand-crafted strategies. Therefore, these
rule-based quantization policies are usually sub-optimal, and 4. Design Insights: We interpreted the different quantiza-
they cannot generalize from one model to another. In this tion polices learned for different hardware architectures.
paper, we would like to automate this exploration process by Taking both computation and memory access into ac-
a learning-based framework. count, the interpretation offers insights on both neural
Another challenge is how to optimize the latency and network architecture and hardware architecture design.
the energy consumption of a given model on the hardware.
A widely adopted approach is to rely on some proxy sig- 2. Related Work
nals (e.g., FLOPs, number of memory references) [12, 24].
However, as different hardware behaves very differently, the Quantization. There have been extensive explorations on
performance of a model on the hardware cannot always be compressing and accelerating deep neural networks using
accurately reflected by these proxy signals. Therefore, it quantization. Han et al. [8] quantized the network weights
is important to directly involve the hardware architecture’s to reduce the model size by rule-based strategies: e.g., they
used human heuristics to determine the bitwidths for con-
∗ Assuming the bitwidth is 1 to 8 for both weights and activations. volution and fully-connected layers. Courbariaux et al. [4]

2
BitFusion (On the Edge)
……
3 bit weight 5 bit activation BISMO (On the Cloud)
PE the Edge)
BISMO (On PE ⋯ PE
Layer 3 ⋯
10100010 PE PE PE
3bit / 5bit


PE PE ⋯ PE
PE PE PE


Action Hardware PE wn ⋯PE w0 an ⋯PE a0


Actor Layer 4 ⋯ a ⋯
1110101001010 Mapping PE wn ⋯PE w0 ⋯PE a0
6bit / 7bit n

w0 ⋯ an ⋯⋯ a0 ⋯

⋯ ⋯


wn ⋯ ⋯
⋯ ⋯

⋯ ⋯


⋯ ⋯(LSB)
Cycle T


State Layer 5 ⋯ ⋯
&
1110101001 Direct Cycle 0
4bit / 6bit
+
Critic Reward Feedback
& (MSB)
&
+ <<
+
<<
+
Layer 6
Agent: DDPG 11101010010 <<
+
5bit / 6bit


+

……
Policy Quantized Model Hardware Accelerator

Figure 2: An overview of our Hardware-Aware Automated Quantization (HAQ) framework. We leverage the reinforcement
learning to automatically search over the huge quantization design space with hardware in the loop. The agent propose an
optimal bitwidth allocation policy given the amount of computation resources (i.e., latency, power, and model size). Our RL
BitFusion (On the Edge)
agent integrates the hardware accelerator into the exploration …… loop so that it can obtain the direct feedback from the hardware,
3 bit weight 5 bit activation BISMO (On the Cloud)
instead of relying on indirect proxy signals. PE the Edge)
BISMO (On PE ⋯ PE
Layer 3 ⋯
10100010 PE PE PE
3bit / 5bit


binarized the network weights into {−1, +1}; Rastegari et Efficient Models. To PE facilitate
PE ⋯the PE
PE PE efficientPEdeployment,


al. [23] and Zhou et al.Action
[32] binarized each convolution filter researchers designed
Hardware PE w ⋯
hardware-friendly
PE w ⋯PE a
a approaches to slim0 0


n n
Layer 4 ⋯ a ⋯
into {−w,Actor
+w}; Zhu et al. [34] mapped the network1 weights
6bit / 7bit 010
1 1 0 1 0 1 0 0 1 neural Mapping models.
network ⋯PE winstance,
PE w For
n ⋯PE athe coarse-grained
0 n 0

⋯ w ⋯ a ⋯⋯ a ⋯

⋯ ⋯


w 0 ⋯ 0
into {−wN , 0, +wP } using two bits; Zhou et al. [33] used n
channel pruning methods ⋯
[11, 20]⋯
prunen
away the entire

⋯ ⋯


⋯ ⋯(LSB)
Cycle T
one bit for network weights and two bits for activations; Ja- channel of convolution
⋯ kernels to achieve speedup. Recently,


State Layer 5 Direct ⋯ ⋯
&
Cycle 0
4bit / 6bit 1110101001 +
cob et al. [14] made
Critic
use of
Reward 8-bit integers for both weights and researchers have
Feedback explicitly optimized
&
for
(MSB) various aspects of

& << +
activations. We refer the reader to the survey paper by Kr- hardware properties, including the <<
inference latency and en-
+ +
ishnamoorthi Layer 6 overview. These
et al. [16] for a more detailed ergy: Yang et al. [30] proposed the energy-aware pruning
Agent: DDPG <<
+
5bit / 6bit 11101010010
conventional quantization methods either simply assign the to directly optimize the energy consumption of neural net-


+

same number of bits to all layers or require domain experts to …… works; Yang et al. [31] reduced the inference time of neural
determine the bitwidths for differentPolicy layers, while ourQuantized
frame- Model Hardwarethrough
networks on the mobile devices Acceleratora lookup table. Nev-
work automates this design process, and our learning-based ertheless, these methods are still rule-based and mostly focus
policy outperforms rule-based strategies. on pruning. Our framework automates the quantization pro-
cess by taking hardware-specific metric as direct rewards
using a learning based method.
AutoML. Many researchers aimed to improve the perfor-
mance of deep neural networks by searching the network 3. Approach
architectures: Zoph et al. [35] proposed the Neural Architec-
We model the quantization task as a reinforcement learn-
ture Search (NAS) to explore and design the transformable
ing problem (Figure 2). We use the actor-critic model with
network building blocks, and their network architecture out-
DDPG agent to give the action: bits for each layer. We
performs several human designed networks; Liu et al. [19]
collect hardware counters as constraints, together with ac-
introduced the Progressive NAS to accelerate the architecture
curacy as rewards to search the optimal quantization policy.
search by 5× using sequential model-based optimization;
We have three hardware environments that covers edge and
Pham et al. [22] introduced the Efficient NAS to speed up
cloud, spatial and temporal architectures for mixed-precision
the exploration by 1000× using parameter sharing; Cai et
accelerator. Below describes the details of the RL formula-
al. [1] introduced the path-level network transformation to
tion.
effectively search the tree-structured architecture space. Mo-
tivated by these AutoML frameworks, He et al. [10] lever-
3.1. Observation (State Space)
aged the reinforcement learning to automatically prune the
convolution channels. Our framework further explores the Our agent processes the neural network in a layer-wise
automated quantization for network weights and activations, manner. For each layer, our agent takes two steps: one for
and it takes the hardware architectures into consideration. weights, and one for activations. In this paper, we introduce

1
a ten-dimensional feature vector Ok as our observation: consumption) on the hardware. Cache locality, number of
th
If the k layer is a convolution layer, the state Ok is kernel calls, memory bandwidth all matters. Proxy feed-
back can not model these hardware functionality to find the
Ok = (k, cin , cout , skernel , sstride , sfeat , nparams , idw , iw/a , ak−1 ), specialized strategies (see Table 1).
(1) Instead, we use direct latency and energy feedback from
where k is the layer index, cin is #input channels, cout is the hardware accelerator as resource constraints, which en-
#output channels, skernel is kernel size, sstride is the stride, ables our RL agent to determine the bitwidth allocation pol-
sfeat is the input feature map size, nparams is #parameters, icy from the subtle differences between different layers: e.g.,
idw is a binary indicator for depthwise convolution, iw/a is a vanilla convolution has more data reuse and better locality,
binary indicator for weight/activation, and ak−1 is the action while depthwise convolution [3] has less reuse and worse
from the last time step. locality, which makes it memory bounded. Such difference
If the k th layer is a fully-connected layer, the state Ok is impacts the optimal quantization policy.

Ok = (k, hin , hout , 1, 0, sfeat , nparams , 0, iw/a , ak−1 ), (2) 3.4. Quantization

where k is the layer index, hin is #input hidden units, hout is We linearly quantize the weights and activations of each
#output hidden units, sfeat is the size of input feature vector, layer using the action ak given by our agent, as linearly
nparams is #parameters, iw/a is a binary indicator for weight/ quantized model only needs fixed point arithmetic unit which
activation, and ak−1 is the action from the last step. is more efficient to implement on the hardware.
For each dimension in the observation vector Ok , we Specifically, for each weight value w in the k th layer, we
normalize it into [0, 1] to make them in the same scale. first truncate it into the range of [−c, c], and we then quantize
it linearly into ak bits:
3.2. Action Space
We use a continuous action space to determine the quantize(w, ak , c) = round(clamp(w, c)/s) × s, (4)
bitwidth. The reason that we do not use a discrete action
space is because it loses the relative order: e.g., 2-bit quan- where clamp(·, x) is to truncate the values into [−x, x], and
tization is more aggressive than 4-bit and even more than the scaling factor s is defined as s = c/(2ak −1 − 1). In this
8-bit. At the k th time step, we take the continuous action ak paper, we choose the value of c by finding the optimal value
(which is in the range of [0, 1]), and round it into the discrete x that minimizes the KL-divergence between the original
bitwidth value bk : weight distribution Wk and the quantized weight distribution
quantize(Wk , ak , x):
bk = round(bmin − 0.5 + ak × (bmax − bmin + 1)), (3)
c = arg min DKL (Wk || quantize(Wk , ak , x)), (5)
where bmin and bmax denote the min and max bitwidth (in our x

experiments, we set bmin to 2 and bmax to 8).


where DKL (· || ·) is the KL-divergence that characterizes the
Resource Constraints. In real-world applications, we distance between two distributions. As for activations, we
have limited computation budgets (i.e., latency, energy, and quantize the values similarly except that we truncate them
model size). We would like to find the quantization policy into the range of [0, c], not [−c, c] since the activation values
with the best performance given the constraint. (which are the outputs of the ReLU layers) are non-negative.
We encourage our agent to meet the computation budget
by limiting the action space. After our RL agent gives actions
3.5. Reward Function
{ak } to all layers, we measure the amount of resources After quantization, we retrain the quantized model for
that will be used by the quantized model. The feedback is one more epoch to recover the performance. As we have
directly obtained from the hardware accelerator, which we already imposed the resource constraints (latency, energy) by
will discuss in Section 3.3. If the current policy exceeds limiting the action space (Section 3.2), we define our reward
our resource budget (on latency, energy or model size), we function R to be only related to the accuracy:
will sequentially decrease the bitwidth of each layer until the
constraint is finally satisfied. R = λ × (accquant − accorigin ), (6)
3.3. Direct Feedback from Hardware Accelerators
where accorigin is the top-1 classification accuracy of the full-
An intuitive feedback to our RL agent can be FLOPs or precision model on the training set, accquant is the accuracy
the model size. However, as these proxy signals are indirect, of the quantized model after finetuning, and λ is a scaling
they are not equal to the performance (i.e., latency, energy factor which is set to 0.1 in our experiments.

4
3.6. Agent Hardware Batch PE Array AXI port Block RAM
For the RL agent, we leverage the deep deterministic Edge Zynq-7020 1 8×8 4×64b 140×36Kb
policy gradient (DDPG) [17], which is an off-policy actor- Cloud VU9P 16 16×16 4×256b 2160×36Kb
critic algorithm for continuous control problem. In our envi-
ronment, one step means that our agent makes an action Table 2: The configurations of edge and cloud accelerators.
to decide the number of bits assigned to the weights or
activations of a specific layer, while one episode is com- Finetuning. During exploration, we finetune the quantized
posed of multiple steps, where our RL agent makes actions model for one epoch to help recover the performance (using
to all layers. We apply a variant form of the Bellman’s SGD with a fixed learning rate of 10−3 and momentum of
Equation, where each transition in an episode is defined 0.9). We randomly select 100 categories from ImageNet [5]
as Tk = (Ok , ak , R, Ok+1 ). During exploration, the Q- to accelerate the model finetuning during exploration. After
function is computed as exploration, we quantize the model with our best policy and
finetune it on the full dataset.
Q̂k = Rk − B + γ × Q(Ok+1 , w(Ok+1 ) | θQ ), (7)
and the loss function can be approximated by 4. Experiments
Ns We conduct extensive experiments to demonstrate the
1 X
L= (Q̂k − Q(Ok , ak | θQ ))2 , (8) consistent effectiveness of our framework for multiple objec-
Ns
k=1 tives: latency, energy, and model size.
where Ns denotes the number of steps in this episode, and
the baseline B is defined as an exponential moving average Datasets and Models. Our experiments are performed on
of all previous rewards in order to reduce the variance of the the ImageNet [5] dataset. As our focus is on more efficient
gradient estimation. The discount factor γ is set to 1 since we models, we extensively study the quantization of MobileNet-
assume that the action made for each layer should contribute V1 [12] and MobileNet-V2 [24]. Both MobileNets are in-
equally to the final result. Moreover, as the number of steps spired from the depthwise separable convolutions [3] and re-
is always finite (bounded by the number of layers), the sum place the regular convolutions with the pointwise and depth-
of the rewards will not explode. wise convolutions: MobileNet-V1 stacks multiple “depth-
wise – pointwise” blocks repeatedly; while MobileNet-V2
3.7. Implementation Details uses the “pointwise – depthwise – pointwise” blocks as its
In this section, we present the implementation details basic building primitives.
about RL exploration and finetuning quantized models.
4.1. Latency-Constrained Quantization
Agent. The DDPG agent consists of an actor network and
We first evaluate our framework under latency constraints
a critic network. Both using the same network architec-
on two representative hardware architectures: spatial and
ture: they take the state vector and the action from the last
temporal architectures for multi-precision CNN. We show
time step as inputs and feed them into two separate fully-
that it’s beneficial to have specialized quantization policies
connected layers with hidden sizes of 400. After that, we
for different hardware architectures. We systematically inter-
add the two hidden vectors together and go through another
pret the policy given by AI to guide future human designs.
two fully-connected layers with hidden sizes of {300, 1}. As
for the actor network, we use an additional sigmoid function
to project the output into the range of [0, 1]. Temporal Architecture. Bit-Serial Matrix Multiplication
Overlay (BISMO) proposed by Yaman et al. [26] is a classic
Exploration. Optimization of the DDPG agent is carried temporal design of neural network accelerator on FPGA. It
out using ADAM [15] with β1 = 0.9 and β2 = 0.999. We introduces bit-serial multipliers which are fed with one-bit
use a fixed learning rate of 10−4 for the actor network and digits from 256 weights and corresponding activations in
10−3 for the critic network. During exploration, we employ parallel at one time and accumulates their partial products
the following stochastic process of the noise: by shifting over time.

w0 (Ok ) ∼ Ntrunc (w(Ok | θkw ), σ 2 , 0, 1), (9)


Spatial Architecture. BitFusion architecture proposed by
where Ntrunc (µ, σ, a, b) is the truncated normal distribution, Hardik et al. [25] is a state-of-the-art spatial ASIC design for
and w is the model weights. The noise σ is initialized as 0.5, neural network accelerator. It employs a 2D systolic array of
and after each episode, the noise is decayed exponentially Fusion Units which spatially sum the shifted partial products
with a decay rate of 0.99. of two-bit elements from weights and activations.

5
20 7 2 -5 0 4 2 2 0

21 2 7 0 -5 2 7 0 5

22 7 2 -5 0 4 2 2 0

23 2 7 0 -5 2 7 0 5

24 7 2 -5 0 4 2 2 0

25 2 7 0
Edge Accelerator -5 2 6 Cloud 0
Accelerator 4

26 7 2MobileNet-V1 -5 0
MobileNet-V2 6 2
MobileNet-V1 4 0
MobileNet-V2
27 2 7 0 -5 2 5 0 3
Bitwidths Acc.-1 Acc.-5 Latency Acc.-1 Acc.-5 Latency Acc.-1 Acc.-5 Latency Acc.-1 Acc.-5 Latency
PACT [2] 4 bits 62.44 84.19 45.452 ms 61.39 2
83.72 52.15 ms 62.44 84.19 57.49 ms -2
61.39 83.72 -2
74.46 ms
Ours flexible 67.40 87.90 45.51 ms 66.99 87.33 52.12 ms 65.33 86.60 57.40 ms 67.01 87.46 73.97 ms
PACT [2] 5 bits 67.00 87.65 57.75 ms 68.84 88.58 66.94 ms 67.00 87.65 77.52 ms 68.84 88.58 99.43 ms
Ours flexible 70.58 89.77 57.70 ms 70.90 89.91 66.92 ms 69.97 89.37 77.49 ms 69.45 88.94 99.07 ms
log#

#bit
PACT [2] 6 bits 70.46 89.59 70.67 ms 71.25 90.00 82.49 ms 70.46 89.59 99.86 ms 71.25 90.00 127.07 ms
2
Ours 5 flexible
8 11
71.20 1490.191770.3520ms 23
71.89 2690.36 82.34 ms 71.20 290.08 5 99.668 ms 1171.85 14 90.2417 127.03
20 ms23 26 29
Layer index MobileNet-V2 OPs per Byte
Original 8 bits 70.82 89.85 96.20 ms 71.81 90.25 115.84 ms 70.82 89.85 151.09 ms 71.81 90.25 189.82 ms Layer index
DW:less bits PW:more bits

log #
Table 3: Latency-constrained quantization on BISMO (edge accelerator and cloud accelerator) on ImageNet. Our framework
can reduce the latency by 1.4× to 1.95× with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization.
2 5 8 11 14 17 20 23 26 29
depthwise: fewer bits pointwise:more bits Table 2 shows our experiment configurations on these two Layer index
8
Edge platforms along with their available resources.
6
4 As for comparison, we adopt the PACT [2] as our baseline,
which uses the same number of bits depthwise:fewer
for all layersbits pointwise:more bits
except for
#bit

2 downsample
layer the first layer which extracts the low level features, they
4
6 use 68 bits for both weights and activations as it has fewer
4
parameters and is very sensitive to errors. We follow a
6 Cloud
#bit

similar
2 setup for the first layer (8 bits), and explore the
4 bitwidth allocation policy for all the other layers. Under the
4
#bit

2 same latency, HAQ consistently achieved better accuracy


layer
4 than6the baseline on both the cloud and the edge (Table 3). dep
6 depthwise:fewer
With6 similar accuracy, HAQ canbits at first
reduce thefew layersby 1.4×
latency
8 to 1.95× compared with the baseline.
depthwise:more bits pointwise:fewer bits 4
#bit

#weight bit (pointwise) #weight bit (depthwise) 2


Interpreting the quantization policy. Our agent gave
#activation bit (pointwise) #activation bit (depthwise)
quite4 different quantization policy for edge and cloud ac-
celerators
6 (Figure 3). For the activations, the depthwise con-
4 depthwise:more bits
log#

2 volution layers are assigned less bitwidth than the pointwise pointwise:fewer bits
0 layer layers on#weight bit (pointwise)
the edge; while on the cloud #weight bit (depthwise)
device, the bitwidth #activation bit
# OPs per Byte (pointwise) # OPs per Byte (depthwise) of these two types of layers are similar. For weights, the
bitwidth of these types of layers are nearly the same on the
Figure 3: Quantization policy under latency constraints for edge; while on the cloud, the depthwise convolution layers
MobileNet-V1. On edge accelerator, our RL agent allocates got more bitwidth than the pointwise convolution layers.
less activation bits to the depthwise convolutions, which We explain the difference of quantization depthwise:more
policy between bits pointw
echos that the depthwise convolutions are memory bounded edge and 6 cloud by the roofline model [27]. Many previous
#bits

and the activations dominates the memory access. On cloud works use 4 FLOPs or BitOPs as metrics to measure compu-
2
accelerator, our agent allocates more bits to the depthwise tation complexity. However, they are not able to directly
convolutions and allocates less bits to the pointwise convolu- 2 latency, since there are many other factors influ-
reflect the
log#

tions, as cloud device has more memory bandwidth and high encing the4 hardware performance, such as memory access
parallelism, the network appears to be computation bounded. cost and6degree#params
of parallelism [24, 20]. Taking
(pointwise) #paramscomputation
(depthwise) #weight bits
and memory access into account, the roofline model assumes
4.1.1 Quantization policy for BISMO Architecture that applications are either computation-bound or memory
bandwidth-bound, if not fitting in on-chip caches, depending
Table 4
Inferencing neural networks on edge devices and cloud on their operation intensity. Operation intensity is measured
seversV2-edge-w
can be quite different: #weight
V2-edge-w batch size, bit
memory #weight bit
bandwidth, as operations (MACs
V2-edge-a in neural#activation
V2-edge-a networks) bit per #activation bit
byte accessed. V2-cloud-w V
(depthwise) (pointwise) (depthwise) (pointwise)
peak FLOPs, etc. We use Xilinx Zynq-7020 FPGA [29] as A lower operation intensity indicates suffering more from
our2edge device and 5
Xilinx VU9P5 [28] as our cloud
-3
device. 0 the memory access. 5 5 3 0 4

3 3 3 0 -1 6 6 0 4 5

4 3 3 0
6-1 6 6 0 4 5

5 6 6 -4 0 5 5 3 0 4

6 4 4 0 -2 7 7 0 5 4
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50
ts Layer index
Edge

depthwise:fewer bits pointwise:more bits


layer Edge
6
4
Cloud
#bit

2
layer
4
layer 6 depthwise:more bits at last few layers
depthwise:fewer bits at first few layers Cloud
6
ts 4
#bit

(depthwise) 2
bit (depthwise) layer
4
6 depthwise:more bits
pointwise:fewer bits
layer #weight bit (pointwise) #weight bit (depthwise) #activation bit (pointwise) #activation bit (depthwise)
yte (depthwise)
Figure 4: Quantization policy under latency constraints for MobileNet-V2 on BISMO. Similar to Figure 3, depthwise layer is
assigned with fewer bits on the edge accelerator, and pointwise layer is assigned with fewer bits on the cloud accelerator.

depthwise:more
Weights Activations Acc.-1 Acc.-5 Latency bits Weights Activations
pointwise:fewer bits Acc.-1
more Acc.-5
params, fewerEnergy
bits
6
#bits

PACT [2] 4 4 bits 4 bits 62.44 84.19 7.86 ms PACT [2] 4 bits 4 bits 62.44 84.19 13.47 mJ
Ours 2
flexible flexible 67.45 87.85 7.86 ms Ours flexible flexible 64.78 85.85 13.69 mJ
PACT [2] 2 6 bits 4 bits 67.51 87.84 11.10 ms PACT [2] 6 bits 4 bits 67.51 87.84 layer
16.57 mJ
log#

Ours 4flexible flexible 70.40 89.69 11.09 ms Ours flexible flexible 70.37 89.40 16.30 mJ
6
PACT [2] 6 bits #params (pointwise)
6 bits 70.46 #params
89.59 (depthwise)
19.99 ms #weight
PACT [2] bits6(pointwise)
bits 6 bits #weight
70.46bits (depthwise)
89.59 26.80 mJ
Ours flexible flexible 70.90 89.95 19.98 ms Ours flexible flexible 70.90 89.73 26.67 mJ
Original 8 bits 8 bits 70.82 89.85 20.08 ms Original 8 bits 8 bits 70.82 89.95 31.03 mJ
Table 4

#weight bit TableV2-edge-a


4: Latency-constrained
V2-edge-a quantization
#activation bit on #activation
BitFusion bit Table 5: Energy-constrained
V2-cloud-w V2-cloud-w quantization
#weight bit on bitBitFusion
#weight
(pointwise) (MobileNet-V1 on ImageNet). Our (depthwise)
framework can (pointwise)
reduce (MobileNet-V1 on ImageNet). (depthwise) (pointwise)
Our framework reduces the
-3 the0 latency by 2×5 with almost no5 loss of accuracy 3 compared 0 4
power consumption by 2×4 with nearly no
-2 loss of accuracy
0

with the fixed bitwidth (8 bits) quantization. compared with the fixed bitwidth quantization.
0 -1 6 6 0 4 5 5 0 -3

0 -1The bottom of6 Figure 3 shows 6 the operation 0 intensities 4


our framework
5
with PACT5 [2] under the 0latency constraints
-3
(OPs per Byte) of convolution layers in the MobileNet-V1. on the BitFusion [25] architecture (Table 4). Our frame-
-4 0
Depthwise 5
convolution 5
is memory bounded, and 3 the point- 0 4
work performs much better 4 than the hand-craft
-2 0
policy with
wise convolution is computation bounded. Our experiments the same latency. It can achieve almost no degradation of
0 -2 7 7 0 5 4 4 0 -2
show that when running MobileNet-V1 on the edge devices accuracy with only half of the latency used by the original
0 with
-2 small batch size,
7 its latency7 is dominated by 0 the depth- 5 MobileNet-V14 model (from 4 20.08 to 11.09
0 ms). Therefore,
-2

-5 wise
0 convolution layers.
4 Since the
4 feature maps2take a major 0 our framework
4 is flexible to4 provide specialized
-2 quantization
0
proportion in the memory of depthwise convolution layers, policy for different hardware platforms.
0 -2 7 7 0 5 4 4 0 -2
our agent gives the activations less bits. In contrast, when
running MobileNet-V1 on the cloud with large batch size, 4.2. Energy-Constrained Quantization
0 -2 7 7 0 5 4 4 0 -2

-5
our0 agent increases
6
the bitwidth of
6
depthwise convolution
4
to 0 We then4 evaluate our framework
4 under
-2 the energy con-
0
preserve the accuracy at low memory overhead since depth- straints. Similar to the latency-constrained experiments, we
0 -3 convolution7only takes a small
wise 7 proportion0 of the total 5 4 4 0 -2
compare our framework with PACT [2] that uses fixed num-
weights. A similar phenomenon can be observed in Figure 4 ber of bits without hardware feedback. From Table 5, we-2can
0 -3 7 7 0 5 4 4 0
on MobileNet-V2. Moreover, as the activation size in deeper clearly see that our framework outperforms the rule-based
-5 layers
0 gets smaller,
5 they get assigned
5 more bits.3 0 baseline: it 5achieves much 5better performance
-3 while consum-
0

4.1.2 Quantization policy for7 BitFusion Architecture ing similar amount of energy. In particular, our framework is
0 -4 7 0 5 4 4 0 -2
able to achieve almost no loss of accuracy with nearly half of
0 -3In order to demonstrate
7 the 7effectiveness of
0 our frame- 5 the energy consumption
4 of4the original MobileNet-V1
0 model
-2
work on different hardware architectures, we further compare (from 31.03 to 16.57 mJ), which suggests that mixed preci-
-5 0 5 5 3 0 5 5 -3 0

0 -4 7 7 0 5
7 4 4 0 -2

0 -4 7 7 0 5 4 4 0 -2

-5 0 6 6 4 0 5 5 -3 0
depthwise:fewer bits pointwise:more bits
downsample
layer Edge
6
#bit 4
2
layer
4
layer 6 MobileNet-V1 MobileNet-V2 ResNet-50
depthwise:more bits at last few layers
Weights Acc.-1
depthwise:fewer Acc.-5
bits at first Model Size
few layers Acc.-1 Acc.-5 Model Size Acc.-1 Acc.-5 Cloud Model Size
6
4 Han et al. [8] 2 bits 37.62 64.31 1.09 MB 58.07 81.24 0.96 MB 68.95 88.68 6.32 MB
Ours flexible 57.14 81.87 1.09 MB 66.75 87.32 0.95 MB 70.63 89.93 6.30 MB
#bit

e) 2
wise) Han et al. [8] 3 bits 65.93 86.85 1.60 MB 68.00 87.96 1.38 MB 75.10 92.33 9.36 layer
MB
4
Ours flexible 67.66 88.21 1.58 MB 70.90 89.76 1.38 MB 75.30 92.45 9.22 MB
6 depthwise:more bits
layer Han et al. [8] 4 bits 71.14 89.84 2.10 MBpointwise:fewer
71.24 bits
89.93 1.79 MB 76.15 92.88 12.40 MB
wise) #weight bit (pointwise)
Ours flexible 71.74#weight bit (depthwise)
90.36 2.07 MB #activation
71.47 90.23bit (pointwise)
1.79 MB #activation
76.14 92.89bit (depthwise)
12.14 MB
Original 32 bits 70.90 89.90 16.14 MB 71.87 90.32 13.37 MB 76.15 92.86 97.49 MB

Table 6: Model size-constrained quantization on ImageNet. Compared with Deep Compression [7], our framework achieves
higher accuracy under similar model size (especially under high compression ratio).
depthwise:more bits pointwise:fewer bits more params, fewer bits
6
#bits

4
2
2 layer
log#

4
6
#params (pointwise) #params (depthwise) #weight bits (pointwise) #weight bits (depthwise)

Figure 5: Quantization policy under model size constraints for MobileNet-V2. Our RL agent allocates more bits to the
depthwise convolutions, since depthwise convolutions have fewer number of parameters.
Table 4
sion with hardware-aware, specialized quantization policy can observe that our framework assigns more bitwidths to
it V2-edge-a V2-edge-a #activation bit #activation bit V2-cloud-w V2-cloud-w #weight bit #weight bit
e) can indeed help reduce the energy consumption.
(depthwise) (pointwise) the weights in depthwise convolution
(depthwise) layers than pointwise
(pointwise)
0 5 5 3 0 convolution
4 layers. Intuitively,
4 this is because
-2 the number0
4.3. Model Size-Constrained Quantization of parameters in the former is much smaller than the latter.
-1
Finally,6 we evaluate our6 framework under 0
the model size 4 Comparing 5 Figure 4 and5 Figure 5, the policies
0 are drasti-
-3

constraints. Following Han6 et al. [8], we employ the k-means cally different under different optimization objectives (fewer
-1 6 0 4 5
bitwiths for 5
depthwise convolutions under0 latency optimiza-
-3
algorithm to quantize the values into k different centroids
instead of5 using the linear tion, more bitwidths for depthwise convolutions under model
0 5 quantization 3for compression, 0 4 4 -2 0
since k-means quantization can be more effective reducing size optimization). Our framework succeeds in learning to
-2 the model7size. 7 0 5 adjust its4bitwidth policy4under different 0constraints. -2
We compare our framework with Deep Compression [8]
-2 7 7 0
on MobileNets and ResNet-50. From Table 6, we can see
5 5. Conclusion
4 4 0 -2

0 4 4
that our framework performs much better2 than Deep Com- 0 4 4 -2
In this paper, we propose Hardware-Aware Automated
0

-2
pression: 7it achieves higher
7
accuracy with0 the same model 5 Quantization
4 (HAQ), an4automated framework 0 for quanti-
-2
size. For compact models like MobileNets, Deep Compres- zation which does not require any domain experts and rule-
-2 sion significantly
7 degrades7the performance0 especially under5 4
based heuristics. 4
We provide a learning 0based method that-2

0 aggressive6 quantization, while


6 our framework
4 can preserve0 can search4 the quantization
4 policy with hardware
-2 feedback.
0
the accuracy much better. For instance, when Deep Com- Compared with indirect proxy signals, our framework can
-3 7 7 0 5 4 4
pression quantizes the weights of MobileNet-V1 to 2 bits, offer a specialized quantization solution0for different hard-
-2
the accuracy drops significantly from 70.90 to 37.62; while ware platforms. Extensive experiments demonstrate that our
-3 7 7 0 5 4 4 0 -2
our framework can still achieve 57.14 of accuracy with the framework performs better than conventional rule-based ap-
same model size. The reason is our framework makes full proaches5for multiple objectives: latency,-3energy and model
0 5 5 3 0 5 0
use of the mixed precision by systematically searching the size. Our framework reveals that the optimal policies on
-4 optimal quantization
7 policy.
7 0 5 different 4hardware architectures
4 0
are drastically -2
different, and
we interpreted the implication of those policies. We believe
-3 7 7
Discussions. In Figure 5, we visualize 0the bitwidth allo-
5 4
the insights will inspire 4the future software
0
and hardware-2

0
cation strategy
5
for MobileNet-V2.
5
From 3
this figure, we
0
co-design5 for efficient deployment
5
of deep
-3
neural networks.
0

-4 7 7 0 5 4 4 0 -2
8
-4 7 7 0 5 4 4 0 -2

0 6 6 4 0 5 5 -3 0
[17] Timothy Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nico-
las Heess, Tom Erez, Yuval Tassa, David Silver, and Daan
Acknowledgements. We thank MIT Quest for Intelli- Wierstra. Continuous control with deep reinforcement learn-
gence, MIT-IBM Watson AI Lab, Xilinx, Samsung, Intel, ing. In ICLR, 2016. 2, 5
ARM, Qualcomm, and SONY for supporting this research. [18] Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime
Neural Pruning. In NIPS, 2017. 1
We thank Google Cloud and AWS Machine Learning Re-
search Awards for providing the computation resource. [19] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens,
Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang,
and Kevin Murphy. Progressive Neural Architecture Search.
References In ECCV, 2018. 3
[1] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and [20] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang,
Yong Yu. Path-Level Network Transformation for Efficient Shoumeng Yan, and Changshui Zhang. Learning efficient
Architecture Search. In ICML, 2018. 3 convolutional networks through network slimming. In ICCV,
[2] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, 2017. 3, 6
Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash [21] Nvidia. Nvidia tensor cores, 2018. 1
Gopalakrishnan. PACT: Parameterized Clipping Activation [22] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and
for Quantized Neural Networks. arXiv, 2018. 1, 6, 7 Jeff Dean. Efficient Neural Architecture Search via Parameter
[3] François Chollet. Xception - Deep Learning with Depthwise Sharing. In ICML, 2018. 3
Separable Convolutions. In CVPR, 2017. 4, 5 [23] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon,
[4] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El- and Ali Farhadi. XNOR-Net - ImageNet Classification Using
Yaniv, and Yoshua Bengio. Binarized Neural Networks: Train- Binary Convolutional Neural Networks. In ECCV, 2016. 3
ing Deep Neural Networks with Weights and Activations [24] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
Constrained to +1 or -1. arXiv, 2016. 2 moginov, and Liang-Chieh Chen. MobileNetV2: Inverted
[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Residuals and Linear Bottlenecks. In CVPR, 2018. 2, 5, 6
Fei-Fei Li. ImageNet - A large-scale hierarchical image [25] Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai,
database. In CVPR, 2009. 5 Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. Bit
[6] EENews. Apple describes 7nm a12 bionic chips, 2018. 1 fusion: Bit-level dynamically composable architecture for
[7] Song Han. Efficient Methods and Hardware for Deep Learn- accelerating deep neural network. In ISCA, 2018. 2, 5, 7
ing. PhD thesis, 2017. 8 [26] Yaman Umuroglu, Lahiru Rasnayake, and Magnus Sjalander.
[8] Song Han, Huizi Mao, and William Dally. Deep Compression: Bismo: A scalable bit-serial matrix multiplication overlay for
Compressing Deep Neural Networks with Pruning, Trained reconfigurable computing. In FPL, 2018. 2, 5
Quantization and Huffman Coding. In ICLR, 2016. 1, 2, 8 [27] Samuel Williams, Andrew Waterman, and David Patterson.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Roofline: an insightful visual performance model for multi-
Deep Residual Learning for Image Recognition. In CVPR, core architectures. Communications of the ACM, 52(4):65–76,
2016. 2 2009. 6
[10] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and [28] Xilinx. Ultrascale architecture and product data sheet:
Song Han. AMC: AutoML for Model Compression and Overview, 2018. 6
Acceleration on Mobile Devices. In ECCV, 2018. 3 [29] Xilinx. Zynq-7000 soc data sheet: Overview, 2018. 6
[11] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for [30] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing
accelerating very deep neural networks. In ICCV, 2017. 3 energy-efficient convolutional neural networks using energy-
[12] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry aware pruning. arXiv, 2016. 3
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- [31] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec
dreetto, and Hartwig Adam. MobileNets: Efficient Convo- Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Ne-
lutional Neural Networks for Mobile Vision Applications. tadapt: Platform-aware neural network adaptation for mobile
arXiv, 2017. 1, 2, 5 applications. In ECCV, 2018. 3
[13] Imagination. Powervr neural network accelerator, 2018. 2 [32] Aojun Zhou, Anbang Yao, Kuan Wang, and Yurong Chen.
[14] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Explicit loss-error-aware quantization for low-bit deep neu-
Matthew Tang, Andrew G Howard, Hartwig Adam, and ral networks. In Proceedings of the IEEE Conference on
Dmitry Kalenichenko. Quantization and Training of Neu- Computer Vision and Pattern Recognition, pages 9426–9435,
ral Networks for Efficient Integer-Arithmetic-Only Inference. 2018. 3
In CVPR, 2018. 1, 3 [33] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin
[15] Diederik Kingma and Jimmy Ba. Adam - A Method for Wu, and Yuheng Zou. DoReFa-Net - Training Low Bitwidth
Stochastic Optimization. In ICLR, 2015. 5 Convolutional Neural Networks with Low Bitwidth Gradients.
[16] Raghuraman Krishnamoorthi. Quantizing deep convolutional arXiv, 2016. 3
networks for efficient inference - A whitepaper. arXiv, 2018. [34] Chenzhuo Zhu, Song Han, Huizi Mao, and William Dally.
3 Trained Ternary Quantization. In ICLR, 2017. 1, 3

9
[35] Barret Zoph and Quoc V Le. Neural Architecture Search with
Reinforcement Learning. In ICLR, 2017. 3

10

You might also like