Quantifying The Knowledge in A DNN To Explain Knowledge Distillation For Classification
Quantifying The Knowledge in A DNN To Explain Knowledge Distillation For Classification
Quantifying The Knowledge in A DNN To Explain Knowledge Distillation For Classification
Abstract—Compared to traditional learning from scratch, knowledge distillation sometimes makes the DNN achieve superior performance.
In this paper, we provide a new perspective to explain the success of knowledge distillation based on the information theory, i.e., quantifying
knowledge points encoded in intermediate layers of a DNN for classification. To this end, we consider the signal processing in a DNN as a
layer-wise process of discarding information. A knowledge point is referred to as an input unit, the information of which is discarded much
less than that of other input units. Thus, we propose three hypotheses for knowledge distillation based on the quantification of knowledge
points. 1. The DNN learning from knowledge distillation encodes more knowledge points than the DNN learning from scratch. 2. Knowledge
distillation makes the DNN more likely to learn different knowledge points simultaneously. In comparison, the DNN learning from scratch
tends to encode various knowledge points sequentially. 3. The DNN learning from knowledge distillation is often more stably optimized than
the DNN learning from scratch. To verify the above hypotheses, we design three types of metrics with annotations of foreground objects to
analyze feature representations of the DNN, i.e., the quantity and the quality of knowledge points, the learning speed of different knowledge
points, and the stability of optimization directions. In experiments, we diagnosed various DNNs on different classification tasks, including
image classification, 3D point cloud classification, binary sentiment classification, and question answering, which verified the above
hypotheses.
Fig. 1. (a) Visualization of the information-discarding process through different layers of VGG-16 on the CUB200-2011 dataset. Input units with darker
colors discard less information. (b) Comparison of the coherency of different explanation methods. Our method coherently measures how input infor-
mation is gradually discarded through layers, and enables fair comparisons of knowledge representations between different layers. In comparison,
CAM [79] and gradient explanations [59] cannot generate coherent magnitudes of importance values through different layers for fair comparisons.
More analysis is presented in Section 3.2.
image. Then, the information of some units (e.g., informa- neglected those features irrelevant to the task for inference.
tion of pixels in the background) is significantly discarded. We assume that a teacher network is supposed to achieve
In comparison, information of other input units (e.g., infor- superior classification performance, because the teacher net-
mation of pixels in the foreground) is discarded less. work usually has a more complex network architecture or is
In this way, we consider input units without significant often learned using more training data. In this scenario, the
information discarding as knowledge points, such as input teacher network is considered to encode more knowledge
units of the bird in Fig. 1a. These knowledge points usually points related to the task and fewer knowledge points unre-
encode discriminative information for inferences, e.g., the lated to the task than the student network. In this way,
wings of birds in Fig. 1a are discriminative for the fine- knowledge distillation forces the student network to mimic
grained bird classification. In comparison, input units with the logic of the teacher network, which ensures that the stu-
significant information-discarding usually do not contain dent network encodes more knowledge points relevant to
sufficient discriminative information for prediction. Thus, the task and fewer knowledge points irrelevant to the task
measuring the amount of information of input units dis- than the DNN learning from scratch.
carded by the DNN enables us to quantify knowledge Hypothesis 2: Knowledge distillation makes the DNN more
points. likely to learn different knowledge points simultaneously. In com-
Aside from the quantity of knowledge points, we also parison, the DNN learning from scratch tends to encode various
measure the quality of knowledge points with the help of knowledge points sequentially through different epochs. This is
annotations of foreground objects. To this end, we classify the case because knowledge distillation forces the student
all knowledge points into two types, including those related network to mimic all knowledge points of the teacher net-
and unrelated to the task. In this way, we can compute the work simultaneously. In comparison, the DNN learning
ratio of knowledge points encoded in a DNN that are rele- from scratch tends to focus on different knowledge points
vant to the task as a metric to evaluate the quality of knowl- in different epochs sequentially.
edge points. A high ratio of relevant knowledge points Hypothesis 3: Knowledge Distillation Usually Yields More
indicates that the feature representation of a DNN is reli- Stable Optimization Directions Than Learning From Scratch.
able. Fig. 2 illustrates that DNNs encoding a larger ratio of The DNN learning from scratch usually keeps trying differ-
task-related knowledge points usually achieve better perfor- ent knowledge points in early epochs and then discarding
mance (please see Section 4.2 for further analysis). unreliable ones in later epochs, rather than modeling target
Based on the quantification of knowledge points, we pro- knowledge points directly. Thus, the optimization direc-
pose the following three hypotheses as to why knowledge tions of the DNN learning from scratch are inconsistent. For
distillation outperforms traditional learning from scratch. convenience, in this study, the phenomenon of inconsistent
Furthermore, we design three types of metrics to verify optimization directions through different epochs is termed
these hypotheses. as “detours”1.
Hypothesis 1: Knowledge Distillation Makes the DNN Learn In contrast to learning from scratch, knowledge distilla-
More Knowledge Points Than the DNN Learning From Scratch. tion usually forces the student network to mimic the teacher
The information-bottleneck theory [58], [68] claimed that a network directly. The student network straightforwardly
DNN tended to preserve features relevant to the task and encodes target knowledge points without temporarily
modeling and discarding other knowledge points. Thus, the
student network is optimized without significant detours.
For a better understanding of Hypothesis 3, let us take the
fine-grained bird classification task as an example. With the
guidance of the teacher network, the student network
directly learns discriminative features from heads or tails of
Fig. 2. Positive correlation between the ratio of foreground knowledge 1. “Detours” refers to the phenomenon that a DNN tends to encode
points and the classification accuracy of DNNs on the CUB200-2011 various knowledge points in early epochs and discard non-discrimina-
dataset. tive ones later.
ZHANG ET AL.: QUANTIFYING THE KNOWLEDGE IN A DNN TO EXPLAIN KNOWLEDGE DISTILLATION FOR CLASSIFICATION 5101
birds without significant detours. In comparison, the DNN Whether to Use the Annotated Knowledge Points. In particu-
learning from scratch usually extracts features from the lar, we should define and quantify knowledge points with-
head, belly, tail, and tree-branch in early epochs, and later out any human annotations. It is because we do not want
neglects features from the tree branch. In this case, the opti- the analysis of the DNN to be influenced by manual annota-
mization direction of the DNN learning from scratch is tions with significant subjective bias, which negatively
more unstable. impacts the trustworthiness of this research. Besides, anno-
Note that previous studies on explanations of the optimi- tating all knowledge points is also typically expensive to the
zation stability and the generalization power of the DNN point of being impractical or impossible. However, previous
were mainly conducted from the perspective of the parame- studies often define visual concepts (knowledge points) by
ter space. [12], [31], [61] demonstrated that the flatness of human annotations, and these concepts usually have spe-
minima of the loss function resulted in good generalization. cific semantic meanings. For example, Bau et al [2] manually
[1], [43], [44] illustrated that the generalization behavior of a annotated six types of semantic visual concepts (objects,
network depended on the norm of the weights. In contrast, parts, textures, scenes, materials, and colors) to explain fea-
this study explains the efficiency of optimization from the ture representations of DNNs. In this way, these methods
perspective of knowledge representations, i.e., analyzing cannot fairly quantify the actual amount of knowledge
whether or not the DNN encodes different knowledge encoded by a DNN, because most visual concepts are not
points simultaneously and whether they do so without sig- countable and cannot be defined with human annotations.
nificant detours. In contrast, in this study, we use the information in each
Connection to the Information-Bottleneck Theory. We input unit discarded by the DNN to define and quantify
roughly consider that the information-bottleneck the- knowledge points without human annotations. Such knowl-
ory [58], [68] can also be used to explain the change of edge points are usually referred to as “Dark Matters” [71].
knowledge points in a DNN. It showed that a DNN usu- Compared to traditionally labeled visual concepts, we con-
ally extracted a few discriminative features for inferences, sider that these dark-matter knowledge points can provide
and neglected massive features irrelevant to inferences a more general and trustworthy way to explain the feature
during the training process. The information-bottleneck representations of DNNs.
theory measures the entire discrimination power of each Scope of Explanation. A trustworthy explanation of knowl-
sample. In comparison, our method quantifies the discrim- edge distillation usually involves two requirements. First,
ination power of each input unit in a more detailed manner, the teacher network should be well optimized. Otherwise, if
and measures pixel-wise information discarded through for- the teacher network does not converge or is optimized for
ward propagation. other tasks, then the teacher network is not qualified to per-
Methods. To verify the above hypotheses, we design form distillation. Second, knowledge distillation often has
three types of metrics to quantify knowledge points encoded two fully different utilities. (1) Distilling from high-dimen-
in intermediate layers of a DNN, and analyze how dif- sional features usually forces the student network to mimic
ferent knowledge points are modeled during the training all sophisticated feature representations in the teacher net-
procedure. work. (2) Distilling from the low-dimensional network out-
The first type of metrics measures the quantity and put mainly selectively uses confident training samples for
quality of knowledge points. We consider that a well- learning, with very little information on how the teacher enc-
trained DNN with superior classification performance is odes detailed features. In this case, the distillation selectively
supposed to encode a massive amount of knowledge points, emphasizes the training of simple samples and ignores the
most of which are discriminative for the classification of tar- training of hard samples2. Therefore, in this study, we
get categories. mainly explain the first utility of knowledge distillation, i.e.,
The second type of metrics measures whether a DNN forcing the student network to mimic the knowledge of the
encodes various knowledge points simultaneously or teacher network. This utility is mainly exhibited by distilling
sequentially, i.e., whether different knowledge points are from high-dimensional intermediate-layer features.
learned at similar speeds. Contributions of this study are summarized as follows.
The third metric evaluates whether a DNN is opti- 1. The quantification of knowledge points encoded in
mized with or without significant detours. In other words, intermediate layers of DNNs can be used as a new perspec-
this metric measures whether or not a DNN directly learns tive to analyze DNNs.
target knowledge points without temporarily modeling and 2. Based on the quantification of knowledge points, we
discarding other knowledge points. To this end, this metric design three types of metrics to explain the mechanism of
is designed as an overlap between knowledge points knowledge distillation.
encoded by the DNN in early epochs and the final knowl- 3. Three hypotheses about knowledge distillation are pro-
edge points modeled by the DNN. If this overlap is large, posed and verified in different classification applications,
then we consider this DNN is optimized without significant including image classification, 3D point cloud classification,
detours; otherwise, with detours. This is the case because a binary sentiment classification, and question answering,
small overlap indicates that the DNN first encodes a large which shed new light on the interpretation of knowledge
number of knowledge points in early epochs, but many of distillation.
these knowledge points are discarded in later epochs. A preliminary version of this work appeared in [10].
In summary, we use these metrics to compare knowledge
distillation and learning from scratch, in order to verify 2. Confident (simple) samples usually generate more salient output
Hypotheses 1-3. signals for knowledge distillation than unconfident (hard) samples.
5102 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023
the first type of utility, i.e., forcing the student network to quantifies how much input information can be discarded
mimic knowledge points of the teacher network. In contrast, without affecting the inference of the object
distilling from relatively low-dimensional network output
often exhibits the second type of utility, i.e., selecting confi- HðX0 Þ s:t: 8x0 2 X0 ; k fðx0 Þ f k2 t: (1)
dent samples for learning and ignoring unconfident samples.
In this paper, we mainly explain the first utility of knowledge X0 denotes a set of inputs corresponding to the concept of a
distillation, which is mainly exhibited by distilling from high- specific object instance. We assume that Dx is an i.i.d. Gauss-
dimensional intermediate-layer features. ian noise, thereby x0 ¼ x þ Dx N ðx; Sðs
s Þ ¼ diagðs 21 ; . . . ; s 2n ÞÞ.
Here, s i indicates the variance of the perturbation w.r.t. the
3.1 Quantifying Information Discarding ith input unit. The assumption of the Gaussian distribution
Before defining and quantifying knowledge points ensures that the entropy HðX0 Þ of the input can be decom-
encoded in DNNs, we first introduce how to measure the posed into pixel-level entropies fHi g
pixel-wise information discarding of the input sample in
this section. X
n
HðX0 Þ ¼ Hi ; (2)
According to information-bottleneck theory [58], [68], we i¼1
consider the signal processing of a forward propagation as
the layer-wise discarding of input information. In low where Hi ¼ log s i þ 12 log ð2peÞ measures the pixel-wise infor-
layers, most input information is used to compute features. mation discarding. A large value of Hi usually indicates that
Whereas, in high layers, information in input units unre- the information of the ith unit is significantly discarded dur-
lated to the inference is gradually discarded, and only ing the forward propagation. For example, as shown in
related pixel-wise information is maintained in the compu- Fig. 1a, information encoded in background pixels is signifi-
tation of network features. In this way, we consider that fea- cantly discarded in the fine-grained bird classification.
tures of the highest layer mainly encode the information, Thus, we learn the optimal Sðs s Þ ¼ diagðs 21 ; . . . ; s 2n Þ via
which is highly relevant to the inferences. the maximum-entropy principle to compute the pixel-wise
Thus, we propose methods to quantify the input infor- information discarding Hi ¼ log s i þ 12 log ð2peÞ. The objec-
mation encoded in a specific intermediate layer of a tive function is formulated in Equation (1), which maxi-
DNN [23], [40], i.e., measuring how much input information mizes HðX0 Þ subject to constraining features within the
is discarded when the DNN extracts the feature of this scope of a specific object instance k fðx0 Þ f k2 t. That is,
layer. To this end, the information in each input unit dis- we enumerate all perturbation directions in the input sam-
carded by the DNN is formulated as the entropy, given a ple x within a small variance of the feature f , in order to
feature of a certain network layer. approximate the low-dimensional manifold of fðx0 Þ. We use
Specifically, given a trained DNN for classification and the Lagrange multiplier to approximate Equation (1) as the
an object instance x 2 Rn with n input units, let f ¼ fðxÞ following loss
denote the feature of a network intermediate layer, and let
y denote the network prediction w.r.t. x. The input unit is min s Þ;
Lossðs
s
referred to as a variable (or a set of variables) in the input
1 h i Xn
0 2
sample. For example, each pixel (or pixels within a small s Þ¼
Lossðs E kfðx Þ f k a Hi ; (3)
local region) of an input image can be taken as an input d2f x0 N ðx;Sðss ÞÞ i¼1
unit, and the embedding of each word in an input sentence 1 h i X n
¼ 2 0 E kfðx0 Þ f k2 a Hi : (4)
also can be regarded as an input unit. df x ¼xþss dd;
d N ð0;IÞ i¼1
In our previous works [23], [40], we consider that the
DNN satisfies the Lipschitz constraint k y0 y k k k
fðx0 Þ f k , where k is the Lipschitz constant, and y0 corre- Considering that Equation (3) is intractable, we use x0 ¼
sponds to the network prediction of the sample x0 . The Lip- x þ s d; d N ð0; IÞ to simplify the computation of
schitz constant k can be computed as the maximum norm Lossðss Þ, where denotes the element-wise multiplication.
of the gradient within a small range of features, i.e., k ¼ s ¼ ½s 1 ; . . . ; s n > represents the range that the input sam-
@y0
supfðx0 Þ:kfðx0 Þf k2 t k @fðx 0 Þ k. This indicates that if we weakly
ple can change. In this way, Lossðs s Þ in Equation (4) is
tractable, and we can learn s via gradient descent. d2f ¼
perturb the low-dimensional manifold of features w.r.t. the
limt!0þ Ex0 N ðx;t2 IÞ ½kfðx0 Þ f k2 denotes the inherent vari-
input x within a small range ffðx0 Þj k fðx0 Þ f k2 tg,
ance of intermediate-layer features subject to small input
then the output of the DNN was also perturbed within a
noises with a fixed magnitude t. d2f is used for normalization,
small range k y0 y k k t, where t is a positive scalar. In
and a positive scalar a is used to balance the two loss terms.
other words, all such weakly perturbed features usually
represent the same object instance.
However, determining the explicit low-dimensional mani-
3.2 Quantification of Knowledge Points
fold of features w.r.t. the input x is very difficult. As an expedi- Hypothesis 1: Knowledge distillation makes the DNN
ency, we add perturbations Dx to the input sample x to encode more knowledge points than learning from scratch.
approximate the manifold of the feature, i.e., generating sam-
ples x0 ¼ x þ Dx, subject to k fðx0 Þ f k2 t. In this way,
the entropy HðX0 Þ measures the uncertainty of the input The basic idea is to measure the quantity and quality of
when the feature represents the same object instance, knowledge points encoded in the intermediate layers of a
k y0 y k k t, as mentioned above. In other words, HðX0 Þ DNN.
5104 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023
Fig. 3. Visualization of knowledge points. The dark color indicates the low-entropy value Hi . Image regions with low pixel-wise entropies fHi g are
considered as knowledge points. In a real implementation, to improve the stability and efficiency of the computation, we divide the image into 16 16
grids, and each grid is taken as a “pixel” to compute the entropy Hi .
The Quantity of Knowledge Points. In this paper, we use value of represents that the feature representation of the
the pixel-wise information of each input unit discarded by DNN is reliable; otherwise, unreliable.
the DNN to define and quantify knowledge points. Given a Thus, knowledge points provide a new way to explain the
trained DNN and an input sample x 2 X, let us consider mechanism of knowledge distillation, i.e., checking whether
the pixel-wise information discarding Hi w.r.t. the feature of knowledge distillation makes the DNN encode a large
a certain network layer f ¼ fðxÞ. X represents a set of amount of knowledge points, and whether most knowledge
input samples. points are precisely localized in the foreground.
According to Equation (2), a low pixel-wise entropy Hi Coherency and Generality. As discussed in [23], [40], a trust-
represents that the information of this input unit is less dis- worthy explanation method to evaluate feature representa-
carded. In other words, the DNN tends to use these input tions of the DNN should satisfy the criteria of the coherency
units with low entropies to compute the feature f and and generality. From this perspective, we analyze the trust-
make inferences. Thus, a knowledge point is defined as an worthiness of knowledge points.
input unit with a low pixel-wise entropy Hi , which encodes Coherency indicates that an explanation method should
discriminative information for prediction. For example, the reflect the essence of the encoded knowledge representations
heads of birs in Fig. 3 are referred to as knowledge points, in DNNs, which are invariant to network settings. In this
which are useful for the fine-grained bird classification. study, we use pixel-wise information discarding to define
To this end, we use the average entropy H of all back- knowledge points without strong assumptions on feature
ground units Lbg as a baseline to determine knowledge representations, or network architectures. In this way,
points, i.e., H ¼ Ei2Lbg ½Hi . This is because information knowledge points provide a coherent evaluation of feature
encoded in background units is usually supposed to be signif- representations. However, most existing explanation meth-
icantly discarded and irrelevant to the inference. If the ods shown in Table 1 usually fail to meet this criterion due to
entropy of a unit is significantly lower than the baseline their biased assumptions. For example, the CAM [79] and
entropy H Hi > b, then this input unit can be considered gradient explanations [59] do not generate coherent explana-
as a valid knowledge point. Here, b is a threshold to determine tions, because the gradient map @Loss @f is usually not coherent
the knowledge points, which is introduced in Section 4.1. In through different layers.
this way, we can quantify the number of knowledge points in To this end, we compare our method with CAM and gra-
the DNN. dient explanations. Theoretically, we can easily construct
The Quality of Knowledge Points. In addition to the quan- two DNNs to represent exactly the same knowledge but
tity of knowledge points, we also consider the quality of with different magnitudes of gradients, as follows. A VGG-
knowledge points. Generally speaking, information encoded 16 model [60] was learned using the CUB200-2011 data-
in foreground units is usually supposed to be crucial to the set [65] for fine-grained bird classification. Given this pre-
inferences. In comparison, information encoded in back- trained DNN, we slightly revised the magnitude of parame-
ground units is often supposed to exhibit negligible effects ters in every pair of neighboring convolutional layers y ¼
on the prediction. Therefore, we design a metric to evaluate x
w þ b to examine the coherency. For the Lth and
the quality of knowledge points by examining whether or ðL þ 1Þth layers, parameters were revised as wðLÞ 4wðLÞ ,
not most knowledge points are localized in the foreground. wðLþ1Þ wðLþ1Þ =4, bðLÞ 4bðLÞ , bðLþ1Þ bðLþ1Þ =4. Such revi-
sions did not change knowledge representations or the net-
" # work output, but did alter the gradient magnitude.
fg fg bg
¼ Ex2X Npoint ðxÞ= Npoint ðxÞ þ Npoint ðxÞ ; (5)
bg
X TABLE 1
Npoint ðxÞ ¼ 1ðH Hi > bÞ; (6)
Comparisons of Different Explanation Methods
i2Lbg w:r:t:x
X in Terms of the Coherency
fg
Npoint ðxÞ ¼ 1ðH Hi > bÞ: (7)
Coherency
i2Lfg w:r:t:x
Fair layer-wise Fair comparisons between
comparisons different networks
bg fg
Npoint ðxÞ and Npoint ðxÞ
denote the number of knowledge Gradient-based [41], [59], [74], [79] No No
points encoded in the background and the foreground, Perturbation-based [17], [33] No No
respectively. Lbg and Lfg are sets of input units in the back- Inversion-based [13] No No
Entropy-based Yes Yes
ground and the foreground w.r.t. the input sample x, respec-
tively. 1ðÞ refers to the indicator function. If the condition The coherency of the entropy-based method enables fair layer-wise comparisons
inside is valid, then 1ðÞ returns 1; otherwise, returns 0. A large within a network and between different networks.
ZHANG ET AL.: QUANTIFYING THE KNOWLEDGE IN A DNN TO EXPLAIN KNOWLEDGE DISTILLATION FOR CLASSIFICATION 5105
TABLE 2
Comparisons Between the Student Network and the Baseline
Network (B) for the Image Classification on the CUB200-2011
Dataset, Where the Student Network Distilled From the FC1
Layer of a Larger Teacher Network
Target fg bg
Learning methods Npoint " Npoint # " Dmean # Dvar # r"
network
VGG-11 Distilling from VGG-16 37.11 11.50 0.78 0.48 0.06 0.58
Fig. 5. Detours of learning knowledge points. We visualize sets of fore- Learning from scratch 25.50 11.50 0.70 2.14 3.29 0.56
ground knowledge points learned after different epochs. The green box Distilling from VGG-19 41.44 8.67 0.83 0.55 0.10 0.63
indicates the union of knowledge points learned during all epochs. The Learning from scratch 25.50 11.50 0.70 2.14 3.29 0.56
ð1 rÞ value denotes the ratio of knowledge points that are discarded
during the learning process to the union set of all learned knowledge Distilling from AlexNet 35.29 4.07 0.90 4.54 8.43 0.57
points. A larger r value indicates DNN is optimized with fewer detours. Learning from scratch 24.00 5.90 0.80 2.80 5.32 0.53
AlexNet Distilling from VGG-16 39.14 4.45 0.90 0.78 0.05 0.67
Learning from scratch 24.00 5.90 0.80 2.80 5.32 0.53
learning from scratch cannot directly encode target knowl-
edge points for inference. Thus, the optimization direction Distilling from VGG-19 38.36 4.14 0.90 0.92 0.11 0.65
Learning from scratch 24.00 5.90 0.80 2.80 5.32 0.53
of this DNN is inconsistent and unstable in early and
late epochs, i.e., with significant detours. In contrast
to learning from scratch, knowledge distillation forces
the student network to directly mimic the well-trained 4 EXPERIMENT
teacher network. With the guidance of the teacher net-
In this section, we conducted experiments on image classifi-
work, knowledge distillation makes the student network
cation, 3D point cloud classification, and natural language
straightforwardly model target knowledge points without
processing (NLP) tasks. Experimental results verified the
temporarily modeling and discarding other knowledge
proposed hypotheses.
points. Thus, the DNN learning via knowledge distilla-
tion tends to be optimized in a stable direction, i.e., with-
out significant detours. 4.1 Implementation Details
Metric. To verify Hypothesis 3, we design a metric to Given a teacher network, we distilled knowledge from the
evaluate whether a DNN is optimized in a stable and con- teacher network to the student network. In order to remove
sistent direction. Let Sj ðxÞ ¼ fi 2 xjH Hi > bg denote the the effects of network architectures, the student network
set of foreground knowledge points in the input sample I had the same architecture as the teacher network. For fair
encoded by the DNN learned after the jth epoch, where j 2 comparisons, the DNN learning from scratch was also
f1; 2; . . . ; Mg. Each knowledge point a 2 Sj ðxÞ is referred to required to have the same architecture as the teacher net-
as a specific unit i on the foreground of the input sample I, work. For convenience, we named the DNN learning from
which satisfies H Hi > b. Then, we use the metric r to scratch the baseline network hereinafter.
measure the stability of optimization directions Additionally, we conducted experiments to check
whether the size of the teacher network affected the verifica-
tion of the three hypotheses. Table 2 and Table 9 show that
jSM ðxÞj when the student network distilled from a larger teacher
r¼ SM : (9)
j j¼1 Sj ðxÞj network, Hypotheses 1-3 were still verified. This phenome-
non indicates that variations in the size of the teacher net-
work did not hurt the verification of our conclusions. To
r represents the ratio of finally encoded knowledge points simplify the story, we set student networks to have the
to all temporarily attempted knowledge points in interme- same architectures as teacher networks in the following
diate epochs, where j j denotes the cardinality of the set. experiments.
More specifically, the numerator of Equation (9) reflects the Datasets & DNNs. For the image classification task, we con-
number of foreground knowledge points, which have been ducted experiments based on AlexNet [34], VGG-11, VGG-
chosen ultimately for inference. For example, knowledge 16, VGG-19 [60], ResNet-50, ResNet-101, and ResNet-152
points shown in the black box of Fig. 5 are encoded as final [29]. We trained these DNNs based on the ILSVRC-2013
knowledge points for fine-grained bird classification. The DET dataset [56], the CUB200-2011 dataset [65], and the Pas-
denominator represents the number of all knowledge points cal VOC 2012 dataset [14] for object classification, respec-
temporarily learned during the entire training process, tively. Considering the high computational burden of
which
S is shown as the green box in Fig. 5. Moreover, training on the entire ILSVRC-2013 DET dataset, we con-
ð M j¼1 Sj ðxÞ n SM ðxÞÞ indicates a set of knowledge points, ducted the classification of terrestrial mammal categories
which have been tried, but finally are discarded by the for comparative experiments. Note that all teacher networks
DNN. The set of those discarded knowledge points is in Sections 4.3, 4.4 were pre-trained on the ImageNet data-
shown as the brown box in Fig. 5. Thus, a high value of r set [53], and then fine-tuned using the aforementioned data-
indicates that the DNN is optimized without significant set. In comparison, all baseline networks were trained from
detours and more stably, and vice versa. scratch. Moreover, data augmentation [28] was applied to
ZHANG ET AL.: QUANTIFYING THE KNOWLEDGE IN A DNN TO EXPLAIN KNOWLEDGE DISTILLATION FOR CLASSIFICATION 5107
both the student network and the baseline network, when loss. Given the distilled features, we further trained parame-
the DNNs were trained using the ILSVRC-2013 DET dataset ters above the target layer l merely using the classification
or the Pascal VOC 2012 dataset. loss. In this way, we were able to ensure that all knowledge
We used object images cropped by object bounding points of the student network were exclusively learned via
boxes to train the aforementioned DNNs on each above distillation. Otherwise, if we trained the student network
dataset, respectively. In particular, to achieve stable results, using both the distillation loss and the classification loss, dis-
images in the Pascal VOC 2012 dataset were cropped by tinguishing whether the classification loss or the distillation
using 1:2 width 1:2 height of the original object bounding loss was responsible for the learning of each knowledge
box. For the ILSVRC-2013 DET dataset, we cropped each point would be difficult.
image by using 1:5 width 1:5 height of the original object Moreover, we employed two schemes of knowledge dis-
bounding box. This was performed because no ground- tillation in this study, including learning from intermediate-
truth annotations of object segmentation were available for layer features and learning from network output. Specifi-
the ILSVRC-2013 DET dataset. We used the object bounding cally, we distilled features in the top convolutional layer
box to separate foreground regions and background regions (Conv) or in each of three FC layers (namely FC1 , FC2 and
of images in the ILSVRC-2013 DET dataset. In this way, pix- FC3 ) of the teacher network to train the student network,
els within the object bounding box were regarded as the respectively. In this way, knowledge distillation from the
foreground Lfg , and pixels outside the object bounding box FC3 layer corresponded to the scheme of learning from out-
were referred to as the background Lbg . put, and distillation from other layers corresponded to the
For natural language processing, we fine-tuned a pre- scheme of learning from intermediate-layer features. In
trained BERT model [11] using the SQuAD dataset [49] as spite of these two schemes, we mainly explained knowledge
the teacher network towards the question-answering task. distillation from intermediate-layer features that forced the
Besides, we also fine-tuned another pre-trained BERT student network to mimic the knowledge of the teacher net-
model on the SST-2 dataset [62] as the teacher network for work as discussed above.
binary sentiment classification. Baseline networks and stu- Layers Used to Quantify Knowledge Points. For each pair of
dent networks were simply trained using samples in the the student and baseline networks, we measured knowl-
SST-2 dataset or the SQuAD dataset. Moreover, for both the edge points on the top convolutional layer and all FC layers.
SQuAD dataset and the SST-2 dataset, we annotated input Particularly, for DNNs trained on the ILSVRC-2013 DET
units irrelevant to the prediction as the background Lbg , dataset and the Pascal VOC 2012 dataset, we quantified
according to the human cognition. We annotated units rele- knowledge points only on FC1 and FC2 layers, because the
vant to the inference as the foreground Lfg . For example, dimension of features in the output layer FC3 was much
words related to answers were labeled as foregrounds in lower than that of intermediate-layer features in the FC1
the SQuAD dataset. Words containing sentiments were and FC2 layers. Similarly, for DNNs trained on the Model-
annotated as foregrounds in the SST-2 dataset. Net40 dataset, we also used the FC1 and FC2 layers to quan-
For 3D point cloud classification task, we conducted compara- tify knowledge points. For the BERT model, we measured
tive experiments on the PointNet [47], DGCNN [66], Point- knowledge points in its last hidden layer. Considering
Conv [69], and PointNet++ [48] models. Considering most ResNets usually only had a single FC layer, to obtain rich
widely used benchmark datasets for point cloud classification knowledge representations, we added two Conv layers and
only contained foreground objects, we constructed a new two FC layers before the final FC layer. In this way, we
dataset containing both foreground objects and background quantified knowledge points on these three FC layers.
objects based on the ModelNet40 dataset [70], as follows. For Hyper-Parameter Settings. We used the SGD optimizer,
each sample (i.e., foreground object) in the ModelNet40 data- and set the batch size as 64 to train DNNs. We set the grid
set, we first randomly sampled five point clouds, the labels of size to 16 16, and settings of the threshold b were as fol-
which differed from the foreground object. Then, we ran- lows. For DNNs towards image classification, b was set to
domly selected 500 points from the sampled five point clouds, 0.2, except that we set b to 0.25 for AlexNet. For BERT
and attached these points to the foreground object as back- towards NLP tasks, b was set to 0.2. For DNNs towards 3D
ground points. Moreover, the teacher network was trained point cloud classification, b was set to 0.25, except that b was
using all training data in this generated dataset for classifica- set to 0.5 for PointNet.
tion. In comparison, we only randomly sampled 10% training Additionally, we also tested the influence of the grid size
data in this generated dataset to learn baseline and student and the threshold b on the verification of the three hypothe-
networks, respectively. Thus, the teacher network was ses. To this end, we conducted experiments with different
guaranteed to encode better knowledge representations. settings of the grid size and different settings of the thresh-
Distillation. Given a well-trained teacher network and an old b. Table 3 shows that Hypotheses 1-3 were still vali-
input sample x, we selected a convolutional layer or a fully- dated, when knowledge points were calculated under
connected (FC) layer l as the target layer to perform knowl- different settings of the grid size and different settings of
edge distillation. We used the distillation loss kfT ðxÞ the threshold b.
fS ðxÞk2 to force the student network to mimic the feature of
the teacher network. Here, fT ðxÞ denoted the feature of lth 4.2 Verification of Hypothesis 1
layer in the teacher network, and fS ðxÞ indicated the feature According to Hypothesis 1, a well-trained teacher network
of lth layer in its corresponding student network. was supposed to encode more foreground knowledge
In order to simplify the story, we used the distillation loss points than the student network and the baseline network.
to learn the student network, rather than the classification Consequently, a distilled student network was supposed to
5108 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023
TABLE 3
Comparisons Between the Student Network (S) and the Baseline Network (B) for Image Classification
on the CUB200-2011 Dataset, Respectively
model more foreground knowledge points than the baseline network when using a teacher network trained in a more
network. This was the case because the teacher network sophisticated manner. In other words, this teacher network
was usually trained using a large amount of training data, was pre-trained on the ImageNet dataset, and then fine-
and achieved superior performance to the baseline network. tuned using the ILSVRC-2013 DET dataset, the CUB200-2011
In this way, this well-trained teacher network was supposed dataset, or the Pascal VOC 2012 dataset. This is a more com-
to encode more foreground knowledge points than the base- mon case in real applications.
line network. Knowledge distillation forced the student net- Based on the information discarding fHi g, we visualized
work to mimic the teacher network. Hence, the student image regions in the foreground that corresponded to fore-
network was supposed to model more foreground knowl- ground knowledge points, as well as image regions in the
edge points than baseline network. background that corresponded to background knowledge
Quantification of Knowledge Points in the Teacher Net- points in Fig. 6, where these knowledge points were mea-
work, the Student Network, and the Baseline Network. Here, we sured at the FC1 layer of the VGG-11 model. In this scenario,
compared knowledge points encoded in the teacher net- Hypothesis 1 was also successfully verified. Moreover, for
work, the student network, and the baseline network, and tasks of image classification, natural language processing,
verified the above hypothesis. We trained a VGG-16 model and 3D point cloud classification, Table 5, Table 6, and
from scratch as the teacher network using the ILSVRC-2013
DET dataset or the CUB200-2011 dataset for image classifi-
cation. Data augmentation [28] was used to boost the perfor- TABLE 4
Comparisons of Knowledge Points Encoded in the Teacher
mance of the teacher network. Table 4 compares the number
fg Network (T), the Student Network (S) and the Baseline
of foreground knowledge points Npoint learned by the Network (B) for Image Classification
teacher network, the baseline network, and the student net-
work. For fair comparisons, the student network had the
same architecture as the teacher network and the baseline
network according to Section 4.1.
Table 4 shows that the teacher network usually encoded
fg
more foreground knowledge points Npoint and a higher ratio
than the student network. Meanwhile, the student net-
fg
work often obtained larger values of Npoint and than the
baseline network. In this way, the above hypothesis was
fg
verified. There was an exception when Npoint was measured
at the FC2 layer of the VGG-16 model on the ILSVRC-2013
DET dataset. The teacher network encoded fewer fore-
ground knowledge points than the student network,
because this teacher network had a larger average back-
ground entropy value H than the student network.
fg bg
Comparing Metrics of Npoint , Npoint and to Verify Hypoth-
esis 1. In contrast to the previous subsection learning the
fg
teacher network from scratch, here, we compared Npoint ,
bg
Npoint and between the student network and the baseline
ZHANG ET AL.: QUANTIFYING THE KNOWLEDGE IN A DNN TO EXPLAIN KNOWLEDGE DISTILLATION FOR CLASSIFICATION 5109
fg
Fig. 6. Visualization of knowledge points encoded in the FC1 layer of VGG-11. Generally, the student network exhibited a larger Npoint value and a
bg
smaller Npoint value than the baseline network.
TABLE 5
Comparisons Between the Student Network (S) and the Baseline Network (B) for Image Classification
Table 7 show that Hypothesis 1 was vaildated. Very few Note that as discussed in Section 3, knowledge distillation
student networks in Table 5, Table 6, and Table 7 encoded usually involved two different utilities. The utility of distill-
bg
more background knowledge points Npoint than the base- ing from the low-dimensional network output was mainly to
line network. This occurred because teacher networks in select confident training samples for learning, with very little
this subsection, Section 4.3, and Section 4.4 were either information on how the teacher network encoded detailed
pre-trained or trained using more samples, which encoded features. In contrast, the utility of distilling from high-dimen-
far more knowledge points than necessary. In this way, sional intermediate-layer features was mainly to force the
student networks distilled from these well-trained teacher student network to mimic knowledge points of the teacher
network learned more unnecessary knowledge for infer- network. Hence, in this study, we mainly explained knowl-
bg
ence, thereby leading to a larger Npoint value than the base- edge distillation from high-dimensional intermediate-layer
line network. features.
5110 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023
TABLE 6 TABLE 8
Comparisons Between the Student Network (S) and the Comparisons Between the Student Network (S) and the Baseline
Baseline Network (B) for the Question-Answering and the Network (B) for Image Classification on the MSCOCO Dataset
Sentiment Classification Tasks, Respectively
fg bg
Dataset Network Npoint " Npoint # " Dmean # Dvar # r"
S 8.46 10.24 0.48 13.81 104.94 0.56
SQuAD BERT
B 3.58 10.55 0.29 23.32 206.23 0.25
S 1.31 1.13 0.53 1.08 0.68 0.80
SST-2 BERT
B 0.68 0.79 0.39 1.29 1.00 0.51
Metrics were evaluated at the last hidden layer of the BERT model, which veri- TABLE 9
fied Hypotheses 1-3. Comparisons Between the Student Network (S) and the
Baseline Network (B) for Image Classification
TABLE 7 on the Tiny-ImageNet Dataset
Comparisons Between the Student Network (S) and the
Baseline Network (B) for the Classification of 3D Point Clouds
fg bg fg bg
Fig. 7. Comparison of Npoint , Npoint all
and all knowledge points Npoint ¼ Npoint þ Npoint between fine-tuning a pre-trained VGG-16 network and learning
a VGG-16 network from scratch on the Tiny-ImageNet dataset. Metrics were measured at the highest convolutional layer. The fine-tuning process
made the DNN encode new foreground knowledge points more quickly and enabled it to be optimized with fewer detours than learning from scratch.
knowledge points usually achieved better classification Specifically, we conducted experiments to explore how the
performance, as discussed in Section 3.2. To this end, we fine-tuning process guided the DNN to encode new knowl-
used the aforementioned student network to describe how edge points, in which we fine-tuned a pre-trained VGG-16
the ratio changes in different epochs of the training process, network on the Tiny ImageNet dataset. Besides, we also
which was distilled from FC1 layer of the teacher network on trained another VGG-16 network from scratch on the same
the CUB200-2011 dataset for object classification. Fig. 2 dataset for comparison. Fig. 7 shows that the fine-tuning pro-
shows that DNNs with a larger ratio of task-related knowl- cess made the DNN learn new knowledge points in the fore-
edge points often exhibited better performance. ground more quickly and be optimized with fewer detours
than learning from scratch, i.e., the fine-tuned DNN discarded
4.3 Verification of Hypothesis 2 fewer temporarily learned knowledge points.
For Hypothesis 2, we assumed that knowledge distillation
made the student network learn different knowledge points 5 CONCLUSION AND DISCUSSIONS
simultaneously. To this end, we used metrics Dmean and
In this study, we provide a new perspective to explain knowl-
Dstd to verify this hypothesis.
edge distillation from intermediate-layer features, i.e., quanti-
For image classification, Table 5 shows that values of
fying knowledge points encoded in the intermediate layers of
Dmean and Dstd were usually smaller than those of the base-
a DNN. We propose three hypotheses regarding the mecha-
line network, which verified Hypothesis 2. Of note, a very
nism of knowledge distillation, and design three types of met-
small number of failure cases occurred. For example, the
rics to verify these hypotheses for different classification
Dmean and Dstd were measured at the FC1 layer of AlexNet or
tasks. Compared to learning from scratch, knowledge distilla-
at the Conv layer of VGG-11. It is because both the AlexNet
tion ensures that the DNN encodes more knowledge points,
model and the VGG-11 model had relatively shallow net-
learns different knowledge points simultaneously, and opti-
work architectures. When learning from raw data, DNNs
mizes with fewer detours. Note that the learning procedure of
with shallow architectures would more easily learn more
DNNs cannot be precisely divided into a learning phase and
knowledge points with less overfitting.
a discarding phase. In each epoch, the DNN may simulta-
For NLP tasks (question-answering and sentiment classi-
neously learn new knowledge points and discard old knowl-
fication) and the classification of 3D point clouds, Table 6
edge points irrelevant to the task. Thus, the target epoch m ^ in
and Table 7 show that the student network had smaller val-
Fig. 4 is simply a rough estimation of the division of two learn-
ues of Dmean and Dstd than the baseline network, respec-
ing phases.
tively, which successfully verified Hypothesis 2.
[7] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This [34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
looks like that: Deep learning for interpretable image recog- cation with deep convolutional neural networks,” in Proc. Adv.
nition,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 8928–8939. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[8] R. T. Chen, J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen, [35] Y. Le and X. Yang, “Tiny imagenet visual recognition challenge,”
“Residual flows for invertible generative modeling,” in Proc. Adv. CS 231N, vol. 7, no. 7, 2015, Art. no. 3.
Neural Inf. Process. Syst., 2019, pp. 9916–9926. [36] T. Li, J. Li, Z. Liu, and C. Zhang, “Few sample knowledge distilla-
[9] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. tion for efficient network compression,” in Proc. IEEE/CVF Conf.
Abbeel, “InfoGAN: Interpretable representation learning by infor- Comput. Vis. Pattern Recognit., 2020, pp. 14 627–14 635.
mation maximizing generative adversarial nets,” in Proc. Adv. [37] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in
Neural Inf. Process. Syst., 2016, pp. 2172–2180. Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[10] X. Cheng, Z. Rao, Y. Chen, and Q. Zhang, “Explaining knowledge [38] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured
distillation by quantifying the knowledge,” in Proc. IEEE/CVF knowledge distillation for semantic segmentation,” in Proc. IEEE/
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 12 925–12 935. CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2599–2608.
[11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- [39] D. Lopez-Paz, L. Bottou, B. Sch€ olkopf, and V. Vapnik, “Unifying
training of deep bidirectional transformers for language under- distillation and privileged information,” in Proc. Int. Conf. Learn.
standing,” 2018, arXiv:1810.04805. Representations, 2016, pp. 1–10.
[12] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima can [40] H. Ma, Y. Zhang, F. Zhou, and Q. Zhang, “Quantifying layerwise
generalize for deep nets,” in Proc. Int. Conf. Mach. Learn., 2017, information discarding of neural networks,” 2019, arXiv:1906.04109.
pp. 1019–1028. [41] A. Mahendran and A. Vedaldi, “Understanding deep image rep-
[13] A. Dosovitskiy and T. Brox, “Inverting visual representations with resentations by inverting them,” in Proc. IEEE Conf. Comput. Vis.
convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Pattern Recognit., 2015, pp. 5188–5196.
Recognit., 2016, pp. 4829–4837. [42] A. K. Menon, A. S. Rawat, S. J. Reddi, S. Kim, and S. Kumar, “Why
[14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. distillation helps: A statistical perspective,” 2020, arXiv:2005.10419.
Zisserman, “The pascal visual object classes (VOC) challenge,” [43] B. Neyshabur, S. Bhojanapalli, and N. Srebro, “A PAC-Bayesian
Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010. approach to spectrally-normalized margin bounds for neural
[15] S. Flennerhag, P. G. Moreno, N. D. Lawrence, and A. Damianou, networks,” in Proc. Int. Conf. on Learn. Representations, 2018, pp. 1–9.
“Transferring knowledge across learning processes,” 2018, [44] B. Neyshabur, R. Tomioka, and N. Srebro, “Norm-based capacity
arXiv:1812.01054. control in neural networks,” in Proc. Conf. Learn. Theory, 2015,
[16] R. Fong and A. Vedaldi, “Net2Vec: Quantifying and explaining pp. 1376–1401.
how concepts are encoded by filters in deep neural networks,” in [45] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8730–8738. “Distillation as a defense to adversarial perturbations against
[17] R. C. Fong and A. Vedaldi, “Interpretable explanations of black deep neural networks,” in Proc. IEEE Symp. Secur. Privacy, 2016,
boxes by meaningful perturbation,” in Proc. IEEE Int. Conf. Com- pp. 582–597.
put. Vis., 2017, pp. 3449–3457. [46] M. Phuong and C. Lampert, “Towards understanding knowledge
[18] S. Fort, P. K. Nowak, and S. Narayanan, “Stiffness: A new perspec- distillation,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 5142–5151.
tive on generalization in neural networks,” 2019, arXiv:1901.09491. [47] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “PointNet:
[19] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandku- Deep learning on point sets for 3D classification and segmen-
mar, “Born again neural networks,” 2018, arXiv:1805.04770. tation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
[20] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wil- pp. 77–85.
son, “Loss surfaces, mode connectivity, and fast ensembling of dnns,” [48] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierar-
in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 8789–8798. chical feature learning on point sets in a metric space,” 2017,
[21] Z. Goldfeld et al., “Estimating information flow in deep neural arXiv:1706.02413.
networks,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 2299–2308. [49] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD:
[22] H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree, “Regularisation 100,000+ questions for machine comprehension of text,” in Proc.
of neural networks by enforcing lipschitz continuity,” Mach. Conf. Empir. Methods Natural Lang. Process., 2016, pp. 2383–2392.
Learn., vol. 110, no. 2, pp. 393–416, 2021. [50] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should i trust you?:
[23] C. Guan, X. Wang, Q. Zhang, R. Chen, D. He, and X. Xie, Explaining the predictions of any classifier,” in Proc. 22nd ACM
“Towards a deep and unified understanding of deep neural mod- SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2016, pp. 1135–1144.
els in nlp,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 2454–2463. [51] M. T. Ribeiro, S. Singh, and C. Guestrin, “Anchors: High-precision
[24] I. Higgins et al., “beta-VAE: Learning basic visual concepts with a model-agnostic explanations,” in Proc. AAAI Conf. Artif. Intell.,
constrained variational framework,” Proc. Int. Conf. Learn. Repre- 2018, pp. 1527–1535.
sentations, 2017, Art. no. 6. [52] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y.
[25] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a Bengio, “Fitnets: Hints for thin deep nets,” 2014, arXiv:1412.6550.
neural network,” 2015, arXiv:1503.02531. [53] O. Russakovsky et al., “Imagenet large scale visual recognition
[26] G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with em challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
routing,” in Proc. Int. Conf. Learn. Representations, 2018, pp. 1–15. [54] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing
[27] S. Hooker, D. Erhan, P.-J. Kindermans, and B. Kim, “A benchmark between capsules,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
for interpretability methods in deep neural networks,” in Proc. pp. 3856–3866.
Adv. Neural Inf. Process. Syst., 2019, pp. 9737–9748. [55] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
[28] J.-H. Jacobsen, A. Smeulders, and E. Oyallon, “i-RevNet: Deep D. Batra, “Grad-CAM: Visual explanations from deep networks
invertible networks,” 2018, arXiv:1802.07088. via gradient-based localization,” in Proc. IEEE Int. Conf. Comput.
[29] S. R. Kaiming He, X. Zhang, and J. Sun, “Deep residual learning Vis., 2017, pp. 618–626.
for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern [56] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y.
Recognit., 2016, pp. 770–778. LeCun, “Overfeat: Integrated recognition, localization and detec-
[30] A. Kapishnikov, T. Bolukbasi, F. Viegas, and M. Terry, “XRAI: tion using convolutional networks,” 2013, arXiv:1312.6229.
Better attributions through regions,” in Proc. IEEE/CVF Int. Conf. [57] W. Shen et al., “Interpretable compositional convolutional neural
Comput. Vis., 2019, pp. 4947–4956. networks,” 2021, arXiv:2107.04474.
[31] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. [58] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep
P. Tang, “On large-batch training for deep learning: Generaliza- neural networks via information,” 2017, arXiv:1703.00810.
tion gap and sharp minima,” 2016, arXiv:1609.04836. [59] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convo-
[32] B. Kim et al., “Interpretability beyond feature attribution: Quanti- lutional networks: Visualising image classification models and
tative testing with concept activation vectors (TCAV),” 2017, saliency maps,” 2017, arXiv:1312.6034.
arXiv:1711.11279. [60] K. Simonyan and A. Zisserman, “Very deep convolutional net-
[33] P.-J. Kindermans et al., “Learning how to explain neural net- works for large-scale image recognition,” in Proc. Int. Conf. Learn.
works: Patternnet and patternattribution,” 2017, arXiv:1705.05598. Representations, 2015, pp. 1–14.
ZHANG ET AL.: QUANTIFYING THE KNOWLEDGE IN A DNN TO EXPLAIN KNOWLEDGE DISTILLATION FOR CLASSIFICATION 5113
[61] U. Simsekli, L. Sagun, and M. Gurbuzbalaban, “A tail-index anal- Quanshi Zhang (Member, IEEE) received the
ysis of stochastic gradient noise in deep neural networks,” in Proc. PhD degree from the University of Tokyo, in
Int. Conf. Mach. Learn., 2019, pp. 5827–5837. 2014. He is currently an Associate Professor with
[62] R. Socher et al., “Recursive deep models for semantic composi- Shanghai Jiao Tong University, China. From 2014
tionality over a sentiment treebank,” in Proc. Conf. Empir. Methods to 2018, he was a post-doctoral researcher with
Natural Lang. Process., 2013, pp. 1631–1642. the University of California, Los Angeles. His
[63] J. Tang et al., “Understanding and improving knowledge distil- research interests include machine learning and
lation,” 2020, arXiv:2002.03532. computer vision. In particular, he has made influ-
[64] J. Uijlings, S. Popov, and V. Ferrari, “Revisiting knowledge trans- ential research in explainable AI (XAI). He was
fer for training object class detectors,” in Proc. IEEE Conf. Comput. the co-chairs of the workshops towards XAI in
Vis. Pattern Recognit., 2018, pp. 1101–1110. ICML 2021, AAAI 2019, and CVPR 2019. He is
[65] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The the speaker of the tutorials on XAI at IJCAI 2020 and IJCAI 2021. He
caltech-ucsd birds-200–2011 dataset,” California Inst. Technol., won the ACM China Rising Star Award with ACM TURC 2021.
Pasadena, CA, USA, Tech. Rep. CNS-TR-2011-001, 2011.
[66] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M.
Solomon, “Dynamic graph CNN for learning on point clouds,” Xu Cheng (Member, IEEE) is working toward the
ACM Trans. GraphicsGraph., vol. 38, no. 5, pp. 1–12, 2019. PhD degree with Shanghai Jiao Tong University.
[67] T.-W. Weng et al., “Evaluating the robustness of neural networks: Her research interests include computer vision
An extreme value theory approach,” 2018, arXiv:1801.10578. and machine learning.
[68] N. Wolchover, “New theory cracks open the black box of deep
learning,” Quanta Mag., 2017. [Online]. Available: https://www.
quantamagazine.org/new-theory-cracks-open-the-black-box-of-
deep-learning-20170921/
[69] W. Wu, Z. Qi, and L. Fuxin, “PointConv: Deep convolutional net-
works on 3D point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit., 2019, pp. 9613–9622.
[70] Z. Wu et al., “3D ShapeNets: A deep representation for volumetric
shapes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015,
pp. 1912–1920. Yilan Chen (Member, IEEE) is working toward
[71] D. Xie, T. Shu, S. Todorovic, and S.-C. Zhu, “Learning and infer- the master’s degree with the University of Califor-
ring “dark matter” and predicting human intents and trajectories nia San Diego. His research interests include
in videos,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 7, machine learning.
pp. 1639–1652, Jul. 2018.
[72] A. Xu and M. Raginsky, “Information-theoretic analysis of gener-
alization capability of learning algorithms,” in Proc. Adv. Neural
Inf. Process. Syst., 2017, pp. 2524–2533.
[73] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distilla-
tion: Fast optimization, network minimization and transfer
learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
pp. 7130–7138.
[74] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, Zhefan Rao (Member, IEEE) received the gradu-
“Understanding neural networks through deep visualization,” ate degree from the Huazhong University of
2015, arXiv:1506.06579. Science &Technology. His research interests
[75] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowl- include machine learning.
edge distillation via label smoothing regularization,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 3902–3910.
[76] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
volutional networks,” in Proc. Eur. Conf. Comput. Vis., 2014,
pp. 818–833.
[77] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals,
“Understanding deep learning requires rethinking general-
ization,” 2016, arXiv:1611.03530.
[78] Q. Zhang, X. Wang, Y. N. Wu, H. Zhou, and S.-C. Zhu,
“Interpretable CNNs for object classification,” IEEE Trans. Pattern " For more information on this or any other computing topic,
Anal. Mach. Intell., vol. 43, no. 10, pp. 3416–3431, Oct. 2021. please visit our Digital Library at www.computer.org/csdl.
[79] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,
“Learning deep features for discriminative localization,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2921–2929.