Quantifying The Knowledge in A DNN To Explain Knowledge Distillation For Classification

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO.
4, APRIL 2023 5099
Quantifying the Knowledge in a DNN to Explain

Knowledge Distillation for Classification
Quanshi Zhang , Member, IEEE, Xu Cheng, Member, IEEE,
Yilan Chen, Member, IEEE, and Zhefan Rao, Member, IEEE
Abstract—Compared to traditional learning from scratch, knowledge distillation sometimes makes the DNN achieve superior performance.
In this paper, we provide a new perspective to explain the success of knowledge distillation based on the information theory, i.e., quantifying
knowledge points encoded in intermediate layers of a DNN for classification. To this end, we consider the signal processing in a DNN as a
layer-wise process of discarding information. A knowledge point is referred to as an input unit, the information of which is discarded much
less than that of other input units. Thus, we propose three hypotheses for knowledge distillation based on the quantification of knowledge
points. 1. The DNN learning from knowledge distillation encodes more knowledge points than the DNN learning from scratch. 2. Knowledge
distillation makes the DNN more likely to learn different knowledge points simultaneously. In comparison, the DNN learning from scratch
tends to encode various knowledge points sequentially. 3. The DNN learning from knowledge distillation is often more stably optimized than
the DNN learning from scratch. To verify the above hypotheses, we design three types of metrics with annotations of foreground objects to
analyze feature representations of the DNN, i.e., the quantity and the quality of knowledge points, the learning speed of different knowledge
points, and the stability of optimization directions. In experiments, we diagnosed various DNNs on different classification tasks, including
image classification, 3D point cloud classification, binary sentiment classification, and question answering, which verified the above
hypotheses.
Index Terms—Knowledge distillation, knowledge points
1 INTRODUCTION weighting training examples. Lopez-Paz et al. [39] interpreted

knowledge distillation as a form of learning with privileged
distillation [25] has achieved great success in
K NOWLEDGE
various applications [19], [52], [73]. Knowledge distilla-
tion refers to the process of transferring the knowledge from
information. Besides, from the perspective of network train-
ing, knowledge distillation was regarded as a label-smoothing
regularization method [63], [75], and was shown to enable
one or more well-trained deep neural networks (DNNs),
faster convergence [46], [73]. These explanations of knowledge
namely the teacher networks, to a compact-yet-efficient
distillation were mainly based on qualitative statements or
DNN, namely the student network. However, there have
focused on the optimization and regularization effects.
been very few studies to explain how and why knowledge
Beyond the above perspectives for explanations, in this
distillation outperforms traditional learning from scratch.
study, we provide a new perspective to analyze the success
Hinton et al. [25] attributed the superior performance of
of knowledge distillation, i.e., quantifying knowledge points
knowledge distillation to the use of “soft targets”. Furlanello
encoded in the intermediate layer of a DNN. In other words,
et al. [19] conjectured the effect of knowledge distillation on re-
the quantity and the quality of knowledge points enable us
to explain the representation power of the DNN, which is
Quanshi Zhang and Xu Cheng are with the Department of Computer Sci- trained for classification. For a better understanding, let us
ence and Engineering, John Hopcroft Center, MoE Key Lab of Artificial take the image classification as an example. If a DNN learns
Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai 200240, lots of visual concepts for classification, then these concepts
China. E-mail: [email protected], [email protected].
can be regarded as knowledge points, such as the bird head
Yilan Chen is with Information and Communication Engineering, Xi’an
Jiaotong University, Xi’an, Shaanxi 710049, China. concept for the bird classification. If these knowledge points
E-mail: [email protected]. are sufficiently discriminative, then this DNN probably
Zhefan Rao is with Computer Science and Engineering, The Hong Kong achieves good performance.
University of Science and Technology, Clear Water Bay, Hong Kong.
E-mail: [email protected]. To measure the quantity and the quality of knowledge
points encoded in a DNN, we consider the signal processing
Manuscript received 10 September 2021; revised 7 July 2022; accepted 9
August 2022. Date of publication 22 August 2022; date of current version 6 in the forward propagation of a DNN as a layer-wise pro-
March 2023. cess of discarding information. As Fig. 1a shows, the DNN
This work was supported in part by the National Key R&D Program of China gradually discards the information of each input unit in the
under Grant 2021ZD0111602, in part by the National Nature Science Foun-
dation of China under Grants 61906120 and U19B2043, in part by Shanghai
forward propagation. Here, an input unit is referred to as a
Natural Science Foundation under Grants 21JC1403800 and 21ZR1434600, variable (or a set of variables) in the input sample. For
in part by Shanghai Municipal Science and Technology Major Project under example, the embedding of each word in the input sentence
Grant 2021SHZDZX0102, and in part by Huawei Technologies Inc. can be regarded as an input unit for natural language proc-
(Corresponding author: Quanshi Zhang.)
Recommended for acceptance by B. Ommer. essing. In the scenario of Fig. 1a, an input unit is referred to
Digital Object Identifier no. 10.1109/TPAMI.2022.3200344 as a pixel (or pixels within a small local region) of the input
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
5100 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 4, APRIL 2023
Fig. 1. (a) Visualization of the information-discarding process through different layers of VGG-16 on the CUB200-2011 dataset. Input units with darker
colors discard less information. (b) Comparison of the coherency of different explanation methods. Our method coherently measures how input infor-
mation is gradually discarded through layers, and enables fair comparisons of knowledge representations between different layers. In comparison,
CAM [79] and gradient explanations [59] cannot generate coherent magnitudes of importance values through different layers for fair comparisons.
More analysis is presented in Section 3.2.
image. Then, the information of some units (e.g., informa- neglected those features irrelevant to the task for inference.
tion of pixels in the background) is significantly discarded. We assume that a teacher network is supposed to achieve
In comparison, information of other input units (e.g., infor- superior classification performance, because the teacher net-
mation of pixels in the foreground) is discarded less. work usually has a more complex network architecture or is
In this way, we consider input units without significant often learned using more training data. In this scenario, the
information discarding as knowledge points, such as input teacher network is considered to encode more knowledge
units of the bird in Fig. 1a. These knowledge points usually points related to the task and fewer knowledge points unre-
encode discriminative information for inferences, e.g., the lated to the task than the student network. In this way,
wings of birds in Fig. 1a are discriminative for the fine- knowledge distillation forces the student network to mimic
grained bird classification. In comparison, input units with the logic of the teacher network, which ensures that the stu-
significant information-discarding usually do not contain dent network encodes more knowledge points relevant to
sufficient discriminative information for prediction. Thus, the task and fewer knowledge points irrelevant to the task
measuring the amount of information of input units dis- than the DNN learning from scratch.
carded by the DNN enables us to quantify knowledge Hypothesis 2: Knowledge distillation makes the DNN more
points. likely to learn different knowledge points simultaneously. In com-
Aside from the quantity of knowledge points, we also parison, the DNN learning from scratch tends to encode various
measure the quality of knowledge points with the help of knowledge points sequentially through different epochs. This is
annotations of foreground objects. To this end, we classify the case because knowledge distillation forces the student
all knowledge points into two types, including those related network to mimic all knowledge points of the teacher net-
and unrelated to the task. In this way, we can compute the work simultaneously. In comparison, the DNN learning
ratio of knowledge points encoded in a DNN that are rele- from scratch tends to focus on different knowledge points
vant to the task as a metric to evaluate the quality of knowl- in different epochs sequentially.
edge points. A high ratio of relevant knowledge points Hypothesis 3: Knowledge Distillation Usually Yields More
indicates that the feature representation of a DNN is reli- Stable Optimization Directions Than Learning From Scratch.
able. Fig. 2 illustrates that DNNs encoding a larger ratio of The DNN learning from scratch usually keeps trying differ-
task-related knowledge points usually achieve better perfor- ent knowledge points in early epochs and then discarding
mance (please see Section 4.2 for further analysis). unreliable ones in later epochs, rather than modeling target
Based on the quantification of knowledge points, we pro- knowledge points directly. Thus, the optimization direc-
pose the following three hypotheses as to why knowledge tions of the DNN learning from scratch are inconsistent. For
distillation outperforms traditional learning from scratch. convenience, in this study, the phenomenon of inconsistent
Furthermore, we design three types of metrics to verify optimization directions through different epochs is termed
these hypotheses. as “detours”1.
Hypothesis 1: Knowledge Distillation Makes the DNN Learn In contrast to learning from scratch, knowledge distilla-
More Knowledge Points Than the DNN Learning From Scratch. tion usually forces the student network to mimic the teacher
The information-bottleneck theory [58], [68] claimed that a network directly. The student network straightforwardly
DNN tended to preserve features relevant to the task and encodes target knowledge points without temporarily
modeling and discarding other knowledge points. Thus, the
student network is optimized without significant detours.
For a better understanding of Hypothesis 3, let us take the
fine-grained bird classification task as an example. With the
guidance of the teacher network, the student network
directly learns discriminative features from heads or tails of
Fig. 2. Positive correlation between the ratio of foreground knowledge 1. “Detours” refers to the phenomenon that a DNN tends to encode
points and the classification accuracy of DNNs on the CUB200-2011 various knowledge points in early epochs and discard non-discrimina-
dataset. tive ones later.
ZHANG ET AL.: QUANTIFYING THE KNOWLEDGE IN A DNN TO EXPLAIN KNOWLEDGE DISTILLATION FOR CLASSIFICATION 5101
birds without significant detours. In comparison, the DNN Whether to Use the Annotated Knowledge Points. In particu-
learning from scratch usually extracts features from the lar, we should define and quantify knowledge points with-
head, belly, tail, and tree-branch in early epochs, and later out any human annotations. It is because we do not want
neglects features from the tree branch. In this case, the opti- the analysis of the DNN to be influenced by manual annota-
mization direction of the DNN learning from scratch is tions with significant subjective bias, which negatively
more unstable. impacts the trustworthiness of this research. Besides, anno-
Note that previous studies on explanations of the optimi- tating all knowledge points is also typically expensive to the
zation stability and the generalization power of the DNN point of being impractical or impossible. However, previous
were mainly conducted from the perspective of the parame- studies often define visual concepts (knowledge points) by
ter space. [12], [31], [61] demonstrated that the flatness of human annotations, and these concepts usually have spe-
minima of the loss function resulted in good generalization. cific semantic meanings. For example, Bau et al [2] manually
[1], [43], [44] illustrated that the generalization behavior of a annotated six types of semantic visual concepts (objects,
network depended on the norm of the weights. In contrast, parts, textures, scenes, materials, and colors) to explain fea-
this study explains the efficiency of optimization from the ture representations of DNNs. In this way, these methods
perspective of knowledge representations, i.e., analyzing cannot fairly quantify the actual amount of knowledge
whether or not the DNN encodes different knowledge encoded by a DNN, because most visual concepts are not
points simultaneously and whether they do so without sig- countable and cannot be defined with human annotations.
nificant detours. In contrast, in this study, we use the information in each
Connection to the Information-Bottleneck Theory. We input unit discarded by the DNN to define and quantify
roughly consider that the information-bottleneck the- knowledge points without human annotations. Such knowl-
ory [58], [68] can also be used to explain the change of edge points are usually referred to as “Dark Matters” [71].
knowledge points in a DNN. It showed that a DNN usu- Compared to traditionally labeled visual concepts, we con-
ally extracted a few discriminative features for inferences, sider that these dark-matter knowledge points can provide
and neglected massive features irrelevant to inferences a more general and trustworthy way to explain the feature
during the training process. The information-bottleneck representations of DNNs.
theory measures the entire discrimination power of each Scope of Explanation. A trustworthy explanation of knowl-
sample. In comparison, our method quantifies the discrim- edge distillation usually involves two requirements. First,
ination power of each input unit in a more detailed manner, the teacher network should be well optimized. Otherwise, if
and measures pixel-wise information discarded through for- the teacher network does not converge or is optimized for
ward propagation. other tasks, then the teacher network is not qualified to per-
Methods. To verify the above hypotheses, we design form distillation. Second, knowledge distillation often has
three types of metrics to quantify knowledge points encoded two fully different utilities. (1) Distilling from high-dimen-
in intermediate layers of a DNN, and analyze how dif- sional features usually forces the student network to mimic
ferent knowledge points are modeled during the training all sophisticated feature representations in the teacher net-
procedure. work. (2) Distilling from the low-dimensional network out-
The first type of metrics measures the quantity and put mainly selectively uses confident training samples for
quality of knowledge points. We consider that a well- learning, with very little information on how the teacher enc-
trained DNN with superior classification performance is odes detailed features. In this case, the distillation selectively
supposed to encode a massive amount of knowledge points, emphasizes the training of simple samples and ignores the
most of which are discriminative for the classification of tar- training of hard samples2. Therefore, in this study, we
get categories. mainly explain the first utility of knowledge distillation, i.e.,
The second type of metrics measures whether a DNN forcing the student network to mimic the knowledge of the
encodes various knowledge points simultaneously or teacher network. This utility is mainly exhibited by distilling
sequentially, i.e., whether different knowledge points are from high-dimensional intermediate-layer features.
learned at similar speeds. Contributions of this study are summarized as follows.
The third metric evaluates whether a DNN is opti- 1. The quantification of knowledge points encoded in
mized with or without significant detours. In other words, intermediate layers of DNNs can be used as a new perspec-
this metric measures whether or not a DNN directly learns tive to analyze DNNs.
target knowledge points without temporarily modeling and 2. Based on the quantification of knowledge points, we
discarding other knowledge points. To this end, this metric design three types of metrics to explain the mechanism of
is designed as an overlap between knowledge points knowledge distillation.
encoded by the DNN in early epochs and the final knowl- 3. Three hypotheses about knowledge distillation are pro-
edge points modeled by the DNN. If this overlap is large, posed and verified in different classification applications,
then we consider this DNN is optimized without significant including image classification, 3D point cloud classification,
detours; otherwise, with detours. This is the case because a binary sentiment classification, and question answering,
small overlap indicates that the DNN first encodes a large which shed new light on the interpretation of knowledge
number of knowledge points in early epochs, but many of distillation.
these knowledge points are discarded in later epochs. A preliminary version of this work appeared in [10].
In summary, we use these metrics to compare knowledge
distillation and learning from scratch, in order to verify 2. Confident (simple) samples usually generate more salient output
Hypotheses 1-3. signals for knowledge distillation than unconfident (hard) samples.
2 RELATED WORK In contrast to previous works, in this study, we aim to

bridge the gap between mathematical explanations and
Although DNNs have shown considerable promises in semantic explanations. That is, we use the information dis-
various applications, they are still regarded as black boxes. carding of the input to quantify knowledge points and to
Previous studies on interpreting DNNs can be roughly sum- explain the discrimination power of the DNN. More specifi-
marized as semantic explanations and mathematical analy- cally, we analyze the quantity and the quality of knowledge
ses of the representation capacity. points in a DNN, measure whether a DNN encodes various
Semantic Explanations for DNNs. An intuitive way to knowledge points sequentially or simultaneously, and eval-
explain DNNs is to visualize the appearance encoded in uate the stability of its optimization direction.
intermediate layers of DNNs, which may significantly acti- Knowledge Distillation. Knowledge distillation is a pop-
vate a specific neuron of a certain layer. Gradient-based ular technique in knowledge transferring, which can be
methods [41], [59], [74], [79] measured the importance of applied to different domains such as adversarial defense
intermediate-layer activation units or input units by using [45], object detection [64], semantic segmentation [38],
gradients of outputs w.r.t. inputs. Inversion-based meth- model compression [36], and improving the generalization
ods [13] inverted feature maps of convolutional layers into capacity of DNNs [19].
images. Moreover, some studies estimated the pixel-wise or Although knowledge distillation has achieved success in
regional attribution/importance/saliency of an input to the various applications, few works have attempted to explain
network output [6], [17], [27], [30], [33], [50], [51], [55]. how and why this approach helps the training of the DNN.
The semantic dissection is also a classical direction of inter- Hinton et al. [25] demonstrated that learning from “soft
preting DNNs. Bau et al. [2] manually annotated six types of targets” was easier than learning from “hard targets”. Furla-
visual concepts (objects, parts, scenes, textures, materials and col- nello et al. [19] explained the efficiency of knowledge distil-
ors), and associated units of DNN feature maps with these lation as re-weighting training examples. From a theoretical
concepts. Furthermore, the first two concepts could be sum- perspective, Lopez-Paz et al. [39] related knowledge distilla-
marized as shapes, and the last four concepts as textures [78]. tion to learning from privileged information under a noise-
Fong and Vedaldi [16] used a set of filters to represent a single free setting. Phuong and Lampert [46] explained the success
semantic concept, and built up many-to-many projections of knowledge distillation from the perspective of data distri-
between concepts and filters. TCAV [32] measured the impor- bution, optimization bias, and the size of the training set.
tance of user-defined concepts to classification results. Yuan et al. [75] interpreted knowledge distillation in terms
Learning a DNN with interpretable feature representa- of label smoothing regularization. Menon et al. [42] consid-
tions in an unsupervised or weakly-supervised manner is ered that a good teacher modeling the Bayes class-probabili-
another direction widely explored in explainable AI. Info- ties resulted in the success of knowledge distillation.
GAN [9] and b-VAE [24] learned interpretable latent rep- However, previous explanations for knowledge distilla-
resentations for generative networks. In the capsule tion were mainly based on qualitative statements [19] or theo-
network [26], [54], each capsule encoded different meaning- retically relied on strong assumptions such as linear
ful features. Interpretable CNNs [78] were used to make each models [46]. In contrast, we interpret knowledge distillation
filter in a high convolutional layer represent a specific object from the perspective of knowledge representations, i.e.,
part. The BagNet [4] was proposed to classify images based quantifying, analyzing, and comparing knowledge points
on the occurrences of local image features. The ProtoPNet [7] encoded in distilled DNNs and knowledge points encoded in
was introduced to learn prototypical parts. Shen et al. [57] DNNs learning from scratch. Hence, knowledge points pro-
proposed a method to modify a traditional CNN into an vide new insight into the success of knowledge distillation.
interpretable compositional CNN.
In summary, previous studies have mainly been based
on heuristic assumptions to calculate importance/saliency/ 3 ALGORITHM
attention [41], [59], [74], [76] or used massive amounts of In this study, we explain the success of knowledge distillation
human-annotated concepts [2], [32] to interpret network from a new perspective, i.e., quantifying knowledge points
features. In contrast, in this paper, we explain feature repre- encoded in the intermediate layers of a DNN for classification.
sentations from the perspective of knowledge points. In contrast to previous explanations of knowledge distillation,
Mathematical Explanations of the Representation Capacity our method enables people to explain the representation
of DNNs. Formulating and evaluating the representation power of a DNN using the quantity and the quality of knowl-
capacity of DNNs is an emerging direction in research on edge points encoded in a DNN, which reflects a more direct
explainable AI. The information-bottleneck theory [58], [68] and stronger connection between the knowledge and perfor-
provided a generic metric to quantify the information mance. To this end, we measure the information in each input
encoded in DNNs, which can be extended to evaluate the unit discarded by the DNN to define knowledge points, based
representation capacity of DNNs [21], [72] as well. Zhang on the information theory. We propose three hypotheses on
et al. [77] discussed the relationship between the parameter the mechanism of knowledge distillation to explain its suc-
number and the generalization capacity of DNNs. Weng cess. To verify these hypotheses, we further design three types
et al. [67] proposed to use the CLEVER score to measure the of metrics based on the quantification of knowledge points,
robustness of neural networks. Fort et al. [18] used the stiff- which evaluate the representation quality of the DNN. Note
ness to diagnose the generalization of a DNN. The module that theoretically, knowledge distillation is generally consid-
criticality [5] was proposed to explain the superior generali- ered to involve two fully different utilities. Distillation from
zation performance of some network architectures. high-dimensional intermediate-layer features usually exhibits
the first type of utility, i.e., forcing the student network to quantifies how much input information can be discarded
mimic knowledge points of the teacher network. In contrast, without affecting the inference of the object
distilling from relatively low-dimensional network output
often exhibits the second type of utility, i.e., selecting confi- HðX0 Þ s:t: 8x0 2 X0 ; k fðx0 Þ f k2 t: (1)
dent samples for learning and ignoring unconfident samples.
In this paper, we mainly explain the first utility of knowledge X0 denotes a set of inputs corresponding to the concept of a
distillation, which is mainly exhibited by distilling from high- specific object instance. We assume that Dx is an i.i.d. Gauss-
dimensional intermediate-layer features. ian noise, thereby x0 ¼ x þ Dx N ðx; Sðs
s Þ ¼ diagðs 21 ; . . . ; s 2n ÞÞ.
Here, s i indicates the variance of the perturbation w.r.t. the
3.1 Quantifying Information Discarding ith input unit. The assumption of the Gaussian distribution
Before defining and quantifying knowledge points ensures that the entropy HðX0 Þ of the input can be decom-
encoded in DNNs, we first introduce how to measure the posed into pixel-level entropies fHi g
pixel-wise information discarding of the input sample in
this section. X
n
HðX0 Þ ¼ Hi ; (2)
According to information-bottleneck theory [58], [68], we i¼1
consider the signal processing of a forward propagation as
the layer-wise discarding of input information. In low where Hi ¼ log s i þ 12 log ð2peÞ measures the pixel-wise infor-
layers, most input information is used to compute features. mation discarding. A large value of Hi usually indicates that
Whereas, in high layers, information in input units unre- the information of the ith unit is significantly discarded dur-
lated to the inference is gradually discarded, and only ing the forward propagation. For example, as shown in
related pixel-wise information is maintained in the compu- Fig. 1a, information encoded in background pixels is signifi-
tation of network features. In this way, we consider that fea- cantly discarded in the fine-grained bird classification.
tures of the highest layer mainly encode the information, Thus, we learn the optimal Sðs s Þ ¼ diagðs 21 ; . . . ; s 2n Þ via
which is highly relevant to the inferences. the maximum-entropy principle to compute the pixel-wise
Thus, we propose methods to quantify the input infor- information discarding Hi ¼ log s i þ 12 log ð2peÞ. The objec-
mation encoded in a specific intermediate layer of a tive function is formulated in Equation (1), which maxi-
DNN [23], [40], i.e., measuring how much input information mizes HðX0 Þ subject to constraining features within the
is discarded when the DNN extracts the feature of this scope of a specific object instance k fðx0 Þ f k2 t. That is,
layer. To this end, the information in each input unit dis- we enumerate all perturbation directions in the input sam-
carded by the DNN is formulated as the entropy, given a ple x within a small variance of the feature f , in order to
feature of a certain network layer. approximate the low-dimensional manifold of fðx0 Þ. We use
Specifically, given a trained DNN for classification and the Lagrange multiplier to approximate Equation (1) as the
an object instance x 2 Rn with n input units, let f ¼ fðxÞ following loss
denote the feature of a network intermediate layer, and let
y denote the network prediction w.r.t. x. The input unit is min s Þ;
Lossðs
s
referred to as a variable (or a set of variables) in the input
1 h i Xn
0 2
sample. For example, each pixel (or pixels within a small s Þ¼
Lossðs E kfðx Þ f k a Hi ; (3)
local region) of an input image can be taken as an input d2f x0 N ðx;Sðss ÞÞ i¼1
unit, and the embedding of each word in an input sentence 1 h i X n
¼ 2 0 E kfðx0 Þ f k2 a Hi : (4)
also can be regarded as an input unit. df x ¼xþss dd;
d N ð0;IÞ i¼1
In our previous works [23], [40], we consider that the
DNN satisfies the Lipschitz constraint k y0 y k k k
fðx0 Þ f k , where k is the Lipschitz constant, and y0 corre- Considering that Equation (3) is intractable, we use x0 ¼
sponds to the network prediction of the sample x0 . The Lip- x þ s d; d N ð0; IÞ to simplify the computation of
schitz constant k can be computed as the maximum norm Lossðss Þ, where denotes the element-wise multiplication.
of the gradient within a small range of features, i.e., k ¼ s ¼ ½s 1 ; . . . ; s n > represents the range that the input sam-
@y0
supfðx0 Þ:kfðx0 Þf k2 t k @fðx 0 Þ k. This indicates that if we weakly
ple can change. In this way, Lossðs s Þ in Equation (4) is
tractable, and we can learn s via gradient descent. d2f ¼
perturb the low-dimensional manifold of features w.r.t. the
limt!0þ Ex0 N ðx;t2 IÞ ½kfðx0 Þ f k2 denotes the inherent vari-
input x within a small range ffðx0 Þj k fðx0 Þ f k2 tg,
ance of intermediate-layer features subject to small input
then the output of the DNN was also perturbed within a
noises with a fixed magnitude t. d2f is used for normalization,
small range k y0 y k k t, where t is a positive scalar. In
and a positive scalar a is used to balance the two loss terms.
other words, all such weakly perturbed features usually
represent the same object instance.
However, determining the explicit low-dimensional mani-
3.2 Quantification of Knowledge Points
fold of features w.r.t. the input x is very difficult. As an expedi- Hypothesis 1: Knowledge distillation makes the DNN
ency, we add perturbations Dx to the input sample x to encode more knowledge points than learning from scratch.
approximate the manifold of the feature, i.e., generating sam-
ples x0 ¼ x þ Dx, subject to k fðx0 Þ f k2 t. In this way,
the entropy HðX0 Þ measures the uncertainty of the input The basic idea is to measure the quantity and quality of
when the feature represents the same object instance, knowledge points encoded in the intermediate layers of a
k y0 y k k t, as mentioned above. In other words, HðX0 Þ DNN.
Fig. 3. Visualization of knowledge points. The dark color indicates the low-entropy value Hi . Image regions with low pixel-wise entropies fHi g are
considered as knowledge points. In a real implementation, to improve the stability and efficiency of the computation, we divide the image into 16 16
grids, and each grid is taken as a “pixel” to compute the entropy Hi .
The Quantity of Knowledge Points. In this paper, we use value of represents that the feature representation of the
the pixel-wise information of each input unit discarded by DNN is reliable; otherwise, unreliable.
the DNN to define and quantify knowledge points. Given a Thus, knowledge points provide a new way to explain the
trained DNN and an input sample x 2 X, let us consider mechanism of knowledge distillation, i.e., checking whether
the pixel-wise information discarding Hi w.r.t. the feature of knowledge distillation makes the DNN encode a large
a certain network layer f ¼ fðxÞ. X represents a set of amount of knowledge points, and whether most knowledge
input samples. points are precisely localized in the foreground.
According to Equation (2), a low pixel-wise entropy Hi Coherency and Generality. As discussed in [23], [40], a trust-
represents that the information of this input unit is less dis- worthy explanation method to evaluate feature representa-
carded. In other words, the DNN tends to use these input tions of the DNN should satisfy the criteria of the coherency
units with low entropies to compute the feature f and and generality. From this perspective, we analyze the trust-
make inferences. Thus, a knowledge point is defined as an worthiness of knowledge points.
input unit with a low pixel-wise entropy Hi , which encodes Coherency indicates that an explanation method should
discriminative information for prediction. For example, the reflect the essence of the encoded knowledge representations
heads of birs in Fig. 3 are referred to as knowledge points, in DNNs, which are invariant to network settings. In this
which are useful for the fine-grained bird classification. study, we use pixel-wise information discarding to define
To this end, we use the average entropy H of all back- knowledge points without strong assumptions on feature
ground units Lbg as a baseline to determine knowledge representations, or network architectures. In this way,
points, i.e., H ¼ Ei2Lbg ½Hi . This is because information knowledge points provide a coherent evaluation of feature
encoded in background units is usually supposed to be signif- representations. However, most existing explanation meth-
icantly discarded and irrelevant to the inference. If the ods shown in Table 1 usually fail to meet this criterion due to
entropy of a unit is significantly lower than the baseline their biased assumptions. For example, the CAM [79] and
entropy H Hi > b, then this input unit can be considered gradient explanations [59] do not generate coherent explana-
as a valid knowledge point. Here, b is a threshold to determine tions, because the gradient map @Loss @f is usually not coherent
the knowledge points, which is introduced in Section 4.1. In through different layers.
this way, we can quantify the number of knowledge points in To this end, we compare our method with CAM and gra-
the DNN. dient explanations. Theoretically, we can easily construct
The Quality of Knowledge Points. In addition to the quan- two DNNs to represent exactly the same knowledge but
tity of knowledge points, we also consider the quality of with different magnitudes of gradients, as follows. A VGG-
knowledge points. Generally speaking, information encoded 16 model [60] was learned using the CUB200-2011 data-
in foreground units is usually supposed to be crucial to the set [65] for fine-grained bird classification. Given this pre-
inferences. In comparison, information encoded in back- trained DNN, we slightly revised the magnitude of parame-
ground units is often supposed to exhibit negligible effects ters in every pair of neighboring convolutional layers y ¼
on the prediction. Therefore, we design a metric to evaluate x
w þ b to examine the coherency. For the Lth and
the quality of knowledge points by examining whether or ðL þ 1Þth layers, parameters were revised as wðLÞ 4wðLÞ ,
not most knowledge points are localized in the foreground. wðLþ1Þ wðLþ1Þ =4, bðLÞ 4bðLÞ , bðLþ1Þ bðLþ1Þ =4. Such revi-
sions did not change knowledge representations or the net-
" # work output, but did alter the gradient magnitude.

fg fg bg
¼ Ex2X Npoint ðxÞ= Npoint ðxÞ þ Npoint ðxÞ ; (5)
bg
X TABLE 1
Npoint ðxÞ ¼ 1ðH Hi > bÞ; (6)
Comparisons of Different Explanation Methods
i2Lbg w:r:t:x
X in Terms of the Coherency
fg
Npoint ðxÞ ¼ 1ðH Hi > bÞ: (7)
Coherency
i2Lfg w:r:t:x
Fair layer-wise Fair comparisons between
comparisons different networks
bg fg
Npoint ðxÞ and Npoint ðxÞ
denote the number of knowledge Gradient-based [41], [59], [74], [79] No No
points encoded in the background and the foreground, Perturbation-based [17], [33] No No
respectively. Lbg and Lfg are sets of input units in the back- Inversion-based [13] No No
Entropy-based Yes Yes
ground and the foreground w.r.t. the input sample x, respec-
tively. 1ðÞ refers to the indicator function. If the condition The coherency of the entropy-based method enables fair layer-wise comparisons
inside is valid, then 1ðÞ returns 1; otherwise, returns 0. A large within a network and between different networks.
As Fig. 1b shows, magnitudes of the CAM and the gradi-

ent explanation methods are sensitive to the magnitude of
parameters. In comparison, our method is not affected by
the magnitude of parameters, which demonstrates the
coherency. Thus, the coherency of our method enables fair
layer-wise comparisons within a DNN, as well as fair com-
parisons between different DNNs.
Generality means that an explanation method should
have strong connections to existing mathematical theories.
To this end, we quantify knowledge points using the
entropy of the input sample. The entropy is a generic tool
Fig. 4. Procedure of learning foreground knowledge points w.r.t. weight
with strong connections to various theories, e.g., the infor- distances. According to information-bottleneck theory, a DNN tends to
mation-bottleneck theory [58], [68]. As a result, the general- learn various knowledge points in early epochs and then discard knowl-
ity of knowledge points enables us to fairly compare feature edge points irrelevant to the task later. Strictly speaking, a DNN learns
new knowledge points and discards old points during the entire course
representations between different DNNs. Note that calculat- of the learning process. The epoch m ^ x encodes the richest foreground
ing knowledge points requires the annotation of the fore- knowledge points.
ground, but this requirement does not hurt the coherency
and the generality of knowledge points. whether a DNN learns knowledge points of different sam-
ples at similar speeds.
3.3 Learning Simultaneously or Sequentially To this end, as shown in Fig. 4, we consider the epoch
m^ in which a DNN obtains the richest foreground
Hypothesis 2: Knowledge distillation makes the DNN knowledge points to measure the learning speeds, i.e., m ^x ¼
more likely to learn different knowledge points simulta- arg maxk Nkfg ðxÞ. We use metrics D and D based on the
neously. Whereas, the DNN learning from scratch tends Pm kwk wk1 k mean var
“weight distance” k¼1 kw0 k to check whether a DNN
to encode various knowledge points sequentially in dif- learns knowledge points simultaneously. w0 and wk denote
ferent epochs. initial parameters and parameters learned after the kth
epoch, respectively. The metric Dmean represents the average
weight distance at which the DNN obtains the richest fore-
This hypothesis can be understood as follows. Knowledge ground knowledge points, and the metric Dvar describes the
distillation forces the student network to mimic all knowl- variation of the weight distance w.r.t. different input sam-
edge points of the teacher network simultaneously. In com- ples. These two metrics are defined as follows:
parison, the DNN learning from scratch is more likely to X
learn different knowledge points in different epochs. A ^ x kwk wk1 k
m
Dmean ¼ E ;
DNN learning from scratch may encode knowledge points x2X k¼1 kw0 k
of common, simple, and elementary features in early X
m^ x kwk wk1 k
epochs. For example, in the scenario of fine-grained bird Dvar ¼ Var : (8)
x2X k¼1 kw0 k
classification, the sky is a typical and simple feature to learn.
Whereas, knowledge points representing complex features Here, we use the weight distance to evaluate the learning
may be encoded in later epochs by the DNN learning from effect at each epoch [15], [20], rather than use the epoch
scratch. For example, in the scenario of fine-grained bird number. This is because the weight distance quantifies the
classification, the subtle differences between two similar total path of updating the parameter wk at each epoch k bet-
bird species are difficult to learn. In this way, the DNN ter than the epoch number. In other words, the weight dis-
learning from scratch tends to encode different knowledge tance can measure the dynamic learning process of a DNN
points in a sequential manner, rather than simultaneously. better. In this way, Dmean measures whether a DNN learns
Metrics. To verify Hypothesis 2, we design two metrics, knowledge points quickly, and Dvar evaluates whether a
which examines whether or not different knowledge points DNN learns knowledge points of different input samples at
are learned within similar epochs, i.e., whether different similar speeds. Thus, small values of Dmean and Dstd reflect
knowledge points are learned simultaneously. that the DNN learns various knowledge points quickly and
Specifically, given a set of training samples X, let M simultaneously.
denote the total epochs used to train a DNN. For each spe-
cific sample x 2 X, foreground knowledge points learned 3.4 Learning with Less Detours
by DNNs after j epochs are represented as Njfg ðxÞ, where
j 2 f1; 2; . . . ; Mg. Hypothesis 3: Knowledge distillation usually yields more
In this way, whether a DNN learns different knowledge stable optimization directions than learning from scratch.
points simultaneously can be measured from the following
two aspects. (1) Whether the number of foreground knowl-
edge points Njfg ðxÞ increases fast along with the epoch num- This hypothesis can be understood as follows. According to
ber. (2) Whether Njfg ðxÞ w.r.t. different samples increase the information-bottleneck theory [58], [68], the DNN learn-
simultaneously. In other words, the first aspect measures ing from scratch usually tries to model various knowledge
whether the DNN learns various knowledge points of a spe- points in early epochs and then discards those unrelated
cific input sample quickly, and the second aspect evaluates to the inference in late epochs. This means that the DNN
TABLE 2
Comparisons Between the Student Network and the Baseline
Network (B) for the Image Classification on the CUB200-2011
Dataset, Where the Student Network Distilled From the FC1
Layer of a Larger Teacher Network
Target fg bg
Learning methods Npoint " Npoint # " Dmean # Dvar # r"
network
Distilling from VGG-11 34.06 8.89 0.80 1.30 0.80 0.58

Learning from scratch 25.50 11.50 0.70 2.14 3.29 0.56
VGG-11 Distilling from VGG-16 37.11 11.50 0.78 0.48 0.06 0.58
Fig. 5. Detours of learning knowledge points. We visualize sets of fore- Learning from scratch 25.50 11.50 0.70 2.14 3.29 0.56
ground knowledge points learned after different epochs. The green box Distilling from VGG-19 41.44 8.67 0.83 0.55 0.10 0.63
indicates the union of knowledge points learned during all epochs. The Learning from scratch 25.50 11.50 0.70 2.14 3.29 0.56
ð1 rÞ value denotes the ratio of knowledge points that are discarded
during the learning process to the union set of all learned knowledge Distilling from AlexNet 35.29 4.07 0.90 4.54 8.43 0.57
points. A larger r value indicates DNN is optimized with fewer detours. Learning from scratch 24.00 5.90 0.80 2.80 5.32 0.53
AlexNet Distilling from VGG-16 39.14 4.45 0.90 0.78 0.05 0.67
learning from scratch cannot directly encode target knowl-
edge points for inference. Thus, the optimization direction Distilling from VGG-19 38.36 4.14 0.90 0.92 0.11 0.65
of this DNN is inconsistent and unstable in early and
late epochs, i.e., with significant detours. In contrast
to learning from scratch, knowledge distillation forces
the student network to directly mimic the well-trained 4 EXPERIMENT
teacher network. With the guidance of the teacher net-
In this section, we conducted experiments on image classifi-
work, knowledge distillation makes the student network
cation, 3D point cloud classification, and natural language
straightforwardly model target knowledge points without
processing (NLP) tasks. Experimental results verified the
temporarily modeling and discarding other knowledge
proposed hypotheses.
points. Thus, the DNN learning via knowledge distilla-
tion tends to be optimized in a stable direction, i.e., with-
out significant detours. 4.1 Implementation Details
Metric. To verify Hypothesis 3, we design a metric to Given a teacher network, we distilled knowledge from the
evaluate whether a DNN is optimized in a stable and con- teacher network to the student network. In order to remove
sistent direction. Let Sj ðxÞ ¼ fi 2 xjH Hi > bg denote the the effects of network architectures, the student network
set of foreground knowledge points in the input sample I had the same architecture as the teacher network. For fair
encoded by the DNN learned after the jth epoch, where j 2 comparisons, the DNN learning from scratch was also
f1; 2; . . . ; Mg. Each knowledge point a 2 Sj ðxÞ is referred to required to have the same architecture as the teacher net-
as a specific unit i on the foreground of the input sample I, work. For convenience, we named the DNN learning from
which satisfies H Hi > b. Then, we use the metric r to scratch the baseline network hereinafter.
measure the stability of optimization directions Additionally, we conducted experiments to check
whether the size of the teacher network affected the verifica-
tion of the three hypotheses. Table 2 and Table 9 show that
jSM ðxÞj when the student network distilled from a larger teacher
r¼ SM : (9)
j j¼1 Sj ðxÞj network, Hypotheses 1-3 were still verified. This phenome-
non indicates that variations in the size of the teacher net-
work did not hurt the verification of our conclusions. To
r represents the ratio of finally encoded knowledge points simplify the story, we set student networks to have the
to all temporarily attempted knowledge points in interme- same architectures as teacher networks in the following
diate epochs, where j j denotes the cardinality of the set. experiments.
More specifically, the numerator of Equation (9) reflects the Datasets & DNNs. For the image classification task, we con-
number of foreground knowledge points, which have been ducted experiments based on AlexNet [34], VGG-11, VGG-
chosen ultimately for inference. For example, knowledge 16, VGG-19 [60], ResNet-50, ResNet-101, and ResNet-152
points shown in the black box of Fig. 5 are encoded as final [29]. We trained these DNNs based on the ILSVRC-2013
knowledge points for fine-grained bird classification. The DET dataset [56], the CUB200-2011 dataset [65], and the Pas-
denominator represents the number of all knowledge points cal VOC 2012 dataset [14] for object classification, respec-
temporarily learned during the entire training process, tively. Considering the high computational burden of
which
S is shown as the green box in Fig. 5. Moreover, training on the entire ILSVRC-2013 DET dataset, we con-
ð M j¼1 Sj ðxÞ n SM ðxÞÞ indicates a set of knowledge points, ducted the classification of terrestrial mammal categories
which have been tried, but finally are discarded by the for comparative experiments. Note that all teacher networks
DNN. The set of those discarded knowledge points is in Sections 4.3, 4.4 were pre-trained on the ImageNet data-
shown as the brown box in Fig. 5. Thus, a high value of r set [53], and then fine-tuned using the aforementioned data-
indicates that the DNN is optimized without significant set. In comparison, all baseline networks were trained from
detours and more stably, and vice versa. scratch. Moreover, data augmentation [28] was applied to
both the student network and the baseline network, when loss. Given the distilled features, we further trained parame-
the DNNs were trained using the ILSVRC-2013 DET dataset ters above the target layer l merely using the classification
or the Pascal VOC 2012 dataset. loss. In this way, we were able to ensure that all knowledge
We used object images cropped by object bounding points of the student network were exclusively learned via
boxes to train the aforementioned DNNs on each above distillation. Otherwise, if we trained the student network
dataset, respectively. In particular, to achieve stable results, using both the distillation loss and the classification loss, dis-
images in the Pascal VOC 2012 dataset were cropped by tinguishing whether the classification loss or the distillation
using 1:2 width 1:2 height of the original object bounding loss was responsible for the learning of each knowledge
box. For the ILSVRC-2013 DET dataset, we cropped each point would be difficult.
image by using 1:5 width 1:5 height of the original object Moreover, we employed two schemes of knowledge dis-
bounding box. This was performed because no ground- tillation in this study, including learning from intermediate-
truth annotations of object segmentation were available for layer features and learning from network output. Specifi-
the ILSVRC-2013 DET dataset. We used the object bounding cally, we distilled features in the top convolutional layer
box to separate foreground regions and background regions (Conv) or in each of three FC layers (namely FC1 , FC2 and
of images in the ILSVRC-2013 DET dataset. In this way, pix- FC3 ) of the teacher network to train the student network,
els within the object bounding box were regarded as the respectively. In this way, knowledge distillation from the
foreground Lfg , and pixels outside the object bounding box FC3 layer corresponded to the scheme of learning from out-
were referred to as the background Lbg . put, and distillation from other layers corresponded to the
For natural language processing, we fine-tuned a pre- scheme of learning from intermediate-layer features. In
trained BERT model [11] using the SQuAD dataset [49] as spite of these two schemes, we mainly explained knowledge
the teacher network towards the question-answering task. distillation from intermediate-layer features that forced the
Besides, we also fine-tuned another pre-trained BERT student network to mimic the knowledge of the teacher net-
model on the SST-2 dataset [62] as the teacher network for work as discussed above.
binary sentiment classification. Baseline networks and stu- Layers Used to Quantify Knowledge Points. For each pair of
dent networks were simply trained using samples in the the student and baseline networks, we measured knowl-
SST-2 dataset or the SQuAD dataset. Moreover, for both the edge points on the top convolutional layer and all FC layers.
SQuAD dataset and the SST-2 dataset, we annotated input Particularly, for DNNs trained on the ILSVRC-2013 DET
units irrelevant to the prediction as the background Lbg , dataset and the Pascal VOC 2012 dataset, we quantified
according to the human cognition. We annotated units rele- knowledge points only on FC1 and FC2 layers, because the
vant to the inference as the foreground Lfg . For example, dimension of features in the output layer FC3 was much
words related to answers were labeled as foregrounds in lower than that of intermediate-layer features in the FC1
the SQuAD dataset. Words containing sentiments were and FC2 layers. Similarly, for DNNs trained on the Model-
annotated as foregrounds in the SST-2 dataset. Net40 dataset, we also used the FC1 and FC2 layers to quan-
For 3D point cloud classification task, we conducted compara- tify knowledge points. For the BERT model, we measured
tive experiments on the PointNet [47], DGCNN [66], Point- knowledge points in its last hidden layer. Considering
Conv [69], and PointNet++ [48] models. Considering most ResNets usually only had a single FC layer, to obtain rich
widely used benchmark datasets for point cloud classification knowledge representations, we added two Conv layers and
only contained foreground objects, we constructed a new two FC layers before the final FC layer. In this way, we
dataset containing both foreground objects and background quantified knowledge points on these three FC layers.
objects based on the ModelNet40 dataset [70], as follows. For Hyper-Parameter Settings. We used the SGD optimizer,
each sample (i.e., foreground object) in the ModelNet40 data- and set the batch size as 64 to train DNNs. We set the grid
set, we first randomly sampled five point clouds, the labels of size to 16 16, and settings of the threshold b were as fol-
which differed from the foreground object. Then, we ran- lows. For DNNs towards image classification, b was set to
domly selected 500 points from the sampled five point clouds, 0.2, except that we set b to 0.25 for AlexNet. For BERT
and attached these points to the foreground object as back- towards NLP tasks, b was set to 0.2. For DNNs towards 3D
ground points. Moreover, the teacher network was trained point cloud classification, b was set to 0.25, except that b was
using all training data in this generated dataset for classifica- set to 0.5 for PointNet.
tion. In comparison, we only randomly sampled 10% training Additionally, we also tested the influence of the grid size
data in this generated dataset to learn baseline and student and the threshold b on the verification of the three hypothe-
networks, respectively. Thus, the teacher network was ses. To this end, we conducted experiments with different
guaranteed to encode better knowledge representations. settings of the grid size and different settings of the thresh-
Distillation. Given a well-trained teacher network and an old b. Table 3 shows that Hypotheses 1-3 were still vali-
input sample x, we selected a convolutional layer or a fully- dated, when knowledge points were calculated under
connected (FC) layer l as the target layer to perform knowl- different settings of the grid size and different settings of
edge distillation. We used the distillation loss kfT ðxÞ the threshold b.
fS ðxÞk2 to force the student network to mimic the feature of
the teacher network. Here, fT ðxÞ denoted the feature of lth 4.2 Verification of Hypothesis 1
layer in the teacher network, and fS ðxÞ indicated the feature According to Hypothesis 1, a well-trained teacher network
of lth layer in its corresponding student network. was supposed to encode more foreground knowledge
In order to simplify the story, we used the distillation loss points than the student network and the baseline network.
to learn the student network, rather than the classification Consequently, a distilled student network was supposed to
TABLE 3
Comparisons Between the Student Network (S) and the Baseline Network (B) for Image Classification
on the CUB200-2011 Dataset, Respectively
model more foreground knowledge points than the baseline network when using a teacher network trained in a more
network. This was the case because the teacher network sophisticated manner. In other words, this teacher network
was usually trained using a large amount of training data, was pre-trained on the ImageNet dataset, and then fine-
and achieved superior performance to the baseline network. tuned using the ILSVRC-2013 DET dataset, the CUB200-2011
In this way, this well-trained teacher network was supposed dataset, or the Pascal VOC 2012 dataset. This is a more com-
to encode more foreground knowledge points than the base- mon case in real applications.
line network. Knowledge distillation forced the student net- Based on the information discarding fHi g, we visualized
work to mimic the teacher network. Hence, the student image regions in the foreground that corresponded to fore-
network was supposed to model more foreground knowl- ground knowledge points, as well as image regions in the
edge points than baseline network. background that corresponded to background knowledge
Quantification of Knowledge Points in the Teacher Net- points in Fig. 6, where these knowledge points were mea-
work, the Student Network, and the Baseline Network. Here, we sured at the FC1 layer of the VGG-11 model. In this scenario,
compared knowledge points encoded in the teacher net- Hypothesis 1 was also successfully verified. Moreover, for
work, the student network, and the baseline network, and tasks of image classification, natural language processing,
verified the above hypothesis. We trained a VGG-16 model and 3D point cloud classification, Table 5, Table 6, and
from scratch as the teacher network using the ILSVRC-2013
DET dataset or the CUB200-2011 dataset for image classifi-
cation. Data augmentation [28] was used to boost the perfor- TABLE 4
Comparisons of Knowledge Points Encoded in the Teacher
mance of the teacher network. Table 4 compares the number
fg Network (T), the Student Network (S) and the Baseline
of foreground knowledge points Npoint learned by the Network (B) for Image Classification
teacher network, the baseline network, and the student net-
work. For fair comparisons, the student network had the
same architecture as the teacher network and the baseline
network according to Section 4.1.
Table 4 shows that the teacher network usually encoded
fg
more foreground knowledge points Npoint and a higher ratio
than the student network. Meanwhile, the student net-
fg
work often obtained larger values of Npoint and than the
baseline network. In this way, the above hypothesis was
fg
verified. There was an exception when Npoint was measured
at the FC2 layer of the VGG-16 model on the ILSVRC-2013
DET dataset. The teacher network encoded fewer fore-
ground knowledge points than the student network,
because this teacher network had a larger average back-
ground entropy value H than the student network.
fg bg
Comparing Metrics of Npoint , Npoint and to Verify Hypoth-
esis 1. In contrast to the previous subsection learning the
fg
teacher network from scratch, here, we compared Npoint ,
bg
Npoint and between the student network and the baseline
fg
Fig. 6. Visualization of knowledge points encoded in the FC1 layer of VGG-11. Generally, the student network exhibited a larger Npoint value and a
bg
smaller Npoint value than the baseline network.
TABLE 5
Comparisons Between the Student Network (S) and the Baseline Network (B) for Image Classification
Table 7 show that Hypothesis 1 was vaildated. Very few Note that as discussed in Section 3, knowledge distillation
student networks in Table 5, Table 6, and Table 7 encoded usually involved two different utilities. The utility of distill-
bg
more background knowledge points Npoint than the base- ing from the low-dimensional network output was mainly to
line network. This occurred because teacher networks in select confident training samples for learning, with very little
this subsection, Section 4.3, and Section 4.4 were either information on how the teacher network encoded detailed
pre-trained or trained using more samples, which encoded features. In contrast, the utility of distilling from high-dimen-
far more knowledge points than necessary. In this way, sional intermediate-layer features was mainly to force the
student networks distilled from these well-trained teacher student network to mimic knowledge points of the teacher
network learned more unnecessary knowledge for infer- network. Hence, in this study, we mainly explained knowl-
bg
ence, thereby leading to a larger Npoint value than the base- edge distillation from high-dimensional intermediate-layer
line network. features.
TABLE 6 TABLE 8
Comparisons Between the Student Network (S) and the Comparisons Between the Student Network (S) and the Baseline
Baseline Network (B) for the Question-Answering and the Network (B) for Image Classification on the MSCOCO Dataset
Sentiment Classification Tasks, Respectively
fg bg
Dataset Network Npoint " Npoint # " Dmean # Dvar # r"
S 8.46 10.24 0.48 13.81 104.94 0.56
SQuAD BERT
B 3.58 10.55 0.29 23.32 206.23 0.25
S 1.31 1.13 0.53 1.08 0.68 0.80
SST-2 BERT
B 0.68 0.79 0.39 1.29 1.00 0.51
Metrics were evaluated at the last hidden layer of the BERT model, which veri- TABLE 9
fied Hypotheses 1-3. Comparisons Between the Student Network (S) and the
Baseline Network (B) for Image Classification
TABLE 7 on the Tiny-ImageNet Dataset
Comparisons Between the Student Network (S) and the
Baseline Network (B) for the Classification of 3D Point Clouds
VOC 2012 dataset. In this scenario, convolutional layers of

the teacher network might encode some noisy features,
such as features pre-trained for massive unrelated Image-
To this end, in Table 5, we mainly focused on distillation Net categories. This might violate the first requirement for
from intermediate-layer features. Nevertheless, we conducted the knowledge distillation i.e., the teacher network was sup-
another experiment to verify that the utility of distilling from posed to be well optimized for the same task as the student
relatively low-dimensional network output was to select con- network. Hence, in order to analyze the convolutional layer
fident samples and ignore unconfident samples. Specifically, in a stricter manner, we conducted another experiment.
we distilled the student network from the 10dimensional Specifically, we fine-tuned the teacher network on the Tiny
final output of the teacher network for the classification of the ImageNet dataset [35], which was much larger and already
top-10 largest categories in the MSCOCO dataset [37]. Results contained sufficient training images to ensure that features
in Table 5 and Table 8 show the difference between the two in high convolutional layers were fine-tuned in a sophisti-
utilities of knowledge distillation, to some extent. For distill- cated manner. Table 9 shows that most hypotheses were
ing from the 200-dimensional output for the classification on verified, except that the student network might encode
the CUB200-2011 dataset in Table 5, the Hypothesis 1 was ver- more background knowledge points than the baseline net-
ified, because the network output with 200 dimensions work. This was the case because the teacher network
encoded relatively rich information to force the student net- encoded significantly more foreground knowledge points
work to mimic knowledge points of the teacher network, in convolutional layers than the baseline network. As a cost,
which reflected the first utility of knowledge distillation. In the teacher network also encoded slightly more background
contrast, when the network output had very few dimensions, knowledge points than the baseline network. Therefore, the
such as 10-category classification on the MSCOCO dataset in student network encoded much more foreground knowl-
Table 8, the student network tended to encode both fewer edge points than the baseline network (3.9 more foreground
foreground knowledge points and fewer background knowl- knowledge points on average), but simply increased back-
edge points than learning from scratch. This phenomenon ground knowledge points by 1.1. Nevertheless, the student
verified the second utility of sample selection. network still exhibited a larger ratio of foreground knowl-
Additionally, unlike FC layers, convolutional layers edge points than the baseline network. We considered that
were usually not sufficiently fine-tuned due to their depth, this phenomenon was still reasonable, as explained above.
when the teacher networks were first pre-trained on the Correlation Between the Quality of Knowledge Points and Clas-
ImageNet dataset for 1000-category classification, and were sification Performance. We conducted experiments to check
then fine-tuned using the CUB200-2011 dataset or the Pascal whether the DNN encoding a larger ratio of foreground
fg bg fg bg
Fig. 7. Comparison of Npoint , Npoint all
and all knowledge points Npoint ¼ Npoint þ Npoint between fine-tuning a pre-trained VGG-16 network and learning
a VGG-16 network from scratch on the Tiny-ImageNet dataset. Metrics were measured at the highest convolutional layer. The fine-tuning process
made the DNN encode new foreground knowledge points more quickly and enabled it to be optimized with fewer detours than learning from scratch.
knowledge points usually achieved better classification Specifically, we conducted experiments to explore how the
performance, as discussed in Section 3.2. To this end, we fine-tuning process guided the DNN to encode new knowl-
used the aforementioned student network to describe how edge points, in which we fine-tuned a pre-trained VGG-16
the ratio changes in different epochs of the training process, network on the Tiny ImageNet dataset. Besides, we also
which was distilled from FC1 layer of the teacher network on trained another VGG-16 network from scratch on the same
the CUB200-2011 dataset for object classification. Fig. 2 dataset for comparison. Fig. 7 shows that the fine-tuning pro-
shows that DNNs with a larger ratio of task-related knowl- cess made the DNN learn new knowledge points in the fore-
edge points often exhibited better performance. ground more quickly and be optimized with fewer detours
than learning from scratch, i.e., the fine-tuned DNN discarded
4.3 Verification of Hypothesis 2 fewer temporarily learned knowledge points.
For Hypothesis 2, we assumed that knowledge distillation
made the student network learn different knowledge points 5 CONCLUSION AND DISCUSSIONS
simultaneously. To this end, we used metrics Dmean and
In this study, we provide a new perspective to explain knowl-
Dstd to verify this hypothesis.
edge distillation from intermediate-layer features, i.e., quanti-
For image classification, Table 5 shows that values of
fying knowledge points encoded in the intermediate layers of
Dmean and Dstd were usually smaller than those of the base-
a DNN. We propose three hypotheses regarding the mecha-
line network, which verified Hypothesis 2. Of note, a very
nism of knowledge distillation, and design three types of met-
small number of failure cases occurred. For example, the
rics to verify these hypotheses for different classification
Dmean and Dstd were measured at the FC1 layer of AlexNet or
tasks. Compared to learning from scratch, knowledge distilla-
at the Conv layer of VGG-11. It is because both the AlexNet
tion ensures that the DNN encodes more knowledge points,
model and the VGG-11 model had relatively shallow net-
learns different knowledge points simultaneously, and opti-
work architectures. When learning from raw data, DNNs
mizes with fewer detours. Note that the learning procedure of
with shallow architectures would more easily learn more
DNNs cannot be precisely divided into a learning phase and
knowledge points with less overfitting.
a discarding phase. In each epoch, the DNN may simulta-
For NLP tasks (question-answering and sentiment classi-
neously learn new knowledge points and discard old knowl-
fication) and the classification of 3D point clouds, Table 6
edge points irrelevant to the task. Thus, the target epoch m ^ in
and Table 7 show that the student network had smaller val-
Fig. 4 is simply a rough estimation of the division of two learn-
ues of Dmean and Dstd than the baseline network, respec-
ing phases.
tively, which successfully verified Hypothesis 2.
4.4 Verification of Hypothesis 3 ACKNOWLEDGMENTS

Hypothesis 3 assumed that knowledge distillation made the Quanshi Zhang and Xu Cheng Contribute equally to this
student network optimized with fewer detours than the paper.
baseline network. The metric r measuring the stability of
optimization directions was used to verify this hypothesis. REFERENCES
For image classification, natural language processing, and
[1] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, “Spectrally-normal-
3D point cloud classification, Table 5, Table 6, and Table 7 ized margin bounds for neural networks,” in Adv. Neural Informat.
show that the student network usually exhibited a larger r Process. Syst., 2017, pp. 6241–6250.
value than the baseline network, which verified Hypothesis 3. [2] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network
dissection: Quantifying interpretability of deep visual repre-
However, there were few failure cases. When measuring r by sentations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
AlexNet and VGG-11, shallow architectures of these two net- pp. 3319–3327.
works lead to failure cases (please see discussions in Sec- [3] J. Behrmann, W. Grathwohl, R. T. Chen, D. Duvenaud, and J.-H.
tion 4.3). There was also another failure case when we used Jacobsen, “Invertible residual networks,” in Proc. Int. Conf. Mach.
Learn., 2019, pp. 573–582.
the PointNet++ model to measure r, because this model was [4] W. Brendel and M. Bethge, “Approximating CNNs with bag-of-
sufficiently discriminative. In this way, this PointNet++ model local-features models works surprisingly well on imagenet,” in
could make inferences based on a few knowledge points. Proc. 7th Int. Conf. Learn. Representations, 2019, pp. pp. 1–15.
[5] N. Chatterji, B. Neyshabur, and H. Sedghi, “The intriguing role of
module criticality in the generalization of deep networks,” in
4.5 Analysis of Fine-Tuning Process Proc. Int. Conf. Learn. Representations, 2020, pp. 1–19.
We further checked whether knowledge points could be used [6] A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubra-
manian, “Grad-CAM++: Generalized gradient-based visual
to analyze other classic transfer-learning methods. Here, we explanations for deep convolutional networks,” in Proc. IEEE Win-
focused on fine-tuning a well-trained DNN on a new dataset. ter Conf. Appl. Comput. Vis., 2018, pp. 839–847.
[7] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su, “This [34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
looks like that: Deep learning for interpretable image recog- cation with deep convolutional neural networks,” in Proc. Adv.
nition,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 8928–8939. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[8] R. T. Chen, J. Behrmann, D. K. Duvenaud, and J.-H. Jacobsen, [35] Y. Le and X. Yang, “Tiny imagenet visual recognition challenge,”
“Residual flows for invertible generative modeling,” in Proc. Adv. CS 231N, vol. 7, no. 7, 2015, Art. no. 3.
Neural Inf. Process. Syst., 2019, pp. 9916–9926. [36] T. Li, J. Li, Z. Liu, and C. Zhang, “Few sample knowledge distilla-
[9] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. tion for efficient network compression,” in Proc. IEEE/CVF Conf.
Abbeel, “InfoGAN: Interpretable representation learning by infor- Comput. Vis. Pattern Recognit., 2020, pp. 14 627–14 635.
mation maximizing generative adversarial nets,” in Proc. Adv. [37] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in
Neural Inf. Process. Syst., 2016, pp. 2172–2180. Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[10] X. Cheng, Z. Rao, Y. Chen, and Q. Zhang, “Explaining knowledge [38] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang, “Structured
distillation by quantifying the knowledge,” in Proc. IEEE/CVF knowledge distillation for semantic segmentation,” in Proc. IEEE/
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 12 925–12 935. CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2599–2608.
[11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- [39] D. Lopez-Paz, L. Bottou, B. Sch€ olkopf, and V. Vapnik, “Unifying
training of deep bidirectional transformers for language under- distillation and privileged information,” in Proc. Int. Conf. Learn.
standing,” 2018, arXiv:1810.04805. Representations, 2016, pp. 1–10.
[12] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima can [40] H. Ma, Y. Zhang, F. Zhou, and Q. Zhang, “Quantifying layerwise
generalize for deep nets,” in Proc. Int. Conf. Mach. Learn., 2017, information discarding of neural networks,” 2019, arXiv:1906.04109.
pp. 1019–1028. [41] A. Mahendran and A. Vedaldi, “Understanding deep image rep-
[13] A. Dosovitskiy and T. Brox, “Inverting visual representations with resentations by inverting them,” in Proc. IEEE Conf. Comput. Vis.
convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Pattern Recognit., 2015, pp. 5188–5196.
Recognit., 2016, pp. 4829–4837. [42] A. K. Menon, A. S. Rawat, S. J. Reddi, S. Kim, and S. Kumar, “Why
[14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. distillation helps: A statistical perspective,” 2020, arXiv:2005.10419.
Zisserman, “The pascal visual object classes (VOC) challenge,” [43] B. Neyshabur, S. Bhojanapalli, and N. Srebro, “A PAC-Bayesian
Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010. approach to spectrally-normalized margin bounds for neural
[15] S. Flennerhag, P. G. Moreno, N. D. Lawrence, and A. Damianou, networks,” in Proc. Int. Conf. on Learn. Representations, 2018, pp. 1–9.
“Transferring knowledge across learning processes,” 2018, [44] B. Neyshabur, R. Tomioka, and N. Srebro, “Norm-based capacity
arXiv:1812.01054. control in neural networks,” in Proc. Conf. Learn. Theory, 2015,
[16] R. Fong and A. Vedaldi, “Net2Vec: Quantifying and explaining pp. 1376–1401.
how concepts are encoded by filters in deep neural networks,” in [45] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8730–8738. “Distillation as a defense to adversarial perturbations against
[17] R. C. Fong and A. Vedaldi, “Interpretable explanations of black deep neural networks,” in Proc. IEEE Symp. Secur. Privacy, 2016,
boxes by meaningful perturbation,” in Proc. IEEE Int. Conf. Com- pp. 582–597.
put. Vis., 2017, pp. 3449–3457. [46] M. Phuong and C. Lampert, “Towards understanding knowledge
[18] S. Fort, P. K. Nowak, and S. Narayanan, “Stiffness: A new perspec- distillation,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 5142–5151.
tive on generalization in neural networks,” 2019, arXiv:1901.09491. [47] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “PointNet:
[19] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandku- Deep learning on point sets for 3D classification and segmen-
mar, “Born again neural networks,” 2018, arXiv:1805.04770. tation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
[20] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wil- pp. 77–85.
son, “Loss surfaces, mode connectivity, and fast ensembling of dnns,” [48] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierar-
in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 8789–8798. chical feature learning on point sets in a metric space,” 2017,
[21] Z. Goldfeld et al., “Estimating information flow in deep neural arXiv:1706.02413.
networks,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 2299–2308. [49] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD:
[22] H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree, “Regularisation 100,000+ questions for machine comprehension of text,” in Proc.
of neural networks by enforcing lipschitz continuity,” Mach. Conf. Empir. Methods Natural Lang. Process., 2016, pp. 2383–2392.
Learn., vol. 110, no. 2, pp. 393–416, 2021. [50] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should i trust you?:
[23] C. Guan, X. Wang, Q. Zhang, R. Chen, D. He, and X. Xie, Explaining the predictions of any classifier,” in Proc. 22nd ACM
“Towards a deep and unified understanding of deep neural mod- SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2016, pp. 1135–1144.
els in nlp,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 2454–2463. [51] M. T. Ribeiro, S. Singh, and C. Guestrin, “Anchors: High-precision
[24] I. Higgins et al., “beta-VAE: Learning basic visual concepts with a model-agnostic explanations,” in Proc. AAAI Conf. Artif. Intell.,
constrained variational framework,” Proc. Int. Conf. Learn. Repre- 2018, pp. 1527–1535.
sentations, 2017, Art. no. 6. [52] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y.
[25] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a Bengio, “Fitnets: Hints for thin deep nets,” 2014, arXiv:1412.6550.
neural network,” 2015, arXiv:1503.02531. [53] O. Russakovsky et al., “Imagenet large scale visual recognition
[26] G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with em challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
routing,” in Proc. Int. Conf. Learn. Representations, 2018, pp. 1–15. [54] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing
[27] S. Hooker, D. Erhan, P.-J. Kindermans, and B. Kim, “A benchmark between capsules,” in Proc. Adv. Neural Inf. Process. Syst., 2017,
for interpretability methods in deep neural networks,” in Proc. pp. 3856–3866.
Adv. Neural Inf. Process. Syst., 2019, pp. 9737–9748. [55] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
[28] J.-H. Jacobsen, A. Smeulders, and E. Oyallon, “i-RevNet: Deep D. Batra, “Grad-CAM: Visual explanations from deep networks
invertible networks,” 2018, arXiv:1802.07088. via gradient-based localization,” in Proc. IEEE Int. Conf. Comput.
[29] S. R. Kaiming He, X. Zhang, and J. Sun, “Deep residual learning Vis., 2017, pp. 618–626.
for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern [56] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y.
Recognit., 2016, pp. 770–778. LeCun, “Overfeat: Integrated recognition, localization and detec-
[30] A. Kapishnikov, T. Bolukbasi, F. Viegas, and M. Terry, “XRAI: tion using convolutional networks,” 2013, arXiv:1312.6229.
Better attributions through regions,” in Proc. IEEE/CVF Int. Conf. [57] W. Shen et al., “Interpretable compositional convolutional neural
Comput. Vis., 2019, pp. 4947–4956. networks,” 2021, arXiv:2107.04474.
[31] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. [58] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep
P. Tang, “On large-batch training for deep learning: Generaliza- neural networks via information,” 2017, arXiv:1703.00810.
tion gap and sharp minima,” 2016, arXiv:1609.04836. [59] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convo-
[32] B. Kim et al., “Interpretability beyond feature attribution: Quanti- lutional networks: Visualising image classification models and
tative testing with concept activation vectors (TCAV),” 2017, saliency maps,” 2017, arXiv:1312.6034.
arXiv:1711.11279. [60] K. Simonyan and A. Zisserman, “Very deep convolutional net-
[33] P.-J. Kindermans et al., “Learning how to explain neural networks for large-scale image recognition,” in Proc. Int. Conf. Learn.
works: Patternnet and patternattribution,” 2017, arXiv:1705.05598. Representations, 2015, pp. 1–14.
[61] U. Simsekli, L. Sagun, and M. Gurbuzbalaban, “A tail-index anal- Quanshi Zhang (Member, IEEE) received the
ysis of stochastic gradient noise in deep neural networks,” in Proc. PhD degree from the University of Tokyo, in
Int. Conf. Mach. Learn., 2019, pp. 5827–5837. 2014. He is currently an Associate Professor with
[62] R. Socher et al., “Recursive deep models for semantic composi- Shanghai Jiao Tong University, China. From 2014
tionality over a sentiment treebank,” in Proc. Conf. Empir. Methods to 2018, he was a post-doctoral researcher with
Natural Lang. Process., 2013, pp. 1631–1642. the University of California, Los Angeles. His
[63] J. Tang et al., “Understanding and improving knowledge distil- research interests include machine learning and
lation,” 2020, arXiv:2002.03532. computer vision. In particular, he has made influ-
[64] J. Uijlings, S. Popov, and V. Ferrari, “Revisiting knowledge trans- ential research in explainable AI (XAI). He was
fer for training object class detectors,” in Proc. IEEE Conf. Comput. the co-chairs of the workshops towards XAI in
Vis. Pattern Recognit., 2018, pp. 1101–1110. ICML 2021, AAAI 2019, and CVPR 2019. He is
[65] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The the speaker of the tutorials on XAI at IJCAI 2020 and IJCAI 2021. He
caltech-ucsd birds-200–2011 dataset,” California Inst. Technol., won the ACM China Rising Star Award with ACM TURC 2021.
Pasadena, CA, USA, Tech. Rep. CNS-TR-2011-001, 2011.
[66] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M.
Solomon, “Dynamic graph CNN for learning on point clouds,” Xu Cheng (Member, IEEE) is working toward the
ACM Trans. GraphicsGraph., vol. 38, no. 5, pp. 1–12, 2019. PhD degree with Shanghai Jiao Tong University.
[67] T.-W. Weng et al., “Evaluating the robustness of neural networks: Her research interests include computer vision
An extreme value theory approach,” 2018, arXiv:1801.10578. and machine learning.
[68] N. Wolchover, “New theory cracks open the black box of deep
learning,” Quanta Mag., 2017. [Online]. Available: https://www.
quantamagazine.org/new-theory-cracks-open-the-black-box-of-
deep-learning-20170921/
[69] W. Wu, Z. Qi, and L. Fuxin, “PointConv: Deep convolutional net-
works on 3D point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit., 2019, pp. 9613–9622.
[70] Z. Wu et al., “3D ShapeNets: A deep representation for volumetric
shapes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015,
pp. 1912–1920. Yilan Chen (Member, IEEE) is working toward
[71] D. Xie, T. Shu, S. Todorovic, and S.-C. Zhu, “Learning and infer- the master’s degree with the University of Califor-
ring “dark matter” and predicting human intents and trajectories nia San Diego. His research interests include
in videos,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 7, machine learning.
pp. 1639–1652, Jul. 2018.
[72] A. Xu and M. Raginsky, “Information-theoretic analysis of gener-
alization capability of learning algorithms,” in Proc. Adv. Neural
Inf. Process. Syst., 2017, pp. 2524–2533.
[73] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distilla-
tion: Fast optimization, network minimization and transfer
learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
pp. 7130–7138.
[74] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, Zhefan Rao (Member, IEEE) received the gradu-
“Understanding neural networks through deep visualization,” ate degree from the Huazhong University of
2015, arXiv:1506.06579. Science &Technology. His research interests
[75] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowl- include machine learning.
edge distillation via label smoothing regularization,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 3902–3910.
[76] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
volutional networks,” in Proc. Eur. Conf. Comput. Vis., 2014,
pp. 818–833.
[77] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals,
“Understanding deep learning requires rethinking general-
ization,” 2016, arXiv:1611.03530.
[78] Q. Zhang, X. Wang, Y. N. Wu, H. Zhou, and S.-C. Zhu,
“Interpretable CNNs for object classification,” IEEE Trans. Pattern " For more information on this or any other computing topic,
Anal. Mach. Intell., vol. 43, no. 10, pp. 3416–3431, Oct. 2021. please visit our Digital Library at www.computer.org/csdl.
[79] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,
“Learning deep features for discriminative localization,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2921–2929.

Quantifying The Knowledge in A DNN To Explain Knowledge Distillation For Classification

Uploaded by

Copyright:

Available Formats

Quantifying The Knowledge in A DNN To Explain Knowledge Distillation For Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quantifying The Knowledge in A DNN To Explain Knowledge Distillation For Classification

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO.

4, APRIL 2023 5099

Quantifying the Knowledge in a DNN to Explain

Index Terms—Knowledge distillation, knowledge points

1 INTRODUCTION weighting training examples. Lopez-Paz et al. [39] interpreted

2 RELATED WORK In contrast to previous works, in this study, we aim to

As Fig. 1b shows, magnitudes of the CAM and the gradi-

Distilling from VGG-11 34.06 8.89 0.80 1.30 0.80 0.58

VOC 2012 dataset. In this scenario, convolutional layers of

4.4 Verification of Hypothesis 3 ACKNOWLEDGMENTS

You might also like