Abstract:This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher–student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.
Section1:Introduction
模型轻量化的技术
1. 设计 efficient deep models
2. 剪枝:parameter pruning and sharing
3. 量化:Low-rank factorization
4. Transferred compact convolutional filters, 不太了解
5. 知识蒸馏
知识蒸馏的main idea:student model mimics the teacher model to obtain a competitive or even a superior performance
知识蒸馏的key problem:how to represent, and transfer knowledge from a large teacher model to a small student model.
知识蒸馏系统的三个组成部分:knowledge,distillation algorithm,teacher-student architecture
Section2:知识
知识蒸馏中的知识有多种类别,一种最基本的知识就是使用教师模型的预测logits;此外,教师模型中间层的特征,也可以视为表征知识来引导学生网络学习。教师网络中不同神经元,不同特征层富含的关系信息,以及教师模型的参数也包含的一定的知识。综述将知识分为三种类别:response-based knowledge,feature-based knowledge,relation-based knowledge
response-based knowledge:指的是教师模型的输出类别的logits
Response-based knowledge usually refers to the neural response of the last output layer of the teacher model. The main idea is to directly mimic the final prediction of the teacher model, which is simple yet effective. The idea of the response-based knowledge is straightforward and easy to understand, especially in the context of “dark knowledge”
最典型的response-based的方法为Hinton在2015年提出的结合Temperature的soft target方法
然而, the response-based knowledge usually relies on the output of the last layer, e.g., soft targets, and thus fails to address the intermediate-level supervision from the teacher model, which turns out to be very important for representation learning using very deep neural networks
feature-based knowledge
典型方法是FitNet的Hints方法:the output of a teacher’s hidden layer that supervises the student’s learning.,用教师模型的Feature map当做知识。
The main idea is to directly match the feature activations of the teacher and the student. Inspired by this, a variety of other methods have been proposed to match the features indirectly,indirect的方法有从特征图中产生attention map来表征知识,匹配特征空间的概率分布,use the activation boundary of the hidden neurons for knowledge transfer,cross-layer knowledge distillation, which adaptively assigns proper teacher layers for each student layer via attention allocation.
用于监督feature-based knowledge distillation的损失函数有:L1损失,L2损失,CE交叉熵损失,MMD损失(maximum mean discrepancy)
feature-based knowledge distillation的难点:Though feature-based knowledge transfer provides favorable information for the learning of the student model, how to effectively choose the hint layers from the teacher model and the guided layers from the student model remains to be further investigated (Romero et al. 2015). Due to the significant differences between sizes of hint and guided layers, how to properly match feature representations of teacher and student also needs to be explored.
Relation-based knowledge:further explores the relationships between different layers or data samples of teacher-student models
常见的方法有FSP矩阵(flow of solution process matrix)表示法,SVD singular value decomposition表示法,graph表示法,mutual information flow表示法,instance relationship graph表示法,
relation-based distillation loss based on relations of feature maps
relation-based distillation loss based on relations of instance
不同种类的relation-based knowledge
relations of feature maps 和 relations of instance的区别是什么????
Section 3:蒸馏的范式:offline distillation,online distillation,self-distillation
offline distillation:先训练好教师模型,然后冻结教师模型,再监督训练学生模型
online distillation:both the teacher model and the student model are updated simultaneously, and the whole knowledge distillation framework is end-to-end trainable
self-distillation:只有一个模型,knowledge from the deeper sections of the network is distilled into its shallow sections
Besides, offline, online and self distillation can also be intuitively understood from the perspective of human beings teacher–student learning. Offline distillation means the knowledgeable teacher teaches a student knowledge; online distillation means both teacher and student study together with each other; self-distillationmeans student learn knowledge by oneself. Moreover, just like the human beings learning, these three kinds of distillation can be combined to complement each other due to their own advantages.
Section 4:Teacher-Student Architecture
The complexity of deep neural networks mainly comes from two dimensions: depth and width. It is usually required to transfer knowledge from deeper and wider neural networks to shallower and thinner neural networks
The student network is usually chosen to be
1)a simplified version of a teacher network with fewer layers and fewer channels in each layer
2)a quantized version of a teacher network in which the structure of the network is preserved
3) a small network with efficient basic operations
4) a small network with optimized global network structure
5) the same network as teacher
Section 5: 蒸馏算法
1. Adversarial Distillation:Inspired by GAN, many adversarial knowledge distillation methods have been proposed to enable the teacher and student networks to have a better understanding of the true data distribution
如图所示,adversarial distillation方法可分为三种类型
不太懂adversarial distillation
2. Multi-Teacher Distillation:Different teacher architectures can provide their own useful knowledge for a student network
3. Cross-Modal Distillation
相关工作
[1] CVPR 2016: Cross modal distillation for supervision transfer.
[2] ECCV 2018: Modality distillation with multiple stream networks for action recognition
[3] CVPR 2018: Through-wall human pose estimation using radio signals.
[4] ICASSP 2018: Cross-modality distillation:A case for conditional generative adversarial networks
[5] CVPR 2020: Knowledge as Priors: Cross-Modal Knowledge Generalization for Datasets without Superior Knowledge
[6] ACM MM 2018: Emotion recognition in speech using cross-modal transfer in the wild
[7] ICIP 2019: Cross-modal knowledge distillation for action recognition
[8] ICLR 2020: Contrastive representation distillation
[9] ICCV 2019: Compact trilinear interaction for visual question answering
[10] ECCV 2018: Learning deep representations with probabilistic knowledge transfer.
[11] CVPR 2016: Learning with side information through modality hallucination
[12] CVPR 2019: Um-adapt: Unsupervised multi-task adaptation using adversarial cross-task distillation
[13] CVPR 2019: rdoco: Pixel-level domain transfer with cross-domain consistency
[14] BMVC 2017: Adaptingmodels to signal degradation using distillation
[15] ECCV 2018: Graph distillation for action detection with privileged modalities
[16] PR 2019: Spatiotemporal distilled denseconnectivity network for video action recognition
[17] ICASSP 2019: Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks
[18] AAAI 2020: Knowledge integration networks for action recognition