来源:https://github.com/open-mmlab/MMRazor收录的方法
ICCV 2021:Channel-Wise Knowledge Distillation for Dense Prediction
首先,使用1x1卷积将教师模型与学生模型的中间层特征的通道维度进行对齐。
A 1x1 convolution layer is employed to upsample the number of channels for the student network if the number of channels mismatches between the teacher and the student.
然后,对每个通道的WxH的特征进行softmax归一化。
最后,使用KL散度损失来最小化学生特征与教师特征之间的距离
ICLR 2018 Workshop:Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks
Adversarial learning for knowledge distillation can be first found in (Ref. ICLR 2018 Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks).
Liu et al. (Ref. CVPR 2019:Structured Knowledge Distillation for Semantic Segmentation) shares the similar idea for semantic segmentation, named holistic distillation. We also leverage the adversarial learning performed in the output space.
Main idea: We propose to use conditional adversarial networks to learn the loss function to transfer knowledge from teacher to student.
CVPR 2019:Structured Knowledge Distillation for Semantic Segmentation
提出从大网络中蒸馏结构化知识到小网络中,structure knowledge
调研了两种结构化知识 (1)pair-wise similarity (2) GAN-based holistic distillation
Pixel-wise distillation:使用语义分割预测的每个类别的类别概率logits,KL散度
Pair-wise distillation:使用余弦相似性计算的相似性矩阵,MSE loss
Holistic distillation:采用 conditional generative adversarial learning
总的损失函数:multi-class cross-entropy loss ℓmc(S)为学生预测与真实标签的多分类交叉熵
CVPR 2019:Knowledge Adaptation for Efficient Semantic Segmentation
(1) The first part is designed for translating the knowledge from the teacher network to a compressed space that is more informative. The translator is achieved by training an autoencoder to compress the knowledge to a compact format that is easier to be learned by the student network, otherwise much harder due to the inherent structure differences.
第一部分旨在将来自教师网络的知识转换到一个信息量更丰富的压缩空间中。通过训练一个自动编码器来实现这种转换,该自动编码器能将知识压缩成一种更紧凑的格式,这样的格式更便于学生网络学习,否则,由于内在结构差异,学生网络学习起来会困难得多。
(2) The second part is designed to capture long-range dependencies from the teacher network, which is difficult to be learned for small models due to the limited receptive field and abstracting capability.
第二部分旨在从教师网络中捕捉长距离依赖关系,由于小型模型的感受野有限且抽象能力不足,这种长距离依赖关系对它们来说是很难学习到的。
训练步骤:第一步:训练教师网络和教师网络的auto-encoder;
训练步骤:第二步,训练学生网络
feature adapter loss
Cf指的是学生网络adapter的特征,E指的是教师模型auto-encoder的encoder输出的特征
affinity loss:作者声称affinity能捕获capture long-range dependencies
arXiv 2020:Channel Distillation:Channel-wise attention for knowledge distillation
核心思想是两点。The first transfer strategy is based on channel-wise attention, called Channel Distillation (CD).
计算每个Channel的权重
对n个阶段的C通道的权重进行平方损失
The second is Guided Knowledge Distillation (GKD). GKD only enables the student to mimic the correct output of the teacher
传统的基于logits的KL散度蒸馏损失
判别教师预测与真值标签的二值概率作为蒸馏损失权重来优化传统的基于logits的KL散度蒸馏损失
ICLR 2021:Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective
从统计学习的角度来看,正则化旨在降低方差,然而在使用软标签进行训练时,偏差和方差如何变化并不明确。在本文中,我们对使用软标签进行蒸馏所带来的偏差 - 方差权衡问题进行了研究。具体而言,我们观察到在训练过程中,偏差 - 方差权衡会因样本不同而有所变化。此外,在相同的蒸馏温度设置下,我们观察到蒸馏性能与某些特定样本的数量呈负相关,这些样本被称为正则化样本,因为它们会导致偏差增大、方差减小。不过,我们通过实验发现,完全剔除正则化样本也会使蒸馏性能变差。
我们的这些发现启发我们提出了一种新颖的加权软标签,以帮助网络自适应地处理基于样本的偏差 - 方差权衡问题。在标准评估基准上进行的实验验证了我们方法的有效性。
提出了 weighted soft labels,对每一个instance只加了一个权重,来修正标准的基于logits的知识蒸馏过程
ICCV 2019:Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation
提出了自蒸馏的方法
Loss Source 1: Cross entropy loss from labels to not only the deepest classifier, but also all the shallow classifiers. 类似于语义分割中的深监督,每一层的预测都与ground truth进行监督
Loss Source 2: KL (Kullback-Leibler) divergence loss under teacher’s guidance. The KL divergence is computed using softmax outputs between students and teachers, and introduced to the softmax layer of each shallow classifier 最深层的softmax logits作为教师,与浅层特征输出的softmax的logits进行KL散度损失
Loss Source 3: L2 loss from hints. It can be obtained through computation of the L2 loss between features maps of the deepest classifier and each shallow classifier. 最深层的特征作为教师,与浅层特征进行L2损失。
一些Data-Free的知识蒸馏方法,这里先跳过
arXiv 2019: Data-Free Adversarial Distillation
ICCV 2019: Data-Free Learning of Student Networks
CVPR 2021: Learning Student Networks in the Wild
ICCV 2019: A Comprehensive Overhaul of Feature Distillation
我们对实现网络压缩的特征蒸馏方法的设计层面进行了研究,并提出了一种新颖的特征蒸馏方法。
在该方法中,蒸馏损失的设计旨在使以下各个方面产生协同作用:教师变换、学生变换、蒸馏特征位置以及距离函数。 teacher transform Tt, student transform Ts and distance d
我们所提出的蒸馏损失包含一种带有新设计的边缘修正线性单元(margin ReLU)的特征变换、一个新的蒸馏特征位置,以及一个能够跳过对学生模型压缩产生不利影响的冗余信息的部分 L2 距离函数。
teacher transform Tt, student transform Ts, and distance d,不同方法的对比
Transform的使用会压缩特征从而导致信息的丢失,蒸馏的位置放在激活函数之后会过滤掉错误信息从而导致信息丢失,L2会引起错误信息,所以提出partial L2跳过对负区域信息的蒸馏。