典型常见的知识蒸馏方法总结二

原创已于 2024-12-07 01:18:38 修改 · 949 阅读

26 ·

CC 4.0 BY-SA版权

文章标签：

#知识蒸馏

于 2024-12-07 01:11:20 首次发布

目标检测，知识蒸馏专栏收录该内容

14 篇文章

订阅专栏

来源：https://github.com/open-mmlab/MMRazor收录的方法

ICCV 2021：Channel-Wise Knowledge Distillation for Dense Prediction

在这里插入图片描述

首先，使用1x1卷积将教师模型与学生模型的中间层特征的通道维度进行对齐。

A 1x1 convolution layer is employed to upsample the number of channels for the student network if the number of channels mismatches between the teacher and the student.

然后，对每个通道的WxH的特征进行softmax归一化。

在这里插入图片描述

最后，使用KL散度损失来最小化学生特征与教师特征之间的距离

在这里插入图片描述

ICLR 2018 Workshop：Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks

Adversarial learning for knowledge distillation can be first found in (Ref. ICLR 2018 Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks).
Liu et al. (Ref. CVPR 2019：Structured Knowledge Distillation for Semantic Segmentation) shares the similar idea for semantic segmentation, named holistic distillation. We also leverage the adversarial learning performed in the output space.

Main idea: We propose to use conditional adversarial networks to learn the loss function to transfer knowledge from teacher to student.

在这里插入图片描述

CVPR 2019：Structured Knowledge Distillation for Semantic Segmentation

提出从大网络中蒸馏结构化知识到小网络中，structure knowledge

调研了两种结构化知识（1）pair-wise similarity （2） GAN-based holistic distillation

在这里插入图片描述

Pixel-wise distillation：使用语义分割预测的每个类别的类别概率logits，KL散度

在这里插入图片描述

Pair-wise distillation：使用余弦相似性计算的相似性矩阵，MSE loss

在这里插入图片描述

Holistic distillation：采用 conditional generative adversarial learning

在这里插入图片描述

总的损失函数：multi-class cross-entropy loss ℓmc(S)为学生预测与真实标签的多分类交叉熵

在这里插入图片描述

CVPR 2019：Knowledge Adaptation for Efficient Semantic Segmentation

在这里插入图片描述

(1) The first part is designed for translating the knowledge from the teacher network to a compressed space that is more informative. The translator is achieved by training an autoencoder to compress the knowledge to a compact format that is easier to be learned by the student network, otherwise much harder due to the inherent structure differences.

第一部分旨在将来自教师网络的知识转换到一个信息量更丰富的压缩空间中。通过训练一个自动编码器来实现这种转换，该自动编码器能将知识压缩成一种更紧凑的格式，这样的格式更便于学生网络学习，否则，由于内在结构差异，学生网络学习起来会困难得多。

(2) The second part is designed to capture long-range dependencies from the teacher network, which is difficult to be learned for small models due to the limited receptive field and abstracting capability.

第二部分旨在从教师网络中捕捉长距离依赖关系，由于小型模型的感受野有限且抽象能力不足，这种长距离依赖关系对它们来说是很难学习到的。
在这里插入图片描述

训练步骤：第一步：训练教师网络和教师网络的auto-encoder；

在这里插入图片描述

训练步骤：第二步，训练学生网络

feature adapter loss

在这里插入图片描述
Cf指的是学生网络adapter的特征，E指的是教师模型auto-encoder的encoder输出的特征

affinity loss：作者声称affinity能捕获capture long-range dependencies

在这里插入图片描述

arXiv 2020：Channel Distillation：Channel-wise attention for knowledge distillation

在这里插入图片描述

核心思想是两点。The first transfer strategy is based on channel-wise attention, called Channel Distillation (CD).

计算每个Channel的权重

在这里插入图片描述

对n个阶段的C通道的权重进行平方损失

在这里插入图片描述

The second is Guided Knowledge Distillation (GKD). GKD only enables the student to mimic the correct output of the teacher

传统的基于logits的KL散度蒸馏损失

在这里插入图片描述

判别教师预测与真值标签的二值概率作为蒸馏损失权重来优化传统的基于logits的KL散度蒸馏损失

在这里插入图片描述

ICLR 2021：Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective

从统计学习的角度来看，正则化旨在降低方差，然而在使用软标签进行训练时，偏差和方差如何变化并不明确。在本文中，我们对使用软标签进行蒸馏所带来的偏差 - 方差权衡问题进行了研究。具体而言，我们观察到在训练过程中，偏差 - 方差权衡会因样本不同而有所变化。此外，在相同的蒸馏温度设置下，我们观察到蒸馏性能与某些特定样本的数量呈负相关，这些样本被称为正则化样本，因为它们会导致偏差增大、方差减小。不过，我们通过实验发现，完全剔除正则化样本也会使蒸馏性能变差。
我们的这些发现启发我们提出了一种新颖的加权软标签，以帮助网络自适应地处理基于样本的偏差 - 方差权衡问题。在标准评估基准上进行的实验验证了我们方法的有效性。

提出了 weighted soft labels，对每一个instance只加了一个权重，来修正标准的基于logits的知识蒸馏过程

在这里插入图片描述

ICCV 2019：Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation

提出了自蒸馏的方法

在这里插入图片描述

Loss Source 1: Cross entropy loss from labels to not only the deepest classifier, but also all the shallow classifiers. 类似于语义分割中的深监督，每一层的预测都与ground truth进行监督

在这里插入图片描述

Loss Source 2: KL (Kullback-Leibler) divergence loss under teacher’s guidance. The KL divergence is computed using softmax outputs between students and teachers, and introduced to the softmax layer of each shallow classifier 最深层的softmax logits作为教师，与浅层特征输出的softmax的logits进行KL散度损失

在这里插入图片描述

Loss Source 3: L2 loss from hints. It can be obtained through computation of the L2 loss between features maps of the deepest classifier and each shallow classifier. 最深层的特征作为教师，与浅层特征进行L2损失。

在这里插入图片描述

一些Data-Free的知识蒸馏方法，这里先跳过

arXiv 2019: Data-Free Adversarial Distillation
ICCV 2019: Data-Free Learning of Student Networks
CVPR 2021: Learning Student Networks in the Wild

ICCV 2019: A Comprehensive Overhaul of Feature Distillation

我们对实现网络压缩的特征蒸馏方法的设计层面进行了研究，并提出了一种新颖的特征蒸馏方法。
在该方法中，蒸馏损失的设计旨在使以下各个方面产生协同作用：教师变换、学生变换、蒸馏特征位置以及距离函数。 teacher transform Tt, student transform Ts and distance d
我们所提出的蒸馏损失包含一种带有新设计的边缘修正线性单元（margin ReLU）的特征变换、一个新的蒸馏特征位置，以及一个能够跳过对学生模型压缩产生不利影响的冗余信息的部分 L2 距离函数。

在这里插入图片描述