【论文阅读】Distilling the Knowledge in a Neural Network

原创已于 2023-03-24 16:16:20 修改 · 503 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#论文阅读

于 2023-03-02 16:28:19 首次发布

该论文提出了一种通过知识蒸馏将多个模型的知识压缩到单个模型中的方法，以提高预测性能。在保持高准确性的前提下，解决了大型神经网络模型部署的复杂性和计算成本问题。研究展示了在MNIST数据集上取得的显著成果，并且改进了一个商业系统的声学模型。此外，还引入了一种新的集成模型结构，结合全模型和专家模型，这些专家模型能并行快速训练，用于细粒度分类。

部署运行你感兴趣的模型镜像

论文下载
 pytorch复现代码
bib:

@INPROCEEDINGS{Hinton2014Distilling,
  title		= {Distilling the Knowledge in a Neural Network},
  author	= {Geoffrey E. Hinton and Oriol Vinyals and Jeffrey Dean},
  booktitle = {NIPS},
  year      = {2014},
  pages     = {1--9}
}

1. 摘要

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions.
Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets.
Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique.
We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble ofmodels into a single model.
We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.
Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

Notes:

第一句引入集成学习，在提出集成学习的缺点是太过笨重，进一步提出可以通过压缩模型的方式解决这个问题。这样的叙事方式很流畅，很值得学习。
本文的贡献，在前人的压缩模型的基础上进一步提出不同的压缩方式，还引入了一种新的集成方式。

2. 算法描述

2.1. 软标签

$q_i = \frac{\exp(z_i/T)}{\sum_j{\exp(z_j/T)}} \tag{1}$

在 $T = 1$ 的情况下，一式就是softmax函数，返回类别对应的概率。在本文中提出的软标签，就是在 $T$ 取较大值时候计算的概率值。原文描述的是这样的软标签中包含更多的类间信息。一个直观的理解是一张猫的图片，输出百分之八十的概率是猫，百分之十九是狗，百分之1是人，从中可以知道猫和狗相似，和人不太相似。

2.2. simplest form of distillation

在这种情况下，没有使用数据集中的硬标签(hard target)。
$\mathcal{L}_{\text{soft}} = -\sum_i^K{p_i\log(q_i)} \tag{2}$

$K$ 表示总的类别数；
$p_i$ 表示教师模型的软标签概率， $q_i$ 表示学生模型的软标签概率。

2.3. better way

$\mathcal{L} = (1-\alpha)\mathcal{L}_{\text{soft}} + \alpha \mathcal{L}_{\text{hard}}\tag{3}$
其中，
$\mathcal{L}_{\text{hard}} = -\sum_i^K{y_i\log(q_i)}\tag{4}$

$y_i$ 表示对应样本的真实标签。

Notice:

在原文中， $\alpha$ 值设置较小值。
由于在计算 $\mathcal{L}_{\text{soft}}$ 时，需要除以 $T$ ，导致soft target关联的梯度被缩小了 $T^2$ 倍， $\mathcal{L}_{\text{soft}}$ 在反向前需要乘以 $T^2$ ，即二式应为
$\mathcal{L}_{\text{soft}} = -T^2\sum_i^K{p_i\log(q_i)} \tag{5}$ 。