剪枝系列4：A Bayesian Optimization Framework for Neural Network Compression_bayesian optimization for neural network pruning-优快云博客

本文链接：https://blog.youkuaiyun.com/linlb15/article/details/102865249

这篇论文把贝叶斯优化用到了模型压缩中，提出了一个优化框架，具体压缩的方法可以灵活使用。超参数 $θ\theta$ 表示最终的网络有多小，比如，当使用剪枝方法时它是一个threshold，当使用SVD是它是一个rank。

本文考虑的是如何选择压缩超参数 $θ\theta$ 。根据BO方法，我们需要确定objective function和acquisition function。

Objective function

object function 需要考虑的是1.the quality of the compressed network。用 $Q(f~θ)\mathcal{Q}(\tilde{f}_\theta)$ 代表模型的表现，用 $L(f~theta,f∗)\mathcal{L}(\tilde{f}_theta,f^*)$ 代表保真度。2.the size of the obtained network。用 $R(f~θ,f∗)R(\tilde{f}_\theta,f^*)$ 表示压缩比例。那么优化问题可以表示为：
$max_\theta(\underbrace{ \gamma \mathcal{Q}(\tilde{f}_\theta)+R(\tilde{f}_\theta,f^*))^{-1}}_{J_Q(\theta)} \\or argmin_\theta(\underbrace{ \kappa \mathcal{L}(\tilde{f}_\theta, f^*)+R(\tilde{f}_\theta,f^*))}_{J_{L}(\theta)}$
这里我们考虑用知识蒸馏的目标函数：
$\mathcal{L}(\tilde{f}_\theta,f^*):=\mathbb{E}_{x\thicksim P}(||\tilde{f}_\theta(x)-f^*(x)||_2^2)=||f^*-\tilde{f}_\theta||^2_2$

acquisition function

在这里插入图片描述

Experiments

Comparison of different model selection methods on Resnet18

在这里插入图片描述

Knowledge distillation as a proxy for risk

一个自然的疑问是用知识蒸馏的目标函数（L2）是否在网络压缩中好用。实验结果表明用function norm 和用top-1 error rate 有相当的表现。

在这里插入图片描述

Compression of VGG-16

In this section, we demonstrate that our method finds compression parameters that compare favorably to state-ofthe-art compression results reported on VGG-16 [10]. We first apply our method to compress convolutional layers of VGG-16 using tensor decomposition, which has 13 parameters. After that, we fine-tune the compressed model for 5 epochs, using Stochastic Gradient Descent (SGD) with
momentum 0.9 and learning rate 1e-4, decreased by a factor of 10 every epoch. Second, we apply another pass of our algorithm to compress the fully-connected layers of the fine-tuned model using SVD, which has 3 parameters. A single optimization takes approximately 10 minutes. Again, after the compression, we fine-tune the compressed model…