这篇论文把贝叶斯优化用到了模型压缩中,提出了一个优化框架,具体压缩的方法可以灵活使用。超参数θ\thetaθ表示最终的网络有多小,比如,当使用剪枝方法时它是一个threshold,当使用SVD是它是一个rank。
本文考虑的是如何选择压缩超参数θ\thetaθ。根据BO方法,我们需要确定objective function和acquisition function。
Objective function
object function 需要考虑的是1.the quality of the compressed network。用Q(f~θ)\mathcal{Q}(\tilde{f}_\theta)Q(f~θ) 代表模型的表现,用L(f~theta,f∗)\mathcal{L}(\tilde{f}_theta,f^*)L(f~theta,f∗)代表保真度。2.the size of the obtained network。用R(f~θ,f∗)R(\tilde{f}_\theta,f^*)R(f~θ,f∗)表示压缩比例。那么优化问题可以表示为:
argmaxθ(γQ(f~θ)+R(f~θ,f∗))−1⏟JQ(θ)orargminθ(κL(f~θ,f∗)+R(f~θ,f∗))⏟JL(θ)
arg max_\theta(\underbrace{ \gamma \mathcal{Q}(\tilde{f}_\theta)+R(\tilde{f}_\theta,f^*))^{-1}}_{J_Q(\theta)} \\or argmin_\theta(\underbrace{ \kappa \mathcal{L}(\tilde{f}_\theta, f^*)+R(\tilde{f}_\theta,f^*))}_{J_{L}(\theta)}
argmaxθ(JQ(θ)γQ(f~θ)+R(f~θ,f∗))−1orargminθ(JL(θ)κL(f~θ,f∗)+R(f~θ,f∗))
这里我们考虑用知识蒸馏的目标函数:
L(f~θ,f∗):=Ex∼P(∣∣f~θ(x)−f∗(x)∣∣22)=∣∣f∗−f~θ∣∣22
\mathcal{L}(\tilde{f}_\theta,f^*):=\mathbb{E}_{x\thicksim P}(||\tilde{f}_\theta(x)-f^*(x)||_2^2)=||f^*-\tilde{f}_\theta||^2_2
L(f~θ,f∗):=Ex∼P(∣∣f~θ(x)−f∗(x)∣∣22)=∣∣f∗−f~θ∣∣22
acquisition function
Experiments
Comparison of different model selection methods on Resnet18
Knowledge distillation as a proxy for risk
一个自然的疑问是用知识蒸馏的目标函数(L2)是否在网络压缩中好用。实验结果表明用function norm 和用top-1 error rate 有相当的表现。
Compression of VGG-16
In this section, we demonstrate that our method finds compression parameters that compare favorably to state-ofthe-art compression results reported on VGG-16 [10]. We first apply our method to compress convolutional layers of VGG-16 using tensor decomposition, which has 13 parameters. After that, we fine-tune the compressed model for 5 epochs, using Stochastic Gradient Descent (SGD) with
momentum 0.9 and learning rate 1e-4, decreased by a factor of 10 every epoch. Second, we apply another pass of our algorithm to compress the fully-connected layers of the fine-tuned model using SVD, which has 3 parameters. A single optimization takes approximately 10 minutes. Again, after the compression, we fine-tune the compressed model…
Conclusion
In this work, we have developed a principled, fast, and flexible framework for optimizing neural network compression parameters…