文章地址: https://arxiv.org/pdf/1707.01209.pdf
Part I: general framework.
We give a general formulation of model compression as constrained optimization.
Related work.
Four categories of model compression.
- Direct learning: min Θ L ( h ( x ; Θ ) ) \min_\Theta L(h(x; \Theta)) minΘL(h(x;Θ)): find the small model with the best loss regardless of the reference.
- Direct compression (DC): min Θ ∥ w − ∆ ( Θ ) ∥ 2 \min_\Theta ∥w − ∆(\Theta)∥^2 minΘ∥w−∆(Θ)∥2: find the closest approximation to the parameters of the reference model.
- Model compression as constrained optimization: It forces h h h and f f f to be models of the same type, by constraining the weights w w w to be constructed from a low-dimensional parameterization w = ∆ ( Θ ) w = ∆(Θ) w=∆(Θ), but h h h must optimize the loss L L L.
- Teacher-student : min Θ ∫ X p ( x ) ∥ f ( x ; w ) − h ( x ; Θ ) ∥ 2 d x \min_\Theta \int_X p(x) ∥f (x; w) − h(x; \Theta)∥^2dx minΘ∫Xp(x)∥f(x;w)−h(x;Θ)∥2dx: find the closest approximation h h h to the reference function f f f, in some norm.
A constrained optimization formulation.

Types of compression
A “Learning-Compression” (LC) algorithm
Direct compression (DC) and the beginning of the path
Compression, generalization and model selection
Compression
Compression can also be seen as a way to prevent overfitting, since it aims at obtaining a smaller model with a similar loss to that of a well-trained reference model.
Generalization
The reference model was not trained well enough, so that the continued training that happens while compressing reduces the error.
Model selection
A good approximate strategy for model selection in neural nets is to train a large enough reference model and compress it as much as possible.