[深度学习论文笔记][Image Classification] Maxout Networks_classification cnn computer vision-优快云博客

Maxout网络不仅能学习隐藏单元间的关系，还能学习每个隐藏单元的激活函数。使用Maxout单位训练时，几乎所有的过滤器都会被利用到，且Maxout单位永远不会失效，梯度始终能流经每一个Maxout单位。结合Dropout技术可以增强模型的泛化能力。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Goodfellow, Ian J., et al. ”Maxout networks.” ICML (3) 28 (2013): 1319-1327. (Citations: 547).

1 Motivations

[Dropout]
• During training, discard each input and hidden unit with probability 0.5.
• During testing, divide weights by 2.
• Approximation to geometric mean of predictions of an ensemble of different models trained with bagging, where each model is trained with only one iteration.

• Improves most neural network’s generalization ability about 10%.

[Leveraging Dropout]

1. Can we enhance the bagging-like nature of dropout training?

• The ideal operating regime for dropout is when the overall training procedure resembles training an ensemble with bagging under parameter sharing constraints.
• Dropout is most effective when taking relatively large steps in parameter space.
• In this regime, each update can be seen as making a significant update to a different model on a different subset of the training set to fit the current input well.
• This differs radically from the ideal SGD in which a single model makes steady progress via small steps.
2. Can we make the approximate model averaging more accurate?
• Dropout model averaging is only an approximation when applied to deep models.

2 Maxout

Maxout networks learn not just the relationship between hidden units, but also the activation function of each hidden unit.

A single maxout unit can be interpreted as making a piecewise linear approximation to an arbitrary convex function.

A MLP with two hidden maxout units can approximate any continuous function since the substraction of two arbitrary convex function can approximate any continuous function.

Changing the dropout mask could frequently change which piece of the piecewise function an input is mapped to. Maxout trained with dropout may have the identity of the maximal filter in each unit change relatively rarely as the dropout mask changes.

3 Analysis
[More Accurate Approximate Model Averaging] The W/2 trick is more accurate for maxout than for tanh. Since maxout is piece-wise linear and the tanh has significant curvature.

[Optimization Works Better When Using Dropout]
• SGD + non-linearity: 5% of the non-linearly activations are 0.
• Dropout + non-linearity: 60% of the non-linearly activations are 0, and 40% filters per layer unused.
• Dropout + maxout: 99.9% of filters used.
• Maxout units can never die, gradient always flows through every maxout unit.

4 References
[1]. ICML Talk. http://techtalks.tv/talks/maxout-networks/58135/.