1 Motivations
[Dropout]• During training, discard each input and hidden unit with probability 0.5.
• During testing, divide weights by 2.
• Approximation to geometric mean of predictions of an ensemble of different models trained with bagging, where each model is trained with only one iteration.
• Improves most neural network’s generalization ability about 10%.
[Leveraging Dropout]
1. Can we enhance the bagging-like nature of dropout training?
• The ideal operating regime for dropout is when the overall training procedure resembles training an ensemble with bagging under parameter sharing constraints.• Dropout is most effective when taking relatively large steps in parameter space.
• In this regime, each update can be seen as making a significant update to a different model on a different subset of the training set to fit the current input well.
• This differs radically from the ideal SGD in which a single model makes steady progress via small steps.
2. Can we make the approximate model averaging more accurate?
• Dropout model averaging is only an approximation when applied to deep models.
2 Maxout
Maxout networks learn not just the relationship between hidden units, but also the activation function of each hidden unit.
A single maxout unit can be interpreted as making a piecewise linear approximation to an arbitrary convex function.
A MLP with two hidden maxout units can approximate any continuous function since the substraction of two arbitrary convex function can approximate any continuous function.
Changing the dropout mask could frequently change which piece of the piecewise function an input is mapped to. Maxout trained with dropout may have the identity of the maximal filter in each unit change relatively rarely as the dropout mask changes.
3 Analysis
[More Accurate Approximate Model Averaging] The W/2 trick is more accurate for maxout than for tanh. Since maxout is piece-wise linear and the tanh has significant curvature.
[Optimization Works Better When Using Dropout]
• SGD + non-linearity: 5% of the non-linearly activations are 0.
• Dropout + non-linearity: 60% of the non-linearly activations are 0, and 40% filters per layer unused.
• Dropout + maxout: 99.9% of filters used.
• Maxout units can never die, gradient always flows through every maxout unit.
4 References
[1]. ICML Talk. http://techtalks.tv/talks/maxout-networks/58135/.