[深度学习论文笔记][Image Classification] Maxout Networks

Maxout网络不仅能学习隐藏单元间的关系,还能学习每个隐藏单元的激活函数。使用Maxout单位训练时,几乎所有的过滤器都会被利用到,且Maxout单位永远不会失效,梯度始终能流经每一个Maxout单位。结合Dropout技术可以增强模型的泛化能力。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Goodfellow, Ian J., et al. ”Maxout networks.” ICML (3) 28 (2013): 1319-1327. (Citations: 547).


1 Motivations

[Dropout]
• During training, discard each input and hidden unit with probability 0.5.
• During testing, divide weights by 2.
• Approximation to geometric mean of predictions of an ensemble of different models trained with bagging, where each model is trained with only one iteration.

• Improves most neural network’s generalization ability about 10%.


[Leveraging Dropout] 

1. Can we enhance the bagging-like nature of dropout training?

• The ideal operating regime for dropout is when the overall training procedure resembles training an ensemble with bagging under parameter sharing constraints.
• Dropout is most effective when taking relatively large steps in parameter space.
• In this regime, each update can be seen as making a significant update to a different model on a different subset of the training set to fit the current input well.
• This differs radically from the ideal SGD in which a single model makes steady progress via small steps.
2. Can we make the approximate model averaging more accurate?
• Dropout model averaging is only an approximation when applied to deep models.


2 Maxout

Maxout networks learn not just the relationship between hidden units, but also the activation function of each hidden unit. 


A single maxout unit can be interpreted as making a piecewise linear approximation to an arbitrary convex function.


A MLP with two hidden maxout units can approximate any continuous function since the substraction of two arbitrary convex function can approximate any continuous function.


Changing the dropout mask could frequently change which piece of the piecewise function an input is mapped to. Maxout trained with dropout may have the identity of the maximal filter in each unit change relatively rarely as the dropout mask changes.

3 Analysis
[More Accurate Approximate Model Averaging] The W/2 trick is more accurate for maxout than for tanh. Since maxout is piece-wise linear and the tanh has significant curvature. 


[Optimization Works Better When Using Dropout]
• SGD + non-linearity: 5% of the non-linearly activations are 0.
• Dropout + non-linearity: 60% of the non-linearly activations are 0, and 40% filters per layer unused.
• Dropout + maxout: 99.9% of filters used.
• Maxout units can never die, gradient always flows through every maxout unit.

4 References
[1]. ICML Talk. http://techtalks.tv/talks/maxout-networks/58135/.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值