论文阅读:Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
论文地址: https://arxiv.org/pdf/1706.02677
背景
1)larger networks and larger datasets need longer time for training;
解决方法:分布式同步SGD,将SGD minibatches分配给并行工作站。
(distributed synchronous SGD offers a potential solution to the problem by dividing SGD minibatches over a pool of parallel workers.)
2)large minibatches’s main issue is: optimization difficultly rather than poor generalization.
解决方法:Authors provide some strategies
use large minibatch in place of small minibatches while maintaining training and generalization accuracy.
但是,用large minibatch会使得optimization difficultly,针对这个问题,作者提出一些tips.
Large minibatch SGD
首先,回忆一下SGD的公式:
1)损失函数:
L(w)=1X∑x∈Xl(x,w)L(w)=\frac{1}{X}\sum_{x \in X} l(x,w)L(w)=X1x∈X∑l(x,w)
XXX为全部的训练集,www为权值,l(x,w)l(x,w)l(x,w)为单个sample xxx的损失函数
2)权值更新:
wt+1=wt−η1n∑x∈BΔl(x,wt)w_{ t+1 }=w_{ t } - \eta \frac{1}{n} \sum_{x \in B } \Delta l(x,w_{t})wt+1=wt−ηn1x∈B∑Δl(x,wt)
BBB为minibatch sample,nnn为∣B∣\left | B \right |∣B∣,η\etaη为学习率,ttt为迭代index.
Large minibatch
Linear Scaling Ruler:
Warmup
BN
Tips

Communication
对于每一个参数的梯度,都是通过allreduce操作来进行汇聚的。在进行allreduce之前,每个GPU都会计算自己的梯度,在allreduce*之后,每个GPU得到梯度的和。
推荐参考:
https://www.zhihu.com/question/60874090
https://www.jianshu.com/p/738ff3628543