优化器
SGD
Wnew=Wold−α∂Loss∂WoldW_{new}=W_{old} - \alpha\frac{\partial{Loss}}{\partial{W_{old}}}Wnew=Wold−α∂Wold∂Loss
α\alphaα:学习率
缺点:容易陷入局部极小值
加入动量(Momentum),解决局部极小值。
SGD+Momentum
Momentum更新:Vnew=ηVold+α∂LossWoldV_{new}=\eta V_{old} +\alpha \frac{\partial{Loss}}{W_{old}}Vnew=ηVold+αWold∂Loss
更新公式:Wnew=Wold−VnewW_{new} = W_{old}-V_{new}Wnew=Wold−Vnew
α\alphaα:学习率
η\etaη:动量系数
优点:防止陷入局部极小值,由于动量由历史积累,使得收敛速度快。
缺点:容易震荡
NAG(Nesterov加速梯度)
Momentum完全展开公式:Wnew=Wold−ηVold−α∂LossWoldW_{new} = W_{old}- \eta V_{old}-\alpha \frac{\partial{Loss}}{W_{old}}Wnew=Wold−ηVold−αWold∂Loss
α∂LossWold\alpha \frac{\partial{Loss}}{W_{old}}αWold∂Loss是个很小的值,未来位置权重:
Wfuture=Wold−ηVoldW_{future}=W_{old}-\eta V_{old}Wfuture=Wold−ηVold
Nesterov Momentum公式Vnew=ηVold+α∂LossWfutureV_{new}=\eta V_{old}+\alpha \frac{\partial{Loss}}{W_{future}}Vnew=ηVold+αWfuture∂Loss
更新公式:Wnew=Wold−VnewW_{new} = W_{old}-V_{new}Wnew=Wold−Vnew
- 梯度更新规则
vt=γvt−1+η▽θ(θ−γvt−1)v_t=\gamma v_{t-1}+\eta \triangledown_{\theta}(\theta-\gamma v_{t-1})vt=γvt−1+η▽θ(θ−γvt−1)
θ=θ−vt\theta=\theta-v_tθ=θ−vt
Adagrad
梯度缓存更新:Cachenew=Cacheold+(∂LossWold)2Cache_{new}=Cache_{old}+(\frac{\partial{Loss}}{W_{old}})^2Cachenew=Cacheold+(Wold∂Loss)2
更新公式:Wnew=Wold+αCachenew+ϵ∂LossWoldW_{new} = W_{old}+\frac{\alpha}{\sqrt{Cache_{new} + \epsilon}}\frac{\partial{Loss}}{W_{old}}Wnew=Wold+Cachenew+ϵαWold∂Loss
缺点:缓存始终增加,学习率会降到非常低以至于训练无法有效进行,导致训练提前结束。
RMSProp
缓存更新公式:Cachenew=γCacheold+(1−γ)(∂LossWold)2Cache_{new}=\gamma Cache_{old}+(1-\gamma)(\frac{\partial{Loss}}{W_{old}})^2Cachenew=γCacheold+(1−γ)(Wold∂Loss)2
更新公式:Wnew=Wold+αCachenew+ϵ∂LossWoldW_{new} = W_{old}+\frac{\alpha}{\sqrt{Cache_{new} + \epsilon}}\frac{\partial{Loss}}{W_{old}}Wnew=Wold+Cachenew+ϵαWold∂Loss
Adam
Adam Momentum更新公式Vnew=β1Vold+(1−β1)∂LossWoldV_{new} = \beta_{1}V_{old}+(1-\beta_1)\frac{\partial{Loss}}{W_{old}}Vnew=β1Vold+(1−β1)Wold∂Loss
缓存更新公式:Cachenew=β2Cacheold+(1−β2)(∂LossWold)2Cache_{new}=\beta_2 Cache_{old}+(1-\beta_2)(\frac{\partial{Loss}}{W_{old}})^2Cachenew=β2Cacheold+(1−β2)(Wold∂Loss)2
Adam更新公式:Wnew=Wold−αCachenew+ϵVnewW_{new}=W_{old}-\frac{\alpha}{\sqrt{Cache_{new} + \epsilon}}V_{new}Wnew=Wold−Cachenew+ϵαVnew
β1=0.9\beta_1=0.9β1=0.9,β2=0.99\beta_2=0.99β2=0.99,ϵ=1e−08\epsilon=1e-08ϵ=1e−08