梯度下降优化算法(gradient descent optimization algorithms):
SGD -> SGDM -> NAG -> AdaGrad -> AdaDelta -> RMSprop -> Adam -> AdaMax -> Nadam -> AMSGrad
在每个epoch t:
1)计算目标函数关于当前参数的梯度;
2)根据历史梯度计算一阶动量和二阶动量;
3)计算当前时刻的下降梯度;
4)根据下降梯度进行更新;
各个优化算法在步骤3、4上都是一致的,主要差别体现在1和2上;
优化算法分为固定学习率和自适应学习率;
固定学习率优化策略有SGD、SGDM、NAG;自适应学习率优化策略有AdaGrad、AdaDelta、Adam、Nadam;
classical momentum (CM)是1964年提出的;(原论文Some methods of speeding up the convergence of iteration methods)
SGDM是在SGD的基础上引入了动量,即SGD with momentum;
NAG是Nesterov’s Accelerated Gradient;
NAG在梯度更新时做了一个矫正,避免了前进太快,同时提高灵敏度;
RMSprop是Geoff Hinton提出的一种自适应学习率方法,可缓解Adagrad算法学习率下降较快的问题;
Adam是另一种自适应学习率方法,是Momentum+AdaGrad/RMSProp?是默认的最好优化器!把一阶动量和二阶动量都用起来;
(利用梯度的一阶矩估计和二阶矩估计动态调整每个参数的学习率;)
(AdaMax is a variant of Adam based on the infinity norm;)
Nesterov(NAG) + Adam = Nadam
AdaGrad和AdaDelta在SGD基础上增加了二阶动量;
(AdaDelta是对AdaGrad的扩展,用于解决AdaGrad学习率单调下降过于激进问题;)
(当使用Adadelta,甚至可以不设置初始学习率;)
SGD、BGD、MBGD分别是Stochastic gradient descent、Batch gradient descent、Mini-batch gradient descent;
二阶方法有牛顿法、拟牛顿法、L-BFGS等:
牛顿法(Hessian矩阵)内存和计算量太大;
二阶方法如牛顿法不能替代梯度法的原因是容易跳入鞍点,所以有无鞍牛顿法;
L-BFGS是对动量技术的改进,不常用;
相关论文:
On the importance of initialization and momentum in deep learning
An overview of gradient descent optimization algorithms
Adam: A Method for Stochastic Optimization
http://ruder.io/optimizing-gradient-descent/index.html !!!
https://blog.youkuaiyun.com/google19890102/article/details/69942970 !!!
https://blog.youkuaiyun.com/heyongluoyao8/article/details/52478715 !!!
https://blog.youkuaiyun.com/u014595019/article/details/52989301 !!!
https://www.cnblogs.com/ranjiewen/p/5938944.html !!!
https://blog.youkuaiyun.com/weixin_38582851/article/details/80555145 !!!
http://wemedia.ifeng.com/69799959/wemedia.shtml
https://zhuanlan.zhihu.com/p/32230623