【转载】Overview of gradient descent algorithms

本文深入探讨了各种流行的基于梯度的优化算法,如动量(Momentum)、AdaGrad、RMSprop和Adam的工作原理,旨在为这些常用于神经网络优化的算法提供直观的理解。

Overview of gradient descent algorithms

 

 

An overview of gradient descent optimization algorithms

Gradient descent is the preferred way to optimize neural networks and many other machine learning algorithms but is often used as a black box. This post explores how many of the most popular gradient-based optimization algorithms such as Momentum, Adagrad, and Adam actually work.

  • Sebastian Ruder

    Sebastian Ruder

    Read more posts by this author.

    Sebastian Ruder

SEBASTIAN RUDER

19 JAN 2016 • 28 MIN READ

 

This post explores how many of the most popular gradient-based optimization algorithms actually work.

Note: If you are looking for a review paper, this blog post is also available as an article on arXiv.

Update 20.03.2020: Added a note on recent optimizers.

Update 09.02.2018: Added AMSGrad.

Update 24.11.2017: Most of the content in this article is now also available as slides.

Update 15.06.2017: Added derivations of AdaMax and Nadam.

Update 21.06.16: This post was posted to Hacker News. The discussion provides some interesting pointers to related work and other techniques.

Table of contents:

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. lasagne'scaffe's, and keras' documentation). These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.

This blog post aims at providing you with intuitions towards the behaviour of different algorithms for optimizing gradient descent that will help you put them to use. We are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training. Subsequently, we will introduce the most common optimization algorithms by showing their motivation to resolve these challenges and how this leads to the derivation of their update rules. We will also take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting. Finally, we will consider additional strategies that are helpful for optimizing gradient descent.

Gradient descent is a way to minimize an objective function J(θ)J(θ) parameterized by a model's parameters θ∈Rdθ∈Rd by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ)∇θJ(θ) w.r.t. to the parameters. The learning rate ηη determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley. If you are unfamiliar with gradient descent, you can find a good introduction on optimizing neural networks here.

请访问原文链接:Overview of gradient descent algorithms

梯度下降优化算法概述 梯度下降是一种常用的优化方法,可以帮助我们找到使目标函数最小化或最大化的参数。随着机器学习和深度学习的发展,各种梯度下降算法也不断涌现。以下是一些常用的梯度下降优化算法的概述: 1. 批量梯度下降(Batch Gradient Descent):在每次迭代中,批量梯度下降使用所有样本的梯度来更新模型参数。适用于训练集较小、模型参数较少的情况。 2. 随机梯度下降(Stochastic Gradient Descent):在每次迭代中,随机梯度下降使用一个单独的样本来更新模型参数。适用于训练集较大、模型参数较多的情况。 3. 小批量梯度下降(Mini-batch Gradient Descent):小批量梯度下降是一种介于批量梯度下降和随机梯度下降之间的方法。它在每次迭代中使用一小部分样本的梯度来更新模型参数。适用于训练集规模较大的情况。 4. 动量(Momentum):动量算法加入了“惯性”的概念,可以加速梯度下降的收敛速度。在每次迭代中,动量算法使用上一次的梯度信息来更新模型参数。 5. 自适应梯度下降(Adaptive Gradient Descent):自适应梯度下降可以自适应地调整每个模型参数的学习率,以便更快地收敛到最优解。比如,Adagrad算法可以针对每个参数单独地调整学习率。 6. 自适应矩估计(Adaptive Moment Estimation):Adam算法是一种结合了Momentum和Adaptive Gradient Descent的算法。它可以自适应调整每个参数的学习率,并利用二阶矩来调整动量。 每种梯度下降算法都有其适用的场合,需要根据问题的性质来选择合适的算法。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值