deep learning 优化方法(未完成待编辑)

本文深入探讨了梯度下降算法的不同变种,包括随机梯度下降(SGD)、平均随机梯度下降(ASGD)、共轭梯度(CG)和有限内存BFGS(LBFGS),对比了它们在解决大规模数据集时的效率和效果。随机梯度下降通过单个样本进行权重更新,加速了收敛过程;ASGD进一步改进了权重更新策略;CG和LBFGS则在求解大型系统的线性方程组方面表现出色。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

SGD(Stochastic Gradient Descent)

ASGD(Averaged Stochastic Gradient Descent)

CG(Conjungate Gradient)

LBFGS(Limited-memory Broyden-Fletcher-Goldfarb-Shanno)



SGD  随机梯度下降

(ref:https://en.wikipedia.org/wiki/Stochastic_gradient_descent)

SGD解决了梯度下降的两个问题: 收敛速度慢和陷入局部最优。修正部分是权值更新的方法有些许不同。

Stochastic gradient descent (often shortened in SGD) is a stochastic approximation of the gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions


pseudocode

  • Choose an initial vector of parameters w and learning rate \eta.

  • Repeat until an approximate minimum is obtained:

    • \! w := w - \eta \nabla Q_i(w).

    • Randomly shuffle examples in the training set.

    • For \! i=1, 2, ..., n, do:


example

Let's suppose we want to fit a straight line y = \! w_1 + w_2 x to a training set of two-dimensional points \! (x_1, y_1), \ldots, (x_n, y_n) using least squares. The objective function to be minimized is:

Q(w) = \sum_{i=1}^n Q_i(w) = \sum_{i=1}^n \left(w_1 + w_2 x_i - y_i\right)^2.

The last line in the above pseudocode for this specific problem will become:

\begin{bmatrix} w_1 \\ w_2 \end{bmatrix} :=     \begin{bmatrix} w_1 \\ w_2 \end{bmatrix}     -  \eta  \begin{bmatrix} 2 (w_1 + w_2 x_i - y_i) \\ 2 x_i(w_1 + w_2 x_i - y_i) \end{bmatrix}.


标准梯度下降和随机梯度下降之间的关键区别

–标准梯度下降是在权值更新前对所有样例汇总误差,而随机梯度下降的权值是通过考查某个训练样例来更新的

–在标准梯度下降中,权值更新的每一步对多个样例求和,需要更多的计算

–标准梯度下降,由于使用真正的梯度,标准梯度下降对于每一次权值更新经常使用比随机梯度下降大的步长

–如果标准误差曲面有多个局部极小值,随机梯度下降有时可能避免陷入这些局部极小值中


梯度下降需要把m个样本全部带入计算,迭代一次计算量为m*n^2

梯度下降 <wbr>VS <wbr>随机梯度下降
随机梯度下降每次只使用一个样本,迭代一次计算量为n^2,当m很大的时候,随机梯度下降迭代一次的速度要远高于梯度下降
梯度下降 <wbr>VS <wbr>随机梯度下降


ASGD 平均随机梯度下降

(ref:https://www.quora.com/How-does-Averaged-Stochastic-Gradient-Decent-ASGD-work)

132202_DEjp_1755471.png

CG

(ref:https://en.wikipedia.org/wiki/Conjugate_gradient_method)

If we choose the conjugate vectors pk carefully, then we may not need all of them to obtain a good approximation to the solution x. So, we want to regard the conjugate gradient method as an iterative method. This also allows us to approximately solve systems where n is so large that the direct method would take too much time.

We denote the initial guess for x by x0. We can assume without loss of generality that x0 = 0 (otherwise, consider the system Az = b − Ax0 instead). Starting with x0 we search for the solution and in each iteration we need a metric to tell us whether we are closer to the solution x (that is unknown to us). This metric comes from the fact that the solution x is also the unique minimizer of the following quadratic function; so if f(x) becomes smaller in an iteration it means that we are closer to x.

  • f(\mathbf{x}) = \tfrac12 \mathbf{x}^\mathsf{T} \mathbf{A}\mathbf{x} - \mathbf{x}^\mathsf{T} \mathbf{b}, \qquad \mathbf{x}\in\mathbf{R}^n.

This suggests taking the first basis vector p0 to be the negative of the gradient of f at x = x0. The gradient of f equals Ax − b. Starting with a "guessed solution" x0 (we can always guess x0 = 0 if we have no reason to guess for anything else), this means we take p0 = b − Ax0. The other vectors in the basis will be conjugate to the gradient, hence the name conjugate gradient method.

Let rk be the residual at the kth step:

  • \mathbf{r}_k = \mathbf{b} - \mathbf{Ax}_k.

Note that rk is the negative gradient of f at x = xk, so the gradient descent method would be to move in the direction rk. Here, we insist that the directions pk be conjugate to each other. We also require that the next search direction be built out of the current residue and all previous search directions, which is reasonable enough in practice.

The conjugation constraint is an orthonormal-type constraint and hence the algorithm bears resemblance to Gram-Schmidt orthonormalization.

This gives the following expression:

  • \mathbf{p}_{k} = \mathbf{r}_{k} - \sum_{i < k}\frac{\mathbf{p}_i^\mathsf{T} \mathbf{A} \mathbf{r}_{k}}{\mathbf{p}_i^\mathsf{T}\mathbf{A} \mathbf{p}_i} \mathbf{p}_i

(see the picture at the top of the article for the effect of the conjugacy constraint on convergence). Following this direction, the next optimal location is given by

  • \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k

with

  • \alpha_{k} = \frac{\mathbf{p}_k^\mathsf{T} \mathbf{b}}{\mathbf{p}_k^\mathsf{T} \mathbf{A} \mathbf{p}_k} = \frac{\mathbf{p}_k^\mathsf{T} (\mathbf{r}_{k-1}+\mathbf{Ax}_{k-1})}{\mathbf{p}_{k}^\mathsf{T} \mathbf{A} \mathbf{p}_{k}} = \frac{\mathbf{p}_{k}^\mathsf{T} \mathbf{r}_{k-1}}{\mathbf{p}_{k}^\mathsf{T} \mathbf{A} \mathbf{p}_{k}},

where the last equality holds because pk and xk-1 are conjugate.



LBFGS




转载于:https://my.oschina.net/kathy00/blog/660087

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值