[First order method] Gradient descent

本文深入探讨了梯度下降算法的核心概念,包括其在光滑凸优化问题中的应用、新顿方法与二次逼近的解释、选择步长的方法、收敛分析以及不同情况下的收敛速率。文章详细阐述了如何通过固定步长和回溯线搜索来优化算法性能,并提供了相关定理证明以确保算法的有效性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. gradient descent

1.1 Model to consider

Consider unconstrained, smooth convex optimization

minxf(x)

i.e., f is convex and differentiable with dom(f)=R. Denote the optimal criteria value by f=minxf(x) , and a solution by x .

Gradient descent: choose initial value x(0) , repeat:

x(k)=x(k1)tkf(x(k1))

Stop at some point.

1.2 Interpretaion

1.2.1 Interpretation via newton method

Since for the smooth and convex function f(x) , the minimum satisfies a condition that

f(x)=0

  • If f(x) can be calculated simply, we can get x directly just by solving this equality.
  • If f(x) is difficult to calculated. Then we can use linear approximation to f(x) at x(0) :
    (x)=f(x(0))+2f(x(0))(xx(0))

    and by setting this linear approximation to zero, we get
    x(new)=x(0)2f(x(0))1f(x(0))

But for many functions, calculating their twice differential is difficult. So we can replace 2f(x) by 1tI :

x(new)=x(0)tf(x(0))

The core idea behind gradient descent is that using linear approximation of f(x) to get the root of f(x) , which is exactly Newton-Raphson method. Newton-Raphson method is a method for finding successively better approximations to the roots (or zeroes) of a real-valued function.

1.2.2 Interpretation via quadratic approximation of original function

Linear approximation of f(x) can be regarded as a quadratic approximation of the original function f(x) :

q(x)=f(x(0))+f(x(0))(xc(0))+12(xx(0))2f(x(0))(xx(0))

which satisfies that:
q(x)=(x)

we can also use 1tI to replace 2f(x(0)) :

f~(x)f(x(0))+f(x(0))(xc(0))+2t(xx(0))(xx(0))

Then setting first differential of f~(x) to be zero:

f~(x)=f(x(0))+1t(xx(0))=0

we get
x(new)=x(0)tf(x(0))

1.3 How to choose step size t(k)

If t is too large, we algorithm will not converge. If it is tool small, the algorithm will converge too slow. So how to choose a suitable t?

  • fixed t
  • The exact line search can be used if f(x) is good enough:
    t=argmintf(xtf(x))
    • But in most cases, we can use backtracking line search

(i) set β(0,1) and α(0,1/2) fixed.(in practice, choose α=1/2 )
(ii) at each iteration, start with t=1 and while

f(xtf(x))>f(x)αtf(x)22

shrink t=βt . Else perform gradient descent:
x+=xtf(x)

From backtracking line search, we can see that

f(xtf(x))f(x)αtf(x)22f(x)

which make sure that gradient descent is going on a exact descent direction.

1.4 Convergence analysis

Theorem:[lipschitz of first derivative with fixed t] if f is convex and differentiable, dom(f)=Rn and f(x) is L lipschize differentiable. Then with fixed step size t<1L , we have

f(x(k))fx(0)x222tk

proof:

  • from lipschize properties of f(x) , we have
    f(x)f(y)2Lxy2f(y)f(x)+f(x)(yx)+L2yx22
  • from the definition of gradient descent method, we know
    x+=xtf(x)
  • from the convexity of f(x) , we have
    f(x)f(x)+f(x)(xx)

    which can be written as
    f(x)f(x)+f(x)(xx)

combining these three together, we have

f(x+)=f(xtf(x))f(x)tf(x)22+L2tf(x)22=f(x)(1Lt2)tf(x)22f(x)12tf(x)22f(x)+f(x)(xx)12tf(x)22f(x)+12(2t(xx+)(xx)1txx+22)=f(x)+12t(2(xx+)(xx)xx+22)=f(x)+12t(xx22x+x22)

So we have

f(x(k))f(x(1))f(x)+12t(x(k1)x22x(k)x22)f(x)+12t(x(0)x22x(1)x22)

summing all of these inequalities, we have

f(x(1))++f(x(k))kf(x)+12t(x(0)x22x(k)x22)kf(x)+12tx(0)x22

Then
f(x(k))f(x(1))++f(x(k))kf(x)+x(0)x222tk

Theorem:[lipschitz of first derivative with backtracking] if f is convex and differentiable, dom(f)=Rn and f(x) is L lipschize differentiable. Then with backtracking line search, we have

f(x(k))fx(0)x222tmink

where tminmin{1,βL}

proof:
All are the same as fixed t but for the value of tmin ,
From the backtrack line search idea, we know that, there exists a t0 such that for t[0,t0] , we have

f(xtf(x))f(x)12tf(x)22

So the final value of tbacktrack(βt0,t0] . From the equality of last theorem: we have

f(x+)f(x)(1Lt2)tf(x)22

we know that t0=1L .

So tmin(βL,1L] and tmin1

tminmin{1,βL}

Theorem:[lipschitz of first derivative and strong convexity of function] If f(x) is m-strong convex and f(x) is L-lipshcitz function, then gradient descent with fixed step size t2L+m or with backtracking line search satisfies:

f(x(k))fckL2x(0)x22

with 0<c<1

proof:

x+x22=xtf(x)x22=xx22+t2f(x)222tf(x)(xx)xx22+t2f(x)222t{mLL+mxx22+1m+Lf(x)22}=(12tmLm+L)xx22+(t22tm+L)f(x)22(12tmLm+L)xx22

Then we have
x(k)x22(12tmLm+L)kx(0)x22

and from the convexity of f(x)
f(x(k))f(x)f(x)(x(k)x(0))+L2x(k)x22L2x(k)x22L2(12tmLm+L)kx(0)x22

1.5 Summarize

  • For lipschitz gradient situation, gradient descent has convergence rate O(1/k)
    i.e., to get f(x(k))fO(ϵ) , we need O(1/ϵ) iterations
  • For lipschitz gradient and strong function situation, gradient decent has exponential convergence rate.

Reference:
http://www.stat.cmu.edu/~ryantibs/convexopt/lectures/05-grad-descent.pdf
http://www.seas.ucla.edu/~vandenbe/236C/lectures/gradient.pdf
http://www.stat.cmu.edu/~ryantibs/convexopt/scribes/05-grad-descent-scribed.pdf

### Adam Optimizer vs SGD Optimizer Structure Diagram In the realm of machine learning, optimization algorithms play a crucial role in training models efficiently and effectively. Two prominent optimizers are Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam). Below is an explanation along with structural diagrams to illustrate how these two differ. #### Differences Between Adam and SGD Both Adam and SGD aim at minimizing loss functions during model training but do so using different strategies: - **Stochastic Gradient Descent (SGD)** updates parameters based on the gradient computed from each sample or mini-batch. This method uses only first-order gradients without considering past gradients' information. - **Adaptive Moment Estimation (Adam)** combines ideas from both momentum-based methods like RMSProp and adaptive learning rates similar to Adagrad. It computes individual adaptive learning rates for different parameters by keeping running averages of both the gradients and their squares over time. The key differences lie in how they handle parameter updates as shown below: ```plaintext +-------------------+ | | | Loss Function | | | +--v-----------+ | | | Compute Gradients| | | +------+--------------+ | SGD v Adam +------------+-------------+ | | | | Update | Estimate | | Parameters | Moments & | | Using Only | Adaptively | | First Order| Adjusted | | Information| Learning Rates| | | Based On Past| | | Gradients Info| +------------+-------------+ ``` For more detailed insight into prompt word optimization techniques that can be applied alongside choosing between such optimizers when working under limited data conditions refer to relevant studies[^1]. --related questions-- 1. What specific scenarios benefit most from using Adam instead of SGD? 2. How does incorporating meta-learning influence the choice between Adam and SGD? 3. Can hypernetworks improve upon traditional optimizers like Adam or SGD? If yes, how? 4. In what ways has SwiGLU impacted modern neural network architectures compared to ReLU activations used within these optimizers? Note: The provided diagram simplifies complex processes involved in updating weights through either algorithm; actual implementations may vary depending on frameworks utilized.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值