参考文章:An overview of gradient descent optimization algorithms
内容概览(Nano Banana Pro):

Overview
Gradient Descent
为了最小化由 θ ∈ R d \theta \in \mathbb{R}^d θ∈Rd参数化的目标函数 L ( θ ) {L}(\theta) L(θ),通过朝目标函数梯度 ∇ θ L ( θ ) \nabla_{\theta}L(\theta) ∇θL(θ)相反方向更新模型参数,学习率 η \eta η决定了趋向最小值的“步子”大小。
In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
即,沿着目标函数产生的表面斜率的方向下坡,直至到达一个谷底。
Gradient Descent Variants
Vanilla Gradient Descent
亦即batch GD,使用训练集的全部数据执行梯度计算代价函数的梯度:
θ
=
θ
−
η
⋅
∇
θ
L
(
θ
)
\theta=\theta-\eta \cdot \nabla_\theta L(\theta)
θ=θ−η⋅∇θL(θ)
以Pytorch-MLP-MNIST为例(50000张训练集图片),每做一次参数更新,使用整个训练集算出总计的平均损失
L
(
θ
)
=
1
N
∑
n
=
1
N
ℓ
(
f
θ
(
x
n
)
,
y
n
)
L(\theta)=\frac{1}{N} \sum_{n=1}^N \ell\left(f_\theta\left(x_n\right), y_n\right)
L(θ)=N1∑n=1Nℓ(fθ(xn),yn),再对其求梯度更新
θ
\theta
θ。具体而言:
- 训练集 D = { ( x n , y n ) } n = 1 N \mathcal{D}=\left\{\left(x_n, y_n\right)\right\}_{n=1}^N D={(xn,yn)}n=1N,其中 x n ∈ R 784 ( 28 ∗ 28 ) x_n \in \mathbb{R}^{784} (28*28) xn∈R784(28∗28), y n ∈ { 0 , … , 9 } y_n \in\{0, \ldots, 9\} yn∈{0,…,9},对于sample n,MLP网络 f θ ( x ) f_\theta(x) fθ(x)输出 z n = f θ ( x n ) ∈ R 10 z_n=f_\theta\left(x_n\right) \in \mathbb{R}^{10} zn=fθ(xn)∈R10。
- MLP网络可被设计为:
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
return self.net(x)
对于此网络,前向传播,先计算隐层并 ReLU = max ( 0 , x ) \text{ReLU}=\max(0,x) ReLU=max(0,x)激活: h = ReLU ( W 1 x + b 1 ) ∈ R 256 h=\text{ReLU}\left(W_1 x+b_1\right) \in \mathbb{R}^{256} h=ReLU(W1x+b1)∈R256,随后进行线性变换: z = W 2 h + b 2 ∈ R 10 z=W_2 h+b_2 \in \mathbb{R}^{10} z=W2h+b2∈R10(即一个未归一化概率向量logits,随后被传至softmax以输出归一化分类概率);对于z的元素,正值表示趋向于分类到该类,负值(ReLU后即为0)表示趋向于不分类到该类。
-
对于logits z n z_n zn,基于softmax将其转为概率,对于其中元素k,有
p n , k = e z n , k ∑ j = 1 10 e z n , j , with ∑ k p n , k = 1 p_{n,k}=\frac{e^{z_{n,k}}}{\sum_{j=1}^{10} e^{z_{n,j}}}, \text{with} \sum_k p_{n,k}=1 pn,k=∑j=110ezn,jezn,k,withk∑pn,k=1
即预测是第k类的概率;对于Pytorch,若输入为full batch,形状为(50000,784),则对应输出logits形状为(50000,10)。对于Pytorch的损失函数 l l l-nn.CrossEntropyLoss:
处理单个logits z n z_n zn时,使用NLL(negative log-likelihood,Pytorch中以e为底即ln),只计算真实类别y对应概率 p y p_y py的负对数
ℓ ( z n , y ) = − log p n , y = − log ( softmax ( z n ) y ) = − log ( e z n , y ∑ j e z n , j ) = − z n , y + log ( ∑ j e z n , j ) \ell(z_n, y)=-\log p_{n,y}=-\log \left(\operatorname{softmax}(z_n)_y\right)=-\log(\frac{e^{z_{n,y}}}{\sum_j e^{z_{n,j}}}) \nonumber \\=-z_{n,y}+\log \left(\sum_{j} e^{z_{n,j}}\right) ℓ(zn,y)=−logpn,y=−log(softmax(zn)y)=−log(∑jezn,jezn,y)=−zn,y+log(j∑ezn,j)
例如, z n = [ 5 , 2 , 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 ] z_n=[5,2,1,1,0,0,0,0,0,0] zn=[5,2,1,1,0,0,0,0,0,0],则 softmax ( z n ) = [ 0.92 , 0.05 , 0.015 , 0.015 , 0 , 0 , 0 , 0 , 0 , 0 ] \text{softmax}(z_n)=[0.92,0.05,0.015,0.015,0,0,0,0,0,0] softmax(zn)=[0.92,0.05,0.015,0.015,0,0,0,0,0,0],则 ℓ ( z n , y = 0 ) = − 5 + l n ( e 5 + e 2 + 2 e ) = 0.083 \ell(z_n, y=0)=-5+ln(e^5+e^2+2e)=0.083 ℓ(zn,y=0)=−5+ln(e5+e2+2e)=0.083
处理batch B=50000时 ,此时logits Z ∈ R 50000 ∗ 10 Z \in \mathbb{R}^{50000*10} Z∈R50000∗10,此时损失为
L ( θ ) = 1 50000 ∑ n = 1 50000 ( − z n , y + log ( ∑ j e z n , j ) ) L(\theta) = \frac{1}{50000}\sum_{n=1}^{50000}(-z_{n,y}+\log(\sum_j e^{z_{n,j}})) L(θ)=500001n=1∑50000(−zn,y+log(j∑ezn,j)) -
采用全梯度更新时(即把MNIST全部训练集数据全部放入显存):
- 把整套训练集打包成一个大 batch;
- 前向:一次性得到所有样本的 logits;
- 计算全数据集的平均 loss;( L ( θ ) = 1 N ∑ n = 1 N ℓ ( f θ ( x n ) , y n ) L(\theta) = \frac{1}{N} \sum_{n=1}^N \ell\left(f_\theta\left(x_n\right), y_n\right) L(θ)=N1∑n=1Nℓ(fθ(xn),yn))
- loss.backward() 得到所有参数的梯度;( ∇ θ L ( θ ) = 1 N ∑ n = 1 N ∇ θ ℓ ( f θ ( x n ) , y n ) \nabla_\theta L(\theta)=\frac{1}{N} \sum_{n=1}^N \nabla_\theta \ell\left(f_\theta\left(x_n\right), y_n\right) ∇θL(θ)=N1∑n=1N∇θℓ(fθ(xn),yn))
- optimizer.step() 更新一次参数。
# Generated by ChatGPT
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
# 1. 准备数据(一次性全取出)
transform = transforms.ToTensor()
train_dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform)
X_all = train_dataset.data.view(-1, 28*28).float() # [N, 784]
y_all = train_dataset.targets # [N]
# 2. 定义一个简单的 MLP
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
def forward(self, x):
return self.net(x)
model = MLP()
criterion = nn.CrossEntropyLoss(reduction='mean') # 默认就是 mean
optimizer = optim.SGD(model.parameters(), lr=0.1)
# 3. 一次“全梯度下降”的迭代
model.train()
optimizer.zero_grad() # 梯度清零
logits = model(X_all) # 前向:一次跑完全数据
loss = criterion(logits, y_all) # 全数据集平均 loss(就是 L(θ))
loss.backward() # autograd: ∇_θ L(θ) 累到每个 param.grad 上
optimizer.step() # 参 数 更 新 θ ← θ - lr * ∇_θ L(θ)
- 显存不够时,另一种实现全梯度更新的方式是使用多个小batch,累计进行loss.backward()后,再更新参数:
因为PyTorch 中,每次 loss.backward()会把梯度"加到"param.grad上而不是覆盖(这也是需要epoch间zero.grad的原因),故可以遍历完所有batch(共计50000条数据)后再optimizer.step() ,在一次迭代里得到的梯度就是遍历所有样本后累加的结果。具体而言:
1.使用 reduction=’sum’,让每个 batch 的 loss 是该 batch 的损失总和;
2.对每个batch调一次loss.backward(),让 param.grad 中累积的是整套数据的 ∑ n ∇ θ ℓ n \sum_n \nabla_\theta \ell_n ∑n∇θℓn ;
3.遍历完所有 batch 后,把每个 param.grad 除以 N N N,变成平均梯度;
4.再 optimizer.step()
# Generated by ChatGPT
from torch.utils.data import DataLoader
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
model = MLP()
criterion = nn.CrossEntropyLoss(reduction='sum') # 注意: sum
optimizer = optim.SGD(model.parameters(), lr=0.1)
for epoch in range(num_epochs):
model.train()
N = len(train_dataset)
optimizer.zero_grad() # 本轮 full-batch 迭代之前清空梯度
for X_batch, y_batch in train_loader:
X_batch = X_batch.view(X_batch.size(0), -1) # [B, 784]
logits = model(X_batch)
batch_loss = criterion(logits, y_batch) # 这是该 batch 的 loss 总和
# 这里 backward 后, param.grad 中会累加 ∂(batch_loss)/∂θ
batch_loss.backward()
# 到这里,param.grad = ∑_{所有样本} ∇θ ℓ_n (总和的梯度)
# 想要平均梯度 ∇θ L(θ) = (1/N) ∑_n ∇θ ℓ_n, 就除以 N
with torch.no_grad():
for p in model.parameters():
if p.grad is not None:
p.grad /= N
optimizer.step() # 用平均梯度做一次更新
-
交叉熵损失函数意义:从概率论视角来看:其在已知输入x时给出预测类别k分布概率 p ( k ∣ x ) p(k \mid x) p(k∣x),同时真实标签是one-hot分布即 q ( k ) = 1 [ k = y ] q(k)=\mathbf{1}[k=y] q(k)=1[k=y]。对于交叉熵损失函数: H ( q , p ) = − ∑ k q ( k ) log p ( k ) = − log p ( y ) H(q,p)=-\sum_k q(k)\log p(k)=-\log p(y) H(q,p)=−∑kq(k)logp(k)=−logp(y)(其中 p ( y ) p(y) p(y)是通过Softmax将logits转换为概率得到的)。最小化此损失函数,即最大化真实类别的概率 p ( y ) p(y) p(y)。
-
对于梯度反向传播,注意到 ℓ ( z n , y ) = − log p n , y = − log ( softmax ( z n ) y ) = − log ( e z n , y ∑ j e z n , j ) = − z n , y + log ( ∑ j e z n , j ) \ell(z_n, y)=-\log p_{n,y}=-\log \left(\operatorname{softmax}(z_n)_y\right)=-\log(\frac{e^{z_{n,y}}}{\sum_j e^{z_{n,j}}}) =-z_{n,y}+\log \left(\sum_{j} e^{z_{n,j}}\right) ℓ(zn,y)=−logpn,y=−log(softmax(zn)y)=−log(∑jezn,jezn,y)=−zn,y+log(∑jezn,j),忽略sample下标n,对于每个logits求偏导:
第1项: − z y -z_y −zy 对 z k z_k zk 的导数
当 k = y : ∂ ( − z y ) ∂ z k = − 1 k=y: \frac{\partial\left(-z_y\right)}{\partial z_k}=-1 k=y:∂zk∂(−zy)=−1
当 k ≠ y : ∂ ( − z y ) ∂ z k = 0 k \neq y: \frac{\partial\left(-z_y\right)}{\partial z_k}=0 k=y:∂zk∂(−zy)=0
∂ ( − z y ) ∂ z k = − 1 [ k = y ] . \frac{\partial\left(-z_y\right)}{\partial z_k}=-\mathbf{1}[k=y] . ∂zk∂(−zy)=−1[k=y].
第2项: log ( ∑ j e z j ) \log \left(\sum_j e^{z_j}\right) log(∑jezj) 对 z k z_k zk 的导数,先设 S = ∑ j = 1 C e z j S=\sum_{j=1}^C e^{z_j} S=∑j=1Cezj,则
∂ ∂ z k log S = 1 S ⋅ ∂ S ∂ z k = 1 ∑ j e z j ⋅ e z k = e z k ∑ j e z j = p k \frac{\partial}{\partial z_k} \log S=\frac{1}{S} \cdot \frac{\partial S}{\partial z_k}=\frac{1}{\sum_j e^{z_j}} \cdot e^{z_k}=\frac{e^{z_k}}{\sum_j e^{z_j}}=p_k ∂zk∂logS=S1⋅∂zk∂S=∑jezj1⋅ezk=∑jezjezk=pk
将其合并得到:
∂ ℓ ∂ z k = ∂ ∂ z k ( − z y ) + ∂ ∂ z k log ( ∑ j e z j ) = − 1 [ k = y ] + p k . \frac{\partial \ell}{\partial z_k}=\frac{\partial}{\partial z_k}\left(-z_y\right)+\frac{\partial}{\partial z_k} \log \left(\sum_j e^{z_j}\right)=-\mathbf{1}[k=y]+p_k . ∂zk∂ℓ=∂zk∂(−zy)+∂zk∂log(j∑ezj)=−1[k=y]+pk.
假设最后一层是线性层;输入特征向量 h ∈ R d h \in \mathbb{R}^d h∈Rd( h h h来自上一层 MLP的输出);权重矩阵 W ∈ R C × d W \in \mathbb{R}^{C \times d} W∈RC×d ,偏置 b ∈ R C b \in \mathbb{R}^C b∈RC ;(对于MNIST,C=10);输出得到logits:
z = W h + b , z k = w k ⊤ h + b k , z=W h+b, \quad z_k=w_k^{\top} h+b_k, z=Wh+b,zk=wk⊤h+bk,
其中 w k ⊤ w_k^{\top} wk⊤ 是 W W W 的第 k k k 行。基于链式法则,对于权重 W W W的梯度为
∂ ℓ ∂ w k = ∂ ℓ ∂ z k ⋅ ∂ z k ∂ w k = ( p k − 1 [ k = y ] ) h , \frac{\partial \ell}{\partial w_k}=\frac{\partial \ell}{\partial z_k} \cdot \frac{\partial z_k}{\partial w_k}=\left(p_k-\mathbf{1}[k=y]\right) h , ∂wk∂ℓ=∂zk∂ℓ⋅∂wk∂zk=(pk−1[k=y])h,
写成矩阵形式,令 δ = p − y one-hot ∈ R C \delta=p-y^{\text {one-hot }} \in \mathbb{R}^C δ=p−yone-hot ∈RC ,则 ∂ ℓ ∂ W = δ h ⊤ . \frac{\partial \ell}{\partial W}=\delta h^{\top} . ∂W∂ℓ=δh⊤.
相应地,对于偏置 b b b的梯度为
∂ ℓ ∂ b = δ . \frac{\partial \ell}{\partial b}=\delta . ∂b∂ℓ=δ.
相应地,对于上一层MLP输出 h h h的梯度,基于链式法则得到
∂ ℓ ∂ h = ∑ k = 1 C ∂ ℓ ∂ z k ⋅ ∂ z k ∂ h = ∑ k = 1 C ( p k − 1 [ k = y ] ) w k = W ⊤ δ . \frac{\partial \ell}{\partial h}=\sum_{k=1}^C \frac{\partial \ell}{\partial z_k} \cdot \frac{\partial z_k}{\partial h}=\sum_{k=1}^C\left(p_k-\mathbf{1}[k=y]\right) w_k=W^{\top} \delta . ∂h∂ℓ=k=1∑C∂zk∂ℓ⋅∂h∂zk=k=1∑C(pk−1[k=y])wk=W⊤δ.
即反向传播传给上一层的梯度。 -
Dataloader()的结构:
# Generated by ChatGPT
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
transform = transforms.ToTensor()
train_dataset = datasets.MNIST(root="./data",
train=True,
download=True,
transform=transform)
train_loader = DataLoader(train_dataset,
batch_size=64,
shuffle=True)
img0, label0 = train_dataset[0]
print(img0.shape) # torch.Size([1, 28, 28])
print(label0) # e.g., 5
for X_batch, y_batch in train_loader:
print(X_batch.shape, y_batch.shape)
break
# X_batch.shape = torch.Size([64, 1, 28, 28])
– datasets.MNIST(…) 继承自 torch.utils.data.Dataset,对于一个样本,返回 (image,label)
– train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True),在每轮开始,对样本索引
[
0
,
1
,
…
,
N
−
1
]
[0,1, \ldots, \mathrm{~N}-1]
[0,1,…, N−1] 打乱,并随后取出batch_size=64个样本(
1
∗
28
∗
28
1*28*28
1∗28∗28),再把其沿着新维度拼成一个 Tensor:
X
batch
∈
R
B
×
1
×
28
×
28
X_{\text {batch }} \in \mathbb{R}^{B \times 1 \times 28 \times 28}
Xbatch ∈RB×1×28×28,同时
y
batch
∈
R
64
y_{\text{batch}}\in\mathbb{R}^{64}
ybatch∈R64
for X_batch, y_batch in train_loader:
X_batch = X_batch.view(X_batch.size(0), -1) # [B, 784]
logits = model(X_batch)
batch_loss = criterion(logits, y_batch) # 这是该 batch 的 loss 总和
# 这里 backward 后, param.grad 中会累加 ∂(batch_loss)/∂θ
batch_loss.backward()
– 注意到X_batch = X_batch.view(X_batch.size(0), -1),由于
X
batch
∈
R
B
×
1
×
28
×
28
X_{\text {batch }} \in \mathbb{R}^{B \times 1 \times 28 \times 28}
Xbatch ∈RB×1×28×28,则X_batch.size(0)为
B
=
64
B=64
B=64,view 本质上做的是不改动底层内存,只是用新的形状解释这段连续内存,则.view(64,-1)的含义是“第一个维度指定为
B
B
B;第二个维度-1,表示自动推断,基于总元素个数除以前维度的积得到:(64128*28)/(64)=784。因此新的形状就是 [B, 784],每一行对应一张 28×28 图展平成一个长度为 784 的向量
– 对于上面代码(显存不足时 分多batch情形),在每一个epoch里,Dataloader拿到train_dataset的索引
[
0
,
1
,
…
,
N
−
1
]
[0,1, \ldots, \mathrm{~N}-1]
[0,1,…, N−1] ,在基于shuffle=True打乱顺序后,按batch_size=64分块:第 1 个 batch:打乱后的前64个样本;第 2 个 batch:接下来的64个样本;…即在每个 epoch 里进行打乱后顺序遍历,每个样本恰好出现一次(不放回采样)
Stochastic Gradient Descent
使用每个训练样本
(
x
n
,
y
n
)
(x_n,y_n)
(xn,yn)执行参数更新:
θ
=
θ
−
η
⋅
∇
θ
L
(
θ
;
x
n
;
y
n
)
\theta=\theta-\eta \cdot \nabla_\theta L\left( \theta ; x_n ; y_n \right)
θ=θ−η⋅∇θL(θ;xn;yn)
SGD通过一次执行一个更新来避免相似冗余梯度计算,以高方差执行频繁更新。
当学习率时,SGD表现出与批处理梯度下降相同的收敛行为,对于非凸优化和凸优化,几乎可以肯定分别收敛到局部或全局最小值。
# Generated by ChatGPT
train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)
for epoch in range(num_epochs):
model.train()
for X_batch, y_batch in train_loader:
X_batch = X_batch.view(X_batch.size(0), -1)
optimizer.zero_grad()
logits = model(X_batch)
loss = criterion(logits, y_batch) # 单样本 loss
loss.backward()
optimizer.step()
Mini-batch Gradient Descent
使用
B
B
B个训练样本执行参数更新:(在DL中也常称其为SGD),每次更新:
θ
=
θ
−
η
⋅
∇
θ
L
(
θ
;
x
(
n
:
n
+
B
)
;
y
(
n
:
n
+
B
)
)
\theta=\theta-\eta \cdot \nabla_\theta L\left(\theta ; x_{(n: n+B)} ; y_{(n: n+B)}\right)
θ=θ−η⋅∇θL(θ;x(n:n+B);y(n:n+B))
# Generated by ChatGPT
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
for epoch in range(num_epochs):
model.train()
for X_batch, y_batch in train_loader:
X_batch = X_batch.view(X_batch.size(0), -1)
optimizer.zero_grad()
logits = model(X_batch)
loss = criterion(logits, y_batch) # 单样本 loss
loss.backward()
optimizer.step()
– 对于shuffle=True,每一个 epoch 开始时,都会重新 shuffle 一次数据索引,故每个 epoch 内的样本顺序是随机的
– DataLoader相当于保存了“dataset + sampler 类型 + batch_size”:
for epoch in range(num_epochs):
model.train()
for X_batch, y_batch in train_loader:
...
'''
当执行 for X_batch, y_batch in train_loader: 时,
会调用 iter(train_loader),生成一个迭代器对象;
这个迭代器内部会重新构造一次采样顺序(shuffle=True);
每 __next__() 一次,便按此随机顺序取出B个样本,形成一个 batch
'''
Momentum
加速收敛并减少震荡:
v
t
+
1
=
μ
v
t
+
g
t
,
θ
t
+
1
=
θ
t
−
η
v
t
+
1
,
\begin{aligned} & v_{t+1}=\mu v_t+g_t, \\ & \theta_{t+1}=\theta_t-\eta v_{t+1}, \end{aligned}
vt+1=μvt+gt,θt+1=θt−ηvt+1,
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
# Nesterov:
# optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True)
对每个 param,optimizer 会在它的 state 里维护一个同形状的 momentum_buffer,并在参数更新时:
# 伪代码
for param in model.parameters():
if param.grad is None: continue
d_p = param.grad # 当前梯度 g_t
if 'momentum_buffer' not in state:
buf = state['momentum_buffer'] = torch.clone(d_p).detach()
else:
buf = state['momentum_buffer']
buf.mul_(momentum).add_(d_p) # v_{t+1} = μ v_t + g_t
param.add_(-lr, buf) # θ ← θ - η v_{t+1}
Nesterov
“先走一步、再修正”
v
t
+
1
=
μ
v
t
+
g
t
θ
t
+
1
=
θ
t
−
η
(
g
t
+
μ
v
t
+
1
)
\begin{aligned} v_{t+1} & =\mu v_t+g_t \\ \theta_{t+1} & =\theta_t-\eta\left(g_t+\mu v_{t+1}\right) \end{aligned}
vt+1θt+1=μvt+gt=θt−η(gt+μvt+1)
# Nesterov:
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True)
# 伪代码
if nesterov:
update = d_p + momentum * buf # 使用 g_t + μ v_{t+1}
else:
update = buf # 使用 v_{t+1}
param.add_(-lr, update)
1202

被折叠的 条评论
为什么被折叠?



