Stochastic Gradient Descent (SGD)

本文详细介绍了随机梯度下降(SGD)及其在神经网络中的应用。对比了SGD与批量梯度下降方法,并讨论了SGD的学习率调整策略、动量项的作用及数据呈现顺序的重要性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Optimization: Stochastic Gradient Descent


Overview

Batch methods, such as limited memory BFGS, which use the full training set to compute the next update to parameters at each iteration tend to converge very well to local optima. They are also straight forward to get working provided a good off the shelf implementation (e.g. minFunc) because they have very few hyper-parameters to tune. However, often in practice computing the cost and gradient for the entire training set can be very slow and sometimes intractable on a single machine if the dataset is too big to fit in main memory. Another issue with batch optimization methods is that they don’t give an easy way to incorporate new data in an ‘online’ setting. Stochastic Gradient Descent (SGD) addresses both of these issues by following the negative gradient of the objective after seeing only a single or a few training examples. The use of SGD In the neural network setting is motivated by the high cost of running back propagation over the full training set. SGD can overcome this cost and still lead to fast convergence.

Stochastic Gradient Descent

The standard gradient descent algorithm updates the parameters θ of the objective J(θ) as,

θ=θαθE[J(θ)]

where the expectation in the above equation is approximated by evaluating the cost and gradient over the full training set. Stochastic Gradient Descent (SGD) simply does away with the expectation in the update and computes the gradient of the parameters using only a single or a few training examples. The new update is given by,

θ=θαθJ(θ;x(i),y(i))

with a pair (x(i),y(i)) from the training set.

Generally each parameter update in SGD is computed w.r.t a few training examples or a minibatch as opposed to a single example. The reason for this is twofold: first this reduces the variance in the parameter update and can lead to more stable convergence, second this allows the computation to take advantage of highly optimized matrix operations that should be used in a well vectorized computation of the cost and gradient. A typical minibatch size is 256, although the optimal size of the minibatch can vary for different applications and architectures.

In SGD the learning rate α is typically much smaller than a corresponding learning rate in batch gradient descent because there is much more variance in the update. Choosing the proper learning rate and schedule (i.e. changing the value of the learning rate as learning progresses) can be fairly difficult. One standard method that works well in practice is to use a small enough constant learning rate that gives stable convergence in the initial epoch (full pass through the training set) or two of training and then halve the value of the learning rate as convergence slows down. An even better approach is to evaluate a held out set after each epoch and anneal the learning rate when the change in objective between epochs is below a small threshold. This tends to give good convergence to a local optima. Another commonly used schedule is to anneal the learning rate at each iterationt as ab+t where a and b dictate the initial learning rate and when the annealing begins respectively. More sophisticated methods include using a backtracking line search to find the optimal update.

One final but important point regarding SGD is the order in which we present the data to the algorithm. If the data is given in some meaningful order, this can bias the gradient and lead to poor convergence. Generally a good method to avoid this is to randomly shuffle the data prior to each epoch of training.

Momentum

If the objective has the form of a long shallow ravine leading to the optimum and steep walls on the sides, standard SGD will tend to oscillate across the narrow ravine since the negative gradient will point down one of the steep sides rather than along the ravine towards the optimum. The objectives of deep architectures have this form near local optima and thus standard SGD can lead to very slow convergence particularly after the initial steep gains. Momentum is one method for pushing the objective more quickly along the shallow ravine. The momentum update is given by,

vθ=γv+αθJ(θ;x(i),y(i))=θv

In the above equation v is the current velocity vector which is of the same dimension as the parameter vectorθ. The learning rate α is as described above, although when using momentum α may need to be smaller since the magnitude of the gradient will be larger. Finallyγ(0,1] determines for how many iterations the previous gradients are incorporated into the current update. Generallyγ is set to 0.5 until the initial learning stabilizes and then is increased to 0.9 or higher.

### 随机梯度下降(Stochastic Gradient Descent, SGD)概述 随机梯度下降是一种常用的优化算法,在机器学习和深度学习领域广泛应用。它通过每次迭代仅使用单一样本或一小批样本来更新模型参数,从而显著减少计算开销并提高训练效率[^4]。 #### 基本原理 SGD的核心思想是在每一次迭代过程中,利用单一数据样本的梯度方向来调整模型权重。这种方法相比传统的批量梯度下降(Batch Gradient Descent),能够更快地收敛于局部最优解,尤其是在处理大规模数据集时表现尤为突出[^3]。 #### 实现细节 在实际应用中,Torch框架提供了 `optim.sgd` 方法作为实现随机梯度下降的标准工具之一。该方法允许用户自定义超参数,例如学习率、动量项以及权重衰减系数等,这些设置对于提升模型性能至关重要[^2]。 #### 场景分析 由于其高效性和灵活性,SGD被广泛应用于多种场景下: - **大规模数据集训练**:当面对海量数据时,传统批量梯度下降可能因内存占用过高而无法运行;此时采用SGD可以有效缓解这一问题[^5]。 - **在线学习环境**:在实时流式数据分析任务中,新数据不断涌入系统,要求模型快速适应变化趋势而不需重新加载全部历史记录——这正是SGD擅长之处[^1]。 以下是基于PyTorch的一个简单SGD实现案例: ```python import torch from torch import nn, optim # 定义简单的线性回归模型 model = nn.Linear(in_features=10, out_features=1) # 创建损失函数与优化器 criterion = nn.MSELoss() optimizer = optim.SGD(model.parameters(), lr=0.01) # 模拟输入输出数据 inputs = torch.randn(100, 10) targets = torch.randn(100, 1) for epoch in range(10): # 进行多次epoch循环 total_loss = 0 for i in range(len(inputs)): input_data = inputs[i].unsqueeze(0) # 获取单个样本 target_data = targets[i].unsqueeze(0) optimizer.zero_grad() # 清除之前的梯度 output = model(input_data) # 计算预测值 loss = criterion(output, target_data) # 计算当前样本上的误差 loss.backward() # 反向传播求导数 optimizer.step() # 更新参数 total_loss += loss.item() print(f'Epoch {epoch}, Loss: {total_loss / len(inputs)}') ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值