Common causes of nans during training

本文探讨了在训练过程中出现NaN的几种常见原因,包括梯度爆炸、错误的学习率策略、有缺陷的损失函数、有问题的输入数据及特定层配置不当等,并提供了相应的解决策略。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

原文  https://stackoverflow.com/questions/33962226/common-causes-of-nans-during-training


Good question.
I came across this phenomenon several times. Here are my observations:


Gradient blow up

Reason: large gradients throw the learning process off-track.

What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You'll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become nan.

What can you do: Decrease the base_lr (in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the loss_weight (in train_val.prototxt) for that specific layer, instead of the general base_lr.


Bad learning rate policy and params

Reason: caffe fails to compute a valid learning rate and gets 'inf' or 'nan' instead, this invalid rate multiplies all updates and thus invalidating all parameters.

What you should expect: Looking at the runtime log, you should see that the learning rate itself becomes 'nan', for example:

... sgd_solver.cpp:106] Iteration 0, lr = -nan

What can you do: fix all parameters affecting the learning rate in your 'solver.prototxt' file.
For instance, if you use lr_policy: "poly" and you forget to define max_iter parameter, you'll end up with lr = nan...
For more information about learning rate in caffe, see this thread.


Faulty Loss function

Reason: Sometimes the computations of the loss in the loss layers causes nans to appear. For example, Feeding InfogainLoss layer with non-normalized values, using custom loss layer with bugs, etc.

What you should expect: Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.

What can you do: See if you can reproduce the error, add printout to the loss layer and debug the error.

For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all - the loss computed produced nans. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.


Faulty input

Reason: you have an input with nan in it!

What you should expect: once the learning process "hits" this faulty input - output becomes nan. Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.

What can you do: re-build your input datasets (lmdb/leveldn/hdf5...) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce nan.


stride larger than kernel size in "Pooling" layer

For some reason, choosing stride > kernel_size for pooling may results with nans. For example:

layer {
  name: "faulty_pooling"
  type: "Pooling"
  bottom: "x"
  top: "y"
  pooling_param {
    pool: AVE
    stride: 5
    kernel: 3
  }
}

results with nans in y.


Instabilities in "BatchNorm"

It was reported that under some settings "BatchNorm" layer may output nans due to numerical instabilities.
This issue was raised in bvlc/caffe and PR #5136 is attempting to fix it.


Recently, I became aware of debug_info flag: setting debug_info: true in 'solver.prototxt' will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can help in spotting gradient blowups and other problems in the training process.


### 关于 `inpaint_nans` 函数的使用方法和实现细节 #### 使用方法 在 MATLAB 中,`inpaint_nans` 是用于处理二维数组(矩阵)中缺失数据(即 NaN 值)的一种工具。该函数通过不同算法来填充这些 NaN 值,从而恢复图像或其他类型的二维数据集。 对于输入参数 A 和 method 的调用形式如下: ```matlab B = inpaint_nans(A, method); ``` 其中: - **A**: 输入含有 NaNs 的二维数组。 - **method**: 描述采用哪种修复策略,默认值为 0;可选的方法有多种,每种对应不同的物理模型[^2]。 具体来说,methods 参数的选择决定了使用的特定修补技术: - 方法 `{0,1,2}` 应用了简单的薄板隐喻; - 方法 `3` 利用了改进版的薄板方程,尽管可能较慢并消耗更多内存; - 方法 `4` 实现了一个弹簧隐喻; - 方法 `5` 计算的是八邻域平均值,在理论依据上不如其他几种充分。 #### 实现细节 从源码角度来看,`inpaint_nans` 提供了几种不同的插值方式,它们之间存在性能上的差异——既包括计算时间也涉及资源占用情况。例如,某些高效但相对粗糙的技术可以在短时间内完成任务,而另一些则追求更高的准确性即使代价更高[^3]。 下面是基于 Python 的简单版本实现思路,它模仿了部分核心逻辑而不完全复制原生 MATLAB 行为: ```python import numpy as np from scipy.interpolate import griddata def python_inpaint_nans(matrix, method=0): """ 对给定矩阵中的 NaN 进行插补 :param matrix: 待处理的二维 NumPy 数组 :param method: 所需的插值方法编号 (默认为最基础的方式) :return: 经过填补后的相同形状的新数组 """ # 获取原始尺寸 rows, cols = matrix.shape # 创建网格坐标系 X, Y = np.meshgrid(np.arange(cols), np.arange(rows)) # 寻找有效数值的位置及其对应的X,Y坐标 valid_points = ~np.isnan(matrix) known_xys = list(zip(X[valid_points].flatten(), Y[valid_points].flatten())) values_at_known = matrix[valid_points] # 定义目标位置(所有NaN的地方) target_positions = [(i,j) for i in range(rows) for j in range(cols)] filled_matrix = griddata(points=known_xys, values=values_at_known, xi=target_positions, method=['nearest', 'linear', 'cubic'][min(method, 2)]) result = matrix.copy() mask_nan = np.isnan(result) try: reshaped_filled_values = filled_matrix.reshape((rows,cols))[mask_nan] result[mask_nan] = reshaped_filled_values except Exception as e: raise ValueError(f"无法应用所选方法{method}: {str(e)}") return result ``` 此代码片段展示了如何利用 SciPy 库中的 `griddata()` 来模拟一些基本的行为模式。需要注意的是,这只是一个简化示例,并未覆盖所有的 MATLAB 版本特性或优化措施。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值