解决模型训练过程中loss出现Nan值的情况

豆豆大王呀

于 2025-02-24 22:57:00 发布

阅读量483

点赞数 11

文章标签：深度学习 python pytorch

本文链接：https://blog.youkuaiyun.com/weixin_43900395/article/details/145840467

版权

网上的方法千篇一律，无非以下几种情况：
1.学习率过大，若是将学习率调至为0还出现Nan的情况则排除。
2.梯度爆炸，使用梯度裁剪（gradient clipping）来限制梯度的大小。
3.权重初始化，梯度裁剪（gradient clipping）来限制梯度的大小。
4.模型不稳定，尝试加入正则化等等。
但是这样定位问题很麻烦，半天找不到问题出在哪，建议直接使用该命令：

# 在训练开始前启用异常检测
torch.autograd.set_detect_anomaly(True)
# 训练过程
for epoch in range(num_epochs):
 #省略

放在训练最开始的部分，例如训练循环的外部即可。autograd 会检测 NaN 和非法梯度，并在反向传播过程中跟踪出错的位置。
比如我的报错如下：

sc/anaconda3/envs/yjhtorch16/lib/python3.8/site-packages/torch/autograd/__init__.py:173: 
UserWarning: Error detected in MulBackward0. 
Traceback of forward call that caused the error:
  File "main_wide.py", line 583, in <module>
    main()
  File "main_wide.py", line 199, in main
    train_loss, train_loss_x, train_loss_u, train_loss_sim = train(labeled_trainloader, unlabeled_trainloader, model, optimizer, ema_optimizer, criterion, criterion_simclr, threshold, epoch, use_cuda)
  File "main_wide.py", line 302, in train
    Ls = criterion_simclr(features_prob)
  File "/home/user-wsc/anaconda3/envs/yjhtorch16/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user-wsc/code/Code_LEAF/losses.py", line 86, in forward
    exp_logits = torch.exp(logits) * logits_mask
 (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:104.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

训练过程中的 MulBackward0 出现了问题，这表明在计算梯度时，出现了数值不稳定，具体是在执行 torch.exp(logits) 时。这种问题通常出现在计算中涉及指数运算（exp）时，如果输入的数值非常大，就可能导致溢出或产生 NaN。
可以定位到Ls = criterion_simclr(features_prob)，即features_prob的值很大，也就是我让模型返回的feature很大，所以我在模型输出之前加入了特征正则化，问题解决。

feature = F.normalize(feature, dim=1)