【程序错误-梯度计算错误】RuntimeError: one of the variables needed for gradient computation has been modified by

HUI 别摸鱼了

已于 2024-07-13 18:51:31 修改

阅读量6.6k

点赞数 15

分类专栏：深度学习PyTorch 文章标签： python

于 2024-04-29 22:47:28 首次发布

本文链接：https://blog.youkuaiyun.com/weixin_47244593/article/details/138327847

版权

深度学习PyTorch 专栏收录该内容

18 篇文章

订阅专栏

1. 问题

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

在计算梯度的时候检查出某个Variable有被一个 inplace operation 修改。

2. 原因

PyTorch默认会跟踪张量的操作历史，以便计算梯度，但是原地操作会破坏操作历史，导致无法计算梯度。

3. 一些解决方法

注：首先通过以下方法排除问题：

将Pytorch中 torch.relu()通过设置inplace=True进行inplace操作；
对于代码中类似x += y等是操作，改成x = x + y；（我的问题最终是这个原因，需要我们非常认真的检查每个变量名）
把pytorch恢复到1.4之前的环境；
把更新梯度的步骤调后放在一起；
将loss.backward( )改成loss.backward(retain_graph=True)；
在pytorch中， inplace operation 可以是两个激活函数串联在一起导致的，将两个串联的激活函数删去一个。
以上方法都不可用。

4. BUG解决

训练代码开始加入以下代码：

torch.autograd.set_detect_anomaly(True)

报错信息会更加具体提示是网络那部分梯度计算出现问题。
在这里插入图片描述

解决方法：
- 使用torch.autograd.Variable：将要修改的变量封装在torch.autograd.Variable中，这样可以跟踪操作历史。然后通过调用variable.data获取原始张量进行修改。
- 使用torch.Tensor.detach()：将张量从计算图中分离出来，这样可以防止跟踪操作历史。然后进行原地操作。
- 避免使用原地操作：尽量使用像torch.Tensor.clone()这样的方法，创建一个新的张量来存储结果，而不是在原地修改。

# 对原始数据进行复制
#改之前
residual1 = new_points1
new_points1_tem = new_points1[0:1, 0:1, :, 0:1]

#改之后
residual1 = new_points1.clone()
new_points1_tem = new_points1[:, :, :1, :].clone()