解决 RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
在将代码从单机单卡修改为单机多卡训练时,突然冒出了这样的问题:
RuntimeError: one of the variables needed for gradient computation has
been modified by an inplace operation: [torch.cuda.FloatTensor [2500,
256]], which is output 0 of AsStridedBackward0, is at version 1;
expected version 0 instead. Hint: the backtrace further above shows
the operation that failed to compute its gradient. The variable in
question was changed in there or anywhere later. Good luck!
从网上搜了很多方法,总结一下就是代码中进行了inplace运算,而这再多卡训练中是不允许的,有的解决方法是:
x += y修改为x = x + y
nn.ReLU(inplace=True)修改为 nn.ReLU(inplace=False)等
但对我都不管用!!
最终我在train模块之前加入这行代码:
torch.autograd.set_detect_anomaly(True)
在加入后报错代码中就会打印出具体的报错行,而不是粗略的loss.backward()中出错。
在获知具体的报错行后,查看此处的变量,在变量后面加入.clone()即可,如我的报错行是:
k = self.k(key).reshape(B, L,
那我将其修改为
k = self.k(key.clone()).reshape(B, L,
问题解决!