Pytorch 多GPU multi-GPU使用 RuntimeError: all tensors must be on devices[0]

最新推荐文章于 2025-07-09 16:56:24 发布

原创最新推荐文章于 2025-07-09 16:56:24 发布 · 1.9w 阅读

17 ·

CC 4.0 BY-SA版权

DL tools 专栏收录该内容

54 篇文章

订阅专栏

问题：
在跑多GPU时出现RuntimeError: all tensors must be on devices[0]问题，找了一番没解决，最后瞎试出来了。
主体代码部分：

os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
net.cuda()
net = torch.nn.DataParallel(net)

第一次训练时设置的args.gpu=”0,2”，第二次同样配置就出现RuntimeError: all tensors must be on devices[0]问题。google一番，找到几个解决办法：
办法1：
CUDA_VISIBLE_DEVICES中gpu id和DataParallel中device_ids不同，如果使用：

CUDA_VISIBLE_DEVICES = 3,1,2

则：

net = torch.nn.DataParallel(net, device_ids=[0,1,2])

意思是此处的device_ids=[0,1,2]分别对应的是三个gpu3,1,2。但我使用无效。
参考链接
办法2：
使用net.cuda(0)或者with torch.cuda.device(0):指定数据载入gpu。又对我没用。
参考链接
办法3：
在device_ids中只指定一个gpu，代码跑的时候还是使用两个gpu。

os.environ['CUDA_VISIBLE_DEVICES'] = "0,2"
net.cuda()
net = torch.nn.DataParallel(net, device_ids=[0])

分享自己跑多GPU方法：

主要内容如下:
net.cuda()
net = torch.nn.DataParallel(net)
但在这段代码前你调用net的层使用：
for param in net.features.parameters():
    param.requires_grad = True
再经过并行封装后，调用层要加module，尤其注意保存模型时加module：
{'params':net.module.out1.parameters()}
'net': net.module,