在使用cuda(gpu)训练神经网络出错怎么办?调成cpu跑一次看看错在哪了

当神经网络模型在GPU上运行出现错误时,错误信息可能被CUDA的错误信息覆盖,导致难以定位真正的问题。本文通过对比GPU和CPU下的错误信息,解释了如何在CPU环境下单线程运行模型以获取更清晰的错误提示,从而有效定位并解决问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

由于在GPU上运行神经网络模型,是使用多线程并行加速的,如果出错,真正的错误信息很容易被cuda的错误信息淹没。所以需要用cpu单线程跑一次,真正的错误信息就会显示出来

eg:在gpu下的错误信息:

/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [124,0,0], thread: [91,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [124,0,0], thread: [92,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [124,0,0], thread: [93,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [124,0,0], thread: [94,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [124,0,0], thread: [95,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
Traceback (most recent call last):
File "main.py", line 79, in
out = model(data)
File "/home/wxd/anaconda3/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/wxd/workspace/SAGPool/networks.py", line 45, in forward
x, edge_index, _, batch, _ = self.pool3(x, edge_index, None, batch)
File "/home/wxd/anaconda3/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/wxd/workspace/SAGPool/layers.py", line 27, in forward
edge_index, edge_attr, perm, num_nodes=score.size(0))
File "/home/wxd/anaconda3/envs/torch/lib/python3.6/site-packages/torch_geometric/nn/pool/topk_pool.py", line 50, in filter_adj
row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: device-side assert triggered

 

 

在cpu下的错误信息:

IndexError: index 224 is out of bounds for dimension 0 with size 156 " on the line "x = x[perm] * torch.tanh(score[perm]).view(-1, 1)

 

cpu运行下的错误信息就要清楚许多

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值