由于在GPU上运行神经网络模型,是使用多线程并行加速的,如果出错,真正的错误信息很容易被cuda的错误信息淹没。所以需要用cpu单线程跑一次,真正的错误信息就会显示出来
eg:在gpu下的错误信息:
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [124,0,0], thread: [91,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [124,0,0], thread: [92,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [124,0,0], thread: [93,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [124,0,0], thread: [94,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [124,0,0], thread: [95,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
Traceback (most recent call last):
File "main.py", line 79, in
out = model(data)
File "/home/wxd/anaconda3/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/wxd/workspace/SAGPool/networks.py", line 45, in forward
x, edge_index, _, batch, _ = self.pool3(x, edge_index, None, batch)
File "/home/wxd/anaconda3/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/wxd/workspace/SAGPool/layers.py", line 27, in forward
edge_index, edge_attr, perm, num_nodes=score.size(0))
File "/home/wxd/anaconda3/envs/torch/lib/python3.6/site-packages/torch_geometric/nn/pool/topk_pool.py", line 50, in filter_adj
row, col = row[mask], col[mask]
RuntimeError: copy_if failed to synchronize: device-side assert triggered
在cpu下的错误信息:
IndexError: index 224 is out of bounds for dimension 0 with size 156 " on the line "x = x[perm] * torch.tanh(score[perm]).view(-1, 1)
cpu运行下的错误信息就要清楚许多