mmsegmentation 报错

最新推荐文章于 2025-04-03 08:17:39 发布

幸福回头

最新推荐文章于 2025-04-03 08:17:39 发布

阅读量3.4k

点赞数 1

分类专栏：深度学习图像处理文章标签： python 人工智能

本文链接：https://blog.youkuaiyun.com/zt1091574181/article/details/124302527

版权

深度学习同时被 2 个专栏收录

33 篇文章

订阅专栏

图像处理

13 篇文章

订阅专栏

在使用mmseg进行VOC数据集训练时遇到CUDA内存错误。问题在于num_class未调整，导致非法内存访问。通过检查和修改配置，正确设置num_class后，解决了训练过程中的问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Describe the bug
2022-03-01 07:52:37,584 - mmseg - INFO - workflow: [('train', 1)], max: 20000 iters
2022-03-01 07:52:37,585 - mmseg - INFO - Checkpoints will be saved to /data/mmsegmentation/work_dirs/fcn_r50-d8_512x512_20k_voc12aug by HardDiskBackend.
/opt/conda/envs/seg/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Traceback (most recent call last):
File "tools/train.py", line 234, in
main()
File "tools/train.py", line 223, in main
train_segmentor(
File "/data/mmsegmentation/mmseg/apis/train.py", line 174, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/data/mmcv/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/data/mmcv/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/data/mmcv/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/data/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step
losses = self(**data_batch)
File "/opt/conda/envs/seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/data/mmcv/mmcv/runner/fp16_utils.py", line 109, in new_func
return old_func(*args, **kwargs)
File "/data/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/data/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 143, in forward_train
loss_decode = self._decode_head_forward_train(x, img_metas,
File "/data/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 86, in _decode_head_forward_train
loss_decode = self.decode_head.forward_train(x, img_metas,
File "/data/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "/data/mmcv/mmcv/runner/fp16_utils.py", line 197, in new_func
return old_func(*args, *kwargs)
File "/data/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 264, in losses
loss['acc_seg'] = accuracy(
File "/data/mmsegmentation/mmseg/models/losses/accuracy.py", line 47, in accuracy
correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f57296eea22 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x10983 (0x7f572994f983 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x1a7 (0x7f5729951027 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f57296d85a4 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0xa27e1a (0x7f56d4a16e1a in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xa27eb1 (0x7f56d4a16eb1 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #23: __libc_start_main + 0xe7 (0x7f573a7ffb97 in /lib/x86_64-linux-gnu/libc.so.6)

使用mmseg工程进行VOC数据集训练的时候，发现报了上述错误

RuntimeError: CUDA error: an illegal memory access was encountered

correct = correct[:, target != ignore_index]

因此，我在外部使用python进行的算法上的测试，重新生成了true和false的tensor，然后添加判断条件，发现并没有什么问题，说明不是cpu和gpu方面的问题，之后回到github的链接下，发现有人出现了同样的问题，是由于num_class没有修改，因此需要修改num_class的数值，可以正常开始算法模型的训练。