mmsegmentation 报错

在使用mmseg进行VOC数据集训练时遇到CUDA内存错误。问题在于num_class未调整,导致非法内存访问。通过检查和修改配置,正确设置num_class后,解决了训练过程中的问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Describe the bug
2022-03-01 07:52:37,584 - mmseg - INFO - workflow: [('train', 1)], max: 20000 iters
2022-03-01 07:52:37,585 - mmseg - INFO - Checkpoints will be saved to /data/mmsegmentation/work_dirs/fcn_r50-d8_512x512_20k_voc12aug by HardDiskBackend.
/opt/conda/envs/seg/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Traceback (most recent call last):
File "tools/train.py", line 234, in
main()
File "tools/train.py", line 223, in main
train_segmentor(
File "/data/mmsegmentation/mmseg/apis/train.py", line 174, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/data/mmcv/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/data/mmcv/mmcv/runner/iter_based_runner.py", line 61, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/data/mmcv/mmcv/parallel/data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/data/mmsegmentation/mmseg/models/segmentors/base.py", line 138, in train_step
losses = self(**data_batch)
File "/opt/conda/envs/seg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/data/mmcv/mmcv/runner/fp16_utils.py", line 109, in new_func
return old_func(*args, **kwargs)
File "/data/mmsegmentation/mmseg/models/segmentors/base.py", line 108, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/data/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 143, in forward_train
loss_decode = self._decode_head_forward_train(x, img_metas,
File "/data/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 86, in _decode_head_forward_train
loss_decode = self.decode_head.forward_train(x, img_metas,
File "/data/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 204, in forward_train
losses = self.losses(seg_logits, gt_semantic_seg)
File "/data/mmcv/mmcv/runner/fp16_utils.py", line 197, in new_func
return old_func(*args, *kwargs)
File "/data/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 264, in losses
loss['acc_seg'] = accuracy(
File "/data/mmsegmentation/mmseg/models/losses/accuracy.py", line 47, in accuracy
correct = correct[:, target != ignore_index]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f57296eea22 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x10983 (0x7f572994f983 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x1a7 (0x7f5729951027 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f57296d85a4 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0xa27e1a (0x7f56d4a16e1a in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0xa27eb1 (0x7f56d4a16eb1 in /opt/conda/envs/seg/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #23: __libc_start_main + 0xe7 (0x7f573a7ffb97 in /lib/x86_64-linux-gnu/libc.so.6)

使用mmseg工程进行VOC数据集训练的时候,发现报了上述错误

RuntimeError: CUDA error: an illegal memory access was encountered

correct = correct[:, target != ignore_index]

因此,我在外部使用python进行的算法上的测试,重新生成了true和false的tensor,然后添加判断条件,发现并没有什么问题,说明不是cpu和gpu方面的问题,之后回到github的链接下,发现有人出现了同样的问题,是由于num_class没有修改,因此需要修改num_class的数值,可以正常开始算法模型的训练。

### 解决 MMSegmentation 推理报错的方法 当遇到 `mmsegmentation` 推理过程中的错误时,通常可以从以下几个方面排查并解决问题: #### 错误分析与常见原因 1. **模型注册表未找到指定模块** 如果出现 `"xxxxx is not in the model registry"` 的错误提示,则表明所使用的类或函数并未被正确导入到模型注册表中。这可能是由于自定义配置文件路径设置不正确或者缺少必要的初始化操作所致[^1]。 2. **数据集元信息缺失** 对于类似于 `KeyError: 'classes'` 这样的异常情况,往往是因为在加载预训练权重之前未能正确设定好数据集的相关属性(比如类别名称)。确保 `dataset_meta['classes']` 已经正确定义,并且其长度匹配实际分类数目[^2]。 3. **环境依赖版本冲突** 版本兼容性也是一个重要因素。确认当前环境中安装的库版本是否满足项目需求,特别是像 PyTorch 及其扩展包、OpenCV 和其他第三方工具链等。例如,在特定日期下记录了某些软件的具体版本号,可以作为参考来调整本地开发环境[TorchVision: 0.17.2+cu118, OpenCV: 4.9.0, MMEngine: 0.10.3, mmsegmentation 1.2.2][^2]。 #### 实际案例解析 假设正在尝试基于 `MMSegmentation` 训练一个新的语义分割网络,并遇到了上述提到的一些典型问题。此时可以根据具体场景采取相应措施: - 若是在创建新数据集适配器时发生错误,检查是否已经按照官方文档说明,在 `mmseg/datasets/` 文件夹里添加了一个描述目标数据结构的新 Python 脚本(如命名为 `cag.py`) 并实现了所有必需方法[^3]。 ```python from mmseg.datasets.builder import DATASETS from mmseg.datasets.custom import CustomDataset @DATASETS.register_module() class CAGDataset(CustomDataset): CLASSES = ('background', 'object') PALETTE = [[0, 0, 0], [255, 255, 255]] def __init__(self, split, **kwargs): super().__init__( img_suffix='.png', seg_map_suffix='.png', split=split, reduce_zero_label=False, **kwargs) ``` - 当执行推理任务时报错,先验证输入图像尺寸是否符合预期;再仔细核对配置文件内的参数选项是否有遗漏之处;最后查看日志输出寻找更多线索以便定位根本原因。 #### 额外建议 为了更好地支持后续维护工作以及便于他人理解整个流程,请务必保持良好的编码习惯——清晰地注释每一部分功能实现细节,并遵循 PEP8 编码风格指南书写代码逻辑。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值