1. NCCL unhandled cuda error
问题:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1565272271120/work/torch/lib/c10d/ProcessGroupNCCL.cpp:290, unhandled cuda error
Traceback (most recent call last):
…
subprocess.CalledProcessError: Command ‘[’/home/user3/anaconda3/envs/open-mmlab/bin/python’, ‘-u’, ‘./tools/test.py’, ‘–local_rank=2’, ‘configs/collin/dcn/faster_rcnn_dconv_c3-c5_r50_fpn_1x–hrrsd.py’, ‘work_dirs/faster_rcnn_dconv_c3-c5_r50_fpn_1x–hrrsd/epoch_12.pth’, ‘–launcher’, ‘pytorch’, ‘–out’, ‘work_dirs/faster_rcnn_dconv_c3-c5_r50_fpn_1x–hrrsd/results.pkl’, ‘–show’]’ returned non-zero exit status 1.
解决:
修改可视的GPU,且必须保证这些GPU上没有任何其他程序运行。
export CUDA_VISIBLE_DEVICES=0,5,6

3193

被折叠的 条评论
为什么被折叠?



