mmcv多gpu训练卡住
现象
运行mmcv的训练代码,卡在这里。gpu利用率100%。不报错。单卡不会有问题。问题仅出现在多gpu训练时。
2021-11-17 17:31:12,287 - mmdet - INFO - initialize RPNHead with init_cfg {'type': 'Normal', 'layer': 'Conv2d', 'std': 0.01}
2021-11-17 17:31:12,290 - mmdet - INFO - initialize Shared2FCBBoxHead with init_cfg [{'type': 'Normal', 'std': 0.01, 'override': {'name': 'fc_cls'}}, {'type': 'Normal', 'std': 0.001, 'override': {'name': 'fc_reg
'}}, {'type': 'Xavier', 'override': [{'name': 'shared_fcs'}, {'name': 'cls_fcs'}, {'name': 'reg_fcs'}]}]
loading annotations into memory...
Done (t=0.42s)
creating index...
index created!
loading annotations into memory...
Done (t=0.06s)
creating index...
index created!
2021-11-17 17:31:13,772 - mmdet - INFO - load checkpoint from http path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
解决方案
1、修改bashrc(对我有用)
在 ~/.bashrc中添加下面的代码
sudo vim ~/.bashrc
最下面添加下面的代码。
export NCCL_P2P_DISABLE="1"
export NCCL_IB_DISABLE="1"
刷新使更改生效
source ~/.bashrc
2、把通信后端从nccl改为gloo
修改文件configs/base/default_runtime.py
中dist_params = dict(backend=‘nccl’) 为 dist_params = dict(backend=‘gloo’)