pytorch_examples训练imagenet问题——RuntimeError: NCCL error。。。

本文详细记录了使用PyTorch框架在Imagenet数据集上进行训练时遇到的错误及其解决过程。主要错误源于DistributedDataParallel模型初始化时的NCCL错误,通过添加--world-size参数成功解决了这一问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

参考代码:
pytorch/examples/imagenet

错误1:

# (py36_pytorch)
python main.py \
>     -a resnet18 \
>     --lr 0.1 \
>     --dist-url 'tcp://127.0.0.1:23456' \
>     --dist-backend 'nccl' \
>     --multiprocessing-distributed \
>     --rank 0 \
>     /DATA/disk1/zhangxin/imagenet
Use GPU: 1 for training
Use GPU: 2 for training
Use GPU: 0 for training
=> creating model 'resnet18'
Use GPU: 3 for training
=> creating model 'resnet18'
Use GPU: 7 for training
=> creating model 'resnet18'
Use GPU: 4 for training
=> creating model 'resnet18'
Use GPU: 6 for training
=> creating model 'resnet18'
Use GPU: 5 for training
=> creating model 'resnet18'
=> creating model 'resnet18'
=> creating model 'resnet18'
Traceback (most recent call last):
  File "main.py", line 398, in <module>
    main()
  File "main.py", line 110, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/home/work/anaconda3/envs/py36_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
    while not spawn_context.join():
  File "/home/work/anaconda3/envs/py36_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join
    raise Exception(msg)
Exception:

-- Process 6 terminated with the following error:
Traceback (most recent call last):
  File "/home/work/anaconda3/envs/py36_pytorch/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/DATA/disk1/zhangxin/github/examples/imagenet/main.py", line 151, in main_worker
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
  File "/home/work/anaconda3/envs/py36_pytorch/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 215, in __init__
    self.broadcast_bucket_size)
  File "/home/work/anaconda3/envs/py36_pytorch/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 377, in _dist_broadcast_coalesced
    dist._dist_broadcast_coalesced(self.process_group, tensors, buffer_size, False)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1544174967633/work/torch/lib/c10d/../c10d/NCCLUtils.hpp:39, invalid argument

解决方法:

添加--world-size参数
### 解决 NCCL 错误 `RUNTIMEERROR` 的方法 #### 一、理解 NCCL 和常见错误原因 NCCL (NVIDIA Collective Communications Library) 是用于 GPU 集群间高效通信的库。当遇到 `RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer` 这类错误时,通常意味着节点之间的连接存在问题[^1]。 #### 二、具体解决措施 ##### 1. 检查网络配置 确保所有参与训练的机器之间能够正常通信,并且防火墙设置允许必要的端口开放。对于分布式训练环境来说,稳定的内部网络至关重要。 ##### 2. 修改 DeepSpeed 或 PyTorch 设置 如果是在集群环境中使用 DeepSpeed 训练大型模型,则可以尝试调整启动参数来规避此问题: ```bash export MASTER_ADDR=localhost export MASTER_PORT=12355 ``` 另外,在命令行中指定更详细的日志级别可以帮助定位问题所在: ```bash torchrun --nnodes=2 --nproc_per_node=8 \ --node_rank=$SLURM_PROCID --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT train.py ``` ##### 3. 尝试单 GPU 训练模式 有时多卡并行计算会触发特定硬件或驱动版本下的兼容性问题。将程序切换到仅利用单一 GPU 执行可作为临时解决方案验证是否为资源分配不当引起的问题[^2]: ```python import torch device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device) ``` ##### 4. 启用 CUDA 调试选项 为了更好地捕捉潜在的异步错误信息,可以在运行前设置如下环境变量以便于后续排查: ```bash export CUDA_LAUNCH_BLOCKING=1 export TORCH_USE_CUDA_DSA=1 ``` 这些设置有助于同步化操作流程,使得任何发生的内核执行失败都能立即被检测出来而不是延迟报告[^3]。
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

张欣-男

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值