ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 35585)

明泽.

已于 2024-12-08 13:47:06 修改

阅读量642

点赞数 1

文章标签：深度学习人工智能

于 2024-12-08 13:11:10 首次发布

本文链接：https://blog.youkuaiyun.com/qq_36955294/article/details/144324707

版权

FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
train.py: error: unrecognized arguments: --local_rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 35585) of binary: /home/lvying/anaconda3/envs/sigma/bin/python

NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES="0,1" python -m torch.distributed.launch --nproc_per_node=2  --master_port 29502 train.py -p 29502 -d 0,1

08 13:14:46 PyTorch Version 1.13.1+cu117
usage: train.py [-h] [-d DEVICES] [-c FILE] [--local-rank LOCAL_RANK] [-p PORT] [--dataset_name DATASET_NAME]

Solution:

after torch1.9,use torchrun!!!

CUDA_VISIBLE_DEVICES="0,1" torchrun --nproc_per_node=2  --master_port 29502 train.py -p 29502 -d 0,1

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

明泽.

关注关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
分享

复制链接

分享到 QQ

分享到新浪微博

扫一扫
举报

举报

pytorch报错 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank:....

科研学习笔记！

06-26

1万+

pytorch报错 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank:....

mmdetection3d报错问题解决汇总

非晚非晚的博客

06-14

1万+

当在mmdetection3d上添加了新的网络结构之后，使用多GPU运行程序时，会报以下错误。ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 333) of binary: /opt/conda/bin/python Traceback (most recent call last):

参与评论您还未登录，请先登录后发表或查看评论

【PyTorch】分布式训练报错记录-ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1)

热门推荐

傲笑风的博客

09-16

7万+

最近在使用单机多卡进行分布式（DDP）训练时遇到一个错误：ERROR: torch.distributed.elastic.multiprocessing.api:failed。而实际报错的内容是：ValueError: sampler option is mutually exclusive with shuffle.

【异常错误】unrecognized arguments: --local-rank=1 ERROR:torch.distributed.elastic.multipr

Drug discovery

05-13

1542

由上图可以看出是–local_rank 与 --local-rank不一致导致的，追究原因，竟然是torch2.0版本launch.py里面写的全是–local-rank，而老版本的torch用的是–local_rank。所以将local_rank全改成local-rank即可。

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

lwj0824的博客

09-13

2万+

并行训练error

【已解决】RuntimeError: No CUDA GPUs are availableERROR:torch.distributed.elastic.multiprocessing.api:fa

BetrayFree的博客

04-19

3525

0为服务器中的GPU编号，可以为0, 1, 2, 3等，表明对程序可见的GPU编号。1. 命令：CUDA_VISIBLE_DEVICES=1 # 只有编号为1的GPU对程序是可见的，在代码中gpu[0]指的就是这块儿GPUCUDA_VISIBLE_DEVICES=0..._cuda_visible_devices。不过，很有意思的是，看的出来我这里面使用的是分布式计算，那我单机多GPU训练的时候，一个GPU上能跑几个进程呢？果然，我这机器里面只有2个GPU，但是呢，这里面的却是5，显然是没办法找到的！

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15504) of binary:

08-18

据引用中的描述，你遇到的错误是"ERROR: torch.distributed.elastic.multiprocessing.api:failed"。根据你提供的详细错误内容"ValueError: sampler option is mutually exclusive with shuffle"，这个错误是由于在...

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 37784) of binary:

01-09

根据你提供的引用内容，你遇到的错误是`ERROR: torch.distributed.elastic.multiprocessing.api:failed`，并且报错内容是`ValueError: sampler option is mutually exclusive with shuffle`。这个错误的原因是在...

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27626) of binary:

10-30

具体错误信息是：ERROR: torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 27626) of binary:。根据引用中提供的信息，这个错误可能是由于在分布式训练时使用了sampler，...

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0

09-09

ERROR: torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0是一个分布式训练中的错误信息。这个错误表示在使用torch.distributed.elastic.multiprocessing.api进行分布式训练时出现...

【异常错误】torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Drug discovery

07-14

1万+

昨天运行的时候总是出现这个错误意思就是子线程有错误，但是却不报错，debug半天后是“”这一行出错了，但是我把这行代码单独放在一个新建的项目的py文件，然后运行没有错误，真是服了，改天了一天也没弄好。

yolov7 多卡训练显示torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

woshi212534的博客

11-02

1981

大概率是因为pytorch版本的问题，在2.0版本中local_rank变为了local-rank。1.首先把数据集里的label.cache清掉。

训练DiT报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0

计算机视觉

11-27

1万+

运行Dit时，torchrun --nnodes=1 --nproc_per_node=8 train.py --model DiT-XL/2 --data-path /home/pansiyuan/jupyter/qianyu/data。此时是多卡计算看不到报错信息。

torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9)

weixin_50232758的博客

09-21

1365

考虑降低workers数量或者其他节省内存的方法。并未有其他提示信息，原因大概率是。

多卡训练自动关闭 torch.distributed.elastic.multiprocessiong.erroes.ChildFailedError

weixin_44376152的博客

03-03

302

可以关注任务管理器验证下。我这是内存不足的问题。调小batchsize。

【mmdet问题】error: unrecognized arguments: --local-rank=0

weixin_55982578的博客

05-29

1372

如果直接使用github里面的给的命令进行的话，回报错。所以我们要加多一行让他认识到 local-rank。