torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 141——YOLOv8双卡训练报错的解决方法

文章讨论了在训练YOLOv8模型时,双GPU部署遇到的torch.distributed异常,涉及进程管理、环境配置和重启策略。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Ultralytics开源的YOLOv8训练模型的时候——使用如下命令,双GPU部署训练

yolo train data=D:/YOLO_V8/ultralytics-main/ultralytics-main/ultralytics/cfg/datasets/mydata.yaml model=yolov8n.pt epochs=650 imgsz=640 batch=256 workers=0 patience=200 device=0,1

抛出异常

torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 141
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 340) of binary: /root/miniconda3/envs/llama/bin/python
torch.distributed.elastic.multiprocessing.errors.ChildFailedError
subprocess.CalledProcessError: Command '['D:\\Anaconda\\envs\\YOLO8\\python.exe', '-m', 'torch.distributed.run', '--nproc_per_node', '2', '--master_port', '58127', 'C:\\Users\\amax\\AppData\\Roaming\\Ultralytics\\DDP\\_temp_8gd8 22v32514268826352.py']' returned non-zero exit status 1.

而使用SingleGPU则不会抛出异常

 yolo train data=D:/YOLO_V8/ultralytics-main/ultralytics-main/ultralytics/cfg/datasets/mydata.yaml model=yolov8n.pt epochs=650 imgsz=640 batch=256 workers=0 patience=200 device=0

这是由于上一次双卡训练直接在pycharm的terminal里面直接Ctrl+C按下去,然后终止了训练,这样可能导致了进程没有完全杀死,没有释放该进程。需要重启电脑。

还有一种解决方法就是,去跑另外的一份python训练AI模型的程序,同样使用同一款pycharm或者vscode训练,然后关闭terminal杀死另一个不相干的训练进程。然后再次打开本训练,有概率就可以继续双卡Multi GPU Training了 

如若重启电脑也不行的话,就考虑环境的问题(需要匹配cuda与torch的版本一一对应上才可以)这个目前没有更好的解决方案,后续再发生这样的报错,找到了更好的解决方案,会再试一下

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 97 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 98) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==1.13.1+cu116', 'console_scripts', 'torchrun')()) File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ==================================================== tools/train.py FAILED ---------------------------------------------------- Failures: <NO_OTHER_FAILURES> ---------------------------------------------------- Root Cause (first observed failure): [0]: time : 2025-03-09_20:49:13 host : yons-MS-7E06 rank : 1 (local_rank: 1) exitcode : -11 (pid: 98) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 98 ====================================================这是什么问题,怎么解决
03-11
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

光芒再现dev

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值