Traceback (most recent call last):
File "distribute_prune_erfnet_cluster.py", line 643, in <module>
daemon=False, )
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/workspace/geniii-trainingcode-lane/lanenet/distribute_prune_erfnet_cluster.py", line 601, in _distributed_worker
distributed.init_process_group(backend="gloo", init_method=dist_url, world_size=world_size, rank=global_rank)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 432, in init_process_group
timeout=timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 503, in _new_process_group_helper
timeout=timeout)
RuntimeError: Socket Timeout
多级多卡分布式训练时,报错 RuntimeError: Socket Timeout
最新推荐文章于 2024-07-09 20:19:52 发布