同时跑两个pytorch DDP程序时,会出现下列错误:
Traceback (most recent call last):
File "train_tasks.py", line 471, in <module>
main()
File "train_tasks.py", line 211, in main
torch.distributed.init_process_group(backend="nccl")
File ".../anaconda3/envs/vilbert/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 406, in init_process_group
store, rank, world_size = next(rendezvous(url))
File ".../anaconda3/envs/vilbert/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use
解决方案:
在python -m torch.distributed.launch
后指定一个未被使用的端口--master_port 9999
。