rank1]:[E830 06:13:19.780493643 ProcessGroupNCCL.cpp:632] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=146806, OpType=ALLREDUCE, NumelIn=3675597, NumelOut=3675597, Timeout(ms)=600000) ran for 600472 milliseconds before timing out.████████████████ | 519/804 [18:24<09:52, 2.08s/it, loss=0.0163]
[rank1]:[E830 06:13:57.611628340 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 146806 PG status: last enqueued work: 146806, last completed work: 146804
[rank1]:[E830 06:13:57.634486146 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E830 06:14:02.704255369 ProcessGroupNCCL.cpp:2106] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping.
[rank0]:[E830 06:14:02.725916508 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 0] Observed flight recorder dump signal from another rank via TCPStore.
[rank1]:[E830 06:14:02.768576450 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from rank 1 and we will try our best to dump the debug info. Last enqueued NCCL work: 146806, last completed NCCL work: 146804.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank0]:[E830 06:14:02.768580472 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from rank 1 and we will try our best to dump the debug info. Last enqueued NCCL work: 146804, last completed NCCL work: 146804.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank0]:[E830 06:14:08.324151488 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank1]:[E830 06:14:08.324155500 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank1]:[E830 06:16:17.610834920 ProcessGroupNCCL.cpp:684] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E830 06:16:17.667404073 ProcessGroupNCCL.cpp:698] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E830 06:16:45.611443648 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=146806, OpType=ALLREDUCE, NumelIn=3675597, NumelOut=3675597, Timeout(ms)=600000) ran for 600472 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7a36a8d785e8 in /root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7a36533e2a6d in /root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7a36533e47f0 in /root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a36533e5efd in /root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7a3642edbbf4 in /root/anaconda3/envs/torch/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7a36aa694ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7a36aa726850 in /lib/x86_64-linux-gnu/libc.so.6)
[rank0]:[F830 06:22:40.572585407 ProcessGroupNCCL.cpp:1557] [PG ID 0 PG GUID 0(default_pg) Rank 0] [PG ID 0 PG GUID 0(default_pg) Rank 0] Terminating the process after attempting to dump debug info, due to collective timeout or exception.
[rank1]: Traceback (most recent call last):
[rank1]: File "/root/anaconda3/envs/torch/lib/python3.12/multiprocessing/resource_sharer.py", line 139, in _serve
[rank1]: msg = conn.recv()
[rank1]: ^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/torch/lib/python3.12/multiprocessing/connection.py", line 249, in recv
[rank1]: buf = self._recv_bytes()
[rank1]: ^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/torch/lib/python3.12/multiprocessing/connection.py", line 413, in _recv_bytes
[rank1]: buf = self._recv(4)
[rank1]: ^^^^^^^^^^^^^
[rank1]: File "/root/anaconda3/envs/torch/lib/python3.12/multiprocessing/connection.py", line 382, in _recv
[rank1]: raise EOFError
[rank1]: EOFError
W0830 06:40:48.263000 3372 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3385 closing signal SIGTERM
E0830 06:40:48.433000 3372 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 1 (pid: 3386) of binary: /root/anaconda3/envs/torch/bin/python3.12
Traceback (most recent call last):
File "/root/anaconda3/envs/torch/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-08-30_06:40:48
host : ub-MS-7A93
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 3386)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3386
最新发布