[2025-09-03 19:55:59] 140706823c44:16583:16896 [1] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<19868> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:3805 remoteGidsfe80::5200:e6ff:feef:3835 hca ibp0p1
140706823c44:16583:16896 [1] NCCL INFO transport/net.cc:1393 -> 6
[2025-09-03 19:55:59] 140706823c44:16584:16900 [2] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<39536> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:48b4 remoteGidsfe80::5200:e6ff:feef:3804 hca ibp3p0
140706823c44:16584:16900 [2] NCCL INFO transport/net.cc:1393 -> 6
[2025-09-03 19:55:59] 140706823c44:16582:16894 [0] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<39536> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:3837 remoteGidsfe80::5200:e6ff:feef:4887 hca ibp1p3
140706823c44:16582:16894 [0] NCCL INFO transport/net.cc:1393 -> 6
[2025-09-03 19:55:59] 140706823c44:16585:16898 [3] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<55900> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:4886 remoteGidsfe80::5200:e6ff:feef:48b6 hca ibp2p2
140706823c44:16585:16898 [3] NCCL INFO transport/net.cc:1393 -> 6
[rank1]:[E903 20:05:00.083473453 ProcessGroupNCCL.cpp:685] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
[rank1]:[E903 20:05:00.083725262 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1
[rank1]:[E903 20:05:00.083739310 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank1]:[E903 20:05:00.083812782 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping.
[rank0]:[E903 20:05:00.155314876 ProcessGroupNCCL.cpp:685] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
[rank0]:[E903 20:05:00.155505084 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 0] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1
[rank0]:[E903 20:05:00.155519676 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank0]:[E903 20:05:00.155597596 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 0] First PG on this rank to signal dumping.
[rank3]:[E903 20:05:00.160329608 ProcessGroupNCCL.cpp:685] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
[rank3]:[E903 20:05:00.160487240 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1
[rank3]:[E903 20:05:00.160497256 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E903 20:05:00.160547272 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 3] First PG on this rank to signal dumping.
[rank2]:[E903 20:05:00.249766401 ProcessGroupNCCL.cpp:685] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600069 milliseconds before timing out.
[rank2]:[E903 20:05:00.249872834 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1
[rank2]:[E903 20:05:00.249881826 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E903 20:05:00.249934338 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 2] First PG on this rank to signal dumping.
[rank3]:[E903 20:05:00.459968512 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 3] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank0]:[E903 20:05:00.459959936 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank2]:[E903 20:05:00.459993568 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 2] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank3]:[E903 20:05:00.460112000 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank0]:[E903 20:05:00.460113952 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank2]:[E903 20:05:00.460117472 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank1]:[E903 20:05:00.463863369 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank1]:[E903 20:05:00.464040010 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:64 -> 3
140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:81 -> 3
140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:864 -> 3
140706823c44:16583:16885 [1] NCCL INFO misc/socket.cc:916 -> 3
Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed
140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:64 -> 3
140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:81 -> 3
140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:864 -> 3
140706823c44:16582:16886 [0] NCCL INFO misc/socket.cc:916 -> 3
140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:64 -> 3
140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:81 -> 3
140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:864 -> 3
140706823c44:16585:16890 [3] NCCL INFO misc/socket.cc:916 -> 3
Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed
140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:64 -> 3
140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:81 -> 3
140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:864 -> 3
140706823c44:16584:16889 [2] NCCL INFO misc/socket.cc:916 -> 3
Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed
140706823c44:16583:17177 [1] NCCL INFO comm 0xafcbcf8345e0 rank 1 nranks 4 cudaDev 1 busId 906000 - Abort COMPLETE
[rank1]:[E903 20:06:01.727205717 ProcessGroupNCCL.cpp:746] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E903 20:06:01.727244053 ProcessGroupNCCL.cpp:760] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E903 20:06:01.727875735 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xfdcd78e09b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xfdcd78e1038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xfdcd78e118ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/Megatron-LM/pretrain_gpt.py", line 245, in <module>
[rank1]: pretrain(
[rank1]: File "/opt/Megatron-LM/megatron/training/training.py", line 193, in pretrain
[rank1]: initialize_megatron(extra_args_provider=extra_args_provider,
[rank1]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 100, in initialize_megatron
[rank1]: _compile_dependencies()
[rank1]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 173, in _compile_dependencies
[rank1]: torch.distributed.barrier()
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4811, in barrier
[rank1]: work = group.barrier(opts=opts)
[rank1]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1.
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xfdcd78e09b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xfdcd78e1038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xfdcd78e118ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6)
Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11b88e0 (0xfdcd78dc88e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xfdcd78e11c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6)
140706823c44:16585:17180 [3] NCCL INFO comm 0xba60cf534e60 rank 3 nranks 4 cudaDev 3 busId 1906000 - Abort COMPLETE
[rank3]:[E903 20:06:01.856367983 ProcessGroupNCCL.cpp:746] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E903 20:06:01.856392463 ProcessGroupNCCL.cpp:760] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
140706823c44:16582:17179 [0] NCCL INFO comm 0xbdefb65320e0 rank 0 nranks 4 cudaDev 0 busId 806000 - Abort COMPLETE
[rank3]:[E903 20:06:01.856965041 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xe7e01e5a9b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xe7e01e5b038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xe7e01e5b18ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
[rank0]:[E903 20:06:01.857166705 ProcessGroupNCCL.cpp:746] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E903 20:06:01.857203857 ProcessGroupNCCL.cpp:760] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xe7e01e5a9b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xe7e01e5b038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xe7e01e5b18ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6)
Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11b88e0 (0xe7e01e5688e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xe7e01e5b1c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6)
[rank0]:[E903 20:06:01.857829171 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xf144de949b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xf144de95038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xf144de9518ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/Megatron-LM/pretrain_gpt.py", line 245, in <module>
[rank0]: pretrain(
[rank0]: File "/opt/Megatron-LM/megatron/training/training.py", line 193, in pretrain
[rank0]: initialize_megatron(extra_args_provider=extra_args_provider,
[rank0]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 100, in initialize_megatron
[rank0]: _compile_dependencies()
[rank0]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 173, in _compile_dependencies
[rank0]: torch.distributed.barrier()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4811, in barrier
[rank0]: work = group.barrier(opts=opts)
[rank0]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0.
what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xf144de949b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xf144de95038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xf144de9518ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6)
Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11b88e0 (0xf144de9088e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xf144de951c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6)
W0903 20:06:01.224000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16582 closing signal SIGTERM
W0903 20:06:01.225000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16584 closing signal SIGTERM
W0903 20:06:01.226000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16585 closing signal SIGTERM
E0903 20:06:01.856000 16517 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 1 (pid: 16583) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 7, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 143, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
pretrain_gpt.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-09-03_20:06:01
host : 140706823c44
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 16583)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 16583
======================================================
最新发布