NCCL 集合通信--Collective Operations

深度解析:NVIDIA Collective Operations:AllReduce、Broadcast、Reduce与AllGather详解
本文详细介绍了NVIDIA Collective Operations中的AllReduce、Broadcast、Reduce和AllGather四种通信操作,包括它们的原理、示例和RingAllReduce、RingReduceScatter、RingAllGather等实际实现策略。这些技术在深度学习和分布式计算中至关重要。

集合通信 Collective Operations

AllReduce

在这里插入图片描述

ncclResult_t ncclBroadcast(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)
  • N 个设备参加的AllReduce,没有root,初始状态下,所有的N个设备上sendbuffer 长度为 count(函数中定义 size_t count),N个sendbuffer中的内容记为 in_0,in_1,…,in_N-1, 运行结束后,所有sendbuffer中内容经过op操作,使得N个设备上的recv buffer中有相同的结果副本,长度均为count,即out。
  • 以op操作为ncclSum为例, out[i] 来自于所有sendbuffer的 in_k[i] 的和,即 out[i] = sum(in_k[i])

Broadcast

在这里插入图片描述

ncclResult_t ncclBroadcast(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream)

N个设备参与的broadcast,root 是 函数中指定的 int root(以上图为例则是2),初始状态下,rank 为 root(上图中2)的设备上的send buffer 长度为count。运行结束后,所有参与broadcast 的N个设备上的recv buffer 长度为 count, 且所有N个rank 的 recv buffer 部分的数据均来自于 root设备的send buffer 的拷贝。

Reduce

在这里插入图片描述

ncclResult_t ncclReduce(const void* sendbuff, void* recvbuff, size_t count, ncclDataType_t datatype, ncclRedOp_t op, int root, ncclComm_t comm, cudaStream_t stream)
  • N个设备参与的reduce ,root 是 函数中指定的 int root(上图中为2),初始状态下,N个设备上的send buffer 长度为 count,记为数组 in_0,in_1,…in_N-1。 运行结束后,rank 为 root (上图中为2)的设备的recv buffer长度为count,记为数组out 。其中 recv buffer的数据来自于N个的send buffer经op操作后的结果。
  • 以op操作为 ncclSum为例,out[i] 为各个send buffer 中 in_X[i] 的和,即out[i] = sum(in_X[i])

AllGather

在这里插入图片描述

ncclResult_t ncclAllGather(const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream)

N个设备参与AllGather,没有root,初始状态下,N个设备上的send buffer 长度为 sendcount,记为数组 in_0,int_1,…,in_N-1,运行结束后,N个设备上的recv buffer 内容均相同,长度均为 N∗sendcountN*sendcountNsendcount,记为 数组out 。其中每个设备上的recv buffer中的out[Y∗sendcout+i]out[Y*sendcout + i]out[Ysendcout+i]来自于rank为 Y 的设备的send buffer中的 in_Y[i] , 即 out[Y*count + i] = in_Y[i]

ReduceScatter

在这里插入图片描述

ncclResult_t ncclReduceScatter(const void* sendbuff, void* recvbuff, size_t recvcount, ncclDataType_t datatype, ncclRedOp_t op, ncclComm_t comm, cudaStream_t stream)
  • N 个设备参与的ReduceScatter,没有root设备。初始状态下,N个设备的sendbuffer 长度均为 N*recvcount。运行结束后,N个设备的recv buffer长度为 recvcount,记为数组out_0,out_1,…,out_N-1。其中,N个 send buffer中数据经过op 操作得到 reduced result,将其散布在 N 个设备中,每个设备仅包含 reduced result中一部分。
  • 以op操作为 ncclSum为例, rank为Y的设备中数据 out_Y[i] 为 reduced result 中 Yrecvcount+i 位置的数据,即 out_Y[i] = sum(in_X[Ycount + i])

集合通信的实现

Ring AllReduce

Ring AllReduce

各个设备首尾相连,形成单向的环。每个环上处理一部分数据 block,NCCL 在 luanch kernel 时,会确定 block的数量,一个block对应一个环。

一个循环中AllReduce的实现:

  • 一共有 k 各设备,将每个设备上的数据划分成等大的 k 个 chunk
    在这里插入图片描述

  • step 0 当前设备的 rank 为 ringIx 则将 (ringIx + k -1 ) mod N 号 chunk 传给下一个设备

在这里插入图片描述

 // step 0: push data to next GPU
      chunk = modRanks(ringIx + nranks-1);
      offset = calcOffset(chunk);
      nelem = min(realChunkSize, size-offset);
      prims.send(offset, nelem);
  • 进行 K - 2 次循环,j 从2 遍历道 N-1; 每次循环将 接收该设备的 (ringIx + k - j ) mod N 作为 chunk_id ,进行reduce之后,传给下一个设备

在这里插入图片描述

 // k-2 steps: reduce and copy to next GPU
      for (int j=2; j<nranks; ++j) {
   
   
        chunk = modRanks(ringIx + nranks-j)
[2025-09-03 19:55:59] 140706823c44:16583:16896 [1] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<19868> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:3805 remoteGidsfe80::5200:e6ff:feef:3835 hca ibp0p1 140706823c44:16583:16896 [1] NCCL INFO transport/net.cc:1393 -> 6 [2025-09-03 19:55:59] 140706823c44:16584:16900 [2] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<39536> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:48b4 remoteGidsfe80::5200:e6ff:feef:3804 hca ibp3p0 140706823c44:16584:16900 [2] NCCL INFO transport/net.cc:1393 -> 6 [2025-09-03 19:55:59] 140706823c44:16582:16894 [0] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<39536> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:3837 remoteGidsfe80::5200:e6ff:feef:4887 hca ibp1p3 140706823c44:16582:16894 [0] NCCL INFO transport/net.cc:1393 -> 6 [2025-09-03 19:55:59] 140706823c44:16585:16898 [3] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<55900> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:4886 remoteGidsfe80::5200:e6ff:feef:48b6 hca ibp2p2 140706823c44:16585:16898 [3] NCCL INFO transport/net.cc:1393 -> 6 [rank1]:[E903 20:05:00.083473453 ProcessGroupNCCL.cpp:685] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. [rank1]:[E903 20:05:00.083725262 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank1]:[E903 20:05:00.083739310 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank1]:[E903 20:05:00.083812782 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping. [rank0]:[E903 20:05:00.155314876 ProcessGroupNCCL.cpp:685] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [rank0]:[E903 20:05:00.155505084 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 0] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank0]:[E903 20:05:00.155519676 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank0]:[E903 20:05:00.155597596 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 0] First PG on this rank to signal dumping. [rank3]:[E903 20:05:00.160329608 ProcessGroupNCCL.cpp:685] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. [rank3]:[E903 20:05:00.160487240 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank3]:[E903 20:05:00.160497256 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank3]:[E903 20:05:00.160547272 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 3] First PG on this rank to signal dumping. [rank2]:[E903 20:05:00.249766401 ProcessGroupNCCL.cpp:685] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [rank2]:[E903 20:05:00.249872834 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank2]:[E903 20:05:00.249881826 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank2]:[E903 20:05:00.249934338 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 2] First PG on this rank to signal dumping. [rank3]:[E903 20:05:00.459968512 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 3] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank0]:[E903 20:05:00.459959936 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank2]:[E903 20:05:00.459993568 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 2] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank3]:[E903 20:05:00.460112000 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank0]:[E903 20:05:00.460113952 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank2]:[E903 20:05:00.460117472 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank1]:[E903 20:05:00.463863369 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank1]:[E903 20:05:00.464040010 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16583:16885 [1] NCCL INFO misc/socket.cc:916 -> 3 Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed 140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16582:16886 [0] NCCL INFO misc/socket.cc:916 -> 3 140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16585:16890 [3] NCCL INFO misc/socket.cc:916 -> 3 Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed 140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16584:16889 [2] NCCL INFO misc/socket.cc:916 -> 3 Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed 140706823c44:16583:17177 [1] NCCL INFO comm 0xafcbcf8345e0 rank 1 nranks 4 cudaDev 1 busId 906000 - Abort COMPLETE [rank1]:[E903 20:06:01.727205717 ProcessGroupNCCL.cpp:746] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E903 20:06:01.727244053 ProcessGroupNCCL.cpp:760] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E903 20:06:01.727875735 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xfdcd78e09b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xfdcd78e1038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xfdcd78e118ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank1]: Traceback (most recent call last): [rank1]: File "/opt/Megatron-LM/pretrain_gpt.py", line 245, in <module> [rank1]: pretrain( [rank1]: File "/opt/Megatron-LM/megatron/training/training.py", line 193, in pretrain [rank1]: initialize_megatron(extra_args_provider=extra_args_provider, [rank1]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 100, in initialize_megatron [rank1]: _compile_dependencies() [rank1]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 173, in _compile_dependencies [rank1]: torch.distributed.barrier() [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank1]: return func(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4811, in barrier [rank1]: work = group.barrier(opts=opts) [rank1]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xfdcd78e09b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xfdcd78e1038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xfdcd78e118ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6) Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b88e0 (0xfdcd78dc88e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xfdcd78e11c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #5: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6) 140706823c44:16585:17180 [3] NCCL INFO comm 0xba60cf534e60 rank 3 nranks 4 cudaDev 3 busId 1906000 - Abort COMPLETE [rank3]:[E903 20:06:01.856367983 ProcessGroupNCCL.cpp:746] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank3]:[E903 20:06:01.856392463 ProcessGroupNCCL.cpp:760] [Rank 3] To avoid data inconsistency, we are taking the entire process down. 140706823c44:16582:17179 [0] NCCL INFO comm 0xbdefb65320e0 rank 0 nranks 4 cudaDev 0 busId 806000 - Abort COMPLETE [rank3]:[E903 20:06:01.856965041 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xe7e01e5a9b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xe7e01e5b038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xe7e01e5b18ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank0]:[E903 20:06:01.857166705 ProcessGroupNCCL.cpp:746] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E903 20:06:01.857203857 ProcessGroupNCCL.cpp:760] [Rank 0] To avoid data inconsistency, we are taking the entire process down. what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xe7e01e5a9b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xe7e01e5b038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xe7e01e5b18ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6) Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b88e0 (0xe7e01e5688e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xe7e01e5b1c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #5: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6) [rank0]:[E903 20:06:01.857829171 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xf144de949b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xf144de95038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xf144de9518ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank0]: Traceback (most recent call last): [rank0]: File "/opt/Megatron-LM/pretrain_gpt.py", line 245, in <module> [rank0]: pretrain( [rank0]: File "/opt/Megatron-LM/megatron/training/training.py", line 193, in pretrain [rank0]: initialize_megatron(extra_args_provider=extra_args_provider, [rank0]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 100, in initialize_megatron [rank0]: _compile_dependencies() [rank0]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 173, in _compile_dependencies [rank0]: torch.distributed.barrier() [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank0]: return func(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4811, in barrier [rank0]: work = group.barrier(opts=opts) [rank0]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xf144de949b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xf144de95038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xf144de9518ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6) Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b88e0 (0xf144de9088e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xf144de951c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #5: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6) W0903 20:06:01.224000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16582 closing signal SIGTERM W0903 20:06:01.225000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16584 closing signal SIGTERM W0903 20:06:01.226000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16585 closing signal SIGTERM E0903 20:06:01.856000 16517 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 1 (pid: 16583) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 7, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 143, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 277, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== pretrain_gpt.py FAILED ------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-09-03_20:06:01 host : 140706823c44 rank : 1 (local_rank: 1) exitcode : -6 (pid: 16583) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 16583 ======================================================
09-05
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值