复制+粘贴 -> 集群就绪 | SLURM HPC

SLURM HPC集群快速搭建指南
本文提供了一种适合非专业Linux运维人员的SLURM HPC集群搭建方案,无需复杂技能,只需简单的复制粘贴指令。适用于CentOS和Ubuntu系统,通过hpc4you toolkit工具,实现集群组建流程的零配置、零管理和零维护,特别适合科研团队快速建立计算平台。

请看这里. 

http://t.csdn.cn/KXzn5icon-default.png?t=M7J4http://t.csdn.cn/KXzn5 

写在前面

本方案, 是假定您的计算机技能, 停留在多数简历里面描述的“熟练掌握Micosoft Word, PowerPoint, Excel; 会使用Origin绘图, 会PhotoShop简单修图”这个水平. 但是您可以区分清楚键盘上的字母键, 数字键, 方向键, 以及粘贴完毕指令后, 还要按Enter这种操作逻辑. 

本方案半自动调试集群. 
  • 如果能用vi, 仅需用vi修改一个文件. (不要求会用, 会用vi得多牛呀. 能用vi添加几行内容和会用vi是两回事情)
Using nodes: slurm-gb200-217-[027,047] # nThread 1 nGpus 1 minBytes 536870912 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 234738 on slurm-gb200-217-027 device 0 [0x01] NVIDIA GB200 # Rank 1 Group 0 Pid 234739 on slurm-gb200-217-027 device 1 [0x01] NVIDIA GB200 # Rank 2 Group 0 Pid 234740 on slurm-gb200-217-027 device 2 [0x01] NVIDIA GB200 # Rank 3 Group 0 Pid 234741 on slurm-gb200-217-027 device 3 [0x01] NVIDIA GB200 # Rank 4 Group 0 Pid 237289 on slurm-gb200-217-047 device 0 [0x01] NVIDIA GB200 # Rank 5 Group 0 Pid 237290 on slurm-gb200-217-047 device 1 [0x01] NVIDIA GB200 # Rank 6 Group 0 Pid 237291 on slurm-gb200-217-047 device 2 [0x01] NVIDIA GB200 # Rank 7 Group 0 Pid 237292 on slurm-gb200-217-047 device 3 [0x01] NVIDIA GB200 slurm-gb200-217-027:234738:234738 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234738:234738 [0] NCCL INFO Bootstrap: Using eth0:10.0.4.227<0> slurm-gb200-217-027:234738:234738 [0] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-027:234738:234738 [0] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-027:234740:234740 [2] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-027:234740:234740 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234740:234740 [2] NCCL INFO Bootstrap: Using eth0:10.0.4.227<0> slurm-gb200-217-027:234740:234740 [2] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-027:234741:234741 [3] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-027:234741:234741 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234741:234741 [3] NCCL INFO Bootstrap: Using eth0:10.0.4.227<0> slurm-gb200-217-027:234741:234741 [3] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-027:234739:234739 [1] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-027:234739:234739 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234739:234739 [1] NCCL INFO Bootstrap: Using eth0:10.0.4.227<0> slurm-gb200-217-027:234739:234739 [1] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-047:237289:237289 [0] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-047:237289:237289 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237289:237289 [0] NCCL INFO Bootstrap: Using eth0:10.0.5.220<0> slurm-gb200-217-047:237289:237289 [0] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-047:237292:237292 [3] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-047:237292:237292 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237292:237292 [3] NCCL INFO Bootstrap: Using eth0:10.0.5.220<0> slurm-gb200-217-047:237292:237292 [3] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-047:237290:237290 [1] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-047:237290:237290 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237290:237290 [1] NCCL INFO Bootstrap: Using eth0:10.0.5.220<0> slurm-gb200-217-047:237290:237290 [1] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-027:234738:235129 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-027:234738:235129 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-027:234738:235129 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-027:234738:235129 [0] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-027:234738:235129 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237291:237291 [2] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-047:237291:237291 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237291:237291 [2] NCCL INFO Bootstrap: Using eth0:10.0.5.220<0> slurm-gb200-217-047:237291:237291 [2] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-027:234738:235129 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-027:234738:235129 [0] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.4.227<0> slurm-gb200-217-027:234738:235129 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-027:234738:235129 [0] NCCL INFO Using network IBext_v8 slurm-gb200-217-027:234740:235130 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-027:234740:235130 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-027:234741:235131 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-027:234741:235131 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-027:234740:235130 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-027:234740:235130 [2] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-027:234741:235131 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-027:234741:235131 [3] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-027:234740:235130 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-027:234740:235130 [2] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-027:234741:235131 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-027:234741:235131 [3] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-027:234740:235130 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234741:235131 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234739:235132 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-027:234739:235132 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-027:234739:235132 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-027:234739:235132 [1] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-027:234739:235132 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-027:234739:235132 [1] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-027:234739:235132 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234741:235131 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-027:234741:235131 [3] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.4.227<0> slurm-gb200-217-027:234740:235130 [2] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-027:234740:235130 [2] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.4.227<0> slurm-gb200-217-027:234741:235131 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-027:234741:235131 [3] NCCL INFO Using network IBext_v8 slurm-gb200-217-027:234740:235130 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-027:234740:235130 [2] NCCL INFO Using network IBext_v8 slurm-gb200-217-027:234739:235132 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-027:234739:235132 [1] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.4.227<0> slurm-gb200-217-027:234739:235132 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-027:234739:235132 [1] NCCL INFO Using network IBext_v8 slurm-gb200-217-047:237289:237675 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-047:237289:237675 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-047:237289:237675 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-047:237289:237675 [0] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-047:237289:237675 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-047:237289:237675 [0] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-047:237289:237675 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237292:237676 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-047:237292:237676 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-047:237292:237676 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-047:237292:237676 [3] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-047:237292:237676 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-047:237292:237676 [3] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-047:237292:237676 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237290:237677 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-047:237290:237677 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-047:237290:237677 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-047:237290:237677 [1] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-047:237290:237677 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-047:237290:237677 [1] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-047:237290:237677 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234738:235129 [0] NCCL INFO DMA-BUF is available on GPU device 0 slurm-gb200-217-027:234738:235129 [0] NCCL INFO ncclCommInitRank comm 0xc4b2133f8d20 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 801000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-047:237289:237675 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-047:237289:237675 [0] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.5.220<0> slurm-gb200-217-047:237292:237676 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-047:237292:237676 [3] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.5.220<0> slurm-gb200-217-047:237290:237677 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-047:237290:237677 [1] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.5.220<0> slurm-gb200-217-047:237289:237675 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-047:237289:237675 [0] NCCL INFO Using network IBext_v8 slurm-gb200-217-047:237292:237676 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-047:237292:237676 [3] NCCL INFO Using network IBext_v8 slurm-gb200-217-047:237290:237677 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-047:237290:237677 [1] NCCL INFO Using network IBext_v8 slurm-gb200-217-047:237291:237678 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-047:237291:237678 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-047:237291:237678 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-047:237291:237678 [2] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-047:237291:237678 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-047:237291:237678 [2] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-047:237291:237678 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237291:237678 [2] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-047:237291:237678 [2] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.5.220<0> slurm-gb200-217-047:237291:237678 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-047:237291:237678 [2] NCCL INFO Using network IBext_v8 slurm-gb200-217-027:234741:235131 [3] NCCL INFO DMA-BUF is available on GPU device 3 slurm-gb200-217-027:234740:235130 [2] NCCL INFO DMA-BUF is available on GPU device 2 slurm-gb200-217-027:234741:235131 [3] NCCL INFO ncclCommInitRank comm 0xb4ab52367d50 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 1901000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-027:234739:235132 [1] NCCL INFO DMA-BUF is available on GPU device 1 slurm-gb200-217-027:234740:235130 [2] NCCL INFO ncclCommInitRank comm 0xabb25a0df930 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 1801000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-027:234739:235132 [1] NCCL INFO ncclCommInitRank comm 0xbf66c0b18250 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 901000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-027:234739:235132 [1] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-027:234740:235130 [2] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237290:237677 [1] NCCL INFO DMA-BUF is available on GPU device 1 slurm-gb200-217-047:237289:237675 [0] NCCL INFO DMA-BUF is available on GPU device 0 slurm-gb200-217-047:237290:237677 [1] NCCL INFO ncclCommInitRank comm 0xb82cf5a83890 rank 5 nranks 8 cudaDev 1 nvmlDev 1 busId 901000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-047:237289:237675 [0] NCCL INFO ncclCommInitRank comm 0xbdcbb5305fc0 rank 4 nranks 8 cudaDev 0 nvmlDev 0 busId 801000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-027:234741:235131 [3] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237289:237675 [0] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237292:237676 [3] NCCL INFO DMA-BUF is available on GPU device 3 slurm-gb200-217-047:237291:237678 [2] NCCL INFO DMA-BUF is available on GPU device 2 slurm-gb200-217-047:237292:237676 [3] NCCL INFO ncclCommInitRank comm 0xbcbd5fdbded0 rank 7 nranks 8 cudaDev 3 nvmlDev 3 busId 1901000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-047:237291:237678 [2] NCCL INFO ncclCommInitRank comm 0xc73c5bd41e40 rank 6 nranks 8 cudaDev 2 nvmlDev 2 busId 1801000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-027:234738:235129 [0] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237290:237677 [1] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237291:237678 [2] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237292:237676 [3] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237291:237678 [2] NCCL INFO Bootstrap timings total 0.001874 (create 0.000049, send 0.000236, recv 0.000857, ring 0.000277, delay 0.000001) slurm-gb200-217-047:237292:237676 [3] NCCL INFO Bootstrap timings total 0.001987 (create 0.000058, send 0.000308, recv 0.000486, ring 0.000298, delay 0.000001) slurm-gb200-217-027:234738:235129 [0] NCCL INFO Bootstrap timings total 0.150981 (create 0.000092, send 0.000167, recv 0.099449, ring 0.000395, delay 0.000001) slurm-gb200-217-027:234739:235132 [1] NCCL INFO Bootstrap timings total 0.051531 (create 0.000051, send 0.000094, recv 0.000164, ring 0.050388, delay 0.000001) slurm-gb200-217-027:234740:235130 [2] NCCL INFO Bootstrap timings total 0.051642 (create 0.000049, send 0.000100, recv 0.000104, ring 0.050388, delay 0.000001) slurm-gb200-217-027:234741:235131 [3] NCCL INFO Bootstrap timings total 0.052601 (create 0.000062, send 0.000132, recv 0.037772, ring 0.013873, delay 0.000001) slurm-gb200-217-047:237290:237677 [1] NCCL INFO Bootstrap timings total 0.018151 (create 0.000077, send 0.000318, recv 0.016776, ring 0.000449, delay 0.000001) slurm-gb200-217-047:237289:237675 [0] NCCL INFO Bootstrap timings total 0.015634 (create 0.000060, send 0.000705, recv 0.000611, ring 0.013624, delay 0.000001) slurm-gb200-217-047:237292:237676 [3] NCCL INFO MNNVL busId 0x1901000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-047:237289:237675 [0] NCCL INFO MNNVL busId 0x801000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-027:234741:235131 [3] NCCL INFO MNNVL busId 0x1901000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-027:234740:235130 [2] NCCL INFO MNNVL busId 0x1801000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-027:234738:235129 [0] NCCL INFO MNNVL busId 0x801000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-047:237290:237677 [1] NCCL INFO MNNVL busId 0x901000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-027:234739:235132 [1] NCCL INFO MNNVL busId 0x901000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-047:237291:237678 [2] NCCL INFO MNNVL busId 0x1801000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-047:237291:237678 [2] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 6 slurm-gb200-217-047:237289:237675 [0] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 4 slurm-gb200-217-047:237292:237676 [3] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 7 slurm-gb200-217-047:237290:237677 [1] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 5 slurm-gb200-217-027:234741:235131 [3] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 3 slurm-gb200-217-027:234739:235132 [1] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 1 slurm-gb200-217-027:234740:235130 [2] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 2 slurm-gb200-217-027:234738:235129 [0] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 0 slurm-gb200-217-047:237292:237676 [3] NCCL INFO Setting affinity for GPU 3 to ffff,ffffffff,ffffff00,00000000,00000000 slurm-gb200-217-047:237292:237676 [3] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-047:237292:237676 [3] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-047:237291:237678 [2] NCCL INFO Setting affinity for GPU 2 to ffff,ffffffff,ffffff00,00000000,00000000 slurm-gb200-217-047:237291:237678 [2] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-047:237291:237678 [2] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-027:234740:235130 [2] NCCL INFO Setting affinity for GPU 2 to ffff,ffffffff,ffffff00,00000000,00000000 slurm-gb200-217-027:234740:235130 [2] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-027:234740:235130 [2] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-027:234741:235131 [3] NCCL INFO Setting affinity for GPU 3 to ffff,ffffffff,ffffff00,00000000,00000000 slurm-gb200-217-027:234741:235131 [3] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-047:237290:237677 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff,ffffffff slurm-gb200-217-027:234741:235131 [3] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-047:237290:237677 [1] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-047:237290:237677 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-027:234739:235132 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff,ffffffff slurm-gb200-217-027:234738:235129 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff slurm-gb200-217-027:234739:235132 [1] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-027:234739:235132 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-047:237289:237675 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff slurm-gb200-217-027:234738:235129 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-027:234738:235129 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-047:237289:237675 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-047:237289:237675 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-047:237292:237676 [3] NCCL INFO comm 0xbcbd5fdbded0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 1 slurm-gb200-217-027:234740:235130 [2] NCCL INFO comm 0xabb25a0df930 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 1 slurm-gb200-217-027:234741:235131 [3] NCCL INFO comm 0xb4ab52367d50 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 1 slurm-gb200-217-047:237291:237678 [2] NCCL INFO comm 0xc73c5bd41e40 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 1 slurm-gb200-217-047:237292:237676 [3] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 [24] -1/-1/-1->7->6 [25] -1/-1/-1->7->6 [26] -1/-1/-1->7->6 [27] -1/-1/-1->7->6 [28] -1/-1/-1->7->6 [29] -1/-1/-1->7->6 [30] -1/-1/-1->7->6 [31] -1/-1/-1->7->6 slurm-gb200-217-047:237292:237676 [3] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234738:235129 [0] NCCL INFO comm 0xc4b2133f8d20 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 1 slurm-gb200-217-027:234740:235130 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 [24] 3/-1/-1->2->1 [25] 3/-1/-1->2->1 [26] 3/-1/-1->2->1 [27] 3/-1/-1->2->1 [28] 3/-1/-1->2->1 [29] 3/-1/-1->2->1 [30] 3/-1/-1->2->1 [31] 3/-1/-1->2->1 slurm-gb200-217-027:234740:235130 [2] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234741:235131 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 [24] 4/-1/-1->3->2 [25] 4/-1/-1->3->2 [26] 4/-1/-1->3->2 [27] 4/-1/-1->3->2 [28] 4/-1/-1->3->2 [29] 4/-1/-1->3->2 [30] 4/-1/-1->3->2 [31] 4/-1/-1->3->2 slurm-gb200-217-027:234741:235131 [3] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 00/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 01/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237291:237678 [2] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 [24] 7/-1/-1->6->5 [25] 7/-1/-1->6->5 [26] 7/-1/-1->6->5 [27] 7/-1/-1->6->5 [28] 7/-1/-1->6->5 [29] 7/-1/-1->6->5 [30] 7/-1/-1->6->5 [31] 7/-1/-1->6->5 slurm-gb200-217-047:237291:237678 [2] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234739:235132 [1] NCCL INFO comm 0xbf66c0b18250 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 1 slurm-gb200-217-047:237290:237677 [1] NCCL INFO comm 0xb82cf5a83890 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 1 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 02/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 03/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 04/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 05/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 06/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 07/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237290:237677 [1] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 [24] 6/-1/-1->5->4 [25] 6/-1/-1->5->4 [26] 6/-1/-1->5->4 [27] 6/-1/-1->5->4 [28] 6/-1/-1->5->4 [29] 6/-1/-1->5->4 [30] 6/-1/-1->5->4 [31] 6/-1/-1->5->4 slurm-gb200-217-047:237290:237677 [1] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 08/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 09/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 10/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 11/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 12/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 13/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237289:237675 [0] NCCL INFO comm 0xbdcbb5305fc0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 1 slurm-gb200-217-027:234739:235132 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 [24] 2/-1/-1->1->0 [25] 2/-1/-1->1->0 [26] 2/-1/-1->1->0 [27] 2/-1/-1->1->0 [28] 2/-1/-1->1->0 [29] 2/-1/-1->1->0 [30] 2/-1/-1->1->0 [31] 2/-1/-1->1->0 slurm-gb200-217-027:234739:235132 [1] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-047:237289:237675 [0] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 [24] 5/-1/-1->4->3 [25] 5/-1/-1->4->3 [26] 5/-1/-1->4->3 [27] 5/-1/-1->4->3 [28] 5/-1/-1->4->3 [29] 5/-1/-1->4->3 [30] 5/-1/-1->4->3 [31] 5/-1/-1->4->3 slurm-gb200-217-047:237289:237675 [0] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 14/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 15/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 16/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 17/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 18/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 19/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 20/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 21/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 22/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237291:237700 [2] NCCL INFO [Proxy Service] Device 2 CPU core 140 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 23/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 24/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 25/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 26/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 27/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 28/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237292:237699 [3] NCCL INFO [Proxy Service] Device 3 CPU core 79 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 29/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 30/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 31/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237291:237702 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 73 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 [24] 1/-1/-1->0->-1 [25] 1/-1/-1->0->-1 [26] 1/-1/-1->0->-1 [27] 1/-1/-1->0->-1 [28] 1/-1/-1->0->-1 [29] 1/-1/-1->0->-1 [30] 1/-1/-1->0->-1 [31] 1/-1/-1->0->-1 slurm-gb200-217-027:234738:235129 [0] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-047:237292:237701 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 80 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 1 directMode 0 slurm-gb200-217-027:234740:235153 [2] NCCL INFO [Proxy Service] Device 2 CPU core 112 slurm-gb200-217-027:234741:235154 [3] NCCL INFO [Proxy Service] Device 3 CPU core 126 slurm-gb200-217-027:234741:235155 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 129 slurm-gb200-217-027:234740:235156 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 114 slurm-gb200-217-047:237289:237704 [0] NCCL INFO [Proxy Service] Device 0 CPU core 4 slurm-gb200-217-047:237290:237706 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 60 slurm-gb200-217-027:234739:235157 [1] NCCL INFO [Proxy Service] Device 1 CPU core 8 slurm-gb200-217-047:237290:237703 [1] NCCL INFO [Proxy Service] Device 1 CPU core 2 slurm-gb200-217-027:234738:235158 [0] NCCL INFO [Proxy Service] Device 0 CPU core 8 slurm-gb200-217-047:237289:237705 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 6 slurm-gb200-217-027:234739:235159 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 45 slurm-gb200-217-027:234738:235160 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 45 slurm-gb200-217-027:234741:235131 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-027:234741:235131 [3] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-047:237291:237678 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-047:237291:237678 [2] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-027:234739:235132 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-027:234739:235132 [1] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-047:237292:237676 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-047:237292:237676 [3] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-027:234740:235130 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-027:234740:235130 [2] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-047:237290:237677 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-047:237290:237677 [1] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-047:237289:237675 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-047:237289:237675 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-027:234738:235129 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-027:234738:235129 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-027:234738:235129 [0] NCCL INFO CC Off, workFifoBytes 1048576 slurm-gb200-217-027:234740:235130 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-027:234740:235130 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-027:234740:235130 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-027:234740:235130 [2] NCCL INFO ncclCommInitRank comm 0xabb25a0df930 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 1801000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-027:234740:235130 [2] NCCL INFO Init timings - ncclCommInitRank: rank 2 nranks 8 total 0.94 (kernels 0.09, alloc 0.16, bootstrap 0.05, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.01) slurm-gb200-217-027:234741:235131 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-027:234741:235131 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-027:234741:235131 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-027:234741:235131 [3] NCCL INFO ncclCommInitRank comm 0xb4ab52367d50 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 1901000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-027:234741:235131 [3] NCCL INFO Init timings - ncclCommInitRank: rank 3 nranks 8 total 0.93 (kernels 0.09, alloc 0.16, bootstrap 0.05, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.08, rest 0.01) slurm-gb200-217-047:237291:237678 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-047:237291:237678 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-047:237292:237676 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-047:237292:237676 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-047:237292:237676 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237291:237678 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237291:237678 [2] NCCL INFO ncclCommInitRank comm 0xc73c5bd41e40 rank 6 nranks 8 cudaDev 2 nvmlDev 2 busId 1801000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-047:237291:237678 [2] NCCL INFO Init timings - ncclCommInitRank: rank 6 nranks 8 total 0.88 (kernels 0.10, alloc 0.14, bootstrap 0.00, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.01) slurm-gb200-217-027:234739:235132 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-027:234739:235132 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-027:234739:235132 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237292:237676 [3] NCCL INFO ncclCommInitRank comm 0xbcbd5fdbded0 rank 7 nranks 8 cudaDev 3 nvmlDev 3 busId 1901000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-047:237292:237676 [3] NCCL INFO Init timings - ncclCommInitRank: rank 7 nranks 8 total 0.88 (kernels 0.09, alloc 0.16, bootstrap 0.00, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.01) slurm-gb200-217-027:234739:235132 [1] NCCL INFO ncclCommInitRank comm 0xbf66c0b18250 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 901000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-027:234739:235132 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 8 total 0.93 (kernels 0.09, alloc 0.15, bootstrap 0.05, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.01) slurm-gb200-217-047:237289:237675 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237290:237677 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-047:237290:237677 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO ncclCommInitRank comm 0xc4b2133f8d20 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 801000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-047:237289:237675 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-047:237289:237675 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237289:237675 [0] NCCL INFO ncclCommInitRank comm 0xbdcbb5305fc0 rank 4 nranks 8 cudaDev 0 nvmlDev 0 busId 801000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-027:234738:235129 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 8 total 0.97 (kernels 0.09, alloc 0.10, bootstrap 0.15, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.00) slurm-gb200-217-047:237290:237677 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237290:237677 [1] NCCL INFO ncclCommInitRank comm 0xb82cf5a83890 rank 5 nranks 8 cudaDev 1 nvmlDev 1 busId 901000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-047:237290:237677 [1] NCCL INFO Init timings - ncclCommInitRank: rank 5 nranks 8 total 0.88 (kernels 0.09, alloc 0.14, bootstrap 0.02, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.01) slurm-gb200-217-047:237289:237675 [0] NCCL INFO Init timings - ncclCommInitRank: rank 4 nranks 8 total 0.89 (kernels 0.09, alloc 0.15, bootstrap 0.02, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.00) # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 24/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 24/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 25/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 25/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 26/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 26/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 27/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 27/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 28/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 28/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 29/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 29/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 30/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 30/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 31/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 31/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 00/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 00/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 00/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 01/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 01/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 01/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 02/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 02/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 03/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 03/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 02/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 00/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 01/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 04/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 04/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 03/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 02/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 05/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 04/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 05/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 03/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 05/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 06/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 06/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 04/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 07/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 07/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 06/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 05/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 07/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 08/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 08/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 06/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 09/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 09/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 07/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 08/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 10/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 09/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 08/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 10/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 10/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 09/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 11/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 11/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 12/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 10/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 11/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 12/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 11/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 13/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 12/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 12/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 13/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 14/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 13/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 15/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 14/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 14/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 13/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 15/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 16/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 14/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 15/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 16/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 17/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 16/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 15/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 17/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 17/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 18/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 16/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 18/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 18/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 19/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 19/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 17/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 19/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 18/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 20/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 20/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 20/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 21/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 19/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 21/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 21/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 20/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 22/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 22/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 23/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 22/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 21/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 23/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 23/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 24/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 22/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 24/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 25/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 24/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 23/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 25/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 24/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 25/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 26/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 26/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 25/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 26/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 27/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 27/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 26/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 27/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 28/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 27/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 28/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 29/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 28/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 29/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 28/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 29/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 30/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 29/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 30/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 30/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 30/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 31/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 31/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 31/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 31/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 24/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 25/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 26/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 27/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 24/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 28/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 25/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 29/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 26/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 27/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 30/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 28/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 31/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 29/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 30/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 31/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] transport/p2p.cc:277 NCCL WARN Cuda failure 400 'invalid resource handle' slurm-gb200-217-027:234741:235161 [3] NCCL INFO transport/p2p.cc:352 -> 1 slurm-gb200-217-027:234741:235161 [3] NCCL INFO transport/p2p.cc:487 -> 1 slurm-gb200-217-027:234741:235161 [3] NCCL INFO transport.cc:194 -> 1 slurm-gb200-217-027:234741:235161 [3] NCCL INFO transport/generic.cc:19 -> 1 slurm-gb200-217-027:234741:235161 [3] NCCL INFO group.cc:148 -> 1 slurm-gb200-217-027:234741:235161 [3] NCCL INFO group.cc:75 -> 1 [Async thread] slurm-gb200-217-027:234741:234741 [3] NCCL INFO group.cc:454 -> 1 slurm-gb200-217-027:234741:234741 [3] NCCL INFO group.cc:573 -> 1 slurm-gb200-217-027:234741:234741 [3] NCCL INFO enqueue.cc:2229 -> 1 slurm-gb200-217-027: Test NCCL failure all_reduce.cu:44 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / ' .. slurm-gb200-217-027 pid 234741: Test failure common.cu:377 .. slurm-gb200-217-027 pid 234741: Test failure common.cu:584 .. slurm-gb200-217-027 pid 234741: Test failure all_reduce.cu:90 .. slurm-gb200-217-027 pid 234741: Test failure common.cu:613 .. slurm-gb200-217-027 pid 234741: Test failure common.cu:1016 .. slurm-gb200-217-027 pid 234741: Test failure common.cu:842 srun: error: slurm-gb200-217-027: task 3: Exited with exit code 3 srun: Terminating StepId=1742.0 slurmstepd: error: *** STEP 1742.0 ON slurm-gb200-217-027 CANCELLED AT 2025-07-14T20:26:11 *** slurmstepd: error: mpi/pmix_v4: _errhandler: slurm-gb200-217-027 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -61, source = [slurm.pmix.1742.0:3] srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: mpi/pmix_v4: _errhandler: slurm-gb200-217-047 [1]: pmixp_client_v2.c:211: Error handler invoked: status = -61, source = [slurm.pmix.1742.0:6] srun: error: slurm-gb200-217-027: tasks 0-2: Terminated srun: error: slurm-gb200-217-047: tasks 4-7: Terminated
最新发布
07-16
在使用 SLURM 进行多节点 GPU 通信时,若遇到 NCCL 报错 `Cuda failure 400: invalid resource handle`,通常表明某个 GPU 资源句柄无效或访问异常。此类错误可能由多种因素引起,包括但不限于 CUDA 环境配置不当、GPU 资源分配冲突、NCCL 版本与 CUDA 不兼容、或 SLURM 配置未正确绑定 GPU 资源。 以下为可能的原因及解决办法: - **资源句柄无效**:该错误代码 `400` 表示“invalid resource handle”,意味着尝试访问的 CUDA 资源(如事件、流、内存等)不存在或已被释放。建议检查程序中是否重复释放了某些资源,或者在多个线程/进程中错误地共享了资源句柄 [^2]。 - **SLURM GPU 绑定问题**:确保在提交作业时通过 `--gpus-per-node` 或 `--gres=gpu:N` 正确指定了使用的 GPU 数量,并且在程序中通过 `CUDA_VISIBLE_DEVICES` 设置了正确的设备索引。例如: ```bash #SBATCH --gres=gpu:gb200:4 export CUDA_VISIBLE_DEVICES=0,1,2,3 ``` 若未正确限制可见设备,可能导致进程试图访问不属于当前任务的 GPU 设备 [^1]。 - **NCCL 和 CUDA 版本兼容性**:确保所使用的 NCCL 版本与 CUDA 工具包版本兼容。可参考 NVIDIA 官方文档确认兼容性列表。若版本不匹配,可能出现资源管理异常。构建 NCCL 时建议指定架构参数以适配 GB200 的计算能力,例如: ```bash export NVCC_GENCODE="-gencode=arch=compute_90,code=sm_90" make -j CUDA_HOME=/usr/local/cuda ``` - **多节点通信初始化失败**:在分布式训练中,NCCL 需要通过 SSH 或 MPI 正确初始化跨节点通信。确保节点间网络通畅,并使用 `nccl-tests` 或 `torchrun` / `mpiexec` 启动多节点任务时配置正确。可以先运行简单的 `all_reduce_perf` 测试验证基本通信能力 [^1]。 - **驱动和运行时状态异常**:使用 `nvidia-smi` 检查所有目标 GPU 是否处于正常状态,无显存泄漏或被其他进程占用。若存在僵尸进程,可通过 `kill -9 <pid>` 清理 [^3]。 ### 示例命令用于测试 NCCL 多节点通信 ```bash srun -N2 -n4 --gpus-per-node=4 --gpu-bind=closest ./build/all_reduce_perf -b 8 -e 512M -f 2 -g 1 ``` 上述命令表示在两个节点上各启动 4 个 GPU 进程,进行 AllReduce 性能测试,并绑定最近的 GPU 设备。 ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值