复制+粘贴 -> 集群就绪 | SLURM HPC

本文提供了一种适合非专业Linux运维人员的SLURM HPC集群搭建方案,无需复杂技能,只需简单的复制粘贴指令。适用于CentOS和Ubuntu系统,通过hpc4you toolkit工具,实现集群组建流程的零配置、零管理和零维护,特别适合科研团队快速建立计算平台。

请看这里. 

http://t.csdn.cn/KXzn5icon-default.png?t=M7J4http://t.csdn.cn/KXzn5 

写在前面

本方案, 是假定您的计算机技能, 停留在多数简历里面描述的“熟练掌握Micosoft Word, PowerPoint, Excel; 会使用Origin绘图, 会PhotoShop简单修图”这个水平. 但是您可以区分清楚键盘上的字母键, 数字键, 方向键, 以及粘贴完毕指令后, 还要按Enter这种操作逻辑. 

本方案半自动调试集群. 
  • 如果能用vi, 仅需用vi修改一个文件. (不要求会用, 会用vi得多牛呀. 能用vi添加几行内容和会用vi是两回事情)
Using nodes: slurm-gb200-217-[027,047] # nThread 1 nGpus 1 minBytes 536870912 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 234738 on slurm-gb200-217-027 device 0 [0x01] NVIDIA GB200 # Rank 1 Group 0 Pid 234739 on slurm-gb200-217-027 device 1 [0x01] NVIDIA GB200 # Rank 2 Group 0 Pid 234740 on slurm-gb200-217-027 device 2 [0x01] NVIDIA GB200 # Rank 3 Group 0 Pid 234741 on slurm-gb200-217-027 device 3 [0x01] NVIDIA GB200 # Rank 4 Group 0 Pid 237289 on slurm-gb200-217-047 device 0 [0x01] NVIDIA GB200 # Rank 5 Group 0 Pid 237290 on slurm-gb200-217-047 device 1 [0x01] NVIDIA GB200 # Rank 6 Group 0 Pid 237291 on slurm-gb200-217-047 device 2 [0x01] NVIDIA GB200 # Rank 7 Group 0 Pid 237292 on slurm-gb200-217-047 device 3 [0x01] NVIDIA GB200 slurm-gb200-217-027:234738:234738 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234738:234738 [0] NCCL INFO Bootstrap: Using eth0:10.0.4.227<0> slurm-gb200-217-027:234738:234738 [0] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-027:234738:234738 [0] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-027:234740:234740 [2] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-027:234740:234740 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234740:234740 [2] NCCL INFO Bootstrap: Using eth0:10.0.4.227<0> slurm-gb200-217-027:234740:234740 [2] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-027:234741:234741 [3] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-027:234741:234741 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234741:234741 [3] NCCL INFO Bootstrap: Using eth0:10.0.4.227<0> slurm-gb200-217-027:234741:234741 [3] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-027:234739:234739 [1] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-027:234739:234739 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234739:234739 [1] NCCL INFO Bootstrap: Using eth0:10.0.4.227<0> slurm-gb200-217-027:234739:234739 [1] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-047:237289:237289 [0] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-047:237289:237289 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237289:237289 [0] NCCL INFO Bootstrap: Using eth0:10.0.5.220<0> slurm-gb200-217-047:237289:237289 [0] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-047:237292:237292 [3] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-047:237292:237292 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237292:237292 [3] NCCL INFO Bootstrap: Using eth0:10.0.5.220<0> slurm-gb200-217-047:237292:237292 [3] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-047:237290:237290 [1] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-047:237290:237290 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237290:237290 [1] NCCL INFO Bootstrap: Using eth0:10.0.5.220<0> slurm-gb200-217-047:237290:237290 [1] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-027:234738:235129 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-027:234738:235129 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-027:234738:235129 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-027:234738:235129 [0] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-027:234738:235129 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237291:237291 [2] NCCL INFO cudaDriverVersion 12080 slurm-gb200-217-047:237291:237291 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237291:237291 [2] NCCL INFO Bootstrap: Using eth0:10.0.5.220<0> slurm-gb200-217-047:237291:237291 [2] NCCL INFO NCCL version 2.25.1+cuda12.8 slurm-gb200-217-027:234738:235129 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-027:234738:235129 [0] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.4.227<0> slurm-gb200-217-027:234738:235129 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-027:234738:235129 [0] NCCL INFO Using network IBext_v8 slurm-gb200-217-027:234740:235130 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-027:234740:235130 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-027:234741:235131 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-027:234741:235131 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-027:234740:235130 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-027:234740:235130 [2] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-027:234741:235131 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-027:234741:235131 [3] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-027:234740:235130 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-027:234740:235130 [2] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-027:234741:235131 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-027:234741:235131 [3] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-027:234740:235130 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234741:235131 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234739:235132 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-027:234739:235132 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-027:234739:235132 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-027:234739:235132 [1] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-027:234739:235132 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-027:234739:235132 [1] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-027:234739:235132 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234741:235131 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-027:234741:235131 [3] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.4.227<0> slurm-gb200-217-027:234740:235130 [2] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-027:234740:235130 [2] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.4.227<0> slurm-gb200-217-027:234741:235131 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-027:234741:235131 [3] NCCL INFO Using network IBext_v8 slurm-gb200-217-027:234740:235130 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-027:234740:235130 [2] NCCL INFO Using network IBext_v8 slurm-gb200-217-027:234739:235132 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-027:234739:235132 [1] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.4.227<0> slurm-gb200-217-027:234739:235132 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-027:234739:235132 [1] NCCL INFO Using network IBext_v8 slurm-gb200-217-047:237289:237675 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-047:237289:237675 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-047:237289:237675 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-047:237289:237675 [0] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-047:237289:237675 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-047:237289:237675 [0] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-047:237289:237675 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237292:237676 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-047:237292:237676 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-047:237292:237676 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-047:237292:237676 [3] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-047:237292:237676 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-047:237292:237676 [3] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-047:237292:237676 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237290:237677 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-047:237290:237677 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-047:237290:237677 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-047:237290:237677 [1] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-047:237290:237677 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-047:237290:237677 [1] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-047:237290:237677 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-027:234738:235129 [0] NCCL INFO DMA-BUF is available on GPU device 0 slurm-gb200-217-027:234738:235129 [0] NCCL INFO ncclCommInitRank comm 0xc4b2133f8d20 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 801000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-047:237289:237675 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-047:237289:237675 [0] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.5.220<0> slurm-gb200-217-047:237292:237676 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-047:237292:237676 [3] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.5.220<0> slurm-gb200-217-047:237290:237677 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-047:237290:237677 [1] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.5.220<0> slurm-gb200-217-047:237289:237675 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-047:237289:237675 [0] NCCL INFO Using network IBext_v8 slurm-gb200-217-047:237292:237676 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-047:237292:237676 [3] NCCL INFO Using network IBext_v8 slurm-gb200-217-047:237290:237677 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-047:237290:237677 [1] NCCL INFO Using network IBext_v8 slurm-gb200-217-047:237291:237678 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v9 symbol. slurm-gb200-217-047:237291:237678 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v8 (v8) slurm-gb200-217-047:237291:237678 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol. slurm-gb200-217-047:237291:237678 [2] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v8) slurm-gb200-217-047:237291:237678 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so slurm-gb200-217-047:237291:237678 [2] NCCL INFO P2P plugin v8 IBext_v8 slurm-gb200-217-047:237291:237678 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 slurm-gb200-217-047:237291:237678 [2] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1. slurm-gb200-217-047:237291:237678 [2] NCCL INFO NET/IB : Using [0]ibp0:1/IB/SHARP [1]ibp1:1/IB/SHARP [2]ibp2:1/IB/SHARP [3]ibp3:1/IB/SHARP [RO]; OOB eth0:10.0.5.220<0> slurm-gb200-217-047:237291:237678 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. slurm-gb200-217-047:237291:237678 [2] NCCL INFO Using network IBext_v8 slurm-gb200-217-027:234741:235131 [3] NCCL INFO DMA-BUF is available on GPU device 3 slurm-gb200-217-027:234740:235130 [2] NCCL INFO DMA-BUF is available on GPU device 2 slurm-gb200-217-027:234741:235131 [3] NCCL INFO ncclCommInitRank comm 0xb4ab52367d50 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 1901000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-027:234739:235132 [1] NCCL INFO DMA-BUF is available on GPU device 1 slurm-gb200-217-027:234740:235130 [2] NCCL INFO ncclCommInitRank comm 0xabb25a0df930 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 1801000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-027:234739:235132 [1] NCCL INFO ncclCommInitRank comm 0xbf66c0b18250 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 901000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-027:234739:235132 [1] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-027:234740:235130 [2] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237290:237677 [1] NCCL INFO DMA-BUF is available on GPU device 1 slurm-gb200-217-047:237289:237675 [0] NCCL INFO DMA-BUF is available on GPU device 0 slurm-gb200-217-047:237290:237677 [1] NCCL INFO ncclCommInitRank comm 0xb82cf5a83890 rank 5 nranks 8 cudaDev 1 nvmlDev 1 busId 901000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-047:237289:237675 [0] NCCL INFO ncclCommInitRank comm 0xbdcbb5305fc0 rank 4 nranks 8 cudaDev 0 nvmlDev 0 busId 801000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-027:234741:235131 [3] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237289:237675 [0] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237292:237676 [3] NCCL INFO DMA-BUF is available on GPU device 3 slurm-gb200-217-047:237291:237678 [2] NCCL INFO DMA-BUF is available on GPU device 2 slurm-gb200-217-047:237292:237676 [3] NCCL INFO ncclCommInitRank comm 0xbcbd5fdbded0 rank 7 nranks 8 cudaDev 3 nvmlDev 3 busId 1901000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-047:237291:237678 [2] NCCL INFO ncclCommInitRank comm 0xc73c5bd41e40 rank 6 nranks 8 cudaDev 2 nvmlDev 2 busId 1801000 commId 0xee13c4f1e7e030dc - Init START slurm-gb200-217-027:234738:235129 [0] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237290:237677 [1] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237291:237678 [2] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237292:237676 [3] NCCL INFO RAS client listening socket at ::1<28028> slurm-gb200-217-047:237291:237678 [2] NCCL INFO Bootstrap timings total 0.001874 (create 0.000049, send 0.000236, recv 0.000857, ring 0.000277, delay 0.000001) slurm-gb200-217-047:237292:237676 [3] NCCL INFO Bootstrap timings total 0.001987 (create 0.000058, send 0.000308, recv 0.000486, ring 0.000298, delay 0.000001) slurm-gb200-217-027:234738:235129 [0] NCCL INFO Bootstrap timings total 0.150981 (create 0.000092, send 0.000167, recv 0.099449, ring 0.000395, delay 0.000001) slurm-gb200-217-027:234739:235132 [1] NCCL INFO Bootstrap timings total 0.051531 (create 0.000051, send 0.000094, recv 0.000164, ring 0.050388, delay 0.000001) slurm-gb200-217-027:234740:235130 [2] NCCL INFO Bootstrap timings total 0.051642 (create 0.000049, send 0.000100, recv 0.000104, ring 0.050388, delay 0.000001) slurm-gb200-217-027:234741:235131 [3] NCCL INFO Bootstrap timings total 0.052601 (create 0.000062, send 0.000132, recv 0.037772, ring 0.013873, delay 0.000001) slurm-gb200-217-047:237290:237677 [1] NCCL INFO Bootstrap timings total 0.018151 (create 0.000077, send 0.000318, recv 0.016776, ring 0.000449, delay 0.000001) slurm-gb200-217-047:237289:237675 [0] NCCL INFO Bootstrap timings total 0.015634 (create 0.000060, send 0.000705, recv 0.000611, ring 0.013624, delay 0.000001) slurm-gb200-217-047:237292:237676 [3] NCCL INFO MNNVL busId 0x1901000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-047:237289:237675 [0] NCCL INFO MNNVL busId 0x801000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-027:234741:235131 [3] NCCL INFO MNNVL busId 0x1901000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-027:234740:235130 [2] NCCL INFO MNNVL busId 0x1801000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-027:234738:235129 [0] NCCL INFO MNNVL busId 0x801000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-047:237290:237677 [1] NCCL INFO MNNVL busId 0x901000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-027:234739:235132 [1] NCCL INFO MNNVL busId 0x901000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-047:237291:237678 [2] NCCL INFO MNNVL busId 0x1801000 fabric UUID 254db2329a6da67a.4641fa3fdb4d7484 cliqueId 0x7ffe state 3 healthMask 0xaa slurm-gb200-217-047:237291:237678 [2] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 6 slurm-gb200-217-047:237289:237675 [0] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 4 slurm-gb200-217-047:237292:237676 [3] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 7 slurm-gb200-217-047:237290:237677 [1] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 5 slurm-gb200-217-027:234741:235131 [3] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 3 slurm-gb200-217-027:234739:235132 [1] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 1 slurm-gb200-217-027:234740:235130 [2] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 2 slurm-gb200-217-027:234738:235129 [0] NCCL INFO MNNVL 1 cliqueId 7ffe cliqueSize 8 cliqueRank 0 slurm-gb200-217-047:237292:237676 [3] NCCL INFO Setting affinity for GPU 3 to ffff,ffffffff,ffffff00,00000000,00000000 slurm-gb200-217-047:237292:237676 [3] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-047:237292:237676 [3] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-047:237291:237678 [2] NCCL INFO Setting affinity for GPU 2 to ffff,ffffffff,ffffff00,00000000,00000000 slurm-gb200-217-047:237291:237678 [2] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-047:237291:237678 [2] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-027:234740:235130 [2] NCCL INFO Setting affinity for GPU 2 to ffff,ffffffff,ffffff00,00000000,00000000 slurm-gb200-217-027:234740:235130 [2] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-027:234740:235130 [2] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-027:234741:235131 [3] NCCL INFO Setting affinity for GPU 3 to ffff,ffffffff,ffffff00,00000000,00000000 slurm-gb200-217-027:234741:235131 [3] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-047:237290:237677 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff,ffffffff slurm-gb200-217-027:234741:235131 [3] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-047:237290:237677 [1] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-047:237290:237677 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-027:234739:235132 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff,ffffffff slurm-gb200-217-027:234738:235129 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff slurm-gb200-217-027:234739:235132 [1] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-027:234739:235132 [1] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-047:237289:237675 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffffffff,ffffffff slurm-gb200-217-027:234738:235129 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-027:234738:235129 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-047:237289:237675 [0] NCCL INFO NCCL_COLLNET_ENABLE set by environment to 0. slurm-gb200-217-047:237289:237675 [0] NCCL INFO NCCL_NVLS_ENABLE set by environment to 0. slurm-gb200-217-047:237292:237676 [3] NCCL INFO comm 0xbcbd5fdbded0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 1 slurm-gb200-217-027:234740:235130 [2] NCCL INFO comm 0xabb25a0df930 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 1 slurm-gb200-217-027:234741:235131 [3] NCCL INFO comm 0xb4ab52367d50 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 1 slurm-gb200-217-047:237291:237678 [2] NCCL INFO comm 0xc73c5bd41e40 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 1 slurm-gb200-217-047:237292:237676 [3] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6 [24] -1/-1/-1->7->6 [25] -1/-1/-1->7->6 [26] -1/-1/-1->7->6 [27] -1/-1/-1->7->6 [28] -1/-1/-1->7->6 [29] -1/-1/-1->7->6 [30] -1/-1/-1->7->6 [31] -1/-1/-1->7->6 slurm-gb200-217-047:237292:237676 [3] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234738:235129 [0] NCCL INFO comm 0xc4b2133f8d20 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 1 slurm-gb200-217-027:234740:235130 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1 [24] 3/-1/-1->2->1 [25] 3/-1/-1->2->1 [26] 3/-1/-1->2->1 [27] 3/-1/-1->2->1 [28] 3/-1/-1->2->1 [29] 3/-1/-1->2->1 [30] 3/-1/-1->2->1 [31] 3/-1/-1->2->1 slurm-gb200-217-027:234740:235130 [2] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234741:235131 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2 [24] 4/-1/-1->3->2 [25] 4/-1/-1->3->2 [26] 4/-1/-1->3->2 [27] 4/-1/-1->3->2 [28] 4/-1/-1->3->2 [29] 4/-1/-1->3->2 [30] 4/-1/-1->3->2 [31] 4/-1/-1->3->2 slurm-gb200-217-027:234741:235131 [3] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 00/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 01/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237291:237678 [2] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5 [24] 7/-1/-1->6->5 [25] 7/-1/-1->6->5 [26] 7/-1/-1->6->5 [27] 7/-1/-1->6->5 [28] 7/-1/-1->6->5 [29] 7/-1/-1->6->5 [30] 7/-1/-1->6->5 [31] 7/-1/-1->6->5 slurm-gb200-217-047:237291:237678 [2] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234739:235132 [1] NCCL INFO comm 0xbf66c0b18250 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 1 slurm-gb200-217-047:237290:237677 [1] NCCL INFO comm 0xb82cf5a83890 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 1 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 02/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 03/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 04/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 05/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 06/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 07/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237290:237677 [1] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 [22] 6/-1/-1->5->4 [23] 6/-1/-1->5->4 [24] 6/-1/-1->5->4 [25] 6/-1/-1->5->4 [26] 6/-1/-1->5->4 [27] 6/-1/-1->5->4 [28] 6/-1/-1->5->4 [29] 6/-1/-1->5->4 [30] 6/-1/-1->5->4 [31] 6/-1/-1->5->4 slurm-gb200-217-047:237290:237677 [1] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 08/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 09/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 10/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 11/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 12/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 13/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237289:237675 [0] NCCL INFO comm 0xbdcbb5305fc0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 1 slurm-gb200-217-027:234739:235132 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0 [24] 2/-1/-1->1->0 [25] 2/-1/-1->1->0 [26] 2/-1/-1->1->0 [27] 2/-1/-1->1->0 [28] 2/-1/-1->1->0 [29] 2/-1/-1->1->0 [30] 2/-1/-1->1->0 [31] 2/-1/-1->1->0 slurm-gb200-217-027:234739:235132 [1] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-047:237289:237675 [0] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3 [24] 5/-1/-1->4->3 [25] 5/-1/-1->4->3 [26] 5/-1/-1->4->3 [27] 5/-1/-1->4->3 [28] 5/-1/-1->4->3 [29] 5/-1/-1->4->3 [30] 5/-1/-1->4->3 [31] 5/-1/-1->4->3 slurm-gb200-217-047:237289:237675 [0] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 14/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 15/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 16/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 17/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 18/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 19/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 20/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 21/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 22/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237291:237700 [2] NCCL INFO [Proxy Service] Device 2 CPU core 140 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 23/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 24/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 25/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 26/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 27/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 28/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237292:237699 [3] NCCL INFO [Proxy Service] Device 3 CPU core 79 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 29/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 30/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Channel 31/32 : 0 1 2 3 4 5 6 7 slurm-gb200-217-047:237291:237702 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 73 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1 [24] 1/-1/-1->0->-1 [25] 1/-1/-1->0->-1 [26] 1/-1/-1->0->-1 [27] 1/-1/-1->0->-1 [28] 1/-1/-1->0->-1 [29] 1/-1/-1->0->-1 [30] 1/-1/-1->0->-1 [31] 1/-1/-1->0->-1 slurm-gb200-217-027:234738:235129 [0] NCCL INFO P2P Chunksize set to 524288 slurm-gb200-217-047:237292:237701 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 80 slurm-gb200-217-027:234738:235129 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 1 directMode 0 slurm-gb200-217-027:234740:235153 [2] NCCL INFO [Proxy Service] Device 2 CPU core 112 slurm-gb200-217-027:234741:235154 [3] NCCL INFO [Proxy Service] Device 3 CPU core 126 slurm-gb200-217-027:234741:235155 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 129 slurm-gb200-217-027:234740:235156 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 114 slurm-gb200-217-047:237289:237704 [0] NCCL INFO [Proxy Service] Device 0 CPU core 4 slurm-gb200-217-047:237290:237706 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 60 slurm-gb200-217-027:234739:235157 [1] NCCL INFO [Proxy Service] Device 1 CPU core 8 slurm-gb200-217-047:237290:237703 [1] NCCL INFO [Proxy Service] Device 1 CPU core 2 slurm-gb200-217-027:234738:235158 [0] NCCL INFO [Proxy Service] Device 0 CPU core 8 slurm-gb200-217-047:237289:237705 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 6 slurm-gb200-217-027:234739:235159 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 45 slurm-gb200-217-027:234738:235160 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 45 slurm-gb200-217-027:234741:235131 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-027:234741:235131 [3] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-047:237291:237678 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-047:237291:237678 [2] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-027:234739:235132 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-027:234739:235132 [1] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-047:237292:237676 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-047:237292:237676 [3] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-027:234740:235130 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-027:234740:235130 [2] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-047:237290:237677 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-047:237290:237677 [1] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-047:237289:237675 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-047:237289:237675 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-027:234738:235129 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512 slurm-gb200-217-027:234738:235129 [0] NCCL INFO 32 coll channels, 32 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer slurm-gb200-217-027:234738:235129 [0] NCCL INFO CC Off, workFifoBytes 1048576 slurm-gb200-217-027:234740:235130 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-027:234740:235130 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-027:234740:235130 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-027:234740:235130 [2] NCCL INFO ncclCommInitRank comm 0xabb25a0df930 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 1801000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-027:234740:235130 [2] NCCL INFO Init timings - ncclCommInitRank: rank 2 nranks 8 total 0.94 (kernels 0.09, alloc 0.16, bootstrap 0.05, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.01) slurm-gb200-217-027:234741:235131 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-027:234741:235131 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-027:234741:235131 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-027:234741:235131 [3] NCCL INFO ncclCommInitRank comm 0xb4ab52367d50 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 1901000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-027:234741:235131 [3] NCCL INFO Init timings - ncclCommInitRank: rank 3 nranks 8 total 0.93 (kernels 0.09, alloc 0.16, bootstrap 0.05, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.08, rest 0.01) slurm-gb200-217-047:237291:237678 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-047:237291:237678 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-047:237292:237676 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-047:237292:237676 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-047:237292:237676 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237291:237678 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237291:237678 [2] NCCL INFO ncclCommInitRank comm 0xc73c5bd41e40 rank 6 nranks 8 cudaDev 2 nvmlDev 2 busId 1801000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-047:237291:237678 [2] NCCL INFO Init timings - ncclCommInitRank: rank 6 nranks 8 total 0.88 (kernels 0.10, alloc 0.14, bootstrap 0.00, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.01) slurm-gb200-217-027:234739:235132 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-027:234739:235132 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-027:234739:235132 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237292:237676 [3] NCCL INFO ncclCommInitRank comm 0xbcbd5fdbded0 rank 7 nranks 8 cudaDev 3 nvmlDev 3 busId 1901000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-047:237292:237676 [3] NCCL INFO Init timings - ncclCommInitRank: rank 7 nranks 8 total 0.88 (kernels 0.09, alloc 0.16, bootstrap 0.00, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.01) slurm-gb200-217-027:234739:235132 [1] NCCL INFO ncclCommInitRank comm 0xbf66c0b18250 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 901000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-027:234739:235132 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 8 total 0.93 (kernels 0.09, alloc 0.15, bootstrap 0.05, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.01) slurm-gb200-217-047:237289:237675 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237290:237677 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol. slurm-gb200-217-047:237290:237677 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-027:234738:235129 [0] NCCL INFO ncclCommInitRank comm 0xc4b2133f8d20 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 801000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-047:237289:237675 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol. slurm-gb200-217-047:237289:237675 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237289:237675 [0] NCCL INFO ncclCommInitRank comm 0xbdcbb5305fc0 rank 4 nranks 8 cudaDev 0 nvmlDev 0 busId 801000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-027:234738:235129 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 8 total 0.97 (kernels 0.09, alloc 0.10, bootstrap 0.15, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.00) slurm-gb200-217-047:237290:237677 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead. slurm-gb200-217-047:237290:237677 [1] NCCL INFO ncclCommInitRank comm 0xb82cf5a83890 rank 5 nranks 8 cudaDev 1 nvmlDev 1 busId 901000 commId 0xee13c4f1e7e030dc - Init COMPLETE slurm-gb200-217-047:237290:237677 [1] NCCL INFO Init timings - ncclCommInitRank: rank 5 nranks 8 total 0.88 (kernels 0.09, alloc 0.14, bootstrap 0.02, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.01) slurm-gb200-217-047:237289:237675 [0] NCCL INFO Init timings - ncclCommInitRank: rank 4 nranks 8 total 0.89 (kernels 0.09, alloc 0.15, bootstrap 0.02, allgathers 0.00, topo 0.52, graphs 0.01, connections 0.09, rest 0.00) # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 16/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 16/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 17/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 17/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 18/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 18/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 19/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 19/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 20/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 20/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 21/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 21/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 22/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 22/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 23/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 23/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 24/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 24/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 25/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 25/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 26/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 26/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 27/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 27/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 28/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 28/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 29/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 29/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 30/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 30/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-027:234740:235162 [2] NCCL INFO Channel 31/0 : 2[2] -> 3[3] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] NCCL INFO Channel 31/0 : 3[3] -> 4[0] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 00/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 00/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 00/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 01/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 01/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 01/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 02/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 02/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 03/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 03/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 02/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 00/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 01/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 04/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 04/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 03/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 02/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 05/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 04/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 05/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 03/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 05/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 06/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 06/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 04/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 07/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 07/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 06/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 05/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 07/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 08/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 08/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 06/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 09/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 09/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 07/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 08/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 10/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 09/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 08/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 10/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 10/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 09/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 11/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 11/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 12/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 10/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 11/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 12/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 11/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 13/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 12/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 12/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 13/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 14/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 13/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 15/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 14/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 14/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 13/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 15/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 16/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 14/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 15/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 16/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 17/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 16/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 15/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 17/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 17/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 18/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 16/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 18/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 18/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 19/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 19/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 17/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 19/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 18/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 20/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 20/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 20/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 21/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 19/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 21/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 21/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 20/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 22/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 22/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 23/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 22/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 21/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 23/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 23/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 24/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 22/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 24/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 25/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 24/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 23/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 25/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 24/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 25/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 26/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 26/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 25/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 26/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 27/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 27/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 26/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 27/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 28/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 27/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 28/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 29/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 28/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 29/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 28/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 29/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 30/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 29/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 30/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 30/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 30/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-047:237292:237707 [3] NCCL INFO Channel 31/0 : 7[3] -> 0[0] via P2P/MNNVL slurm-gb200-217-047:237290:237710 [1] NCCL INFO Channel 31/0 : 5[1] -> 6[2] via P2P/MNNVL slurm-gb200-217-047:237289:237709 [0] NCCL INFO Channel 31/0 : 4[0] -> 5[1] via P2P/MNNVL slurm-gb200-217-047:237291:237708 [2] NCCL INFO Channel 31/0 : 6[2] -> 7[3] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 16/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 17/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 18/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 19/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 20/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 24/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 21/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 25/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 22/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 26/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 23/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 27/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 24/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 28/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 25/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 29/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 26/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 27/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 30/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 28/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234738:235163 [0] NCCL INFO Channel 31/0 : 0[0] -> 1[1] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 29/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 30/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234739:235164 [1] NCCL INFO Channel 31/0 : 1[1] -> 2[2] via P2P/MNNVL slurm-gb200-217-027:234741:235161 [3] transport/p2p.cc:277 NCCL WARN Cuda failure 400 'invalid resource handle' slurm-gb200-217-027:234741:235161 [3] NCCL INFO transport/p2p.cc:352 -> 1 slurm-gb200-217-027:234741:235161 [3] NCCL INFO transport/p2p.cc:487 -> 1 slurm-gb200-217-027:234741:235161 [3] NCCL INFO transport.cc:194 -> 1 slurm-gb200-217-027:234741:235161 [3] NCCL INFO transport/generic.cc:19 -> 1 slurm-gb200-217-027:234741:235161 [3] NCCL INFO group.cc:148 -> 1 slurm-gb200-217-027:234741:235161 [3] NCCL INFO group.cc:75 -> 1 [Async thread] slurm-gb200-217-027:234741:234741 [3] NCCL INFO group.cc:454 -> 1 slurm-gb200-217-027:234741:234741 [3] NCCL INFO group.cc:573 -> 1 slurm-gb200-217-027:234741:234741 [3] NCCL INFO enqueue.cc:2229 -> 1 slurm-gb200-217-027: Test NCCL failure all_reduce.cu:44 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / ' .. slurm-gb200-217-027 pid 234741: Test failure common.cu:377 .. slurm-gb200-217-027 pid 234741: Test failure common.cu:584 .. slurm-gb200-217-027 pid 234741: Test failure all_reduce.cu:90 .. slurm-gb200-217-027 pid 234741: Test failure common.cu:613 .. slurm-gb200-217-027 pid 234741: Test failure common.cu:1016 .. slurm-gb200-217-027 pid 234741: Test failure common.cu:842 srun: error: slurm-gb200-217-027: task 3: Exited with exit code 3 srun: Terminating StepId=1742.0 slurmstepd: error: *** STEP 1742.0 ON slurm-gb200-217-027 CANCELLED AT 2025-07-14T20:26:11 *** slurmstepd: error: mpi/pmix_v4: _errhandler: slurm-gb200-217-027 [0]: pmixp_client_v2.c:211: Error handler invoked: status = -61, source = [slurm.pmix.1742.0:3] srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: mpi/pmix_v4: _errhandler: slurm-gb200-217-047 [1]: pmixp_client_v2.c:211: Error handler invoked: status = -61, source = [slurm.pmix.1742.0:6] srun: error: slurm-gb200-217-027: tasks 0-2: Terminated srun: error: slurm-gb200-217-047: tasks 4-7: Terminated
最新发布
07-16
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值