Using nodes: slurm-gb200-218-[145,147,149,151,253,255],slurm-gb200-219-[001,003]
pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e
pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e
pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e
pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e
pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e
pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e
pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e
pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e
[1752160265.696087] [slurm-gb200-219-003:95563:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[1752160265.696023] [slurm-gb200-218-255:103525:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[1752160265.696113] [slurm-gb200-219-003:95561:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-219-003:95561] pml_ucx.c:424 Error: ucp_ep_create(proc=12) failed: Destination is unreachable
[slurm-gb200-219-003:95561] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 12
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.696047] [slurm-gb200-218-255:103523:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-255:103523] pml_ucx.c:424 Error: ucp_ep_create(proc=4) failed: Destination is unreachable
[slurm-gb200-218-255:103523] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 4
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.696571] [slurm-gb200-218-253:89460:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-253:89460] pml_ucx.c:424 Error: ucp_ep_create(proc=0) failed: Destination is unreachable
[slurm-gb200-218-253:89460] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 0
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.696861] [slurm-gb200-219-001:99789:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-219-001:99789] pml_ucx.c:424 Error: ucp_ep_create(proc=8) failed: Destination is unreachable
[slurm-gb200-219-001:99789] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 8
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.696119] [slurm-gb200-219-003:95562:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-219-003:95562] pml_ucx.c:424 Error: ucp_ep_create(proc=13) failed: Destination is unreachable
[slurm-gb200-219-003:95562] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 13
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[1752160265.697575] [slurm-gb200-218-253:89463:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-253:89463] pml_ucx.c:424 Error: ucp_ep_create(proc=3) failed: Destination is unreachable
[slurm-gb200-218-253:89463] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 3
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[1752160265.697815] [slurm-gb200-219-001:99792:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-219-003:95563] pml_ucx.c:424 Error: ucp_ep_create(proc=14) failed: Destination is unreachable
[slurm-gb200-219-003:95563] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 14
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.696210] [slurm-gb200-219-003:95564:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-255:103524] pml_ucx.c:424 Error: ucp_ep_create(proc=5) failed: Destination is unreachable
[slurm-gb200-218-255:103524] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 5
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.696171] [slurm-gb200-218-255:103526:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-219-001:99792] pml_ucx.c:424 Error: ucp_ep_create(proc=11) failed: Destination is unreachable
[slurm-gb200-219-001:99792] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 11
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.698460] [slurm-gb200-219-001:99791:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.698161] [slurm-gb200-218-253:89462:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-219-003:95564] pml_ucx.c:424 Error: ucp_ep_create(proc=15) failed: Destination is unreachable
[slurm-gb200-219-003:95564] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 15
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.698521] [slurm-gb200-219-001:99790:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-255:103526] pml_ucx.c:424 Error: ucp_ep_create(proc=7) failed: Destination is unreachable
[slurm-gb200-218-255:103526] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 7
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[slurm-gb200-219-001:99790] pml_ucx.c:424 Error: ucp_ep_create(proc=9) failed: Destination is unreachable
[slurm-gb200-219-001:99790] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 9
[slurm-gb200-218-253:89461] pml_ucx.c:424 Error: ucp_ep_create(proc=1) failed: Destination is unreachable
[slurm-gb200-218-253:89461] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 1
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[slurm-gb200-218-255:103525] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-219-001:99792] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-219-003:95561] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-255:103526] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-219-001:99790] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-253:89461] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-253:89462] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-219-003:95562] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-255:103523] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-253:89460] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-253:89463] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-255:103524] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-219-001:99791] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-219-001:99789] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-219-003:95563] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-219-003:95564] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[1752160265.743560] [slurm-gb200-218-145:100876:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[1752160265.743544] [slurm-gb200-218-145:100878:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[1752160265.743013] [slurm-gb200-218-151:110182:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-145:100876] pml_ucx.c:424 Error: ucp_ep_create(proc=17) failed: Destination is unreachable
[slurm-gb200-218-145:100876] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 17
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.743059] [slurm-gb200-218-151:110181:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-145:100878] pml_ucx.c:424 Error: ucp_ep_create(proc=19) failed: Destination is unreachable
[slurm-gb200-218-145:100878] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 19
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.744269] [slurm-gb200-218-149:114475:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-151:110182] pml_ucx.c:424 Error: ucp_ep_create(proc=31) failed: Destination is unreachable
[slurm-gb200-218-151:110182] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 31
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.743380] [slurm-gb200-218-147:116799:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-149:114477] pml_ucx.c:424 Error: ucp_ep_create(proc=27) failed: Destination is unreachable
[slurm-gb200-218-149:114477] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 27
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.743747] [slurm-gb200-218-145:100875:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-149:114475] pml_ucx.c:424 Error: ucp_ep_create(proc=25) failed: Destination is unreachable
[slurm-gb200-218-149:114475] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 25
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.743469] [slurm-gb200-218-147:116796:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-145:100875] pml_ucx.c:424 Error: ucp_ep_create(proc=16) failed: Destination is unreachable
[slurm-gb200-218-145:100875] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 16
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[1752160265.743100] [slurm-gb200-218-151:110179:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-151:110179] pml_ucx.c:424 Error: ucp_ep_create(proc=28) failed: Destination is unreachable
[slurm-gb200-218-151:110179] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 28
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.744293] [slurm-gb200-218-149:114476:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-147:116797] pml_ucx.c:424 Error: ucp_ep_create(proc=21) failed: Destination is unreachable
[slurm-gb200-218-147:116797] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 21
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.743444] [slurm-gb200-218-147:116798:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-151:110181] pml_ucx.c:424 Error: ucp_ep_create(proc=30) failed: Destination is unreachable
[slurm-gb200-218-151:110181] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 30
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[1752160265.744394] [slurm-gb200-218-149:114474:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-149:114476] pml_ucx.c:424 Error: ucp_ep_create(proc=26) failed: Destination is unreachable
[slurm-gb200-218-149:114476] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 26
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[slurm-gb200-218-147:116798] pml_ucx.c:424 Error: ucp_ep_create(proc=22) failed: Destination is unreachable
[slurm-gb200-218-147:116798] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 22
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[slurm-gb200-218-149:114474] pml_ucx.c:424 Error: ucp_ep_create(proc=24) failed: Destination is unreachable
[slurm-gb200-218-149:114474] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 24
[slurm-gb200-218-147:116799] pml_ucx.c:424 Error: ucp_ep_create(proc=23) failed: Destination is unreachable
[slurm-gb200-218-147:116799] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 23
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[slurm-gb200-218-147:116796] pml_ucx.c:424 Error: ucp_ep_create(proc=20) failed: Destination is unreachable
[slurm-gb200-218-147:116796] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 20
[LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4
[LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed
[slurm-gb200-218-147:116797] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-149:114476] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-147:116796] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-151:110180] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-151:110182] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-149:114477] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-145:100877] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-145:100878] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-147:116799] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-149:114474] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-151:110181] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-151:110179] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-147:116798] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-149:114475] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-145:100876] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[slurm-gb200-218-145:100875] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed
[1752160265.930628] [slurm-gb200-219-003:95564:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-219-003:95564] pml_ucx.c:424 Error: ucp_ep_create(proc=0) failed: Destination is unreachable
[slurm-gb200-219-003:95564] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 0
[slurm-gb200-219-003:95564:0:95564] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7b)
[1752160265.974232] [slurm-gb200-218-151:110182:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De
[slurm-gb200-218-151:110182] pml_ucx.c:424 Error: ucp_ep_create(proc=16) failed: Destination is unreachable
[slurm-gb200-218-151:110182] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 16
[slurm-gb200-218-151:110182:0:110182] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7b)
==== backtrace (tid: 95564) ====
0 0x000000000004b8d8 ompi_request_default_test_all() /opt/hpcx/sources/openmpi-gitclone/ompi/request/req_test.c:184
1 0x0000000000002610 oob_allgather_test() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:182
2 0x000000000000cd10 ucc_core_addr_exchange() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:461
3 0x000000000000d8f4 ucc_context_create_proc_info() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:723
4 0x00000000000028f8 mca_coll_ucc_init_ctx() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:302
5 0x000000000000429c mca_coll_ucc_comm_query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:488
6 0x000000000008038c query_2_0_0() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:540
7 0x000000000008038c query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:523
8 0x000000000008038c check_one_component() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:486
9 0x000000000008038c check_components() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:406
10 0x0000000000080744 mca_coll_base_comm_select() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:114
11 0x00000000000c5750 ompi_mpi_init() /opt/hpcx/sources/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958
12 0x000000000006eae8 PMPI_Init() /opt/hpcx/sources/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:67
13 0x0000000000003104 main() /opt/nccl-tests/src/common.cu:840
14 0x00000000000273fc __libc_init_first() ???:0
15 0x00000000000274cc __libc_start_main() ???:0
16 0x0000000000005b70 _start() ???:0
==== backtrace (tid: 95564) ====
0 0x000000000004b8d8 ompi_request_default_test_all() /opt/hpcx/sources/openmpi-gitclone/ompi/request/req_test.c:184
1 0x0000000000002610 oob_allgather_test() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:182
2 0x000000000000cd10 ucc_core_addr_exchange() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:461
3 0x000000000000d8f4 ucc_context_create_proc_info() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:723
4 0x00000000000028f8 mca_coll_ucc_init_ctx() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:302
5 0x000000000000429c mca_coll_ucc_comm_query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:488
6 0x000000000008038c query_2_0_0() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:540
7 0x000000000008038c query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:523
8 0x000000000008038c check_one_component() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:486
9 0x000000000008038c check_components() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:406
10 0x0000000000080744 mca_coll_base_comm_select() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:114
11 0x00000000000c5750 ompi_mpi_init() /opt/hpcx/sources/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958
12 0x000000000006eae8 PMPI_Init() /opt/hpcx/sources/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:67
13 0x0000000000003104 main() /opt/nccl-tests/src/common.cu:840
14 0x00000000000273fc __libc_init_first() ???:0
15 0x00000000000274cc __libc_start_main() ???:0
16 0x0000000000005b70 _start() ???:0
=================================
[slurm-gb200-219-003:95564] *** Process received signal ***
[slurm-gb200-219-003:95564] Signal: Segmentation fault (11)
[slurm-gb200-219-003:95564] Signal code: (-6)
[slurm-gb200-219-003:95564] Failing at address: 0x7e90001754c
[slurm-gb200-219-003:95564] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xf3ede1d709d0]
[slurm-gb200-219-003:95564] [ 1] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_test_all+0x48)[0xf3ede1bab8d8]
[slurm-gb200-219-003:95564] [ 2] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x2610)[0xf3edc68f2610]
[slurm-gb200-219-003:95564] [ 3] /opt/hpcx/ucc/lib/libucc.so.1(ucc_core_addr_exchange+0x5c)[0xf3edc68acd10]
[slurm-gb200-219-003:95564] [ 4] /opt/hpcx/ucc/lib/libucc.so.1(ucc_context_create_proc_info+0x7f0)[0xf3edc68ad8f4]
[slurm-gb200-219-003:95564] [ 5] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x28f8)[0xf3edc68f28f8]
[slurm-gb200-219-003:95564] [ 6] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(mca_coll_ucc_comm_query+0x5c)[0xf3edc68f429c]
[slurm-gb200-219-003:95564] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(+0x8038c)[0xf3ede1be038c]
[slurm-gb200-219-003:95564] [ 8] /opt/hpcx/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x64)[0xf3ede1be0744]
[slurm-gb200-219-003:95564] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_mpi_init+0x1190)[0xf3ede1c25750]
[slurm-gb200-219-003:95564] [10] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Init+0x78)[0xf3ede1bceae8]
[slurm-gb200-219-003:95564] [11] /opt/nccl_tests/build/all_reduce_perf(+0x3104)[0xacaa30543104]
[slurm-gb200-219-003:95564] [12] /usr/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xf3edd25073fc]
[slurm-gb200-219-003:95564] [13] /usr/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xf3edd25074cc]
[slurm-gb200-219-003:95564] [14] /opt/nccl_tests/build/all_reduce_perf(+0x5b70)[0xacaa30545b70]
[slurm-gb200-219-003:95564] *** End of error message ***
==== backtrace (tid: 110182) ====
0 0x000000000004b8d8 ompi_request_default_test_all() /opt/hpcx/sources/openmpi-gitclone/ompi/request/req_test.c:184
1 0x0000000000002610 oob_allgather_test() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:182
2 0x000000000000cd10 ucc_core_addr_exchange() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:461
3 0x000000000000d8f4 ucc_context_create_proc_info() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:723
4 0x00000000000028f8 mca_coll_ucc_init_ctx() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:302
5 0x000000000000429c mca_coll_ucc_comm_query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:488
6 0x000000000008038c query_2_0_0() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:540
7 0x000000000008038c query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:523
8 0x000000000008038c check_one_component() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:486
9 0x000000000008038c check_components() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:406
10 0x0000000000080744 mca_coll_base_comm_select() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:114
11 0x00000000000c5750 ompi_mpi_init() /opt/hpcx/sources/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958
12 0x000000000006eae8 PMPI_Init() /opt/hpcx/sources/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:67
13 0x0000000000003104 main() /opt/nccl-tests/src/common.cu:840
14 0x00000000000273fc __libc_init_first() ???:0
15 0x00000000000274cc __libc_start_main() ???:0
16 0x0000000000005b70 _start() ???:0
=================================
[slurm-gb200-218-151:110182] *** Process received signal ***
[slurm-gb200-218-151:110182] Signal: Segmentation fault (11)
[slurm-gb200-218-151:110182] Signal code: (-6)
[slurm-gb200-218-151:110182] Failing at address: 0x7e90001ae66
[slurm-gb200-218-151:110182] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xf7d7771209d0]
[slurm-gb200-218-151:110182] [ 1] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_test_all+0x48)[0xf7d776f5b8d8]
[slurm-gb200-218-151:110182] [ 2] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x2610)[0xf7d7538a2610]
[slurm-gb200-218-151:110182] [ 3] /opt/hpcx/ucc/lib/libucc.so.1(ucc_core_addr_exchange+0x5c)[0xf7d75385cd10]
[slurm-gb200-218-151:110182] [ 4] /opt/hpcx/ucc/lib/libucc.so.1(ucc_context_create_proc_info+0x7f0)[0xf7d75385d8f4]
[slurm-gb200-218-151:110182] [ 5] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x28f8)[0xf7d7538a28f8]
[slurm-gb200-218-151:110182] [ 6] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(mca_coll_ucc_comm_query+0x5c)[0xf7d7538a429c]
[slurm-gb200-218-151:110182] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(+0x8038c)[0xf7d776f9038c]
[slurm-gb200-218-151:110182] [ 8] /opt/hpcx/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x64)[0xf7d776f90744]
[slurm-gb200-218-151:110182] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_mpi_init+0x1190)[0xf7d776fd5750]
[slurm-gb200-218-151:110182] [10] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Init+0x78)[0xf7d776f7eae8]
[slurm-gb200-218-151:110182] [11] /opt/nccl_tests/build/all_reduce_perf(+0x3104)[0xbd73c6e53104]
[slurm-gb200-218-151:110182] [12] /usr/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xf7d7678b73fc]
[slurm-gb200-218-151:110182] [13] /usr/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xf7d7678b74cc]
[slurm-gb200-218-151:110182] [14] /opt/nccl_tests/build/all_reduce_perf(+0x5b70)[0xbd73c6e55b70]
[slurm-gb200-218-151:110182] *** End of error message ***
srun: error: slurm-gb200-219-003: task 31: Segmentation fault
srun: error: slurm-gb200-218-151: task 15: Segmentation fault