SLURM and OpenMPI

本文介绍SLURM集群管理系统如何与OpenMPI结合使用进行高效的任务调度与资源分配。详细阐述了通过slurm.conf配置文件指定默认的MPI实现,并通过srun命令启动任务时预留通信端口的方法。

1) The MpiDefault configuration parameter in slurm.conf establishes the system default MPI to be supported. 

The srun option --mpi= (or the equivalent environment variable SLURM_MPI_TYPE can be used to specify when a different MPI implementation is to be supported for an individual job.



2) SLURM creates a resource allocation for the job and then mpirun launches tasks using SLURM's infrastructure (OpenMPI, LAM/MPI and HP-MPI).


3) The current versions of SLURM and Open MPI support task launch using the srun command.

 It relies upon SLURM version 2.0 (or higher) managing reservations of communication ports for use by the Open MPI version 1.5 (or higher). The system administrator must specify the range of ports to be reserved in the slurm.conf file using the MpiParams parameter. For example: 

MpiParams=ports=12000-12999


Launch tasks using the srun command plus the option --resv-ports. The ports reserved on every allocated node will be identified in an environment variable available to the tasks as shown here: 
SLURM_STEP_RESV_PORTS=12000-12015


If the ports reserved for a job step are found by the Open MPI library to be in use, a message of this form will be printed and the job step will be re-launched:
srun: error: sun000: task 0 unble to claim reserved port, retrying
After three failed attempts, the job step will be aborted. Repeated failures should be reported to your system administrator in order to rectify the problem by cancelling the processes holding those ports.


Note: Older releases


Older versions of Open MPI and SLURM rely upon SLURM to allocate resources for the job and then mpirun to initiate the tasks. For example:


$ salloc -n4 sh    # allocates 4 processors  # and spawns shell for job
> mpirun a.out
> exit          # exits shell spawned by initial salloc command
Using nodes: slurm-gb200-218-[145,147,149,151,253,255],slurm-gb200-219-[001,003] pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e [1752160265.696087] [slurm-gb200-219-003:95563:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [1752160265.696023] [slurm-gb200-218-255:103525:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [1752160265.696113] [slurm-gb200-219-003:95561:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-003:95561] pml_ucx.c:424 Error: ucp_ep_create(proc=12) failed: Destination is unreachable [slurm-gb200-219-003:95561] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 12 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696047] [slurm-gb200-218-255:103523:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-255:103523] pml_ucx.c:424 Error: ucp_ep_create(proc=4) failed: Destination is unreachable [slurm-gb200-218-255:103523] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 4 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696571] [slurm-gb200-218-253:89460:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-253:89460] pml_ucx.c:424 Error: ucp_ep_create(proc=0) failed: Destination is unreachable [slurm-gb200-218-253:89460] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 0 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696861] [slurm-gb200-219-001:99789:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-001:99789] pml_ucx.c:424 Error: ucp_ep_create(proc=8) failed: Destination is unreachable [slurm-gb200-219-001:99789] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 8 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696119] [slurm-gb200-219-003:95562:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-003:95562] pml_ucx.c:424 Error: ucp_ep_create(proc=13) failed: Destination is unreachable [slurm-gb200-219-003:95562] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 13 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [1752160265.697575] [slurm-gb200-218-253:89463:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-253:89463] pml_ucx.c:424 Error: ucp_ep_create(proc=3) failed: Destination is unreachable [slurm-gb200-218-253:89463] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 3 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [1752160265.697815] [slurm-gb200-219-001:99792:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-003:95563] pml_ucx.c:424 Error: ucp_ep_create(proc=14) failed: Destination is unreachable [slurm-gb200-219-003:95563] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 14 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696210] [slurm-gb200-219-003:95564:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-255:103524] pml_ucx.c:424 Error: ucp_ep_create(proc=5) failed: Destination is unreachable [slurm-gb200-218-255:103524] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 5 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696171] [slurm-gb200-218-255:103526:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-001:99792] pml_ucx.c:424 Error: ucp_ep_create(proc=11) failed: Destination is unreachable [slurm-gb200-219-001:99792] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 11 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.698460] [slurm-gb200-219-001:99791:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.698161] [slurm-gb200-218-253:89462:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-003:95564] pml_ucx.c:424 Error: ucp_ep_create(proc=15) failed: Destination is unreachable [slurm-gb200-219-003:95564] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 15 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.698521] [slurm-gb200-219-001:99790:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-255:103526] pml_ucx.c:424 Error: ucp_ep_create(proc=7) failed: Destination is unreachable [slurm-gb200-218-255:103526] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 7 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-219-001:99790] pml_ucx.c:424 Error: ucp_ep_create(proc=9) failed: Destination is unreachable [slurm-gb200-219-001:99790] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 9 [slurm-gb200-218-253:89461] pml_ucx.c:424 Error: ucp_ep_create(proc=1) failed: Destination is unreachable [slurm-gb200-218-253:89461] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 1 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-218-255:103525] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-001:99792] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-003:95561] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-255:103526] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-001:99790] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-253:89461] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-253:89462] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-003:95562] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-255:103523] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-253:89460] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-253:89463] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-255:103524] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-001:99791] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-001:99789] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-003:95563] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-003:95564] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [1752160265.743560] [slurm-gb200-218-145:100876:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [1752160265.743544] [slurm-gb200-218-145:100878:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [1752160265.743013] [slurm-gb200-218-151:110182:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-145:100876] pml_ucx.c:424 Error: ucp_ep_create(proc=17) failed: Destination is unreachable [slurm-gb200-218-145:100876] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 17 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.743059] [slurm-gb200-218-151:110181:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-145:100878] pml_ucx.c:424 Error: ucp_ep_create(proc=19) failed: Destination is unreachable [slurm-gb200-218-145:100878] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 19 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.744269] [slurm-gb200-218-149:114475:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-151:110182] pml_ucx.c:424 Error: ucp_ep_create(proc=31) failed: Destination is unreachable [slurm-gb200-218-151:110182] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 31 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.743380] [slurm-gb200-218-147:116799:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-149:114477] pml_ucx.c:424 Error: ucp_ep_create(proc=27) failed: Destination is unreachable [slurm-gb200-218-149:114477] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 27 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.743747] [slurm-gb200-218-145:100875:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-149:114475] pml_ucx.c:424 Error: ucp_ep_create(proc=25) failed: Destination is unreachable [slurm-gb200-218-149:114475] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 25 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.743469] [slurm-gb200-218-147:116796:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-145:100875] pml_ucx.c:424 Error: ucp_ep_create(proc=16) failed: Destination is unreachable [slurm-gb200-218-145:100875] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 16 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [1752160265.743100] [slurm-gb200-218-151:110179:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-151:110179] pml_ucx.c:424 Error: ucp_ep_create(proc=28) failed: Destination is unreachable [slurm-gb200-218-151:110179] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 28 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.744293] [slurm-gb200-218-149:114476:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-147:116797] pml_ucx.c:424 Error: ucp_ep_create(proc=21) failed: Destination is unreachable [slurm-gb200-218-147:116797] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 21 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.743444] [slurm-gb200-218-147:116798:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-151:110181] pml_ucx.c:424 Error: ucp_ep_create(proc=30) failed: Destination is unreachable [slurm-gb200-218-151:110181] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 30 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.744394] [slurm-gb200-218-149:114474:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-149:114476] pml_ucx.c:424 Error: ucp_ep_create(proc=26) failed: Destination is unreachable [slurm-gb200-218-149:114476] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 26 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-218-147:116798] pml_ucx.c:424 Error: ucp_ep_create(proc=22) failed: Destination is unreachable [slurm-gb200-218-147:116798] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 22 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-218-149:114474] pml_ucx.c:424 Error: ucp_ep_create(proc=24) failed: Destination is unreachable [slurm-gb200-218-149:114474] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 24 [slurm-gb200-218-147:116799] pml_ucx.c:424 Error: ucp_ep_create(proc=23) failed: Destination is unreachable [slurm-gb200-218-147:116799] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 23 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-218-147:116796] pml_ucx.c:424 Error: ucp_ep_create(proc=20) failed: Destination is unreachable [slurm-gb200-218-147:116796] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 20 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-218-147:116797] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-149:114476] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-147:116796] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-151:110180] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-151:110182] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-149:114477] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-145:100877] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-145:100878] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-147:116799] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-149:114474] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-151:110181] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-151:110179] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-147:116798] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-149:114475] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-145:100876] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-145:100875] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [1752160265.930628] [slurm-gb200-219-003:95564:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-003:95564] pml_ucx.c:424 Error: ucp_ep_create(proc=0) failed: Destination is unreachable [slurm-gb200-219-003:95564] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 0 [slurm-gb200-219-003:95564:0:95564] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7b) [1752160265.974232] [slurm-gb200-218-151:110182:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-151:110182] pml_ucx.c:424 Error: ucp_ep_create(proc=16) failed: Destination is unreachable [slurm-gb200-218-151:110182] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 16 [slurm-gb200-218-151:110182:0:110182] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7b) ==== backtrace (tid: 95564) ==== 0 0x000000000004b8d8 ompi_request_default_test_all() /opt/hpcx/sources/openmpi-gitclone/ompi/request/req_test.c:184 1 0x0000000000002610 oob_allgather_test() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:182 2 0x000000000000cd10 ucc_core_addr_exchange() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:461 3 0x000000000000d8f4 ucc_context_create_proc_info() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:723 4 0x00000000000028f8 mca_coll_ucc_init_ctx() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:302 5 0x000000000000429c mca_coll_ucc_comm_query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:488 6 0x000000000008038c query_2_0_0() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:540 7 0x000000000008038c query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:523 8 0x000000000008038c check_one_component() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:486 9 0x000000000008038c check_components() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:406 10 0x0000000000080744 mca_coll_base_comm_select() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:114 11 0x00000000000c5750 ompi_mpi_init() /opt/hpcx/sources/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958 12 0x000000000006eae8 PMPI_Init() /opt/hpcx/sources/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:67 13 0x0000000000003104 main() /opt/nccl-tests/src/common.cu:840 14 0x00000000000273fc __libc_init_first() ???:0 15 0x00000000000274cc __libc_start_main() ???:0 16 0x0000000000005b70 _start() ???:0 ==== backtrace (tid: 95564) ==== 0 0x000000000004b8d8 ompi_request_default_test_all() /opt/hpcx/sources/openmpi-gitclone/ompi/request/req_test.c:184 1 0x0000000000002610 oob_allgather_test() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:182 2 0x000000000000cd10 ucc_core_addr_exchange() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:461 3 0x000000000000d8f4 ucc_context_create_proc_info() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:723 4 0x00000000000028f8 mca_coll_ucc_init_ctx() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:302 5 0x000000000000429c mca_coll_ucc_comm_query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:488 6 0x000000000008038c query_2_0_0() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:540 7 0x000000000008038c query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:523 8 0x000000000008038c check_one_component() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:486 9 0x000000000008038c check_components() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:406 10 0x0000000000080744 mca_coll_base_comm_select() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:114 11 0x00000000000c5750 ompi_mpi_init() /opt/hpcx/sources/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958 12 0x000000000006eae8 PMPI_Init() /opt/hpcx/sources/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:67 13 0x0000000000003104 main() /opt/nccl-tests/src/common.cu:840 14 0x00000000000273fc __libc_init_first() ???:0 15 0x00000000000274cc __libc_start_main() ???:0 16 0x0000000000005b70 _start() ???:0 ================================= [slurm-gb200-219-003:95564] *** Process received signal *** [slurm-gb200-219-003:95564] Signal: Segmentation fault (11) [slurm-gb200-219-003:95564] Signal code: (-6) [slurm-gb200-219-003:95564] Failing at address: 0x7e90001754c [slurm-gb200-219-003:95564] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xf3ede1d709d0] [slurm-gb200-219-003:95564] [ 1] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_test_all+0x48)[0xf3ede1bab8d8] [slurm-gb200-219-003:95564] [ 2] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x2610)[0xf3edc68f2610] [slurm-gb200-219-003:95564] [ 3] /opt/hpcx/ucc/lib/libucc.so.1(ucc_core_addr_exchange+0x5c)[0xf3edc68acd10] [slurm-gb200-219-003:95564] [ 4] /opt/hpcx/ucc/lib/libucc.so.1(ucc_context_create_proc_info+0x7f0)[0xf3edc68ad8f4] [slurm-gb200-219-003:95564] [ 5] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x28f8)[0xf3edc68f28f8] [slurm-gb200-219-003:95564] [ 6] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(mca_coll_ucc_comm_query+0x5c)[0xf3edc68f429c] [slurm-gb200-219-003:95564] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(+0x8038c)[0xf3ede1be038c] [slurm-gb200-219-003:95564] [ 8] /opt/hpcx/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x64)[0xf3ede1be0744] [slurm-gb200-219-003:95564] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_mpi_init+0x1190)[0xf3ede1c25750] [slurm-gb200-219-003:95564] [10] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Init+0x78)[0xf3ede1bceae8] [slurm-gb200-219-003:95564] [11] /opt/nccl_tests/build/all_reduce_perf(+0x3104)[0xacaa30543104] [slurm-gb200-219-003:95564] [12] /usr/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xf3edd25073fc] [slurm-gb200-219-003:95564] [13] /usr/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xf3edd25074cc] [slurm-gb200-219-003:95564] [14] /opt/nccl_tests/build/all_reduce_perf(+0x5b70)[0xacaa30545b70] [slurm-gb200-219-003:95564] *** End of error message *** ==== backtrace (tid: 110182) ==== 0 0x000000000004b8d8 ompi_request_default_test_all() /opt/hpcx/sources/openmpi-gitclone/ompi/request/req_test.c:184 1 0x0000000000002610 oob_allgather_test() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:182 2 0x000000000000cd10 ucc_core_addr_exchange() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:461 3 0x000000000000d8f4 ucc_context_create_proc_info() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:723 4 0x00000000000028f8 mca_coll_ucc_init_ctx() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:302 5 0x000000000000429c mca_coll_ucc_comm_query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:488 6 0x000000000008038c query_2_0_0() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:540 7 0x000000000008038c query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:523 8 0x000000000008038c check_one_component() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:486 9 0x000000000008038c check_components() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:406 10 0x0000000000080744 mca_coll_base_comm_select() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:114 11 0x00000000000c5750 ompi_mpi_init() /opt/hpcx/sources/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958 12 0x000000000006eae8 PMPI_Init() /opt/hpcx/sources/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:67 13 0x0000000000003104 main() /opt/nccl-tests/src/common.cu:840 14 0x00000000000273fc __libc_init_first() ???:0 15 0x00000000000274cc __libc_start_main() ???:0 16 0x0000000000005b70 _start() ???:0 ================================= [slurm-gb200-218-151:110182] *** Process received signal *** [slurm-gb200-218-151:110182] Signal: Segmentation fault (11) [slurm-gb200-218-151:110182] Signal code: (-6) [slurm-gb200-218-151:110182] Failing at address: 0x7e90001ae66 [slurm-gb200-218-151:110182] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xf7d7771209d0] [slurm-gb200-218-151:110182] [ 1] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_test_all+0x48)[0xf7d776f5b8d8] [slurm-gb200-218-151:110182] [ 2] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x2610)[0xf7d7538a2610] [slurm-gb200-218-151:110182] [ 3] /opt/hpcx/ucc/lib/libucc.so.1(ucc_core_addr_exchange+0x5c)[0xf7d75385cd10] [slurm-gb200-218-151:110182] [ 4] /opt/hpcx/ucc/lib/libucc.so.1(ucc_context_create_proc_info+0x7f0)[0xf7d75385d8f4] [slurm-gb200-218-151:110182] [ 5] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x28f8)[0xf7d7538a28f8] [slurm-gb200-218-151:110182] [ 6] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(mca_coll_ucc_comm_query+0x5c)[0xf7d7538a429c] [slurm-gb200-218-151:110182] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(+0x8038c)[0xf7d776f9038c] [slurm-gb200-218-151:110182] [ 8] /opt/hpcx/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x64)[0xf7d776f90744] [slurm-gb200-218-151:110182] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_mpi_init+0x1190)[0xf7d776fd5750] [slurm-gb200-218-151:110182] [10] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Init+0x78)[0xf7d776f7eae8] [slurm-gb200-218-151:110182] [11] /opt/nccl_tests/build/all_reduce_perf(+0x3104)[0xbd73c6e53104] [slurm-gb200-218-151:110182] [12] /usr/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xf7d7678b73fc] [slurm-gb200-218-151:110182] [13] /usr/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xf7d7678b74cc] [slurm-gb200-218-151:110182] [14] /opt/nccl_tests/build/all_reduce_perf(+0x5b70)[0xbd73c6e55b70] [slurm-gb200-218-151:110182] *** End of error message *** srun: error: slurm-gb200-219-003: task 31: Segmentation fault srun: error: slurm-gb200-218-151: task 15: Segmentation fault
最新发布
07-12
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值