slurm集群执行作业出现RuntimeError: No CUDA GPUs are available、CUDA_LAUNCH_BLOCKING=1等问题。

文章描述了在使用sbatch执行.sh脚本时遇到CUDA无法找到GPU的问题,通过检查代码、更新环境变量、修改.sh文件格式、重新安装PyTorch匹配CUDA版本,最终成功解决问题。关键操作包括设置SBATCH参数、使用conda安装指定版本的PyTorch以及设置CUDA_LAUNCH_BLOCKING。

问题描述

当我写好脚本sh文件后用sbatch执行后发现,总是会报RuntimeError: No CUDA GPUs are available
找了找办法,在main.py代码里加了两句:

print(torch.cuda.device_count())
print(torch.cuda.is_available())

结果:

0
False

squeue -l查看当前进程状态,发现其已经在计算节点comput8上,难不成计算节点8没有GPU?
查了一下comput8节点的资源情况 sinfo -o "%all" -N -n comput8

在这里插入图片描述
明明是有的。
最终,把脚本sh文件的内容改了一下重新保存,发现可能是之前脚本文件没有读进去导致的?因为改了之后至少python代码中.cuda()的部分不报错了。
把.sh文件改为如下:

#!/bin/bash
#SBATCH -J MICRO # 作业名
#SBATCH -N 1 # 申请节点数
#SBATCH --output=/public/home/robertchen/ylzhang20215227085/MICRO/codes/slurm/slurm.out # 输出slurm.out日志文件目录
#SBATCH --gres=gpu:1 # 指定我们需要使用 1 个 GPU 设备
#SBATCH --ntasks-per-node=8

source activate ylzhang20215227085-MICRO
nvidia-smi
python ./main.py

还有一种可能,是在代码中用了 os.environ[“CUDA_VISIBLE_DEVICES”] = str(args.gpu_id) 指定gpu id。这样也会报No GPU available!

然后不报之前.cuda()的No GPU avaliable错误了,而且输出显示,torch打印cuda的代码输出GPU检测到了(结果中的1和True是之前在main.py函数中加的print函数打印出torch.cuda的信息):

Mon Mar  6 14:36:48 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        P
Using nodes: slurm-gb200-218-[145,147,149,151,253,255],slurm-gb200-219-[001,003] pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e pyxis: imported docker image: ghcr.io#coreweave/nccl-tests:12.8.1-devel-ubuntu22.04-nccl2.26.2-1-0708d2e [1752160265.696087] [slurm-gb200-219-003:95563:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [1752160265.696023] [slurm-gb200-218-255:103525:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [1752160265.696113] [slurm-gb200-219-003:95561:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-003:95561] pml_ucx.c:424 Error: ucp_ep_create(proc=12) failed: Destination is unreachable [slurm-gb200-219-003:95561] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 12 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696047] [slurm-gb200-218-255:103523:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-255:103523] pml_ucx.c:424 Error: ucp_ep_create(proc=4) failed: Destination is unreachable [slurm-gb200-218-255:103523] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 4 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696571] [slurm-gb200-218-253:89460:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-253:89460] pml_ucx.c:424 Error: ucp_ep_create(proc=0) failed: Destination is unreachable [slurm-gb200-218-253:89460] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 0 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696861] [slurm-gb200-219-001:99789:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-001:99789] pml_ucx.c:424 Error: ucp_ep_create(proc=8) failed: Destination is unreachable [slurm-gb200-219-001:99789] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 8 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696119] [slurm-gb200-219-003:95562:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-003:95562] pml_ucx.c:424 Error: ucp_ep_create(proc=13) failed: Destination is unreachable [slurm-gb200-219-003:95562] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 13 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [1752160265.697575] [slurm-gb200-218-253:89463:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-253:89463] pml_ucx.c:424 Error: ucp_ep_create(proc=3) failed: Destination is unreachable [slurm-gb200-218-253:89463] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 3 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [1752160265.697815] [slurm-gb200-219-001:99792:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-003:95563] pml_ucx.c:424 Error: ucp_ep_create(proc=14) failed: Destination is unreachable [slurm-gb200-219-003:95563] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 14 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696210] [slurm-gb200-219-003:95564:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-255:103524] pml_ucx.c:424 Error: ucp_ep_create(proc=5) failed: Destination is unreachable [slurm-gb200-218-255:103524] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 5 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.696171] [slurm-gb200-218-255:103526:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-001:99792] pml_ucx.c:424 Error: ucp_ep_create(proc=11) failed: Destination is unreachable [slurm-gb200-219-001:99792] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 11 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.698460] [slurm-gb200-219-001:99791:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.698161] [slurm-gb200-218-253:89462:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-003:95564] pml_ucx.c:424 Error: ucp_ep_create(proc=15) failed: Destination is unreachable [slurm-gb200-219-003:95564] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 15 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.698521] [slurm-gb200-219-001:99790:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-255:103526] pml_ucx.c:424 Error: ucp_ep_create(proc=7) failed: Destination is unreachable [slurm-gb200-218-255:103526] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 7 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-219-001:99790] pml_ucx.c:424 Error: ucp_ep_create(proc=9) failed: Destination is unreachable [slurm-gb200-219-001:99790] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 9 [slurm-gb200-218-253:89461] pml_ucx.c:424 Error: ucp_ep_create(proc=1) failed: Destination is unreachable [slurm-gb200-218-253:89461] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 1 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-218-255:103525] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-001:99792] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-003:95561] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-255:103526] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-001:99790] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-253:89461] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-253:89462] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-003:95562] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-255:103523] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-253:89460] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-253:89463] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-255:103524] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-001:99791] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-001:99789] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-003:95563] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-219-003:95564] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [1752160265.743560] [slurm-gb200-218-145:100876:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [1752160265.743544] [slurm-gb200-218-145:100878:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [1752160265.743013] [slurm-gb200-218-151:110182:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-145:100876] pml_ucx.c:424 Error: ucp_ep_create(proc=17) failed: Destination is unreachable [slurm-gb200-218-145:100876] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 17 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.743059] [slurm-gb200-218-151:110181:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-145:100878] pml_ucx.c:424 Error: ucp_ep_create(proc=19) failed: Destination is unreachable [slurm-gb200-218-145:100878] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 19 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.744269] [slurm-gb200-218-149:114475:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-151:110182] pml_ucx.c:424 Error: ucp_ep_create(proc=31) failed: Destination is unreachable [slurm-gb200-218-151:110182] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 31 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.743380] [slurm-gb200-218-147:116799:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-149:114477] pml_ucx.c:424 Error: ucp_ep_create(proc=27) failed: Destination is unreachable [slurm-gb200-218-149:114477] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 27 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.743747] [slurm-gb200-218-145:100875:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-149:114475] pml_ucx.c:424 Error: ucp_ep_create(proc=25) failed: Destination is unreachable [slurm-gb200-218-149:114475] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 25 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.743469] [slurm-gb200-218-147:116796:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-145:100875] pml_ucx.c:424 Error: ucp_ep_create(proc=16) failed: Destination is unreachable [slurm-gb200-218-145:100875] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 16 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [1752160265.743100] [slurm-gb200-218-151:110179:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-151:110179] pml_ucx.c:424 Error: ucp_ep_create(proc=28) failed: Destination is unreachable [slurm-gb200-218-151:110179] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 28 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.744293] [slurm-gb200-218-149:114476:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-147:116797] pml_ucx.c:424 Error: ucp_ep_create(proc=21) failed: Destination is unreachable [slurm-gb200-218-147:116797] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 21 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.743444] [slurm-gb200-218-147:116798:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-151:110181] pml_ucx.c:424 Error: ucp_ep_create(proc=30) failed: Destination is unreachable [slurm-gb200-218-151:110181] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 30 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [1752160265.744394] [slurm-gb200-218-149:114474:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-149:114476] pml_ucx.c:424 Error: ucp_ep_create(proc=26) failed: Destination is unreachable [slurm-gb200-218-149:114476] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 26 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-218-147:116798] pml_ucx.c:424 Error: ucp_ep_create(proc=22) failed: Destination is unreachable [slurm-gb200-218-147:116798] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 22 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-218-149:114474] pml_ucx.c:424 Error: ucp_ep_create(proc=24) failed: Destination is unreachable [slurm-gb200-218-149:114474] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 24 [slurm-gb200-218-147:116799] pml_ucx.c:424 Error: ucp_ep_create(proc=23) failed: Destination is unreachable [slurm-gb200-218-147:116799] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 23 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-218-147:116796] pml_ucx.c:424 Error: ucp_ep_create(proc=20) failed: Destination is unreachable [slurm-gb200-218-147:116796] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 20 [LOG_CAT_COMMPATTERNS] isend failed in comm_allreduce_pml at iterations 4 [LOG_CAT_P2P] hmca_bcol_ucx_p2p address preexchange allreduce failed [slurm-gb200-218-147:116797] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-149:114476] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-147:116796] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-151:110180] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-151:110182] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-149:114477] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-145:100877] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-145:100878] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-147:116799] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-149:114474] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-151:110181] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-151:110179] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-147:116798] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-149:114475] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-145:100876] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [slurm-gb200-218-145:100875] Error: coll_hcoll_module.c:310 - mca_coll_hcoll_comm_query() Hcol library init failed [1752160265.930628] [slurm-gb200-219-003:95564:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-219-003:95564] pml_ucx.c:424 Error: ucp_ep_create(proc=0) failed: Destination is unreachable [slurm-gb200-219-003:95564] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 0 [slurm-gb200-219-003:95564:0:95564] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7b) [1752160265.974232] [slurm-gb200-218-151:110182:0] select.c:644 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable, sysv/memory - Destination is unreachable, posix/memory - Destination is unreachable, cuda_copy/cuda - no am bcopy, cuda_ipc/cuda - no am bcopy, rc_verbs/ibp0:1 - Destination is unreachable, ud_verbs/ibp0:1 - De [slurm-gb200-218-151:110182] pml_ucx.c:424 Error: ucp_ep_create(proc=16) failed: Destination is unreachable [slurm-gb200-218-151:110182] pml_ucx.c:477 Error: Failed to resolve UCX endpoint for rank 16 [slurm-gb200-218-151:110182:0:110182] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7b) ==== backtrace (tid: 95564) ==== 0 0x000000000004b8d8 ompi_request_default_test_all() /opt/hpcx/sources/openmpi-gitclone/ompi/request/req_test.c:184 1 0x0000000000002610 oob_allgather_test() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:182 2 0x000000000000cd10 ucc_core_addr_exchange() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:461 3 0x000000000000d8f4 ucc_context_create_proc_info() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:723 4 0x00000000000028f8 mca_coll_ucc_init_ctx() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:302 5 0x000000000000429c mca_coll_ucc_comm_query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:488 6 0x000000000008038c query_2_0_0() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:540 7 0x000000000008038c query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:523 8 0x000000000008038c check_one_component() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:486 9 0x000000000008038c check_components() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:406 10 0x0000000000080744 mca_coll_base_comm_select() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:114 11 0x00000000000c5750 ompi_mpi_init() /opt/hpcx/sources/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958 12 0x000000000006eae8 PMPI_Init() /opt/hpcx/sources/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:67 13 0x0000000000003104 main() /opt/nccl-tests/src/common.cu:840 14 0x00000000000273fc __libc_init_first() ???:0 15 0x00000000000274cc __libc_start_main() ???:0 16 0x0000000000005b70 _start() ???:0 ==== backtrace (tid: 95564) ==== 0 0x000000000004b8d8 ompi_request_default_test_all() /opt/hpcx/sources/openmpi-gitclone/ompi/request/req_test.c:184 1 0x0000000000002610 oob_allgather_test() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:182 2 0x000000000000cd10 ucc_core_addr_exchange() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:461 3 0x000000000000d8f4 ucc_context_create_proc_info() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:723 4 0x00000000000028f8 mca_coll_ucc_init_ctx() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:302 5 0x000000000000429c mca_coll_ucc_comm_query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:488 6 0x000000000008038c query_2_0_0() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:540 7 0x000000000008038c query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:523 8 0x000000000008038c check_one_component() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:486 9 0x000000000008038c check_components() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:406 10 0x0000000000080744 mca_coll_base_comm_select() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:114 11 0x00000000000c5750 ompi_mpi_init() /opt/hpcx/sources/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958 12 0x000000000006eae8 PMPI_Init() /opt/hpcx/sources/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:67 13 0x0000000000003104 main() /opt/nccl-tests/src/common.cu:840 14 0x00000000000273fc __libc_init_first() ???:0 15 0x00000000000274cc __libc_start_main() ???:0 16 0x0000000000005b70 _start() ???:0 ================================= [slurm-gb200-219-003:95564] *** Process received signal *** [slurm-gb200-219-003:95564] Signal: Segmentation fault (11) [slurm-gb200-219-003:95564] Signal code: (-6) [slurm-gb200-219-003:95564] Failing at address: 0x7e90001754c [slurm-gb200-219-003:95564] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xf3ede1d709d0] [slurm-gb200-219-003:95564] [ 1] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_test_all+0x48)[0xf3ede1bab8d8] [slurm-gb200-219-003:95564] [ 2] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x2610)[0xf3edc68f2610] [slurm-gb200-219-003:95564] [ 3] /opt/hpcx/ucc/lib/libucc.so.1(ucc_core_addr_exchange+0x5c)[0xf3edc68acd10] [slurm-gb200-219-003:95564] [ 4] /opt/hpcx/ucc/lib/libucc.so.1(ucc_context_create_proc_info+0x7f0)[0xf3edc68ad8f4] [slurm-gb200-219-003:95564] [ 5] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x28f8)[0xf3edc68f28f8] [slurm-gb200-219-003:95564] [ 6] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(mca_coll_ucc_comm_query+0x5c)[0xf3edc68f429c] [slurm-gb200-219-003:95564] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(+0x8038c)[0xf3ede1be038c] [slurm-gb200-219-003:95564] [ 8] /opt/hpcx/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x64)[0xf3ede1be0744] [slurm-gb200-219-003:95564] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_mpi_init+0x1190)[0xf3ede1c25750] [slurm-gb200-219-003:95564] [10] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Init+0x78)[0xf3ede1bceae8] [slurm-gb200-219-003:95564] [11] /opt/nccl_tests/build/all_reduce_perf(+0x3104)[0xacaa30543104] [slurm-gb200-219-003:95564] [12] /usr/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xf3edd25073fc] [slurm-gb200-219-003:95564] [13] /usr/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xf3edd25074cc] [slurm-gb200-219-003:95564] [14] /opt/nccl_tests/build/all_reduce_perf(+0x5b70)[0xacaa30545b70] [slurm-gb200-219-003:95564] *** End of error message *** ==== backtrace (tid: 110182) ==== 0 0x000000000004b8d8 ompi_request_default_test_all() /opt/hpcx/sources/openmpi-gitclone/ompi/request/req_test.c:184 1 0x0000000000002610 oob_allgather_test() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:182 2 0x000000000000cd10 ucc_core_addr_exchange() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:461 3 0x000000000000d8f4 ucc_context_create_proc_info() /build-result/src/hpcx-v2.22.1-gcc-doca_ofed-ubuntu22.04-cuda12-aarch64/ucc-4abdb985e3ff05922d8fd175ec1ad099a80e6514/src/core/ucc_context.c:723 4 0x00000000000028f8 mca_coll_ucc_init_ctx() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:302 5 0x000000000000429c mca_coll_ucc_comm_query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/ucc/coll_ucc_module.c:488 6 0x000000000008038c query_2_0_0() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:540 7 0x000000000008038c query() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:523 8 0x000000000008038c check_one_component() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:486 9 0x000000000008038c check_components() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:406 10 0x0000000000080744 mca_coll_base_comm_select() /opt/hpcx/sources/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:114 11 0x00000000000c5750 ompi_mpi_init() /opt/hpcx/sources/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958 12 0x000000000006eae8 PMPI_Init() /opt/hpcx/sources/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:67 13 0x0000000000003104 main() /opt/nccl-tests/src/common.cu:840 14 0x00000000000273fc __libc_init_first() ???:0 15 0x00000000000274cc __libc_start_main() ???:0 16 0x0000000000005b70 _start() ???:0 ================================= [slurm-gb200-218-151:110182] *** Process received signal *** [slurm-gb200-218-151:110182] Signal: Segmentation fault (11) [slurm-gb200-218-151:110182] Signal code: (-6) [slurm-gb200-218-151:110182] Failing at address: 0x7e90001ae66 [slurm-gb200-218-151:110182] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xf7d7771209d0] [slurm-gb200-218-151:110182] [ 1] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_test_all+0x48)[0xf7d776f5b8d8] [slurm-gb200-218-151:110182] [ 2] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x2610)[0xf7d7538a2610] [slurm-gb200-218-151:110182] [ 3] /opt/hpcx/ucc/lib/libucc.so.1(ucc_core_addr_exchange+0x5c)[0xf7d75385cd10] [slurm-gb200-218-151:110182] [ 4] /opt/hpcx/ucc/lib/libucc.so.1(ucc_context_create_proc_info+0x7f0)[0xf7d75385d8f4] [slurm-gb200-218-151:110182] [ 5] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(+0x28f8)[0xf7d7538a28f8] [slurm-gb200-218-151:110182] [ 6] /opt/hpcx/ompi/lib/openmpi/mca_coll_ucc.so(mca_coll_ucc_comm_query+0x5c)[0xf7d7538a429c] [slurm-gb200-218-151:110182] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(+0x8038c)[0xf7d776f9038c] [slurm-gb200-218-151:110182] [ 8] /opt/hpcx/ompi/lib/libmpi.so.40(mca_coll_base_comm_select+0x64)[0xf7d776f90744] [slurm-gb200-218-151:110182] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_mpi_init+0x1190)[0xf7d776fd5750] [slurm-gb200-218-151:110182] [10] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Init+0x78)[0xf7d776f7eae8] [slurm-gb200-218-151:110182] [11] /opt/nccl_tests/build/all_reduce_perf(+0x3104)[0xbd73c6e53104] [slurm-gb200-218-151:110182] [12] /usr/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xf7d7678b73fc] [slurm-gb200-218-151:110182] [13] /usr/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xf7d7678b74cc] [slurm-gb200-218-151:110182] [14] /opt/nccl_tests/build/all_reduce_perf(+0x5b70)[0xbd73c6e55b70] [slurm-gb200-218-151:110182] *** End of error message *** srun: error: slurm-gb200-219-003: task 31: Segmentation fault srun: error: slurm-gb200-218-151: task 15: Segmentation fault
07-12
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值