解决 NCCL WARN Cuda failure ‘invalid device function‘ , unhandled cuda error, NCCL version 2.4.8

错误信息可能是: unhandled cuda error, NCCL version 2.4.8

设置以下环境变量,查看nccl 错误日志:

export NCCL_SOCKET_IFNAME=enp6s0

export NCCL_IB_DISABLE=1

export NCLL_DEBUG=info

注意,以上export NCCL_SOCKET_IFNAME=enp6s0 中的enp6s0 为你本地的网卡名称,用ifconfig获取。

cuda版本不匹配 会有以下信息: 

znsoft-virtual-machine:102553:102553 [0] NCCL INFO Bootstrap : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.1.113<0>
NCCL version 2.4.8+cuda10.2
znsoft-virtual-machine:102620:102620 [1] NCCL INFO Bootstrap : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
znsoft-virtual-machine:102620:102695 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Channel 00 :    0   1
znsoft-virtual-machine:102620:102695 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via direct shared memory
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Using 256 threads, Min Comp Cap 8, Trees disabled
znsoft-virtual-machine:102620:102695 [1] NCCL INFO comm 0x7f0438002580 rank 1 nranks 2 cudaDev 1 nvmlDev 1 - Init COMPLETE
znsoft-virtual-machine:102553:102694 [0] NCCL INFO comm 0x7fbb600025a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE

znsoft-virtual-machine:102620:102620 [1] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
znsoft-virtual-machine:102620:102620 [1] NCCL INFO misc/group.cc:148 -> 1
znsoft-virtual-machine:102553:102553 [0] NCCL INFO Launch mode Parallel

znsoft-virtual-machine:102553:102553 [0] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'

注意最后一行: enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'

这是pytorch编译时的cuda和本机安装的cuda不一致导致。

注意要安装nccl 包,我是用以下命令编译的:

git clone https://github.com/NVIDIA/nccl.git

cd nccl 

export NVCC_GENCODE=-gencode=arch=compute_80,code=compute_80

make CUDA_HOME=/usr/local/cuda  

make install

 解决办法:

安装pytorch时,用的cuda和本机安装的一致:

运行nvidia-smi 后得到的版本要和pytorch安装 时的版本一样,我的是: CUDA Version: 11.7    

安装pytorch要使用 cuda 11.6/7之类接近的版本:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值