错误信息可能是: unhandled cuda error, NCCL version 2.4.8
设置以下环境变量,查看nccl 错误日志:
export NCCL_SOCKET_IFNAME=enp6s0
export NCCL_IB_DISABLE=1
export NCLL_DEBUG=info
注意,以上export NCCL_SOCKET_IFNAME=enp6s0 中的enp6s0 为你本地的网卡名称,用ifconfig获取。
cuda版本不匹配 会有以下信息:
znsoft-virtual-machine:102553:102553 [0] NCCL INFO Bootstrap : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
znsoft-virtual-machine:102553:102553 [0] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.1.113<0>
NCCL version 2.4.8+cuda10.2
znsoft-virtual-machine:102620:102620 [1] NCCL INFO Bootstrap : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
znsoft-virtual-machine:102620:102620 [1] NCCL INFO NET/Socket : Using [0]enp6s0:192.168.1.113<0>
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff
znsoft-virtual-machine:102620:102695 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Channel 00 : 0 1
znsoft-virtual-machine:102620:102695 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via direct shared memory
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
znsoft-virtual-machine:102553:102694 [0] NCCL INFO Using 256 threads, Min Comp Cap 8, Trees disabled
znsoft-virtual-machine:102620:102695 [1] NCCL INFO comm 0x7f0438002580 rank 1 nranks 2 cudaDev 1 nvmlDev 1 - Init COMPLETE
znsoft-virtual-machine:102553:102694 [0] NCCL INFO comm 0x7fbb600025a0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 - Init COMPLETE
znsoft-virtual-machine:102620:102620 [1] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
znsoft-virtual-machine:102620:102620 [1] NCCL INFO misc/group.cc:148 -> 1
znsoft-virtual-machine:102553:102553 [0] NCCL INFO Launch mode Parallel
znsoft-virtual-machine:102553:102553 [0] enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
注意最后一行: enqueue.cc:197 NCCL WARN Cuda failure 'invalid device function'
这是pytorch编译时的cuda和本机安装的cuda不一致导致。
注意要安装nccl 包,我是用以下命令编译的:
git clone https://github.com/NVIDIA/nccl.git
cd nccl
export NVCC_GENCODE=-gencode=arch=compute_80,code=compute_80
make CUDA_HOME=/usr/local/cuda
make install
解决办法:
安装pytorch时,用的cuda和本机安装的一致:
运行nvidia-smi 后得到的版本要和pytorch安装 时的版本一样,我的是: CUDA Version: 11.7
安装pytorch要使用 cuda 11.6/7之类接近的版本: