docker container在创建时是加了gpu设备的,在container里安装cuda后却发现gpu用不起来,连执行最简单的nvidia-smi命令都报错:Failed to initialize NVML: Driver/library version mismatch
在容器内分别检查nvidia drvier和nvidia相关库发现:
cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 465.19.01 Fri Mar 19 07:44:41 UTC 2021
cat /var/log/dpkg.log|grep nvidia
2022-08-14 14:52:45 install libnvidia-cfg1-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status half-installed libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status unpacked libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:45 status unpacked libnvidia-cfg1-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 install libnvidia-common-470:all <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status half-installed libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status unpacked libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status unpacked libnvidia-common-470:all 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 install libnvidia-compute-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:46 status half-installed libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-compute-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 install libnvidia-decode-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status half-installed libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-decode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 install libnvidia-encode-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status half-installed libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:47 status unpacked libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-encode-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 install libnvidia-extra-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status half-installed libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status unpacked libnvidia-extra-470:amd64 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 install libnvidia-fbc1-470:amd64 <none> 470.141.03-0ubuntu0.18.04.1
2022-08-14 14:52:48 status half-installed libnvidia-fbc1-470:amd64 470.141.03-0ubuntu0.18.04.1
...
出了这种问题一般是因为container里安装的cuda版本较高,和driver版本不匹配,因为container使用的driver是host环境里安装的,而不是container里安装cuda时安装的。
解决办法很简单,把host环境下的nvidia driver 升级到不低于容器内的nvidia库的版本即可,例如:
sudo apt install nvidia-driver-470
然后执行reboot即可,不重启是不行的,cat /proc/driver/nvidia/version可以看到driver还是465,而不是新安装的470,新安装的驱动需要重启后才能生效。
文章描述了一个在Dockercontainer中使用GPU时遇到的问题,即安装CUDA后无法正常使用GPU,执行nvidia-smi命令报错。原因是container内的CUDA版本与host环境的NVIDIA驱动版本不一致。解决方案是将host环境的NVIDIA驱动升级至与container内CUDA版本兼容,如安装nvidia-driver-470,并需重启以使新驱动生效。
4421

被折叠的 条评论
为什么被折叠?



