升级 CUDA 10.2 安装报错处理

本文介绍了在升级CUDA 10.2时遇到的驱动安装报错问题及其解决方案,包括如何处理X Server错误、'nvidia-uvm'错误和'nvidia'错误,以及在安装完成后测试CUDA的新方法。
部署运行你感兴趣的模型镜像

在已安装旧版 CUDA 的机器上通过运行下载的 runfile[local] 升级 CUDA 版本时,如果机器上正在运行占用显卡或调用显卡驱动的程序,可能会遇到驱动安装报错,例如,“ERROR: An NVIDIA kernel module 'nvidia' appears ... ” 等。很多解决方法是完全卸载显卡驱动并重新系统来解除程序占用,虽然简单直接,但是直接破坏了低版本的环境,其实也可以通过查找并停止显卡占用程序的方式来处理。

目录

一、下载 runfile 并运行

二、驱动安装报错处理

三、安装完成并测试

首先查看升级机器的环境版本,包括内核、GCC、nouveau 驱动、NVCC 等,信息如下:

Welcome to Ubuntu 16.04.7 LTS (GNU/Linux 4.4.0-98-generic x86_64)

$ uname -a
Linux p 4.4.0-98-generic #121-Ubuntu SMP Tue Oct 10 14:24:03 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

$ /usr/bin/gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ cat /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

$ nvidia-smi
Tue Jan  4 14:34:48 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|

CUDA Toolkit 与 Nvidia Driver 版本的对应关系可查看这里。从官方文档看,高版本的显卡驱动应该是可以向前兼容低版本的 CUDA 。

Table 3. CUDA Toolkit and Corresponding Driver Versions
CUDA ToolkitToolkit Driver Version
Linux x86_64 Driver VersionWindows x86_64 Driver Version
CUDA 11.5 GA>=495.29.05>=496.04
CUDA 11.4 Update 3>=470.82.01>=472.50
CUDA 11.4 Update 2>=470.57.02>=471.41
CUDA 11.4 Update 1>=470.57.02>=471.41
CUDA 11.4.0 GA>=470.42.01>=471.11
CUDA 11.3.1 Update 1>=465.19.01>=465.89
CUDA 11.3.0 GA>=465.19.01>=465.89
CUDA 11.2.2 Update 2>=460.32.03>=461.33
CUDA 11.2.1 Update 1>=460.32.03>=461.09
CUDA 11.2.0 GA>=460.27.03>=460.82
CUDA 11.1.1 Update 1>=455.32>=456.81
CUDA 11.1 GA>=455.23>=456.38
CUDA 11.0.3 Update 1>= 450.51.06>= 451.82
CUDA 11.0.2 GA>= 450.51.05>= 451.48
CUDA 11.0.1 RC>= 450.36.06>= 451.22
CUDA 10.2.89>= 440.33>= 441.22
CUDA 10.1 (10.1.105 general release, and updates)>= 418.39>= 418.96
CUDA 10.0.130>= 410.48>= 411.31
CUDA 9.2 (9.2.148 Update 1)>= 396.37>= 398.26
CUDA 9.2 (9.2.88)>= 396.26>= 397.44
CUDA 9.1 (9.1.85)>= 390.46>= 391.29
CUDA 9.0 (9.0.76)>= 384.81>= 385.54
CUDA 8.0 (8.0.61 GA2)>= 375.26>= 376.51
CUDA 8.0 (8.0.44)>= 367.48>= 369.30
CUDA 7.5 (7.5.16)>= 352.31>= 353.66
CUDA 7.0 (7.0.28)>= 346.46>= 347.62

一、下载 runfile 并运行

$ wget https://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run
$ chmod +x cuda_10.2.89_440.33.01_linux.run
$ sudo sh cuda_10.2.89_440.33.01_linux.run

选择 Install 执行

┌──────────────────────────────────────────────────────────────────────────────┐
│ CUDA Installer                                                               │
│ - [X] Driver                                                                 │
│      [X] 440.33.01                                                           │
│ + [X] CUDA Toolkit 10.2                                                      │
│   [X] CUDA Samples 10.2                                                      │
│   [X] CUDA Demo Suite 10.2                                                   │
│   [X] CUDA Documentation 10.2                                                │
│   Options                                                                    │
│   Install                                                                    │
│                                                                              │
│ Up/Down: Move | Left/Right: Expand | 'Enter': Select | 'A': Advanced options │
└──────────────────────────────────────────────────────────────────────────────┘

安装过程若出现错误,例如,

[ERROR]: Install of 440.33.01 failed

安装程序会直接退出,详细错误信息需要查看如下日志文件

$ vi /var/log/cuda-installer.log
$ vi /var/log/nvidia-installer.log

二、驱动安装报错处理

由于 CUDA Toolkit 默认安装匹配的最低版本 Nvidia Driver,因此,如果机器上运行的其他程序正在占用显卡或调用显卡驱动的程序,可能会遇到驱动安装报错。

很多解决办法是将显卡驱动完全卸载,然后重启系统。这种方案虽然一劳永逸,但是如果新版驱动或 CUDA 安装不成功,再想退回低版本就很麻烦了,因此,不推荐这种方案。此外,也有解决方法是跳过驱动安装过程,参考这里,但是 CUDA 安装仍然需要手动升级显卡驱动。

其实,只要暂时把占用显卡或驱动的程序停止就可以了,而且不需要重启系统,以下给出几种错误的解决方案。

(1)X Server 错误(You appear to be running an X server; please exit X before installing)

需要停止 lightdm  或者 gdm 服务,参考这里

# lightdm
$ sudo /etc/init.d/lightdm status
$ sudo /etc/init.d/lightdm stop

# dm
$ sudo /etc/init.d/gdm status
$ sudo /etc/init.d/gdm stop

(2)'nvidia-uvm' 错误 (ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel.)

详细错误信息包括如下

-> Detected 24 CPUs online; setting concurrency level to 24.
ERROR: An NVIDIA kernel module 'nvidia-uvm' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

原因是其他运行中的程序调用了  nvidia-uvm 模块,如果通过 nvidia-smi 查看显卡上有正在运行的进程,就会占用该模块,可以将相关程序直接 kill 。当然,如果能够知道进程具体是什么,最好采用优雅的停止方式,例如 docker stop 。

$ sudo lsof /dev/nvidia.uvm
$ sudo kill -9 [PID]

如果使用了 Docker 容器和 nvidia-docker-plugin ,则还需要停止容器以及 nvidia-docker 服务,若不放心可以将 Docker 服务也一并停止。

$ sudo docker stop [CONTAINER_ID]

$ sudo systemctl status nvidia-docker
$ sudo systemctl stop nvidia-docker

$ sudo systemctl stop docker
$ sudo systemctl stop docker.socket

(3)'nvidia' 错误 (ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel.)

原因同上面错误(2),解决方法也同上,参考 stackoverflow

$ sudo lsof /dev/nvidia*
$ kill -9 [PID]

除了以上错误外,类似错误还包括'nvidia-drm' 错误等。 

三、安装完成并测试

安装成功后的提示信息如下:

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-10.2/
Samples:  Installed in /home/user/, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-10.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-10.2/lib64, or, add /usr/local/cuda-10.2/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.2/doc/pdf for detailed information on setting up CUDA.
Logfile is /var/log/cuda-installer.log

查看显卡和 CUDA 信息

$ nvidia-smi

Thu Jan  6 16:44:55 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|

最后,由于升级了显卡驱动和 CUDA 版本,可能导致原来使用低版本的 Docker 容器启动失败,出现如下错误,需要重新配置 nvidia-docker-plugin 。

$ sudo docker logs -n 100 ddac544b963c

WARNING: Logging before InitGoogleLogging() is written to STDERR
E0106 07:15:55.663282    11 common.cpp:114] Cannot create Cublas handle. Cublas won't be available.
E0106 07:15:55.664846    11 common.cpp:121] Cannot create Curand generator. Curand won't be available.
F0106 07:15:55.666266    11 common.cpp:152] Check failed: error == cudaSuccess (30 vs. 0)  unknown error
*** Check failure stack trace: ***
Aborted (core dumped)

 

您可能感兴趣的与本文相关的镜像

PyTorch 2.5

PyTorch 2.5

PyTorch
Cuda

PyTorch 是一个开源的 Python 机器学习库,基于 Torch 库,底层由 C++ 实现,应用于人工智能领域,如计算机视觉和自然语言处理

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值