【踩坑记录】nvidia-smi 能识别 GPU，但 torch.cuda.is_available() 报错的终极解决方案

最新推荐文章于 2025-05-30 20:49:47 发布

原创最新推荐文章于 2025-05-30 20:49:47 发布 · 784 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#python #人工智能 #深度学习

部署运行你感兴趣的模型镜像

？问题描述

在一台多 GPU 的服务器上，运行如下命令：

nvidia-smi

可以正常看到 GPU0–GPU2，甚至 GPU3 也有显示（虽然状态异常）：

Unable to determine the device handle for GPU3: 0000:E3:00.0: Unknown Error
Sat May 24 17:55:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:17:00.0 Off |                  N/A |
| 41%   33C    P8             33W /  350W |      15MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:65:00.0 Off |                  N/A |
| 48%   31C    P8             25W /  350W |      15MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:CA:00.0 Off |                  N/A |
| 42%   34C    P8             28W /  350W |      15MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2406      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A            2406      G   /usr/lib/xorg/Xorg                        4MiB |
|    2   N/A  N/A            2406      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

但是在 Python 中运行：

import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

却返回：

UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0

RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

初步排查思路

✅ GPU 是存在的，nvidia-smi 正常

✅ PyTorch 是 GPU 版本，非 +cpu。之前在这个环境中成功运行过相关代码

❌ torch.cuda 初始化失败

怀疑是 某块 GPU（如 GPU3）硬件或驱动状态异常，导致整个 CUDA 驱动初始化失败。

最终解决方案：无需重启，修复 CUDA 驱动状态

安装 `nvidia-modprobe`

sudo apt install nvidia-modprobe
sudo nvidia-modprobe

卸载并重载 NVIDIA 驱动模块：

# 卸载驱动模块
sudo rmmod nvidia_uvm

# 重新加载驱动模块
sudo modprobe nvidia_uvm

最终验证成功🎉

import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

True
NVIDIA GeForce RTX 3090

参考资料

CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment

您可能感兴趣的与本文相关的镜像

PyTorch 2.9

PyTorch

Cuda

PyTorch 是一个开源的 Python 机器学习库，基于 Torch 库，底层由 C++ 实现，应用于人工智能领域，如计算机视觉和自然语言处理