linux下NVLink版NVIDIA A100安装Fabric-manager的方法
问题:H100 A100 Error 802: system CUDA initialization: Unexpected error from cudaGetDeviceCount().
安装fabricmanager
问题:print(torch.cuda.is_available())报错但是CUDA和cudnn都安装完成,版本对应良好,报错如下
UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
解释:NVIDIA NVLink A100 GPU卡,需额外安装与驱动版本对应的 nvidia-fabricmanager 服务使 GPU 卡间能够互联通过NVSwitch互联,如果仅安装NVIDIA GPU 驱动程序,会导致GPU不能正常使用。安装步骤如下:
网站下载对应驱动版本的fabricmanager:Index of /compute/cuda/repos/ubuntu2204/x86_64 (nvidia.cn)
下载地址: nvidia-fabricmanager
版本号就是驱动版本
- 查看驱动版本
nvidia-smi
- 手动安装
sudo apt-get install ./nvidia-fabricmanager-535_535.104.05-1_amd64.deb
- 解除禁用
sudo systemctl enable nvidia-fabricmanager
- 重启
sudo systemctl restart nvidia-fabricmanager
- 检查状态
sudo systemctl status nvidia-fabricmanager
问题:couldn’t be accessed by user ‘_apt’. - pkgAcquire::Run (13: Permission denied)
解决:
chown _apt /var/lib/update-notifier/package-data-downloads/partial/