Ubuntu多卡服务器、普通用户安装paddlepaddle环境

原创已于 2023-05-05 13:36:07 修改 · 1.7k 阅读

13 ·

CC 4.0 BY-SA版权

文章标签：

#ubuntu #服务器 #paddlepaddle

于 2023-05-05 12:01:20 首次发布

开发环境与工具同时被 2 个专栏收录

11 篇文章

订阅专栏

深度学习

1 篇文章

订阅专栏

该文详细介绍了在Ubuntu多卡服务器上，作为普通用户如何创建conda虚拟环境，无权限修改系统依赖的情况下安装paddlepaddle-gpu，包括选择合适的cuda版本，解决cuda和NCCL的依赖问题，以及设置环境变量以避免每次都手动配置。

部署运行你感兴趣的模型镜像

之前在本地的Ubuntu机器上安装paddle环境还挺顺利的，但是在多卡服务器上安装确遇到了很多问题，主要是服务器上已经安装了cuda等环境，普通用户也没有权限修改系统的依赖，多卡环境与单卡也有些区别。
主要参考资料就是paddle官方文档paddle安装说明

1. 建立conda虚拟环境

新建虚拟环境

conda create -n paddle_env python=3.9

进入虚拟环境

conda activate paddle_env

2. 安装paddlepaddle gpu版本

这里强调一下，一定要用conda安装，conda安装可以直接在当前环境下安装独立的cuda等依赖，这样就不会与系统预装的依赖环境冲突，pip安装就比较麻烦。

2.1 选择cuda版本

建议在终端输入nvidia-smi 查看系统的CUDA Version，选择比系统版本小的cuda版本，例如我当前机器的CUDA Version是11.5，所以我选择安装cuda 11.2
在这里插入图片描述

2.2 安装paddle

conda install paddlepaddle-gpu==2.4.2 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge

3. 验证及排错

3.1 验证方法

安装完成后您可以使用 python3 进入 python 解释器，输入import paddle ，再输入 paddle.utils.run_check()

如果出现PaddlePaddle is installed successfully!，说明您已成功安装。

3.2 第一次报错：cuda问题

W0505 03:08:12.283917 3969672 dynamic_loader.cc:307] The third-party dynamic library (libcudnn.so) that Paddle depends on is not configured correctly. (error code is /usr/local/cuda/lib64/libcudnn.so: cannot open shared object file: No such file or directory)
  Suggestions:
  1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
  2. Configure third-party dynamic library environment variables as follows:
  - Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
  - Windows: set PATH by `set PATH=XXX;

在这里插入图片描述

解决方法
查看环境安装的路径下，其实已经有了cuda相关的依赖：

但是目前还是寻找的系统目录，所以指定到环境目录就可以，在终端输入命令：

export LD_LIBRARY_PATH=[安装路径]/miniconda3/envs/paddle_env/lib

再次验证，可以看到刚才的错误已经不在了。

3.3 第二次报错：NCCL问题（多卡）

W0505 03:22:18.677640 3977430 dynamic_loader.cc:278] You may need to install 'nccl2' from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-downloadbefore install PaddlePaddle.
[2023-05-05 03:22:18,678] [ WARNING] install_check.py:281 - PaddlePaddle meets some problem with 4 GPUs. This may be caused by:
 1. There is not enough GPUs visible on your system
 2. Some GPUs are occupied by other process now
 3. NVIDIA-NCCL2 is not installed correctly on your system. Please follow instruction on https://github.com/NVIDIA/nccl-tests 
 to test your NCCL, or reinstall it following https://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
[2023-05-05 03:22:18,679] [ WARNING] install_check.py:289 - 
 Original Error is: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
  Suggestions:
  1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
  2. Configure third-party dynamic library environment variables as follows:
  - Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
  - Windows: set PATH by `set PATH=XXX; (at /paddle/paddle/phi/backends/dynload/dynamic_loader.cc:305)

PaddlePaddle is installed successfully ONLY for single GPU! Let's start deep learning with PaddlePaddle now.

解决方法
下载安装NCCL，这个需要去NVIDIA 官网下载，下载地址。

下载完解压

tar xvf nccl_2.17.1-1+cuda11.0_x86_64.txz

解压后可以直接把库拷贝到环境安装目录下
在这里插入图片描述
这时再次验证即可通过！

4. 设置环境变量，可以不用每次设置依赖目录

如果要进入paddle环境，需要设置环境变量

export LD_LIBRARY_PATH=[安装路径]/miniconda3/envs/paddle_env/lib

可以设置为每次打开终端，自动设置环境变量

vim ~/.bashrc

再最下边输入

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:[环境目录]/miniconda3/envs/paddle_env/lib

保存退出后，重新打开终端即生效。

您可能感兴趣的与本文相关的镜像

PyTorch 2.5

PyTorch

Cuda

PyTorch 是一个开源的 Python 机器学习库，基于 Torch 库，底层由 C++ 实现，应用于人工智能领域，如计算机视觉和自然语言处理