CentOS 7 安装GPU版Tensorflow教程（一）

最新推荐文章于 2023-02-27 11:26:03 发布

转载最新推荐文章于 2023-02-27 11:26:03 发布 · 1.1k 阅读

本文详细介绍如何在Linux环境下安装GPU版TensorFlow，包括关闭UEFI、确认显卡支持CUDA、安装NVIDIA驱动及CUDA、配置环境变量等关键步骤。

安装流程：
1.关闭UEFI
在BIOS里面检查你的UEFI是否开启，如果开启的话请立马关掉它（这个很湿重要，因为它很有可能导致你的kernel安装失败，笔者就遇到了这个坑，浪费了好多时间），具体怎么关掉就不多说，由于每种电脑型号的BIOS都有所不同。

2.确认自己的显卡支持cuda

[plain]view plaincopy
[littlebei@localhost ~]$ lspci | grep -i nvidia  
01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 745] (rev a2)  
01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1)  

如果有以上信息出现，说明你的显卡是支持cuda的。

3.确认Linux版本是否支持cuda

[plain]view plaincopy
[littlebei@localhost ~]$ uname -m && cat /etc/*release  

若有信息输出，说明是支持的。

4.检查gcc是否安装

[plain]view plaincopy
[littlebei@localhost ~]$ gcc --version  
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)  
Copyright (C) 2015 Free Software Foundation, Inc.  
This is free software; see the source for copying conditions.  There is NO  
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  

若出现以上信息说明gcc已经安装。

若没有安装，可以使用一下命令安装

[plain]view plaincopy
[littlebei@localhost ~]$ sudo yum install gcc gcc-c++  

5.安装kernel-devel和kernel-headers

[plain]view plaincopy
$ sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)  

其中$(uname -r)就是个参数，uname -r表示kernel的版本号。

6.关闭X server

[plain]view plaincopy
$ systemctl stop gdm.service  

7.禁用nouveau（因为它是一般linxu系统自带的显卡驱动，会和nvidia冲突，所以必须要关掉）
(1)将 nouveau 驱动加入黑名单：
在 /usr/lib/modprobe.d/dist-blacklist.conf 中加入 blacklist nouveau（这种方式仅限在centos 7，其他Linux 系统自行解决）。
(2)备份 initramfs 文件：

[plain]view plaincopy
$ sudo mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak  

(3)重建 initramfs 文件

[plain]view plaincopy
$ sudo dracut -v /boot/initramfs-$(uname -r).img $(uname -r)  

8.关机重启

9.安装NVIDIA驱动

安装NVIDIA驱动是很重要的步骤，该步成功了，后面也就基本上一马平川了。
(1)使用第2步中的方法，找到你的驱动型号，然后在官网找到与之匹配的型号，下载安装，下载链接戳我
(2)使用一下命令安装

[plain]view plaincopy
$ sudo sh NVIDIAxxx --kernel-source-path=/usr/src/kernels/x.xx.x-xxxxx  

其中 NVIDIAxxx 为 nvidia 驱动脚本文件， x.xx.x-xxxx 为 kernel 版本号，kernel版本号可以使用一下命令查找

[plain]view plaincopy
[littlebei@localhost ~]$ uname -r  
3.10.0-693.2.2.el7.x86_64  

在安装过程中，可能会出现一下两种错误：
第一种：

[plain]view plaincopy
The driver installation is unable to locate the kernel source. Please make sure that the kernel source packages are    
installed and set up correctly.  
If you know that the kernel source packages are installed and set up correctly, you may pass the location of thekernel source with the '--kernel-source-path' flag.  

解决方案：

[plain]view plaincopy
$ sudo yum install epel-release  
$ sudo yum install --enablerepo=epel dkms  

第二种：

[plain]view plaincopy
ERROR: Unable to load the 'nvidia-drm' kernel module.  

解决方案：

[plain]view plaincopy
One probable reason is that the system is boot from UEFI but Secure Boot option is turned on in the BIOS setting.   
Turn it off and the problem will be solved.  

这也就是为什么在第一步中我让大家关掉UEFI的原因了。

(3)具体的安装执行过程
在accept的页面选择Accept，在32-bit页面选择No，在X- configuration页面选择Yes

10.安装cuda
在这个页面选择与系统版本匹配的cuda，戳我，进行下载，这里建议不要下载太新的cuda版本，因为下载太新的版本很有可能和tensorflow版本匹配不上，这里也是笔者踩过得坑。
安装的命令

[plain]view plaincopy
$ sudo sh cuda_8.0.61_375.26_linux.run  

安装执行以下过程

[plain]view plaincopy
# accept  
-------------------------------------------------------------   
Do you accept the previously read EULA?accept/decline/quit: accept  
# no  
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 375.26?(y)es/(n)o/(q)uit: n  
-------------------------------------------------------------  
# 后面的就都选yes或者default  
Do you want to install the OpenGL libraries?  
(y)es/(n)o/(q)uit [ default is yes ]:   
Do you want to run nvidia-xconfig?  
This will update the system X configuration file so that the NVIDIA X driver is used.   
The pre-existing X configuration file will be backed up.  
This option should not be used on systems that require a custom X configuration,   
such as systems with multiple GPU vendors.  
(y)es/(n)o/(q)uit [ default is no ]: y  
Install the CUDA 8.0 Toolkit?  
(y)es/(n)o/(q)uit: y  
  
  
Enter Toolkit Location [ default is /usr/local/cuda-8.0 ]:  
  
  
Do you want to install a symbolic link at /usr/local/cuda?  
(y)es/(n)o/(q)uit: y  
  
  
Install the CUDA 8.0 Samples?  
(y)es/(n)o/(q)uit: y  
  
  
Enter CUDA Samples Location  
[ default is /root ]:   
  
  
Installing the NVIDIA display driver...  

看到以下输出信息说明安装成功

[plain]view plaincopy
The driver installation has failed due to an unknown error. Please consult the driver   
installation log located at /var/log/nvidia-installer.log.  
  
  
===========  
= Summary =  
===========  
  
  
Driver: Not Selected  
Toolkit: Installed in /usr/local/cuda-8.0  
Samples: Installed in /root, but missing recommended libraries  
  
  
Please make sure that  
  - PATH includes /usr/local/cuda-8.0/bin   
  - LD_LIBRARY_PATH includes /usr/local/cuda-8.0/lib64, or,   
  add /usr/local/cuda-8.0/lib64 to /etc/ld.so.conf and run ldconfig as root  
  
  
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-8.0/bin  
  
  
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-8.0/doc/pdf for detailed  
information on setting up CUDA.  
  
  
***WARNING: Incomplete installation! This installation did not install the CUDA Driver.   
A driver of version at least 361.00 is required for CUDA 8.0 functionality to work.  
To install the driver using this installer, run the following command,   
replacing <CudaInstaller> with the name of this run file:  
     sudo <CudaInstaller>.run -silent -driver  
  
  
Logfile is /tmp/cuda_install_192.log  

11.配置cuda环境变量
编辑~/.bashrc文件

[plain]view plaincopy
$ sudo vim ~/.bashrc  

添加如下内容

[plain]view plaincopy
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64:$LD_LIBRARY_PATH  
export CUDA_HOME=/usr/local/cuda-8.0/  

12.安装cuDNN
在官网上下载cuDNN包，戳我（注意版本匹配的问题）
下载完成执行以下操作

[plain]view plaincopy
$ tar -xvzf cudnn-8.0-linux-x64-v6.0.tgz  
$ cp include/* /usr/local/cuda/include  
$ cp lib64/* /usr/local/cuda/lib64  

13.安装gpu版的TensorFlow

[plain]view plaincopy
$ sudo pip install tensorflow-gpu  

这里是使用pip直接安装的，如果你的机器上没有安装pip的话，可以参考我的另外一篇博文里面有写到pip的安装教程。

14.测试TensorFlow
走过前面的沟沟坎坎，终于到了测试这一步了，是不是很happy。

[plain]view plaincopy
Python 2.7.5 (default, Jun 17 2014, 18:11:42)  
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2  
Type "help", "copyright", "credits" or "license" for more information.  
>>> import tensorflow as tf  
>>> hello = tf.constant('Hello, TensorFlow!')  
>>> sess = tf.Session()  
2017-06-28 16:42:53.518877: W tensorflow/core/platform/cpu_feature_guard.cc:45]   
The TensorFlow library wasn't compiled to use SSE4.1 instructions,   
but these are available on your machine and could speed up CPU computations.  
2017-06-28 16:42:53.518906: W tensorflow/core/platform/cpu_feature_guard.cc:45]   
The TensorFlow library wasn't compiled to use SSE4.2 instructions,   
but these are available on your machine and could speed up CPU computations.  
2017-06-28 16:42:53.518914: W tensorflow/core/platform/cpu_feature_guard.cc:45]   
The TensorFlow library wasn't compiled to use AVX instructions,   
but these are available on your machine and could speed up CPU computations.  
2017-06-28 16:42:53.518921: W tensorflow/core/platform/cpu_feature_guard.cc:45]   
The TensorFlow library wasn't compiled to use AVX2 instructions,   
but these are available on your machine and could speed up CPU computations.  
2017-06-28 16:42:53.518929: W tensorflow/core/platform/cpu_feature_guard.cc:45]   
The TensorFlow library wasn't compiled to use FMA instructions,   
but these are available on your machine and could speed up CPU computations.  
2017-06-28 16:42:54.099744: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901]   
successful NUMA node read from SysFS had negative value (-1),   
but there must be at least one NUMA node, so returning NUMA node zero  
2017-06-28 16:42:54.100218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887]   
Found device 0 with properties:  
name: Tesla M60  
major: 5 minor: 2 memoryClockRate (GHz) 1.1775  
pciBusID 0000:00:02.0  
Total memory: 7.93GiB  
Free memory: 7.86GiB  
2017-06-28 16:42:54.100243: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0  
2017-06-28 16:42:54.100251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y  
2017-06-28 16:42:54.100266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]   
Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M60, pci bus id: 0000:00:02.0)  
>>> print(sess.run(hello))  
Hello, TensorFlow!