Tensorflow-GPU源码编译

本文详细记录了在不同版本的Ubuntu系统上编译安装TensorFlow的过程,包括配置环境、解决常见问题及编译遇到的具体错误。涵盖了从TensorFlow 1.6到2.7版本的编译经验。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


这套编译已经不再适用于最新的tensorflow了。直接跳到分割线看最新版Tensorflow编译流程。

Ubuntu16.0.4编译安装TensorFlow

我在不同的电脑上安装编译安装TensorFlow出现过很多问题,几次都没有成功,这次重装的ubuntu16.0.4系统。不知怎么的成功了,特此记录下来。

本机安装了jdk1.8,最新版的bazel,clang5.0.1,cuda9.1,cudnn7
安装过程如下:

  • 获取tensorflow仓库https://github.com/tensorflow/tensorflow.git
  1. export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64 按照TensorFlow说明之前需要安装sudo apt-get install cuda-command-line-tools但是安装的时候提示没有,于是没有安装。
  • 在tensorflow git源码中使用./configure配置,配置如下(下面的triSYCL在将仓库clone放到/usr/local/下面了)
 WARNING: Running Bazel server needs to be killed, because the startup options are different.
You have bazel 0.10.1- (@non-git) installed.
Please specify the location of python. [Default is /home/liushuai/anaconda3/bin/python]: 


Found possible Python library paths:
  /home/liushuai/anaconda3/lib/python3.6/site-packages
Please input the desired Python library path to use.  Default is [/home/liushuai/anaconda3/lib/python3.6/site-packages]

Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: y
jemalloc as malloc support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: y
Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: y
Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: y
Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Apache Kafka Platform support? [y/N]: n
No Apache Kafka Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]: y
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: y
VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: y
OpenCL SYCL support will be enabled for TensorFlow.

Please specify which C++ compiler should be used as the host C++ compiler. [Default is /usr/bin/g++]: 


Please specify which C compiler should be used as the host C compiler. [Default is /usr/bin/gcc]: 


Do you wish to build TensorFlow with ComputeCPP support? [Y/n]: n
No ComputeCPP support will be enabled for TensorFlow.

Please specify the location of the triSYCL include directory. (Use --config=sycl_trisycl when building with Bazel) [Default is /usr/local/triSYCL/include]: 


Do you wish to build TensorFlow with CUDA support? [y/N]: n
No CUDA support will be enabled for TensorFlow.

Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 


The WORKSPACE file has at least one of ["android_sdk_repository", "android_ndk_repository"] already set. Will not ask to help configure the WORKSPACE. Please delete the existing rules to activate the helper.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
	--config=mkl         	# Build with MKL support.
	--config=monolithic  	# Config for mostly static monolithic build.
Configuration finished
  • 编译bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package直到编译出现如下消息:
INFO: From Compiling tensorflow/stream_executor/cuda/cuda_blas.cc:
tensorflow/stream_executor/cuda/cuda_blas.cc: In function 'cudaDataType_t perftools::gputools::cuda::{anonymous}::CUDAComputationType(perftools::gputools::blas::ComputationType)':
tensorflow/stream_executor/cuda/cuda_blas.cc:604:1: warning: control reaches end of non-void function [-Wreturn-type]
 }
 ^
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
  bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 2236.355s, Critical Path: 150.89s
INFO: Build completed successfully, 4272 total actions

  • . 生成文件bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
    文件目录在/tmp/tensorflow_pkg/tensorflow-1.6.0rc1-cp36-cp36m-linux_x86_64.whl

tensorflow开启vers和gdr

之前在笔记本上编译了最新版本的tensorflow,但是没有开启VERBS,GDR,因为后续会报错缺少dma.h之类的错误,安装相应的软件包即可sudo apt install librdmacm-dev,加上caffe需要只支持cuda8,编译的时候经常提示出错,于是我又安装了cuda9.1然后cudnn7,因为cuda9.1生成/usr/local/cuda符号链接的时候是cuda8生成的,在cuda9.1提示生成符号连接/usr/local/cuda的时候选择了否。安装位置选在了/usr/cuda-9.1在本地bash或者zsh文件配置好环境变量。
于是在服务器上重新编译开启VERBS和GDR。即在VERBS和GDR选项选中y
服务器环境如下:

  • cpu:E5-2620
  • gpu:GTX1080×2
  • ubuntu16.04 64bit
    提前预装的软件有clang,cmake,bazel,java1.8。为了开启mkl加速,又安装了mkl(开启mkl需要config=mkl),编译如上。
    开启mpi需要安装openmpi(目录选择默认/usr/local)
    下面有预编译好的tensorflow-gpu,如果需要可以在我的网盘中下载,编译了不同功能的,具体细节查看readme.md网盘 密码: ez2b

Ubuntu18.0.4编译TensorFlow2.0 Alpha

安装更新下面的包

pip install -U --user pip six numpy wheel mock
pip install -U --user keras_applications==1.0.6 --no-deps
pip install -U --user keras_preprocessing==1.0.5 --no-deps

和16下编译类似我的编译选项开启了TensorRT,cuda支持。
如果出现:

ERROR: Analysis of target '//tensorflow/tools/pip_package:build_pip_package' failed; build aborted: no such package '@llvm//': java.io.IOException: Error downloading [https://mirror.bazel.build/github.com/llvm-mirror/llvm/archive/b4ec9009a62f5966fd723d9dc89eac2216bf122c.tar.gz, https://github.com/llvm-mirror/llvm/archive/b4ec9009a62f5966fd723d9dc89eac2216bf122c.tar.gz] to /home/amax/.cache/bazel/_bazel_amax/72a114ee0020c76796715c8bc7376219/external/llvm/b4ec9009a62f5966fd723d9dc89eac2216bf122c.tar.gz: Premature EOF
INFO: Elapsed time: 854.354s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (398 packages loaded, 18468 targets configured)
    Fetching @llvm; fetching 750s
    Fetching @icu; fetching 750s

rm -rf ~/.cache/bazel后重新编译。这里有编译好的tensorflow2.0, 提取码: raxg

  • cuda10
  • cudnn7.5
  • python3.7
  • TensorRT 5.1
    编译gtt工具
bazel build tensorflow/tools/graph_transforms:transform_graph

编译summary

bazel build tensorflow/tools/graph_transforms:summarize_graph

新版本的tensorflow release编译出错

bazel 0.19.2编译release出错,需要将tool是/bazel.rc添加到tensorflow根目录下的.bazelrc文件中,内容如下:
import /home/amax/tensorflow/tools/bazel.rc

Manjaro下源码编译

和ubuntu一样,但是因为pacman安装的软件的路径不一样,所以关于cuda,cudnn,tensorrt的库文件需要手动指定,这里环境如下(截至2020-10-20,bazel支持版本为3.1):

  • cuda 10.2
  • cudnn8
  • tensorrt7
    路径配置:/opt/cuda,/local/TensorRT-7.1.3.4/include,/opt/cuda/extras/CUPTI/include,/opt/cuda/targets/x86_64-linux/include,/local/TensorRT-7.1.3.4/targets/x86_64-linux-gnu/lib

分割线

Ubuntu20.04+RTX3090

最新新购入一张3090的卡,为了获取更好的性能(tensorflow源提供的tensorflow包太垃圾了看看人家PyTorch,一条命令,cuda啥的自动搞好),不得已编译了源码。当前环境为:

  • GPU:RTX3090
  • CPU:E5
  • OS:Ubuntu20.04
  • 驱动:455.23.04
  • CUDA:11.1(截止到2020-11-27日,所有软件包均为最新,驱动是长期版最新,短期版最新在ubuntu20下使用屏幕瞎刷新,不排除各例)
  • CUDNN:8.0
  • TensorRT:7.2
  • Bazel 3.4
    编译过程:当前Tensorflow编译还是比较简单,除了一些依赖包国内网络可能无法下载外多数包都是可以正常下载的。
  1. bazel build --config=opt --config=cuda --config=mkl //tensorflow/tools/pip_package:build_pip_package
  2. ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

(为了保险期间,我我已經将所有的tensorflow依赖包下载到了vps里面,基本上可以解决任何网络包缺失的问题)提醒4点:
3. 如果你编译过程中提示docker_rule 错误~/i/.cache/bazel/_bazel_liushuai/aa8adcb58cbbd827a6e521843f232b82/external/bazel_toolchains/repositories/repositories.bzl将里面的git地址改一下,因为这个包http的git下载不了。我的解决办法是用自己的vps下载,然后部署git,从自己的git上clone。
4. 如果提示llvm包下载不了,直接在tensorflow源码里面找到workspace.bzl文件,替代下载路径为自己的vps下载地址即可。
5. RTX3090的cc是8.6,如果你直接配置8.6是编译不过的(我测试了n次,还在tensorflow上提了issue,最后还是自己解决的)。如果你设置cc为7.5(上代volta的cc)是可以编译通过的,但是训练的时候会提示错误,大体是说xx和驱动不一致。我还没搞明白到底是cc设置的问题还是驱动的问题,后来搞了个新固态,系统给重装了,也就没考虑这个问题了,直接上最新驱动,最新cuda,编译的时候使用的是cc8.0。当前测试正常。
6. 设置bazel为3.4(Tensorflow release应该是使用3.1.0,但是实际上你使用更高版本也是可以的,我下载的3.4懒得换成3.1,改了配置文件就好了,我也不记得哪个配置文件了,你编译提示错误的时候就能看到)
给个我自己编译好的包下载,省得你再去编译了:

编译遇到的问题

  1. 截至至2021-7-28日,最新版tensorflow(版本号2.7)出现:
RuntimeError                              Traceback (most recent call last)
RuntimeError: module compiled against API version 0xe but this version of numpy is 0x

更新最新版numpy即可:

pip install -U numpy 

编译包下载

  • cuda_11.1.0_455.23.05_linux.run+cudnn-11.1-linux-x64-v8.0.5.39+TensorRT-7.2.1.6.Ubuntu-18.04.x86_64-gnu.cuda-11.1.cudnn8.0+driver-455.23.04:编译包链接:Tensorflow 2.5,提取码: miu8
  • 这个版本是在cuda11下编译的(cc为7.5),其他包和上面一样:tensorflow 2.5 提取码: zjed。有RTX30系列显卡的可以顺便帮我验证一下在驱动为cuda11的机器上安装这个版本是否出现xx太老不匹配驱动的问题。
  • 这个是为Tesla T4编译的,大约三个月前(基于CUDA10.2编译,cudnn为8,tensorrt为7.X(不记得了,缺庫的时候下载对应的tensorrt吧)):tensorflow 2.4 提取码: 2w8r.

链接失效了留言给我。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值