build tensorflow from source

TensorFlow搭建指南

最新推荐文章于 2021-05-20 20:14:36 发布

转载最新推荐文章于 2021-05-20 20:14:36 发布 · 2.6k 阅读

tensorflow 专栏收录该内容

1 篇文章

订阅专栏

本文详细介绍如何在特定平台上从零开始构建TensorFlow环境，包括配置必要的软件包、修改编译设置等关键步骤。

From :http://biophysics.med.jhmi.edu/~yliu120/tensorflow.html

TensorFlow offers an excellent framework for executing mathematical operations.Equipped with TensorFlow, many complicated machine learning models, as well as generalmathematical problems could be programmed easily and launched to hierarchical andefficient architectures (multi-CPUs and multi-GPUs). However, TensorFlow is pretty brand-new and it is open-sourced for not too long. Building TensorFlow on a "non-standard" platform is proven to be a difficult task. This page provides all necessary modificationon the current CROSSTOOL settings and workarounds for building TensorFlow on MARCC. Eventhough MARCC has a specific architecture, many settings and workarounds can shed lightson other different environments as well.

No extra dependencies (GCC libs, libstdc++, cuda, cudnn and etc) are needed to becompiled in this protocol. We are only using the existing MARCC libraries. Hereis the libraries we use:

GCC 4.9.2
libstdc++ (come with GCC 4.9.2)
CUDA 7.5
CUDNN 5.0
Python 2.7.10b
Binutils 2.25
Java 1.8.0_112

Translate to the MARCC script:

											
module load binutils
module load gcc/4.9.2
module load cuda/7.5
module load cudnn/5.0
module load python/2.7.10b
module load java/1.8.0_112

This protocol is nothing but passing the correct environment variables to Bazel and TensorFlow. All code can be built successfully by using the correct ENVs. The following statements are all false:

TensorFlow won't be built on CentOS 6
TensorFlow won't be built on current arch of MARCC

Build Bazel

TensorFlow should be built with Bazel. In other words, Bazel is the onlybuild tool that is provided by TensorFlow. This is true for all linux/darwinplatform. We need to build Bazel prior to building TensorFlow.

Essentially, Bazel is the open-source version of Google's internal buildtool. Bazel is a rather new project as well so that the cross-platform supportof Bazel is also not elegant. However, Bazel is a really good build tool forits scalability.

Download the latest Bazel and Uncompress it

Go to the Bazel's Github release page and check out the latest release (version 0.4.2).

										
wget https://github.com/bazelbuild/bazel/releases/download/0.4.2/bazel-0.4.2-dist.zip
mkdir -p bazel-0.4.2
cd bazel-0.4.2 && unzip ../bazel-0.4.2-dist.zip

Note that please checkout the "dist.zip" version for a release build. After version 0.3.2, Bazel's source code release is only for building a developer version. There are a few issues discussed on Bazel's Github Issues.

Be aware of our environment settings

In current Bazel release, the c++ compiler tool chain is hard-coded in the code base. In future Bazel release, this should be improved but now we should change the cc_tool_chain rule in Bazel's CROSSTOOL files to provide correct paths (envs).

To build Bazel, you should load the above modules (CUDNN not needed here)

											
module load binutils
module load gcc/4.9.2
module load cuda/7.5
module load python/2.7.10b
module load java/1.8.0_112

After loading those modules, we take a look at all our current ENVs.

											
[MyUserName@login-node04 xxx]$ which gcc
/cm/shared/apps/gcc/4.9.2/bin/gcc
[MyUserName@login-node04 xxx]$ which ld
/cm/shared/apps/binutils/2.25/src/bin/ld
[MyUserName@login-node04 xxx]$ which nm
/cm/shared/apps/binutils/2.25/src/bin/nm
[NyUserName@login-node04 xxx]$ which ar
/cm/shared/apps/binutils/2.25/src/bin/ar
...

You don't have to do this. To illustrate the above is only for presenting the path for our gcc and GNU binutils.

Modify the hardcoded CROSSTOOL files

Let's modify the hardcoded CROSSTOOL files at tools/cpp/CROSSTOOLand tools/cpp/cc_configure.bzl.

In the CROSSTOOL file, you only have to modify the "local_linux" toolchain section. (Search "toolchain_identifier" and you will see a toolchain block is associated with "local_linux") This is because we are building Bazel on a Linux System. Do the following modification step by step:

Change all possible binutils, gcc and cpp's tool path to our path listed above. For instance, change tool_path { name: "ar" path: "/usr/bin/ar" } to tool_path { name: "ar" path: "/cm/shared/apps/binutils/2.25/src/bin/ar" }, and change tool_path { name: "gcc" path: "/usr/bin/gcc" } to tool_path { name: "gcc" path: "/cm/shared/apps/gcc/4.9.2/bin/gcc" }
Add a tool_path to prevent "as" (GNU assembler) problem to the tool_path bundle. tool_path { name: "as" path: "/cm/shared/apps/binutils/2.25/src/bin/ar" }

Only Change all linker flag lines right under tool_path of gcc to (comment out the original):

										    		
linker_flag: "-lstdc++, -Wl"

Change cxx_builtin_include_library right under tool_path of gcc to:

										    		
cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include"
cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include-fixed"
cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/include/c++/4.9.2"

In the cc_configure.bzl file, do the following modification:

Replace all occurence of "-B/usr/bin" to "-B/cm/shared/apps/binutils/2.25/src/bin/"

Build Bazel

Run the following:

											
export EXTRA_BAZEL_ARGS='-s --verbose_failures --ignore_unsupported_sandboxing --genrule_strategy=standalone --spawn_strategy=standalone --jobs 24'
./compile.sh

On a interactive job (apply one node for fast build by interact -n 24 -p parallel). When entering a new node, don't forget to reload all the modules.

Some suggestions:

In the EXTRA_BAZEL_ARGS env, we use "--jobs 24", if we use >24, then during the compilation, we may run out of the memory. You can adjust the java's memory limit and use a higher job number. My suggestion is not more than 50.
If you follow the exact protocol steps and use the same version of Bazel, you do NOT need "-s --verbose_failures" in the EXTRA_BAZEL_ARGS env and may go faster.

A successful built should have a final output like this

											
Target //src:bazel up-to-date:
  bazel-bin/src/bazel
INFO: Elapsed time: 63.711s, Critical Path: 49.16s
WARNING: /tmp/bazel_t7vQ9Fsh/out/external/bazel_tools/WORKSPACE:1: Workspace name in /tmp/bazel_t7vQ9Fsh/out/external/bazel_tools/WORKSPACE (@io_bazel) does not match the name given in the repository's definition (@bazel_tools); this will cause a build error in future versions.

Build successful! Binary is here: /home-4/MyUserName/compare/output/bazel

Bazel is a fully statically linked binary. It has a huge size as binary as well (~100M). It is quite portable, so you just need to copy it to your local default PATH, for example, $HOME/opt/bin.

Build Tensorflow

Download TensorFlow from Github Master Branch

Run this command in any directory you prefer.

											
git clone https://github.com/tensorflow/tensorflow.git && cd tensorflow

Load all necessary modules

Note that you need to load this each time you log in a new machine/node.

											
module load binutils
module load gcc/4.9.2
module load cuda/7.5
module load cudnn/5.0
module load python/2.7.10b
module load java/1.8.0_112

Modify the CROSSTOOL file.

Again, same problem as building Bazel. We should modify the CROSSTOOL file of TensorFlowto pass the correct ENVs to the compiler tool chain. Since we are building TensorFlow withGPU support, we should look at the CROSSTOOL file in third_party/gpus/crosstool.

Modifications in third_party/gpus/crosstool/CROSSTOOL.tpl,

Again, look for the toolchain block marked as toolchain_identifier: "local_linux":
- Replace PATH for cpp and binutils (same as Bazel), do NOT replace path for gcc.
- Change linker flags as follows
- Change cxx_builtin_include_directory as follows
Finally, the above mentioned section should look like:
```
											
tool_path { name: "ar" path: "/cm/shared/apps/binutils/2.25/src/bin/ar" }
tool_path { name: "compat-ld" path: "/cm/shared/apps/binutils/2.25/src/bin/ld" }
tool_path { name: "cpp" path: "/cm/shared/apps/gcc/4.9.2/bin/cpp" }
tool_path { name: "dwp" path: "/usr/bin/dwp" }
# As part of the TensorFlow release, we place some cuda-related compilation
# files in @local_config_cuda//crosstool/clang/bin, and this relative
# path, combined with the rest of our Bazel configuration causes our
# compilation to use those files.
tool_path { name: "gcc" path: "clang/bin/crosstool_wrapper_driver_is_not_gcc" }
# Use "-std=c++11" for nvcc. For consistency, force both the host compiler
# and the device compiler to use "-std=c++11".
cxx_flag: "-std=c++11"
linker_flag: "-L/cm/shared/apps/gcc/4.9.2/lib64"
linker_flag: "-Wl,-no-as-needed"
linker_flag: "-lstdc++"
linker_flag: "-Wl,-rpath, /cm/shared/apps/gcc/4.9.2/lib64"

# linker_flag: "-B/usr/bin/"
cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4..
9.2/include"
cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4..
9.2/include-fixed"
cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/include/c++/4.9.2"
											
										
```
Modifications in third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl. This file is very essential since it generates the compiler and linker flags for the tool chain defined in CROSSTOOL.tpl and for all compiling rules.
- Modify line 53 and line 54. Give absolute path of NVCC and GCC compiler to Bazel. These two lines should be modified to:
```
													
NVCC_PATH = '/cm/shared/apps/cuda/7.5/bin/nvcc'
LLVM_HOST_COMPILER_PATH = ('/cm/shared/apps/gcc/4.9.2/bin/gcc')
													
												
```
- Comment out line 232. cmd = 'PATH=' + PREFIX_DIR + ' ' + cmd. This line will create "as" (GNU assembler) linking problem (link to wrong GNU assembler). So comment it out.
Configuration

Create a file, namely env.sh containing only these lines:
```
											
export TF_NEED_CUDA=1
export GCC_HOST_COMPILER_PATH=/cm/shared/apps/gcc/4.9.2/bin/gcc
export CUDA_TOOLKIT_PATH=/cm/shared/apps/cuda/7.5
export TF_CUDA_VERSION="7.5"
export TF_CUDNN_VERSION=
export CUDNN_INSTALL_PATH=/cm/shared/apps/cudnn/5.0
export TF_CUDA_COMPUTE_CAPABILITIES="3.7"
											
										
```
Change line 25 in ./configure (Or Search for bazel clean --expunge). Change bazel clean --expunge to bazel clean --expunge_async. bazel clean --expunge is no longer working for the latest version of bazel.
Then, run bash env.sh && ./configure. When running ./configureusing all default options. In other words, enter Enter to the end.

The reason why we use a env.sh is because the configure program always asks us to entergcc/nvcc/cudnn path. So we can pre-defined those necessary environment variables prior to running configure. Then it will not ask us to input anything.
During Configuration, Bazel will fetch the all external dependencies at the last step. Finally you will get some output like:
```
											
WARNING: Output base '/home-4/MyUserName/.cache/bazel/_bazel_yliu120@jhu.edu/ab212480fad2cec733167496f42a4173' is on NFS. This may lead to surprising failures and undetermined behavior.
INFO: All external dependencies fetched successfully.
Configuration finished
											
										
```
Change Protobuf.bzl

Making this change is because there is a glitch in google/protobuf (See this repository). We have to modify it before we build the entire tensorflow. The detailed problem is stated in this pull request I submitted.

Do the following steps:
- Find the file /home-4/MyUserName/.cache/bazel/_bazel_yliu120@jhu.edu/HashCodeShownAsAbove/external/protobuf/protobuf.bzl
- Search ctx.action in that file.
- Add a line use_default_shell_env=True in that block to make it like this,
```
												
    ctx.action(
        inputs=inputs,
        outputs=ctx.outputs.outs,
        arguments=args + import_flags + [s.path for s in srcs],
        executable=ctx.executable.protoc,
        mnemonic="ProtoCompile",
        use_default_shell_env=True,
    )
												
											
```
  By doing the above, Bazel will pass envs to the protoc compiler when compiling protos.
Compile TensorFlow on one of the GPU nodes

Run bazel build -c opt --config=cuda -s --verbose_failures --ignore_unsupported_sandboxing --genrule_strategy=standalone --spawn_strategy=standalone --jobs 24 --linkopt '-lrt -lm' //tensorflow/tools/pip_package:build_pip_package

Do NOT add any flag like --copt="-DGPR_BACKWARDS_COMPATIBILITY_MODE" --conlyopt="-std=c99", some package will not compile under old c99 standard. And also, if you do exactly as above, you can delete the -s --verbose_failures to get a faster non-verbose compilation. During the compilation, there are lots of warnings showing up. It doesn't matter since the official built (see Jenkins on Github/TensorFlow) has many warnings printed out in their console log as well.

I built TensorFlow on gpu072. Finally I got
```
										
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
  bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 662.709s, Critical Path: 419.05s
										
									
```
It only took 10 mins to build the entire TensorFlow with 24 CPUs.
Post Build - Install TensorFlow's python binding to local

Simply do the following step:
- bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg
- pip install --upgrade --user pip
- pip install --user ~/tensorflow_pkg/*
We add the --user to pip is because we want to install this Python package to local directory, most likely $HOME/.cache/pip. We upgrade pip before we install tensorflow packages since the pip version on MARCC is not the latest. If your local pip has the latest version (9.0.1), you do not need to do this step.
Major Reference