build tensorflow from source

TensorFlow搭建指南
本文详细介绍如何在特定平台上从零开始构建TensorFlow环境,包括配置必要的软件包、修改编译设置等关键步骤。

From :http://biophysics.med.jhmi.edu/~yliu120/tensorflow.html



TensorFlow offers an excellent framework for executing mathematical operations.Equipped with TensorFlow, many complicated machine learning models, as well as generalmathematical problems could be programmed easily and launched to hierarchical andefficient architectures (multi-CPUs and multi-GPUs). However, TensorFlow is pretty brand-new and it is open-sourced for not too long. Building TensorFlow on a "non-standard" platform is proven to be a difficult task. This page provides all necessary modificationon the current CROSSTOOL settings and workarounds for building TensorFlow on MARCC. Eventhough MARCC has a specific architecture, many settings and workarounds can shed lightson other different environments as well.

No extra dependencies (GCC libs, libstdc++, cuda, cudnn and etc) are needed to becompiled in this protocol. We are only using the existing MARCC libraries. Hereis the libraries we use:

  • GCC 4.9.2
  • libstdc++ (come with GCC 4.9.2)
  • CUDA 7.5
  • CUDNN 5.0
  • Python 2.7.10b
  • Binutils 2.25
  • Java 1.8.0_112
Translate to the MARCC script:
											
module load binutils
module load gcc/4.9.2
module load cuda/7.5
module load cudnn/5.0
module load python/2.7.10b
module load java/1.8.0_112
											
										

This protocol is nothing but passing the correct environment variables to Bazel and TensorFlow. All code can be built successfully by using the correct ENVs. The following statements are all false:

  • TensorFlow won't be built on CentOS 6
  • TensorFlow won't be built on current arch of MARCC
Build Bazel

TensorFlow should be built with Bazel. In other words, Bazel is the onlybuild tool that is provided by TensorFlow. This is true for all linux/darwinplatform. We need to build Bazel prior to building TensorFlow.

Essentially, Bazel is the open-source version of Google's internal buildtool. Bazel is a rather new project as well so that the cross-platform supportof Bazel is also not elegant. However, Bazel is a really good build tool forits scalability.

Download the latest Bazel and Uncompress it

Go to the Bazel's Github release page and check out the latest release (version 0.4.2).

										
wget https://github.com/bazelbuild/bazel/releases/download/0.4.2/bazel-0.4.2-dist.zip
mkdir -p bazel-0.4.2
cd bazel-0.4.2 && unzip ../bazel-0.4.2-dist.zip
										
									
Note that please checkout the "dist.zip" version for a release build. After version 0.3.2, Bazel's source code release is only for building a developer version. There are a few issues discussed on Bazel's Github Issues.
Be aware of our environment settings

In current Bazel release, the c++ compiler tool chain is hard-coded in the code base. In future Bazel release, this should be improved but now we should change the cc_tool_chain rule in Bazel's CROSSTOOL files to provide correct paths (envs).

To build Bazel, you should load the above modules (CUDNN not needed here)

											
module load binutils
module load gcc/4.9.2
module load cuda/7.5
module load python/2.7.10b
module load java/1.8.0_112
											
										
After loading those modules, we take a look at all our current ENVs.
											
[MyUserName@login-node04 xxx]$ which gcc
/cm/shared/apps/gcc/4.9.2/bin/gcc
[MyUserName@login-node04 xxx]$ which ld
/cm/shared/apps/binutils/2.25/src/bin/ld
[MyUserName@login-node04 xxx]$ which nm
/cm/shared/apps/binutils/2.25/src/bin/nm
[NyUserName@login-node04 xxx]$ which ar
/cm/shared/apps/binutils/2.25/src/bin/ar
...
											
										
You don't have to do this. To illustrate the above is only for presenting the path for our gcc and GNU binutils.
Modify the hardcoded CROSSTOOL files

Let's modify the hardcoded CROSSTOOL files at tools/cpp/CROSSTOOLand tools/cpp/cc_configure.bzl.

In the CROSSTOOL file, you only have to modify the "local_linux" toolchain section. (Search "toolchain_identifier" and you will see a toolchain block is associated with "local_linux") This is because we are building Bazel on a Linux System. Do the following modification step by step:

  • Change all possible binutils, gcc and cpp's tool path to our path listed above. For instance, change tool_path { name: "ar" path: "/usr/bin/ar" } to tool_path { name: "ar" path: "/cm/shared/apps/binutils/2.25/src/bin/ar" }, and change tool_path { name: "gcc" path: "/usr/bin/gcc" } to tool_path { name: "gcc" path: "/cm/shared/apps/gcc/4.9.2/bin/gcc" }
  • Add a tool_path to prevent "as" (GNU assembler) problem to the tool_path bundle. tool_path { name: "as" path: "/cm/shared/apps/binutils/2.25/src/bin/ar" }
  • Only Change all linker flag lines right under tool_path of gcc to (comment out the original):
    										    		
    linker_flag: "-lstdc++, -Wl"
    										    		
    										    	
  • Change cxx_builtin_include_library right under tool_path of gcc to:
    										    		
    cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include"
    cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include-fixed"
    cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/include/c++/4.9.2"
    										    		
    										    	

In the cc_configure.bzl file, do the following modification:

  • Replace all occurence of "-B/usr/bin" to "-B/cm/shared/apps/binutils/2.25/src/bin/"
Build Bazel

Run the following:

											
export EXTRA_BAZEL_ARGS='-s --verbose_failures --ignore_unsupported_sandboxing --genrule_strategy=standalone --spawn_strategy=standalone --jobs 24'
./compile.sh
											
										
On a interactive job (apply one node for fast build by interact -n 24 -p parallel). When entering a new node, don't forget to reload all the modules.

Some suggestions:
  1. In the EXTRA_BAZEL_ARGS env, we use "--jobs 24", if we use >24, then during the compilation, we may run out of the memory. You can adjust the java's memory limit and use a higher job number. My suggestion is not more than 50.
  2. If you follow the exact protocol steps and use the same version of Bazel, you do NOT need "-s --verbose_failures" in the EXTRA_BAZEL_ARGS env and may go faster.

A successful built should have a final output like this

											
Target //src:bazel up-to-date:
  bazel-bin/src/bazel
INFO: Elapsed time: 63.711s, Critical Path: 49.16s
WARNING: /tmp/bazel_t7vQ9Fsh/out/external/bazel_tools/WORKSPACE:1: Workspace name in /tmp/bazel_t7vQ9Fsh/out/external/bazel_tools/WORKSPACE (@io_bazel) does not match the name given in the repository's definition (@bazel_tools); this will cause a build error in future versions.

Build successful! Binary is here: /home-4/MyUserName/compare/output/bazel
											
										
Bazel is a fully statically linked binary. It has a huge size as binary as well (~100M). It is quite portable, so you just need to copy it to your local default PATH, for example, $HOME/opt/bin.
Build Tensorflow
Download TensorFlow from Github Master Branch

Run this command in any directory you prefer.

											
git clone https://github.com/tensorflow/tensorflow.git && cd tensorflow
											
										
Load all necessary modules

Note that you need to load this each time you log in a new machine/node.

											
module load binutils
module load gcc/4.9.2
module load cuda/7.5
module load cudnn/5.0
module load python/2.7.10b
module load java/1.8.0_112
											
										
Modify the CROSSTOOL file.

Again, same problem as building Bazel. We should modify the CROSSTOOL file of TensorFlowto pass the correct ENVs to the compiler tool chain. Since we are building TensorFlow withGPU support, we should look at the CROSSTOOL file in third_party/gpus/crosstool.

Modifications in third_party/gpus/crosstool/CROSSTOOL.tpl,

  • Again, look for the toolchain block marked as toolchain_identifier: "local_linux":
    • Replace PATH for cpp and binutils (same as Bazel), do NOT replace path for gcc.
    • Change linker flags as follows
    • Change cxx_builtin_include_directory as follows
    Finally, the above mentioned section should look like:
    											
    tool_path { name: "ar" path: "/cm/shared/apps/binutils/2.25/src/bin/ar" }
    tool_path { name: "compat-ld" path: "/cm/shared/apps/binutils/2.25/src/bin/ld" }
    tool_path { name: "cpp" path: "/cm/shared/apps/gcc/4.9.2/bin/cpp" }
    tool_path { name: "dwp" path: "/usr/bin/dwp" }
    # As part of the TensorFlow release, we place some cuda-related compilation
    # files in @local_config_cuda//crosstool/clang/bin, and this relative
    # path, combined with the rest of our Bazel configuration causes our
    # compilation to use those files.
    tool_path { name: "gcc" path: "clang/bin/crosstool_wrapper_driver_is_not_gcc" }
    # Use "-std=c++11" for nvcc. For consistency, force both the host compiler
    # and the device compiler to use "-std=c++11".
    cxx_flag: "-std=c++11"
    linker_flag: "-L/cm/shared/apps/gcc/4.9.2/lib64"
    linker_flag: "-Wl,-no-as-needed"
    linker_flag: "-lstdc++"
    linker_flag: "-Wl,-rpath, /cm/shared/apps/gcc/4.9.2/lib64"
    
    # linker_flag: "-B/usr/bin/"
    cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4..
    9.2/include"
    cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4..
    9.2/include-fixed"
    cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/include/c++/4.9.2"
    											
    										

    Modifications in third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl. This file is very essential since it generates the compiler and linker flags for the tool chain defined in CROSSTOOL.tpl and for all compiling rules.

    • Modify line 53 and line 54. Give absolute path of NVCC and GCC compiler to Bazel. These two lines should be modified to:
      													
      NVCC_PATH = '/cm/shared/apps/cuda/7.5/bin/nvcc'
      LLVM_HOST_COMPILER_PATH = ('/cm/shared/apps/gcc/4.9.2/bin/gcc')
      													
      												
    • Comment out line 232. cmd = 'PATH=' + PREFIX_DIR + ' ' + cmd. This line will create "as" (GNU assembler) linking problem (link to wrong GNU assembler). So comment it out.
    Configuration

    Create a file, namely env.sh containing only these lines:

    											
    export TF_NEED_CUDA=1
    export GCC_HOST_COMPILER_PATH=/cm/shared/apps/gcc/4.9.2/bin/gcc
    export CUDA_TOOLKIT_PATH=/cm/shared/apps/cuda/7.5
    export TF_CUDA_VERSION="7.5"
    export TF_CUDNN_VERSION=
    export CUDNN_INSTALL_PATH=/cm/shared/apps/cudnn/5.0
    export TF_CUDA_COMPUTE_CAPABILITIES="3.7"
    											
    										

    Change line 25 in ./configure (Or Search for bazel clean --expunge). Change bazel clean --expunge to bazel clean --expunge_async. bazel clean --expunge is no longer working for the latest version of bazel.
    Then, run bash env.sh && ./configure. When running ./configureusing all default options. In other words, enter Enter to the end.

    The reason why we use a env.sh is because the configure program always asks us to entergcc/nvcc/cudnn path. So we can pre-defined those necessary environment variables prior to running configure. Then it will not ask us to input anything.

    During Configuration, Bazel will fetch the all external dependencies at the last step. Finally you will get some output like:

    											
    WARNING: Output base '/home-4/MyUserName/.cache/bazel/_bazel_yliu120@jhu.edu/ab212480fad2cec733167496f42a4173' is on NFS. This may lead to surprising failures and undetermined behavior.
    INFO: All external dependencies fetched successfully.
    Configuration finished
    											
    										
    Change Protobuf.bzl

    Making this change is because there is a glitch in google/protobuf (See this repository). We have to modify it before we build the entire tensorflow. The detailed problem is stated in this pull request I submitted.

    Do the following steps:

    • Find the file /home-4/MyUserName/.cache/bazel/_bazel_yliu120@jhu.edu/HashCodeShownAsAbove/external/protobuf/protobuf.bzl
    • Search ctx.action in that file.
    • Add a line use_default_shell_env=True in that block to make it like this,
      												
          ctx.action(
              inputs=inputs,
              outputs=ctx.outputs.outs,
              arguments=args + import_flags + [s.path for s in srcs],
              executable=ctx.executable.protoc,
              mnemonic="ProtoCompile",
              use_default_shell_env=True,
          )
      												
      											
      By doing the above, Bazel will pass envs to the protoc compiler when compiling protos.
    Compile TensorFlow on one of the GPU nodes

    Run bazel build -c opt --config=cuda -s --verbose_failures --ignore_unsupported_sandboxing --genrule_strategy=standalone --spawn_strategy=standalone --jobs 24 --linkopt '-lrt -lm' //tensorflow/tools/pip_package:build_pip_package

    Do NOT add any flag like --copt="-DGPR_BACKWARDS_COMPATIBILITY_MODE" --conlyopt="-std=c99", some package will not compile under old c99 standard. And also, if you do exactly as above, you can delete the -s --verbose_failures to get a faster non-verbose compilation. During the compilation, there are lots of warnings showing up. It doesn't matter since the official built (see Jenkins on Github/TensorFlow) has many warnings printed out in their console log as well.

    I built TensorFlow on gpu072. Finally I got

    										
    Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
      bazel-bin/tensorflow/tools/pip_package/build_pip_package
    INFO: Elapsed time: 662.709s, Critical Path: 419.05s
    										
    									
    It only took 10 mins to build the entire TensorFlow with 24 CPUs.
    Post Build - Install TensorFlow's python binding to local

    Simply do the following step:

    • bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg
    • pip install --upgrade --user pip
    • pip install --user ~/tensorflow_pkg/*
    We add the --user to pip is because we want to install this Python package to local directory, most likely $HOME/.cache/pip. We upgrade pip before we install tensorflow packages since the pip version on MARCC is not the latest. If your local pip has the latest version (9.0.1), you do not need to do this step.
    Major Reference
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值