From :http://biophysics.med.jhmi.edu/~yliu120/tensorflow.html
TensorFlow offers an excellent framework for executing mathematical operations.Equipped with TensorFlow, many complicated machine learning models, as well as generalmathematical problems could be programmed easily and launched to hierarchical andefficient architectures (multi-CPUs and multi-GPUs). However, TensorFlow is pretty brand-new and it is open-sourced for not too long. Building TensorFlow on a "non-standard" platform is proven to be a difficult task. This page provides all necessary modificationon the current CROSSTOOL settings and workarounds for building TensorFlow on MARCC. Eventhough MARCC has a specific architecture, many settings and workarounds can shed lightson other different environments as well.
No extra dependencies (GCC libs, libstdc++, cuda, cudnn and etc) are needed to becompiled in this protocol. We are only using the existing MARCC libraries. Hereis the libraries we use:
- GCC 4.9.2
- libstdc++ (come with GCC 4.9.2)
- CUDA 7.5
- CUDNN 5.0
- Python 2.7.10b
- Binutils 2.25
- Java 1.8.0_112
module load binutils
module load gcc/4.9.2
module load cuda/7.5
module load cudnn/5.0
module load python/2.7.10b
module load java/1.8.0_112
This protocol is nothing but passing the correct environment variables to Bazel and TensorFlow. All code can be built successfully by using the correct ENVs. The following statements are all false:
- TensorFlow won't be built on CentOS 6
- TensorFlow won't be built on current arch of MARCC
Build Bazel
TensorFlow should be built with Bazel. In other words, Bazel is the onlybuild tool that is provided by TensorFlow. This is true for all linux/darwinplatform. We need to build Bazel prior to building TensorFlow.
Essentially, Bazel is the open-source version of Google's internal buildtool. Bazel is a rather new project as well so that the cross-platform supportof Bazel is also not elegant. However, Bazel is a really good build tool forits scalability.
Download the latest Bazel and Uncompress it
Go to the Bazel's Github release page and check out the latest release (version 0.4.2).
wget https://github.com/bazelbuild/bazel/releases/download/0.4.2/bazel-0.4.2-dist.zip
mkdir -p bazel-0.4.2
cd bazel-0.4.2 && unzip ../bazel-0.4.2-dist.zip
Note that please checkout the "dist.zip" version for a release build. After version 0.3.2, Bazel's source code release is only for building a developer version. There are a few issues discussed on Bazel's Github Issues.
Be aware of our environment settings
In current Bazel release, the c++ compiler tool chain is hard-coded in the code base. In future Bazel release, this should be improved but now we should change the cc_tool_chain rule in Bazel's CROSSTOOL files to provide correct paths (envs).
To build Bazel, you should load the above modules (CUDNN not needed here)
module load binutils
module load gcc/4.9.2
module load cuda/7.5
module load python/2.7.10b
module load java/1.8.0_112
After loading those modules, we take a look at all our current ENVs.
[MyUserName@login-node04 xxx]$ which gcc
/cm/shared/apps/gcc/4.9.2/bin/gcc
[MyUserName@login-node04 xxx]$ which ld
/cm/shared/apps/binutils/2.25/src/bin/ld
[MyUserName@login-node04 xxx]$ which nm
/cm/shared/apps/binutils/2.25/src/bin/nm
[NyUserName@login-node04 xxx]$ which ar
/cm/shared/apps/binutils/2.25/src/bin/ar
...
You don't have to do this. To illustrate the above is only for presenting the path for our gcc and GNU binutils.
Modify the hardcoded CROSSTOOL files
Let's modify the hardcoded CROSSTOOL files at tools/cpp/CROSSTOOL
and tools/cpp/cc_configure.bzl
.
In the CROSSTOOL file, you only have to modify the "local_linux" toolchain section. (Search "toolchain_identifier" and you will see a toolchain block is associated with "local_linux") This is because we are building Bazel on a Linux System. Do the following modification step by step:
- Change all possible binutils, gcc and cpp's tool path to our path listed above. For instance, change
tool_path { name: "ar" path: "/usr/bin/ar" }
totool_path { name: "ar" path: "/cm/shared/apps/binutils/2.25/src/bin/ar" }
, and changetool_path { name: "gcc" path: "/usr/bin/gcc" }
totool_path { name: "gcc" path: "/cm/shared/apps/gcc/4.9.2/bin/gcc" }
- Add a tool_path to prevent "as" (GNU assembler) problem to the tool_path bundle.
tool_path { name: "as" path: "/cm/shared/apps/binutils/2.25/src/bin/ar" }
- Only Change all linker flag lines right under tool_path of gcc to (comment out the original):
linker_flag: "-lstdc++, -Wl"
- Change cxx_builtin_include_library right under tool_path of gcc to:
cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include" cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.9.2/include-fixed" cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/include/c++/4.9.2"
In the cc_configure.bzl file, do the following modification:
- Replace all occurence of
"-B/usr/bin"
to"-B/cm/shared/apps/binutils/2.25/src/bin/"
Build Bazel
Run the following:
export EXTRA_BAZEL_ARGS='-s --verbose_failures --ignore_unsupported_sandboxing --genrule_strategy=standalone --spawn_strategy=standalone --jobs 24'
./compile.sh
On a interactive job (apply one node for fast build by
interact -n 24 -p parallel
). When entering a new node, don't forget to reload all the modules.
Some suggestions:
- In the EXTRA_BAZEL_ARGS env, we use "--jobs 24", if we use >24, then during the compilation, we may run out of the memory. You can adjust the java's memory limit and use a higher job number. My suggestion is not more than 50.
- If you follow the exact protocol steps and use the same version of Bazel, you do NOT need "-s --verbose_failures" in the EXTRA_BAZEL_ARGS env and may go faster.
A successful built should have a final output like this
Target //src:bazel up-to-date:
bazel-bin/src/bazel
INFO: Elapsed time: 63.711s, Critical Path: 49.16s
WARNING: /tmp/bazel_t7vQ9Fsh/out/external/bazel_tools/WORKSPACE:1: Workspace name in /tmp/bazel_t7vQ9Fsh/out/external/bazel_tools/WORKSPACE (@io_bazel) does not match the name given in the repository's definition (@bazel_tools); this will cause a build error in future versions.
Build successful! Binary is here: /home-4/MyUserName/compare/output/bazel
Bazel is a fully statically linked binary. It has a huge size as binary as well (~100M). It is quite portable, so you just need to copy it to your local default PATH, for example,
$HOME/opt/bin
.
Build Tensorflow
Download TensorFlow from Github Master Branch
Run this command in any directory you prefer.
git clone https://github.com/tensorflow/tensorflow.git && cd tensorflow
Load all necessary modules
Note that you need to load this each time you log in a new machine/node.
module load binutils
module load gcc/4.9.2
module load cuda/7.5
module load cudnn/5.0
module load python/2.7.10b
module load java/1.8.0_112
Modify the CROSSTOOL file.
Again, same problem as building Bazel. We should modify the CROSSTOOL file of TensorFlowto pass the correct ENVs to the compiler tool chain. Since we are building TensorFlow withGPU support, we should look at the CROSSTOOL file in third_party/gpus/crosstool
.
Modifications in third_party/gpus/crosstool/CROSSTOOL.tpl
,
- Again, look for the toolchain block marked as
toolchain_identifier: "local_linux"
:- Replace PATH for cpp and binutils (same as Bazel), do NOT replace path for gcc.
- Change linker flags as follows
- Change cxx_builtin_include_directory as follows
tool_path { name: "ar" path: "/cm/shared/apps/binutils/2.25/src/bin/ar" } tool_path { name: "compat-ld" path: "/cm/shared/apps/binutils/2.25/src/bin/ld" } tool_path { name: "cpp" path: "/cm/shared/apps/gcc/4.9.2/bin/cpp" } tool_path { name: "dwp" path: "/usr/bin/dwp" } # As part of the TensorFlow release, we place some cuda-related compilation # files in @local_config_cuda//crosstool/clang/bin, and this relative # path, combined with the rest of our Bazel configuration causes our # compilation to use those files. tool_path { name: "gcc" path: "clang/bin/crosstool_wrapper_driver_is_not_gcc" } # Use "-std=c++11" for nvcc. For consistency, force both the host compiler # and the device compiler to use "-std=c++11". cxx_flag: "-std=c++11" linker_flag: "-L/cm/shared/apps/gcc/4.9.2/lib64" linker_flag: "-Wl,-no-as-needed" linker_flag: "-lstdc++" linker_flag: "-Wl,-rpath, /cm/shared/apps/gcc/4.9.2/lib64" # linker_flag: "-B/usr/bin/" cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.. 9.2/include" cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/lib/gcc/x86_64-unknown-linux-gnu/4.. 9.2/include-fixed" cxx_builtin_include_directory: "/cm/shared/apps/gcc/4.9.2/include/c++/4.9.2"
Modifications in
third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc.tpl
. This file is very essential since it generates the compiler and linker flags for the tool chain defined inCROSSTOOL.tpl
and for all compiling rules.- Modify line 53 and line 54. Give absolute path of NVCC and GCC compiler to Bazel. These two lines should be modified to:
NVCC_PATH = '/cm/shared/apps/cuda/7.5/bin/nvcc' LLVM_HOST_COMPILER_PATH = ('/cm/shared/apps/gcc/4.9.2/bin/gcc')
- Comment out line 232.
cmd = 'PATH=' + PREFIX_DIR + ' ' + cmd
. This line will create "as" (GNU assembler) linking problem (link to wrong GNU assembler). So comment it out.
Configuration
Create a file, namely
env.sh
containing only these lines:export TF_NEED_CUDA=1 export GCC_HOST_COMPILER_PATH=/cm/shared/apps/gcc/4.9.2/bin/gcc export CUDA_TOOLKIT_PATH=/cm/shared/apps/cuda/7.5 export TF_CUDA_VERSION="7.5" export TF_CUDNN_VERSION= export CUDNN_INSTALL_PATH=/cm/shared/apps/cudnn/5.0 export TF_CUDA_COMPUTE_CAPABILITIES="3.7"
Change line 25 in./configure
(Or Search for bazel clean --expunge). Changebazel clean --expunge
tobazel clean --expunge_async
.bazel clean --expunge
is no longer working for the latest version of bazel.
Then, runbash env.sh && ./configure
. When running./configure
using all default options. In other words, enter Enter to the end.
The reason why we use aenv.sh
is because the configure program always asks us to entergcc/nvcc/cudnn path. So we can pre-defined those necessary environment variables prior to running configure. Then it will not ask us to input anything.During Configuration, Bazel will fetch the all external dependencies at the last step. Finally you will get some output like:
WARNING: Output base '/home-4/MyUserName/.cache/bazel/_bazel_yliu120@jhu.edu/ab212480fad2cec733167496f42a4173' is on NFS. This may lead to surprising failures and undetermined behavior. INFO: All external dependencies fetched successfully. Configuration finished
Change Protobuf.bzl
Making this change is because there is a glitch in google/protobuf (See this repository). We have to modify it before we build the entire tensorflow. The detailed problem is stated in this pull request I submitted.
Do the following steps:- Find the file
/home-4/MyUserName/.cache/bazel/_bazel_yliu120@jhu.edu/HashCodeShownAsAbove/external/protobuf/protobuf.bzl
- Search
ctx.action
in that file. - Add a line
use_default_shell_env=True
in that block to make it like this,ctx.action( inputs=inputs, outputs=ctx.outputs.outs, arguments=args + import_flags + [s.path for s in srcs], executable=ctx.executable.protoc, mnemonic="ProtoCompile", use_default_shell_env=True, )
Compile TensorFlow on one of the GPU nodes
Run
bazel build -c opt --config=cuda -s --verbose_failures --ignore_unsupported_sandboxing --genrule_strategy=standalone --spawn_strategy=standalone --jobs 24 --linkopt '-lrt -lm' //tensorflow/tools/pip_package:build_pip_package
Do NOT add any flag like
--copt="-DGPR_BACKWARDS_COMPATIBILITY_MODE" --conlyopt="-std=c99"
, some package will not compile under old c99 standard. And also, if you do exactly as above, you can delete the-s --verbose_failures
to get a faster non-verbose compilation. During the compilation, there are lots of warnings showing up. It doesn't matter since the official built (see Jenkins on Github/TensorFlow) has many warnings printed out in their console log as well.
I built TensorFlow on gpu072. Finally I gotTarget //tensorflow/tools/pip_package:build_pip_package up-to-date: bazel-bin/tensorflow/tools/pip_package/build_pip_package INFO: Elapsed time: 662.709s, Critical Path: 419.05s
Post Build - Install TensorFlow's python binding to local
Simply do the following step:
bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg
pip install --upgrade --user pip
pip install --user ~/tensorflow_pkg/*
--user
to pip is because we want to install this Python package to local directory, most likely$HOME/.cache/pip
. We upgrade pip before we install tensorflow packages since the pip version on MARCC is not the latest. If your local pip has the latest version (9.0.1), you do not need to do this step.Major Reference