A
This repo summarizes some techniques for optimizing TensorFlow code. Official document describing a collection of best practices can be found here. Before started, it will be very helpful to read it.
Dockerfile which contains all of libraries/packages introduced here is provided. It includes how to install the libraries/packages listed below.
First of all, it is important to find whether CPU will bottleneck GPU, or vice versa (simply check by running nvidia-smi
). If GPU is a bottleneck, it is relatively easy to optimize. On the other hand, it is complicated if CPU is your bottleneck.
Overall, I got 1.5~2.0x performance gain by applying all below.
If GPUs are fully utilized
- Use
NCHW
data format for 4D tensor.
- Native data format for cudnn library is
NCHW
. Performance gain increases as you have many layers. - If you use this format, using
_fused_batch_norm
is mandatory. Otherwise, your code will be almost 10x slower sincenn.moments
cannot deal with this format efficiently. - Several preprocessing ops support only
HWC
format, so we have to transpose tensors somewhere. If your input pipeline is a bottleneck, it is better to transpose them using GPU.
- Use fused batch norm.
- Whatever your data format is, it is better to use fused batch norm.
If CPUs are your bottleneck
- Utilize queues for input pipeline
- First, you have to utilize queues for reading and fetching input data. Please refer to Reading Data Guide and
batch_inputs
function in inception codes. - CAREFULLY allocate threads for each reading and preprocessing. It completely depends on your machine: how many threads can you use?, can you read from SSD?, etc.
- Use TCMalloc.
- TCMalloc is faster for multi-threaded programs.
- Also, it is effective if you use multi-threads for input pipeline.
- Relevant issues or comments: here, here.
- Use advanced instructions (SSE, AVX, FMA) on Intel CPUs.
- For TensorFlow v.1.0.0, you can see the following warnings when you execute codes.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
- To use these instructions, you have to build from sources. The most simple way is to build this dockerfile.
- Relevant issues or comments: here, here, here.
B
- Shrink the TensorFlow libraries. We can rebuild the TensorFlow libraries so they will only include the ops used by your specific model. My tutorial on how to shrink the TensorFlow library for Android shows how to reduce the library from 9.7MB down to 2.5MB! Even better, this optimization does not affect the accuracy of our model. :-)
- Choose a smaller model. For example, you can change your model from Inception to MobileNet. This can shrink an image classifier down from 53MB to around 1MB, though you will likely have some loss of accuracy.
- Shrink the model. We can remove unused ops from the model. This is part of the normal process of building for mobile, so you are probably already doing this. Another option is to quantize the model. Quantizing does reduce your model accuracy, but it can shrink your model down to 25% of its original size. I plan to write a blog post on that next. :-)