tensorflow optimizing

最新推荐文章于 2022-11-22 00:11:32 发布

转载最新推荐文章于 2022-11-22 00:11:32 发布 · 257 阅读

本文介绍了一系列针对TensorFlow代码优化的方法，包括使用NCHW数据格式、融合批规范化操作、利用队列进行输入管线优化等技术手段，以提高GPU利用率及解决CPU瓶颈问题。

This repo summarizes some techniques for optimizing TensorFlow code. Official document describing a collection of best practices can be found here. Before started, it will be very helpful to read it.

Dockerfile which contains all of libraries/packages introduced here is provided. It includes how to install the libraries/packages listed below.

First of all, it is important to find whether CPU will bottleneck GPU, or vice versa (simply check by running nvidia-smi). If GPU is a bottleneck, it is relatively easy to optimize. On the other hand, it is complicated if CPU is your bottleneck.

Overall, I got 1.5~2.0x performance gain by applying all below.

If GPUs are fully utilized

Use NCHW data format for 4D tensor.

Native data format for cudnn library is NCHW. Performance gain increases as you have many layers.
If you use this format, using _fused_batch_norm is mandatory. Otherwise, your code will be almost 10x slower since nn.moments cannot deal with this format efficiently.
Several preprocessing ops support only HWC format, so we have to transpose tensors somewhere. If your input pipeline is a bottleneck, it is better to transpose them using GPU.

Use fused batch norm.

Whatever your data format is, it is better to use fused batch norm.

If CPUs are your bottleneck

Utilize queues for input pipeline

First, you have to utilize queues for reading and fetching input data. Please refer to Reading Data Guide and batch_inputs function in inception codes.
CAREFULLY allocate threads for each reading and preprocessing. It completely depends on your machine: how many threads can you use?, can you read from SSD?, etc.

Use TCMalloc.

TCMalloc is faster for multi-threaded programs.
Also, it is effective if you use multi-threads for input pipeline.
Relevant issues or comments: here, here.

Use advanced instructions (SSE, AVX, FMA) on Intel CPUs.

For TensorFlow v.1.0.0, you can see the following warnings when you execute codes.

tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

To use these instructions, you have to build from sources. The most simple way is to build this dockerfile.
Relevant issues or comments: here, here, here.

Shrink the TensorFlow libraries. We can rebuild the TensorFlow libraries so they will only include the ops used by your specific model. My tutorial on how to shrink the TensorFlow library for Android shows how to reduce the library from 9.7MB down to 2.5MB! Even better, this optimization does not affect the accuracy of our model. :-)
Choose a smaller model. For example, you can change your model from Inception to MobileNet. This can shrink an image classifier down from 53MB to around 1MB, though you will likely have some loss of accuracy.
Shrink the model. We can remove unused ops from the model. This is part of the normal process of building for mobile, so you are probably already doing this. Another option is to quantize the model. Quantizing does reduce your model accuracy, but it can shrink your model down to 25% of its original size. I plan to write a blog post on that next. :-)