【TensorFlow】quantization量化

最新推荐文章于 2025-07-14 17:41:07 发布

原创

最新推荐文章于 2025-07-14 17:41:07 发布 · 2.4w 阅读

33 ·

CC 4.0 BY-SA版权

文章标签：

#qunatization #tensorflow

本文深入探讨了TensorFlow的量化技术，包括如何在模型中实施量化和去量化，使用gemmlowp库进行8位运算，以及MobileNet的量化实现。通过量化，可以将模型的计算从浮点数转换为8位，显著减小文件大小，同时保持相似的性能。此外，还介绍了TensorFlow的量化工具，如tf.quantize和quantize_graph。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

搜索关键词:quantize tensorflow

一、 Question 1:How does Tensorflow do quantization and dequantization?

Details
According to the blog post “https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/“*（重点！）*, Tensorflow quantizes values before they go into a layer. After being processed by the layer, the values are dequantized. Tensorflow quantizes values by rescaling the values between 0 and 255, so it needs to keep “min” and “max” to dequantize the values.
I would like to ask: 1. how the “min” and “max” in the outputs of a “quantization” op are determined? I mean, if we simply find the minimum and maximum value and set them to 0 and 255, we will get data overflow or underflow when doing convolution. 2. how the “min” and “max” in the outputs of a “convolution” op are determined? Both weights and activations are quantized, so there are two sets of “min” and “max”. How does a convolution op combine them to form a single set of “min” and “max”?

根据博客文章“ https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/”，Tensorflow在进入一个层之前量化值。经过图层处理后，值被去量化。Tensorflow通过重新调整0到255之间的值来量化值，所以它需要保留“min”和“max”来对这些值进行去量化。
我想问一下：1.如何确定“量化”操作的输出中的“最小”和“最大”？我的意思是，如果我们简单地找到最小值和最大值并将它们设置为0和255，那么在进行卷积时我们将会发生数据溢出或下溢。2.如何确定“卷积”op输出中的“min”和“max”？量和激活都是量化的，所以有两组“最小”和“最大”。卷积运算如何将它们组合成一组“最小”和“最大”？

Answer
TensorFlow uses i.a. gemmlowp for low-precision matrix multiplications. Although 8-bit values are used as inputs, intermediate results are 32-bit values. These 32-bit values are converted back to 8-bit before returning the results.
TensorFlow使用i.a. gemmlowp用于低精度矩阵乘法。尽管8位值用作输入，但中间结果是32位值。在返回结果之前，这些32位值被转换回8位。

From https://github.com/google/gemmlowp/blob/master/doc/low-precision.md :

To avoid overflow, we internally accumulate results on more than 8 bits, and at the end we keep only some significant 8 bits.
为了避免溢出，我们在内部累积了超过8位的结果，最后我们只保留了一些重要的8位。

二、How to Quantize Neural Networks with TensorFlow (Blog)，官网指南

2.1 量化已有模型并做测试

代码地址：https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/quantize/python/quantize_graph.py

curl http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz -o /tmp/inceptionv3.tgz
tar xzf /tmp/inceptionv3.tgz -C /tmp/
bazel build tensorflow/contrib/quantization/tools:quantize_graph
bazel-bin/tensorflow/contrib/quantization/tools/quantize_graph \
--input=/tmp/classify_image_graph_def.pb \
--output_node_names="softmax" --output=/tmp/quantized_graph.pb \
--mode=eightbit

This will produce a new model that runs the same operations as the original, but with eight bit calculations internally, and all weights quantized as well. If you look at the file size, you’ll see it’s about a quarter of the original (23MB versus 91MB). You can still run this model using exactly the same inputs and outputs though, and you should get equivalent results. Here’s an example:

bazel build tensorflow/examples/label_image:label_image
bazel-bin/tensorflow/examples/label_image/label_image \
--input_graph=/tmp/quantized_graph.pb \
--input_width=299 \
--input_height=299 \
--mean_value=128 \
--std_value=128 \
--input_layer_name="Mul:0" \
--output_layer_name="softmax:0"

2.2 量化张量使用什么表示？

我们将浮点数字数组转换为8位表示形式作为压缩问题。我们知道经过训练的神经网络模型中的权重和激活张量倾向于具有分布在相对较小范围内的值（例如，对于权重可能有-15到+15，对于图像模型上的激活可能有-500到1000）确切的数字会有所不同）。我们从实验中也知道，神经网络在噪声的情况下往往是非常稳健的，所以量化到一小组值的噪声类错误不会严重影响整体结果的精度。我们也希望选择一个易于执行计算的表示，特别是构成运行模型所需的大部分工作的大型矩阵乘法。
How Does the Quantization Process Work?

2.3 8bit运算：Eight-bit arithmetic

using gemmlowp

2.3.1 量化实现代码（重要！！）

2.3.2 原理讲解 The low-precision paradigm in gemmlowp, and how it’s implemented (gemmlowp)

“Low-precision” means that the input and output matrix entries are integers on at most 8 bits. The scalar type is uint8_t.
gemmlowp is flexible enough to support multiple low-precision paradigms, i.e. multiple ways that a meaning is attached to 8bit values so that a computation can rely on a 8bit GEMM provided by gemmlowp.