【量化】SNPE

Quantization Algorithm

This section describes the concepts behind the quantization algorithm used in SNPE. These concepts are used by snpe-dlc-quantize and is also used by SNPE for input quantization when using the DSP runtime.

Overview

Note: SNPE supports multiple quantization modes. The basics of the quantization, regardless of mode, are described here. See Quantization Modes for more information.

  • Quantization converts floating point data to Tensorflow-style 8-bit fixed point format
  • The following requirements are satisfied:
    • Full range of input values is covered.
    • Minimum range of 0.01 is enforced.
    • Floating point zero is exactly representable.
  • Quantization algorithm inputs:
    • Set of floating point values to be quantized.
  • Quantization algorithm outputs:
    • Set of 8-bit fixed point values.
    • Encoding parameters:
      • encoding-min - minimum floating point value representable (by fixed point value 0)
      • encoding-max - maximum floating point value representable (by fixed point value 255)
  • Algorithm
    1. Compute the true range (min, max) of input data.
    2. Compute the encoding-min and encoding-max.
    3. Quantize the input floating point values.
    4. Output:
      • fixed point values
      • encoding-min and encoding-max parameters

Details

  1. Compute the true range of the input floating point data.
    • finds the smallest and largest values in the input data
    • represents the true range of the input data
  2. Compute the encoding-min and encoding-max.
    • These parameters are used in the quantization step.
    • These parameters define the range and floating point values that will be representable by the fixed point format.
      • encoding-min: specifies the smallest floating point value that will be represented by the fixed point value of 0
      • encoding-max: specifies the largest floating point value that will be represented by the fixed point value of 255
      • floating point values at every step size, where step size = (encoding-max - encoding-min) / 255, will be representable
    1. encoding-min and encoding-max are first set to the true min and true max computed in the previous step
    2. First requirement: encoding range must be at least a minimum of 0.01
      • encoding-max is adjusted to max(true max, true min + 0.01)
    3. Second requirement: floating point value of 0 must be exactly representable
      • encoding-min or encoding-max may be further adjusted
  3. Handling 0.
    1. Case 1: Inputs are strictly positive
      • the encoding-min is set to 0.0
      • zero floating point value is exactly representable by smallest fixed point value 0
      • e.g. input range = [5.0, 10.0]  这里直接把5变成了0,不明白
        • encoding-min = 0.0, encoding-max = 10.0
    2. Case 2: Inputs are strictly negative
      • encoding-max is set to 0.0
      • zero floating point value is exactly representable by the largest fixed point value 255
      • e.g. input range = [-20.0, -6.0]  这里直接把-6变成了0,不明白
        • encoding-min = -20.0, encoding-max = 0.0  
    3. Case 3: Inputs are both negative and positive
      • encoding-min and encoding-max are slightly shifted to make the floating point zero exactly representable
      • e.g. input range = [-5.1, 5.1]   保证定点中有一个值可以跟float中的0对应上
        • encoding-min and encoding-max are first set to -5.1 and 5.1, respectively
        • encoding range is 10.2 and the step size is 10.2/255 = 0.04
        • zero value is currently not representable. The closest values representable are -0.02 and +0.02 by fixed point values 127 and 128, respectively
        • encoding-min and encoding-max are shifted by -0.02. The new encoding-min is -5.12 and the new encoding-max is 5.08
        • floating point zero is now exactly representable by the fixed point value of 128
  4. Quantize the input floating point values.
    • encoding-min and encoding-max parameter determined in the previous step are used to quantize all the input floating values to their fixed point representation
    • Quantization formula is:   简单的线性变换,量化到(0, 255)之间
      • quantized value = round(255 * (floating point value - encoding.min) / (encoding.max - encoding.min))
    • quantized value is also clamped to be within 0 and 255
  5. Outputs
    • the fixed point values
    • encoding-min and encoding-max parameters

Quantization Example

  • Inputs:
    • input values = [-1.8, -1.0, 0, 0.5]
  • encoding-min is set to -1.8 and encoding-max to 0.5
  • encoding range is 2.3, which is larger than the required 0.01
  • encoding-min is adjusted to −1.803922 and encoding-max to 0.496078 to make zero exactly representable
  • step size (delta or scale) is 0.009020
  • Outputs:
    • quantized values are [0, 89, 200, 255]

Dequantization Example

  • Inputs:
    • quantized values = [0, 89, 200, 255]
    • encoding-min = −1.803922, encoding-max = 0.496078
  • step size is 0.009020
  • Outputs:
    • dequantized values = [−1.8039, −1.0011, 0.0000, 0.4961]

Bit Width Selection (只支持HTA设备)

SNPE currently supports a default quantization bit width of 8 for both weights and biases. The bias bit width, however, can be overriden to use 32 bit quantization by specifying the command line option "--bias_bitwidth 32" from snpe-dlc-quantize. For some models, using 32bit biases may give a small improvement in accuracy. Unfortunately it is difficult to predict which models may benefit from this since model architectures, weight distributions, etc all have an impact on quantization performance.

Quantization Modes

SNPE supports multiple quantization modes, the difference is in how quantization parameters are chosen.

Default Quantization Mode (就是上面描述的)

The default mode has been described above, and uses the true min/max of the data being quantized, followed by an adjustment of the range to ensure a minimum range and to ensure 0.0 is exactly quantizable.

Enhanced Quantization Mode

Enhanced quantization mode (invoked by using the "use_enhanced_quantizer" parameter to snpe-dlc-quantize) uses an algorithm to try to determine a better set of quantization parameters to improve accuracy. The algorithm may pick a different min/max value than the default quantizer, and in some cases it may set the range such that some of the original weights and/or activations cannot fall into that range. However, this range does produce better accuracy than simply using the true min/max. The enhanced quantizer can be enabled independently for weights and activations by appending either "weights" or "activations" after the option.

This is useful for some models where the weights and/or activations may have "long tails". (Imagine a range with most values between -100 and 1000, but a few values much greater than 1000 or much less than -100.) In some cases these long tails can be ignored and the range -100, 1000 can be used more effectively than the full range.

Enhanced quantizer still enforces a minimum range and ensures 0.0 is exactly quantizable.

Adjusted Weights Quantization Mode (和Enhanced Quantization Mode啥区别)

This mode is used only for quantizing weights to 8 bit fixed point(invoked by using the "use_adjusted_weights_quantizer" parameter to snpe-dlc-quantize), which uses adjusted min or max of the data being quantized other than true min/max or the min/max that exclude the long tail. This has been verified to be able to provide accuracy benefit for denoise model specifically. Using this quantizer, the max will be expanded or the min will be decreased if necessary.

Adjusted weights quantizer still enforces a minimum range and ensures 0.0 is exactly quantizable.

Enhanced Quantization Techniques

Quantization can be a difficult problem to solve due to the myriad of training techniques, model architectures, and layer types. In an attempt to mitigate quantization problems two new model preprocessing techniques have been added to snpe-dlc-quantize that may improve quantization performance on models which exhibit sharp drops in accuracy upon quantization.

The two new techniques introduced are CLE (Cross Layer Equalization) and BC (Bias Correction).

CLE works by scaling the convolution weight ranges in the network by making use of a scale-equivariance property of activation functions. In addition, the process absorbs high biases which may be result from weight scaling from one convolution layer to a subsequent convolution layer.

BC corrects the biased error that is introduced in the activations during quantization. It does this by accumulating convolution/MatMul activation data from the floating-point model and the quantized model and then corrects for the statistical bias on the layer’s output. This correction is added to the bias of the layer in question.

Enhanced Quantization Techniques: Limitations

In many cases CLE+BC may enable quantized models to return to close to their original floating-point accuracy. There are some caveats/limitations to the current algorithms:

  1. CLE operates on specific patterns of operations that all exist in a single branch (outputs cannot be consumed by more than one op). The matched operation patterns (r=required, o=optional) are:
    • Conv(r)->Batchnorm(r)->activation(o)->Conv(r)->Batchnorm(r)->activation(o)
    • Conv(r)->Batchnorm(r)->activation(o)->DepthwiseConv(r)->Batchnorm(r)->activation(o)->Conv(r)->Batchnorm(r)->activation(o)
  2. The CLE algorithm currently only supports Relu activations. Any Relu6 activations will be automatically changed to Relu and any activations other than these will cause the algorithm to ignore the preceding convolution. Typically the switch from Relu6->Relu is harmless and does not cause any degradation in accuracy, however some models may exhibit a slight degradation of accuracy. In this case, CLE+BC can only recover accuracy to that degraded level, and not to the original float accuracy.
  3. CLE requires batchnorms (specifically detectable batchnorm beta/gamma data) be present in the original model before conversion to DLC for the complete algorithm to be run and to regain maximum accuracy. For Tensorflow, the beta and gamma can sometimes still be found even with folded batchnorms, so long as the folding didn't fold the parameters into the convolution's static weights and bias. If it does not detect the required information you may see a message that looks like: "Invalid model for HBA quantization algorithm." This indicates the algorithm will only partially run and accuracy issues may likely be present.

(参考高通论文《A Quantization-Friendly Separable Convolution for MobileNets》,使用relu6会破坏模型的信号分布,造成模型量化后过大的量化损失,所有建议用relu)

Typically CLE and BC are best when used together as they complement one another. CLE is very fast (seconds), however BC can be extremely slow (minutes to hours) as quantization data must be run through the network repeatedly as corrections are made. When time is critical it might be prudent to try CLE first and check the accuracy of the resulting model. If the accuracy is good (within a couple % of the floating-point model) then CLE+BC should be run on the original model and will likely give better results. In cases where the accuracy is still poor BC is unlikely to provide much additional accuracy gain.

To run the typical use case, CLE+BC, pass the "--optimizations cle --optimizations bc" to snpe-dlc-quantize.

To run only CLE pass the "--optimizations cle" to snpe-dlc-quantize.

The original converted float model should always be used as input to snpe-dlc-quantize. Passing quantized models back to the quantizer is not supported and will result in undefined behavior. In addition, BC should not be run on its own without CLE as it won’t fix major issues with weight/activation quantization.

More information about the algorithms can be found here: https://arxiv.org/abs/1906.04721

Quantization Impacts

Quantizing a model and/or running it in a quantized runtime (like the DSP) can affect accuracy. Some models may not work well when quantized, and may yield incorrect results. The metrics for measuring impact of quantization on a model that does classification are typically "Mean Average Precision", "Top-1 Error" and "Top-5 Error". These metrics published in SNPE release notes for various models.

Quantization Overrides

If the option –quantization_overrides is provided during model conversion the user can provide a json file with parameters to use for quantization. These will be cached along with the model and can be used to override any quantization data carried from conversion (eg TF fake quantization) or calculated during the normal quantization process in snpe-dlc-quantize. To override the params during snpe-dlc-quantize the option –override_params must be passed, and the cached values will be used instead. The json format is defined as per AIMET specification and can be found below.

There are two sections in the json, a section for overriding operator output encodings called "activation_encodings" and a section for overriding parameter (weight and bias) encodings called "param_encodings". Both must be present in the file, but can be empty if no overrides are desired.

An example with all of the currently supported options:

{
  "activation_encodings": {
      "Conv1:0": [
          {
              "bitwidth": 8,
              "max": 12.82344407824954,
              "min": 0.0,
              "offset": 0,
              "scale": 0.050288015993135454
          }
      ],
      "input:0": [
          {
              "bitwidth": 8,
              "max": 0.9960872825108046,
              "min": -1.0039304197656937,
              "offset": 127,
              "scale": 0.007843206675594112
          }
      ]
  },
  "param_encodings": {
      "Conv2d/weights": [
          {
              "bitwidth": 8,
              "max": 1.700559472933134,
              "min": -2.1006477158567995,
              "offset": 140,
              "scale": 0.01490669485799974
          }
      ]
  }
}

Under "activation_encodings" the names (eg "Conv1:0") represent the output tensor names where quantization should be overriden. Under "param_encodings" the names represent the weights or biases for which the encodings will be specified. A brief breakdown of the common parameters:

  • bitwidth (int, required) - The bitwidth to use for quantization. Note that this much match the existing bitwidth support for the runtime on which the model will be run.
  • max (float, required) - The largest number in the distribution or desired range.
  • min (float, required) - The smalled number in the distribution or desired range.
  • offset (int) - The integer offset indicating the zero point (ie The point at which 0 is exactly represnted)
  • scale (float) - The value indicating the integer size divided by the desired distribution range

Note that it is not required to provide scale (also referred to as delta) and offset (zero point). If they are provided they will be used, otherwise they will be calulated from the provided bitwidth, min, and max parameters.

  •  
### 关于SNPE DLC模型量化 SNPE(Snapdragon Neural Processing Engine)是一个用于优化和加速移动设备上深度学习模型推理的工具包。它允许开发者将训练好的模型转换为DLC(Deep Learning Container)格式,从而可以在支持SNPE的硬件上高效运行。 对于DLC模型的量化过程,通常涉及以下几个方面: #### 1. 模型量化的目的 模型量化是一种减少模型大小并提高推理速度的技术。通过降低权重和激活值的数据精度(例如从浮点数到整数),可以显著减小模型文件的体积,并提升计算效率[^3]。 #### 2. SNPE中的模型量化流程 SNPE提供了专门的命令行工具`snpe-dlc-quantize`来执行模型量化操作。以下是该工具的主要功能描述以及如何使用它的基本指南: - **工具介绍**: `snpe-dlc-quantize` 是一个独立的可执行程序,负责将未量化的DLC模型转化为已量化的版本。 - **参数选项**: - `-i/--input`: 输入原始DLC模型路径。 - `-o/--output`: 输出目标位置及名称。 - `--calibration_images`: 提供一组校准图像集以帮助确定最佳量化范围。 - `--calibration_method`: 可选的方法有`naive`, `kl_divergence`等,默认采用的是`naive`方式[^4]。 #### 示例代码展示 下面是一段简单的Python脚本调用上述命令的例子: ```bash snpe-dlc-quantize \ --input_model /path/to/unquantized.dlc \ --output_model /path/to/quantized_output.dlc \ --calibration_images /dataset/calibration_set/ \ --calibration_method kl_divergence ``` 此命令会读取指定目录下的图片作为校正样本,并利用KL散度算法完成整个量化处理过程[^5]。 #### 注意事项 需要注意的是,在某些特定场景下可能还需要额外调整网络结构或者增加一些辅助节点才能顺利完成量化工作。比如之前提到过的Caffe转SNPE过程中遇到的一些特殊层如Flatten, Reshape 和 Permute 的适配问题也需要特别关注[^2]。 ### 结论 综上所述,借助SNPE自带的相关工具链完全可以实现高效的DLC模型压缩与性能优化目的;不过实际应用当中还需综合考虑具体业务需求以及其他潜在因素影响效果评估最终方案设计合理性。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值