【论文阅读笔记】Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

介绍IAO方法,实现仅用整数计算进行神经网络推理,适用于已压缩网络如MobileNets,测试于ImageNet。对比以往方法,IAO在普通CPU上运行,无需特殊硬件。文章详解量化方案、量化卷积计算及训练过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

该方法的简称为:IAO

  该论文提出了一种允许网络在推测时只使用整数计算的方法,即 float32 --> int8.
  该文在MobileNets等已经压缩过的网络进行量化,测试数据集为ImageNet分类数据集。不同于其他的方法,在高度冗余的模型,如Alexnet等网络上进行的,且提出的硬件加速方法需要专门的特殊硬件(即他们并没有在真正的硬件上加速,只是提出一种理论),该方法在普通的CPU上即可运行。


Introduction

  以前的方法在两方面减少modle size和推测时间:

  1. 网络结构上,如MobileNet、SqueezeNet、ShuffleNet和DenseNet
  2. 量化权重和激活函数,从32 bit的floating point到更低的 bit-depth表示

  但是之前的方法有两个方面的缺陷:

  1. 之前的方法没有在一个合理的基准结构上评估,他们大部分的基准结构普遍在AlexNet,VGGnet,GoogleNet上,这些网络在设计时,为了达到高准确率,有很大的冗余,所以在压缩这些网络时有不小的效果
  2. **很多量化放大没有真正在硬件上进行有效性证明。**① 有的方法只在weight上量化,仅仅在乎在设备的存储,而不在乎计算效率;② 有的方法,比如二元/三元权重网络和bit-shift网络,他们的方法把权重限制为0或者2n,他们想把乘法计算用bit-shift实现。但是,在现有的硬件上,bit-shift实现并不比<乘法-加法>好。而且,只有当bit较大时,乘法才显得“昂贵”,所以,我们需要将乘法的输入—weight和activation都给量化成较小的bit-width。

Quantized Inference

Quantized scheme

  作者的量化方案为:在测试时只使用int计算,在训练时使用浮点计算。
  量化方案的基本要求就是:允许只用量化值的整数运算来完成所有的算术运算。
  量化方案是一个仿射映射,在整数qqq和实值rrrr=S(q−Z)(1)r = S(q-Z) \tag{1}r=S(qZ)(1)  其中S和ZS和ZSZ是常数S和ZS和ZSZ是量化参数)【对每一层array用一对S/Z参数,即每一个activation、每一个weight的Tensor都有一个对应S/Z】

  对于8-bit的量化,q就是8-bit的整数。对于B-bit的量化,q就是B-bit的实数。
  而对于bias,就固定量化为32-bit的实数。

  常数SSSstep_size,代表一个“scale”,是一个实数。常数ZZZ是偏移shift,代表“zero-point”,就是为了让实数rrr000和量化数qqq000对应,和q是一个类型的数。S=arraymax−arraymin2bit_width−1(9)S=\frac{array_{max}-array_{min}}{2^{bit\_width}-1}\tag{9}S=2bit_width1arraymaxarraymin(9)设定:round(x)={0,x<0⌊x⌉,0<⌊x⌉<2n−12n−1,x>2n−1round(x) = \begin{cases}0,&x<0 \\ \lfloor x \rceil, &0<\lfloor x \rceil<2^{n}-1 \\ 2^{n}-1, &x>2^{n}-1\end{cases}round(x)=0,x2n1,x<00<x<2n1x>2n1
Z=round(−arrayminS)(10)Z=round(-\frac{array_{min}}{S})\tag{10}Z=round(Sarraymin)(10)

从公式(1)(1)(1)可以推出:q=round(Z+rS)(11)q = round(Z + \frac{r}{S}) \tag{11}q=round(Z+Sr)(11)

template<typename QType>                    // e.q.  QType=uint8
struch QuantizedBuffer{
	vector<QType> q;                        // the quantized values
	float S;                                // the scale
	QType Z;                                // the zero-point
}

量化卷积计算过程

1. 输入 量化的特征图 lhs_quantized_val, uint8类型, 偏移量 lhs_zero_point, int32类型;
2. 输入 量化的卷积核 rhs_quantized_val, uint8类型, 偏移量 rhs_zero_point, int32类型;
3. 转换 unit8 到 int32类型;
4. 每一块卷积求和(int32乘法求和有溢出风险,可换成固定点小数树乘法);
   int32_accumulator += (lhs_quantized_val(i, j) - lhs_zero_point) * (rhs_quantized_val(j, k) - rhs_zero_point);
5. 输入 量化的乘子 quantized_multiplier, int32类型 和 右移次数记录 right_shift, int类型;
6. 计算乘法,得到int32类型的结果 (int32乘法有溢出风险,可换成固定点小数树乘法);
   quantized_multiplier * int32_accumulator
   
7. 再左移动 right_shift 位还原,得到 int32的结果;
8. 最后再加上 结果的偏移量 result_zero_point;
   (7和8的顺序和 官方说的先后顺序颠倒);
9. 将int32类型结果 限幅到[0, 255], 再强制转换到 uint8类型;
10. 之后再 反量化到浮点数,更新统计输出值分布信息 max, min;
11. 再量化回 uint8;
11. 之后 经过 量化激活层;
12. 最后 反量化到 浮点数,本层网络输出;

13. 进入下一层
    循环执行 1~12 步骤

  若有连续的层需要量化操作时,就没必要进行反量化了。。比如上面步骤的 10→1110\rightarrow111011,实际上是可以省略的【但相应的。。可能会带来乘法加法累积造成的溢出】
在这里插入图片描述
Integer-arithmetic-only matrix multiplication

  由公式(1)(1)(1)可以知道,每个array中的实数 rir_iri 都能表示带有一对参数S和ZS和ZSZ的实数qiq_iqi。则对实数矩阵R1和R2R_1和R2R1R2的乘法,其结果矩阵的每个实数可以表示为:S3(q3(i,k)−Z3)=∑j=1NS1(q1(i,j)−Z1)S2(q2(j,k)−Z2)(2)S_3(q_3^{(i,k)}-Z_3)=\sum_{j=1}^{N}S_1(q_1^{(i,j)}-Z_1)S_2(q_2^{(j,k)}-Z_2)\tag{2} S3(q3(i,k)Z3)=j=1NS1(q1(i,j)Z1)S2(q2(j,k)Z2)(2)q3(i,k)=Z3+M∑j=1N(q1(i,j)−Z1)(q2(j,k)−Z2)(3)q_3^{(i,k)}=Z_3+M\sum_{j=1}^{N}(q_1^{(i,j)}-Z_1)(q_2^{(j,k)}-Z_2)\tag{3}q3(i,k)=Z3+Mj=1N(q1(i,j)Z1)(q2(j,k)Z2)(3)其中,M=S1S2S3(4)M=\frac{S_1S_2}{S_3}\tag{4}M=S3S1S2(4)MMM是公式(3)(3)(3)中唯一不是整数的值。经验发现,MMM总是在(0,1)(0,1)(0,1)中,所以将MMM表达为:M=2−nM0(5)M=2^{-n}M_0\tag{5}M=2nM0(5)  其中,nnn为非负整数,M0M_0M0是一个整数。这样,实数运算就变成了整数运算。同时,2−n可以2^{-n}可以2n用移位运算。因为图片输入的每一个数字都在[0,255][0,255][0,255],而映射的区域也是[0,28−1]=[0,255][0, 2^8-1]=[0,255][0,281]=[0,255]

注意,在预测时,权重矩阵的量化step_size,可以通过已有参数统计出来,而activation的量化step_size是大量训练数据的指数移动均值统计出来的。所以,才会有q3q_3q3没出来,但应用了S3S_3S3的公式出现。【下面会有S的计算公式,是通过映射前的range的最小值aaa和最大值bbb出来的】

Implementation of a typical fused layer

  定义bias-vector的量化:Sbias=S1S2,Zbias=0S_{bias}=S_1S_2,Z_{bias}=0Sbias=S1S2Zbias=0其中,SbiasS_{bias}Sbias用int32表示。
  矩阵乘法,weights和activations的矩阵乘法后的加法:int32+=uint8∗uint8int32 += uint8 *uint8int32+=uint8uint8
  使用32位加法器得结构后,还有3件事情需要做:① 将这个output activations的int32 scale downint8scale指标;② 将 int32 的output activations 仿射到 uint8;③ 对这个uint8的结果应用激活函数。
  这个乘法的down-scaling就是公式(3)的(3)的(3)MMM.


Training with simulated quantization

Quantized scheme

  作者分析原有的量化方法:用floating point训练,然后量化训练的weight.(有时候会进行量化后的微调),发现这种方法对于大模型来说结果不错,对小模型来说精确度下降不小。作者对这种量化后的微调方法分析了存在的两点问题:

  1. 同一layers不同channels权重分布尺度差很多(100x)
    1. 离散的权重会导致所有剩余的权重的精度下降;

  作者提出了一种在前向传播阶段模拟量化的方法,反向传播和平常一样,所有的权重和biases都用floating point保存以微调小变化。前向传播的浮点数计算方法,模拟量化的方式,就如公式(3)(3)(3)所示

  对每一层,数值都用参数(量化层次,和clamping range)量化:clamp(r;a,b):=min(max(x,a),b)(6)clamp(r;a,b) := min(max(x,a),b)\tag{6}clamp(r;a,b):=min(max(x,a),b)(6)s(a,b,n):=b−an−1(7)s(a,b,n):=\frac{b-a}{n-1} \tag{7}s(a,b,n):=n1ba(7)q(r;a,b,n):=⌊clamp(r;a,b)−as(a,b,n)⌉s(a,b,n)+a(8)q(r;a,b,n):=\lfloor\frac{clamp(r;a,b)-a}{s(a,b,n)}\rceil s(a,b,n)+a\tag{8}q(r;a,b,n):=s(a,b,n)clamp(r;a,b)as(a,b,n)+a(8)其中,rrr是待量化的实数;[a,b][a,b][a,b]是量化的范围;nnn是量化的等级数,对所有layer都是固定的,比如8-bit量化,n=28=256n=2^8=256n=28=256⌊∗⌉\lfloor * \rceil就是取整到最近的整数。
  公式(6)(6)(6)表示x所处的范围,即映射前的range,之所以有这个看起来没用的公式,是因为activation的range是移动均值平均的出来的;公式(7)(7)(7)表示range映射到n-1段,每段的大小;公式(8)(8)(8)表示一个实数r就近到能完整量化的实数,且⌊clamp(r;a,b)−as(a,b,n)⌉\lfloor\frac{clamp(r;a,b)-a}{s(a,b,n)}\rceils(a,b,n)clamp(r;a,b)a是能我们的B-bit表示的量化结果;

  bias没有被量化,因为在推测阶段,它被32-bit表示。【这句没懂,因为前面不是有量化bias的参数SbiasS_{bias}Sbias吗?】

Learning quantization ranges
  weight 和 activation的量化range不同:

  • 对于weight:a:=min(w)a:=min(w)a:=min(w)b:=max(w)b:=max(w)b:=max(w)
  • 对于activation:在训练时,用指数移动平均来收集activation的range [a,b],平滑参数接近1.【若范围快速变化时,指数移动平均会延迟反映出来,所以在训练初期(5万步到200万步)时,不量化activation是有用的】
### PyTorch Quantization Aware Training (QAT) with YOLOv8 Implementation #### Overview of QAT in PyTorch Quantization Aware Training (QAT) is a technique that simulates the effects of post-training quantization during training, allowing models to be trained directly for lower precision inference. This approach helps mitigate accuracy loss when converting floating-point models to integer-based representations like INT8[^2]. For implementing QAT specifically within the context of YOLOv8 using PyTorch, several key steps need attention: #### Preparation Steps Before Applying QAT on YOLOv8 Model Before applying QAT, ensure the environment setup includes necessary libraries such as `torch`, `torchvision` along with specific versions compatible with your hardware and software stack. Ensure the model architecture supports QAT by verifying compatibility or making adjustments where required. Some layers might not support direct quantization; hence modifications may be needed before proceeding further. ```python import torch from ultralytics import YOLO model = YOLO('yolov8n.pt') # Load pre-trained YOLOv8 nano model ``` #### Configuring the Model for QAT To prepare the YOLOv8 model for QAT, configure it according to PyTorch guidelines provided in official documentation[^1]. The configuration involves setting up observers which will collect statistics about activations and weights throughout different stages of forward passes. ```python # Prepare model for QAT model.train() model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') torch.quantization.prepare_qat(model, inplace=True) for name, module in model.named_modules(): if isinstance(module, torch.nn.Conv2d): torch.quantization.fuse_modules( model, [name], inplace=True ) ``` #### Fine-Tuning Process During QAT Phase During fine-tuning phase under QAT mode, continue training while periodically validating performance metrics against validation datasets. Adjust learning rates carefully since aggressive changes could negatively impact convergence properties observed earlier without quantization constraints applied. Monitor both original float32 evaluation scores alongside their corresponding int8 counterparts generated through simulated low-bit operations introduced via inserted fake_quant modules across network paths. ```python optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9) criterion = torch.nn.CrossEntropyLoss() for epoch in range(num_epochs): train_one_epoch(model, criterion, optimizer, data_loader_train, device=device) validate(model, criterion, data_loader_val, device=device) ``` #### Exporting Post-QAT Trained Models into ONNX Format Once satisfied with achieved accuracies after completing sufficient epochs count, export resulting optimized graph structure together with learned parameters ready for deployment onto target platforms supporting efficient execution over reduced bit-width arithmetic units. Exported files should contain explicit instructions regarding how each tensor gets transformed between full-range floats versus narrow-scaled integers at runtime boundaries defined inside exported protocol buffers specification documents adhering closely enough so they remain interoperable among diverse ecosystem components involved from development until production phases inclusive. ```python dummy_input = torch.randn(1, 3, 640, 640).to(device) torch.onnx.export( model.eval(), dummy_input, 'qat_yolov8.onnx', opset_version=13, do_constant_folding=True, input_names=['input'], output_names=['output'] ) ``` #### Best Practices When Implementing QAT on YOLOv8 Adopting best practices ensures successful application of QAT techniques leading towards effective utilization of computational resources available today's edge devices capable running deep neural networks efficiently even constrained environments characterized limited power supply options present mobile phones cameras drones etcetera. - **Start Simple**: Begin experimentation process utilizing smaller variants first e.g., Nano version instead larger ones initially. - **Validate Early & Often**: Regularly check intermediate results ensuring no significant drop occurs compared baseline configurations prior introducing any form approximation schemes whatsoever. - **Adjust Learning Rate Carefully**: Gradually decrease step sizes especially near end iterations avoiding abrupt shifts causing instability issues otherwise avoided altogether following systematic reduction schedules designed maintain stability throughout entire procedure duration spanned multiple rounds optimization cycles executed sequentially ordered fashion preserving overall integrity final product delivered customers hands ultimately.
评论 31
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值