Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference 论文阅读

这篇论文探讨了如何在TensorFlow Lite中实现神经网络的量化,以进行高效的整数算术仅量化推理。核心内容包括均匀仿射量化器的设计、量化矩阵乘法的优化、零点处理、典型融合层的实现以及模拟量化训练。通过这些方法,即使在训练过程中也能保持模型精度,同时在推断时提高速度。

Introduction

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inferencegoogle 关于 模型压缩的一篇文章,该文章不仅仅是一篇研究工作,而且已经应用到Tensorflow Lite中。论文内容非常丰富,主要涉及了 Uniform affine qunantizier(核心quantization公式), matrix multiplication(如何用quantization后的matix 去表征real value 的matix multiplication) , Efficient handling of zero-points(针对matix multiplication 提出的性能优化), Implementation of a typical fused layer(介绍了fake quantization,bias 和activation 如何加入卷积网络)以及在train的过程中具体如何操作。

Quantization Scheme

Uniform affine qunantizier

核心公式:
在这里插入图片描述
对于该公式,https://blog.youkuaiyun.com/qq_19784349/article/details/82883271给出了解释:
在这里插入图片描述
这个量化公式是在inference时用的,在train的时候用下面的公式:
Uniform affine qunantizier
r代表real value; a代表min( r) ,b代表max( r),n代表quantizi的层级,与数据类型相关,如果是int8的话,就是256;【】代表取整
分析下这两个量化公式有什么区别。最大的区别就是第一个量化方法使得两个矩阵的乘法失去了量化前两个矩阵乘法的意义。这也就是为什么要设计下面一章的矩阵乘法算法;而第二种只是将一个实数量化到最接近他的整数,量化后乘法仍然有意义,所以可以用于网络训练。

matrix multiplication

这部分是研究如何用quantization 之后的matix 去估算real value 矩阵的乘积。在介绍机制之前,首先讲下为什么需要做这件事。当我们把conv层的weight和activation矩阵全部quantizi 之后,在计算weight*activation时,直接把quantizi 之后的矩阵相乘不就可以了嘛,为什么要研究乘法机制?
关键原因在于,weight和activation的分布不一样。就是上文公式中的ab不一样。那么quantizi之后,同样一个整数代表的实数会存在非常大的差别。因此这个量化只对原矩阵有意义,两个量化矩阵之间做运算是没有实际含义的。
那么是如何用量化矩阵去估计实数矩阵乘积的呢?

matrix multiplication
在这里插入图片描述
这里M是浮点数。如何讲M用整数表示呢?作者提出了下面的做法
在这里插入图片描述
M可以通过两个整数进行记录,一个是右移位数,一个是非常接近2^30次方的一个大整数(用来保持精度)。

Efficient handling of zero-points

这部分讲了如何优化公式(4),降低运算复杂度。核心思想是讲矩阵维度的减法拆分出来,降低复杂度。

### PyTorch Quantization Aware Training (QAT) with YOLOv8 Implementation #### Overview of QAT in PyTorch Quantization Aware Training (QAT) is a technique that simulates the effects of post-training quantization during training, allowing models to be trained directly for lower precision inference. This approach helps mitigate accuracy loss when converting floating-point models to integer-based representations like INT8[^2]. For implementing QAT specifically within the context of YOLOv8 using PyTorch, several key steps need attention: #### Preparation Steps Before Applying QAT on YOLOv8 Model Before applying QAT, ensure the environment setup includes necessary libraries such as `torch`, `torchvision` along with specific versions compatible with your hardware and software stack. Ensure the model architecture supports QAT by verifying compatibility or making adjustments where required. Some layers might not support direct quantization; hence modifications may be needed before proceeding further. ```python import torch from ultralytics import YOLO model = YOLO('yolov8n.pt') # Load pre-trained YOLOv8 nano model ``` #### Configuring the Model for QAT To prepare the YOLOv8 model for QAT, configure it according to PyTorch guidelines provided in official documentation[^1]. The configuration involves setting up observers which will collect statistics about activations and weights throughout different stages of forward passes. ```python # Prepare model for QAT model.train() model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') torch.quantization.prepare_qat(model, inplace=True) for name, module in model.named_modules(): if isinstance(module, torch.nn.Conv2d): torch.quantization.fuse_modules( model, [name], inplace=True ) ``` #### Fine-Tuning Process During QAT Phase During fine-tuning phase under QAT mode, continue training while periodically validating performance metrics against validation datasets. Adjust learning rates carefully since aggressive changes could negatively impact convergence properties observed earlier without quantization constraints applied. Monitor both original float32 evaluation scores alongside their corresponding int8 counterparts generated through simulated low-bit operations introduced via inserted fake_quant modules across network paths. ```python optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9) criterion = torch.nn.CrossEntropyLoss() for epoch in range(num_epochs): train_one_epoch(model, criterion, optimizer, data_loader_train, device=device) validate(model, criterion, data_loader_val, device=device) ``` #### Exporting Post-QAT Trained Models into ONNX Format Once satisfied with achieved accuracies after completing sufficient epochs count, export resulting optimized graph structure together with learned parameters ready for deployment onto target platforms supporting efficient execution over reduced bit-width arithmetic units. Exported files should contain explicit instructions regarding how each tensor gets transformed between full-range floats versus narrow-scaled integers at runtime boundaries defined inside exported protocol buffers specification documents adhering closely enough so they remain interoperable among diverse ecosystem components involved from development until production phases inclusive. ```python dummy_input = torch.randn(1, 3, 640, 640).to(device) torch.onnx.export( model.eval(), dummy_input, 'qat_yolov8.onnx', opset_version=13, do_constant_folding=True, input_names=['input'], output_names=['output'] ) ``` #### Best Practices When Implementing QAT on YOLOv8 Adopting best practices ensures successful application of QAT techniques leading towards effective utilization of computational resources available today's edge devices capable running deep neural networks efficiently even constrained environments characterized limited power supply options present mobile phones cameras drones etcetera. - **Start Simple**: Begin experimentation process utilizing smaller variants first e.g., Nano version instead larger ones initially. - **Validate Early & Often**: Regularly check intermediate results ensuring no significant drop occurs compared baseline configurations prior introducing any form approximation schemes whatsoever. - **Adjust Learning Rate Carefully**: Gradually decrease step sizes especially near end iterations avoiding abrupt shifts causing instability issues otherwise avoided altogether following systematic reduction schedules designed maintain stability throughout entire procedure duration spanned multiple rounds optimization cycles executed sequentially ordered fashion preserving overall integrity final product delivered customers hands ultimately.
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值