ref
https://github.com/NVIDIA/TensorRT-Model-Optimizer
https://nvidia.github.io/TensorRT-Model-Optimizer/index.html
安装
pip install "nvidia-modelopt[all]" --extra-index-url https://pypi.nvidia.com
https://github.com/NVIDIA/TensorRT-Model-Optimizer
这个只是modelopt工具的使用example,实际的(部分)源码可以通过pip安装后的modelopt查看。
另一个早期的量化工具:
https://github.com/NVIDIA/TensorRT/tree/main/tools/pytorch-quantization
resnet50实验
先用resnet简单实验跑通int8量化导出onnx模型和转TensorRT
from transformers import AutoImageProcessor, ResNetForImageClassification
import torch
from diffusers.utils import load_image
import modelopt.torch.quantization as mtq
# https://huggingface.co/microsoft/resnet-50
processor = AutoImageProcessor.from_pretrained("resnet_50")
model = ResNetForImageClassification.from_pretrained("resnet_50")
img_url = "cat1.jpg"
image = load_image(img_url).resize((512, 512))
inputs = processor(image, return_tensors="pt")
pixel_values = inputs["pixel_values"]
data_loader=[pixel_values]
def forward_loop(model):
for batch in data_loader:
model(batch)
# mtq.INT8_SMOOTHQUANT_CFG
# Quantize the model and perform calibration (PTQ)
model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)
torch.onnx.export(model, # model being run
(pixel_values), # model input (or a tuple for multiple inputs)
"resnet50_quant.onnx", # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=15, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names=['pixel_values'], # the model's input names
output_names=['output'], # the model's output names
dynamic_axes={'pixel_values': {0: 'batch'}},
)
可见核心流程为定义一个根据校准数据推理的校准函数,设置一个量化配置,然后用这两个作为quantize的输入进行量化插入量化节点。最后导出插入量化节点的onnx模型,并且使用tensorrt trtexec转换该onnx模型。
使用trtexec转模型,查看里面的性能评估。还可以用trt-engine-explorer查看trt模型结构和算子性能。
v100测试上述int8模型
Latency: min = 1.51245 ms, max = 1.69751 ms, mean = 1.54279 ms
fp16模型
Latency: min = 0.972412 ms, max = 1.14563 ms, mean = 0.993796 ms
为啥这里int8反而比fp16更慢呢?实际上我看通过上述导出的onnx模型中,batch norm并没有fuse到conv里面,插入了量化反量化算子进一步阻值了该融合。而fp16的模型batch norm进行了融合从而性能更好。如下图是量化的onnx模型中可以看到bn没有融合到conv。而实际上trt-engine-explorer中显示的算子实际上int8 conv更好,但是bn耗时也挺大。
可以考虑直接对onnx进行融合conv+bn后再进行量化
TensorRT-Model-Optimizer/onnx_ptq/README.md at main · NVIDIA/TensorRT-Model-Optimizer · GitHub
量化配置
关闭某些层量化
stable diffusion量化
Ref
TensorRT-Model-Optimizer/diffusers/quantization at main · NVIDIA/TensorRT-Model-Optimizer · GitHub
ONNX量化
modelopt除了可以直接对pytorch模型进行量化外,还可以对导出的onnx进行量化。总的来说尽量优先使用直接对torch 模型量化再导出onnx,直接对onnx量化相对支持相对没有那么完善,INT8和传统CV模型支持看上去还行。
import onnx
import numpy as np
from modelopt.onnx.quantization import quantize
onnx_path = "matmul_fp16.onnx"
output_path = "matmul_fp16.onnx_quant.onnx"
x = np.random.randn(1, 4608, 3072).astype("float16")
calibration_data = x
quantize(
onnx_path=onnx_path,
calibration_data=calibration_data,
calibration_method="minmax",
calibration_cache_path=None,
op_types_to_quantize=["MatMul"],
op_types_to_exclude=None,
nodes_to_quantize=None,
nodes_to_exclude=None,
use_external_data_format=False,
keep_intermediate_files=False,
output_path=output_path,
verbose=False,
quantize_mode="fp8",
trt_plugins=None,
trt_plugins_precision=None,
high_precision_dtype="fp32",
mha_accumulation_dtype="fp32",
disable_mha_qdq=True,
)