Jetson配置YOLOv11环境(9)TensorRT导出参数详解
文章目录
0. TensorRT 导出参数简介
论据 | 类型 | 默认值 | 说明 |
---|---|---|---|
format | str | engine | 导出模型的目标格式,定义与各种部署环境的兼容性。 |
imgsz | int 或 tuple | 640 | 模型输入所需的图像尺寸。对于正方形图像,可以是一个整数,或者是一个元组 (height, width) 了解具体尺寸。 |
half | bool | False | 启用 FP16(半精度)量化,在支持的硬件上减小模型大小并可能加快推理速度。 |
int8 | bool | False | 激活 INT8 量化,进一步压缩模型并加快推理速度,同时将精度损失降至最低,主要用于边缘设备。 |
dynamic | bool | False | 允许动态输入尺寸,提高了处理不同图像尺寸的灵活性。 |
simplify | bool | True | 简化模型图 onnxslim 这可能会提高性能和兼容性。 |
workspace | float 或 None | None | 为TensorRT 优化设置最大工作区大小(GiB),以平衡内存使用和性能;使用 None TensorRT 进行自动分配,最高可达设备最大值。 |
nms | bool | False | 增加了非最大值抑制 (NMS),这对精确高效的检测后处理至关重要。 |
batch | int | 1 | 指定导出模型的批量推理大小,或导出模型将同时处理的图像的最大数量。 predict 模式。 |
data | str | coco8.yaml | 通往 数据集 配置文件(默认: coco8.yaml ),对量化至关 |
1. TensorRT 导出参数设置
1.1 [imgsz参数]
imgsz=(height, width)
:设置与相机输入分辨率一致的推理尺寸。以imgsz=480,640
为例:
1.1.1 导出
yolo export model=yolo11n.pt format=engine imgsz=480,640
导出完成
TensorRT: export success ✅ 354.2s, saved as 'yolo11n.engine' (12.7 MB)
Export complete (364.2s)
Results saved to /home/nx/Desktop/yolov11/ultralytics-main
Predict: yolo predict task=detect model=yolo11n.engine imgsz=480,640
Validate: yolo val task=detect model=yolo11n.engine imgsz=480,640 data=/usr/src/ultralytics/ultralytics/cfg/datasets/coco.yaml WARNING ⚠️ non-PyTorch val requires square images, 'imgsz=[480, 640]' will not work. Use export 'imgsz=640' if val is required.
Visualize: https://netron.app
💡 Learn more at https://docs.ultralytics.com/modes/export
1.1.2 推理
yolo predict task=detect model=weights/imgsz_480x640/yolo11n.engine imgsz=480,640 source=videos/街道.mp4
实测结果:13.0ms
(pytorch) nx@nx-desktop:~/Desktop/yolov11/ultralytics-main$ yolo predict task=detect model=weights/imgsz_480x640/yolo11n.engine imgsz=480,640 source=videos/街道.mp4
Ultralytics 8.3.70 🚀 Python-3.8.20 torch-2.1.0a0+41361538.nv23.06 CUDA:0 (Xavier, 6833MiB)
Loading weights/imgsz_480x640/yolo11n.engine for TensorRT inference...
[02/07/2025-20:33:59] [TRT] [I] Loaded engine size: 12 MiB
[02/07/2025-20:34:06] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +343, GPU +323, now: CPU 691, GPU 3017 (MiB)
[02/07/2025-20:34:06] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12, now: CPU 0, GPU 12 (MiB)
[02/07/2025-20:34:06] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 678, GPU 3017 (MiB)
[02/07/2025-20:34:06] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +15, now: CPU 0, GPU 27 (MiB)
video 1/1 (frame 1/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 12.9ms
video 1/1 (frame 2/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 13.1ms
video 1/1 (frame 3/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 13.0ms
video 1/1 (frame 4/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 13.0ms
video 1/1 (frame 5/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 13.0ms
video 1/1 (frame 6/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 13.1ms
video 1/1 (frame 7/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 13.0ms
video 1/1 (frame 8/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 13.1ms
video 1/1 (frame 9/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 13.0ms
video 1/1 (frame 10/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 13.0ms
video 1/1 (frame 11/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 13.0ms
1.2 [half参数]
half=True
:启用 FP16(半精度)量化,在支持的硬件上减小模型大小并可能加快推理速度。
1.2.0 Int8 vs FP16
对推理速度无极致的要求时,建议选择fp16即可。以下是使用 JetPack 6.0 (L4T 36.3) Ubuntu 22.04.4 LTS 测试。
精度 | 评估测试 | 平均 (ms) | min | max (ms) | mAPval 50(B) | mAPval 50-95(B) | batch | 尺寸 (像素) |
---|---|---|---|---|---|---|---|
FP32 | Predict | 6.11 | 6.10 | 6.29 | 8 | 640 | ||
FP32 | COCOval | 6.17 | 0.52 | 0.37 | 1 | 640 | |
FP16 | Predict | 3.18 | 3.18 | 3.20 | 8 | 640 | ||
FP16 | COCOval | 3.19 | 0.52 | 0.37 | 1 | 640 | |
INT8 | Predict | 2.30 | 2.29 | 2.35 | 8 | 640 | ||
INT8 | COCOval | 2.32 | 0.46 | 0.32 | 1 | 640 |
- FP16(半精度):
- 直接降低模型权重和激活值的精度(从FP32转为FP16),不涉及量化过程。
- 无需校准数据集,仅需在TensorRT引擎构建时启用FP16模式,由TensorRT自动完成精度转换。
- 优势:计算速度提升(GPU的FP16算力通常更高),显存占用减少约50%。
- INT8(整数量化):
- 需要校准数据集(通常数百张代表性图片)统计激活值分布,生成动态范围校准表。
- 必须通过校准器(如
IInt8EntropyCalibrator2
)完成量化,否则无法正确推理。- 优势:进一步加速推理(适用于对延迟敏感的场景),但可能引入精度损失。
1.2.1 导出
yolo export model=yolo11n.pt format=engine imgsz=480,640 half=True
导出完成
TensorRT: export success ✅ 1039.7s, saved as 'yolo11n.engine' (7.1 MB)
Export complete (1049.0s)
Results saved to /home/nx/Desktop/yolov11/ultralytics-main
Predict: yolo predict task=detect model=yolo11n.engine imgsz=480,640 half
Validate: yolo val task=detect model=yolo11n.engine imgsz=480,640 data=/usr/src/ultralytics/ultralytics/cfg/datasets/coco.yaml half WARNING ⚠️ non-PyTorch val requires square images, 'imgsz=[480, 640]' will not work. Use export 'imgsz=640' if val is required.
Visualize: https://netron.app
💡 Learn more at https://docs.ultralytics.com/modes/export
1.2.2 推理
yolo predict task=detect model=weights/half/yolo11n.engine imgsz=480,640 source=videos/街道.mp4
实测结果:7.3ms
(pytorch) nx@nx-desktop:~/Desktop/yolov11/ultralytics-main$ yolo predict task=detect model=weights/half/yolo11n.engine imgsz=480,640 source=videos/街道.mp4
Ultralytics 8.3.70 🚀 Python-3.8.20 torch-2.1.0a0+41361538.nv23.06 CUDA:0 (Xavier, 6833MiB)
Loading weights/half/yolo11n.engine for TensorRT inference...
[02/07/2025-21:08:12] [TRT] [I] Loaded engine size: 7 MiB
[02/07/2025-21:08:14] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +261, GPU +426, now: CPU 609, GPU 3185 (MiB)
[02/07/2025-21:08:14] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +82, GPU +151, now: CPU 691, GPU 3336 (MiB)
[02/07/2025-21:08:14] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +5, now: CPU 0, GPU 5 (MiB)
[02/07/2025-21:08:14] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +13, now: CPU 684, GPU 3333 (MiB)
[02/07/2025-21:08:14] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +4, now: CPU 684, GPU 3337 (MiB)
[02/07/2025-21:08:14] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +9, now: CPU 0, GPU 14 (MiB)
video 1/1 (frame 1/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 9.1ms
video 1/1 (frame 2/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 8.0ms
video 1/1 (frame 3/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.3ms
video 1/1 (frame 4/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.4ms
video 1/1 (frame 5/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.5ms
video 1/1 (frame 6/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.4ms
video 1/1 (frame 7/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.3ms
video 1/1 (frame 8/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.3ms
video 1/1 (frame 9/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.3ms
video 1/1 (frame 10/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.4ms
video 1/1 (frame 11/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.7ms
video 1/1 (frame 12/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 8.9ms
video 1/1 (frame 13/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.4ms
video 1/1 (frame 14/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.3ms
video 1/1 (frame 15/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.4ms
video 1/1 (frame 16/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.3ms
video 1/1 (frame 17/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.3ms
video 1/1 (frame 18/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.4ms
video 1/1 (frame 19/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.3ms
video 1/1 (frame 20/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.4ms
video 1/1 (frame 21/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 7.3ms
1.3 [dla参数]
device="dla:0"
启用NVIDIA 深度学习加速器(Deep Learning Accelerator)。通过卸载GPU 的任务(将其释放给更密集的进程),DLA 可使模型以更低的功耗运行,同时保持高吞吐量(降速节能!)
注意:dla
需配合FP16
或INT8
一起使用。
ultralytics官方:使用NVIDIA 深度学习加速器 (DLA)
在使用 DLA 导出时,某些层可能不支持在 DLA 上运行,而会回退到GPU 上执行。这种回退会带来额外的延迟,影响整体推理性能。因此,与完全在GPU 上运行的TensorRT 相比,DLA 的主要目的不是减少推理延迟,而是提高吞吐量和能效。
1.3.1 导出
yolo export model=yolo11n.pt format=engine imgsz=480,640 half=True device="dla:0"
导出完成
TensorRT: export success ✅ 119.3s, saved as 'yolo11n.engine' (7.3 MB)
Export complete (129.7s)
Results saved to /home/nx/Desktop/yolov11/ultralytics-main
Predict: yolo predict task=detect model=yolo11n.engine imgsz=480,640 half
Validate: yolo val task=detect model=yolo11n.engine imgsz=480,640 data=/usr/src/ultralytics/ultralytics/cfg/datasets/coco.yaml half WARNING ⚠️ non-PyTorch val requires square images, 'imgsz=[480, 640]' will not work. Use export 'imgsz=640' if val is required.
Visualize: https://netron.app
💡 Learn more at https://docs.ultralytics.com/modes/export
1.3.2 推理
yolo predict task=detect model=weights/dla/yolo11n.engine imgsz=480,640 source=videos/街道.mp4
实测结果:27.0ms
(pytorch) nx@nx-desktop:~/Desktop/yolov11/ultralytics-main$ yolo predict task=detect model=weights/dla/yolo11n.engine imgsz=480,640 source=videos/街道.mp4
Ultralytics 8.3.70 🚀 Python-3.8.20 torch-2.1.0a0+41361538.nv23.06 CUDA:0 (Xavier, 6833MiB)
Loading weights/dla/yolo11n.engine for TensorRT inference...
[02/07/2025-22:12:54] [TRT] [I] Loaded engine size: 7 MiB
[02/07/2025-22:12:56] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +261, GPU +425, now: CPU 606, GPU 3086 (MiB)
[02/07/2025-22:12:56] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +7, GPU +0, now: CPU 7, GPU 0 (MiB)
[02/07/2025-22:12:56] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +26, now: CPU 598, GPU 3112 (MiB)
[02/07/2025-22:12:56] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +35, now: CPU 7, GPU 35 (MiB)
video 1/1 (frame 1/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 27.8ms
video 1/1 (frame 2/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 28.0ms
video 1/1 (frame 3/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 27.5ms
video 1/1 (frame 4/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 27.8ms
video 1/1 (frame 5/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 27.9ms
video 1/1 (frame 6/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 26.3ms
video 1/1 (frame 7/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 26.3ms
video 1/1 (frame 8/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 26.4ms
video 1/1 (frame 9/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 26.4ms
video 1/1 (frame 10/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 26.3ms
video 1/1 (frame 11/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 26.3ms
video 1/1 (frame 12/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 26.5ms
video 1/1 (frame 13/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 26.9ms
video 1/1 (frame 14/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 26.8ms
video 1/1 (frame 15/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 26.9ms
video 1/1 (frame 16/601) /home/nx/Desktop/yolov11/ultralytics-main/videos/街道.mp4: 480x640 1 car, 27.0ms
1.3.3 结论
发现使用dla推理速度变慢了,但是能耗并没有显著的降低。
- 使用dla导出推理
- 不使用dla导出推理
2. 总结
追求推理速度与精度的平衡,导出参数可配置为:imgsz=相机输入分辨率
,half=True
yolo export model=yolo11n.pt format=engine imgsz=480,640 half=True