【Jetson orin-nx】使用Tensorrt并发推理四个Yolo模型 (python版)

最新推荐文章于 2025-10-14 06:13:46 发布

原创最新推荐文章于 2025-10-14 06:13:46 发布 · 1.2k 阅读

20 ·

CC 4.0 BY-SA版权

文章标签：

#YOLO #python #jetson #Tensorrt #linux #人工智能

部署运行你感兴趣的模型镜像

前言

Jetson平台的GPU如果有足够的显存,，可以实现多个模型并发推理，并且推理时间相比推理单个模型不会有明显下降，任能保证较高的实时性。而目前类似像瑞芯微这样主流的NPU方案（如RK3588、RK3568等）在多模型并发推理方面，并不支持严格意义上的多模型同时计算（即真正的硬件并行）。

所以要完成真正意义上的多模型并发推理，还是选择GPU作为推理平台。

本文选择Jetson的orin nx 16g作为并发推理测试平台，这个显存主够并发推理四个量化为FP16/INT8的Yolo模型。
（除了Jetson，本文的流程与代码也适用于带Nvidia GPU的电脑）

准备工作

首先就是要准备一个带Nvidia GPU的设备，本文的硬件平台如下：
https://jp.seeedstudio.com/reComputer-J4012-w-o-power-adapter-p-5628.html

在这里插入图片描述
如果是jetson的话，预装的Jetpack上都会自带CUDA、Tensorrt、Opencv等用于加速推理Yolo的工具。

还需要我们手动安pycuda，终端输入：

sudo apt-get install python3-pip
pip3 install Cython
pip3 install pycuda --user

接下来把需要推理的.onnx模型转为Tensorrt的.engine格式，本文以yolov8n.onnx为例。一般安装好Tensorrt后，转换工具会在/usr/src/tensorrt/bin这个路径，搜索trtexec这个执行文件就能看到。

以量化为FP16为例，生成静态batch size的engine

./trtexec 	--onnx=<onnx_file> \ 						
        	--explicitBatch \ 						
        	--saveEngine=<tensorRT_engine_file> \ 		
        	--workspace=<size_in_megabytes> \ 		
        	--fp16

其中，--workspace根据你的显存大小设置，可以是2048、4096 或更高。
--explicit_batch 批处理模式是指在模型构建阶段,明确指定每个输入张量的批次大小，及静态固定的输入张量。你可以在Netron上查看你的onnx模型是固定输入还是动态输入。
--onnx输入的onnx模型的路径。
-saveEngine输出engine文件的路径
--fp16可以替换为int8等其他精度

如果要生成动态batchsize的engine，静态输入的模型不要参考这个指令

./trtexec 	--onnx=<onnx_file> \					
        	--minShapes=input:<shape_of_min_batch> \ 
        	--optShapes=input:<shape_of_opt_batch> \  	
        	--maxShapes=input:<shape_of_max_batch> \ 	
        	--workspace=<size_in_megabytes> \ 			
        	--saveEngine=<engine_file> \   				
        	--fp16

官方的yolov8预训练模型是静态输入。如果不知道.pt模型如何转为.onnx可以参考这篇文章
https://docs.pytorch.org/tutorials/beginner/onnx/export_simple_model_to_onnx_tutorial.html

转换后的onnx如果版本太高，想切换onnx文件的版本和其opset的版本，可以用这个脚本：
https://github.com/jjjadand/ONNX_Downgrade

本文选用两个yolov8-det和yolov8-pose，分别进行fp16和int8精度的量化，最后得到四个.engine，转换的过程必须要在你的部署设备上进行，因为加载.engine和转换得到engine的硬件环境需要严格一致。最终得到两个F16和两个INT8的Tensorrt格式模型:
在这里插入图片描述

实时并发推理四个YOLO

本文选用USB 摄像头作为输入，终端输入下面指令查看摄像头映射的设备名：

ls /dev/video*

查看摄像头支持的帧率和分辨率，以video设备为例：

sudo apt install v4l-utils
v4l2-ctl -d /dev/video0 --list-formats-ext

把下面的代码配置改为符合你的摄像头的参数，以及输入的.engine文件名也需要修改为你的。

import cv2
import time
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt

# ==== 配置 ====
cam_width = 640
cam_len = 480
MODEL_INPUT_SIZE = 640
CONF_THRESHOLD = 0.3
CONF_THRESHOLD_POSE = 0.1
POSE_KPT_THRESHOLD = 0.2
NMS_THRESHOLD = 0.1


# now not enable, GStreamer 管道（针对 Jetson 优化）,modify base on your camera
# gst_str = (
#     "v4l2src device=/dev/video0 ! "
#     "video/x-raw, width=640, height=480, framerate=30/1 ! "
#     "videoconvert ! "
#     "video/x-raw, format=BGR ! appsink"
# )

# 类别列表（检测用）
CLASSES = ["person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck",
           "boat", "traffic light", "fire hydrant", "stop sign", "parking meter", "bench",
           "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra",
           "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee",
           "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove",
           "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup",
           "fork", "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange",
           "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch",
           "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
           "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink",
           "refrigerator", "book", "clock", "vase", "scissors", "teddy bear", "hair drier",
           "toothbrush"]
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))

# COCO 姿态关键点连线（17 点）
SKELETON = [
    (0,1),(0,2),(1,3),(2,4),
    (5,6),(5,7),(7,9),(6,8),(8,10),
    (5,11),(6,12),(11,12),(11,13),(13,15),(12,14),(14,16)
]

# ==== 工具函数 ====
def load_engine(path):
    logger = trt.Logger(trt.Logger.WARNING)
    with open(path, "rb") as f, trt.Runtime(logger) as rt:
        return rt.deserialize_cuda_engine(f.read())

def allocate_buffers(engine):
    inputs, outputs, bindings = [], [], []
    stream = cuda.Stream()
    for binding in engine:
        shape = engine.get_binding_shape(binding)
        size  = trt.volume(shape)
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        host_mem   = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        bindings.append(int(device_mem))
        (inputs if engine.binding_is_input(binding) else outputs).append((host_mem, device_mem))
    return inputs[0], outputs[0], bindings, stream

def preprocess(img):
    h0, w0 = img.shape[:2]
    r = min(MODEL_INPUT_SIZE/w0, MODEL_INPUT_SIZE/h0)
    w, h = int(w0*r), int(h0*r)
    resized = cv2.resize(img, (w,h), interpolation=cv2.INTER_LINEAR)
    padded = np.full((MODEL_INPUT_SIZE, MODEL_INPUT_SIZE,3), 114, dtype=np.uint8)
    padded[:h,:w] = resized
    blob = cv2.dnn.blobFromImage(padded, 1/255.0, (MODEL_INPUT_SIZE,MODEL_INPUT_SIZE), swapRB=False)
    return padded, blob, {"scale": r, "original_shape": (h0,w0)}

# def postprocess_det(output, meta):
#     # output: np.array (anchors,84)
#     scale = meta["scale"]
#     # 提取
#     boxes = output[:, :4]
#     scores = output[:,5:]
#     cls_ids = np.argmax(scores, axis=1)
#     confs   = scores[np.arange(len(scores)), cls_ids]
#     keep = confs>CONF_THRESHOLD
#     boxes, confs, cls_ids = boxes[keep], confs[keep], cls_ids[keep]
#     if boxes.shape[0]==0: return []
#     # cx,cy,w,h → x1,y1,x2,y2
#     cx,cy,w,h = boxes[:,0], boxes[:,1], boxes[:,2], boxes[:,3]
#     x1 = ((cx - w/2)/scale).astype(int)
#     y1 = ((cy - h/2)/scale).astype(int)
#     x2 = ((cx + w/2)/scale).astype(int)
#     y2 = ((cy + h/2)/scale).astype(int)
#     xyxy = np.stack([x1,y1,x2,y2],axis=1)
#     # NMS
#     idxs = nms_numpy(xyxy, confs, NMS_THRESHOLD)
#     return [(xyxy[i], confs[i], cls_ids[i]) for i in idxs]

def postprocess_det(output, meta):
    h_orig, w_orig = meta["original_shape"]
    scale = meta["scale"]

    boxes_raw = output[:, :4]              # (N, 4)
    scores = output[:, 4:]                 # (N, num_classes), 4 or 5
    #print("scores shape:", scores.shape)
    #for i in range(5):
        #print(f"cls{i} max score: {np.max(scores[:, i])}")

    cls_ids = np.argmax(scores, axis=1)    # (N,)
    confs = scores[np.arange(len(output)), cls_ids]

    keep = confs > CONF_THRESHOLD
    boxes_raw = boxes_raw[keep]
    confs = confs[keep]
    cls_ids = cls_ids[keep]

    if len(boxes_raw) == 0:
        return []

    # (cx,cy,w,h) -> (x1,y1,x2,y2)，注意只考虑缩放，不补偿 padding
    cx, cy, w, h = boxes_raw[:, 0], boxes_raw[:, 1], boxes_raw[:, 2], boxes_raw[:, 3]
    x1 = ((cx - w / 2) / scale).astype(int)
    y1 = ((cy - h / 2) / scale).astype(int)
    x2 = ((cx + w / 2) / scale).astype(int)
    y2 = ((cy + h / 2) / scale).astype(int)
    boxes = np.stack([x1, y1, x2, y2], axis=1)

    # NMS
    indices = cv2.dnn.NMSBoxes(boxes.tolist(), confs.tolist(), CONF_THRESHOLD, NMS_THRESHOLD)
    results = []
    if len(indices) > 0:
        for i in indices.flatten():
            results.append((boxes[i], confs[i], cls_ids[i]))
    return results



def nms_numpy(boxes, scores, iou_thres=0.5):
    x1,y1,x2,y2 = boxes[:,0],boxes[:,1],boxes[:,2],boxes[:,3]
    areas = (x2-x1)*(y2-y1)
    order = scores.argsort()[::-1]
    keep=[]
    while order.size>0:
        i = order[0]; keep.append(i)
        if order.size==1: break
        xx1 = np.maximum(x1[i], x1[order[1:]])
        yy1 = np.maximum(y1[i], y1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])
        w = np.maximum(0.0, xx2-xx1)
        h = np.maximum(0.0, yy2-yy1)
        inter = w*h
        iou = inter/(areas[i]+areas[order[1:]]-inter)
        inds = np.where(iou<=iou_thres)[0]
        order = order[inds+1]
    return keep


def postprocess_pose(output, meta):
    """
    带 NMS 的 Vectorized CPU 后处理 YOLOv8-Pose 输出。
    output: np.ndarray, shape (N, 5 + 3*K)
    meta: dict, 包含 'scale'
    返回: list of persons，每个是长度 K 的 (x,y) or None
    """
    scale = meta["scale"]
    N, C = output.shape
    K = (C - 5) // 3

    if N == 0:
        return []

    # --- 1. 提取 bbox 和 objectness ---
    cx = output[:, 0]
    cy = output[:, 1]
    w  = output[:, 2]
    h  = output[:, 3]
    obj_conf = output[:, 4]

    # 转成原图坐标的 x1,y1,x2,y2
    x1 = (cx - w * 0.5) / scale
    y1 = (cy - h * 0.5) / scale
    x2 = (cx + w * 0.5) / scale
    y2 = (cy + h * 0.5) / scale
    boxes = np.stack([x1, y1, x2, y2], axis=1)

    # --- 2. obj_conf 阈值过滤 ---
    mask0 = obj_conf >= CONF_THRESHOLD_POSE
    if not np.any(mask0):
        return []
    boxes = boxes[mask0]
    scores = obj_conf[mask0]
    selected = output[mask0]      # shape = (M, 5+3K)

    # --- 3. NMS 去重 ---
    keep = nms_numpy(boxes, scores, iou_thres=NMS_THRESHOLD)
    if len(keep) == 0:
        return []
    sel = selected[keep]          # shape = (M2, 5+3K)

    # --- 4. Vectorized 后处理关键点 ---
    M2 = sel.shape[0]
    kp = sel[:, 5:].reshape(M2, K, 3)   # (M2, K, 3)
    x_rel = kp[:, :, 0]
    y_rel = kp[:, :, 1]
    c_rel = kp[:, :, 2]

    valid = c_rel >= POSE_KPT_THRESHOLD

    xs = (x_rel / scale).astype(np.int32)
    ys = (y_rel / scale).astype(np.int32)
    xs[~valid] = -1
    ys[~valid] = -1

    # --- 5. 构造返回列表 ---
    persons = [
        [(int(xs[i, j]), int(ys[i, j])) if valid[i, j] else None
         for j in range(K)]
        for i in range(M2)
    ]

    return persons



def main():
    # 1. 加载4个 TRT 引擎
    det_engine  = load_engine("yolov8n.engine")
    det2_engine  = load_engine("yolov8n-int8.engine")
    pose_engine= load_engine("yolov8n-pose.engine")
    pose2_engine= load_engine("yolov8n-pose-int8.engine")

    # 2. 创建上下文、缓冲
    det_ctx   = det_engine.create_execution_context()
    det2_ctx   = det_engine.create_execution_context()
    pose_ctx = pose_engine.create_execution_context()
    pose2_ctx = pose_engine.create_execution_context()
    (h_din, d_din), (h_dout, d_dout), det_bind, det_stream = allocate_buffers(det_engine)
    (h_din2, d_din2), (h_dout2, d_dout2), det2_bind, det2_stream = allocate_buffers(det2_engine)
    (h_pin, d_pin), (h_pout, d_pout), pose_bind, pose_stream = allocate_buffers(pose_engine)
    (h_pin2, d_pin2), (h_pout2, d_pout2), pose2_bind, pose2_stream = allocate_buffers(pose2_engine)

    cap = cv2.VideoCapture(0)
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, cam_width)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, cam_len)
    #cap = cv2.VideoCapture(gst_str, cv2.CAP_GSTREAMER)
    if not cap.isOpened():
        print("[ERROR] Camera pipeline failed to open. Check GStreamer string or camera device.")

    while True:
        
        ret, frame = cap.read()
        if not ret: break
        start_time = time.time()  # ← 帧开始时间

        #  预处理
        start_pre = time.time()  # ← pre开始时间
        img_padded, blob, meta = preprocess(frame)
        np.copyto(h_din, blob.ravel()); np.copyto(h_pin, blob.ravel())
        np.copyto(h_din2, blob.ravel()); np.copyto(h_pin2, blob.ravel())
        over_pre = time.time()  # 
        print("pre time:", over_pre - start_pre)

        #  4路并发推理
        start_infer = time.time()  # infer开始时间
        cuda.memcpy_htod_async(d_din, h_din, det_stream)
        det_ctx.execute_async_v2(det_bind, stream_handle=det_stream.handle)
        cuda.memcpy_dtoh_async(h_dout, d_dout, det_stream)

        cuda.memcpy_htod_async(d_pin, h_pin, pose_stream)
        pose_ctx.execute_async_v2(pose_bind, stream_handle=pose_stream.handle)
        cuda.memcpy_dtoh_async(h_pout, d_pout, pose_stream)

        cuda.memcpy_htod_async(d_din2, h_din2, det2_stream)
        det2_ctx.execute_async_v2(det2_bind, stream_handle=det2_stream.handle)
        cuda.memcpy_dtoh_async(h_dout2, d_dout2, det2_stream)

        cuda.memcpy_htod_async(d_pin2, h_pin2, pose2_stream)
        pose2_ctx.execute_async_v2(pose2_bind, stream_handle=pose2_stream.handle)
        cuda.memcpy_dtoh_async(h_pout2, d_pout2, pose2_stream)

        det_stream.synchronize()
        pose_stream.synchronize()
        det2_stream.synchronize()
        pose2_stream.synchronize()
        over_infer = time.time()  # ← 
        print("infer time:",over_infer - start_infer)

        # 5. 解析结果
        start_post = time.time()  # ← post始时间

        #det_model x2 post
        anchors = h_dout.size // 84
        det_out = h_dout.reshape(1,84,-1).transpose(0,2,1).squeeze(0)

        anchors2 = h_dout2.size // 84
        det2_out = h_dout2.reshape(1,84,-1).transpose(0,2,1).squeeze(0)
        #print(det_out)
        dets   = postprocess_det(det_out, meta)
        dets2   = postprocess_det(det2_out, meta)
        #print(dets)

        # 姿态x2 post
        pose_ch = h_pout.size // anchors
        pose_out= h_pout.reshape(1,pose_ch,-1).transpose(0,2,1).squeeze(0)

        pose2_ch = h_pout.size // anchors
        pose2_out= h_pout2.reshape(1,pose2_ch,-1).transpose(0,2,1).squeeze(0)

        people = postprocess_pose(pose_out, meta)
        people2 = postprocess_pose(pose2_out, meta)



        # 6. 可视化：检测 + 姿态
        # 6.1 检测框
        
        for box, conf, cid in dets:
            color = COLORS[cid]; x1,y1,x2,y2 = box
            cv2.rectangle(frame, (x1,y1),(x2,y2), color, 2)

        for box, conf, cid in dets:
            color = COLORS[cid]; x1,y1,x2,y2 = box
            cv2.rectangle(frame, (x1,y1),(x2,y2), color, 2)    

        # 6.2 姿态关键点 + 骨架
        for pts in people:
            # 画点
            for p in pts:
                if p: cv2.circle(frame, p, 3, (0,255,255), -1)
            # 画连线
            for i,j in SKELETON:
                if pts[i] and pts[j]:
                    cv2.line(frame, pts[i], pts[j], (0,200,0), 2)

        for pts in people2:
            # 画点
            for p in pts:
                if p: cv2.circle(frame, p, 3, (0,155,155), -1)
            # 画连线
            for i,j in SKELETON:
                if pts[i] and pts[j]:
                    cv2.line(frame, pts[i], pts[j], (100,100,0), 2)

        over_post = time.time()  # 
        print("post time:",over_post - start_post)

                # === 计算 FPS ===
        elapsed = time.time() - start_time
        print("all time:",elapsed) #print all time
        fps = 1.0 / elapsed if elapsed > 0 else 0
        cv2.putText(frame, f"FPS: {fps:.2f}", (10, 25),
                     cv2.FONT_HERSHEY_SIMPLEX, 0.75, (0, 255, 0), 2)

        cv2.imshow("Det+Pose", frame)
        if cv2.waitKey(1)&0xFF==ord("q"):
            break

    cap.release()
    cv2.destroyAllWindows()
    del det_ctx
    del pose_ctx
    del det2_ctx
    del pose2_ctx
    print("[INFO] All TensorRT contexts released.")

if __name__=="__main__":
    main()