修改yolov5的detect层,提高Triton推理服务的性能

博客围绕Yolov5在Infer模式下的性能优化展开。默认detect层输出数据处理时间长,影响推理服务性能。通过将模型转换为TensorRT engine并部署在Triton Inference Server进行测试,对detect层改造后,不同batch size下吞吐量等性能指标均有显著提升。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Infer模式下, yolov5 默认的detect层输出的数据是一个形状为[batches, 25200, 85]的张量。如果部署在Nvidia Triton中,输出层的张量大小过大,处理输出的时间会变大,造成队列积压。 特别是在Triton ServerClient不在同一台机器,无法使用shared memory的情况下,通过网络将数据传输到client的时间还会变大,影响推理服务的性能。 相关代码链接


1. 测试方法

将模型转换为tensorrt engine, 并部署在Triton Inference Server,instance group数量为1,类型为GPU,在其他机器上通过Triton提供的perf_analyzer工具进行性能测试。

  • 将yolov5s.pt转换为onnx格式
  • 将onnx转换为tensorrt engine

    /usr/src/tensorrt/bin/trtexec  \
    --onnx=yolov5s.onnx \
    --minShapes=images:1x3x640x640 \
    --optShapes=images:8x3x640x640 \
    --maxShapes=images:32x3x640x640 \
    --workspace=4096 \
    --saveEngine=yolov5s.engine \
    --shapes=images:1x3x640x640 \
    --verbose \
    --fp16 \
    > result-FP16.txt
    
  • 部署在Triton Inference Server

    模型上传到Triton server 设置的model repository路径,编写模型服务配置

  • 生成真实数据

    python generate_input.py --input_images <image_path> ----output_file <real_data>.json
    
  • 利用真实数据进行性能测试

    perf_analyzer  -m <triton_model_name>  -b 1  --input-data <real_data>.json  --concurrency-range 1:10  --measurement-interval 10000  -u <triton server endpoint> -i gRPC  -f <triton_model_name>.csv
    

2. 修改前的性能指标

如下为使用默认detect层的yolov5 trt engine, 部署在triton的性能测试结果,可以看到,使用默认的detect层,大量时间消耗在队列积压(Server Queue)和输出数据的处理(Server Compute Output),吞吐量甚至达不到 1 infer/sec

除了吞吐,其余指标的单位均为us, 其中Client Send和Client Recv分别为gRPC序列化、反序列化数据的时间

ConcurrencyInferences/SecondClient SendNetwork+Server Send/RecvServer QueueServer Compute InputServer Compute InferServer Compute Outputp90 latency
10.7168315172324668003441293111592936
20.8146415144753931065946169567362583025
30.72613148586810139927370439612680703879331
40.72268146338622300409933573412502454983687
50.620641540583351202511057484312260586512305
60.628191573869480288510134432012346447888080
70.516641507386600723511197489912444828854777

因此,改造的一个方案就是将数据层进行精简,在送入nms之前根据conf对bbox进行粗略的筛选, 最后参考tensorrtx中对detect层的处理,将输出改造成形状为[batches, num_bboxes, 6]的向量, 其中num_bboxes=1000

6 = [cx,cy,w,h,conf,cls_id], 其中conf = obj_conf * cls_prob


3. 具体步骤

3.1 clone ultralytics yolov5 repo

git clone -b v6.1 https://github.com/ultralytics/yolov5.git

3.2 改造detect层

将detect的forward函数修改为

def forward(self, x):
    z = []  # inference output
    for i in range(self.nl):
        x[i] = self.m[i](x[i])  # conv
        bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
        x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

        if not self.training:  # inference
            if self.onnx_dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
                self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)

            y = x[i].sigmoid()
            if self.inplace:
                y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
                y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
            else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
                xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
                wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
                y = torch.cat((xy, wh, y[..., 4:]), -1)
            z.append(y.view(bs, -1, self.no))

    # custom output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    # [bs, 25200, 85]
    origin_output = torch.cat(z, 1)
    output_bboxes_nums = 1000
    # operator argsort to ONNX opset version 12 is not supported.
    # top_conf_index = origin_output[..., 4].argsort(descending=True)[:,:output_bboxes_nums]

    # [bs, 1000]
    top_conf_index =origin_output[..., 4].topk(k=output_bboxes_nums)[1]

    # torch.Size([bs, 1000, 85])
    filter_output = origin_output.gather(1, top_conf_index.unsqueeze(-1).expand(-1, -1, 85))

    filter_output[...,5:] *= filter_output[..., 4].unsqueeze(-1)  # conf = obj_conf * cls_conf
    bboxes =  filter_output[..., :4]
    conf, cls_id = filter_output[..., 5:].max(2, keepdim=True)
    # [bs, 1000, 6]
    filter_output = torch.cat((bboxes, conf, cls_id.float()), 2)

    return x if self.training else filter_output
    # custom output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    
    # return x if self.training else (torch.cat(z, 1), x)

3.3 导出onnx

onnx simplify的时候,必须注释掉下面的代码,否则导出的onnx模型仍然为static shape

model_onnx, check = onnxsim.simplify(
    model_onnx,
    dynamic_input_shape=dynamic
    # 必须注释
    #input_shapes={'images': list(im.shape)} if dynamic else None
    )

运行python export.py --weight yolov5s.pt --dynamic --simplify --include onnx导出onnx模型,导出的onnx结构如下:

在这里插入图片描述

3.4 导出tensorrt engine

见上文


4. 修改后的性能

  • batch size = 1

    吞吐量提升了25倍以上,Server QueueServer Compute Output的时间也显著降低

    ConcurrencyInferences/SecondClient SendNetwork+Server Send/RecvServer QueueServer Compute InputServer Compute InferServer Compute OutputClient Recvp90 latency
    111.912456947228673595022340393457
    219.2137689804341753849971613118114
    320.214061312651500824048815003171370
    42013821806212769905151844963235043
    520.513622260462404811250686223286810
    620.814872717142034833150765063406248
    720.115353281442626844451224053430850
    819.915123846903511816850185815465658
    920.214334208933499903451803893522285
    1020.514764690293369828051654423622745
  • batch size = 8

    相对 batch size = 1, Server Compute Input、Server Compute Infer, Server Compute Output速度分别提升了约1.4倍、2倍、4倍,代价是随着batch增大,数据传输的耗时增大

    ConcurrencyInferences/SecondClient SendNetwork+Server Send/RecvServer QueueServer Compute InputServer Compute InferServer Compute OutputClient Recvp90 latency
    115.21120252707536053862488435570189
    218.41042482992712457802491334901743
    3201020311781112290564025702041267145
    42010097159561448435998245410451716309
    519.29117197160823975376248020342518530
    620872823380662914630424969642706257
    72014785270829265815556248916053170047
    8201303530527075067635324926243235293
    917.610870353560170376307248013653856391
    1018.4935739538308044562925206434531638

REFERENCES

### YOLOv8 Detect Layer Implementation and Usage In the context of object detection, particularly within frameworks like YOLO (You Only Look Once), the detect layer plays a crucial role in predicting bounding boxes, class probabilities, and confidence scores for objects present in an image. For YOLOv8 specifically, this layer has been optimized to enhance performance while maintaining efficiency. The detect layer in YOLOv8 is implemented as part of the model's architecture where it processes feature maps generated by earlier layers such as convolutional or custom operations like SAConv2d[^1]. These processed features are then used to predict multiple attributes about detected objects including their location via bounding box coordinates along with associated classes and confidences. Here’s how one can implement and use the detect layer: #### Defining the Detection Head To define the detection head that includes the detect layer, typically involves configuring parameters related to anchors, strides, number of classes etc., which guide predictions at different scales across various levels of resolution provided by backbone network outputs. ```python import torch.nn as nn class Detect(nn.Module): stride = None # strides computed during build onnx_dynamic = False # ONNX export parameter def __init__(self, nc=80, anchors=(), ch=()): super().__init__() self.nc = nc # number of classes self.no = nc + 5 # number of outputs per anchor self.nl = len(anchors) # number of detection layers self.na = len(anchors[0]) // 2 # number of anchors self.m = nn.ModuleList( nn.Conv2d(x, self.no * self.na, 1) for x in ch) self.ia = nn.Identity() if not hasattr(self, 'ia') else self.ia self.im = nn.Identity() if not hasattr(self, 'im') else self.im ``` This code snippet defines a `Detect` module responsible for handling detections from three separate branches corresponding to varying resolutions derived from the main body of the neural net. #### Forward Pass Through Detect Layer During inference time when passing data through these models, each branch produces its own set of prediction tensors containing information regarding potential instances found inside input images after applying non-max suppression post-processing steps outside core framework logic itself. ```python def forward(self, x): z = [] # inference output for i in range(self.nl): x[i] = self.m[i](self.ia(x[i])) # conv bs, _, ny, nx = x[i].shape # batch size, _, grid_y, grid_x x[i] = ( x[i] .view(bs, self.na, self.no, ny, nx) .permute(0, 1, 3, 4, 2) .contiguous() ) if not self.training: # inference only y = x[i].sigmoid() y[..., :4] *= 2 z.append(y.view(bs, -1, self.no)) return x if self.training else (torch.cat(z, 1), x) ``` Through this method, tensor transformations prepare raw outputs suitable for further processing into final results encompassing all necessary details required for identifying targets accurately without redundancy due to overlapping candidates produced independently over distinct spatial regions covered throughout entire scene captured visually.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值