江大白 | TensorRT模型部署，灵活性和性能调试，技巧梳理解析！

本文链接：https://blog.youkuaiyun.com/csdn_xmj/article/details/141097076

本文来源公众号“江大白”，仅用于学术分享，侵权删，干货满满。

导读

模型的转换是大家在调试模型过程中经常需要做的工作，是不是觉得这个过程很辛苦呢？那有没有兼顾灵活性和性能的更好的方式呢，本文为大家详细介绍了两种更好的方式，希望对大家有所帮助。

用过TensorRT的基本都接触过trtexec[1]，可以方便快捷地将你的ONNX模型转换为TensorRT的engine：

./trtexec --onnx=model.onnx

其中原理是啥，这就涉及到了另外一个库onnx-tensorrt[2]，可以解析onnx模型并且将onnx中的每一个op转换为TensorRT的op，进而构建得到engine，trtexec转模型的核心就是onnx-tensorrt。

如果没有onnx-tensorrt[3]，我们该怎么使用TensorRT去加速你的模型的呢？

幸运的是TensorRT官方提供了API[4]去搭建网络，你可以像使用Pytorch一样去搓一个网络出来，比如TensorRTx[5]这个库，就包含了很多直接使用API搭建出来的TensorRT网络：

nvinfer1::IHostMemory* buildEngineYolov8n(nvinfer1::IBuilder* builder,
                                          nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, const std::string& wts_path) {
    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);
    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);

    /*******************************************************************************************************
    ******************************************  YOLOV8 INPUT  **********************************************
    *******************************************************************************************************/
    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});
    assert(data);

    /*******************************************************************************************************
    *****************************************  YOLOV8 BACKBONE  ********************************************
    *******************************************************************************************************/
    nvinfer1::IElementWiseLayer* conv0 = convBnSiLU(network, weightMap, *data, 16, 3, 2, 1, "model.0");
    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, *conv0->getOutput(0), 32, 3, 2, 1, "model.1");
    nvinfer1::IElementWiseLayer* conv2 = C2F(network, weightMap, *conv1->getOutput(0), 32, 32, 1, true, 0.5, "model.2");
    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0), 64, 3, 2, 1, "model.3");
    nvinfer1::IElementWiseLayer* conv4 = C2F(network, weightMap, *conv3->getOutput(0), 64, 64, 2, true, 0.5, "model.4");
    nvinfer1::IElementWiseLayer* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0), 128, 3, 2, 1, "model.5");
    nvinfer1::IElementWiseLayer* conv6 = C2F(network, weightMap, *conv5->getOutput(0), 128, 128, 2, true, 0.5, "model.6");
    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0), 256, 3, 2, 1, "model.7");
    nvinfer1::IElementWiseLayer* conv8 = C2F(network, weightMap, *conv7->getOutput(0), 256, 256, 1, true, 0.5, "model.8");
    nvinfer1::IElementWiseLayer* conv9 = SPPF(network, weightMap, *conv8->getOutput(0), 256, 256, 5, "model.9");
...
}

这种方式的搭建，相比使用onnx-tensorrt[6]的优点：

可以更精确控制网络中的每一层，规避onnx中冗余的造成性能下降的结构，所以理论上通过API搭建的trt网络，在构建后性能会更好一些（当然也分情况哈，对于大部分模型来说，现在onnx2trt + TensorRT 配合其实已经和纯API搭建性能几乎一样了）
后期可以比较方便的修改trt网络层中的某一层，以及加plugin

不过缺点很显然，搭网络很耗时，还需要你熟悉TensorRT的api，入手期间可能会经历无数的坑。有那时间使用onnx2trt一行命令就转好了，没有onnx2trt灵活。

不过当然不能无脑使用onnx，遇到网络中不支持的算子，或者你的网络比较特殊的话，会直接GG，看看onnx2TensorRT仓库的issue，直到2023年还会有各种各样的op问题：

另外，当模型特别大（嗯我说的就是llm），层数特别多的话，onnx就不是很好用了，也不是不能导出来，就是当onnx比较大的时候，看网络结构、定位问题不是很好搞，总得经过onnx这个IR，而ONNX用起来有很多小坑，虽说最后可以完成任务，但过程总归是很辛苦的（苦力活，懂的都懂）。

那么有没有更好的方式呢？同时兼顾灵活性和性能？

更好的方式 v1

想必有些童鞋也用过类似于torch2trt[7]的TensorRT转换工具，通过遍历你的Pytorch网络，在遍历每一个op的时候将每个op转换为相应的TensorRT-op，搭建好网络后就可以build成TensorRT的engine：

  model = deeplabv3_resnet50().cuda().eval().half()
  data = torch.randn((1, 3, 224, 224)).cuda().half()

  print('Running torch2trt...')
  model_trt = torch2trt_dynamic(
      model, [data], fp16_mode=True, max_workspace_size=1 << 25)

比如下述这个converter，当你模型遍历到torch.nn.functional.leaky_relu这个op的时候，会执行这个转换脚本生成TensorRT-network的op：ctx.network.add_activation(input_trt, trt.ActivationType.LEAKY_RELU)。

@tensorrt_converter('torch.nn.functional.leaky_relu')
@tensorrt_converter('torch.nn.functional.leaky_relu_')
def convert_leaky_relu(ctx):
    input = get_arg(ctx, 'input', pos=0, default=None)
    negative_slope = get_arg(ctx, 'negative_slope', pos=1, default=0.01)
    output = ctx.method_return

    input_trt = trt_(ctx.network, input)
    layer = ctx.network.add_activation(input_trt,
                                       trt.ActivationType.LEAKY_RELU)
    layer.alpha = negative_slope

    output._trt = layer.get_output(0)

这种方式的好处是修改网络比较简单，因为是直接从你pytorch模型去转换而不是经过onnx，虽然说经过onnx也可以修改网络，但是终归是要经过onnx这个IR，有些op从pytorch->onnx的时候会变，到时候出现了问题不好定位。

另外，需要debug的时候你可以很方便的设置哪些是output（直接在网络中找到你想要设置output的地方，将子模型单独截取出来转换即可），方便定位问题。如果是onnx的话，首先需要获取pytorch-onnx的对应层，然后在onnx2trt脚本中设置才可以，虽然TensorRT官方也提供了Polygraphy[8]这样的debug工具，但是实际使用起来没有直接在pytorch网络上修改方便。

后续的trtorch，又或者叫torch-TensorRT[9]的工具，原理和torch2trt差不多，也是通过遍历torch的网络去一层一层转化为TensorRT的op：

更好的方式 v2

上述的v1方法，相比onnx2trt更直接一些，可以直接在pytorch模型中进行转换，不过我们拿到的只是build后的TensorRT-engine，中间TensorRT-network网络的搭建过程被隐藏起来了，之后网络中遇到问题，之后想要进一步debug的时候，对于网络的全局观还是要差那么一点，如果能直接debug使用TensorRT-API搭建的网络会更好更直观一点：

class Centernet_dla34(object):
    def __init__(self, weights) -> None:
        super().__init__()
        self.weights = weights
        self.levels = [1, 1, 1, 2, 2, 1]
        self.channels = [16, 32, 64, 128, 256, 512]
        self.down_ratio = 4
        self.last_level = 5
        self.engine = self.build_engine()

    def add_batchnorm_2d(self, input_tensor, parent):
        gamma = self.weights[parent + '.weight'].numpy()
        beta = self.weights[parent + '.bias'].numpy()
        mean = self.weights[parent + '.running_mean'].numpy()
        var = self.weights[parent + '.running_var'].numpy()
        eps = 1e-5

        scale = gamma / np.sqrt(var + eps)
        shift = beta - mean * gamma / np.sqrt(var + eps)
        power = np.ones_like(scale)

        return self.network.add_scale(input=input_tensor.get_output(0), mode=trt.ScaleMode.CHANNEL, shift=shift, scale=scale, power=power)
...
    def populate_network(self):