Project: Inference Framework based TensorRT

最新推荐文章于 2024-02-04 14:10:58 发布

deepindeed

最新推荐文章于 2024-02-04 14:10:58 发布

阅读量478

点赞数

分类专栏：【性能优化】【计算机视觉】【深度学习框架】文章标签： TensorRT

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.youkuaiyun.com/cwlseu/article/details/103696616

版权

本文介绍了NVIDIA的TensorRT推理框架，重点讨论了其FP16和INT8计算支持，PTX预编译代码，以及如何通过网络模型裁剪、重构和低精度计算优化性能。还提到了TensorRT在硬件层面利用Tensor Core加速卷积运算，并概述了框架的开发计划和已实现的插件。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

引言

视觉算法经过几年高速发展，大量的算法被提出。为了能真正将算法在实际应用场景中更好地应用，高性能的 inference框架层出不穷。从手机端上的ncnn到tf-lite，NVIDIA在cudnn之后，推出专用于神经网络推理的TensorRT. 经过几轮迭代，支持的操作逐渐丰富，补充的插件已经基本满足落地的需求。笔者觉得，尤其是tensorrt 5.0之后，无论是接口还是使用samples都变得非常方便集成。

版本选型与基本概念

FP16 INT8

The easiest way to benefit from mixed precision in your application is to take advantage of the support for FP16 and INT8 computation in NVIDIA GPU libraries. Key libraries from the NVIDIA SDK now support a variety of precisions for both computation and storage.

Table shows the current support for FP16 and INT8 in key CUDA libraries as well as in PTX assembly and CUDA C/C++ intrinsics.

Feature	FP16x2	INT8/16 DP4A/DP2A
PTX instructions	CUDA 7.5	CUDA 8
CUDA C/C++ intrinsics	CUDA 7.5	CUDA 8
cuBLAS GEMM	CUDA 7.5	CUDA 8
cuFFT	CUDA 7.5	I/O via cuFFT callbacks
cuDNN	5.1	6
TensorRT	v1	v2 Tech Preview

PTX

PTX(parallel-thread-execution，并行线程执行) 预编译后GPU代码的一种形式，开发者可以通过编译选项 “-keep”选择输出PTX代码，当然开发人员也可以直接编写PTX级代码。另外，PTX是独立于GPU架构的，因此可以重用相同的代码适用于不同的GPU架构。具体可参考CUDA-PDF之《PTX ISA reference document》

建议我们的CUDA 版本为CUDA 8.0以上, 显卡至少为GeForce 1060, 如果想支持Int8/DP4A等feature，还是需要RTX 1080或者P40。

TensorRT特性助力高性能算法

优化原理

网络模型的裁剪与重构

The above figures explain the vertical fusion optimization that TRT does. The Convolution (C), Bias(B) and Activation(R, ReLU in this case) are all collapsed into one single node (implementation wise this would mean a single CUDA kernel launch for C, B and R).

There is also a horizontal fusion where if multiple nodes with same operation are feeding to multiple nodes then it is converted to one single node feeding multiple nodes. The three 1x1 CBRs are fused to one and their output is directed to appropriate nodes. Other optimizations Apart from the graph optimizations, TRT, through experiments and based on parameters like batch size, convolution kernel(filter) sizes, chooses efficient algorithms and kernels(CUDA kernels) for operations in network.