Project: Inference Framework based TensorRT

本文介绍了NVIDIA的TensorRT推理框架,重点讨论了其FP16和INT8计算支持,PTX预编译代码,以及如何通过网络模型裁剪、重构和低精度计算优化性能。还提到了TensorRT在硬件层面利用Tensor Core加速卷积运算,并概述了框架的开发计划和已实现的插件。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

引言

视觉算法经过几年高速发展,大量的算法被提出。为了能真正将算法在实际应用场景中更好地应用,高性能的 inference框架层出不穷。从手机端上的ncnn到tf-lite,NVIDIA在cudnn之后,推出专用于神经网络推理的TensorRT. 经过几轮迭代,支持的操作逐渐丰富,补充的插件已经基本满足落地的需求。笔者觉得,尤其是tensorrt 5.0之后,无论是接口还是使用samples都变得非常方便集成。

版本选型与基本概念

FP16 INT8

The easiest way to benefit from mixed precision in your application is to take advantage of the support for FP16 and INT8 computation in NVIDIA GPU libraries. Key libraries from the NVIDIA SDK now support a variety of precisions for both computation and storage.

Table shows the current support for FP16 and INT8 in key CUDA libraries as well as in PTX assembly and CUDA C/C++ intrinsics.

Feature FP16x2 INT8/16 DP4A/DP2A
PTX instructions CUDA 7.5 CUDA 8
CUDA C/C++ intrinsics CUDA 7.5 CUDA 8
cuBLAS GEMM CUDA 7.5 CUDA 8
cuFFT CUDA 7.5 I/O via cuFFT callbacks
cuDNN 5.1 6
TensorRT v1 v2 Tech Preview

PTX

PTX(parallel-thread-execution,并行线程执行) 预编译后GPU代码的一种形式,开发者可以通过编译选项 “-keep”选择输出PTX代码,当然开发人员也可以直接编写PTX级代码。另外,PTX是独立于GPU架构的,因此可以重用相同的代码适用于不同的GPU架构。 具体可参考CUDA-PDF之《PTX ISA reference document》

建议我们的CUDA 版本为CUDA 8.0以上, 显卡至少为GeForce 1060, 如果想支持Int8/DP4A等feature,还是需要RTX 1080或者P40

TensorRT特性助力高性能算法

优化原理

网络模型的裁剪与重构

The above figures explain the vertical fusion optimization that TRT does. The Convolution (C), Bias(B) and Activation(R, ReLU in this case) are all collapsed into one single node (implementation wise this would mean a single CUDA kernel launch for C, B and R).

There is also a horizontal fusion where if multiple nodes with same operation are feeding to multiple nodes then it is converted to one single node feeding multiple nodes. The three 1x1 CBRs are fused to one and their output is directed to appropriate nodes. Other optimizations Apart from the graph optimizations, TRT, through experiments and based on parameters like batch size, convolution kernel(filter) sizes, chooses efficient algorithms and kernels(CUDA kernels) for operations in network.

低精度计算的支持

  • FP16 & Int8指令的支持
  • DP4A(Dot Product of 4 8-bits Accumulated to a 32-bit)

TensorRT 进行优化的方式是 DP4A (Dot Product of 4 8-bits Accumulated to a 32-bit),如下图:

这是PASCAL 系列GPU的硬件指令,INT8卷积就是使用这种方式进行的卷积计算。更多关于DP4A的信息可以参考<

### DeepSeek C++ Library Information and Resources DeepSeek appears to be a less commonly referenced term within the context of C++ libraries or frameworks. However, based on available knowledge up until 2023, no specific widely recognized C++ library named "DeepSeek" exists that is dedicated as a standalone offering[^1]. It's possible that confusion might arise with similarly named projects or components integrated into larger software ecosystems. For developers interested in deep learning functionalities using C++, several alternatives are recommended: #### Popular Alternatives for Deep Learning in C++ - **Dlib**: A modern C++ toolkit containing machine learning algorithms and tools for creating complex software to solve real-world problems. ```cpp #include <dlib/dnn.h> using namespace dlib; ``` - **TensorFlow C++ API**: TensorFlow offers an extensive set of APIs including support for C++. This allows building models directly in C++ without needing Python dependencies. ```cpp #include "tensorflow/cc/client/client_session.h" // Example usage would involve setting up sessions and running operations similar to Python but natively in C++ ``` - **MNN (Mobile Neural Network)**: Alibaba’s lightweight neural network inference framework supporting multiple platforms which includes efficient implementations suitable for mobile devices. Given these options, exploring Dlib or TensorFlow could provide robust solutions depending upon project requirements related to deep learning tasks implemented via C++ codebases. --related questions-- 1. What are some key features offered by Dlib for implementing computer vision applications? 2. How does one integrate TensorFlow's C++ API effectively into existing projects? 3. Can MNN be used efficiently on embedded systems beyond smartphones? 4. Are there any notable differences between training models versus deploying them when working with C++ libraries like those mentioned above?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值