Triton & TensorRT-LLM

原创

已于 2024-12-27 19:20:00 修改 · 550 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#性能优化

于 2024-05-30 22:39:49 首次发布

Deploy an AI Coding Assistant with NVIDIA TensorRT-LLM and NVIDIA Triton | NVIDIA Technical Blog

模型格式先Huggingface转为FasterTransformer；再用TensorRT-LLM将其compile为TensorRT engine；然后可用TensorRT-LLM的C++ runtime来跑推理（或者模型放到Triton Repo上，并指定TensorRT-LLM为backend）

Input的Tokenizing和Output的De-Tokenizing，视作前处理、后处理，创建"Python Model"；整个流程用一个"Ensemble Model"来表示，包含以上两个"Model"以及真正的GPT-Model;

https://github.com/triton-inference-server/tutorials/blob/main/Conceptual_Guide/Part_1-model_deployment/README.md

1. 想用onnx-runtime来做推理backend；因此先要将模型转换为onnx格式；

2. model repo: 新建一个目录（本地目录、远程目录、Azure Blob都可）；存放所有模型的名称(text_detection、text_recognition)、版本（1、2）、配置文件（config.pbtxt)、模型文件(model.onnx)。例如：

model_repository/
├── text_detection
│   ├── 1
│   │   └── model.onnx
│   ├── 2
│   │   └── model.onnx
│   └── config.pbtxt
└── text_recognition
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

3. config.pbtxt格式

name: "text_detection"
backend: "onnxruntime"
max_batch_size : 256
input [
  {
    name: "input_images:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 3 ]
  }
]
output [
  {
    name: "feature_fusion/Conv_7/Sigmoid:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 1 ]
  }
]
output [
  {
    name: "feature_fusion/concat_3:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 5 ]
  }
]

backend、max_batch_size要写; input、output应该可以由triton从模型文件里自动获取，也可不写；

4. 拉取和启动nvcr.io的triton server镜像：

docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:<yy.mm>-py3

5. 启动triton server

tritonserver --model-repository=/models

启动成功后，显示如下信息：（哪几个模型READY了；版本、内存、显存等信息；2个推理用的端口和1个状态查询端口）

I0712 16:37:18.246487 128 server.cc:626]
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| text_detection   | 1       | READY  |
| text_recognition | 1       | READY  |
+------------------+---------+--------+

I0712 16:37:18.267625 128 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I0712 16:37:18.268041 128 tritonserver.cc:2159]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton