Deploy an AI Coding Assistant with NVIDIA TensorRT-LLM and NVIDIA Triton | NVIDIA Technical Blog
模型格式先Huggingface转为FasterTransformer;再用TensorRT-LLM将其compile为TensorRT engine;然后可用TensorRT-LLM的C++ runtime来跑推理(或者模型放到Triton Repo上,并指定TensorRT-LLM为backend)
Input的Tokenizing和Output的De-Tokenizing,视作前处理、后处理,创建"Python Model";整个流程用一个"Ensemble Model"来表示,包含以上两个"Model"以及真正的GPT-Model;
1. 想用onnx-runtime来做推理backend;因此先要将模型转换为onnx格式;
2. model repo: 新建一个目录(本地目录、远程目录、Azure Blob都可);存放所有模型的名称(text_detection、text_recognition)、版本(1、2)、配置文件(config.pbtxt)、模型文件(model.onnx)。例如:
model_repository/
├── text_detection
│ ├── 1
│ │ └── model.onnx
│ ├── 2
│ │ └── model.onnx
│ └── config.pbtxt
└── text_recognition
├── 1
│ └── model.onnx
└── config.pbtxt
3. config.pbtxt格式
name: "text_detection"
backend: "onnxruntime"
max_batch_size : 256
input [
{
name: "input_images:0"
data_type: TYPE_FP32
dims: [ -1, -1, -1, 3 ]
}
]
output [
{
name: "feature_fusion/Conv_7/Sigmoid:0"
data_type: TYPE_FP32
dims: [ -1, -1, -1, 1 ]
}
]
output [
{
name: "feature_fusion/concat_3:0"
data_type: TYPE_FP32
dims: [ -1, -1, -1, 5 ]
}
]
backend、max_batch_size要写; input、output应该可以由triton从模型文件里自动获取,也可不写;
4. 拉取和启动nvcr.io的triton server镜像:
docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:<yy.mm>-py3
5. 启动triton server
tritonserver --model-repository=/models
启动成功后,显示如下信息:(哪几个模型READY了;版本、内存、显存等信息;2个推理用的端口和1个状态查询端口)
I0712 16:37:18.246487 128 server.cc:626]
+------------------+---------+--------+
| Model | Version | Status |
+------------------+---------+--------+
| text_detection | 1 | READY |
| text_recognition | 1 | READY |
+------------------+---------+--------+
I0712 16:37:18.267625 128 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I0712 16:37:18.268041 128 tritonserver.cc:2159]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton

最低0.47元/天 解锁文章
2470

被折叠的 条评论
为什么被折叠?



