Triton全方位指南---模型部署入门篇（二）

原创已于 2025-01-09 16:54:42 修改 · 1.7k 阅读

19 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #tensorflow #python #pytorch

于 2025-01-09 14:11:43 首次发布

深度学习专栏收录该内容

2 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

文章目录

简介
模型仓库配置
- 最小配置文件`config.pbtxt`
- - input/output参数详解
快速部署一个Hello World
其他

简介

Triton支持多种类型的模型，包括TensorFlow, PyTorch, ONNX, TensorRT, Python等，在实际部署时我们可能会用到其中的一种或多种模型。而且任何深度学习模型部署框架都需要解决三方面的问题。

管理多种类型的模型。
控制模型的版本，加载和卸载。
配置模型的输入和输出

上面的问题Triton都已提供了相应的解决方案。具体来说，Triton是通过配置文件来管理模型的版本、输入输出、支撑后端等功能。因此在部署前需要做好的两件事是准备好模型文件model.xx和写好配置文件config.pbtxt。

模型仓库配置

在上一节中，使用docker参数-v ${PWD}/model_repository:/models映射进来一个文件夹，这个文件夹作为模型仓库，用来存放我们所有的模型和配置。
整个模型仓库的结构应该如下

models
├── test1(模型名称)
│   ├── 1(版本号)
│   │   └── model.py(模型文件)
│   └── config.pbtxt(配置文件)
├── test2(模型名称)
│   ├── 1(版本号)
│   │   └── model.pt(模型文件)
│   ├── 2(版本号)
│   │   └── model.pt(模型文件)
│   └── config.pbtxt(配置文件)

其中模型名称，版本号（不支持小数点.来划分大小版本）和模型后缀名model.xx根据实际情况进行修改。从目录就可以看出，Triton通过模型名称来控制选择的模型，并且通过版本号控制模型版本，对于输入输出和其他功能，都放在了配置文件中。接下来重点介绍，配置文件config.pbtxt的书写。

最小配置文件`config.pbtxt`

配置文件config.pbtxt中需要指明模型需要的信息。下面用torch后端做个演示样例。更多的信息见官方模型配置指南。

name: "test2"
backend: "pytorch"
max_batch_size: 16
input [
{
    name: "INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1 ]
}
]

output [
{
    name: "OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 3, 256, 256 ]
}
]

一个配置文件最少需要设置backend/platform、max_batch_size、input 和 output属性。

name: 可选参数。如果指明了模型名称，则模型名称必须和文件夹名字保持一致，不一样的话会报错。
max_batch_size: 可选参数。模型支持最大的batch尺寸，如果不支持batch，应该设置为0。缺省为0（这里虽然说可以不写，但是建议写上）。
backend：和platform填其中一个。支持后端的有tensorrt、pytorch、onnxruntime、tensorflow、python、openvino、dali、fil等后端。同时，也可以自定义后端。
platform：和backend填其中一个。这个参数的作用是检查模型的类别，例如tensorflow有多种保存格式，GraphDef格式和SavedModel格式，对应的platform为tensorflow_graphdef和tensorflow_savedmodel。
input：必填。输入队列，在每个输入中需要指明输入数据的名称name，输入类型data_type和输入尺寸dims。
output：必填。输出队列，和输入队列一样。

input/output参数详解

name这个参数是对传输的变量进行命名，主要作用是和客户端通信，方便操作。
data_type该参数指定模型接受的数据类型，下表为各类型变量的对应关系。

Model Config	TensorRT	TensorFlow	ONNX Runtime	PyTorch	API	NumPy
TYPE_BOOL	kBOOL	DT_BOOL	BOOL	kBool	BOOL	bool
TYPE_UINT8	kUINT8	DT_UINT8	UINT8	kByte	UINT8	uint8
TYPE_UINT16		DT_UINT16	UINT16		UINT16	uint16
TYPE_UINT32		DT_UINT32	UINT32		UINT32	uint32
TYPE_UINT64		DT_UINT64	UINT64		UINT64	uint64
TYPE_INT8	kINT8	DT_INT8	INT8	kChar	INT8	int8
TYPE_INT16		DT_INT16	INT16	kShort	INT16	int16
TYPE_INT32	kINT32	DT_INT32	INT32	kInt	INT32	int32
TYPE_INT64	kINT64	DT_INT64	INT64	kLong	INT64	int64

dims表示输入模型的shape，其中[-1]表示任意尺寸。如果max_batch_size>0，则输入的shape会和batch进行拼接，最终shape为[-1] + dims，如果max_batch_size=0，则输入的shape应为dims。
举例说明

max_batch_size: 16
input [
{
    name: "INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1 ]
}
]
output [
{
    name: "OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 3, 256, 256 ]
}
]

在上面的例子中，"INPUT_0"输入变量是一个shape为[-1, -1]，类型为uint8的张量。"OUTPUT_0"输出变量是一个float32的Tensor，shape为[-1, 3, 256, 256]的张量。

快速部署一个Hello World

接下来用python后端做一个简单的演示，功能为客户端传输给后端一个字符串，后端将Hello World和这个字符串拼接起来并返回。模型仓库的目录如下。

models
└── test1
    ├── 1
    │   └── model.py
    └── config.pbtxt

配置文件config.pbtxt

name: "test1"
backend: "python"
max_batch_size: 0
input [
{
    name: "INPUT_0"
    data_type: TYPE_STRING
    dims: [ -1 ]
}
]

output [
{
    name: "OUTPUT_0"
    data_type: TYPE_STRING
    dims: [ -1 ]
}
]

模型文件model.py

import io
import json

import numpy as np

# triton_python_backend_utils is available in every Triton Python model. You
# need to use this module to create inference requests and responses. It also
# contains some utility functions for extracting information from model_config
# and converting Triton input/output types to numpy types.
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    """Your Python model must use the same class name. Every Python model
    that is created must have "TritonPythonModel" as the class name.
    """

    def initialize(self, args):
        """`initialize` is called only once when the model is being loaded.
        Implementing `initialize` function is optional. This function allows
        the model to initialize any state associated with this model.

        Parameters
        ----------
        args : dict
          Both keys and values are strings. The dictionary keys and values are:
          * model_config: A JSON string containing the model configuration
          * model_instance_kind: A string containing model instance kind
          * model_instance_device_id: A string containing model instance device ID
          * model_repository: Model repository path
          * model_version: Model version
          * model_name: Model name
        """

        # You must parse model_config. JSON string is not parsed here
        self.model_config = model_config = json.loads(args["model_config"])

        # Get OUTPUT0 configuration
        output0_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT_0")

        # Convert Triton types to numpy types
        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config["data_type"]
        )

    def execute(self, requests):
        """`execute` MUST be implemented in every Python model. `execute`
        function receives a list of pb_utils.InferenceRequest as the only
        argument. This function is called when an inference request is made
        for this model. Depending on the batching configuration (e.g. Dynamic
        Batching) used, `requests` may contain multiple requests. Every
        Python model, must create one pb_utils.InferenceResponse for every
        pb_utils.InferenceRequest in `requests`. If there is an error, you can
        set the error argument when creating a pb_utils.InferenceResponse

        Parameters
        ----------
        requests : list
          A list of pb_utils.InferenceRequest

        Returns
        -------
        list
          A list of pb_utils.InferenceResponse. The length of this list must
          be the same as `requests`
        """

        output0_dtype = self.output0_dtype

        responses = []

        # Every Python backend must iterate over everyone of the requests
        # and create a pb_utils.InferenceResponse for each of them.
        for request in requests:
            # Get INPUT0
            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT_0")

            in_0 = in_0.as_numpy() # shape[1]
            in_str = in_0[0].decode('utf-8')

            img_out_str = "Hello World: " + in_str
            
            # Construct the output
            out_tensor_0 = pb_utils.Tensor("OUTPUT_0", np.array([img_out_str], dtype=output0_dtype))

            # Create InferenceResponse. You can set an error here in case
            # there was a problem with handling this inference request.
            # Below is an example of how you can set errors in inference
            # response:
            #
            # pb_utils.InferenceResponse(
            #    output_tensors=..., TritonError("An error occurred"))
            inference_response = pb_utils.InferenceResponse(
                output_tensors=[out_tensor_0]
            )
            responses.append(inference_response)

        # You should return a list of pb_utils.InferenceResponse. Length
        # of this list must match the length of `requests` list.
        return responses

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is OPTIONAL. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print("Cleaning up...")

启动服务

启动Triton容器并进入，创建好所有文件之后使用tritonserver --model-repository=/models --load-model=test1启动服务。默认开启http(8000端口)和grpc(8001端口)服务。

I0712 16:37:18.246487 128 server.cc:626]
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| text1            | 1       | READY  |
+------------------+---------+--------+

I0712 16:37:18.267625 128 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I0712 16:37:18.268041 128 tritonserver.cc:2159]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.23.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /models                                                                                                                                                                                      |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 1                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0712 16:37:18.269464 128 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
I0712 16:37:18.269956 128 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
I0712 16:37:18.311686 128 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

出现Ready表示模型加载成功。如果出现异常根据提示解决。

客户端client

与Triton服务端进行沟通支持多种方式：1、Http；2、GRPC。
官方提供了的client工具包的进行沟通请求，里面封装好了请求所用到的工具，可以用在客户端。
目前官方提供的tritonclient支持的语言平台有限，例如在python后端可以使用pip install tritonclient[all]安装。在一些其他语言，例如js、C#上只能手动实现请求封装。
官方考虑到这点，提供了通用的proto文件，通过Protobuf实现grpc请求，实现在各种平台上的通信。此外，还可以使用手动构造http请求，使用postman等调试工具，调用相关api接口，这种方法比较麻烦，推荐只在查询模型状态、加载模型和卸载模型时使用。
下面通过使用tritonclient脚本演示调用test1模型，完成客户端和后端的通信。

import sys
import numpy as np
import tritonclient.grpc as tritongrpcclient

# 连接服务器
try:
    triton_client = tritongrpcclient.InferenceServerClient(
        url="127.0.0.1:8001", verbose=False
    )
except Exception as e:
    print("channel creation failed: " + str(e))
    sys.exit(1)

# 输入
inputs = []
inputs.append(tritongrpcclient.InferInput("INPUT_0", [1], "BYTES"))
inputs[0].set_data_from_numpy(np.array(["I'm Steve."], dtype=np.object_))

# 输出
outputs = []
outputs.append(tritongrpcclient.InferRequestedOutput("OUTPUT_0"))

# 调用后端
results = triton_client.infer(
    model_name='test1', inputs=inputs, outputs=outputs, model_version="1"
)

output0_data = results.as_numpy("OUTPUT_0")
print(output0_data[0].decode('utf-8'))

如果出现Hello World: I'm Steve.表示成功。这里也可以封装json字符串完成一起其他的操作。

其他

triton的python后端虽然理论上可以完成如flask等后端的功能，但是在实际生产中还是建议将api后端逻辑框架和triton框架分离开来，trithon只提供算法的接口。
尽量不要用python后端去加载TensorRT、Onnx、torch、tensorflow等这些模型文件实现推理，如果需要做一些预处理或者后处理，推荐的做法是使用Triton的Ensemble和BLS结构实现，这个会在后面详细介绍这两种的区别。
参考文献
官方部署模型指南
 官方client客户端指南
 官方HTTP和GRPC协议文档

您可能感兴趣的与本文相关的镜像

Python3.10

Conda

Python

Python 是一种高级、解释型、通用的编程语言，以其简洁易读的语法而闻名，适用于广泛的应用，包括Web开发、数据分析、人工智能和自动化脚本