指南：Triton Inference Server

@daviiid

已于 2024-05-09 15:06:07 修改

阅读量4.4k

点赞数 26

分类专栏： AI 文章标签：计算机视觉人工智能深度学习云计算 AIGC

于 2024-01-24 16:35:20 首次发布

本文链接：https://blog.youkuaiyun.com/wb3533366/article/details/135824213

版权

AI 专栏收录该内容

15 篇文章

订阅专栏

提示：字数较多，篇幅较长，目录在右侧，感兴趣的建议先收藏

关于模型服务部署，先让gpt简单介绍下：

Triton Inference Server（Triton服务器）是由NVIDIA开发的用于部署机器学习模型的开源推理服务。它允许用户将训练好的深度学习模型部署为RESTful API，从而能够提供实时推理服务。Triton服务器支持多种深度学习框架（如TensorFlow、PyTorch、ONNX等），并提供了灵活的部署选项，包括单个模型部署、模型集合部署以及模型版本管理。此外，Triton服务器还具备状态管理功能，使得在处理有状态模型时能够有效地保持模型状态。
总的来说，Triton Inference Server为机器学习模型的部署提供了高性能、可扩展性和易用性，适用于在生产环境中提供推理服务。

☀️ 安装

Triton提供了两个操作界面，serve和client。

serve，服务端，负责加载模型，创建服务，请求分发，执行计算，发送结果。

client，客户端，负责发送request，接受结果。

两者均可以通过NGC或者源码编译安装，NGC安装最方便。

源码编译支持二次开发，更灵活，但网络要求高，坑比较多（下面有详细的避坑教程）。

NGC (推荐)

docker pull nvcr.io/nvidia/tritonserver:xx.yy-py3

其中xx表示年，yy表示月，代表不同的版本，具体查询点击下面这个链接

Triton Inference Server | NVIDIA NGC

确定版本后，拉下来就行了

docker pull nvcr.io/nvidia/tritonserver:23.12-py3

怎么获取指定cuda版本对应的Tag， 比如cuda-11.3**？**

打开链接：

Tag Layers
或者
Framework Containers Support Matrix
每个Tag都可以看到其参与构建的layer，找到CUDA_VERSION符合要求的:

源码编译

git clone https://github.com/triton-inference-server/server.git

cd server && mkdir build

./build.py -v --enable-all

build.py有很多自定义的参数，可以根据实际情况自由选择。

编译过程中大概率会碰到下面几个坑：

apt-get timeout

解决方法是将apt的源替换成国内的源，如下：

先创建一个sources.list，以22.04为例：

➜  server git:(main) ✗ cat sources.list
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-updates main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-backports main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-security main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ jammy-security main restricted universe multiverse

修改build.py把sources.list拷贝进去，改动如下（+表示这行新增）：

# Ensure apt-get won't prompt for selecting options
ENV DEBIAN_FRONTEND=noninteractive

+RUN cp /etc/apt/sources.list /etc/apt/sources.list.bak
+COPY . .
+ADD sources.list /etc/apt/
+RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 7EA0A9C3F273FCD8

# Install docker docker buildx
RUN apt-get update \
        && apt-get install -y ca-certificates curl gnupg \
        && install -m 0755 -d /etc/apt/keyrings \
        && curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg \
        && chmod a+r /etc/apt/keyrings/docker.gpg \
        && echo \
            "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
            "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
            tee /etc/apt/sources.list.d/docker.list > /dev/null \
        && apt-get update \
        && apt-get install -y docker.io docker-buildx-plugin

cuda-keyring.deb 找不到

Reading package lists...E: read, still have 8 to read but none left
E: Internal error, could not locate member control.tar{.zst,.lz4,.gz,.xz,.bz2,.lzma,}
E: Could not read meta data from /tmp/cuda-keyring.deb
E: The package lists or status file could not be parsed or opened.

The command '/bin/sh -c curl -o /tmp/cuda-keyring.deb     https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb     && apt install /tmp/cuda-keyring.deb && rm /tmp/cuda-keyring.deb &&     apt-get update && apt-get install -y datacenter-gpu-manager=1:3.2.6' returned a non-zero code: 100

这个是因为链接需要跳转，解决方法：

把curl -o替换成curl -Lo

curl -Lo /tmp/cuda-keyring.deb     https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb

pip超时

配置pip源，改动如下：

+RUN pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

RUN pip3 install --upgrade pip && \
    pip3 install --upgrade wheel setuptools docker

# Install boost version >= 1.78 for boost::span
# Current libboost-dev apt packages are < 1.78, so install from tar.gz
RUN wget -O /tmp/boost.tar.gz \
        https://boostorg.jfrog.io/artifactory/main/release/1.80.0/source/boost_1_80_0.tar.gz && \
    (cd /tmp && tar xzf boost.tar.gz) && \
    cd /tmp/boost_1_80_0 && ./bootstrap.sh --prefix=/usr && ./b2 install && \
    mv /tmp/boost_1_80_0/boost /usr/include/boost

# Server build requires recent version of CMake (FetchContent required)

不管用哪种方式，最终都会得到一个image：nvcr.io/nvidia/tritonserver，这就是用来创建服务容器的镜像。

安装Client

client即用户端，可以部署在任意设备，如手机app，电脑网页，机器人等；

上面选择–enable-all会自动编译client，否则可以通过pip安装：

$ pip install tritonclient[all]

需要使用cuda时，添加后缀：

$ pip install tritonclient[all, cuda]

或者，源码编译：

git clone https://github.com/triton-inference-server/client.git 
cd client
$ mkdir build
$ cd build
$ cmake -DCMAKE_INSTALL_PREFIX=`pwd`/install -DTRITON_ENABLE_CC_HTTP=ON -DTRITON_ENABLE_CC_GRPC=ON -DTRITON_ENABLE_PERF_ANALYZER=ON -DTRITON_ENABLE_PERF_ANALYZER_C_API=ON -DTRITON_ENABLE_PERF_ANALYZER_TFS=ON -DTRITON_ENABLE_PERF_ANALYZER_TS=ON -DTRITON_ENABLE_PYTHON_HTTP=ON -DTRITON_ENABLE_PYTHON_GRPC=ON -DTRITON_ENABLE_JAVA_HTTP=ON -DTRITON_ENABLE_GPU=ON -DTRITON_ENABLE_EXAMPLES=ON -DTRITON_ENABLE_TESTS=ON ..
$ make cc-clients python-clients java-clients

☀️ 简单例子：OCR

流程说明

OCR的流程如下，分为几个步骤：

从数据库抓取图片；
对图片进行预处理，得到归一化的Tensor；
执行Text-Detection，得到text的boundingbox和score；
进行后处理，得到归一化的crop-image；
进行Text-Recognition；
进行后处理，得到文本；

iocr-pipeline

配置模型仓库

首先，创建模型仓库：

export MODEL_REPO=~/model_repo

mkdir -p $MODEL_REPO && cd $MODEL_REPO

然后，把模型配置到这个目录，拓扑如下：

➜ tree $MODEL_REPO

./
├── text_detection
│   ├── 1
│   │   └── model.onnx
│   └── config.pbtxt
└── text_recognition
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

4 directories, 4 files

其中：

第一级，text_detection和text_recognition为模型名，可以自定义；
第二级，数字1表示模型的版本号，可以自定义，如v1.0.1这种，如果有多个版本可以创建多个。config.pbtxt，配置文件，如下所示。配置模型的相关参数，如batchsize，backend等等，最核心部分，后面详解；
第三级，model.onnx，模型文件；

name: "text_detection"
backend: "onnxruntime"
max_batch_size : 256
input [
  {
    name: "input_images:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 3 ]
  }
]
output [
  {
    name: "feature_fusion/Conv_7/Sigmoid:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 1 ]
  }
]
output [
  {
    name: "feature_fusion/concat_3:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 5 ]
  }
]

name: "text_recognition"
backend: "onnxruntime"
max_batch_size : 0
input [
  {
    name: "input.1"
    data_type: TYPE_FP32
    dims: [ 1, 1, 32, 100 ]
  }
]
output [
  {
    name: "307"
    data_type: TYPE_FP32
    dims: [ 1, 26, 37 ]
  }
]

加载服务

创建容器

docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(MODEL_REPO):/models --name trition_server_demo 093eb0f6e20b

进入容器后，加载服务

tritonserver --model-repository=/models

如果不需要额外配置参数的话，上面两步可以合并成一步：

docker run --gpus=all --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(MODEL_REPO):/models -d --name trition_server_demo 093eb0f6e20b tritonserver --model-repository=/models

配置成功会输出类似如下的信息：

I0712 16:37:18.246487 128 server.cc:626]
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| text_detection   | 1       | READY  |
| text_recognition | 1       | READY  |
+------------------+---------+--------+

I0712 16:37:18.267625 128 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I0712 16:37:18.268041 128 tritonserver.cc:2159]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.23.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /models                                                                                                                                                                                      |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 1                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0712 16:37:18.269464 128 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
I0712 16:37:18.269956 128 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
I0712 16:37:18.311686 128 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

client发送请求

import math
import cv2
import numpy as np
import tritonclient.http as httpclient

SAVE_INTERMEDIATE_IMAGES = False

def detection_preprocessing(image: cv2.Mat) -> np.ndarray:
    inpWidth = 640
    inpHeight = 480

    # pre-process image
    blob = cv2.dnn.blobFromImage(
        image, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False
    )
    blob = np.transpose(blob, (0, 2, 3, 1))
    return blob


def detection_postprocessing(scores, geometry, preprocessed_image):
    def fourPointsTransform(frame, vertices):
        vertices = np.asarray(vertices)
        outputSize = (100, 32)
        targetVertices = np.array(
            [
                [0, outputSize[1] - 1],
                [0, 0],
                [outputSize[0] - 1, 0],
                [outputSize[0] - 1, outputSize[1] - 1],
            ],
            dtype="float32",
        )

        rotationMatrix = cv2.getPerspectiveTransform(vertices, targetVertices)
        result = cv2.warpPerspective(frame, rotationMatrix, outputSize)
        return result

    def decodeBoundingBoxes(scores, geometry, scoreThresh=0.5):
        detections = []
        confidences = []

        ############ CHECK DIMENSIONS AND SHAPES OF geometry AND scores ########
        assert len(scores.shape) == 4, "Incorrect dimensions of scores"
        assert len(geometry.shape) == 4, "Incorrect dimensions of geometry"
        assert scores.shape[0] == 1, "Invalid dimensions of scores"
        assert geometry.shape[0] == 1, "Invalid dimensions of geometry"
        assert scores.shape[1] == 1, "Invalid dimensions of scores"
        assert geometry.shape[1] == 5, "Invalid dimensions of geometry"
        assert (
            scores.shape[2] == geometry.shape[2]
        ), "Invalid dimensions of scores and geometry"
        assert (
            scores.shape[3] == geometry.shape[3]
        ), "Invalid dimensions of scores and geometry"
        height = scores.shape[2]
        width = scores.shape[3]
        for y in range(0, height):
            # Extract data from scores
            scoresData = scores[0][0][y]
            x0_data = geometry[0][0][y]
            x1_data = geometry[0][1][y]
            x2_data = geometry[0][2][y]
            x3_data = geometry[0][3][y]
            anglesData = geometry[0][4][y]
            for x in range(0, width):
                score = scoresData[x]

                # If score is lower than threshold score, move to next x
                if score < scoreThresh:
                    continue

                # Calculate offset
                offsetX = x * 4.0
                offsetY = y * 4.0
                angle = anglesData[x]

                # Calculate cos and sin of angle
                cosA = math.cos(angle)
                sinA = math.sin(angle)
                h = x0_data[x] + x2_data[x]
                w = x1_data[x] + x3_data[x]

                # Calculate offset
                offset = [
                    offsetX + cosA * x1_data[x] + sinA * x2_data[x],
                    offsetY - sinA * x1_data[x] + cosA * x2_data[x],
                ]

                # Find points for rectangle
                p1 = (-sinA * h + offset[0], -cosA * h + offset[1])
                p3 = (-cosA * w + offset[0], sinA * w + offset[1])
                center = (0.5 * (p1[0] + p3[0]), 0.5 * (p1[1] + p3[1]))
                detections.append((center, (w, h), -1 * angle * 180.0 / math.pi))
                confidences.append(float(score))

        # Return detections and confidences
        return [detections, confidences]

    scores = scores.transpose(0, 3, 1, 2)
    geometry = geometry.transpose(0, 3, 1, 2)
    frame = np.squeeze(preprocessed_image, axis=0)
    frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    [boxes, confidences] = decodeBoundingBoxes(scores, geometry)
    indices = cv2.dnn.NMSBoxesRotated(boxes, confidences, 0.5, 0.4)

    cropped_list = []
    cv2.imwrite("frame.png", frame)
    count = 0
    for i in indices:
        # get 4 corners of the rotated rect
        count += 1
        vertices = cv2.boxPoints(boxes[i])
        cropped = fourPointsTransform(frame, vertices)
        cv2.imwrite(str(count) + ".png", cropped)
        cropped = np.expand_dims(cv2.cvtColor(cropped, cv2.COLOR_BGR2GRAY), axis=0)

        cropped_list.append(((cropped / 255.0) - 0.5) * 2)
    cropped_arr = np.stack(cropped_list, axis=0)

    # Only keep the first image, since the models don't currently allow batching.
    # See part 2 for enabling batch sizes > 0
    return cropped_arr[None, 0]


def recognition_postprocessing(scores: np.ndarray) -> str:
    text = ""
    alphabet = "0123456789abcdefghijklmnopqrstuvwxyz"

    scores = np.transpose(scores, (1, 0, 2))

    for i in range(scores.shape[0]):
        c = np.argmax(scores[i][0])
        if c != 0:
            text += alphabet[c - 1]
        else:
            text += "-"
    # adjacent same letters as well as background text must be removed
    # to get the final output
    char_list = []
    for i, char in enumerate(text):
        if char != "-" and (not (i > 0 and char == text[i - 1])):
            char_list.append(char)
    return "".join(char_list)


if __name__ == "__main__":
    # Setting up client
    client = httpclient.InferenceServerClient(url="localhost:8000")

    # Read image and create input object
    raw_image = cv2.imread("./img1.jpg")
    preprocessed_image = detection_preprocessing(raw_image)

    detection_input = httpclient.InferInput(
        "input_images:0", preprocessed_image.shape, datatype="FP32"
    )
    detection_input.set_data_from_numpy(preprocessed_image, binary_data=True)

    # Query the server
    detection_response = client.infer(
        model_name="text_detection", inputs=[detection_input]
    )

    # Process responses from detection model
    scores = detection_response.as_numpy("feature_fusion/Conv_7/Sigmoid:0")
    geometry = detection_response.as_numpy("feature_fusion/concat_3:0")
    cropped_images = detection_postprocessing(scores, geometry, preprocessed_image)

    # Create input object for recognition model
    recognition_input = httpclient.InferInput(
        "input.1", cropped_images.shape, datatype="FP32"
    )
    recognition_input.set_data_from_numpy(cropped_images, binary_data=True)

    # Query the server
    recognition_response = client.infer(
        model_name="text_recognition", inputs=[recognition_input]
    )

    # Process response from recognition model
    final_text = recognition_postprocessing(recognition_response.as_numpy("308"))

    print(final_text)

送入图片：

stop

执行请求：

python client.py

输出：

stop

☀️ perf_analyzer

这是官方提供的一款工具，作用是模拟用户发送请求来衡量当前服务的性能，包括吞吐量、延时等。该工具提供了大量参数帮助用户最大化还原真实的使用场景。

同时，它还支持许多当前比较火的AI服务的评估，如LLM。

源码和详细的解释可以参考：

perf_analyzer

安装

如果已经安装了tritonclient，则perf_analyzer就一起安装了。

运行的时候可能会报错：

perf_analyzer: error while loading shared libraries: libb64.so.0d: cannot open shared object file: No such file or directory

解决方法：

sudo apt-get install libb64

libcudart.so.12: cannot open shared object file: No such file or directory

解决方法：

pip install tritonclient==2.9.0

常用参数

-m         模型名
-x         模型版本
--async   使用异步模式。如果模型是stateful-model（triton模型架构核心之一，后面单独说）则默认使用async；在同步模式下，perf_analyzer 将启动与并发级别相等的线程。使用异步模式可以限制线程的数量，同时保持并发性。
--sync      强制使用同步模式
--measurement-interval  数据统计采样间隔
--concurrency-range  并发数范围，'end' and --latency-threshold can not be both 0 simultaneously
--request-rate-range 请求发送速率变化范围，'end' and --latency-threshold can not be both 0 simultaneously. 并发数和请求速率不一样
--request-distribution  请求间隔的分布，必须搭配request-rate-range使用；
--request-intervals:  指定一个文件路径，这个文件包含request间隔数据，不能同request-rate-range、concurrency-range同时使用
--num-of-sequences  序列个数，当评估sequence-model时
--latency-threshold  指定延时阈值，超过这个值就中断
--max-threads  最大的线程数
--stability-percentage 指定延迟稳定性测量的波动范围（%）
--max-trials  最大尝试次数
--percentile  使用百分位来评估延迟是否稳定
-b batchsize数
--input-data 指定数据源
--shared-memory 指定共享内存类型
--output-shared-memory-size 指定output tensor能申请到的最大内存空间
--sequence-length  指定sequence的长度
--string-length 指定string类型输入的长度
-u 指定serve的地址
-i  指定通信类型

关于并发请求中的同步和异步，引用gpt的回答：

在并发请求中，同步和异步是两种处理方式，它们的主要区别在于对于请求的等待和处理方式：
同步（Synchronous）：
等待阻塞： 在同步模式下，每个请求发送后，调用方会阻塞并等待请求完成，直到接收到响应或完成特定操作。这意味着每个请求的处理是同步进行的。
阻塞调用： 同步调用会阻塞程序的执行，直到请求完成。这可能导致程序在等待请求完成时变得不响应。
异步（Asynchronous）：
非阻塞： 在异步模式下，调用方不会等待每个请求完成。相反，它会继续执行后续的操作。请求的处理是在后台或另一个线程中完成的。
回调机制： 异步调用通常使用回调函数或事件处理程序，以便在请求完成时触发相应的处理逻辑。

☀️ Triton模型架构

triton主要支持三种模型架构，stateless、stateful、ensemble。

Stateful

比较难理解，引用官方文档：

With respect to Triton’s schedulers, a stateful model does maintain state between inference requests. The model is expecting multiple inference requests that together form a sequence of inferences that must be routed to the same model instance so that the state being maintained by the model is correctly updated. Moreover, the model may require that Triton provide control signals indicating, for example, the start and end of the sequence.

关键字：

state between inference requests
routed to the same model instance
state being maintained by the model

解读：

首先，state是requests之间流转的状态，因此和模型结构（RNN,CNN）无关；

其次，state维系的request序列会被routed到同一个model instance，确保state可以正确被更新和使用；

最后，Model依赖于Trition提供CONTROL_SEQUENCE_CORRID、CONTROL_SEQUENCE_END等控制信号来组织request的运行。

stateful就是一个普通的模型+state变量+triton控制信号，如下图所示：

stateful-model

不太好表达，以一个累加器为例：

ResetSequence是启动状态，通过where控制使用上次的状态Accumulate_In还是初始化。

Accumulate_In是模型内部维护的state，逻辑上等于上一次的Accumulate_Out，即：

Accumulate_In(t) = Accumulate_Out(t-1)

Accumulate_Out(t) = Accumulate_In(t) + ReduceSum(Input(t))

Output等于input加上当前Accumulate_Out，得到当前累加结果，即：

Output(t) = Input(t) + Accumulate_Out(t) = input(t) + Accumulate_Out(t-1) + ReduceSum(Input(t))

从上面的公式可以看出，当前request的输出依赖于上一次requeset的状态，这就是stateful的意思，所以我们希望通过Triton的状态控制、CORRID，来保证每次状态都能得到正确的更新。

累加器例子

Stateless

stateless表示无状态，即各个request之间是独立的，对照stateful理解就行了。最常见最基础的，后面介绍的大部分优化都是在这个架构下展开。

Ensemble

在一个复杂的pipeline中，通常由多个model组成，比如OCR中有text_detection和text_recognition两部分。在之前的示例中，client端要完成整个OCR需要多个步骤：

获取大图；
对大图进行预处理；
检测；
抠图+预处理；
识别；

其中，1，2，4在client端进行，3，5在triton端进行，如下图左所示。

很明显，数据流频繁在client和triton端流转，这就会造成网络带宽、计算资源浪费。

ensemble把中间的预处理步骤同样当作模型处理，比如image cropping，模型按照先后次序链接在一起，最后各个input和output通过预处理模块衔接，从而避免了上面问题，如上图右所示。这样我们就可以把整个pipeline当作一个模型来看待。

☀️ 吞吐量优化

一般，优化一个AI服务的吞吐量，可以从几个方面出发：

模型。如模型的剪枝、量化、输入分辨率、配置合理的op等；
硬件。增加网络带宽、使用更高性能的计算芯片、高性能存储等；
调度。动态合并request、异步等；
底层。使用高度优化的底层推理框架、优化pipeline等；

上面几个，除了模型层面的优化外，其他的triton都有提供策略。接下来针对这几个方面做个对比测试。

首先看下测试服务器的硬件配置：

processor       : 0                                                                                                                                                                                      
vendor_id       : GenuineIntel                                                                                                                                                                           
cpu family      : 6                                                                                                                                                                                      
model           : 85                                                                                                                                                                                     
model name      : Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz                                                                                                                                            
stepping        : 7                                                                                                                                                                                      
microcode       : 0x5003303
cpu MHz         : 1000.000
cache size      : 14080 KB
physical id     : 0
siblings        : 20
core id         : 0
cpu cores       : 10
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit mmio_stale_data retbleed eibrs_pbrsb
bogomips        : 4400.00
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-cea6c460-bfe4-934d-a9e6-1c0e505216f5)
         Link 0: <inactive>
         Link 1: <inactive>
         Link 2: <inactive>
         Link 3: <inactive>

baseline的config.pbtxt，不开dynamicbatch，单个instance，CPU：

name: "text_recognition"
backend: "onnxruntime"
max_batch_size : 8
input [
  {
    name: "input.1"
    data_type: TYPE_FP32
    dims: [ 1, 32, 100 ]
  }
]
output [
  {
    name: "308"
    data_type: TYPE_FP32
    dims: [ 26, 37 ]
  }
]

instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
]

下面是用perf_analyzer跑基准输出测试结果：

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 2, throughput: 5.2 infer/sec, latency 822731 usec
Concurrency: 4, throughput: 5.2 infer/sec, latency 1609278 usec
Concurrency: 6, throughput: 4.4 infer/sec, latency 2453611 usec
Concurrency: 8, throughput: 4 infer/sec, latency 3206366 usec
Concurrency: 10, throughput: 3.6 infer/sec, latency 4005856 usec
Concurrency: 12, throughput: 3.2 infer/sec, latency 4802846 usec
Concurrency: 14, throughput: 2.8 infer/sec, latency 5605523 usec
Concurrency: 16, throughput: 2.4 infer/sec, latency 6458274 usec

Batcher

将多个request组合成一个batch后再交给model处理。batching在大部分情况下比非batching性能更高。如果对延迟比较敏感，且requests间隔比较长的话，开启batching会加重延时。不同架构类型对应不同的batcher。

Dynamic Batcher

通常在单instance、stateless-model的情况下，serve端对requests是单个串行处理的，如果模型支持batching，则对应到trion的流程图如下：

使用方法：

首先是backend得支持，比如onnx导出得时候要打开对应参数；
config.pbtxt定义dynamic_batching。

name: "text_recognition"
backend: "onnxruntime"
max_batch_size : 8
input [
  {
    name: "input.1"
    data_type: TYPE_FP32
    dims: [ 1, 32, 100 ]
  }
]
output [
  {
    name: "308"
    data_type: TYPE_FP32
    dims: [ 26, 37 ]
  }
]

dynamic_batching { }

instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
]

该字段可以定义如下几个变量：

repeated int32 preferred_batch_size：优先满足的batch_size，比如[8, 16, 32]；
uint64 max_queue_delay_microseconds： 最大的等待间隔；
bool preserve_ordering： 是否在batch中保持次序；
uint64 priority_levels： 优先级
ModelQueuePolicy default_queue_polic： 队列策略

Sequence Batcher

如果模型是stateful-model，则要用Sequence Batcher。

Ensemble Scheduler

emsembel模型专用，后面单独展开。

加入dynamicbatch后再用perf_analyzer跑一遍：

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 2, throughput: 5.2 infer/sec, latency 841268 usec
Concurrency: 4, throughput: 11.2 infer/sec, latency 859879 usec
Concurrency: 6, throughput: 15.2 infer/sec, latency 891571 usec
Concurrency: 8, throughput: 17.6 infer/sec, latency 881096 usec
Concurrency: 10, throughput: 17.6 infer/sec, latency 1348658 usec
Concurrency: 12, throughput: 17.6 infer/sec, latency 1361539 usec
Concurrency: 14, throughput: 17.6 infer/sec, latency 1774365 usec
Concurrency: 16, throughput: 16 infer/sec, latency 1762362 usec

Device

一般GPU要比CPU算力高不少，修改kind配置device类型：

  enum Kind {
    //@@    .. cpp:enumerator:: Kind::KIND_AUTO = 0
    //@@
    //@@       This instance group represents instances that can run on either
    //@@       CPU or GPU. If all GPUs listed in 'gpus' are available then
    //@@       instances will be created on GPU(s), otherwise instances will
    //@@       be created on CPU.
    //@@
    KIND_AUTO = 0;

    //@@    .. cpp:enumerator:: Kind::KIND_GPU = 1
    //@@
    //@@       This instance group represents instances that must run on the
    //@@       GPU.
    //@@
    KIND_GPU = 1;

    //@@    .. cpp:enumerator:: Kind::KIND_CPU = 2
    //@@
    //@@       This instance group represents instances that must run on the
    //@@       CPU.
    //@@
    KIND_CPU = 2;

    //@@    .. cpp:enumerator:: Kind::KIND_MODEL = 3
    //@@
    //@@       This instance group represents instances that should run on the
    //@@       CPU and/or GPU(s) as specified by the model or backend itself.
    //@@       The inference server will not override the model/backend
    //@@       settings.
    //@@
    KIND_MODEL = 3;
  }

默认是KIND_AUTO，自动根据gpus是否可以用来配置，这里显式使用GPU：

name: "text_recognition"
backend: "onnxruntime"
max_batch_size : 8
input [
  {
    name: "input.1"
    data_type: TYPE_FP32
    dims: [ 1, 32, 100 ]
  }
]
output [
  {
    name: "308"
    data_type: TYPE_FP32
    dims: [ 26, 37 ]
  }
]
dynamic_batching { }
instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]

测试结果：

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 2, throughput: 771.6 infer/sec, latency 5285 usec
Concurrency: 4, throughput: 1049.6 infer/sec, latency 7728 usec
Concurrency: 6, throughput: 1196.4 infer/sec, latency 10154 usec
Concurrency: 8, throughput: 1500.8 infer/sec, latency 10799 usec
Concurrency: 10, throughput: 1499.2 infer/sec, latency 16075 usec
Concurrency: 12, throughput: 1504 infer/sec, latency 16094 usec
Concurrency: 14, throughput: 1502.4 infer/sec, latency 21401 usec
Concurrency: 16, throughput: 1500.8 infer/sec, latency 21471 usec

Instance

类似于多线程，一个device上创建多个模型instance，每个instance可以同步处理request，默认instance=1。

以火车站售票处为例，一个窗口代表一个instance，如果有个多instance则每个窗口都可以为乘客办理相关业务。多instance增加吞吐量的同时会增加负载。

name: "text_recognition"
backend: "onnxruntime"
max_batch_size : 8
input [
  {
    name: "input.1"
    data_type: TYPE_FP32
    dims: [ 1, 32, 100 ]
  }
]
output [
  {
    name: "308"
    data_type: TYPE_FP32
    dims: [ 26, 37 ]
  }
]
dynamic_batching { }
instance_group [
    {
      count: 2
      kind: KIND_GPU
    }
]

测试结果：

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 2, throughput: 669.6 infer/sec, latency 6921 usec
Concurrency: 4, throughput: 867.2 infer/sec, latency 11136 usec
Concurrency: 6, throughput: 1130.4 infer/sec, latency 12631 usec
Concurrency: 8, throughput: 1326.4 infer/sec, latency 14248 usec
Concurrency: 10, throughput: 1506.8 infer/sec, latency 15818 usec
Concurrency: 12, throughput: 1639.6 infer/sec, latency 17343 usec
Concurrency: 14, throughput: 1643.6 infer/sec, latency 22555 usec
Concurrency: 16, throughput: 1649.6 infer/sec, latency 24245 usec

Backend

官方提供了openvivo、tensorrt、onnx、pytorch等backend。不同的backend支持不同devices，合理的匹配能增加吞吐量，比如tensorrt+GPU。

triton的流程图如下：

triton流程图

那怎么根据自己的设备类型、模型类型合理的选择backend？见下图：

tensorrt示例

首先转换模型：

trtexec --onnx=/models/text_recognition/1/model.onnx --saveEngine=model.plan

如果需要动态batch，需要配置参数，其它优化参数见trtexec -h：

trtexec --onnx=/models/text_recognition/1/model.onnx --saveEngine=model.plan --minShapes=input.1:1x1x32x100 --maxShapes=input.1:8x1x32x100 --optShapes=input.1:2x1x32x100

也可以直接从pytorch导出tensorrt模型：

import torch
import torch_tensorrt
torch.hub._validate_not_a_forked_repo=lambda a,b,c: True

# load model
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")

# Compile with Torch TensorRT;
trt_model = torch_tensorrt.compile(model,
    inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
    enabled_precisions= { torch.half} # Run with FP32
)

# Save the model
torch.jit.save(trt_model, "model.pt")

修改config.pbtxt，注意这里要定义platform而非backend：

name: "text_recognition"
platform: "tensorrt_plan"
max_batch_size : 8
input [
  {
    name: "input.1"
    data_type: TYPE_FP32
    dims: [ 1, 32, 100 ]
  }
]
output [
  {
    name: "308"
    data_type: TYPE_FP32
    dims: [ 26, 37 ]
  }
]

dynamic_batching { }

测试结果如下：

Concurrency: 2, throughput: 812 infer/sec, latency 5202 usec
Concurrency: 4, throughput: 974.4 infer/sec, latency 8359 usec
Concurrency: 6, throughput: 1154.4 infer/sec, latency 10738 usec
Concurrency: 8, throughput: 1296.4 infer/sec, latency 12570 usec
Concurrency: 10, throughput: 1477.6 infer/sec, latency 13800 usec
Concurrency: 12, throughput: 1566.4 infer/sec, latency 15485 usec
Concurrency: 14, throughput: 1561.6 infer/sec, latency 20661 usec
Concurrency: 16, throughput: 1560 infer/sec, latency 20777 usec

如果使用fp16精度：

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 2, throughput: 3212 infer/sec, latency 1340 usec
Concurrency: 4, throughput: 2600 infer/sec, latency 3656 usec
Concurrency: 6, throughput: 3409.6 infer/sec, latency 3915 usec
Concurrency: 8, throughput: 3225.2 infer/sec, latency 5332 usec
Concurrency: 10, throughput: 3533.6 infer/sec, latency 5970 usec
Concurrency: 12, throughput: 4222.4 infer/sec, latency 6002 usec
Concurrency: 14, throughput: 4222.4 infer/sec, latency 7700 usec
Concurrency: 16, throughput: 4190.4 infer/sec, latency 7828 usec

ONNX Backend w. TRT，

混合模式，对onnx中的部分算子优先使用TRT加速，使用方法也很简单，pbtxt中定义优化参数即可：

name: "text_recognition"
backend: "onnxruntime"
max_batch_size : 8
input [
  {
    name: "input.1"
    data_type: TYPE_FP32
    dims: [ 1, 32, 100 ]
  }
]
output [
  {
    name: "308"
    data_type: TYPE_FP32
    dims: [ 26, 37 ]
  }
]

optimization {
  graph : {
    level : 1
  }
 execution_accelerators {
    gpu_execution_accelerator : [ {
      name : "tensorrt",
      parameters { key: "precision_mode" value: "FP16" },
      parameters { key: "max_workspace_size_bytes" value: "1073741824" }
    }]
  }
}

gpu_execution_accelerator 的可配置参数有：

precision_mode: 精度模式可选 "FP32", "FP16" ，"INT8". 默认"FP32".
max_workspace_size_bytes: 最大的预分配的显存，默认1GB.
int8_calibration_table_name: Int8模式的标定表
int8_use_native_calibration_table: 1 (使用 TensorRT 生成的calibration table) and 0 (使用 ORT 生成的 calibration table). 默认为 0
trt_engine_cache_enable: 开启缓冲.
trt_engine_cache_path: 指定缓冲路径.

测试结果：

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 2, throughput: 3088 infer/sec, latency 1381 usec
Concurrency: 4, throughput: 3070.4 infer/sec, latency 2721 usec
Concurrency: 6, throughput: 3036.4 infer/sec, latency 4131 usec
Concurrency: 8, throughput: 3028.8 infer/sec, latency 5518 usec
Concurrency: 10, throughput: 3002.8 infer/sec, latency 6964 usec
Concurrency: 12, throughput: 3047.2 infer/sec, latency 8128 usec
Concurrency: 14, throughput: 3000.4 infer/sec, latency 9714 usec
Concurrency: 16, throughput: 3068.4 infer/sec, latency 10575 usec

config的参数很多，可以根据实际情况调整。

最后，如果不满足要求可以自定义backend，这里不展开。

这一篇主要记录了一些基础内容，其实还有蛮多的知识点需要记录和尝试：