TensorRT-LLM自动扩缩容：K8s HPA配置实践-优快云博客

TensorRT-LLM自动扩缩容：K8s HPA配置实践

【免费下载链接】TensorRT-LLM TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. 项目地址: https://gitcode.com/GitHub_Trending/te/TensorRT-LLM

引言：LLM推理的弹性伸缩痛点

你是否正面临这些挑战：TensorRT-LLM服务在业务高峰期GPU资源耗尽，低谷期又造成资源浪费？Kubernetes Horizontal Pod Autoscaler（HPA，水平Pod自动扩缩器）提供了动态调整Pod数量的能力，但在GPU加速的LLM场景中，如何精准配置资源指标、避免扩缩容抖动、实现毫秒级响应？本文将通过3个核心配置文件、5个优化参数和完整监控链路，带你构建生产级的TensorRT-LLM弹性伸缩体系。

读完本文你将掌握：

基于GPU利用率和请求队列的HPA双指标配置
TensorRT-LLM服务的K8s部署最佳实践
扩缩容稳定性优化的关键参数（如稳定窗口、最小副本数）
Prometheus + Grafana监控面板配置
故障排查与性能调优指南

一、环境准备与前置条件

1.1 软件版本要求

组件	最低版本	推荐版本	说明
Kubernetes	1.23	1.27+	需支持HPA v2
TensorRT-LLM	0.5.0	0.8.0+	含trtllm-serve服务
NVIDIA GPU Operator	22.9.0	23.9.0+	提供GPU监控指标
Prometheus	2.30.0	2.45.0+	采集服务与GPU指标
Metrics Server	0.5.0	0.6.3+	提供Pod资源指标

1.2 集群环境检查

# 验证GPU节点标签
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu\.present}{"\n"}{end}'

# 检查Metrics Server状态
kubectl get pods -n kube-system | grep metrics-server

# 验证Prometheus监控端点
kubectl port-forward -n monitoring svc/prometheus-server 9090:80

二、TensorRT-LLM服务部署基础

2.1 部署架构图

mermaid

2.2 基础Deployment配置

创建trtllm-deployment.yaml：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorrt-llm-deploy
  namespace: llm-inference
spec:
  replicas: 3  # 初始副本数
  selector:
    matchLabels:
      app: tensorrt-llm
  template:
    metadata:
      labels:
        app: tensorrt-llm
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: tensorrt-llm
        image: nvcr.io/nvidia/tensorrt-llm:0.8.0-py3
        command: ["trtllm-serve", "--model", "/models/llama-7b", "--port", "8000"]
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1  # 单Pod占用1张GPU
            cpu: "4"
            memory: "16Gi"
          requests:
            cpu: "2"
            memory: "8Gi"
        volumeMounts:
        - name: model-storage
          mountPath: /models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc

2.3 Service与Ingress配置

创建trtllm-service.yaml：

apiVersion: v1
kind: Service
metadata:
  name: tensorrt-llm-service
  namespace: llm-inference
spec:
  selector:
    app: tensorrt-llm
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP

三、HPA核心配置与指标监控

3.1 HPA配置文件（v2版本）

创建trtllm-hpa.yaml：

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tensorrt-llm-hpa
  namespace: llm-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tensorrt-llm-deploy
  minReplicas: 2  # 最小副本数，避免冷启动延迟
  maxReplicas: 10  # 最大副本数，限制资源消耗
  metrics:
  - type: Resource
    resource:
      name: nvidia_gpu_utilization
      target:
        type: Utilization
        averageUtilization: 70  # GPU利用率阈值
  - type: Pods
    pods:
      metric:
        name: trtllm_queue_length
        selector:
          matchLabels:
            metric: inference_queue
      target:
        type: AverageValue
        averageValue: 10  # 平均请求队列长度阈值
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60  # 扩容稳定窗口
      policies:
      - type: Percent
        value: 50
        periodSeconds: 120  # 每次扩容最多增加50%，间隔2分钟
    scaleDown:
      stabilizationWindowSeconds: 300  # 缩容稳定窗口（长窗口避免抖动）
      policies:
      - type: Percent
        value: 30
        periodSeconds: 300  # 每次缩容最多减少30%，间隔5分钟

3.2 自定义指标采集（Prometheus Adapter）

创建prometheus-adapter-config.yaml：

apiVersion: v1
kind: ConfigMap
metadata:
  name: adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'trtllm_queue_length{namespace!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)$"
        as: "${1}_avg"
      metricsQuery: avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)

3.3 监控指标说明

指标名称	类型	说明	推荐阈值
nvidia_gpu_utilization	Resource	GPU利用率百分比	70%
trtllm_queue_length	Pods	推理请求队列长度	10
trtllm_inference_latency	Pods	P95推理延迟（毫秒）	500ms
http_requests_total	Pods	每秒请求数（RPS）	100

四、部署与验证流程

4.1 部署步骤

# 创建命名空间
kubectl create namespace llm-inference

# 部署TensorRT-LLM服务
kubectl apply -f trtllm-deployment.yaml -n llm-inference
kubectl apply -f trtllm-service.yaml -n llm-inference

# 部署HPA
kubectl apply -f trtllm-hpa.yaml -n llm-inference

# 检查部署状态
kubectl get pods -n llm-inference
kubectl get hpa -n llm-inference

4.2 负载测试与扩缩容验证

# 安装压测工具
pip install locust

# 创建locustfile.py
cat > locustfile.py << EOF
from locust import HttpUser, task, between

class TRTLLMUser(HttpUser):
    wait_time = between(0.1, 0.5)
    
    @task
    def inference(self):
        self.client.post("/v1/completions", json={
            "prompt": "What is TensorRT-LLM?",
            "max_tokens": 100
        })
EOF

# 启动压测（100用户，每秒增加10用户）
locust -f locustfile.py --headless -u 100 -r 10 --host=http://tensorrt-llm-service.llm-inference

4.3 监控面板配置（Grafana）

{
  "panels": [
    {
      "title": "GPU利用率",
      "type": "graph",
      "targets": [
        {
          "expr": "avg(nvidia_gpu_utilization{pod=~\"tensorrt-llm-deploy-.*\"}) by (pod)",
          "legendFormat": "{{pod}}"
        }
      ]
    },
    {
      "title": "Pod数量",
      "type": "graph",
      "targets": [
        {
          "expr": "kube_deployment_status_replicas_available{deployment=\"tensorrt-llm-deploy\"}",
          "legendFormat": "可用副本"
        }
      ]
    }
  ]
}

五、优化与最佳实践

5.1 扩缩容稳定性优化

最小副本数设置：根据业务最低QPS需求设置，避免冷启动。公式：min_replicas = ceil(peak_qps / single_pod_qps)
GPU共享策略：使用MIG技术实现GPU切片，提高资源利用率：
```
resources:
  limits:
    nvidia.com/mig-1g.5gb: 1  # 使用1个MIG实例
```

预热机制：在Deployment中添加初始化容器加载模型：

initContainers:
- name: model-warmup
  image: nvcr.io/nvidia/tensorrt-llm:0.8.0-py3
  command: ["python", "-c", "from tensorrt_llm.builder import Builder; Builder()"]

5.2 故障排查指南

问题现象	可能原因	解决方案
HPA不触发扩容	metrics-server未运行	kubectl rollout restart deployment metrics-server -n kube-system
频繁扩缩容（抖动）	稳定窗口过短	增大scaleDown.stabilizationWindowSeconds至300秒以上
GPU指标采集失败	GPU Operator未安装	helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace

六、总结与展望

本文详细介绍了TensorRT-LLM在K8s环境下的HPA配置实践，通过GPU利用率+请求队列双指标触发、稳定窗口参数优化、MIG切片资源管理等技术，实现了LLM推理服务的弹性伸缩。关键收获包括：

掌握基于K8s HPA的TensorRT-LLM自动扩缩容全流程
学会配置自定义指标监控与Prometheus Adapter
理解GPU加速场景下的资源调度最佳实践

未来，随着TensorRT-LLM对多实例GPU（MIG）和时间切片调度的支持增强，结合K8s 1.28+的PodAutoscaler API，LLM服务的弹性伸缩将实现更精细的资源控制。建议关注项目官方文档的更新，及时应用新特性。

收藏本文，下次配置TensorRT-LLM自动扩缩容时即可快速参考。如有疑问，欢迎在评论区留言讨论！

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考