TensorRT-LLM自动扩缩容:K8s HPA配置实践
引言:LLM推理的弹性伸缩痛点
你是否正面临这些挑战:TensorRT-LLM服务在业务高峰期GPU资源耗尽,低谷期又造成资源浪费?Kubernetes Horizontal Pod Autoscaler(HPA,水平Pod自动扩缩器)提供了动态调整Pod数量的能力,但在GPU加速的LLM场景中,如何精准配置资源指标、避免扩缩容抖动、实现毫秒级响应?本文将通过3个核心配置文件、5个优化参数和完整监控链路,带你构建生产级的TensorRT-LLM弹性伸缩体系。
读完本文你将掌握:
- 基于GPU利用率和请求队列的HPA双指标配置
- TensorRT-LLM服务的K8s部署最佳实践
- 扩缩容稳定性优化的关键参数(如稳定窗口、最小副本数)
- Prometheus + Grafana监控面板配置
- 故障排查与性能调优指南
一、环境准备与前置条件
1.1 软件版本要求
| 组件 | 最低版本 | 推荐版本 | 说明 |
|---|---|---|---|
| Kubernetes | 1.23 | 1.27+ | 需支持HPA v2 |
| TensorRT-LLM | 0.5.0 | 0.8.0+ | 含trtllm-serve服务 |
| NVIDIA GPU Operator | 22.9.0 | 23.9.0+ | 提供GPU监控指标 |
| Prometheus | 2.30.0 | 2.45.0+ | 采集服务与GPU指标 |
| Metrics Server | 0.5.0 | 0.6.3+ | 提供Pod资源指标 |
1.2 集群环境检查
# 验证GPU节点标签
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu\.present}{"\n"}{end}'
# 检查Metrics Server状态
kubectl get pods -n kube-system | grep metrics-server
# 验证Prometheus监控端点
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
二、TensorRT-LLM服务部署基础
2.1 部署架构图
2.2 基础Deployment配置
创建trtllm-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tensorrt-llm-deploy
namespace: llm-inference
spec:
replicas: 3 # 初始副本数
selector:
matchLabels:
app: tensorrt-llm
template:
metadata:
labels:
app: tensorrt-llm
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
containers:
- name: tensorrt-llm
image: nvcr.io/nvidia/tensorrt-llm:0.8.0-py3
command: ["trtllm-serve", "--model", "/models/llama-7b", "--port", "8000"]
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1 # 单Pod占用1张GPU
cpu: "4"
memory: "16Gi"
requests:
cpu: "2"
memory: "8Gi"
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
2.3 Service与Ingress配置
创建trtllm-service.yaml:
apiVersion: v1
kind: Service
metadata:
name: tensorrt-llm-service
namespace: llm-inference
spec:
selector:
app: tensorrt-llm
ports:
- port: 80
targetPort: 8000
type: ClusterIP
三、HPA核心配置与指标监控
3.1 HPA配置文件(v2版本)
创建trtllm-hpa.yaml:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tensorrt-llm-hpa
namespace: llm-inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tensorrt-llm-deploy
minReplicas: 2 # 最小副本数,避免冷启动延迟
maxReplicas: 10 # 最大副本数,限制资源消耗
metrics:
- type: Resource
resource:
name: nvidia_gpu_utilization
target:
type: Utilization
averageUtilization: 70 # GPU利用率阈值
- type: Pods
pods:
metric:
name: trtllm_queue_length
selector:
matchLabels:
metric: inference_queue
target:
type: AverageValue
averageValue: 10 # 平均请求队列长度阈值
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # 扩容稳定窗口
policies:
- type: Percent
value: 50
periodSeconds: 120 # 每次扩容最多增加50%,间隔2分钟
scaleDown:
stabilizationWindowSeconds: 300 # 缩容稳定窗口(长窗口避免抖动)
policies:
- type: Percent
value: 30
periodSeconds: 300 # 每次缩容最多减少30%,间隔5分钟
3.2 自定义指标采集(Prometheus Adapter)
创建prometheus-adapter-config.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name: adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'trtllm_queue_length{namespace!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)$"
as: "${1}_avg"
metricsQuery: avg(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)
3.3 监控指标说明
| 指标名称 | 类型 | 说明 | 推荐阈值 |
|---|---|---|---|
| nvidia_gpu_utilization | Resource | GPU利用率百分比 | 70% |
| trtllm_queue_length | Pods | 推理请求队列长度 | 10 |
| trtllm_inference_latency | Pods | P95推理延迟(毫秒) | 500ms |
| http_requests_total | Pods | 每秒请求数(RPS) | 100 |
四、部署与验证流程
4.1 部署步骤
# 创建命名空间
kubectl create namespace llm-inference
# 部署TensorRT-LLM服务
kubectl apply -f trtllm-deployment.yaml -n llm-inference
kubectl apply -f trtllm-service.yaml -n llm-inference
# 部署HPA
kubectl apply -f trtllm-hpa.yaml -n llm-inference
# 检查部署状态
kubectl get pods -n llm-inference
kubectl get hpa -n llm-inference
4.2 负载测试与扩缩容验证
# 安装压测工具
pip install locust
# 创建locustfile.py
cat > locustfile.py << EOF
from locust import HttpUser, task, between
class TRTLLMUser(HttpUser):
wait_time = between(0.1, 0.5)
@task
def inference(self):
self.client.post("/v1/completions", json={
"prompt": "What is TensorRT-LLM?",
"max_tokens": 100
})
EOF
# 启动压测(100用户,每秒增加10用户)
locust -f locustfile.py --headless -u 100 -r 10 --host=http://tensorrt-llm-service.llm-inference
4.3 监控面板配置(Grafana)
{
"panels": [
{
"title": "GPU利用率",
"type": "graph",
"targets": [
{
"expr": "avg(nvidia_gpu_utilization{pod=~\"tensorrt-llm-deploy-.*\"}) by (pod)",
"legendFormat": "{{pod}}"
}
]
},
{
"title": "Pod数量",
"type": "graph",
"targets": [
{
"expr": "kube_deployment_status_replicas_available{deployment=\"tensorrt-llm-deploy\"}",
"legendFormat": "可用副本"
}
]
}
]
}
五、优化与最佳实践
5.1 扩缩容稳定性优化
- 最小副本数设置:根据业务最低QPS需求设置,避免冷启动。公式:
min_replicas = ceil(peak_qps / single_pod_qps) - GPU共享策略:使用MIG技术实现GPU切片,提高资源利用率:
resources: limits: nvidia.com/mig-1g.5gb: 1 # 使用1个MIG实例 - 预热机制:在Deployment中添加初始化容器加载模型:
initContainers: - name: model-warmup image: nvcr.io/nvidia/tensorrt-llm:0.8.0-py3 command: ["python", "-c", "from tensorrt_llm.builder import Builder; Builder()"]
5.2 故障排查指南
| 问题现象 | 可能原因 | 解决方案 |
|---|---|---|
| HPA不触发扩容 | metrics-server未运行 | kubectl rollout restart deployment metrics-server -n kube-system |
| 频繁扩缩容(抖动) | 稳定窗口过短 | 增大scaleDown.stabilizationWindowSeconds至300秒以上 |
| GPU指标采集失败 | GPU Operator未安装 | helm install gpu-operator nvidia/gpu-operator -n gpu-operator --create-namespace |
六、总结与展望
本文详细介绍了TensorRT-LLM在K8s环境下的HPA配置实践,通过GPU利用率+请求队列双指标触发、稳定窗口参数优化、MIG切片资源管理等技术,实现了LLM推理服务的弹性伸缩。关键收获包括:
- 掌握基于K8s HPA的TensorRT-LLM自动扩缩容全流程
- 学会配置自定义指标监控与Prometheus Adapter
- 理解GPU加速场景下的资源调度最佳实践
未来,随着TensorRT-LLM对多实例GPU(MIG)和时间切片调度的支持增强,结合K8s 1.28+的PodAutoscaler API,LLM服务的弹性伸缩将实现更精细的资源控制。建议关注项目官方文档的更新,及时应用新特性。
收藏本文,下次配置TensorRT-LLM自动扩缩容时即可快速参考。如有疑问,欢迎在评论区留言讨论!
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



