GPU容器化监控平台搭建:Grafana仪表盘配置实战

GPU容器化监控平台搭建:Grafana仪表盘配置实战

【免费下载链接】nvidia-docker Build and run Docker containers leveraging NVIDIA GPUs 【免费下载链接】nvidia-docker 项目地址: https://gitcode.com/gh_mirrors/nv/nvidia-docker

引言:GPU容器监控的痛点与解决方案

在AI训练与推理场景中,你是否经常遇到以下问题:GPU资源利用率波动剧烈却无法定位瓶颈?容器内GPU进程异常退出难以追溯?多租户共享GPU时资源分配不均引发冲突?本文将通过实战案例,详解如何基于NVIDIA Container Toolkit构建GPU容器化监控平台,实现从物理GPU到容器进程的全链路可视化。

读完本文你将掌握:

  • 容器化环境中GPU指标采集方案设计
  • Prometheus + node-exporter + dcgm-exporter部署架构
  • Grafana自定义仪表盘配置与关键指标可视化
  • 多维度GPU性能监控告警规则设置

技术架构与环境准备

监控系统整体架构

mermaid

硬件与软件要求

组件版本要求作用
NVIDIA GPUKepler架构及以上提供硬件加速能力
NVIDIA驱动≥450.80.02支持容器化GPU调度
Docker Engine≥19.03容器运行时环境
NVIDIA Container Toolkit≥1.17.0实现GPU资源隔离
Prometheus≥2.30.0时序数据存储与查询
Grafana≥8.0.0数据可视化与告警
dcgm-exporter≥2.0.0GPU指标采集工具

环境部署步骤

1. NVIDIA Container Toolkit安装
# 配置APT仓库
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# 安装工具包
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
sudo apt-get update
sudo apt-get install -y \
nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

# 配置Docker运行时
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
2. 验证GPU容器运行环境
# 测试基础GPU功能
docker run --rm --runtime=nvidia nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

# 预期输出应包含GPU型号、驱动版本及容器ID信息

监控组件部署与配置

1. DCGM Exporter部署

# docker-compose.yml
version: '3'
services:
  dcgm-exporter:
    image: nvidia/dcgm-exporter:2.4.6
    command: ["-f", "/etc/dcgm-exporter/dcp-metrics-included.csv"]
    volumes:
      - ./dcp-metrics-included.csv:/etc/dcgm-exporter/dcp-metrics-included.csv
    ports:
      - "9400:9400"
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all

关键指标采集配置文件(dcp-metrics-included.csv):

# Format: Domain,Group,Field,Description,Units,ChangeType
dcgm,nvmlDevMemory,FBUsed,Framebuffer Memory Used,MiB,gauge
dcgm,nvmlDevMemory,FBFree,Framebuffer Memory Free,MiB,gauge
dcgm,nvmlDevUtilization,Gpu,GPU Utilization (%),%,gauge
dcgm,nvmlDevUtilization,Mem,Memory Controller Utilization (%),%,gauge
dcgm,nvmlDevTemperature,Gpu,Temperature,C,counter
dcgm,nvmlDevPcie,RxThroughput,PCIe Rx Throughput,MB/s,counter
dcgm,nvmlDevPcie,TxThroughput,PCIe Tx Throughput,MB/s,counter
2. Prometheus配置
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'dcgm'
    static_configs:
      - targets: ['dcgm-exporter:9400']
  
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
  
  - job_name: 'docker'
    static_configs:
      - targets: ['cadvisor:8080']
3. Grafana容器部署
docker run -d -p 3000:3000 \
  --name grafana \
  -v grafana-data:/var/lib/grafana \
  -v ./grafana-provisioning:/etc/grafana/provisioning \
  grafana/grafana:8.2.2

Grafana仪表盘设计与配置

数据源配置

  1. 登录Grafana控制台(默认账号admin/admin)
  2. 添加Prometheus数据源:
    • 名称:Prometheus-GPU
    • URL:http://prometheus:9090
    • 刮板间隔:15s

自定义仪表盘设计

1. 仪表盘JSON结构解析
{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 1,
  "iteration": 1622568445754,
  "links": [],
  "panels": [],
  "refresh": "10s",
  "schemaVersion": 27,
  "style": "dark",
  "tags": ["gpu", "container"],
  "templating": {
    "list": [
      {
        "allValue": null,
        "current": {
          "selected": false,
          "text": "All",
          "value": "$__all"
        },
        "datasource": "Prometheus-GPU",
        "definition": "label_values(dcgm_gpu_temp{gpu=~\"$gpu\"}, gpu)",
        "description": null,
        "error": null,
        "hide": 0,
        "includeAll": true,
        "label": "GPU",
        "multi": false,
        "name": "gpu",
        "options": [],
        "query": {
          "query": "label_values(dcgm_gpu_temp, gpu)",
          "refId": "StandardVariableQuery"
        },
        "refresh": 1,
        "regex": "",
        "skipUrlSync": false,
        "sort": 1,
        "tagValuesQuery": "",
        "tags": [],
        "tagsQuery": "",
        "type": "query",
        "useTags": false
      }
    ]
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m", "30m", "1h", "2h", "1d"]
  },
  "timezone": "",
  "title": "GPU Container Monitoring",
  "uid": "gpu-container-monitor",
  "version": 1
}
2. 关键指标可视化面板
GPU利用率监控面板
{
  "collapsed": false,
  "datasource": null,
  "gridPos": {
    "h": 1,
    "w": 24,
    "x": 0,
    "y": 0
  },
  "id": 2,
  "panels": [],
  "title": "GPU Utilization"
}

查询语句:

avg by (gpu) (rate(dcgm_gpu_utilization{gpu=~"$gpu"}[5m])) * 100
内存使用趋势图
{
  "aliasColors": {},
  "bars": false,
  "dashLength": 10,
  "dashes": false,
  "datasource": "Prometheus-GPU",
  "fieldConfig": {
    "defaults": {
      "links": []
    },
    "overrides": []
  },
  "fill": 1,
  "fillGradient": 0,
  "gridPos": {
    "h": 8,
    "w": 12,
    "x": 0,
    "y": 1
  },
  "hiddenSeries": false,
  "id": 4,
  "legend": {
    "avg": false,
    "current": false,
    "max": false,
    "min": false,
    "show": true,
    "total": false,
    "values": false
  },
  "lines": true,
  "linewidth": 1,
  "nullPointMode": "null",
  "options": {
    "alertThreshold": true
  },
  "percentage": false,
  "pluginVersion": "8.2.2",
  "pointradius": 2,
  "points": false,
  "renderer": "flot",
  "seriesOverrides": [],
  "spaceLength": 10,
  "stack": false,
  "steppedLine": false,
  "targets": [
    {
      "expr": "dcgm_fb_used{gpu=~\"$gpu\"}",
      "interval": "",
      "legendFormat": "Used",
      "refId": "A"
    },
    {
      "expr": "dcgm_fb_free{gpu=~\"$gpu\"}",
      "interval": "",
      "legendFormat": "Free",
      "refId": "B"
    }
  ],
  "thresholds": [],
  "timeFrom": null,
  "timeRegions": [],
  "timeShift": null,
  "title": "GPU Memory Usage (MiB)",
  "tooltip": {
    "shared": true,
    "sort": 0,
    "value_type": "individual"
  },
  "type": "graph",
  "xaxis": {
    "buckets": null,
    "mode": "time",
    "name": null,
    "show": true,
    "values": []
  },
  "yaxes": [
    {
      "format": "decbytes",
      "label": "Memory (MiB)",
      "logBase": 1,
      "max": null,
      "min": "0",
      "show": true
    },
    {
      "format": "short",
      "label": null,
      "logBase": 1,
      "max": null,
      "min": null,
      "show": true
    }
  ],
  "yaxis": {
    "align": false,
    "alignLevel": null
  }
}

容器级GPU监控面板

{
  "aliasColors": {},
  "bars": false,
  "dashLength": 10,
  "dashes": false,
  "datasource": "Prometheus-GPU",
  "fill": 1,
  "fillGradient": 0,
  "gridPos": {
    "h": 8,
    "w": 12,
    "x": 12,
    "y": 1
  },
  "hiddenSeries": false,
  "id": 6,
  "legend": {
    "avg": false,
    "current": false,
    "max": false,
    "min": false,
    "show": true,
    "total": false,
    "values": false
  },
  "lines": true,
  "linewidth": 1,
  "nullPointMode": "null",
  "options": {
    "alertThreshold": true
  },
  "percentage": false,
  "pluginVersion": "8.2.2",
  "pointradius": 2,
  "points": false,
  "renderer": "flot",
  "seriesOverrides": [],
  "spaceLength": 10,
  "stack": false,
  "steppedLine": false,
  "targets": [
    {
      "expr": "sum by (container_name) (rate(container_gpu_usage_seconds_total{container_name!=\"\"}[5m])) * 100",
      "interval": "",
      "legendFormat": "{{container_name}}",
      "refId": "A"
    }
  ],
  "thresholds": [],
  "timeFrom": null,
  "timeRegions": [],
  "timeShift": null,
  "title": "Container GPU Utilization (%)",
  "tooltip": {
    "shared": true,
    "sort": 0,
    "value_type": "individual"
  },
  "type": "graph",
  "xaxis": {
    "buckets": null,
    "mode": "time",
    "name": null,
    "show": true,
    "values": []
  },
  "yaxes": [
    {
      "format": "percentunit",
      "label": "Utilization (%)",
      "logBase": 1,
      "max": "100",
      "min": "0",
      "show": true
    },
    {
      "format": "short",
      "label": null,
      "logBase": 1,
      "max": null,
      "min": null,
      "show": true
    }
  ],
  "yaxis": {
    "align": false,
    "alignLevel": null
  }
}

告警规则配置

关键指标告警阈值

指标告警阈值严重级别建议处理措施
GPU利用率>90%持续5分钟Warning检查是否存在资源竞争
显存使用率>95%持续3分钟Critical优化模型或增加GPU资源
GPU温度>85°C持续2分钟Warning检查散热系统
PCIe带宽>90%持续10分钟Info考虑优化数据传输

Prometheus告警规则配置

groups:
- name: gpu_alerts
  rules:
  - alert: HighGpuUtilization
    expr: avg by (gpu) (rate(dcgm_gpu_utilization[5m])) * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High GPU utilization detected"
      description: "GPU {{ $labels.gpu }} has utilization above 90% for 5 minutes (current value: {{ $value }})"
      
  - alert: HighMemoryUsage
    expr: (dcgm_fb_used / (dcgm_fb_used + dcgm_fb_free)) * 100 > 95
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "High GPU memory usage detected"
      description: "GPU {{ $labels.gpu }} memory usage is above 95% (current value: {{ $value }})"

高级功能与最佳实践

多GPU节点监控扩展

mermaid

性能优化建议

  1. 指标采集优化

    • 非关键指标设置较长采集间隔(如温度每60s采集一次)
    • 使用relabel_configs过滤无用标签
    • 配置存储策略(如保留30天数据,每2小时降采样)
  2. 查询性能优化

    • 对常用查询创建recording rules
    • 仪表盘使用缓存查询结果
    • 避免大范围时间区间的聚合查询
  3. 高可用配置

    • Prometheus采用联邦集群架构
    • Grafana配置数据库持久化
    • 监控组件使用容器编排工具实现自动恢复

总结与展望

本文详细介绍了基于NVIDIA Container Toolkit的GPU容器化监控平台构建方案,通过Prometheus生态实现了从物理GPU到容器进程的全链路指标采集与可视化。关键成果包括:

  1. 建立了完整的GPU容器监控技术栈,解决了容器环境下GPU指标采集难题
  2. 设计了多维度监控仪表盘,实现从硬件到应用的性能可视化
  3. 提供了可扩展的告警体系,实现GPU资源异常的及时发现与处理

未来监控平台可向以下方向演进:

  • 结合AI模型实现GPU性能异常预测
  • 开发容器GPU调度优化建议系统
  • 构建跨集群GPU资源调度可视化平台

通过这套监控方案,运维团队能够实时掌握GPU资源使用状况,开发人员可以精准定位性能瓶颈,为AI应用的稳定运行提供有力保障。

【免费下载链接】nvidia-docker Build and run Docker containers leveraging NVIDIA GPUs 【免费下载链接】nvidia-docker 项目地址: https://gitcode.com/gh_mirrors/nv/nvidia-docker

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值