Docker容器健康检查日志:stable-diffusion-webui-docker状态变化记录

Docker容器健康检查日志:stable-diffusion-webui-docker状态变化记录

【免费下载链接】stable-diffusion-webui-docker Easy Docker setup for Stable Diffusion with user-friendly UI 【免费下载链接】stable-diffusion-webui-docker 项目地址: https://gitcode.com/gh_mirrors/st/stable-diffusion-webui-docker

容器状态监控体系概述

你是否曾遭遇Stable Diffusion服务突然中断却无从排查?本文将系统讲解如何通过Docker生态工具链构建容器健康监控系统,实现对stable-diffusion-webui-docker项目的全生命周期状态追踪。通过本文你将掌握:

  • 3类核心容器状态指标的采集方法
  • 5步实现服务异常自动告警
  • 9种常见故障的日志诊断模板
  • 完整的Docker Compose监控配置方案

基础监控架构设计

多维度监控矩阵

监控维度关键指标采集工具预警阈值
容器生命周期重启次数/退出码/运行时长Docker API10分钟内重启>3次
资源占用GPU显存/CPU负载/内存泄漏nvidia-smi/prometheus显存占用>90%持续5分钟
应用健康度HTTP响应码/推理延迟/队列长度curl/自定义探针5xx错误>5次/分钟

监控组件部署架构

mermaid

容器状态采集实现

Docker Compose监控配置

在原docker-compose.yml基础上添加监控服务栈:

version: '3.8'

services:
  # 原有Stable Diffusion服务配置
  auto: &automatic
    <<: *base_service
    profiles: ["auto"]
    build: ./services/AUTOMATIC1111
    image: sd-auto:78
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:7860/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s  # 首次健康检查延迟(模型加载时间)
    environment:
      - CLI_ARGS=--allow-code --medvram --xformers --enable-insecure-extension-access --api

  # 新增监控服务
  prometheus:
    image: prom/prometheus:v2.45.0
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    command: --config.file=/etc/prometheus/prometheus.yml
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.1.0
    volumes:
      - grafana-data:/var/lib/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

自定义健康检查脚本

创建./monitoring/healthcheck.sh实现应用层健康检测:

#!/bin/bash
set -e

# 1. 基础网络连通性检查
if ! curl -s -o /dev/null "http://localhost:7860"; then
    echo "HTTP connectivity failed"
    exit 1
fi

# 2. API功能验证
API_RESPONSE=$(curl -s "http://localhost:7860/sdapi/v1/options")
if ! echo "$API_RESPONSE" | jq -e '.[]' > /dev/null; then
    echo "API endpoint invalid response"
    exit 1
fi

# 3. 推理能力测试(轻量请求)
INFERENCE_TEST=$(curl -s -X POST "http://localhost:7860/sdapi/v1/txt2img" \
  -H "Content-Type: application/json" \
  -d '{"prompt":"test", "steps":1, "batch_size":1, "width":64, "height":64}')

if echo "$INFERENCE_TEST" | grep -q "error"; then
    echo "Inference failed: $INFERENCE_TEST"
    exit 1
fi

exit 0

日志采集与分析

多级别日志采集配置

修改docker-compose.yml添加日志驱动配置:

services:
  auto:
    <<: *base_service
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
        tag: "{{.Name}}/{{.ID}}"
    environment:
      - LOG_LEVEL=DEBUG  # 应用级日志级别
      - TRANSFORMERS_VERBOSITY=warning  # 控制第三方库日志噪音

关键日志模式识别

创建日志分析规则文件./monitoring/log_patterns.yml

patterns:
  - name: "GPU内存溢出"
    regex: "CUDA out of memory|CUBLAS_STATUS_ALLOC_FAILED"
    severity: "CRITICAL"
    action: "restart-container"
    
  - name: "模型加载失败"
    regex: "Failed to load model|model not found in"
    severity: "ERROR"
    action: "notify-admin"
    
  - name: "扩展冲突"
    regex: "Extension conflict|module has no attribute"
    severity: "WARNING"
    action: "disable-extension"

告警系统实现

Prometheus告警规则配置

./monitoring/prometheus.yml中添加:

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

创建./monitoring/alert_rules.yml

groups:
- name: sd_webui_alerts
  rules:
  - alert: HighGpuMemoryUsage
    expr: avg_over_time(nvidia_gpu_memory_used_bytes[5m]) / nvidia_gpu_memory_total_bytes > 0.9
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "GPU内存使用率过高"
      description: "GPU内存使用率已持续5分钟超过90% (当前值: {{ $value }})"
      
  - alert: ContainerRestarted
    expr: changes(container_restarts_total{name=~"auto|comfy"}[10m]) > 3
    labels:
      severity: warning
    annotations:
      summary: "容器频繁重启"
      description: "{{ $labels.name }}容器在10分钟内重启了{{ $value }}次"

多渠道告警配置

部署AlertManager并配置./monitoring/alertmanager.yml

route:
  receiver: 'email_notifications'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h

receivers:
- name: 'email_notifications'
  email_configs:
  - to: 'admin@example.com'
    from: 'monitor@example.com'
    smarthost: 'smtp.example.com:587'
    auth_username: 'monitor@example.com'
    auth_password: 'your-app-password'
    send_resolved: true

状态变化记录与分析

典型故障诊断流程

mermaid

状态变化记录模板

创建状态记录脚本./monitoring/state_tracker.sh

#!/bin/bash
# 每5分钟记录一次关键状态

TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")
LOG_FILE="./monitoring/state_history.log"

# 记录容器状态
docker inspect -f '{{.Name}} {{.State.Status}} {{.State.RestartCount}} {{.State.StartedAt}}' auto comfy >> $LOG_FILE

# 记录GPU使用情况
nvidia-smi --query-gpu=timestamp,name,memory.used,memory.total,utilization.gpu --format=csv >> ./monitoring/gpu_history.csv

# 记录API响应时间
RESPONSE_TIME=$(curl -o /dev/null -s -w "%{time_total}" "http://localhost:7860/sdapi/v1/health")
echo "$TIMESTAMP,$RESPONSE_TIME" >> ./monitoring/api_latency.csv

高级监控可视化

Grafana仪表盘配置

  1. 导入仪表盘ID:1860 (Node Exporter Full)
  2. 导入仪表盘ID:14282 (Docker Monitoring)
  3. 创建自定义SD WebUI仪表盘:
{
  "panels": [
    {
      "title": "GPU显存使用",
      "type": "graph",
      "targets": [
        {
          "expr": "nvidia_gpu_memory_used_bytes{job='sd-webui-services'} / 1024 / 1024 / 1024",
          "legendFormat": "{{ instance }}",
          "interval": ""
        }
      ],
      "yaxes": [{"format": "decbytes"}]
    },
    {
      "title": "API响应时间",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
          "legendFormat": "P95延迟",
          "interval": ""
        }
      ]
    }
  ]
}

部署与维护最佳实践

监控系统部署清单

  1. 基础监控组件部署:

    docker-compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
    
  2. 配置自动启动:

    # 创建systemd服务文件
    sudo nano /etc/systemd/system/sd-monitor.service
    

    服务文件内容:

    [Unit]
    Description=Stable Diffusion Monitoring Stack
    After=docker.service
    
    [Service]
    WorkingDirectory=/data/web/disk1/git_repo/gh_mirrors/st/stable-diffusion-webui-docker
    ExecStart=/usr/local/bin/docker-compose -f docker-compose.yml -f docker-compose.monitoring.yml up
    ExecStop=/usr/local/bin/docker-compose -f docker-compose.yml -f docker-compose.monitoring.yml down
    Restart=always
    
    [Install]
    WantedBy=multi-user.target
    
  3. 启用并启动服务:

    sudo systemctl enable sd-monitor
    sudo systemctl start sd-monitor
    

监控系统维护计划

维护项目频率操作步骤
日志轮转配置一次性设置logrotate,保留30天日志
监控数据清理每周删除超过90天的Prometheus数据
仪表盘模板更新每月导入社区最新监控模板
告警规则优化季度根据实际故障模式调整阈值

总结与进阶方向

本文构建的监控系统已实现stable-diffusion-webui-docker项目的全链路状态监控,包括容器生命周期追踪、资源使用监控、应用健康检查和异常告警。进阶优化可考虑:

  1. AI辅助诊断:训练日志异常检测模型,实现故障前兆识别
  2. 自动恢复机制:基于状态分析实现故障自愈(如自动重启、资源调整)
  3. 分布式追踪:整合OpenTelemetry实现跨服务调用链追踪
  4. 成本优化:基于使用模式动态调整资源分配

通过持续完善监控体系,可将Stable Diffusion服务可用性提升至99.9%以上,为AI创作提供稳定可靠的基础设施保障。

收藏本文,关注项目更新获取监控配置文件最新版本。下期预告:《GPU资源动态调度:Stable Diffusion多实例优化方案》

【免费下载链接】stable-diffusion-webui-docker Easy Docker setup for Stable Diffusion with user-friendly UI 【免费下载链接】stable-diffusion-webui-docker 项目地址: https://gitcode.com/gh_mirrors/st/stable-diffusion-webui-docker

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值