Docker Swarm部署:stable-diffusion-webui-docker容器编排集群

Docker Swarm部署:stable-diffusion-webui-docker容器编排集群

【免费下载链接】stable-diffusion-webui-docker Easy Docker setup for Stable Diffusion with user-friendly UI 【免费下载链接】stable-diffusion-webui-docker 项目地址: https://gitcode.com/gh_mirrors/st/stable-diffusion-webui-docker

1. 痛点与解决方案

你是否面临以下挑战:单节点部署Stable Diffusion算力不足、多用户并发访问卡顿、服务可用性无法保障?本文将通过Docker Swarm实现stable-diffusion-webui-docker的容器编排集群,提供企业级高可用AI绘画解决方案。

读完本文你将掌握:

  • Docker Swarm集群环境搭建
  • GPU资源调度与多节点负载均衡
  • 高可用服务配置与故障自动恢复
  • 多UI版本(AUTOMATIC1111/ComfyUI)并行部署
  • 数据持久化与模型共享方案

2. 环境准备与架构设计

2.1 硬件与软件要求

组件最低配置推荐配置
节点数量1 manager + 1 worker1 manager + 2+ workers
CPU8核16核+
内存16GB32GB+
GPU单卡8GB VRAM多卡16GB+ VRAM (NVIDIA)
存储100GB SSD500GB NVMe
操作系统Ubuntu 20.04+Ubuntu 22.04 LTS
Docker版本20.10+24.0.5+
NVIDIA驱动470.57+535.104.05+

2.2 集群架构图

mermaid

3. Docker Swarm集群部署

3.1 初始化Swarm集群

# 在管理节点初始化集群
docker swarm init --advertise-addr 192.168.1.100

# 获取加入命令(worker节点执行)
docker swarm join-token worker

# 安装NVIDIA Container Toolkit(所有节点)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

3.2 节点标签与资源约束

为实现GPU资源精准调度,需要为节点添加标签:

# 标记GPU节点(根据实际情况修改)
docker node update --label-add gpu=true worker1
docker node update --label-add gpu=true worker2
docker node update --label-add gpu=false worker3

# 标记UI版本支持节点
docker node update --label-add comfyui=true worker1
docker node update --label-add auto1111=true worker2

4. 容器编排配置

4.1 Docker Compose Swarm配置

创建docker-compose-swarm.yml文件,实现多服务编排:

version: '3.8'

x-base_service: &base_service
  ports:
    - "7860:7860"
  volumes:
    - nfs_data:/data
    - nfs_output:/output
  stop_signal: SIGKILL
  deploy:
    replicas: 2
    placement:
      constraints: [node.labels.gpu == true]
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['0']  # 可指定GPU设备ID
            capabilities: [compute, utility]
    restart_policy:
      condition: on-failure
      delay: 10s
      max_attempts: 3
      window: 60s

services:
  download:
    build: ./services/download/
    volumes:
      - nfs_data:/data
    deploy:
      replicas: 1
      placement:
        constraints: [node.role == manager]

  auto1111:
    <<: *base_service
    build: ./services/AUTOMATIC1111
    image: sd-auto:78
    environment:
      - CLI_ARGS=--allow-code --medvram --xformers --enable-insecure-extension-access --api
    deploy:
      replicas: 2
      placement:
        constraints: [node.labels.auto1111 == true]
      update_config:
        parallelism: 1
        delay: 30s
        failure_action: rollback

  comfyui:
    <<: *base_service
    build: ./services/comfy/
    image: sd-comfy:7
    ports:
      - "7861:7860"
    environment:
      - CLI_ARGS=
    deploy:
      replicas: 1
      placement:
        constraints: [node.labels.comfyui == true]

  auto1111-cpu:
    <<: *base_service
    build: ./services/AUTOMATIC1111
    environment:
      - CLI_ARGS=--no-half --precision full --allow-code --enable-insecure-extension-access --api
    deploy:
      replicas: 1
      placement:
        constraints: [node.labels.gpu == false]
      resources:
        reservations:
          devices: []  # 禁用GPU

volumes:
  nfs_data:
    driver: local
    driver_opts:
      type: nfs
      o: addr=192.168.1.10,rw
      device: ":/nfs/share/data"
  nfs_output:
    driver: local
    driver_opts:
      type: nfs
      o: addr=192.168.1.10,rw
      device: ":/nfs/share/output"

4.2 关键配置解析

4.2.1 资源调度策略

mermaid

4.2.2 服务更新流程
    participant Manager
    participant Worker1
    participant Worker2
    
    Manager->>Worker1: 部署新版本(auto1111_v2)
    Worker1-->>Manager: 健康检查通过
    Manager->>Worker2: 部署新版本(auto1111_v2)
    Worker2-->>Manager: 健康检查通过
    Manager->>Worker1: 停止旧版本(auto1111_v1)
    Manager->>Worker2: 停止旧版本(auto1111_v1)

5. NFS共享存储配置

为实现多节点数据共享,部署NFS服务:

# 在管理节点安装NFS服务
sudo apt install -y nfs-kernel-server
sudo mkdir -p /nfs/share/{data,output}
sudo chown -R nobody:nogroup /nfs/share
sudo chmod -R 777 /nfs/share

# 配置 exports
echo "/nfs/share/data 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)" | sudo tee -a /etc/exports
echo "/nfs/share/output 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)" | sudo tee -a /etc/exports

sudo exportfs -a
sudo systemctl restart nfs-kernel-server

# 在worker节点安装NFS客户端
sudo apt install -y nfs-common

6. 集群部署与管理

6.1 部署服务栈

# 部署完整服务栈
docker stack deploy -c docker-compose-swarm.yml sd-webui

# 查看服务状态
docker stack ps sd-webui

# 查看日志
docker service logs -f sd-webui_auto1111

# 扩展服务实例
docker service scale sd-webui_auto1111=3

6.2 监控与告警

部署Prometheus和Grafana监控GPU使用率:

# docker-compose-monitor.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
    deploy:
      placement:
        constraints: [node.role == manager]

  grafana:
    image: grafana/grafana
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3000:3000"
    depends_on:
      - prometheus

volumes:
  grafana_data:

7. 性能优化与最佳实践

7.1 GPU资源优化

# config.py 中的GPU内存优化配置
def check_and_replace_config(config_file: str, target_file: str = None):
    data = json_file_to_dict(config_file) or {}
    
    # 根据GPU内存自动调整参数
    gpu_vram = get_gpu_vram()  # 需要实现GPU内存检测
    if gpu_vram < 8:
        data['medvram'] = True
        data['lowvram'] = True
    elif gpu_vram < 12:
        data['medvram'] = True
        data['lowvram'] = False
    else:
        data['medvram'] = False
        data['lowvram'] = False
    
    dict_to_json_file(target_file or config_file, data)

7.2 负载均衡配置

使用Traefik作为反向代理,实现智能路由:

# traefik.yml
entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"

http:
  routers:
    auto1111-router:
      rule: "PathPrefix(`/auto`)"
      service: auto1111-service
    comfyui-router:
      rule: "PathPrefix(`/comfy`)"
      service: comfyui-service

  services:
    auto1111-service:
      loadBalancer:
        servers:
          - url: "http://auto1111:7860/"
    comfyui-service:
      loadBalancer:
        servers:
          - url: "http://comfyui:7860/"

8. 故障处理与恢复

8.1 常见问题排查流程

st=>start: 服务异常
op1=>operation: 检查服务状态
op2=>operation: 查看日志
op3=>operation: 检查GPU资源
op4=>operation: 检查NFS连接
op5=>operation: 重启服务
op6=>operation: 回滚版本
cond1=>condition: 服务是否运行?
cond2=>condition: 日志有错误?
cond3=>condition: GPU是否可用?
cond4=>condition: NFS是否挂载?
e=>end: 问题解决

st->op1->cond1
cond1(yes)->op2->cond2
cond1(no)->op5->e
cond2(yes)->op3->cond3
cond2(no)->op4->cond4
cond3(no)->op5->e
cond3(yes)->op6->e
cond4(no)->op5->e
cond4(yes)->op6->e

8.2 自动恢复配置

# 在docker-compose中配置自动恢复
deploy:
  restart_policy:
    condition: on-failure
    delay: 10s
    max_attempts: 3
    window: 60s
  update_config:
    parallelism: 1
    delay: 10s
    failure_action: rollback
    monitor: 60s
    max_failure_ratio: 0.33

9. 总结与展望

通过Docker Swarm部署stable-diffusion-webui-docker集群,我们实现了:

  1. 高可用性:服务自动恢复,故障转移
  2. 弹性扩展:根据负载动态调整实例数量
  3. 资源优化:GPU资源合理分配,避免浪费
  4. 多版本支持:同时部署AUTOMATIC1111和ComfyUI
  5. 数据共享:NFS存储确保模型和输出文件一致性

未来可进一步优化的方向:

  • 实现基于GPU使用率的自动扩缩容
  • 集成分布式训练功能
  • 添加成本监控与资源调度优化
  • 实现WebUI配置的集中化管理

10. 收藏与互动

如果本文对你有帮助,请点赞👍、收藏⭐、关注作者获取更多AI部署教程!

下期预告:Kubernetes部署Stable Diffusion企业级解决方案


附录:常用命令速查

# 查看GPU使用情况
nvidia-smi

# 查看Swarm节点状态
docker node ls

# 查看服务详情
docker service inspect --pretty sd-webui_auto1111

# 进入容器
docker exec -it $(docker ps -q --filter name=sd-webui_auto1111) bash

# 强制更新服务
docker service update --force sd-webui_auto1111

【免费下载链接】stable-diffusion-webui-docker Easy Docker setup for Stable Diffusion with user-friendly UI 【免费下载链接】stable-diffusion-webui-docker 项目地址: https://gitcode.com/gh_mirrors/st/stable-diffusion-webui-docker

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值