Docker Swarm部署:stable-diffusion-webui-docker容器编排集群
1. 痛点与解决方案
你是否面临以下挑战:单节点部署Stable Diffusion算力不足、多用户并发访问卡顿、服务可用性无法保障?本文将通过Docker Swarm实现stable-diffusion-webui-docker的容器编排集群,提供企业级高可用AI绘画解决方案。
读完本文你将掌握:
- Docker Swarm集群环境搭建
- GPU资源调度与多节点负载均衡
- 高可用服务配置与故障自动恢复
- 多UI版本(AUTOMATIC1111/ComfyUI)并行部署
- 数据持久化与模型共享方案
2. 环境准备与架构设计
2.1 硬件与软件要求
| 组件 | 最低配置 | 推荐配置 |
|---|---|---|
| 节点数量 | 1 manager + 1 worker | 1 manager + 2+ workers |
| CPU | 8核 | 16核+ |
| 内存 | 16GB | 32GB+ |
| GPU | 单卡8GB VRAM | 多卡16GB+ VRAM (NVIDIA) |
| 存储 | 100GB SSD | 500GB NVMe |
| 操作系统 | Ubuntu 20.04+ | Ubuntu 22.04 LTS |
| Docker版本 | 20.10+ | 24.0.5+ |
| NVIDIA驱动 | 470.57+ | 535.104.05+ |
2.2 集群架构图
3. Docker Swarm集群部署
3.1 初始化Swarm集群
# 在管理节点初始化集群
docker swarm init --advertise-addr 192.168.1.100
# 获取加入命令(worker节点执行)
docker swarm join-token worker
# 安装NVIDIA Container Toolkit(所有节点)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
3.2 节点标签与资源约束
为实现GPU资源精准调度,需要为节点添加标签:
# 标记GPU节点(根据实际情况修改)
docker node update --label-add gpu=true worker1
docker node update --label-add gpu=true worker2
docker node update --label-add gpu=false worker3
# 标记UI版本支持节点
docker node update --label-add comfyui=true worker1
docker node update --label-add auto1111=true worker2
4. 容器编排配置
4.1 Docker Compose Swarm配置
创建docker-compose-swarm.yml文件,实现多服务编排:
version: '3.8'
x-base_service: &base_service
ports:
- "7860:7860"
volumes:
- nfs_data:/data
- nfs_output:/output
stop_signal: SIGKILL
deploy:
replicas: 2
placement:
constraints: [node.labels.gpu == true]
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0'] # 可指定GPU设备ID
capabilities: [compute, utility]
restart_policy:
condition: on-failure
delay: 10s
max_attempts: 3
window: 60s
services:
download:
build: ./services/download/
volumes:
- nfs_data:/data
deploy:
replicas: 1
placement:
constraints: [node.role == manager]
auto1111:
<<: *base_service
build: ./services/AUTOMATIC1111
image: sd-auto:78
environment:
- CLI_ARGS=--allow-code --medvram --xformers --enable-insecure-extension-access --api
deploy:
replicas: 2
placement:
constraints: [node.labels.auto1111 == true]
update_config:
parallelism: 1
delay: 30s
failure_action: rollback
comfyui:
<<: *base_service
build: ./services/comfy/
image: sd-comfy:7
ports:
- "7861:7860"
environment:
- CLI_ARGS=
deploy:
replicas: 1
placement:
constraints: [node.labels.comfyui == true]
auto1111-cpu:
<<: *base_service
build: ./services/AUTOMATIC1111
environment:
- CLI_ARGS=--no-half --precision full --allow-code --enable-insecure-extension-access --api
deploy:
replicas: 1
placement:
constraints: [node.labels.gpu == false]
resources:
reservations:
devices: [] # 禁用GPU
volumes:
nfs_data:
driver: local
driver_opts:
type: nfs
o: addr=192.168.1.10,rw
device: ":/nfs/share/data"
nfs_output:
driver: local
driver_opts:
type: nfs
o: addr=192.168.1.10,rw
device: ":/nfs/share/output"
4.2 关键配置解析
4.2.1 资源调度策略
4.2.2 服务更新流程
participant Manager
participant Worker1
participant Worker2
Manager->>Worker1: 部署新版本(auto1111_v2)
Worker1-->>Manager: 健康检查通过
Manager->>Worker2: 部署新版本(auto1111_v2)
Worker2-->>Manager: 健康检查通过
Manager->>Worker1: 停止旧版本(auto1111_v1)
Manager->>Worker2: 停止旧版本(auto1111_v1)
5. NFS共享存储配置
为实现多节点数据共享,部署NFS服务:
# 在管理节点安装NFS服务
sudo apt install -y nfs-kernel-server
sudo mkdir -p /nfs/share/{data,output}
sudo chown -R nobody:nogroup /nfs/share
sudo chmod -R 777 /nfs/share
# 配置 exports
echo "/nfs/share/data 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)" | sudo tee -a /etc/exports
echo "/nfs/share/output 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)" | sudo tee -a /etc/exports
sudo exportfs -a
sudo systemctl restart nfs-kernel-server
# 在worker节点安装NFS客户端
sudo apt install -y nfs-common
6. 集群部署与管理
6.1 部署服务栈
# 部署完整服务栈
docker stack deploy -c docker-compose-swarm.yml sd-webui
# 查看服务状态
docker stack ps sd-webui
# 查看日志
docker service logs -f sd-webui_auto1111
# 扩展服务实例
docker service scale sd-webui_auto1111=3
6.2 监控与告警
部署Prometheus和Grafana监控GPU使用率:
# docker-compose-monitor.yml
version: '3'
services:
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
deploy:
placement:
constraints: [node.role == manager]
grafana:
image: grafana/grafana
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3000:3000"
depends_on:
- prometheus
volumes:
grafana_data:
7. 性能优化与最佳实践
7.1 GPU资源优化
# config.py 中的GPU内存优化配置
def check_and_replace_config(config_file: str, target_file: str = None):
data = json_file_to_dict(config_file) or {}
# 根据GPU内存自动调整参数
gpu_vram = get_gpu_vram() # 需要实现GPU内存检测
if gpu_vram < 8:
data['medvram'] = True
data['lowvram'] = True
elif gpu_vram < 12:
data['medvram'] = True
data['lowvram'] = False
else:
data['medvram'] = False
data['lowvram'] = False
dict_to_json_file(target_file or config_file, data)
7.2 负载均衡配置
使用Traefik作为反向代理,实现智能路由:
# traefik.yml
entryPoints:
web:
address: ":80"
websecure:
address: ":443"
http:
routers:
auto1111-router:
rule: "PathPrefix(`/auto`)"
service: auto1111-service
comfyui-router:
rule: "PathPrefix(`/comfy`)"
service: comfyui-service
services:
auto1111-service:
loadBalancer:
servers:
- url: "http://auto1111:7860/"
comfyui-service:
loadBalancer:
servers:
- url: "http://comfyui:7860/"
8. 故障处理与恢复
8.1 常见问题排查流程
st=>start: 服务异常
op1=>operation: 检查服务状态
op2=>operation: 查看日志
op3=>operation: 检查GPU资源
op4=>operation: 检查NFS连接
op5=>operation: 重启服务
op6=>operation: 回滚版本
cond1=>condition: 服务是否运行?
cond2=>condition: 日志有错误?
cond3=>condition: GPU是否可用?
cond4=>condition: NFS是否挂载?
e=>end: 问题解决
st->op1->cond1
cond1(yes)->op2->cond2
cond1(no)->op5->e
cond2(yes)->op3->cond3
cond2(no)->op4->cond4
cond3(no)->op5->e
cond3(yes)->op6->e
cond4(no)->op5->e
cond4(yes)->op6->e
8.2 自动恢复配置
# 在docker-compose中配置自动恢复
deploy:
restart_policy:
condition: on-failure
delay: 10s
max_attempts: 3
window: 60s
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
monitor: 60s
max_failure_ratio: 0.33
9. 总结与展望
通过Docker Swarm部署stable-diffusion-webui-docker集群,我们实现了:
- 高可用性:服务自动恢复,故障转移
- 弹性扩展:根据负载动态调整实例数量
- 资源优化:GPU资源合理分配,避免浪费
- 多版本支持:同时部署AUTOMATIC1111和ComfyUI
- 数据共享:NFS存储确保模型和输出文件一致性
未来可进一步优化的方向:
- 实现基于GPU使用率的自动扩缩容
- 集成分布式训练功能
- 添加成本监控与资源调度优化
- 实现WebUI配置的集中化管理
10. 收藏与互动
如果本文对你有帮助,请点赞👍、收藏⭐、关注作者获取更多AI部署教程!
下期预告:Kubernetes部署Stable Diffusion企业级解决方案
附录:常用命令速查
# 查看GPU使用情况
nvidia-smi
# 查看Swarm节点状态
docker node ls
# 查看服务详情
docker service inspect --pretty sd-webui_auto1111
# 进入容器
docker exec -it $(docker ps -q --filter name=sd-webui_auto1111) bash
# 强制更新服务
docker service update --force sd-webui_auto1111
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



