一、集群架构设计原则
1.1 核心设计目标
-
水平扩展能力:支持动态添加/移除计算节点
-
故障域隔离:实现计算/存储/网络三层容错
-
资源利用率优化:基于优先级的多级任务调度
-
统一入口管理:API Gateway + 负载均衡架构
1.2 典型拓扑方案对比
架构类型 | 节点规模 | 适用场景 | 优缺点 |
---|---|---|---|
单控制节点 | <20节点 | 实验环境 | 部署简单/存在单点故障 |
多可用区部署 | 50-200节点 | 生产环境 | 高可用性/网络延迟敏感 |
混合云架构 | 200+节点 | 全球化服务 | 成本最优/管理复杂度高 |
二、基础环境搭建
2.1 硬件资源规划
计算节点分层配置:
# cluster_config.yaml
node_profiles:
- type: gpu_heavy
specs:
gpu: 4x A100 80GB
cpu: 64 vCPU
mem: 512GB
storage: 10TB NVMe
count: 8
- type: cpu_preprocess
specs:
gpu: none
cpu: 32 vCPU
mem: 256GB
storage: 5TB SSD
count: 12
- type: storage_node
specs:
network: 100GbE
storage: 1PB Ceph
count: 3
2.2 网络架构配置
# 使用Calico构建BGP网络
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
# 配置多网卡绑定(示例)
nmcli con add type bond con-name bond0 ifname bond0 mode 802.3ad
nmcli con add type ethernet ifname eth1 master bond0
nmcli con add type ethernet ifname eth2 master bond0
nmcli con mod bond0 ipv4.addresses 10.200.1.10/24
nmcli con up bond0
三、Kubernetes集群部署
3.1 使用Kubeadm构建集群
# 控制平面初始化
kubeadm init --pod-network-cidr=192.168.0.0/16 \
--apiserver-advertise-address=10.200.1.10 \
--image-repository registry.aliyuncs.com/google_containers
# 工作节点加入
kubeadm join 10.200.1.10:6443 --token xxxx \
--discovery-token-ca-cert-hash sha256:xxxx
3.2 GPU节点专项配置
# Dockerfile.gpu
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04
RUN apt-get update && apt-get install -y \
nvidia-container-toolkit \
nvidia-cuda-toolkit
# 部署GPU插件
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
四、ComfyUI容器化改造
4.1 构建生产级Docker镜像
# 多阶段构建优化
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime AS builder
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt \
&& python -m compileall .
FROM nvidia/cuda:12.1.0-base
COPY --from=builder /app /app
COPY --from=builder /usr/local/lib/python3.10/site-packages /usr/local/lib/python3.10/site-packages
EXPOSE 8188
CMD ["python", "main.py", "--listen", "--port", "8188"]
4.2 Helm Chart编排配置
# values.yaml
replicaCount: 8
resources:
limits:
nvidia.com/gpu: 2
memory: 48Gi
requests:
cpu: 8
memory: 32Gi
autoscaling:
enabled: true
minReplicas: 4
maxReplicas: 32
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 85
nodeSelector:
node-type: gpu_heavy
tolerations:
- key: "gpu"
operator: "Exists"
effect: "NoSchedule"
五、分布式任务调度系统
5.1 Celery-RabbitMQ架构
# tasks.py
from celery import Celery
from comfy_api import execute_workflow
app = Celery('tasks',
broker='pyamqp://user:pass@rabbitmq:5672//',
backend='redis://redis:6379/0')
@app.task(bind=True, max_retries=3)
def execute_workflow_task(self, workflow_json):
try:
return execute_workflow(workflow_json)
except Exception as exc:
raise self.retry(exc=exc)
5.2 优先级队列配置
# 定义不同优先级的消息队列
task_routes = {
'tasks.high_priority': {'queue': 'urgent'},
'tasks.normal_priority': {'queue': 'default'},
'tasks.low_priority': {'queue': 'batch'}
}
# RabbitMQ策略配置
rabbitmqctl set_policy prioritization \
"^(urgent|default|batch)$" \
'{"priority": {"maximum": 10}, "max-length": 10000}' \
--apply-to queues
六、存储系统优化
6.1 分布式模型存储方案
# 部署MinIO集群
helm install minio minio/minio \
--set mode=distributed \
--set persistence.size=100Ti \
--set resources.requests.memory=16Gi \
--set accessKey=comfyadmin \
--set secretKey=comfysecret123
# 模型同步策略
mc mirror --watch /models/local s3/comfy-models/
6.2 缓存加速方案
# 使用Redis进行热点模型缓存
import redis
from functools import lru_cache
r = redis.Redis(host='redis-master', port=6379)
class ModelCache:
@lru_cache(maxsize=32)
def get_model(self, ckpt_name):
model_data = r.get(f"model:{ckpt_name}")
if not model_data:
model_data = self._load_from_storage(ckpt_name)
r.setex(f"model:{ckpt_name}", 3600, model_data)
return torch.load(io.BytesIO(model_data))
七、监控与日志系统
7.1 Prometheus+Grafana监控栈
# prometheus-rules.yaml
groups:
- name: comfy-alerts
rules:
- alert: HighGPUUtilization
expr: avg(rate(nvidia_gpu_utilization[5m])) by (instance) > 0.9
for: 10m
labels:
severity: critical
annotations:
summary: "GPU利用率持续高位 ({{ $value }}%)"
- alert: WorkflowTimeout
expr: comfy_workflow_duration_seconds > 300
labels:
severity: warning
annotations:
description: "工作流执行超时 {{ $value }} 秒"
7.2 业务日志分析
# 结构化日志输出
import structlog
logger = structlog.get_logger()
def execute_workflow(workflow):
try:
start_time = time.monotonic()
logger.info("workflow_started",
workflow_id=workflow.id,
user=workflow.user)
# ...执行逻辑...
logger.info("workflow_completed",
duration=time.monotonic()-start_time,
output_size=len(results))
except Exception as e:
logger.error("workflow_failed",
exception=str(e),
stack_info=True)
八、安全防护体系
8.1 零信任安全架构
# istio授权策略示例
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: comfy-api-policy
spec:
selector:
matchLabels:
app: comfy-api
rules:
- from:
- source:
principals: ["cluster.local/ns/auth-system/sa/jwt-validator"]
to:
- operation:
methods: ["POST"]
paths: ["/api/v1/execute"]
when:
- key: request.auth.claims[group]
values: ["ai-developer"]
8.2 数据加密方案
# 启用存储加密
kubectl create secret generic minio-key \
--from-literal=MINIO_KMS_SECRET_KEY=32byteslongsupersecretkey
# TLS证书管理
certbot certonly --dns-cloudflare \
--dns-cloudflare-credentials ./cloudflare.ini \
-d comfy.example.com
九、灾备与恢复方案
9.1 跨区域同步
# Velero备份配置
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.7.0 \
--bucket comfy-backup \
--secret-file ./credentials \
--use-volume-snapshots=false \
--backup-location-config region=us-west-2
# 创建定时备份
velero schedule create daily-backup \
--schedule="@every 24h" \
--include-namespaces comfy-prod
9.2 故障转移演练
# 混沌工程测试用例
def test_gpu_node_failure():
# 随机选择GPU节点
node = random.choice(get_gpu_nodes())
# 模拟节点故障
node.power_off()
# 验证任务自动迁移
assert check_task_redistribution(node), "任务未正确迁移"
# 恢复节点并验证数据一致性
node.power_on()
assert check_data_integrity(), "数据一致性校验失败"
本章配套工具包:
第一章:Kubernetes部署检查清单
1.1 部署前检查
- [ ] 所有节点时间同步 (chrony status) - [ ] 节点间DNS解析正常 (nslookup <node-name>) - [ ] 防火墙开放端口: - 6443 (API Server) - 2379-2380 (etcd) - 10250-10259 (Kube组件) - [ ] SWAP已禁用 (swapoff -a) - [ ] 内核参数调整: - net.ipv4.ip_forward=1 - vm.swappiness=0
1.2 安装验证清单
# 集群状态检查 kubectl get componentstatus kubectl get nodes -o wide # 网络连通性测试 kubectl run test-pod --image=busybox --rm -it -- ping <service-ip> # 存储卷验证 kubectl apply -f - <<EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: test-pvc spec: accessModes: [ReadWriteOnce] resources: requests: storage: 1Gi EOF
第二章:混沌工程测试脚本集
2.1 节点故障模拟
#!/bin/bash # chaos_node_failure.sh NODE=${1:-worker01} echo "🔄 隔离节点 $NODE" kubectl cordon $NODE echo "💥 终止节点上的Pod" kubectl drain $NODE --delete-epoch=5 --ignore-daemonsets echo "🔌 模拟网络中断 (持续60秒)" ssh $NODE "sudo iptables -A INPUT -j DROP && sleep 60 && sudo iptables -D INPUT -j DROP" & echo "✅ 故障注入完成,60秒后自动恢复"
2.2 资源耗尽测试
# chaos_resource_stress.py import chaosmesh.k8s as chaos import time def cpu_stress(namespace="comfy", duration=300): experiment = chaos.StressChaos( name="cpu-burn", mode="one", selector={"namespaces": [namespace]}, stressors={ "cpu": { "workers": 4, "load": 95, "duration": f"{duration}s" } } ) experiment.create() time.sleep(duration) experiment.delete() def memory_stress(size="2GB", duration=180): # 类似CPU压力测试配置 ...
第三章:安全基线配置模板
3.1 Kubernetes加固配置
# security_baseline.yaml apiVersion: audit.k8s.io/v1 kind: Policy rules: - level: Metadata resources: [{"group": "*"}] omitStages: ["RequestReceived"] --- apiVersion: apiserver.config.k8s.io/v1 kind: EncryptionConfiguration resources: - resources: [secrets] providers: - aescbc: keys: - name: key1 secret: <base64-encoded-secret>
3.2 Pod安全策略
# pod_security_policy.yaml apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: name: comfy-restricted spec: privileged: false allowPrivilegeEscalation: false requiredDropCapabilities: - ALL volumes: - 'configMap' - 'persistentVolumeClaim' hostNetwork: false hostIPC: false hostPID: false runAsUser: rule: 'MustRunAsNonRoot' seLinux: rule: 'RunAsAny' supplementalGroups: rule: 'MustRunAs' ranges: [{min: 1, max: 65535}]
第四章:性能基准测试数据集
4.1 压力测试配置
# loadtest-config.yaml workloads: - name: image-generation concurrency: 100 duration: 5m parameters: resolution: 1024x1024 steps: 50 batch_size: 4 - name: model-inference concurrency: 50 duration: 10m parameters: model: sd-xl-v1.0 precision: fp16 metrics: - name: gpu_util query: avg(rate(container_gpu_utilization[1m])) * 100 - name: p95_latency query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))
4.2 测试数据生成工具
# generate_benchmark_data.py import numpy as np from faker import Faker def generate_image_requests(num=1000): fake = Faker() return [{ "request_id": f"req_{i:04d}", "prompt": fake.sentence(), "seed": np.random.randint(0, 2**32), "steps": np.random.choice([20, 30, 50]), "width": np.random.choice([512, 768, 1024]), "height": np.random.choice([512, 768, 1024]) } for i in range(num)] if __name__ == "__main__": import json data = generate_image_requests(5000) with open("workload.json", "w") as f: json.dump(data, f, indent=2)
工具包获取方式:
-
Kubernetes检查清单
wget https://comfyops.cc/k8s-checklist.zip && unzip k8s-checklist.zip
-
混沌工程脚本集
pip install chaos-mesh && git clone https://github.com/comfyops/chaos-scripts
-
安全配置模板
kubectl apply -f https://comfyops.cc/security-baseline.yaml
-
性能测试数据集
aws s3 cp s3://comfyops-benchmark/datasets/v2.1.3.tar.gz . tar xzf v2.1.3.tar.gz -C ./benchmark-data
实施路线图:
-
环境准备
执行Kubernetes检查清单,确保基础环境合规 -
安全加固
应用安全基线配置模板,完成漏洞扫描 -
性能基准
运行基准测试数据集,记录初始性能指标 -
混沌测试
按计划执行故障注入,验证系统健壮性 -
优化迭代
根据测试结果调整资源配置和部署策略
可视化看板示例:
// grafana-dashboard.json { "title": "Comfy集群监控", "panels": [ { "type": "graph", "title": "GPU利用率", "targets": [{ "expr": "avg(rate(nvidia_gpu_utilization[1m])) by (instance)", "legendFormat": "{{instance}}" }] }, { "type": "heatmap", "title": "请求延迟分布", "targets": [{ "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1m])) by (le))" }] } ] }