以下是基于1 台 master + 3 台 client和自定义镜像的完整中心化部署方案
包含 Dockerfile 优化、docker-compose 编排、SPU 与 Ray 配置(确保端口不冲突)及部署验证步骤:
一、Dockerfile
在你的 Dockerfile 基础上,调整端口规划并补充 Ray/SPU 运行依赖,确保 SSH、Ray、SPU 端口独立:
FROM secretflow/secretflow-anolis8:latest
# 安装依赖(SSH + 网络工具,便于调试)
RUN yum install -y openssh-server openssh-clients net-tools nc && \
yum clean all && \
mkdir -p /var/empty/sshd && \
chown 0:0 /var/empty/sshd && \
chmod 0711 /var/empty/sshd
# 配置 SSH(固定端口 22,不与其他服务冲突)
RUN mkdir -p /var/run/sshd && \
ssh-keygen -A && \
sed -i 's/^#PermitRootLogin yes/PermitRootLogin yes/' /etc/ssh/sshd_config && \
sed -i 's/^#PasswordAuthentication yes/PasswordAuthentication yes/' /etc/ssh/sshd_config
# 设置 root 密码(便于节点间 SSH 登录)
RUN echo 'root:123456' | chpasswd
# 创建启动脚本:同时启动 SSH、Ray(按需)、SPU(按需)
RUN touch /workspace/start.sh && \
echo '#!/bin/bash' >> /workspace/start.sh && \
echo 'mkdir -p /var/empty/sshd && chown -R root:root /var/empty/sshd && chmod 0711 /var/empty/sshd' >> /workspace/start.sh && \
echo '/usr/sbin/sshd -D &' >> /workspace/start.sh && # 后台启动 SSH
echo 'tail -f /dev/null' >> /workspace/start.sh # 保持容器运行
# 权限调整
RUN chmod +x /workspace/start.sh
# 工作目录
WORKDIR /workspace
ENV PYTHONPATH=/workspace
# 端口规划(严格区分,避免冲突)
# - 22: SSH
# - 6379: Ray 主节点端口
# - 8265: Ray Dashboard
# - 10086: SPU 节点通信端口
EXPOSE 22 6379 8265 10086
ENTRYPOINT ["./start.sh"]
构建镜像(在所有节点执行,确保镜像一致):
docker build -t secretflow-custom:latest .
二、docker-compose 编排文件(分节点配置)
根据节点角色(master/alice/bob/charlie),分别创建编排文件,确保网络互通且端口不冲突。
1. Master 节点(192.168.127.130)
创建 /opt/secretflow/docker-compose.yml:
version: '3.8'
networks:
secretflow-net:
driver: bridge
ipam:
config:
- subnet: 172.30.0.0/16 # 独立子网,避免与物理网络冲突
services:
master:
image: secretflow-ssh:v1.0
container_name: secretflow-master
networks:
secretflow-net:
ipv4_address: 172.30.0.10 # 固定容器IP
volumes:
- ./data:/workspace/data # 挂载本地数据目录
- ./scripts:/workspace/scripts # 挂载训练脚本
- ./logs:/workspace/logs # 挂载日志
working_dir: /workspace
ports:
- "22:22" # SSH
- "6379:6379" # Ray 主节点端口
- "8265:8265" # Ray Dashboard
environment:
- NODE_ROLE=master
- MASTER_IP=192.168.127.130 # 物理机IP
- RAY_PORT=6379
command: >
bash -c "ray start --head \
--node-ip-address=172.30.0.10 \
--port=6379 \
--resources='{\"master\": 16}' \
--include-dashboard=True \
--dashboard-host=0.0.0.0 \
--dashboard-port=8265 \
--disable-usage-stats && \
tail -f /dev/null"
restart: unless-stopped
2. Alice 节点(192.168.127.131)
创建 /opt/secretflow/docker-compose.yml:
version: '3.8'
networks:
secretflow-net:
driver: bridge
ipam:
config:
- subnet: 172.30.0.0/16
services:
alice:
image: secretflow-ssh:v1.0
container_name: secretflow-alice
networks:
secretflow-net:
ipv4_address: 172.30.0.11
volumes:
- ./data:/workspace/data # Alice 的文物图片数据
- ./scripts:/workspace/scripts
- ./logs:/workspace/logs # 挂载日志
working_dir: /workspace
ports:
- "2222:22"
- "10086:10086" # SPU 通信端口
environment:
- NODE_ROLE=alice
- MASTER_IP=192.168.127.130
- SPU_PORT=10086
command: >
bash -c "ray start \
--address=192.168.127.130:6379 \ # 连接 master 的 Ray 集群
--node-ip-address=172.30.0.11 \
--resources='{\"alice\": 16}' \
--disable-usage-stats && \
tail -f /dev/null"
restart: unless-stopped
3. Bob 节点(192.168.127.132)
创建 /opt/secretflow/docker-compose.yml:
version: '3.8'
networks:
secretflow-net:
driver

最低0.47元/天 解锁文章
2044

被折叠的 条评论
为什么被折叠?



