开源项目Prime：高效分布式AI模型训练框架-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00222/article/details/146973370

开源项目Prime：高效分布式AI模型训练框架

prime prime is a framework for efficient, globally distributed training of AI models over the internet. 项目地址: https://gitcode.com/gh_mirrors/prime2/prime

1. 项目介绍

Prime（之前称为ZeroBand）是一个用于高效、全球分布式AI模型训练的框架，它通过互联网实现节点间的协作训练。该框架引入了ElasticDeviceMesh这一新的分布式抽象概念，支持容错训练，并且能够动态调整全球进程组大小，适应节点加入或离开的情况，而不需要冷重启。

2. 项目快速启动

以下是快速启动Prime项目的步骤：

首先，安装所需的依赖：

curl -sSL https://raw.githubusercontent.com/PrimeIntellect-ai/prime/main/scripts/install/install.sh | bash

克隆项目仓库：

git clone git@github.com:PrimeIntellect-ai/prime.git

安装uv：

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

设置环境：

sudo apt install iperf -y
uv venv
source .venv/bin/activate
uv sync --extra all
git submodule update --init --recursive

登录Hugging Face：

huggingface-cli login

下载数据：

mkdir -p datasets
uv run python scripts/subset_data.py --dataset_name PrimeIntellect/fineweb-edu --data_world_size 1 --data_rank 0 --max_shards 32
mv fineweb-edu/ datasets/fineweb-edu/

验证你的设置：

GLOO_SOCKET_IFNAME=lo GLOBAL_ADDR=localhost GLOBAL_RANK=0 GLOBAL_UNIQUE_ID=0 GLOBAL_WORLD_SIZE=1 GLOBAL_PORT=8989 uv run torchrun --nproc_per_node=2 src/zeroband/train.py @configs/debug/diloco.toml