torch之分布式训练

原创已于 2025-10-20 17:28:34 修改 · 495 阅读

15 ·

CC 4.0 BY-SA版权

文章标签：

#分布式

于 2025-10-20 15:42:23 首次发布

部署运行你感兴趣的模型镜像

torch.distributed.run（之前是 torch.distributed.launch）是 PyTorch 用于启动分布式训练的工具，分布式训练有如下两种主要模式，

1. 两种主要的分布式训练模式

模式A：数据并行（Data Parallelism）- 最常见

每张卡都有完整的模型，独立工作，只在梯度同步时通信

模式B：模型并行（Model Parallelism）- 较特殊

模型被拆分到多张卡上，前向/反向传播需要层层传递

2. 数据并行（第一种情况）

工作流程

# 假设有4张GPU，总batch_size=256
# 每张卡处理：256 / 4 = 64 个样本

# 每张卡都有完整的模型副本
model_copy_0 = Model().cuda(0)  # GPU 0
model_copy_1 = Model().cuda(1)  # GPU 1  
model_copy_2 = Model().cuda(2)  # GPU 2
model_copy_3 = Model().cuda(3)  # GPU 3

# 数据被分配到不同GPU
data_batch_0 = data[0:64].cuda(0)
data_batch_1 = data[64:128].cuda(1)
data_batch_2 = data[128:192].cuda(2) 
data_batch_3 = data[192:256].cuda(3)

通信时机

# 1. 前向传播：各卡独立计算，无通信
output_0 = model_copy_0(data_batch_0)
output_1 = model_copy_1(data_batch_1)
# ...

# 2. 反向传播：各卡独立计算梯度
loss_0 = criterion(output_0, target_0)
loss_0.backward()  # 在GPU 0上计算梯度

loss_1 = criterion(output_1, target_1)  
loss_1.backward()  # 在GPU 1上计算梯度
# ...

# 3. 梯度同步：所有卡通信，求平均梯度
# 这是唯一的通信阶段！
dist.all_reduce(gradients, op=dist.ReduceOp.SUM)
gradients /= world_size

# 4. 参数更新：各卡用相同的梯度更新自己的模型副本
optimizer.step()

3. 模型并行（第二种情况）

工作流程

# 将模型的不同层放到不同GPU上
class SplitModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = Layer1().cuda(0)  # 在GPU 0上
        self.layer2 = Layer2().cuda(1)  # 在GPU 1上
        self.layer3 = Layer3().cuda(2)  # 在GPU 2上
    
    def forward(self, x):
        x = self.layer1(x.cuda(0))      # GPU 0计算
        x = self.layer2(x.cuda(1))      # 数据传输到GPU 1
        x = self.layer3(x.cuda(2))      # 数据传输到GPU 2
        return x

# 前向传播需要层层传递
input_data = input_data.cuda(0)
output = model(input_data)  # 数据会在GPU 0→1→2之间传输

4. PyTorch中的具体实现

数据并行（DDP - DistributedDataParallel）

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def train():
    # 每张卡都有完整的模型
    model = MyModel()
    
    # 用DDP包装模型
    model = DDP(model, device_ids=[local_rank])
    
    # 数据被分配到不同卡
    sampler = DistributedSampler(dataset)
    dataloader = DataLoader(dataset, batch_size=64, sampler=sampler)
    
    for batch in dataloader:
        # 前向传播：各卡独立
        output = model(batch)
        loss = criterion(output, target)
        
        # 反向传播：各卡独立计算梯度
        loss.backward()
        
        # 梯度同步：DDP自动处理（通信发生在这里）
        # 参数更新：各卡同步更新
        optimizer.step()
        optimizer.zero_grad()

通信细节

# DDP在后台做的通信操作：
def sync_gradients(model):
    for param in model.parameters():
        if param.grad is not None:
            # 对所有卡的梯度求平均
            dist.all_reduce(param.grad, op=dist.ReduceOp.SUM)
            param.grad /= world_size

5. 两种模式的对比

特性	数据并行（DDP）	模型并行
模型存储	每张卡都有完整副本	模型被拆分到多卡
数据存储	每张卡处理不同数据	所有卡处理相同数据
通信时机	反向传播后同步梯度	前向/反向传播都需要通信
通信量	相对较小（只同步梯度）	较大（传递中间激活值）
适用场景	模型能放入单张GPU	超大模型（单卡放不下）
使用频率	95%+ 的分布式训练	少数特殊情况

6. 实际中的混合策略

混合并行（Hybrid Parallelism）

# 同时使用数据并行和模型并行
# 例如：8卡训练，2卡做模型并行，4组做数据并行

# 模型并行组（2卡放一个模型）
model_group_1 = [GPU0, GPU1]  # 拆分模型的两部分
model_group_2 = [GPU2, GPU3]  # 另一个完整模型
model_group_3 = [GPU4, GPU5]  # 另一个完整模型  
model_group_4 = [GPU6, GPU7]  # 另一个完整模型

# 数据并行：4个模型组处理不同数据

7. 对于 `torch.distributed.run --nproc_per_node 8 train.py`：

默认情况下：使用的是数据并行，每张卡都有完整的模型副本，独立工作
通信发生：只在梯度同步时通信，不是每层都通信
工作方式：
- 前向传播：8张卡各自独立计算
- 反向传播：8张卡各自计算梯度
- 梯度同步：8张卡通信，求平均梯度
- 参数更新：8张卡用相同的梯度更新

总结

绝大多数情况下，分布式训练使用数据并行，每张卡独立工作，只在梯度同步时通信
极少数情况下（模型太大），使用模型并行，需要层层传递
PyTorch DDP 自动处理数据并行的通信，对用户透明

您可能感兴趣的与本文相关的镜像

PyTorch 2.5

PyTorch

Cuda

PyTorch 是一个开源的 Python 机器学习库，基于 Torch 库，底层由 C++ 实现，应用于人工智能领域，如计算机视觉和自然语言处理

torch之分布式训练

1. 两种主要的分布式训练模式

模式A：数据并行（Data Parallelism）- 最常见

模式B：模型并行（Model Parallelism）- 较特殊

2. 数据并行（第一种情况）

工作流程

通信时机

3. 模型并行（第二种情况）

工作流程

4. PyTorch中的具体实现

数据并行（DDP - DistributedDataParallel）

通信细节

5. 两种模式的对比

6. 实际中的混合策略

混合并行（Hybrid Parallelism）

7. 对于 torch.distributed.run --nproc_per_node 8 train.py：

总结

7. 对于 `torch.distributed.run --nproc_per_node 8 train.py`：