FairScale项目中的Pipeline并行模型分片技术详解-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_01028/article/details/148578551

FairScale项目中的Pipeline并行模型分片技术详解

fairscale PyTorch extensions for high performance and large scale training. 项目地址: https://gitcode.com/gh_mirrors/fa/fairscale

什么是Pipeline并行

Pipeline并行是一种将深度学习模型按层划分到不同GPU设备上的技术，它能够有效解决单个GPU内存不足的问题。在FairScale项目中，通过fairscale.nn.Pipe模块实现了高效的Pipeline并行方案。

基础模型示例

让我们从一个简单的模型开始，该模型包含两个线性层和一个ReLU激活函数：

import torch
import torch.nn as nn

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = torch.nn.Linear(10, 10)  # 第一层线性变换
        self.relu = torch.nn.ReLU()          # 激活函数
        self.net2 = torch.nn.Linear(10, 5)   # 第二层线性变换

    def forward(self, x):
        x = self.relu(self.net1(x))  # 第一层+激活
        return self.net2(x)          # 第二层输出

model = ToyModel()

转换为Pipeline并行模型

要将上述模型转换为Pipeline并行模式，需要先将模型转换为torch.nn.Sequential形式，然后使用fairscale.nn.Pipe进行包装：

import fairscale
import torch
import torch.nn as nn

# 将模型转换为Sequential形式
model = nn.Sequential(
            torch.nn.Linear(10, 10),  # 第一层
            torch.nn.ReLU(),          # 激活层
            torch.nn.Linear(10, 5)    # 第二层
        )

# 使用Pipe进行包装，balance参数指定各GPU上的层数分配
model = fairscale.nn.Pipe(model, balance=[2, 1])

这段代码实现了：

前两层(Linear+ReLU)运行在第一个GPU(cuda:0)上
最后一层(Linear)运行在第二个GPU(cuda:1)上

优化器和损失函数配置

Pipeline并行模型的优化器配置与普通模型相同：

import torch.optim as optim
import torch.nn.functional as F

# 定义优化器
optimizer = optim.SGD(model.parameters(), lr=0.001)
# 定义损失函数
loss_fn = F.nll_loss

# 准备数据
optimizer.zero_grad()
target = torch.randint(0,2,size=(20,1)).squeeze()
data = torch.randn(20, 10)

模型训练的关键细节

使用Pipeline并行模型训练时，需要特别注意设备间的数据转移：

# 获取模型第一个设备(通常是数据输入设备)
device = model.devices[0]

# 前向传播(确保输入数据在正确设备上)
outputs = model(data.to(device))

# 计算损失(确保输出和目标在同一设备上)
loss = loss_fn(outputs.to(device), target.to(device))

# 反向传播和优化
loss.backward()
optimizer.step()