[Pytorch]基于DDP的单机多卡训练

博客围绕PyTorch、深度学习和Python展开,虽未给出具体内容,但可知涉及使用Python进行基于PyTorch的深度学习开发,这在信息技术领域是热门方向,可用于图像识别、自然语言处理等多种场景。
部署运行你感兴趣的模型镜像
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

import torch.multiprocessing as mp
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
import os

class MyTrainDataset(Dataset):
    def __init__(self, size):
        self.size = size
        self.data = [(torch.rand(20), torch.rand(1)) for _ in range(size)]

    def __len__(self):
        return self.size
    
    def __getitem__(self, index):
        return self.data[index]


def ddp_setup(rank, world_size):
    """
    Args:
        rank: Unique identifier of each process
        world_size: Total number of processes
    """
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    init_process_group(backend="gloo", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

class Trainer:
    def __init__(
        self,
        model: torch.nn.Module,
        train_data: DataLoader,
        optimizer: torch.optim.Optimizer,
        gpu_id: int,
        save_every: int,
    ) -> None:
        self.gpu_id = gpu_id
        self.model = model.to(gpu_id)
        self.train_data = train_data
        self.optimizer = optimizer
        self.save_every = save_every
        self.model = DDP(model, device_ids=[gpu_id])

    def _run_batch(self, source, targets):
        self.optimizer.zero_grad()
        output = self.model(source)
        loss = F.cross_entropy(output, targets)
        loss.backward()
        self.optimizer.step()

    def _run_epoch(self, epoch):
        b_sz = len(next(iter(self.train_data))[0])
        print(f"[GPU{self.gpu_id}] Epoch {epoch} | Batchsize: {b_sz} | Steps: {len(self.train_data)}")
        self.train_data.sampler.set_epoch(epoch)
        for source, targets in self.train_data:
            source = source.to(self.gpu_id)
            targets = targets.to(self.gpu_id)
            self._run_batch(source, targets)

    def _save_checkpoint(self, epoch):
        ckp = self.model.module.state_dict()
        PATH = "checkpoint.pt"
        torch.save(ckp, PATH)
        print(f"Epoch {epoch} | Training checkpoint saved at {PATH}")

    def train(self, max_epochs: int):
        for epoch in range(max_epochs):
            self._run_epoch(epoch)
            if self.gpu_id == 0 and epoch % self.save_every == 0:
                self._save_checkpoint(epoch)


def load_train_objs():
    train_set = MyTrainDataset(2048)  # load your dataset
    model = torch.nn.Linear(20, 1)  # load your model
    optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
    return train_set, model, optimizer


def prepare_dataloader(dataset: Dataset, batch_size: int):
    return DataLoader(
        dataset,
        batch_size=batch_size,
        pin_memory=True,
        shuffle=False,
        sampler=DistributedSampler(dataset)
    )


def main(rank: int, world_size: int, save_every: int, total_epochs: int, batch_size: int):
    ddp_setup(rank, world_size)
    dataset, model, optimizer = load_train_objs()
    train_data = prepare_dataloader(dataset, batch_size)
    trainer = Trainer(model, train_data, optimizer, rank, save_every)
    trainer.train(total_epochs)
    destroy_process_group()

if __name__ == '__main__':
	world_size = torch.cuda.device_count()
	mp.spawn(main
         , args=(world_size
                ,10
                ,10
                ,10)
         , nprocs=world_size)

您可能感兴趣的与本文相关的镜像

PyTorch 2.5

PyTorch 2.5

PyTorch
Cuda

PyTorch 是一个开源的 Python 机器学习库,基于 Torch 库,底层由 C++ 实现,应用于人工智能领域,如计算机视觉和自然语言处理

下面是一个使用PyTorch DDP(分布式数据并行)进行训练的示例: 1.首先,需要使用torch.distributed.launch启动个进程,每个进程都运行相同的脚本并使用不同的参数。例如,在两台机器上运行以下命令: ``` # Machine 1 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="10.0.0.1" --master_port=8888 train.py # Machine 2 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr="10.0.0.1" --master_port=8888 train.py ``` 上面的命令将在两台机器上启动4个进程,每个进程使用2个GPU进行训练。 2.在代码中,使用torch.distributed初始化进程组,并将模型和数据加载到每个进程中。例如: ``` import torch import torch.nn as nn import torch.distributed as dist # Initialize distributed process group dist.init_process_group(backend='nccl', init_method='env://') # Load data and model train_data = ... train_loader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True) model = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10) ) # Distributed model and optimizer model = nn.parallel.DistributedDataParallel(model) optimizer = torch.optim.SGD(model.parameters(), lr=0.01) ``` 这里使用了nn.parallel.DistributedDataParallel将模型包装成分布式模型,使用torch.optim.SGD作为优化器。 3.在训练循环中,每个进程都会收集自己的梯度并将它们聚合到进程组中。然后,所有进程都将使用平均梯度更新模型参数。例如: ``` for epoch in range(10): for data, target in train_loader: optimizer.zero_grad() output = model(data) loss = nn.functional.cross_entropy(output, target) loss.backward() # All-reduce gradients for param in model.parameters(): dist.all_reduce(param.grad.data, op=dist.reduce_op.SUM) optimizer.step() ``` 在每个批次之后,使用dist.all_reduce将每个进程的梯度聚合到进程组中,然后使用平均梯度更新模型参数。 4.训练完成后,使用dist.destroy_process_group()关闭进程组并释放资源。例如: ``` dist.destroy_process_group() ``` 这个示例展示了如何使用PyTorch DDP进行训练。需要注意的是,使用DDP需要确保所有进程都能够访问相同的数据和模型,并且需要正确设置进程组中的参数。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值