pytorch distributed learning分布式训练

最新推荐文章于 2024-12-20 10:16:33 发布

原创

最新推荐文章于 2024-12-20 10:16:33 发布 · 472 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#pytorch #深度学习 #人工智能

本文档介绍了如何在PyTorch中进行分布式训练，强调首先在非分布式环境中调试模型，然后通过命令行以分布式方式运行。关键步骤包括在`__main__`中初始化分布式，配置数据集，以及使用`torch.distributed.launch`工具启动多GPU训练。注意，批量大小应至少等于参与节点的数量。

参考https://theaisummer.com/distributed-training-pytorch/

一般来说，最好先用非分布式的模型下debug，然后包装成分布式用命令行运行。

1、在__main__ 中parse args之前init distributed


#============================Example 2=================================
def init_distributed():

    # Initializes the distributed backend which will take care of synchronizing nodes/GPUs
    dist_url = "env://" # default

    # # only works with torch.distributed.launch // torch.run
    rank = int(os.environ["RANK"])
    world_size = int(os.environ['WORLD_SIZE'])
    local_rank = int(os.environ['LOCAL_RANK'])
    dist.init_process_group(
            backend="nccl",
            init_method=dist_url,
            world_size=world_size,
            rank=rank)
    # this will make all .cuda() calls work properly
    torch.cuda.set_device(local_rank)

    # synchron