Horovod——分布式深度学习框架使用说明

Alt

References:

Horovod github_homepage
Horovod示例代码

Training 训练过程

To use Horovod, make the following additions to your program:

1. Run hvd.init().

2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list. With the typical setup of one GPU per process, this can be set to local rank. In that case, the first process on the server will be allocated the first GPU, second process will be allocated the second GPU and so forth. 

3. Scale the learning rate by number of workers. Effective batch size in synchronous distributed training is scaled by the number of workers. An increase in learning rate compensates for the increased batch size.

4. Wrap optimizer in hvd.DistributedOptimizer. The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.

5. Add hvd.BroadcastGlobalVariablesHook(0) to broadcast initial variable states from rank 0 to all other processes. This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint. Alternatively, if you're not using MonitoredTrainingSession, you can simply execute the hvd.broadcast_global_variables op after global variables have been initialized.

6. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them. This can be accomplished by passing checkpoint_dir=None to tf.train.MonitoredTrainingSession if hvd.rank() != 0.

这个过程总结一下就是:
horovod初始化 —— 进程分配 —— 训练参数配置 —— 模型参数广播 —— 分布式Optimizer —— 模型保存

Horovod_Demmo_tensorflow

import tensorflow as tf
import horovod.tensorflow as hvd


# Initialize Horovod 111
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process) 222
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model... 333
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())

# Add Horovod Distributed Optimizer 444
opt = hvd.DistributedOptimizer(opt)

# Add hook to broadcast variables from rank 0 to all other processes during 555
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operation
train_op = opt.minimize(loss)

# Save checkpoints only on worker 0 to prevent other workers from corrupting them. 666
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None

# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                       config=config,
                                       hooks=hooks) as mon_sess:
  while not mon_sess.should_stop():
    # Perform synchronous training.
    mon_sess.run(train_op)

Horovod_Demmo_MXNet

Gluon API # Gluon是仿pytorch动态图的MXNet库

from mxnet import autograd, gluon
import mxnet as mx
import horovod.mxnet as hvd

# Initialize Horovod 111
hvd.init()

# Pin GPU to be used to process local rank 222
context = mx.gpu(hvd.local_rank())
num_workers = hvd.size()

# Build model
model = ...
model.hybridize()

# Define hyper parameters 
optimizer_params = ...

# Add Horovod Distributed Optimizer 444
opt = mx.optimizer.create('sgd', **optimizer_params)
opt = hvd.DistributedOptimizer(opt)

# Initialize parameters
model.initialize(initializer, ctx=context)

# Fetch and broadcast parameters 555
params = model.collect_params()
if params is not None:
    hvd.broadcast_parameters(params, root_rank=0)

# Create trainer and loss function
trainer = gluon.Trainer(params, opt, kvstore=None)
loss_fn = ...

# Train model
for epoch in range(num_epoch):
    train_data.reset()
    for nbatch, batch in enumerate(train_data, start=1):
        data = gluon.utils.split_and_load(batch.data[0], ctx_list=[context],
                                          batch_axis=0)
        label = gluon.utils.split_and_load(batch.label[0], ctx_list=[context],
                                           batch_axis=0)
        with autograd.record():
            outputs = [model(x.astype(dtype, copy=False)) for x in data]
            loss = [loss_fn(yhat, y) for yhat, y in zip(outputs, label)]
        for l in loss:
            l.backward()
        trainer.step(batch_size)

Module API

import mxnet as mx
import horovod.mxnet as hvd

# Initialize Horovod 
hvd.init()

# Pin GPU to be used to process local rank 222
context = mx.gpu(hvd.local_rank())
num_workers = hvd.size()

# Build model
model = ...

# Define hyper parameters 
optimizer_params = ...

# Add Horovod Distributed Optimizer 444
opt = mx.optimizer.create('sgd', **optimizer_params)
opt = hvd.DistributedOptimizer(opt)

# Initialize parameters
initializer = mx.init.Xavier(rnd_type='gaussian', factor_type="in",
                             magnitude=2)
model.bind(data_shapes=train_data.provide_data,
           label_shapes=train_data.provide_label)
model.init_params(initializer)

# Fetch and broadcast parameters 555
(arg_params, aux_params) = model.get_params()
if arg_params:
    hvd.broadcast_parameters(arg_params, root_rank=0)
if aux_params:
    hvd.broadcast_parameters(aux_params, root_rank=0)
model.set_params(arg_params=arg_params, aux_params=aux_params)

# Train model
model.fit(train_data,
          kvstore=None,
          optimizer=opt,
          num_epoch=num_epoch)

Horovod_Demmo_Pytorch

import torch
import horovod.torch as hvd

# Initialize Horovod 111
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process) 222
torch.cuda.set_device(hvd.local_rank())

# Define dataset...
train_dataset = ...

# Partition dataset among workers using DistributedSampler 333
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

# Build model...
model = ...
model.cuda()

optimizer = optim.SGD(model.parameters())

# Add Horovod Distributed Optimizer 444
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())

# Broadcast parameters from rank 0 to all other processes. 555
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
       optimizer.zero_grad()
       output = model(data)
       loss = F.nll_loss(output, target)
       loss.backward()
       optimizer.step()
       if batch_idx % args.log_interval == 0:
           print('Train Epoch: {} [{}/{}]\tLoss: {}'.format(
               epoch, batch_idx * len(data), len(train_sampler), loss.item()))

Inference 推理过程(only forward process)

What about inference? Inference may be done outside of the Python script that was used to train the model. If you do this, it will not have references to the Horovod library. Inference
通常在训练过程外完成,Horovod没有相关的lib进行参照。

To run inference on a checkpoint generated by the Horovod-enabled training script you should optimize the graph and only keep operations necessary for a forward pass through model. The Optimize for Inference script from the TensorFlow repository will do that for you.
若想在horovod生成的checkpoint上执行一个推理过程,首先需要优化graph并且只保留模型中前向传播的运算过程。推理过程的脚本优化知识库由tensorflow库提供——https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/optimize_for_inference.py

If you want to convert your checkpoint to Frozen Graph, you should do so after doing the optimization described above, otherwise the Freeze Graph script will fail to load Horovod op:
若你想将checkpoint转换为固定的图,你需要先执行以上优化过程,否则固定图脚本将不能加载horovod操作。

ValueError: No op named HorovodAllreduce in defined operations.
会出现此类提示。

Tensor Fusion: 描述horovod计算和通信特征

One of the unique things about Horovod is its ability to interleave communication and computation coupled with the ability to batch small allreduce operations, which results in improved performance. We call this batching feature Tensor Fusion.

Tensor Fusion works by attempting to combine all the tensors that are ready to be reduced at given moment of time into one reduction operation.

### 分布式机器学习项目实现方案及相关资料 分布式机器学习是一种用于处理大规模数据集和复杂模型的技术,其核心在于通过多节点协同工作来加速计算过程。以下是关于分布式机器学习项目的实现方案、教程及常用框架的相关信息。 #### 一、分布式机器学习的必要性 随着数据量的增长和模型复杂性的提升,传统的单机计算已难以满足需求。分布式机器学习的优势体现在以下几个方面[^1]: - **数据规模**: 大型数据集无法存储于单一节点内存中。 - **计算复杂度**: 高维度特征空间和深层神经网络结构需要更多的计算资源。 - **时间效率**: 并行化操作可以显著缩短训练周期。 #### 二、主流分布式计算框架及其特点 目前存在多种支持分布式机器学习的框架,每种都有各自的特点: 1. **Dask** - 基于 Python 的开源库,适用于大数据分析场景下的并行计算。 - 提供类似于 Pandas 和 NumPy 接口的操作体验,易于上手。 - 支持动态任务调度机制,在灵活性上有一定优势[^2]。 2. **Apache Spark** - 使用 Scala 编写而成的大数据分析引擎,兼容 Java/Scala/Python/R 等编程语言。 - 内置 MLlib 库提供了丰富的预定义算法集合。 - 对批处理作业优化较好,但在实时流式处理领域表现稍逊。 3. **TensorFlow Distributed Training** - Google 开发的人工智能平台 TensorFlow 中的一部分功能模块。 - 可配置不同类型的集群拓扑结构(如 Parameter Server 或 AllReduce),适应性强。 - Ring-Allreduce 是一种高效的梯度同步策略,最初由百度硅谷实验室提出并贡献给社区[^4]。 4. **Horovod** - Uber 发起的一个轻量化工具包,专注于简化深度学习模型跨 GPU/CPU 设备间的通信流程。 - 构建在 MPI(Message Passing Interface)之上,允许开发者轻松扩展现有 PyTorch/Tensorflow/Keras 训练脚本至多台主机环境运行。 5. **Ray & Modin** - Ray 是一个高性能通用 RPC 框架,Modin 则是在此基础上构建的数据帧管理解决方案。 - 它们共同致力于解决传统 pandas 性能瓶颈问题的同时保留熟悉的 API 形态。 #### 三、典型应用场景与实践指南 针对具体业务需求设计合理的实施方案至关重要。下面列举几个常见的例子作为参考: ##### 场景 A: 图像分类任务 假设目标是从海量图片集中提取有用的信息,则可以选择如下路径完成端到端开发: 1. 准备好标注好的样本文件夹; 2. 导入必要的依赖项,比如 torchvision, torch.utils.data.Dataset; 3. 自定义 Dataset 类继承自 base class ,重载 __len__(), __getitem__() 方法读取本地磁盘上的图像像素矩阵; 4. 设置 DataLoader 参数 batch_size=64 shuffle=True num_workers=8 来加载 mini-batches 进入 pipeline ; 5. 初始化 model instance 同时指定 device='cuda' 若硬件条件允许的话; 6. 调整 optimizer learning rate scheduler loss function hyperparameters etc.; 7. 执行 multi-step training loop until convergence criteria met. ```python import torch from torch import nn, optim from torch.utils.data import DataLoader from torchvision import datasets, transforms transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), ]) train_dataset = datasets.ImageFolder(root='./data/train', transform=transform) val_dataset = datasets.ImageFolder(root='./data/validation', transform=transform) train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=8) val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False, num_workers=8) model = YourCustomModel().to('cuda') criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters()) for epoch in range(num_epochs): train_loss = 0. val_accuracy = 0. # Train phase... for inputs, labels in train_loader: pass # Validation phase... with torch.no_grad(): for inputs, labels in val_loader: pass print(f"Epoch {epoch}: Loss={train_loss:.4f}, Accuracy={val_accuracy:.2%}") ``` ##### 场景 B: 文本挖掘 NLP 工作流 当面对自然语言处理类挑战时,可考虑采用以下步骤搭建流水线架构: 1. 清洗原始语料去除噪声干扰成分; 2. 将句子分割成单词序列并通过词嵌入技术映射为向量表示形式; 3. 组织批次输入传递给 RNN/LSTM/Transformer Encoder Layer Stack 结构单元; 4. 输出层经过 softmax activation 得到最后预测概率分布; 5. 根据交叉熵损失函数反向传播调整权重直至收敛为止。 #### 四、参考资料推荐 对于希望深入了解该主题的学习者而言,这里整理了一些高质量文档链接便于查阅学习: - [官方文档](https://docs.dask.org/en/stable/) —— 关于 Dask 的权威说明手册涵盖了几乎所有方面的知识点。 - [Spark MLLib Guide](https://spark.apache.org/docs/latest/ml-guide.html) —— Apache Spark 社区维护的一系列教学文章帮助理解内置机器学习组件的工作原理。 - [TensorFlow Distribution Strategies Tutorial](https://www.tensorflow.org/guide/distributed_training) —— TensorFlow 官方博客发布的博文详细介绍各种可用选项之间的差异对比情况。 - [Horovod Getting Started Page](https://github.com/horovod/horovod#getting-started) —— Horovod GitHub 主页提供详尽的新手引导材料降低入门门槛难度系数。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值