Horovod——分布式深度学习框架使用说明
References:
Horovod github_homepage
Horovod示例代码
Training 训练过程
To use Horovod, make the following additions to your program:
1. Run hvd.init().
2. Pin a server GPU to be used by this process using config.gpu_options.visible_device_list. With the typical setup of one GPU per process, this can be set to local rank. In that case, the first process on the server will be allocated the first GPU, second process will be allocated the second GPU and so forth.
3. Scale the learning rate by number of workers. Effective batch size in synchronous distributed training is scaled by the number of workers. An increase in learning rate compensates for the increased batch size.
4. Wrap optimizer in hvd.DistributedOptimizer. The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.
5. Add hvd.BroadcastGlobalVariablesHook(0) to broadcast initial variable states from rank 0 to all other processes. This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint. Alternatively, if you're not using MonitoredTrainingSession, you can simply execute the hvd.broadcast_global_variables op after global variables have been initialized.
6. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them. This can be accomplished by passing checkpoint_dir=None to tf.train.MonitoredTrainingSession if hvd.rank() != 0.
这个过程总结一下就是:
horovod初始化 —— 进程分配 —— 训练参数配置 —— 模型参数广播 —— 分布式Optimizer —— 模型保存
Horovod_Demmo_tensorflow
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod 111
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process) 222
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model... 333
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
# Add Horovod Distributed Optimizer 444
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during 555
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# Save checkpoints only on worker 0 to prevent other workers from corrupting them. 666
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config,
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
Horovod_Demmo_MXNet
Gluon API # Gluon是仿pytorch动态图的MXNet库
from mxnet import autograd, gluon
import mxnet as mx
import horovod.mxnet as hvd
# Initialize Horovod 111
hvd.init()
# Pin GPU to be used to process local rank 222
context = mx.gpu(hvd.local_rank())
num_workers = hvd.size()
# Build model
model = ...
model.hybridize()
# Define hyper parameters
optimizer_params = ...
# Add Horovod Distributed Optimizer 444
opt = mx.optimizer.create('sgd', **optimizer_params)
opt = hvd.DistributedOptimizer(opt)
# Initialize parameters
model.initialize(initializer, ctx=context)
# Fetch and broadcast parameters 555
params = model.collect_params()
if params is not None:
hvd.broadcast_parameters(params, root_rank=0)
# Create trainer and loss function
trainer = gluon.Trainer(params, opt, kvstore=None)
loss_fn = ...
# Train model
for epoch in range(num_epoch):
train_data.reset()
for nbatch, batch in enumerate(train_data, start=1):
data = gluon.utils.split_and_load(batch.data[0], ctx_list=[context],
batch_axis=0)
label = gluon.utils.split_and_load(batch.label[0], ctx_list=[context],
batch_axis=0)
with autograd.record():
outputs = [model(x.astype(dtype, copy=False)) for x in data]
loss = [loss_fn(yhat, y) for yhat, y in zip(outputs, label)]
for l in loss:
l.backward()
trainer.step(batch_size)
Module API
import mxnet as mx
import horovod.mxnet as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank 222
context = mx.gpu(hvd.local_rank())
num_workers = hvd.size()
# Build model
model = ...
# Define hyper parameters
optimizer_params = ...
# Add Horovod Distributed Optimizer 444
opt = mx.optimizer.create('sgd', **optimizer_params)
opt = hvd.DistributedOptimizer(opt)
# Initialize parameters
initializer = mx.init.Xavier(rnd_type='gaussian', factor_type="in",
magnitude=2)
model.bind(data_shapes=train_data.provide_data,
label_shapes=train_data.provide_label)
model.init_params(initializer)
# Fetch and broadcast parameters 555
(arg_params, aux_params) = model.get_params()
if arg_params:
hvd.broadcast_parameters(arg_params, root_rank=0)
if aux_params:
hvd.broadcast_parameters(aux_params, root_rank=0)
model.set_params(arg_params=arg_params, aux_params=aux_params)
# Train model
model.fit(train_data,
kvstore=None,
optimizer=opt,
num_epoch=num_epoch)
Horovod_Demmo_Pytorch
import torch
import horovod.torch as hvd
# Initialize Horovod 111
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process) 222
torch.cuda.set_device(hvd.local_rank())
# Define dataset...
train_dataset = ...
# Partition dataset among workers using DistributedSampler 333
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)
# Build model...
model = ...
model.cuda()
optimizer = optim.SGD(model.parameters())
# Add Horovod Distributed Optimizer 444
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
# Broadcast parameters from rank 0 to all other processes. 555
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
for epoch in range(100):
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
print('Train Epoch: {} [{}/{}]\tLoss: {}'.format(
epoch, batch_idx * len(data), len(train_sampler), loss.item()))
Inference 推理过程(only forward process)
What about inference? Inference may be done outside of the Python script that was used to train the model. If you do this, it will not have references to the Horovod library. Inference
通常在训练过程外完成,Horovod没有相关的lib进行参照。
To run inference on a checkpoint generated by the Horovod-enabled training script you should optimize the graph and only keep operations necessary for a forward pass through model. The Optimize for Inference script from the TensorFlow repository will do that for you.
若想在horovod生成的checkpoint上执行一个推理过程,首先需要优化graph并且只保留模型中前向传播的运算过程。推理过程的脚本优化知识库由tensorflow库提供——https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/optimize_for_inference.py
If you want to convert your checkpoint to Frozen Graph, you should do so after doing the optimization described above, otherwise the Freeze Graph script will fail to load Horovod op:
若你想将checkpoint转换为固定的图,你需要先执行以上优化过程,否则固定图脚本将不能加载horovod操作。
ValueError: No op named HorovodAllreduce in defined operations.
会出现此类提示。
Tensor Fusion: 描述horovod计算和通信特征
One of the unique things about Horovod is its ability to interleave communication and computation coupled with the ability to batch small allreduce operations, which results in improved performance. We call this batching feature Tensor Fusion.
Tensor Fusion works by attempting to combine all the tensors that are ready to be reduced at given moment of time into one reduction operation.