转载自:Huggingface Accelerator踩坑指南 | 广哥拉斯的树洞
因为是个人网页,怕今后什么时候丢了,所以特意复制。
前言
最近因为做实验的数据集较大(百万条中文语句对),传统的单机单卡的速度实在不可恭维,所以笔者转向尝试单机多卡训练。最开始使用的是torch自带的torch.distributed
包,引入了很多配置相关的代码,导致原来的代码结构混乱,再加上本身分布式并行计算就难以调试,笔者不得不花费大量时间一点点纠错才跑起来(一把辛酸泪)
然后机缘巧合下看到了hugginface开发的accelerate
。在阅读完其简介和例子后,心中不经暗想:这也太方便了,事出反常必有妖!不过笔者还是决定付诸实践,看看这库是否如其宣传的那般好用。
目前的体验感想,仅一家之言
- 和torch一样,分布式环境难以调试,但比torch容易
- 网上的教程和huggingface自带的论坛里的提问很少有回答(活跃度太低,遇到bug可能得自己解决…)
- 非常多的配置选项,虽然“好像”能提升训练效率但是笔者(一枚菜鸟)看不懂啊
- 代码改动少很多,且模型训练和推理过程几乎和单卡一致
- accelerate支持单卡和多卡,几乎可以无缝切换
- 无需指定模型和数据分配的设备,accelerate会自动将两者分配到正确的设备!(可以极大提高显存使用效率)
- 解决了日志打印混乱的问题,accelerate自带的logger默认只在主进程打印日志,所以就不会再被眼花缭乱、上蹿下跳的不知道哪路进程打印的log迷惑了双眼
- more than what I mentioned aboved…
总结:值得一试
下面主要讲讲accelerate的简单使用和踩坑记录,代码和介绍主要参考Accelerate (huggingface.co)
入门实践
添加accelerator
首先导入和创建一个Accelerator实例,后面的所有几乎操作都离不开它
from accelerate import Accelerator
accelerator = Accelerator()
设置device
如果你要用到device的话需要如下配置,accelerator会将pytorch object放入正确的device
device = accelerator.device | |
# 笔者实践中,发现只需要将模型显示放入device,其他代码再见不到.to(xxx)和.cuda()了 | |
model.to(device) |
调用prepare()处理对象
prepare()
为传递的所有对象做好用于进行分布式训练和混合精度(能加速训练)的准备,返回相同顺序的对象
- 只接受继承自
torch.nn.Module
的对象 - inference阶段无需使用
- 返回的对象可以正常使用,而
torch.distributed
则需要model.module
才能访问到真正的model
model, optimizer, training_dataloader, scheduler = accelerator.prepare( | |
model, optimizer, training_dataloader, scheduler | |
) |
替换backward()
使用accelerator.backward()
函数进行反向传播,后续很多针对梯度和误差的操作都依赖此接口
loss.backward() -> accelerator.backward(loss)
初步完成代码
将上面描述的内容组合起来就可以跑了,但是如果想要实现分布式,我们必须将上述代码放入函数里边,即不能写在脚本里直接用
所以我们初步完成的代码应该如下所示
from accelerate import Accelerator | |
+ def main(): | |
accelerator = Accelerator() | |
model, optimizer, training_dataloader, scheduler = accelerator.prepare( | |
model, optimizer, training_dataloader, scheduler | |
) | |
for batch in training_dataloader: | |
optimizer.zero_grad() | |
# 注意到,这里无需将数据导入到指定设备,accelerator会自动帮你完成 | |
# 传统来说应该先将inputs和targets传入gpu,cuda()或to(device) | |
inputs, targets = batch | |
outputs = model(inputs) | |
loss = loss_function(outputs, targets) | |
# 这里不再使用loss.backward()!!! | |
accelerator.backward(loss) | |
optimizer.step() | |
scheduler.step() | |
+ if __name__ == "__main__": | |
+ main() |
如何运行
官方推荐运行前最好先调用accelerate config
配置训练方式和相关参数,但笔者懒得搞(其实是不会),所以只讲最简单的运行方式
类似torch.distributed launch
,我们这里用accelerate launch
代替,紧跟着是运行的py脚本和你自己想传入的参数
# 默认是单机单卡训练,非混合精度(这样运行会有很多warning) | |
accelerate launch {script_name.py} --arg1 --arg2 ... | |
# 限制使用的GPU | |
CUDA_VISIBLE_DEVICES="0" accelerate launch {script_name.py} --arg1 --arg2 ... | |
# 单机多卡训练,非混合精度 | |
accelerate launch --multi_gpu {script_name.py} {--arg1} {--arg2} ... | |
# 单机多卡训练,混合精度 | |
accelerate launch --multi_gpu --mixed_precision=fp16 {script_name.py} {--arg1} {--arg2} ... | |
# 指定进程数 | |
accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=2 {script_name.py} {--arg1} {--arg2} ... | |
# 指定机器数量,笔者的配置,这样就没有warning了 | |
CUDA_VISIBLE_DEVICES="0,1" accelerate launch --multi_gpu --mixed_precision=fp16 --num_processes=2 --num_machines=1 train.py |
具体还可以设置的参数使用accelerate launch -h
获取
seed设置
如果想要重复实验,seed设置必不可少,accelerate提供的方法非常容易使用,一行代码解决
from accelerate import set_seed | |
set_seed(42) |
实际上做的事
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# ^^ safe to call this function even if cuda is not available
if is_tpu_available():
xm.set_rng_state(seed)
logger设置
之前笔者一直用的是loguru.logger
,打印的日志非常清晰有条理,但分布式训练时却无法分辨当前进程是否是主进程,结果打印的日志都是重复且由于进程不同步而杂乱无章。当然可以先判断是否是主进程再打印,但这样的代码量就上去了,降低了可阅读性且不美观
accelerate提供的logging模块则完美解决了上述问题
from accelerate.logging import get_logger | |
logger = get_logger(__name__) | |
# 所有进程都打印 | |
logger.info("My log", main_process_only=False) | |
# 只有主进程打印,main_process_only默认参数就是False,所以无需显示设置 | |
logger.debug("My log", main_process_only=True) |
一脚一个坑
因为笔者之前用
loguru.logger
用惯了,发现按照上面的配置打印不出日志!正当我百思不得其解,突然想起python自带的logging模块需要先设置logging.basicConfig(level=logging.INFO)
才能将INFO日志打出故完整代码应如下(说明accelerate自带的logging应该是对python内置logging的封装)
from accelerate.logging import get_logger
import logging
logging.basicConfig(level=logging.INFO)
logger = get_logger(__name__)
logger.debug("This log will be printed in the main process only")
梯度累积
官方解释
Gradient accumulation
is a technique where you can train on bigger batch sizes than your machine would normally be able to fit into memory. This is done by accumulating gradients over several batches, and only stepping the optimizer after a certain number of batches have been performed.
总而言之,即不在每个batch就反向传播,而是延迟n个batch后再将累积的误差进行反向传播。这样一来,就相当于我们使用原来n倍的batch_size进行训练
θ~←θ−∑i=1n∂fi∂θθ←θ−i=1∑n∂θ∂fi
传统进行梯度累积的例子
device = "cuda" | |
model.to(device) | |
gradient_accumulation_steps = 2 | |
for index, batch in enumerate(training_dataloader): | |
inputs, targets = batch | |
inputs = inputs.to(device) | |
targets = targets.to(device) | |
outputs = model(inputs) | |
loss = loss_function(outputs, targets) | |
loss = loss / gradient_accumulation_steps | |
loss.backward() | |
if (index + 1) % gradient_accumulation_steps == 0: | |
optimizer.step() | |
scheduler.step() | |
optimizer.zero_grad() |
应用accelerate后
from accelerate import Accelerator | |
# 创建accelerator时指定累积步长(多少个batch更新一次) | |
accelerator = Accelerator(gradient_accumulation_steps=2) | |
for batch in training_dataloader: | |
# 调用accumulate()累积计算 | |
with accelerator.accumulate(model): | |
inputs, targets = batch | |
outputs = model(inputs) | |
loss = loss_function(outputs, targets) | |
# 注意,这里不再是loss.backward()了 | |
accelerator.backward(loss) | |
optimizer.step() | |
scheduler.step() | |
optimizer.zero_grad() |
混合精度
具体细节请参考其他资料或博客,总的来说就是能够加快训练速度,但不会对训练效果产生什么影响
在启动时添加--mixed_precision
,有fp16
,bfp16
和no
三种选项,默认为no
(不开启)
accelerate launch --multi_gpu --mixed_precision=fp16 {script_name.py} {--arg1} {--arg2}
或者在accelerator内指定
# 若不指定,则跟从启动命令传入的mixed_precision配置,默认为no | |
accelerator = Accelerator(mixed_precision="bf16") |
模型评估
模型的评估可在主进程或者分布式进行
主进程评估
eval_loader = Dataloader(...) | |
... | |
''' | |
accelerator.is_local_main_process是本机的主进程(单机时使用) | |
accelerator.is_main_process则代表整个集群的主进程(多机时使用) | |
''' | |
total_loss = 0 | |
if accelerator.is_local_main_process: | |
for i, batch in enumerate(eval_loader): | |
inputs, targets = batch | |
with torch.no_grad(): | |
out = model(inputs) | |
loss = criterion(out, targets) | |
total_loss += loss.item() | |
logger.info(f'total eval loss is :{total_loss}') |
分布式评估
eval_loader = Dataloader(...) | |
# 需要调用prepare对eval_loader进行处理 | |
eval_loader = accelerator.prepare(eval_loader) | |
... | |
''' | |
accelerator.gather(tensor)能将所有进程的tensor收集起来 | |
''' | |
total_loss = 0 | |
for i, batch in enumerate(eval_loader): | |
inputs, targets = batch | |
with torch.no_grad(): | |
out = model(inputs) | |
# 将所有结果收集起来再计算loss | |
loss = criterion(accelerator.gather(out), accelerator.gather(target)) | |
total_loss += loss.item() | |
logger.info(f'total eval loss is :{total_loss}') |
一脚一个坑
可能大家想到如果将
accelerator.gather()
用在训练的时候,不就能获得总的训练loss,和单卡一样了吗?我一开始就是这样想的,代码如下
...
loss = criterion(accelerator.gather(out), accelerator.gather(target))
accelerator.backward(loss)
total_train_loss += loss.item()
然后模型就不收敛了,train了几轮loss都不会下降,后面查了官网对于gather功能的解释
Gather the values in tensor across all processes and concatenate them on the first dimension. Useful to regroup the predictions from all processes when doing evaluation.
中文:在分布式评估上非常有用
即让我们在评估的时候用而不是训练阶段!
But Why?
想知道原因就得看源码了(Talk is cheap…)
gather函数源码
def gather(tensor):
"""
Recursively gather tensor in a nested list/tuple/dictionary of tensors from all devices.
Args:
tensor (nested list/tuple/dictionary of `torch.Tensor`):
The data to gather.
Returns:
The same data structure as `tensor` with all tensors sent to the proper device.
"""
if AcceleratorState().distributed_type == DistributedType.TPU:
return _tpu_gather(tensor, name="accelerate.utils.gather")
elif AcceleratorState().distributed_type in CUDA_DISTRIBUTED_TYPES:
return _gpu_gather(tensor)
elif AcceleratorState().distributed_type == DistributedType.MULTI_CPU:
return _cpu_gather(tensor)
else:
return tensor
目前看不出什么端倪,我们继续查看
_gpu_gather(tensor)
的源码
def _gpu_gather(tensor):
def _gpu_gather_one(tensor):
if tensor.ndim == 0:
tensor = tensor.clone()[None]
output_tensors = [tensor.clone() for _ in range(torch.distributed.get_world_size())]
torch.distributed.all_gather(output_tensors, tensor)
return torch.cat(output_tensors, dim=0)
return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
注意到,这里其实调用了
torch.distributed.all_gather
,而该函数本身不进行反向传播梯度!这就是为什么我们训练loss一直不下降的原因那么我就是要得到所有进程的训练loss怎么办呢?
笔者认为可以这样实现(没测试过)
total_train_loss = 0
for batch in training_dataloader:
inputs, targets = batch
outputs = model(inputs)
loss = criterion(outputs, targets)
gathered_loss = accelerator.gather(loss)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
total_train_loss += gathered_loss.item()
logger.info(f'total training loss is :{total_train_loss}')
模型的保存和加载
官方有提供保存训练状态的方法,具体示例如下
from accelerate import Accelerator | |
import torch | |
accelerator = Accelerator() | |
my_scheduler = torch.optim.lr_scheduler.StepLR(my_optimizer, step_size=1, gamma=0.99) | |
my_model, my_optimizer, my_training_dataloader = accelerate.prepare(my_model, my_optimizer, my_training_dataloader) | |
# 注册scheduler | |
accelerate.register_for_checkpointing(my_scheduler) | |
# 保存scheduler当前的状态 | |
accelerate.save_state("my/save/path") | |
device = accelerator.device | |
my_model.to(device) | |
# Perform training | |
for epoch in range(num_epochs): | |
for batch in my_training_dataloader: | |
my_optimizer.zero_grad() | |
inputs, targets = batch | |
inputs = inputs.to(device) | |
targets = targets.to(device) | |
outputs = my_model(inputs) | |
loss = my_loss_function(outputs, targets) | |
accelerator.backward(loss) | |
my_optimizer.step() | |
my_scheduler.step() | |
# 加载scheduler | |
accelerate.load_state("my/save/path") |
这种使用方法有诸多限制
register_for_checkpointing
只接受有state_dict
和load_state_dict
函数的对象- 不好移植,没有accelerate的话怎么获取模型和其他状态?
笔者使用的方式
保存状态
# 只在主进程保存
if accelerator.is_local_main_process:
# prepare封装后的module可能会有其他额外的层,需要先unwrap
unwrap_model = accelerator.unwrap_model(model)
unwrap_optim = accelerator.unwrap_model(optimizer)
unwrap_lr = accelerator.unwrap_model(scheduler)
torch.save({
'model_state' : unwrap_model.state_dict(),
'optim_state' : unwrap_optim.state_dict(),
'lr_state' : unwrap_lr.state_dict()}, model_config['model_save_path'] + f'ckpt_{epoch+1}.pt')
logger.info(f'checkpoint ckpt_{epoch+1}.pt is saved...')
加载状态
ckpt = torch.load(model_config['model_save_path'] + model_config['state_ckpt'])
model.load_state_dict(ckpt['model_state'])
optimizer.load_state_dict(ckpt['optim_state'])
scheduler.load_state_dict(ckpt['lr_state'])
笔者代码
一个完整的框架示例
... | |
import logging | |
import tqdm | |
from accelerate import Accelerator | |
from accelerate.logging import get_logger | |
from accelerate.utils import set_seed | |
logging.basicConfig(level=logging.INFO) | |
logger = get_logger(__name__) | |
device = None | |
model_config = {xxx: xxx | |
... | |
... | |
} | |
... | |
def train(): | |
accelerator = Accelerator(gradient_accumulation_steps=model_config['accumulate_steps']) | |
device = accelerator.device | |
logger.info('Initializing model...') | |
model =my_model() | |
model.to(device) | |
logger.info('Loading training&evaluation data...') | |
train_loader, eval_loader = get_dataloader('train'), get_dataloader('eval') | |
epochs = model_config['epochs'] | |
train_steps = epochs * len(train_loader) | |
optimizer = get_optimizer() | |
scheduler = get_scheduler() | |
criterion = get_loss_func() | |
if model_config['use_trained']: # 加载之前保存的训练状态 | |
logger.info(f"Loading trained model :{model_config['trained_model']}") | |
ckpt = torch.load(model_config['model_save_path'] + model_config['trained_model']) | |
model.load_state_dict(ckpt['model_state']) | |
optimizer.load_state_dict(ckpt['optim_state']) | |
scheduler.load_state_dict(ckpt['lr_state']) | |
logger.info('accelerator preparing...') | |
model, train_loader, eval_loader, optimizer, scheduler = accelerator.prepare(model, train_loader, eval_loader, optimizer, scheduler) | |
best_loss =1e9 | |
for epoch in range(epochs): | |
logger.info('=' * 10 + 'Start training' + '=' * 10) | |
model.train() | |
total_loss = 0 | |
pbar = tqdm.tqdm(enumerate(train_loader), total=len(train_loader), disable=(not accelerator.is_local_main_process)) | |
with accelerator.accumulate(model): | |
for i, batch in pbar: | |
inputs, targets = batch | |
out = model(inputs) | |
loss = criterion(out, targets) | |
accelerator.backward(loss) | |
optimizer.step() | |
scheduler.step() | |
optimizer.zero_grad() | |
pbar.set_description(f"epoch {epoch + 1} iter {i}: train loss {loss.item():.5f}. lr {scheduler.get_last_lr()[0]:e}") | |
if accelerator.is_local_main_process: | |
total_loss += loss.item() | |
logger.info(f'Total local training loss is: {total_loss}', main_process_only=True) | |
logger.info('=' * 10 + 'Start evaluating' + '=' * 10, main_process_only=True) | |
model.eval() | |
if accelerator.is_local_main_process: | |
total_loss = 0 | |
pbar = tqdm.tqdm(enumerate(eval_loader), total=len(eval_loader), disable=(not accelerator.is_local_main_process)) | |
for i, batch in pbar: | |
inputs, targets = batch | |
with torch.no_grad(): | |
out = model(inputs) | |
loss = criterion(accelerator.gather(out), accelerator.gather(targets)) | |
pbar.set_description(f"epoch {epoch + 1} iter {i}: eval loss {loss.item():.5f}") | |
if accelerator.is_local_main_process: | |
total_loss += loss.item() | |
logger.info(f'Total evaluating loss is: {total_loss}', main_process_only=True) | |
logger.info('Saving checkpoint...') | |
# 下面用不用没太大差别,就是等所有进程走到这一步再往下 | |
accelerator.wait_for_everyone() | |
if accelerator.is_local_main_process: | |
unwrap_model = accelerator.unwrap_model(model) | |
unwrap_optim = accelerator.unwrap_model(optimizer) | |
unwrap_lr = accelerator.unwrap_model(scheduler) | |
torch.save({ | |
'model_state' : unwrap_model.state_dict(), | |
'optim_state' : unwrap_optim.state_dict(), | |
'lr_state' : unwrap_lr.state_dict()}, model_config['model_save_path'] + f'ckpt_{epoch+1}.pt') | |
logger.info(f'checkpoint ckpt_{epoch+1}.pt is saved...') | |
if best_loss > total_loss: | |
logger.info('The best model found!') | |
best_loss = total_loss | |
torch.save({ | |
'model_state' : unwrap_model.state_dict(), | |
'optim_state' : unwrap_optim.state_dict(), | |
'lr_state' : unwrap_lr.state_dict()}, model_config['model_save_path'] + f'ggec_best.pt') | |
logger.info('The best model saved!') | |
if __name__ == "__main__": | |
set_seed(model_config['seed']) | |
train() |
尾声
这篇文章是本人对accelerate的肤浅理解和总结,后者其实还有很多高级的功能,比如:模型太大,很难确定合适的batch size,accelerate可以帮你自动寻找合适的batch size;deepspeed加速功能。。。
如果有任何疑问,欢迎留言讨论(╯▔皿▔)╯
参考资料