PyTorch重大更新：将支持自动混合精度训练！

最新推荐文章于 2024-05-22 16:28:54 发布

紫云东启

最新推荐文章于 2024-05-22 16:28:54 发布

阅读量322

点赞数

分类专栏：笔记

本文链接：https://blog.youkuaiyun.com/yusujifeng/article/details/109209667

版权

笔记专栏收录该内容

8 篇文章

订阅专栏

混合精度训练可通过提升训练速度同时保持模型准确性，适用于多种深度学习框架。PyTorch即将提供内置支持，利用torch.cuda.amp.autocast自动选择GPU操作的精度，并通过torch.cuda.amp.GradScaler调整梯度规模避免过小问题，从而实现自动混合精度训练。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

混合精度训练（mixed precision training）可以让模型训练在尽量不降低性能的情形下提升训练速度，而且也可以降低显卡使用内存。目前主流的深度学习框架都开始支持混合精度训练。对于PyTorch，混合精度训练还主要是采用NVIDIA开源的apex库。但是，PyTorch将迎来重大更新，那就是提供内部支持的混合精度训练，而且是自动混合精度训练：
torch.cuda.amp.autocast ：自动为GPU op选择精度来提升训练性能而不降低模型准确度。
torch.cuda.amp.GradScaler : 对梯度进行scale来加快模型收敛，因为float16梯度容易出现underflow（梯度过小）
两者结合在一起，可以实现自动混合精度训练：

Creates model and optimizer in default precision

model = Net().cuda()
optimizer = optim.SGD(model.parameters(), …)

Creates a GradScaler once at the beginning of training.

scaler = GradScaler()

for epoch in epochs:
for input, target in data:
optimizer.zero_grad()

    # Runs the forward pass with autocasting.
    with autocast():
        output = model(input)
        loss = loss_fn(output, target)

    # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
    # Backward passes under autocast are not recommended.
    # Backward ops run in the same precision that autocast used for corresponding forward ops.
    scaler.scale(loss).backward()

    # scaler.step() first unscales the gradients of the optimizer's assigned params.
    # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
    # otherwise, optimizer.step() is skipped.
    scaler.step(optimizer)

    # Updates the scale for next iteration.
    scaler.update()
    可以看到，为了防止梯度的underflow，首先scaler.scale(loss).backward()会对loss乘以一个scale因子，然后backward时所有梯度都会乘以相同的scale因子，这样保证梯度有较大的magnitude而不会出现为0。我们不希望这个scale因子对学习速率产生影响，那么scaler.step(optimizer)会先unscale要更新的梯度然后再更新，如果梯度出现infs或者NaNs，optimizer将忽略这次迭代训练。

如果你想在梯度更新前对梯度进行clip，也是可以的：
scaler = GradScaler()
for epoch in epochs: for input, target in data: optimizer.zero_grad() with autocast(): output = model(input) loss = loss_fn(output, target) scaler.scale(loss).backward() # Unscales the gradients of optimizer’s assigned params in-place scaler.unscale_(optimizer) # Since the gradients of optimizer’s assigned params are unscaled, clips as usual: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) # optimizer’s gradients are already unscaled, so scaler.step does not unscale them, # although it still skips optimizer.step() if the gradients contain infs or NaNs. scaler.step(optimizer) # Updates the scale for next iteration. scaler.update()当然，混合精度训练肯定要支持分布式训练，由于autocast是thread local的，所以要注意以下不同的情形：如果使用torch.nn.DataParallel：此时只有一个进程，而不同GPU上是各自的线程跑forward过程的，所以下面操作时无效的：model = MyModel()dp_model = nn.DataParallel(model)# Sets autocast in the main threadwith autocast(): # dp_model’s internal threads won’t autocast. The main thread’s autocast state has no effect. output = dp_model(input) # loss_fn still autocasts, but it’s too late… loss = loss_fn(output)此时你需要对model的forward方法用autocast装饰：MyModel(nn.Module): … @autocast() def forward(self, input): …# AlternativelyMyModel(nn.Module): … def forward(self, input): with autocast(): …model = MyModel()dp_model = nn.DataParallel(model)with autocast(): output = dp_model(input) loss = loss_fn(output)如果使用torch.nn.parallel.DistributedDataParallel：一般情形下是单GPU进程的，此时原来的用来就没有问题，但是如果是多GPU一个进程那么就和上述问题一样，需要用autocast装饰model的forward。