深度学习训练模型时,GPU显存不够怎么办?

PyTorch AMP 实战
本文通过使用AlexNet网络架构、CIFAR10数据集、AdamW优化器及ReduceLROnPlateau学习率机制,详细介绍PyTorch自动混合精度(AMP)的使用方法,包括autocast和GradScaler的应用技巧,旨在加速模型训练过程。
部署运行你感兴趣的模型镜像

‍作者 | 游客26024 编辑 | 极市平台

原文链接:zhihu.com/question/461811359/answer/2492822726

点击下方卡片,关注“自动驾驶之心”公众号

ADAS巨卷干货,即可获取

点击进入→自动驾驶之心【全栈算法】技术交流群

导读

 

此篇博文以AlexNet为网络架构(其需要输入的图像大小为227x227x3),CIFAR10为数据集,Adamw为梯度下降函数,学习率机制为ReduceLROnPlateau举例。旨为如何让网络模型加速训练,而非去了解其原理。

题外话,我为什么要写这篇博客,就是因为我穷没钱!租的服务器使用多GPU时一会钱就烧没了(gpu内存不用),急需要一种trick,来降低内存加速。

回到正题,如果我们使用的数据集较大,且网络较深,则会造成训练较慢,此时我们要想加速训练可以使用Pytorch的AMPautocast与Gradscaler);本文便是依据此写出的博文,对Pytorch的AMP(autocast与Gradscaler进行对比)自动混合精度对模型训练加速

注意Pytorch1.6+,已经内置torch.cuda.amp,因此便不需要加载NVIDIA的apex库(半精度加速),为方便我们便不使用NVIDIA的apex库(安装麻烦),转而使用torch.cuda.amp

AMP (Automatic mixed precision): 自动混合精度,那什么是自动混合精度

先来梳理一下历史:先有NVIDIA的apex,之后NVIDIA的开发人员将其贡献到Pytorch 1.6+产生了torch.cuda.amp[这是笔者梳理,可能有误,请留言]

详细讲:默认情况下,大多数深度学习框架都采用32位浮点算法进行训练。2017年,NVIDIA研究了一种用于混合精度训练的方法(apex),该方法在训练网络时将单精度(FP32)与半精度(FP16)结合在一起,并使用相同的超参数实现了与FP32几乎相同的精度,且速度比之前快了不少

之后,来到了AMP时代(特指torch.cuda.amp),此有两个关键词:自动混合精度(Pytorch 1.6+中的torch.cuda.amp)其中,自动表现在Tensor的dtype类型会自动变化,框架按需自动调整tensor的dtype,可能有些地方需要手动干预;混合精度表现在采用不止一种精度的Tensor, torch.FloatTensor与torch.HalfTensor。并且从名字可以看出torch.cuda.amp,这个功能只能在cuda上使用

为什么我们要使用AMP自动混合精度?

1.减少显存占用(FP16优势)

2.加快训练和推断的计算(FP16优势)

3.张量核心的普及(NVIDIA Tensor Core),低精度(FP16优势)

4. 混合精度训练缓解舍入误差问题,(FP16有此劣势,但是FP32可以避免此)

5.损失放大,可能使用混合精度还会出现无法收敛的问题[其原因时激活梯度值较小],造成了溢出,则可以通过使用torch.cuda.amp.GradScaler放大损失来防止梯度的下溢

申明此篇博文主旨如何让网络模型加速训练,而非去了解其原理,且其以AlexNet为网络架构(其需要输入的图像大小为227x227x3),CIFAR10为数据集,Adamw为梯度下降函数,学习率机制为ReduceLROnPlateau举例。使用的电脑是2060的拯救者,虽然渣,但是还是可以搞搞这些测试。

本文从1.没使用DDP与DP训练与评估代码(之后加入amp),2.分布式DP训练与评估代码(之后加入amp),3.单进程占用多卡DDP训练与评估代码(之后加入amp) 角度讲解。

运行此程序时,文件的结构:

D:/PycharmProject/Simple-CV-Pytorch-master
|
|
|
|----AMP(train_without.py、train_DP.py、train_autocast.py、train_GradScaler.py、eval_XXX.py
|等,之后加入的alexnet也在这里,alexnet.py)
|
|
|
|----tensorboard(保存tensorboard的文件夹)
|
|
|
|----checkpoint(保存模型的文件夹)
|
|
|
|----data(数据集所在文件夹)

1.没使用DDP与DP训练与评估代码

没使用DDP与DP的训练与评估实验,作为我们实验的参照组

(1)原本模型的训练与评估源码:

训练源码:

注意:此段代码无比简陋,仅为代码的雏形,大致能理解尚可!

train_without.py

import time
import torch
import torchvision
from torch import nn
from torch.utils.data import DataLoader
from torchvision.models import alexnet
from torchvision import transforms
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse
 
 
def parse_args():
    parser = argparse.ArgumentParser(description='CV Train')
    parser.add_mutually_exclusive_group()
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
    parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
    parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
    parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
 
# 1.Create SummaryWriter
if args.tensorboard:
    writer = SummaryWriter(args.tensorboard_log)
 
# 2.Ready dataset
if args.dataset == 'CIFAR10':
    train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
        [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
else:
    raise ValueError("Dataset is not CIFAR10")
cuda = torch.cuda.is_available()
print('CUDA available: {}'.format(cuda))
 
# 3.Length
train_dataset_size = len(train_dataset)
print("the train dataset size is {}".format(train_dataset_size))
 
# 4.DataLoader
train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
 
# 5.Create model
model = alexnet()
 
if args.cuda == cuda:
    model = model.cuda()
 
# 6.Create loss
cross_entropy_loss = nn.CrossEntropyLoss()
 
# 7.Optimizer
optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
# 8. Set some parameters to control loop
# epoch
iter = 0
t0 = time.time()
for epoch in range(args.epochs):
    t1 = time.time()
    print(" -----------------the {} number of training epoch --------------".format(epoch))
    model.train()
    for data in train_dataloader:
        loss = 0
        imgs, targets = data
        if args.cuda == cuda:
            cross_entropy_loss = cross_entropy_loss.cuda()
            imgs, targets = imgs.cuda(), targets.cuda()
        outputs = model(imgs)
        loss_train = cross_entropy_loss(outputs, targets)
        loss = loss_train.item() + loss
        if args.tensorboard:
            writer.add_scalar("train_loss", loss_train.item(), iter)
 
        optim.zero_grad()
        loss_train.backward()
        optim.step()
        iter = iter + 1
        if iter % 100 == 0:
            print(
                "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
                    .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
                            np.mean(loss)))
    if args.tensorboard:
        writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
    scheduler.step(np.mean(loss))
    t2 = time.time()
    h = (t2 - t1) // 3600
    m = ((t2 - t1) % 3600) // 60
    s = ((t2 - t1) % 3600) % 60
    print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
 
    if epoch % 1 == 0:
        print("Save state, iter: {} ".format(epoch))
        torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
 
torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
t3 = time.time()
h_t = (t3 - t0) // 3600
m_t = ((t3 - t0) % 3600) // 60
s_t = ((t3 - t0) % 3600) // 60
print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
if args.tensorboard:
    writer.close()

运行结果:

4e1df4460660988a658ac216a80f954a.jpeg 1840f884ec24aa21c4e6bc365a982cd5.jpeg

Tensorboard观察:

3dffcafc6186f06c5cc7d1e71139158f.jpeg 14c5d4b86eba166fe9591c563f57774a.jpeg

评估源码:

代码特别粗犷,尤其是device与精度计算,仅供参考,切勿模仿!

eval_without.py

import torch
import torchvision
from torch.utils.data import DataLoader
from torchvision.transforms import transforms
from alexnet import alexnet
import argparse
 
 
# eval
def parse_args():
    parser = argparse.ArgumentParser(description='CV Evaluation')
    parser.add_mutually_exclusive_group()
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
# 1.Create model
model = alexnet()
 
 
# 2.Ready Dataset
if args.dataset == 'CIFAR10':
    test_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=False,
                                                transform=transforms.Compose(
                                                    [transforms.Resize(args.img_size),
                                                     transforms.ToTensor()]),
                                                download=True)
else:
    raise ValueError("Dataset is not CIFAR10")
# 3.Length
test_dataset_size = len(test_dataset)
print("the test dataset size is {}".format(test_dataset_size))
 
# 4.DataLoader
test_dataloader = DataLoader(dataset=test_dataset, batch_size=args.batch_size)
 
# 5. Set some parameters for testing the network
total_accuracy = 0
 
# test
model.eval()
with torch.no_grad():
    for data in test_dataloader:
        imgs, targets = data
        device = torch.device('cpu')
        imgs, targets = imgs.to(device), targets.to(device)
        model_load = torch.load("{}/AlexNet.pth".format(args.checkpoint), map_location=device)
        model.load_state_dict(model_load)
        outputs = model(imgs)
        outputs = outputs.to(device)
        accuracy = (outputs.argmax(1) == targets).sum()
        total_accuracy = total_accuracy + accuracy
        accuracy = total_accuracy / test_dataset_size
    print("the total accuracy is {}".format(accuracy))

运行结果:

a9215706833a86154d8b919c6bf7feee.jpeg

分析:

原本模型训练完20个epochs花费了22分22秒,得到的准确率为0.8191

(2)原本模型加入autocast的训练与评估源码:

训练源码:

训练大致代码流程:

from torch.cuda.amp import autocast as autocast
 
...
 
# Create model, default torch.FloatTensor
model = Net().cuda()
 
# SGD,Adm, Admw,...
optim = optim.XXX(model.parameters(),..)
 
...
 
for imgs,targets in dataloader:
    imgs,targets = imgs.cuda(),targets.cuda()
 
    ....
    with autocast():
        outputs = model(imgs)
        loss = loss_fn(outputs,targets)
   ...
    optim.zero_grad()
    loss.backward()
    optim.step()
 
...

train_autocast_without.py

import time
import torch
import torchvision
from torch import nn
from torch.cuda.amp import autocast
from torchvision import transforms
from torchvision.models import alexnet
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse
 
 
def parse_args():
    parser = argparse.ArgumentParser(description='CV Train')
    parser.add_mutually_exclusive_group()
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
    parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
    parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
    parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
 
# 1.Create SummaryWriter
if args.tensorboard:
    writer = SummaryWriter(args.tensorboard_log)
 
# 2.Ready dataset
if args.dataset == 'CIFAR10':
    train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
        [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
else:
    raise ValueError("Dataset is not CIFAR10")
cuda = torch.cuda.is_available()
print('CUDA available: {}'.format(cuda))
 
# 3.Length
train_dataset_size = len(train_dataset)
print("the train dataset size is {}".format(train_dataset_size))
 
# 4.DataLoader
train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
 
# 5.Create model
model = alexnet()
 
if args.cuda == cuda:
    model = model.cuda()
 
# 6.Create loss
cross_entropy_loss = nn.CrossEntropyLoss()
 
# 7.Optimizer
optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
# 8. Set some parameters to control loop
# epoch
iter = 0
t0 = time.time()
for epoch in range(args.epochs):
    t1 = time.time()
    print(" -----------------the {} number of training epoch --------------".format(epoch))
    model.train()
    for data in train_dataloader:
        loss = 0
        imgs, targets = data
        if args.cuda == cuda:
            cross_entropy_loss = cross_entropy_loss.cuda()
            imgs, targets = imgs.cuda(), targets.cuda()
        with autocast():
            outputs = model(imgs)
            loss_train = cross_entropy_loss(outputs, targets)
        loss = loss_train.item() + loss
        if args.tensorboard:
            writer.add_scalar("train_loss", loss_train.item(), iter)
 
        optim.zero_grad()
        loss_train.backward()
        optim.step()
        iter = iter + 1
        if iter % 100 == 0:
            print(
                "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
                    .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
                            np.mean(loss)))
    if args.tensorboard:
        writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
    scheduler.step(np.mean(loss))
    t2 = time.time()
    h = (t2 - t1) // 3600
    m = ((t2 - t1) % 3600) // 60
    s = ((t2 - t1) % 3600) % 60
    print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
 
    if epoch % 1 == 0:
        print("Save state, iter: {} ".format(epoch))
        torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
 
torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
t3 = time.time()
h_t = (t3 - t0) // 3600
m_t = ((t3 - t0) % 3600) // 60
s_t = ((t3 - t0) % 3600) // 60
print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
if args.tensorboard:
    writer.close()

运行结果:

35de10d8f4e70b6fe376d8645956a376.jpeg f6e8b1a22e4b4288fc189ea87de4b7b6.jpeg

Tensorboard观察:

ed9ba2c3188c5b60e831cd1a49e4d854.jpeg ef426d241ad3c076eb556388c862fb0b.jpeg

评估源码:

eval_without.py 和 1.(1)一样

运行结果:

ade6b289cbd6f3f068d9f3397180ad52.jpeg

分析:

原本模型训练完20个epochs花费了22分22秒,加入autocast之后模型花费的时间为21分21秒,说明模型速度增加了,并且准确率从之前的0.8191提升到0.8403

(3)原本模型加入autocast与GradScaler的训练与评估源码:

使用torch.cuda.amp.GradScaler是放大损失值来防止梯度的下溢

训练源码:

训练大致代码流程:

from torch.cuda.amp import autocast as autocast
from torch.cuda.amp import GradScaler as GradScaler
...
 
# Create model, default torch.FloatTensor
model = Net().cuda()
 
# SGD,Adm, Admw,...
optim = optim.XXX(model.parameters(),..)
scaler = GradScaler()
 
...
 
for imgs,targets in dataloader:
    imgs,targets = imgs.cuda(),targets.cuda()
    ...
    optim.zero_grad()
    ....
    with autocast():
        outputs = model(imgs)
        loss = loss_fn(outputs,targets)
 
    scaler.scale(loss).backward()
    scaler.step(optim)
    scaler.update()
...

train_GradScaler_without.py

import time
import torch
import torchvision
from torch import nn
from torch.cuda.amp import autocast, GradScaler
from torchvision import transforms
from torchvision.models import alexnet
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse
 
 
def parse_args():
    parser = argparse.ArgumentParser(description='CV Train')
    parser.add_mutually_exclusive_group()
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
    parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
    parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
    parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
 
# 1.Create SummaryWriter
if args.tensorboard:
    writer = SummaryWriter(args.tensorboard_log)
 
# 2.Ready dataset
if args.dataset == 'CIFAR10':
    train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
        [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
else:
    raise ValueError("Dataset is not CIFAR10")
cuda = torch.cuda.is_available()
print('CUDA available: {}'.format(cuda))
 
# 3.Length
train_dataset_size = len(train_dataset)
print("the train dataset size is {}".format(train_dataset_size))
 
# 4.DataLoader
train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
 
# 5.Create model
model = alexnet()
 
if args.cuda == cuda:
    model = model.cuda()
 
# 6.Create loss
cross_entropy_loss = nn.CrossEntropyLoss()
 
# 7.Optimizer
optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
scaler = GradScaler()
# 8. Set some parameters to control loop
# epoch
iter = 0
t0 = time.time()
for epoch in range(args.epochs):
    t1 = time.time()
    print(" -----------------the {} number of training epoch --------------".format(epoch))
    model.train()
    for data in train_dataloader:
        loss = 0
        imgs, targets = data
        optim.zero_grad()
        if args.cuda == cuda:
            cross_entropy_loss = cross_entropy_loss.cuda()
            imgs, targets = imgs.cuda(), targets.cuda()
        with autocast():
            outputs = model(imgs)
            loss_train = cross_entropy_loss(outputs, targets)
            loss = loss_train.item() + loss
        if args.tensorboard:
            writer.add_scalar("train_loss", loss_train.item(), iter)
 
        scaler.scale(loss_train).backward()
        scaler.step(optim)
        scaler.update()
        iter = iter + 1
        if iter % 100 == 0:
            print(
                "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
                    .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
                            np.mean(loss)))
    if args.tensorboard:
        writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
    scheduler.step(np.mean(loss))
    t2 = time.time()
    h = (t2 - t1) // 3600
    m = ((t2 - t1) % 3600) // 60
    s = ((t2 - t1) % 3600) % 60
    print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
 
    if epoch % 1 == 0:
        print("Save state, iter: {} ".format(epoch))
        torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
 
torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
t3 = time.time()
h_t = (t3 - t0) // 3600
m_t = ((t3 - t0) % 3600) // 60
s_t = ((t3 - t0) % 3600) // 60
print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
if args.tensorboard:
    writer.close()

运行结果:

ba7afc1dfc996461d9e076115ae18fb0.jpeg 3d6eae860a949888f333365506602892.jpeg

Tensorboard观察:

65104096742666bfff263383a85e5ef4.jpeg 4404995a273831ed320bd2530166518f.jpeg

评估源码:

eval_without.py 和 1.(1)一样

运行结果:

2e1322b7e9018b42ac9b27992880673c.jpeg

分析:

为什么,我们训练完20个epochs花费了27分27秒,比之前原模型未使用任何amp的时间(22分22秒)都多了?

这是因为我们使用了GradScaler放大了损失降低了模型训练的速度,还有个原因可能是笔者自身的显卡太小,没有起到加速的作用

2.分布式DP训练与评估代码

(1)DP原本模型的训练与评估源码:

训练源码:

train_DP.py

import time
import torch
import torchvision
from torch import nn
from torch.utils.data import DataLoader
from torchvision.models import alexnet
from torchvision import transforms
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse
 
 
def parse_args():
    parser = argparse.ArgumentParser(description='CV Train')
    parser.add_mutually_exclusive_group()
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
    parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
    parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
    parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
 
# 1.Create SummaryWriter
if args.tensorboard:
    writer = SummaryWriter(args.tensorboard_log)
 
# 2.Ready dataset
if args.dataset == 'CIFAR10':
    train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
        [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
else:
    raise ValueError("Dataset is not CIFAR10")
cuda = torch.cuda.is_available()
print('CUDA available: {}'.format(cuda))
 
# 3.Length
train_dataset_size = len(train_dataset)
print("the train dataset size is {}".format(train_dataset_size))
 
# 4.DataLoader
train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
 
# 5.Create model
model = alexnet()
 
if args.cuda == cuda:
    model = model.cuda()
    model = torch.nn.DataParallel(model).cuda()
else:
    model = torch.nn.DataParallel(model)
 
# 6.Create loss
cross_entropy_loss = nn.CrossEntropyLoss()
 
# 7.Optimizer
optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
# 8. Set some parameters to control loop
# epoch
iter = 0
t0 = time.time()
for epoch in range(args.epochs):
    t1 = time.time()
    print(" -----------------the {} number of training epoch --------------".format(epoch))
    model.train()
    for data in train_dataloader:
        loss = 0
        imgs, targets = data
        if args.cuda == cuda:
            cross_entropy_loss = cross_entropy_loss.cuda()
            imgs, targets = imgs.cuda(), targets.cuda()
        outputs = model(imgs)
        loss_train = cross_entropy_loss(outputs, targets)
        loss = loss_train.item() + loss
        if args.tensorboard:
            writer.add_scalar("train_loss", loss_train.item(), iter)
 
        optim.zero_grad()
        loss_train.backward()
        optim.step()
        iter = iter + 1
        if iter % 100 == 0:
            print(
                "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
                    .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
                            np.mean(loss)))
    if args.tensorboard:
        writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
    scheduler.step(np.mean(loss))
    t2 = time.time()
    h = (t2 - t1) // 3600
    m = ((t2 - t1) % 3600) // 60
    s = ((t2 - t1) % 3600) % 60
    print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
 
    if epoch % 1 == 0:
        print("Save state, iter: {} ".format(epoch))
        torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
 
torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
t3 = time.time()
h_t = (t3 - t0) // 3600
m_t = ((t3 - t0) % 3600) // 60
s_t = ((t3 - t0) % 3600) // 60
print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
if args.tensorboard:
    writer.close()

运行结果:

6a1f5328e2cd96e682ac0d8bd59b6e27.jpeg 56ce2d1070dca696f923fe5d374ba631.jpeg

Tensorboard观察:

198d69e3911c53594ff4a8785b121581.jpeg f0f9011f76e0ae0eb4b415c35e33e0a7.jpeg

评估源码:

eval_DP.py

import torch
import torchvision
from torch.utils.data import DataLoader
from torchvision.transforms import transforms
from alexnet import alexnet
import argparse
 
 
# eval
def parse_args():
    parser = argparse.ArgumentParser(description='CV Evaluation')
    parser.add_mutually_exclusive_group()
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
# 1.Create model
model = alexnet()
model = torch.nn.DataParallel(model)
 
# 2.Ready Dataset
if args.dataset == 'CIFAR10':
    test_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=False,
                                                transform=transforms.Compose(
                                                    [transforms.Resize(args.img_size),
                                                     transforms.ToTensor()]),
                                                download=True)
else:
    raise ValueError("Dataset is not CIFAR10")
# 3.Length
test_dataset_size = len(test_dataset)
print("the test dataset size is {}".format(test_dataset_size))
 
# 4.DataLoader
test_dataloader = DataLoader(dataset=test_dataset, batch_size=args.batch_size)
 
# 5. Set some parameters for testing the network
total_accuracy = 0
 
# test
model.eval()
with torch.no_grad():
    for data in test_dataloader:
        imgs, targets = data
        device = torch.device('cpu')
        imgs, targets = imgs.to(device), targets.to(device)
        model_load = torch.load("{}/AlexNet.pth".format(args.checkpoint), map_location=device)
        model.load_state_dict(model_load)
        outputs = model(imgs)
        outputs = outputs.to(device)
        accuracy = (outputs.argmax(1) == targets).sum()
        total_accuracy = total_accuracy + accuracy
        accuracy = total_accuracy / test_dataset_size
    print("the total accuracy is {}".format(accuracy))

运行结果:

555aa372a840bb7ca1771e7664ebac15.jpeg

(2)DP使用autocast的训练与评估源码:

训练源码:

如果你这样写代码,那么你的代码无效!!!

...
    model = Model()
    model = torch.nn.DataParallel(model)
    ...
    with autocast():
        output = model(imgs)
        loss = loss_fn(output)

正确写法,训练大致流程代码:

1.Model(nn.Module):
      @autocast()
      def forward(self, input):
      ...
 
2.Model(nn.Module):
      def foward(self, input):
          with autocast():
              ...

1与2皆可,之后:

...
model = Model()
model = torch.nn.DataParallel(model)
with autocast():
    output = model(imgs)
    loss = loss_fn(output)
...

模型:

须在forward函数上加入@autocast()或者在forward里面最上面加入with autocast():

alexnet.py

import torch
import torch.nn as nn
from torchvision.models.utils import load_state_dict_from_url
from torch.cuda.amp import autocast
from typing import Any
 
__all__ = ['AlexNet', 'alexnet']
 
model_urls = {
    'alexnet': 'https://download.pytorch.org/models/alexnet-owt-4df8aa71.pth',
}
 
 
class AlexNet(nn.Module):
 
    def __init__(self, num_classes: int = 1000) -> None:
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )
 
    @autocast()
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x
 
 
def alexnet(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> AlexNet:
    r"""AlexNet model architecture from the
    `"One weird trick..." <https://arxiv.org/abs/1404.5997>`_ paper.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        progress (bool): If True, displays a progress bar of the download to stderr
    """
    model = AlexNet(**kwargs)
    if pretrained:
        state_dict = load_state_dict_from_url(model_urls["alexnet"],
                                              progress=progress)
        model.load_state_dict(state_dict)
    return model

train_DP_autocast.py 导入自己的alexnet.py

import time
import torch
from alexnet import alexnet
import torchvision
from torch import nn
from torch.utils.data import DataLoader
from torchvision import transforms
from torch.cuda.amp import autocast as autocast
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse
 
 
def parse_args():
    parser = argparse.ArgumentParser(description='CV Train')
    parser.add_mutually_exclusive_group()
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
    parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
    parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
    parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
 
# 1.Create SummaryWriter
if args.tensorboard:
    writer = SummaryWriter(args.tensorboard_log)
 
# 2.Ready dataset
if args.dataset == 'CIFAR10':
    train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
        [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
else:
    raise ValueError("Dataset is not CIFAR10")
cuda = torch.cuda.is_available()
print('CUDA available: {}'.format(cuda))
 
# 3.Length
train_dataset_size = len(train_dataset)
print("the train dataset size is {}".format(train_dataset_size))
 
# 4.DataLoader
train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
 
# 5.Create model
model = alexnet()
 
if args.cuda == cuda:
    model = model.cuda()
    model = torch.nn.DataParallel(model).cuda()
else:
    model = torch.nn.DataParallel(model)
 
# 6.Create loss
cross_entropy_loss = nn.CrossEntropyLoss()
 
# 7.Optimizer
optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
# 8. Set some parameters to control loop
# epoch
iter = 0
t0 = time.time()
for epoch in range(args.epochs):
    t1 = time.time()
    print(" -----------------the {} number of training epoch --------------".format(epoch))
    model.train()
    for data in train_dataloader:
        loss = 0
        imgs, targets = data
        if args.cuda == cuda:
            cross_entropy_loss = cross_entropy_loss.cuda()
            imgs, targets = imgs.cuda(), targets.cuda()
        with autocast():
            outputs = model(imgs)
            loss_train = cross_entropy_loss(outputs, targets)
        loss = loss_train.item() + loss
        if args.tensorboard:
            writer.add_scalar("train_loss", loss_train.item(), iter)
 
        optim.zero_grad()
        loss_train.backward()
        optim.step()
        iter = iter + 1
        if iter % 100 == 0:
            print(
                "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
                    .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
                            np.mean(loss)))
    if args.tensorboard:
        writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
    scheduler.step(np.mean(loss))
    t2 = time.time()
    h = (t2 - t1) // 3600
    m = ((t2 - t1) % 3600) // 60
    s = ((t2 - t1) % 3600) % 60
    print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
 
    if epoch % 1 == 0:
        print("Save state, iter: {} ".format(epoch))
        torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
 
torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
t3 = time.time()
h_t = (t3 - t0) // 3600
m_t = ((t3 - t0) % 3600) // 60
s_t = ((t3 - t0) % 3600) // 60
print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
if args.tensorboard:
    writer.close()

运行结果:

00504ec7e4dd349923e8e13754fdb1e2.jpeg e3cb67be3cbef45b9695746367212662.jpeg

Tensorboard观察:

9816cf1ac1447c25b96511bf165bfd83.jpeg 25825b0c7e3496e3b51adda9583cd7fc.jpeg

评估源码:

eval_DP.py 相比与2. (1)导入自己的alexnet.py

运行结果:

f1c3b4e8a8862704b648badce2af1b21.jpeg

分析:

可以看出DP使用autocast训练完20个epochs时需要花费的时间是21分21秒,相比与之前DP没有使用的时间(22分22秒)快了1分1秒

之前DP未使用amp能达到准确率0.8216,而现在准确率降低到0.8188,说明还是使用自动混合精度加速还是对模型的准确率有所影响,后期可通过增大batch_sizel让运行时间和之前一样,但是准确率上升,来降低此影响

(3)DP使用autocast与GradScaler的训练与评估源码:

训练源码:

train_DP_GradScaler.py 导入自己的alexnet.py

import time
import torch
from alexnet import alexnet
import torchvision
from torch import nn
from torch.utils.data import DataLoader
from torchvision import transforms
from torch.cuda.amp import autocast as autocast
from torch.cuda.amp import GradScaler as GradScaler
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse
 
 
def parse_args():
    parser = argparse.ArgumentParser(description='CV Train')
    parser.add_mutually_exclusive_group()
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
    parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
    parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
    parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
 
# 1.Create SummaryWriter
if args.tensorboard:
    writer = SummaryWriter(args.tensorboard_log)
 
# 2.Ready dataset
if args.dataset == 'CIFAR10':
    train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
        [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
else:
    raise ValueError("Dataset is not CIFAR10")
cuda = torch.cuda.is_available()
print('CUDA available: {}'.format(cuda))
 
# 3.Length
train_dataset_size = len(train_dataset)
print("the train dataset size is {}".format(train_dataset_size))
 
# 4.DataLoader
train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size)
 
# 5.Create model
model = alexnet()
 
if args.cuda == cuda:
    model = model.cuda()
    model = torch.nn.DataParallel(model).cuda()
else:
    model = torch.nn.DataParallel(model)
 
# 6.Create loss
cross_entropy_loss = nn.CrossEntropyLoss()
 
# 7.Optimizer
optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
scaler = GradScaler()
# 8. Set some parameters to control loop
# epoch
iter = 0
t0 = time.time()
for epoch in range(args.epochs):
    t1 = time.time()
    print(" -----------------the {} number of training epoch --------------".format(epoch))
    model.train()
    for data in train_dataloader:
        loss = 0
        imgs, targets = data
        optim.zero_grad()
        if args.cuda == cuda:
            cross_entropy_loss = cross_entropy_loss.cuda()
            imgs, targets = imgs.cuda(), targets.cuda()
        with autocast():
            outputs = model(imgs)
            loss_train = cross_entropy_loss(outputs, targets)
            loss = loss_train.item() + loss
        if args.tensorboard:
            writer.add_scalar("train_loss", loss_train.item(), iter)
 
        scaler.scale(loss_train).backward()
        scaler.step(optim)
        scaler.update()
 
        iter = iter + 1
        if iter % 100 == 0:
            print(
                "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
                    .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
                            np.mean(loss)))
    if args.tensorboard:
        writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
    scheduler.step(np.mean(loss))
    t2 = time.time()
    h = (t2 - t1) // 3600
    m = ((t2 - t1) % 3600) // 60
    s = ((t2 - t1) % 3600) % 60
    print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
 
    if epoch % 1 == 0:
        print("Save state, iter: {} ".format(epoch))
        torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
 
torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
t3 = time.time()
h_t = (t3 - t0) // 3600
m_t = ((t3 - t0) % 3600) // 60
s_t = ((t3 - t0) % 3600) // 60
print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
if args.tensorboard:
    writer.close()

运行结果:

9a260e8abe46c7036eef3994a337d580.jpeg 3da7ef0ee41096765793d2de81bcfd39.jpeg

Tensorboard观察:

16e558d6acdbc627d79ceaa677d0d599.jpeg 3ef28e9b82b8ef23b82644c6e3ab5bbb.jpeg

评估源码:

eval_DP.py 相比与2. (1)导入自己的alexnet.py

运行结果:

b332b82831bab5ea51e0a274aabcebd5.jpeg

分析:

跟之前一样,DP使用了GradScaler放大了损失降低了模型训练的速度

现在DP使用了autocast与GradScaler的准确率为0.8409,相比与DP只使用autocast准确率0.8188还是有所上升,并且之前DP未使用amp是准确率(0.8216)也提高了不少

3.单进程占用多卡DDP训练与评估代码

(1)DDP原模型训练与评估源码:

训练源码:

train_DDP.py

import time
import torch
from torchvision.models.alexnet import alexnet
import torchvision
from torch import nn
import torch.distributed as dist
from torchvision import transforms
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse
 
 
def parse_args():
    parser = argparse.ArgumentParser(description='CV Train')
    parser.add_mutually_exclusive_group()
    parser.add_argument("--rank", type=int, default=0)
    parser.add_argument("--world_size", type=int, default=1)
    parser.add_argument("--master_addr", type=str, default="127.0.0.1")
    parser.add_argument("--master_port", type=str, default="12355")
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
    parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
    parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
    parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
 
 
def train():
    dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
                            rank=args.rank,
                            world_size=args.world_size)
    # 1.Create SummaryWriter
    if args.tensorboard:
        writer = SummaryWriter(args.tensorboard_log)
 
    # 2.Ready dataset
    if args.dataset == 'CIFAR10':
        train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
            [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
 
    else:
        raise ValueError("Dataset is not CIFAR10")
 
    cuda = torch.cuda.is_available()
    print('CUDA available: {}'.format(cuda))
 
    # 3.Length
    train_dataset_size = len(train_dataset)
    print("the train dataset size is {}".format(train_dataset_size))
 
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    # 4.DataLoader
    train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size, sampler=train_sampler,
                                  num_workers=2,
                                  pin_memory=True)
 
    # 5.Create model
    model = alexnet()
 
    if args.cuda == cuda:
        model = model.cuda()
        model = torch.nn.parallel.DistributedDataParallel(model).cuda()
    else:
        model = torch.nn.parallel.DistributedDataParallel(model)
 
    # 6.Create loss
    cross_entropy_loss = nn.CrossEntropyLoss()
 
    # 7.Optimizer
    optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
 
    # 8. Set some parameters to control loop
    # epoch
    iter = 0
    t0 = time.time()
    for epoch in range(args.epochs):
        t1 = time.time()
        print(" -----------------the {} number of training epoch --------------".format(epoch))
        model.train()
        for data in train_dataloader:
            loss = 0
            imgs, targets = data
            if args.cuda == cuda:
                cross_entropy_loss = cross_entropy_loss.cuda()
                imgs, targets = imgs.cuda(), targets.cuda()
            outputs = model(imgs)
            loss_train = cross_entropy_loss(outputs, targets)
            loss = loss_train.item() + loss
            if args.tensorboard:
                writer.add_scalar("train_loss", loss_train.item(), iter)
 
            optim.zero_grad()
            loss_train.backward()
            optim.step()
            iter = iter + 1
            if iter % 100 == 0:
                print(
                    "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
                        .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
                                np.mean(loss)))
        if args.tensorboard:
            writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
        scheduler.step(np.mean(loss))
        t2 = time.time()
        h = (t2 - t1) // 3600
        m = ((t2 - t1) % 3600) // 60
        s = ((t2 - t1) % 3600) % 60
        print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
 
        if epoch % 1 == 0:
            print("Save state, iter: {} ".format(epoch))
            torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
 
    torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
    t3 = time.time()
    h_t = (t3 - t0) // 3600
    m_t = ((t3 - t0) % 3600) // 60
    s_t = ((t3 - t0) % 3600) // 60
    print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
    if args.tensorboard:
        writer.close()
 
 
if __name__ == "__main__":
    local_size = torch.cuda.device_count()
    print("local_size: ".format(local_size))
    train()

运行结果:

3ea5d5df0d0c4bfc967eb798e22dc69a.jpeg 9ddbf27f7ee313f98bb8c2cdf6e0add8.jpeg

Tensorboard观察:

d73f9f83cedb652c80b6e02078745417.jpeg c1b12340e3b4835f1bc9b35091a456f9.jpeg

评估源码:

eval_DDP.py

import torch
import torchvision
import torch.distributed as dist
from torch.utils.data import DataLoader
from torchvision.transforms import transforms
# from alexnet import alexnet
from torchvision.models.alexnet import alexnet
import argparse
 
 
# eval
def parse_args():
    parser = argparse.ArgumentParser(description='CV Evaluation')
    parser.add_mutually_exclusive_group()
    parser.add_argument("--rank", type=int, default=0)
    parser.add_argument("--world_size", type=int, default=1)
    parser.add_argument("--master_addr", type=str, default="127.0.0.1")
    parser.add_argument("--master_port", type=str, default="12355")
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
 
 
def eval():
    dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
                            rank=args.rank,
                            world_size=args.world_size)
    # 1.Create model
    model = alexnet()
    model = torch.nn.parallel.DistributedDataParallel(model)
 
    # 2.Ready Dataset
    if args.dataset == 'CIFAR10':
        test_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=False,
                                                    transform=transforms.Compose(
                                                        [transforms.Resize(args.img_size),
                                                         transforms.ToTensor()]),
                                                    download=True)
 
    else:
        raise ValueError("Dataset is not CIFAR10")
 
    # 3.Length
    test_dataset_size = len(test_dataset)
    print("the test dataset size is {}".format(test_dataset_size))
    test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)
 
    # 4.DataLoader
    test_dataloader = DataLoader(dataset=test_dataset, sampler=test_sampler, batch_size=args.batch_size,
                                 num_workers=2,
                                 pin_memory=True)
 
    # 5. Set some parameters for testing the network
    total_accuracy = 0
 
    # test
    model.eval()
    with torch.no_grad():
        for data in test_dataloader:
            imgs, targets = data
            device = torch.device('cpu')
            imgs, targets = imgs.to(device), targets.to(device)
            model_load = torch.load("{}/AlexNet.pth".format(args.checkpoint), map_location=device)
            model.load_state_dict(model_load)
            outputs = model(imgs)
            outputs = outputs.to(device)
            accuracy = (outputs.argmax(1) == targets).sum()
            total_accuracy = total_accuracy + accuracy
            accuracy = total_accuracy / test_dataset_size
        print("the total accuracy is {}".format(accuracy))
 
 
if __name__ == "__main__":
    local_size = torch.cuda.device_count()
    print("local_size: ".format(local_size))
    eval()

运行结果:

52a58e8cb9ee65abc66544698c4dc6d9.jpeg

(2)DDP使用autocast的训练与评估源码:

训练源码:

train_DDP_autocast.py 导入自己的alexnet.py

import time
import torch
from alexnet import alexnet
import torchvision
from torch import nn
import torch.distributed as dist
from torchvision import transforms
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast as autocast
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse
 
 
def parse_args():
    parser = argparse.ArgumentParser(description='CV Train')
    parser.add_mutually_exclusive_group()
    parser.add_argument("--rank", type=int, default=0)
    parser.add_argument("--world_size", type=int, default=1)
    parser.add_argument("--master_addr", type=str, default="127.0.0.1")
    parser.add_argument("--master_port", type=str, default="12355")
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
    parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
    parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
    parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
 
 
def train():
    dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
                            rank=args.rank,
                            world_size=args.world_size)
    # 1.Create SummaryWriter
    if args.tensorboard:
        writer = SummaryWriter(args.tensorboard_log)
 
    # 2.Ready dataset
    if args.dataset == 'CIFAR10':
        train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
            [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
 
    else:
        raise ValueError("Dataset is not CIFAR10")
 
    cuda = torch.cuda.is_available()
    print('CUDA available: {}'.format(cuda))
 
    # 3.Length
    train_dataset_size = len(train_dataset)
    print("the train dataset size is {}".format(train_dataset_size))
 
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    # 4.DataLoader
    train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size, sampler=train_sampler,
                                  num_workers=2,
                                  pin_memory=True)
 
    # 5.Create model
    model = alexnet()
 
    if args.cuda == cuda:
        model = model.cuda()
        model = torch.nn.parallel.DistributedDataParallel(model).cuda()
    else:
        model = torch.nn.parallel.DistributedDataParallel(model)
 
    # 6.Create loss
    cross_entropy_loss = nn.CrossEntropyLoss()
 
    # 7.Optimizer
    optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
 
    # 8. Set some parameters to control loop
    # epoch
    iter = 0
    t0 = time.time()
    for epoch in range(args.epochs):
        t1 = time.time()
        print(" -----------------the {} number of training epoch --------------".format(epoch))
        model.train()
        for data in train_dataloader:
            loss = 0
            imgs, targets = data
            if args.cuda == cuda:
                cross_entropy_loss = cross_entropy_loss.cuda()
                imgs, targets = imgs.cuda(), targets.cuda()
            with autocast():
                outputs = model(imgs)
                loss_train = cross_entropy_loss(outputs, targets)
            loss = loss_train.item() + loss
            if args.tensorboard:
                writer.add_scalar("train_loss", loss_train.item(), iter)
 
            optim.zero_grad()
            loss_train.backward()
            optim.step()
            iter = iter + 1
            if iter % 100 == 0:
                print(
                    "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
                        .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
                                np.mean(loss)))
        if args.tensorboard:
            writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
        scheduler.step(np.mean(loss))
        t2 = time.time()
        h = (t2 - t1) // 3600
        m = ((t2 - t1) % 3600) // 60
        s = ((t2 - t1) % 3600) % 60
        print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
 
        if epoch % 1 == 0:
            print("Save state, iter: {} ".format(epoch))
            torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
 
    torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
    t3 = time.time()
    h_t = (t3 - t0) // 3600
    m_t = ((t3 - t0) % 3600) // 60
    s_t = ((t3 - t0) % 3600) // 60
    print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
    if args.tensorboard:
        writer.close()
 
 
if __name__ == "__main__":
    local_size = torch.cuda.device_count()
    print("local_size: ".format(local_size))
    train()

运行结果:

27712159b58fc280348198b00631993b.jpeg ee7031d48d8fd0d87b286a810ebee994.jpeg

Tensorboard观察:

13d1e2ee12782c1743880d5931cd3d11.jpeg 8106952106b7f95f88c8d120bc56fcf4.jpeg

评估源码:

eval_DDP.py 导入自己的alexnet.py

import torch
import torchvision
import torch.distributed as dist
from torch.utils.data import DataLoader
from torchvision.transforms import transforms
from alexnet import alexnet
# from torchvision.models.alexnet import alexnet
import argparse
 
 
# eval
def parse_args():
    parser = argparse.ArgumentParser(description='CV Evaluation')
    parser.add_mutually_exclusive_group()
    parser.add_argument("--rank", type=int, default=0)
    parser.add_argument("--world_size", type=int, default=1)
    parser.add_argument("--master_addr", type=str, default="127.0.0.1")
    parser.add_argument("--master_port", type=str, default="12355")
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
 
 
def eval():
    dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
                            rank=args.rank,
                            world_size=args.world_size)
    # 1.Create model
    model = alexnet()
    model = torch.nn.parallel.DistributedDataParallel(model)
 
    # 2.Ready Dataset
    if args.dataset == 'CIFAR10':
        test_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=False,
                                                    transform=transforms.Compose(
                                                        [transforms.Resize(args.img_size),
                                                         transforms.ToTensor()]),
                                                    download=True)
 
    else:
        raise ValueError("Dataset is not CIFAR10")
 
    # 3.Length
    test_dataset_size = len(test_dataset)
    print("the test dataset size is {}".format(test_dataset_size))
    test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)
 
    # 4.DataLoader
    test_dataloader = DataLoader(dataset=test_dataset, sampler=test_sampler, batch_size=args.batch_size,
                                 num_workers=2,
                                 pin_memory=True)
 
    # 5. Set some parameters for testing the network
    total_accuracy = 0
 
    # test
    model.eval()
    with torch.no_grad():
        for data in test_dataloader:
            imgs, targets = data
            device = torch.device('cpu')
            imgs, targets = imgs.to(device), targets.to(device)
            model_load = torch.load("{}/AlexNet.pth".format(args.checkpoint), map_location=device)
            model.load_state_dict(model_load)
            outputs = model(imgs)
            outputs = outputs.to(device)
            accuracy = (outputs.argmax(1) == targets).sum()
            total_accuracy = total_accuracy + accuracy
            accuracy = total_accuracy / test_dataset_size
        print("the total accuracy is {}".format(accuracy))
 
 
if __name__ == "__main__":
    local_size = torch.cuda.device_count()
    print("local_size: ".format(local_size))
    eval()

运行结果:

8c3363556603ae702a64608f76a5affc.jpeg

分析:

从DDP未使用amp花费21分21秒,DDP使用autocast花费20分20秒,说明速度提升了

DDP未使用amp的准确率0.8224,之后DDP使用了autocast准确率下降到0.8162

(3)DDP使用autocast与GradScaler的训练与评估源码

训练源码:

train_DDP_GradScaler.py 导入自己的alexnet.py

import time
import torch
from alexnet import alexnet
import torchvision
from torch import nn
import torch.distributed as dist
from torchvision import transforms
from torch.utils.data import DataLoader
from torch.cuda.amp import autocast as autocast
from torch.cuda.amp import GradScaler as GradScaler
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import argparse
 
 
def parse_args():
    parser = argparse.ArgumentParser(description='CV Train')
    parser.add_mutually_exclusive_group()
    parser.add_argument("--rank", type=int, default=0)
    parser.add_argument("--world_size", type=int, default=1)
    parser.add_argument("--master_addr", type=str, default="127.0.0.1")
    parser.add_argument("--master_port", type=str, default="12355")
    parser.add_argument('--dataset', type=str, default='CIFAR10', help='CIFAR10')
    parser.add_argument('--dataset_root', type=str, default='../data', help='Dataset root directory path')
    parser.add_argument('--img_size', type=int, default=227, help='image size')
    parser.add_argument('--tensorboard', type=str, default=True, help='Use tensorboard for loss visualization')
    parser.add_argument('--tensorboard_log', type=str, default='../tensorboard', help='tensorboard folder')
    parser.add_argument('--cuda', type=str, default=True, help='if is cuda available')
    parser.add_argument('--batch_size', type=int, default=64, help='batch size')
    parser.add_argument('--lr', type=float, default=1e-4, help='learning rate')
    parser.add_argument('--epochs', type=int, default=20, help='Number of epochs to train.')
    parser.add_argument('--checkpoint', type=str, default='../checkpoint', help='Save .pth fold')
    return parser.parse_args()
 
 
args = parse_args()
 
 
def train():
    dist.init_process_group("gloo", init_method="tcp://{}:{}".format(args.master_addr, args.master_port),
                            rank=args.rank,
                            world_size=args.world_size)
    # 1.Create SummaryWriter
    if args.tensorboard:
        writer = SummaryWriter(args.tensorboard_log)
 
    # 2.Ready dataset
    if args.dataset == 'CIFAR10':
        train_dataset = torchvision.datasets.CIFAR10(root=args.dataset_root, train=True, transform=transforms.Compose(
            [transforms.Resize(args.img_size), transforms.ToTensor()]), download=True)
    else:
        raise ValueError("Dataset is not CIFAR10")
 
    cuda = torch.cuda.is_available()
    print('CUDA available: {}'.format(cuda))
 
    # 3.Length
    train_dataset_size = len(train_dataset)
    print("the train dataset size is {}".format(train_dataset_size))
 
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    # 4.DataLoader
    train_dataloader = DataLoader(dataset=train_dataset, batch_size=args.batch_size, sampler=train_sampler,
                                  num_workers=2,
                                  pin_memory=True)
 
    # 5.Create model
    model = alexnet()
 
    if args.cuda == cuda:
        model = model.cuda()
        model = torch.nn.parallel.DistributedDataParallel(model).cuda()
    else:
        model = torch.nn.parallel.DistributedDataParallel(model)
 
    # 6.Create loss
    cross_entropy_loss = nn.CrossEntropyLoss()
 
    # 7.Optimizer
    optim = torch.optim.AdamW(model.parameters(), lr=args.lr)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim, patience=3, verbose=True)
    scaler = GradScaler()
    # 8. Set some parameters to control loop
    # epoch
    iter = 0
    t0 = time.time()
    for epoch in range(args.epochs):
        t1 = time.time()
        print(" -----------------the {} number of training epoch --------------".format(epoch))
        model.train()
        for data in train_dataloader:
            loss = 0
            imgs, targets = data
            optim.zero_grad()
            if args.cuda == cuda:
                cross_entropy_loss = cross_entropy_loss.cuda()
                imgs, targets = imgs.cuda(), targets.cuda()
            with autocast():
                outputs = model(imgs)
                loss_train = cross_entropy_loss(outputs, targets)
                loss = loss_train.item() + loss
            if args.tensorboard:
                writer.add_scalar("train_loss", loss_train.item(), iter)
 
            scaler.scale(loss_train).backward()
            scaler.step(optim)
            scaler.update()
 
            iter = iter + 1
            if iter % 100 == 0:
                print(
                    "Epoch: {} | Iteration: {} | lr: {} | loss: {} | np.mean(loss): {} "
                        .format(epoch, iter, optim.param_groups[0]['lr'], loss_train.item(),
                                np.mean(loss)))
        if args.tensorboard:
            writer.add_scalar("lr", optim.param_groups[0]['lr'], epoch)
        scheduler.step(np.mean(loss))
        t2 = time.time()
        h = (t2 - t1) // 3600
        m = ((t2 - t1) % 3600) // 60
        s = ((t2 - t1) % 3600) % 60
        print("epoch {} is finished, and time is {}h{}m{}s".format(epoch, int(h), int(m), int(s)))
 
        if epoch % 1 == 0:
            print("Save state, iter: {} ".format(epoch))
            torch.save(model.state_dict(), "{}/AlexNet_{}.pth".format(args.checkpoint, epoch))
 
    torch.save(model.state_dict(), "{}/AlexNet.pth".format(args.checkpoint))
    t3 = time.time()
    h_t = (t3 - t0) // 3600
    m_t = ((t3 - t0) % 3600) // 60
    s_t = ((t3 - t0) % 3600) // 60
    print("The finished time is {}h{}m{}s".format(int(h_t), int(m_t), int(s_t)))
    if args.tensorboard:
        writer.close()
 
 
if __name__ == "__main__":
    local_size = torch.cuda.device_count()
    print("local_size: ".format(local_size))
    train()

运行结果:

a6d6160e054e1613bfbcca1cd5a0d383.jpeg 74f26921cb790c93baf6a003bfbb9200.jpeg

Tensorboard观察:

cbd6937879c619a89a57367e8553bb55.jpeg dba59fb85d53e9c252fdd738e9fea1d2.jpeg

评估源码:

eval_DDP.py 与3. (2) 一样,导入自己的alexnet.py

运行结果:

3abe091067f9b8baec4ef750289991d1.jpeg

分析:

运行起来了,速度也比DDP未使用amp(用时21分21秒)快了不少(用时20分20秒),之前DDP未使用amp准确率到达0.8224,现在DDP使用了autocast与GradScaler的准确率达到0.8252,提升了

参考:

1.Pytorch自动混合精度(AMP)训练:https://blog.youkuaiyun.com/ytusdc/article/details/122152244

2.PyTorch分布式训练基础--DDP使用:https://zhuanlan.zhihu.com/p/358974461

bf7fc8c8dfb287dd5172bb8999d2ffd4.png

自动驾驶之心】全栈技术交流群

自动驾驶之心是首个自动驾驶开发者社区,聚焦目标检测、语义分割、全景分割、实例分割、关键点检测、车道线、目标跟踪、3D目标检测、BEV感知、多传感器融合、SLAM、光流估计、深度估计、轨迹预测、高精地图、规划控制、模型部署落地、自动驾驶仿真测试、硬件配置、AI求职交流等方向;

1a30720f7a461602bca2bf05de236e8f.jpeg

添加汽车人助理微信邀请入群

备注:学校/公司+方向+昵称

自动驾驶之心【知识星球】

想要了解更多自动驾驶感知(分类、检测、分割、关键点、车道线、3D目标检测、多传感器融合、目标跟踪、光流估计、轨迹预测)、自动驾驶定位建图(SLAM、高精地图)、自动驾驶规划控制、领域技术方案、AI模型部署落地实战、行业动态、岗位发布,欢迎扫描下方二维码,加入自动驾驶之心知识星球(三天内无条件退款),日常分享论文+代码,这里汇聚行业和学术界大佬,前沿技术方向尽在掌握中,期待交流!

24bd795b0273db25231f3adb83bce36c.jpeg

您可能感兴趣的与本文相关的镜像

PyTorch 2.5

PyTorch 2.5

PyTorch
Cuda

PyTorch 是一个开源的 Python 机器学习库,基于 Torch 库,底层由 C++ 实现,应用于人工智能领域,如计算机视觉和自然语言处理

<think> 嗯,用户这次问的是深度学习模型训练为什么必须用GPU,这个问题其实在之前的对话中已经埋下了伏笔。之前我详细解释过GPU和CPU的架构差异,现在用户显然是希望把理论联系到实际应用场景。 用户可能正在搭建深度学习环境,遇到了GPU配置问题,或者对训练速度不满意。从ta引用OpenMP和多GPU训练的段落来看,应该是个有一定技术背景的研究者或工程师,但可能对硬件底层不太熟悉。需要特别注意避免堆砌术语,要把计算效率这样抽象的概念具象化。 啊,正好可以用之前CPU/GPU对比的四个维度来展开:并行能力、内存带宽、专用指令集和能效比。用户引用的多GPU训练资料[^3]特别有价值,那个梯度同步的例子简直是现成的教学案例。不过要注意控制篇幅,用户连续追问说明ta需要的是精准的要点而非长篇大论。 在解释矩阵运算得用LaTeX展示$XW+B$这样的典型操作,但要注意所有公式都严格按用户要求的$...$格式。显存容量那段可以结合用户引用的框架配置建议[^1],说明为什么ResNet这类模型GPU不可。 最后那个协同工作的比喻很重要——用户可能误以为GPU是完全替代CPU的,需要强调异构计算的价值。训练好的模型部署CPU的作用这个延伸点值得放进相关问题里,很多初学者会忽视这点。 </think>深度学习模型训练必须使用GPU的核心原因在于其**大规模并行计算能力与高吞吐特性**,这直接解决了深度学习的计算瓶颈问题。以下是具体原因分析: --- ### 一、计算需求与硬件特性匹配 1. **矩阵运算的本质** 深度学习模型的核心计算是**张量操作**(如卷积、矩阵乘法),以全连接层为例: $$ \mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b} $$ 其中 $\mathbf{X} \in \mathbb{R}^{n\times d}$, $\mathbf{W} \in \mathbb{R}^{d\times k}$。当 $n=1000, d=4096$ ,单次前向传播需完成 **1000×4096×4096 ≈ 16.78亿次浮点运算**。GPU的数千核心可并行处理此类运算[^1]。 2. **并行架构优势** | **硬件** | **核心数** | **线程并发能力** | **适用场景** | |----------|------------|------------------|--------------------| | CPU | 4-64核心 | 数百线程 | 逻辑控制、分支预测 | | GPU | 数千核心 | 数万线程 | 数据并行计算 | GPU的SIMT(单指令多线程)架构可将大型张量拆解为微小任务分发给流处理器并行执行[^3]。 --- ### 二、关键性能指标对比 | **指标** | **CPU** | **GPU** | **差异倍数** | |------------------|----------------------|-----------------------|-------------| | 浮点算力 (TFLOPS) | 1-2 (主流型号) | 20-200 (如A100/H100) | 10-100× | | 内存带宽 (GB/s) | 50-100 (DDR5) | 900-3000 (HBM2e/GDDR6)| 10-30× | | 能效比 (FLOPs/W) | 低 | 高 | 5-10× | > 💡 **典型案例**: > ResNet-50 训练一个Epoch(ImageNet数据集): > - CPU:约 **50小** > - GPU(V100×8):约 **8分钟** > *加速比达 **375倍*** [^1][^2] --- ### 三、深度学习训练的特殊需求 1. **显存容量要求** - 现代模型参数量巨大(如GPT-3达1750亿) - GPU显存(16-80GB)可缓存整个模型及中间激活值 - CPU内存虽大但带宽不足,无法支撑高速计算[^1] 2. **梯度同步效率** 多GPU训练,梯度同步采用 **All-Reduce算法**: ```mermaid graph LR A[GPU1梯度] --> C[NCCL通信库] B[GPU2梯度] --> C C --> D[聚合梯度] D --> A & B ``` GPU专用的NVLink/NVSwitch提供300GB/s互联带宽,远超CPU的PCIe 5.0(64GB/s)[^3]。 --- ### 四、CPU-GPU协同必要性 尽管训练依赖GPU,但CPU仍承担关键角色: 1. **数据预处理**:加载/增强数据(OpenMP多核优化)[^1] 2. **控制流管理**:条件分支、循环调度 3. **IO调度**:分布式训练中的网络通信 **最佳实践**:构建CPU-GPU异构流水线 ```python # PyTorch示例 dataset = DataLoader(..., num_workers=8) # CPU多线程预处理 model = nn.DataParallel(model) # GPU并行计算 ``` --- ### 结论:为什么必须用GPU? 1. **算力需求**:$O(N^3)$ 级矩阵运算需并行加速 2. **显存带宽**:百GB级参数/梯度传输需高带宽 3. **经济性**:GPU的FLOPs/$成本比CPU低两个数量级 > ⚠️ **例外情况**: > - 极小模型(如MNIST上的LeNet)可在CPU训练 > - 模型推理阶段可能使用CPU(边缘设备) ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值