PyTorch最佳实践和代码风格，教你高效使用PyTorch！非常有用，错过就太可惜了！...

最新推荐文章于 2025-06-21 14:54:07 发布

Python大本营

最新推荐文章于 2025-06-21 14:54:07 发布

阅读量2.4k

点赞数 4

本文总结了使用PyTorch框架进行深度学习的最佳实践，包括代码风格、网络构建、多GPU训练及模型训练等方面的技巧和建议。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

640?wx_fmt=jpeg

导读：PyTorch1.0之后，越来越多的人选择使用PyTorch，今天给大家介绍一个github项目，作者通过自己使用PyTorch的实际工程经验，总结出了一套非常有用的使用PyTorch的最佳实践，涉及到使用PyTorch的方方面面，看了之后非常有收获！

作者 | IgorSusmelj

编译 | ronghuaiyang

来源 | AI公园（ID：AI_Paradise）

640?wx_fmt=png

不是PyTorch的官方风格指南。本文总结了使用PyTorch框架进行深度学习的一年多经验中的最佳实践。请注意，我们分享的经验大多来自研究和创业的视角。

这是一个开放的项目，欢迎其他合作者编辑和改进文档。

该文档有三个主要部分。首先，简要回顾一下Python中的最佳实践，然后介绍一些使用PyTorch的技巧和建议。最后，我们分享了一些使用其他框架的见解和经验，这些框架通常对我们改进工作流有帮助。

我们推荐使用Python 3.6+

根据我们的经验，我们推荐使用Python 3.6+，因为它具有以下特性，这些特性对于简洁的代码非常方便：

从Python 3.6开始支持typing
从Python 3.6开始支持f string

Python风格指南回顾

我们尝试遵循Python的谷歌样式指南。

请参考文档丰富的谷歌提供的python代码风格指南。

我们在此提供最常用规则的摘要：

命名规范

640?wx_fmt=png

IDEs

代码编辑器

通常，我们推荐使用IDE，比如visual studio code或PyCharm。而VS Code在相对轻量级的编辑器中提供语法高亮和自动完成功能，PyCharm有许多用于处理远程集群的高级功能。

Jupyter Notebook vs Python Scripts

一般来说，我们建议使用 jupyter notebooks进行初步探索/尝试新的模型和代码。

如果你想在更大的数据集上训练模型，就应该使用Python脚本，因为在更大的数据集上，复现性更重要。

我们的推荐的工作流：

从jupyter notebook开始
研究数据和模型
在notebook的单元格中构建类/方法
将代码移动到python脚本
在服务器上进行训练/部署

640?wx_fmt=png

库

常用的库

640?wx_fmt=png

文件结构

不要把所有层和模型都放在同一个文件中。最佳实践是将最终的网络分离到一个单独的文件中(network .py)，并将层、损失和操作符保存在各自的文件中(layers.py，loss.py，ops.py)。完成的模型(由一个或多个网络组成)应该在一个具有其名称的文件中引用(例如yolov3.py，DCGAN.py)

主例程、各自的训练脚本和测试脚本应该只从具有模型名称的文件中导入。

使用PyTorch构建神经网络

我们建议将网络分解为更小的可重用部分。网络是一个神经网络。模块由操作或其他神经网络组成。模块作为构建块。损失函数也是nn.Module。因此，可以直接集成到网络中。

继承自nn.Module的类，必须有一个forward方法来实现相应层或操作的前向。

nn.module可以在输入数据上使用self.net(input)，这就是使用了 call()方法来通过模块提供输入。

output = self.net(input)

PyTorch构建的一个简单网络

对于单输入、单输出的简单网络，请使用以下模式：

class ConvBlock(nn.Module):    def __init__(self):        super(ConvBlock, self).__init__()        block = [nn.Conv2d(...)]        block += [nn.ReLU()]        block += [nn.BatchNorm2d(...)]        self.block = nn.Sequential(*block)        def forward(self, x):        return self.block(x)

class SimpleNetwork(nn.Module):    def __init__(self, num_resnet_blocks=6):        super(SimpleNetwork, self).__init__()        # here we add the individual layers        layers = [ConvBlock(...)]        for i in range(num_resnet_blocks):            layers += [ResBlock(...)]        self.net = nn.Sequential(*layers)        def forward(self, x):        return self.net(x)

请注意以下几点：

我们重用简单的循环构建块，如ConvBlock，它由相同的循环模式(卷积、激活、归一化)组成，并将它们放入单独的nn.Module中
我们建立一个所需层的列表，最后使用nn.Sequential()将它们转换成一个模型。我们在list对象之前使用*操作符来展开它。
在前向传递中，我们只是通过模型运行输入

在PyTorch中使用带有跳跃连接的网络

class ResnetBlock(nn.Module):    def __init__(self, dim, padding_type, norm_layer, use_dropout, use_bias):        super(ResnetBlock, self).__init__()        self.conv_block = self.build_conv_block(...)

    def build_conv_block(self, ...):        conv_block = []

        conv_block += [nn.Conv2d(...),                       norm_layer(...),                       nn.ReLU()]        if use_dropout:            conv_block += [nn.Dropout(...)]                    conv_block += [nn.Conv2d(...),                       norm_layer(...)]

        return nn.Sequential(*conv_block)

    def forward(self, x):        out = x + self.conv_block(x)        return out

在这里，实现了一个ResNet block的跳跃连接。PyTorch允许在向前传递期间进行动态操作。

在PyTorch使用多个输出的网络

对于一个需要多个输出的网络，例如使用一个预先训练好的VGG网络构建感知机loss，我们使用以下模式：

class Vgg19(torch.nn.Module):  def __init__(self, requires_grad=False):    super(Vgg19, self).__init__()    vgg_pretrained_features = models.vgg19(pretrained=True).features    self.slice1 = torch.nn.Sequential()    self.slice2 = torch.nn.Sequential()    self.slice3 = torch.nn.Sequential()

    for x in range(7):        self.slice1.add_module(str(x), vgg_pretrained_features[x])    for x in range(7, 21):        self.slice2.add_module(str(x), vgg_pretrained_features[x])    for x in range(21, 30):        self.slice3.add_module(str(x), vgg_pretrained_features[x])    if not requires_grad:        for param in self.parameters():            param.requires_grad = False

  def forward(self, x):    h_relu1 = self.slice1(x)    h_relu2 = self.slice2(h_relu1)            h_relu3 = self.slice3(h_relu2)            out = [h_relu1, h_relu2, h_relu3]    return out

请注意以下事项：

我们使用torchvision提供的预训练模型。
我们把网络分成三个部分。每个切片由来自预训练模型的层组成。
我们将冻结的网络设置成requires_grad = False
返回一个包含切片的三个输出的列表

自定义Loss

即使PyTorch已经有很多标准的损失函数，有时也需要创建自己的损失函数。为此，需要创建一个单独的文件 losses.py ，然后扩展 nn.Module类创建自定义损失函数：

class CustomLoss(torch.nn.Module):        def __init__(self):        super(CustomLoss,self).__init__()            def forward(self,x,y):        loss = torch.mean((x - y)**2)        return loss

训练模型的推荐代码结构

注意，我们使用了以下模式：

我们使用从prefetch_generator中的BackgroundGenerator加载下一个batch的数据
我们使用tqdm来监控训练进度，并显示计算效率。这有助于我们发现数据加载管道中的瓶颈。

# import statementsimport torchimport torch.nn as nnfrom torch.utils import data...

# set flags / seedstorch.backends.cudnn.benchmark = Truenp.random.seed(1)torch.manual_seed(1)torch.cuda.manual_seed(1)...

# Start with main codeif __name__ == '__main__':    # argparse for additional flags for experiment    parser = argparse.ArgumentParser(description="Train a network for ...")    ...    opt = parser.parse_args()         # add code for datasets (we always use train and validation/ test set)    data_transforms = transforms.Compose([        transforms.Resize((opt.img_size, opt.img_size)),        transforms.RandomHorizontalFlip(),        transforms.ToTensor(),        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))    ])        train_dataset = datasets.ImageFolder(        root=os.path.join(opt.path_to_data, "train"),        transform=data_transforms)    train_data_loader = data.DataLoader(train_dataset, ...)        test_dataset = datasets.ImageFolder(        root=os.path.join(opt.path_to_data, "test"),        transform=data_transforms)    test_data_loader = data.DataLoader(test_dataset ...)    ...        # instantiate network (which has been imported from *networks.py*)    net = MyNetwork(...)    ...        # create losses (criterion in pytorch)    criterion_L1 = torch.nn.L1Loss()    ...        # if running on GPU and we want to use cuda move model there    use_cuda = torch.cuda.is_available()    if use_cuda:        net = net.cuda()        ...        # create optimizers    optim = torch.optim.Adam(net.parameters(), lr=opt.lr)    ...        # load checkpoint if needed/ wanted    start_n_iter = 0    start_epoch = 0    if opt.resume:        ckpt = load_checkpoint(opt.path_to_checkpoint) # custom method for loading last checkpoint        net.load_state_dict(ckpt['net'])        start_epoch = ckpt['epoch']        start_n_iter = ckpt['n_iter']        optim.load_state_dict(ckpt['optim'])        print("last checkpoint restored")        ...            # if we want to run experiment on multiple GPUs we move the models there    net = torch.nn.DataParallel(net)    ...        # typically we use tensorboardX to keep track of experiments    writer = SummaryWriter(...)        # now we start the main loop    n_iter = start_n_iter    for epoch in range(start_epoch, opt.epochs):        # set models to train mode        net.train()        ...                # use prefetch_generator and tqdm for iterating through data        pbar = tqdm(enumerate(BackgroundGenerator(train_data_loader, ...)),                    total=len(train_data_loader))        start_time = time.time()                # for loop going through dataset        for i, data in pbar:            # data preparation            img, label = data            if use_cuda:                img = img.cuda()                label = label.cuda()            ...                        # It's very good practice to keep track of preparation time and computation time using tqdm to find any issues in your dataloader            prepare_time = start_time-time.time()                        # forward and backward pass            optim.zero_grad()            ...            loss.backward()            optim.step()            ...                        # udpate tensorboardX            writer.add_scalar(..., n_iter)            ...                        # compute computation time and *compute_efficiency*            process_time = start_time-time.time()-prepare_time            pbar.set_description("Compute efficiency: {:.2f}, epoch: {}/{}:".format(                process_time/(process_time+prepare_time), epoch, opt.epochs))            start_time = time.time()                    # maybe do a test pass every x epochs        if epoch % x == x-1:            # bring models to evaluation mode            net.eval()            ...            #do some tests            pbar = tqdm(enumerate(BackgroundGenerator(test_data_loader, ...)),                    total=len(test_data_loader))             for i, data in pbar:                ...                            # save checkpoint if needed            ...

在PyTorch使用多GPU训练

PyTorch中有两种使用多个GPU进行训练的模式。从我们的经验来看，这两种模式都是有效的。然而，第一个方法的结果是代码更好、更少。由于GPU之间的通信更少，第二种方法似乎具有轻微的性能优势。（我曾在PyTorch论坛上问过这个问题：https://discuss.pytorch.org/t/how-to-best-use-dataparallel-with-multiple-models/39289）

分割每个网络的batch

最常见的一种方法是简单地将所有“网络”的batch分配给各个GPU。

因此，如果一个模型运行在一个批处理大小为64的GPU上，那么它将运行在两个GPU上，每个GPU的批处理大小为32。这可以通过使用nn.DataParallel(model)自动完成。

将所有的网络打包进一个super网络，并把输入batch分割

这种模式不太常用。实现这种方法的repository在pix2pixHD implementation by Nvidia（https://github.com/NVIDIA/pix2pixHD）

什么该做，什么不该做

避免在nn.Module的forward方法找那个使用Numpy代码

Numpy运行在CPU上，比torch代码慢。由于torch的开发思路与numpy相似，所以大多数numpy函数已经得到了PyTorch的支持。

从main代码中分离DataLoader

数据加载管道应该独立于你的主训练代码。PyTorch使用后台来更有效地加载数据，并且不会干扰主训练过程。

不要在每一次迭代中打印日志结果

通常我们训练我们的模型数千个迭代。因此，每n步记录损失和其他结果就足以减少开销。特别是，在训练过程中，将中间结果保存为图像可能非常耗时。

使用命令行参数

使用命令行参数在代码执行期间设置参数(批处理大小、学习率等)非常方便。跟踪实验参数的一个简单方法是打印从parse_args接收到的字典：

...# saves arguments to config.txt fileopt = parser.parse_args()with open("config.txt", "w") as f:    f.write(opt.__str__())...

可能的话，使用.detach()将张量从图中释放出来

PyTorch跟踪所有涉及张量的操作，以实现自动微分。使用.detach()防止记录不必要的操作。

使用.item()打印标量数据

你可以直接打印变量，但是建议使用variable.detach()或variable.item()。在早期的PyTorch版本< 0.4中，必须使用.data访问一个变量的张量。

在nn.Module中使用函数调用而不是直接用forward

下面这两种方式是不一样的：

output = self.net.forward(input)# they are not equal!output = self.net(input)

FAQ

1. 如何让实验可复现？

我们建议在代码开头设置以下种子：

np.random.seed(1)torch.manual_seed(1)torch.cuda.manual_seed(1)

2. 如何进一步提升训练和推理速度？

在Nvidia GPUs上，你可以在代码的开头添加以下行。这将允许cuda后端在第一次执行时优化你的图。但是，要注意，如果改变网络输入/输出张量的大小，那么每次发生变化时，图都会被优化。这可能导致运行非常慢和内存不足错误。只有当输入和输出总是相同的形状时才设置此标志。通常情况下，这将导致大约20%的改善。

torch.backends.cudnn.benchmark = True

3. 使用tqdm + prefetch_generator模式计算效率的最佳值是什么？

这取决于使用的机器、预处理管道和网络大小。在一个1080Ti GPU上使用SSD硬盘，我们看到一个几乎为1.0的计算效率，这是一个理想的场景。如果使用浅(小)网络或慢速硬盘，这个数字可能会下降到0.1-0.2左右，这取决于你的设置。

4. 即使我没有足够的内存，我如何让batch size > 1？

在PyTorch中，我们可以很容易地实现虚拟batch sizes。我们只是不让优化器每次都更新参数，并把batch_size个梯度加起来。

...# in the main loopout = net(input)loss = criterion(out, label)# we just call backward to sum up gradients but don't perform step hereloss.backward() total_loss += loss.item() / batch_sizeif n_iter % batch_size == batch_size-1:    # here we perform out optimization step using a virtual batch size    optim.step()    optim.zero_grad()    print('Total loss: ', total_loss)    total_loss = 0.0...

5. 在训练过程中如何调整学习率？

我们可以直接使用实例化的优化器得到学习率，如下所示：

...for param_group in optim.param_groups:    old_lr = param_group['lr']    new_lr = old_lr * 0.1    param_group['lr'] = new_lr    print('Updated lr from {} to {}'.format(old_lr, new_lr))...

6. 在训练中如何使用一个预训练的模型作为损失(没有后向传播)？

如果你想使用一个预先训练好的模型，如VGG来计算损失，但不训练它(例如在style-transfer/GANs/Auto-encoder中的感知损失)，你可以使用以下模式：

...# instantiate the modelpretrained_VGG = VGG19(...)

# disable gradients (prevent training)for p in pretrained_VGG.parameters():  # reset requires_grad    p.requires_grad = False...# you don't have to use the no_grad() namespace but can just run the model# no gradients will be computed for the VGG modelout_real = pretrained_VGG(input_a)out_fake = pretrained_VGG(input_b)loss = any_criterion(out_real, out_fake)...

7. 在PyTorch找那个为什么要用.train()* 和 .eval()？

这些方法用于将BatchNorm2d或Dropout2d等层从训练模式设置为推理模式。每个模块都继承自nn.Module有一个名为istrain的属性。.eval()和.train()只是简单地将这个属性设置为True/ False。有关此方法如何实现的详细信息，请参阅PyTorch中的module代码。

8. 我的模型在推理过程中使用了大量内存/如何在PyTorch中正确运行推理模型？

确保在代码执行期间没有计算和存储梯度。你可以简单地使用以下模式来确保：

with torch.no_grad():    # run model here    out_tensor = net(in_tensor)

9. 如何微调预训练模型？

在PyTorch你可以冻结层。这将防止在优化步骤中更新它们。

# you can freeze whole modules usingfor p in pretrained_VGG.parameters():  # reset requires_grad    p.requires_grad = False

10. 什么时候用Variable(...)？

从PyTorch 0.4开始Variable和Tensor就合并了，我们不用再显式的构建Variable对象了。

11. PyTorch在C++上比Python快吗？

C++版本的速度快10%

12. TorchScript / JIT可以加速代码吗？

Todo...

13. PyTorch代码使用cudnn.benchmark=True会变快吗？

根据我们的经验，你可以获得约20%的加速。但是，第一次运行模型需要相当长的时间来构建优化的图。在某些情况下(前向传递中的循环、没有固定的输入形状、前向中的if/else等等)，这个标志可能会导致内存不足或其他错误。

14. 如何使用多GPUs训练？

Todo...

15. PyTorch中的.detach()是怎么工作的？

如果从计算图中释放一个张量，这里有一个很好的图解：http://www.bnikolic.co.uk/blog/pytorch-detach.html

英文原文：

https://github.com/IgorSusmelj/pytorch-styleguide

（*本文仅代表作者观点，转载请联系原作者）

公开课预告

◆

5月9日晚8点

◆

本次公开课，我们先从 Python 服务器后台开发开始，详解服务端压力的来源，教你对系统瓶颈的原因进行定位，并出分析思路。还会基于 Python 进行多任务优化，并教你如何处理高峰值流量问题。

关于课程更详细的知识点，请识别二维码查看。

640?wx_fmt=jpeg

❤点击“阅读原文”，了解更多活动信息。