阅读 PyTorch 自动微分文章补充知识点

原创

已于 2022-03-17 16:00:41 修改 · 2.1k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#pytorch #深度学习 #python

于 2022-03-17 15:37:42 首次发布

本文深入探讨了自动微分的基本原理，包括原地操作如何减少GPU内存使用，迫切求值与惰性求值的差异，以及正向累积和反向累积两种模式。此外，详细介绍了向量-雅可比积的概念，它是自动微分中的关键计算工具。通过对PyTorch中ReLU函数的内存管理示例，强调了原地操作的优缺点，并提醒在使用时需要注意的问题。

绪论

本文是作者在阅读发表在NIPS 2017 Workshop上面的名为 Automatic differentiation in PyTorch 的文章时，发现其中我缺少的一些知识点。查找后以本文记录。本文涉及到的点有

原地操作
计算机程序中的求值策略，迫切求值还是懒惰求值
自动求导，含正向累积与反向累积方法
向量-雅可比积

原地操作

“ 原地运算是一种直接改变给定线性代数、向量、矩阵(张量)内容（content）的运算，而不需要复制。该定义取自本 Python 教程。

根据定义，原地操作不会复制输入。这就是为什么它们可以在操作高维数据时帮助减少内存使用的原因。

我想演示就地操作如何帮助消耗更少的 GPU 内存。为了做到这一点，我将使用这个简单的函数，从 PyTorch 对非原地操作（out-of-place）的 ReLU 和原地操作（in-place）的 ReLU 的分配内存:

# Import PyTorch
import torch # import main library
import torch.nn as nn # import modules like nn.ReLU()
import torch.nn.functional as F # import torch functions like F.relu() and F.relu_()

def get_memory_allocated(device, inplace = False):
    '''
    Function measures allocated memory before and after the ReLU function call.
    INPUT:
      - device: gpu device to run the operation
      - inplace: True - to run ReLU in-place, False - for normal ReLU call
    '''
    
    # Create a large tensor
    t = torch.randn(10000, 10000, device=device)
    
    # Measure allocated memory
    torch.cuda.synchronize()
    #torch.cuda.max_memory_allocated() 将返回自此程序开始以来的Tensor的峰值分配内存
    # 1024**2 ，以 MB 为单位
    start_max_memory = torch.cuda.max_memory_allocated() / 1024**2
    #torch.cuda.memory_allocated() 将返回当前Tensor占用的GPU内存（以字节为单位)，
    start_memory = torch.cuda.memory_allocated() / 1024**2
    
    # Call in-place or normal ReLU
    if inplace:
        F.relu_(t)
    else:
        output = F.relu(t)
    
    # Measure allocated memory after the call
    torch.cuda.synchronize()
    end_max_memory = torch.cuda.max_memory_allocated() / 1024**2
    end_memory = torch.cuda.memory_allocated() / 1024**2
    
    # Return amount of memory allocated for ReLU call
    return end_memory - start_memory, end_max_memory - start_max_memory

代码1. 测量分配的内存的函数

通过代码2 来为非原地操作的 ReLU 函数分配内存：

# setup the device
device = torch.device('cuda:0' if torch.cuda.is_available() else "cpu")

# call the function to measure allocated memory
memory_allocated, max_memory_allocated = get_memory_allocated(device, inplace = False)
print('Allocated memory: {}'.format(memory_allocated))
print('Allocated max memory: {}'.format(max_memory_allocated))

代码2. 为非原地操作的 ReLU 函数分配内存

然后会得到下面的输出：

Allocated memory: 382.0
Allocated max memory: 382.0

输出1. 非原地操作的 ReLU 函数分配的内存大小

接着通过代码3 为原地操作的 ReLU 函数分配内存：

memory_allocated_inplace, max_memory_allocated_inplace = get_memory_allocated(device, inplace = True)
print('Allocated memory: {}'.format(memory_allocated_inplace))
print('Allocated max memory: {}'.format(max_memory_allocated_inplace))

代码3. 为原地操作的 ReLU 函数分配内存

得到下面的输出

Allocated memory: 0.0
Allocated max memory: 0.0

输出2. 原地操作的 ReLU 函数分配的内存大小

看起来使用就地操作可以帮助我们节省一些 GPU 内存。然而，在使用就地操作时，我们应该非常谨慎，并仔细检查。在下面中，我将告诉您为什么。

使用就地操作的缺点

就地操作的主要缺点是，它们可能会覆盖计算梯度所需的值，这意味着破坏模型的训练过程。这就是 PyTorch autograd 的官方文档所说的:

Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.

There are two main reasons that limit the applicability of in-place operations:

In-place operations can potentially overwrite values required to compute gradients.

Every in-place operation actually requires the implementation to rewrite the computational graph. Out-of-place versions simply allocate new objects and keep references to the old graph, while in-place operations, require changing the creator of all inputs to the Function representing this operation.

谨慎使用就地操作的另一个原因是它们的实现非常 tricky 。这就是为什么我建议使用 PyTorch 标准的就地操作(就像上面的就地ReLU )，而不是手动实现一个。

让我们看一个 SiLU (或 Swish-1)激活函数的例子。这是SiLU的非原地操作实现：

def silu(input):
    '''
    Out-of-place implementation of SiLU activation function
    https://arxiv.org/pdf/1606.08415.pdf
    '''
    return input * torch.sigmoid(input)

代码4. SiLU 函数的非原地操作实现

让我们尝试用 torch.sigmoid_ 实现 就地操作的 SiLU 函数

def silu_inplace_1(input):
    '''
    Incorrect implementation of in-place SiLU activation function
    https://arxiv.org/pdf/1606.08415.pdf
    '''
    return input * torch.sigmoid_(input) # THIS IS INCORRECT!!!

代码5. 错误的SiLU 函数的原地操作实现

上面的代码不正确地实现了就地SiLU。只要比较两个函数返回值，我们就可以确定。实际上，函数silu_inplace_1返回 sigmoid(input) * sigmoid(input)! 使用torch_sigmoid_ 就地实现 SiLU 的工作示例如下: :

def silu_inplace_2(input):
    '''
    Example of implementation of in-place SiLU activation function using torch.sigmoid_
    https://arxiv.org/pdf/1606.08415.pdf
    '''
    result = input.clone()
    torch.sigmoid_(input)
    input *= result
    return input

代码6. 正确的SiLU 函数的原地操作实现

这个小示例演示了为什么在使用就地操作时要谨慎并进行检查。

总结：

我描述了原地操作及其目的。演示了原地操作如何帮助消耗更少的GPU内存。
我描述了原地操作的显著缺点。人们应该非常小心地使用它们，并检查结果。

迫切求值与惰性求值

在编程语言中，求值策略（evaluation strategy）是一组用于求值表达式的规则。该术语通常用于指代更为具体的参数传递策略概念。该策略定义了传递给函数的每个参数的值的种类（绑定策略）。是否计算函数调用的参数，以及如果是，按什么顺序（求值顺序）。

迫切求值（贪婪求值）（eager evaluation， greedy evaluation）。应用顺序（Applicative order）是一组求值顺序，在此顺序中，函数的参数在被应用之前被完全求值。这使得函数变得严格，也就是说，如果任意一个参数是未定义的，那么函数的结果就是未定义的，所以应用顺序计算通常被称为严格计算（strict evaluation ）。此外，函数调用在过程中一旦遇到就会立即执行，因此它也被称为迫切求值或贪婪求值。一些作者将严格的计算称为“按值调用”，因为按值调用绑定策略需要严格的评估。
惰性求值（lazy evaluation）。非严格求值顺序是不严格的求值顺序，也就是说，一个函数可能在其所有参数都被完全求值之前返回结果。典型的例子是正常顺序（normal order）求值，它不会对任何参数求值，直到它们在函数体中是必需的。正常顺序求值具有这样的特性，即只要任何其他求值顺序无错误地终止，它就会无错误地终止。请注意，惰性求值在本文中归类为绑定技术而不是求值顺序。但是这种区别并不总是被遵循，一些作者将惰性求值定义为正常顺序求值，反之亦然，或者将非严格性与惰性求值混淆。

许多语言中的布尔表达式使用一种称为短路求值的非严格求值形式，其中只要确定一个明确的布尔值将导致求值就立即返回——例如，在遇到真值的析取表达式 (OR) 中, 或在遇到 false 的合取表达式 (AND) 中，等等。条件表达式同样使用非严格评估 - 仅评估其中一个分支。

在数学和计算机代数中，自动微分(automatic differentiation, AD)，也称为algorithmic differentiation、computational differentiation、auto-differentiation 或简称 autodiff，是一套计算计算机程序指定函数的导数的技术。 AD 利用了这样一个事实:每一个计算机程序，无论多么复杂，都要执行一系列基本算术运算 (加减乘除等) 和基本函数(exp、log、sin、cos等)。将链式法则反复应用于这些运算，可以自动计算任意阶导数，精确到工作精度，并且比原程序最多多使用一个小常数因子的算术运算。

自动微分不同于符号微分（symbolic differentiation）和数值微分（numerical differentiation）。符号微分面临将计算机程序转换为单个数学表达式的困难，并可能导致代码效率低下。数值微分（有限差分法）会在离散化过程中引入舍入误差。这两种经典方法在计算高阶导数时都存在问题，复杂性和误差都会增加。最后，这两种经典方法在计算函数对多输入的偏导数时都很慢，这是基于梯度的优化算法所需要的。 自动微分法解决了所有这些问题。

链式法则（chain rule）

AD 的基础是由链式法则提供的微分分解。 对于简单的复合，举个例子

$y=f(g(h(x)))=f(g(h(w_{0})))=f(g(w_{1}))=f(w_{2})=w_{3}$

$w_{0}=x$

$w_{1}=h(w_{0})$

$w_{2}=g(w_{1})$

$w_{3}=f(w_{2})=y$

由链式法则，得

$\frac{\mathrm{d} y }{\mathrm{d} x} = \frac{\mathrm{d} y }{\mathrm{d} w_{2}} \frac{\mathrm{d} w_{2} }{\mathrm{d} w_{1}} \frac{\mathrm{d} w_{1} }{\mathrm{d} x} = \frac{\mathrm{d} f(w_{2}) }{\mathrm{d} w_{2}} \frac{\mathrm{d} g(w_{1}) }{\mathrm{d} w_{1}} \frac{\mathrm{d} h(w_{0}) }{\mathrm{d} x}$

通常， AD 有两种不同的模式，正向累积（或正向模式）（forward accumulation, forward mode）和反向累积（或反向模式）（reverse accumulation，reverse mode）。正向累积指定从内到外遍历链式法则（即，首先计算 $\frac{\mathrm{d} w_{1} }{\mathrm{d} x}$ ，然后 $\frac{\mathrm{d} w_{2} }{\mathrm{d} w_{1}}$ ，最后 $\frac{\mathrm{d} y }{\mathrm{d} w_{2}}$ ），然而反向累积从外到内遍历（首先计算 $\frac{\mathrm{d} y }{\mathrm{d} w_{2}}$ ，然后 $\frac{\mathrm{d} w_{2} }{\mathrm{d} w_{1}}$