通过例子学习PyTorch

最新推荐文章于 2025-03-26 19:32:35 发布

原创最新推荐文章于 2025-03-26 19:32:35 发布 · 546 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#pytorch

研专栏收录该内容

7 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

原文：LEARNING PYTORCH WITH EXAMPLES
作者：Justin Johnson
翻译：Jerry
时间：2019-01-22

PyTorch: Tensors

Numpy是一个很好的框架，但是它不能利用GPU来加速数值计算。对于现在的深度网络而言，GPU通常能提供50倍或更高的提速，所以很不幸的是，Numpy对深度学习来说是不够的。

这里我们介绍PyTorch中最基础的概念：Tensor。它与numpy array是同一个东西：一个Tensor是一个n维数组，PyTorch为操作Tensor提供了很多函数。Tensor能在后台持续跟踪计算图和梯度，并且它对科学计算来说也是一个通用的工具。

与numpy不同，PyTorch的Tensor可以利用GPU来甲酸数值计算。要在GPU上使用Tensor，只需要将其转换成一个新的数据类型。

这里我们使用PyTorch的Tensor来实现一个两层网络以匹配随机数据。与numpy一样，这里我们需要自己实现神经网络的前向传播和反向传播：

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # 这行代码将使得PyTorch运行在GPU上

# N 是 batch size; D_in 是 输入层节点数;
# H 是 隐藏层节点数; D_out 是 输出层节点数.
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机生成输入和输出
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 随机初始化权重
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # 前向传播：计算神经网络的预测值
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # 计算、显示损失值
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)

    # 反向传播：计算w1和w2的梯度
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # 使用梯度更新权重
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

Autograd

PyTorch: Tensors and autograd

在上面的例子中，我们手动实现了神经网络的前向传播和反向传播。在一个小网络（例如上面的2层网络）中实现反向传播还不算太难，但是对于一个大型复杂网络而言，这是一件非常棘手的事情。

值得庆幸的是，我们可以使用自动微分算法来实现神经网络中的反向传播。PyTorch中的autograd包正好提供了这个功能。当使用autogra的时候，神经网络的前向传播定义了计算图，图中的节点是Tensor，边是一个从输入Tensor产生输出Tensor的函数。通过计算图进行反向传播可以很容易得到梯度。

这听起来很复杂，其实它在使用的时候非常简单。每一个Tensor都代表着计算图中的一个节点，那些设置了x.requires_grad=True的Tensorx，可以通过x.grad来得到x在当前值下的梯度。

这里我们使用PyTorch的Tensor和autograd来实现我们的两层网络，现在我们不再需要手动实现反向传播了：

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # 这行代码将使得PyTorch运行在GPU上

# N 是 batch size; D_in 是 输入层节点数;
# H 是 隐藏层节点数; D_out 是 输出层节点数.
N, D_in, H, D_out = 64, 1000, 100, 10

# 随机生成输入和输出
# 设置requires_grad=False表示
# 在反向传播中不需要计算梯度
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# 随机初始化权重
# 设置requires_grad=True表示
# 在反向传播中需要计算其梯度
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # 前向传播：计算神经网络的预测值
    # 这与使用Tensor做前向传播一样，但是我们不需要保存中间值
    # 因为我们在反向传播中无需手动计算梯度
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # 计算、显示损失值
    # loss是一个形状为(1, )的Tensor
    # loss.item()可以拿到loss中的标量值
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # 使用autograd完成反向传播。这行代码会计算所有参与loss运算的Tensor中
    # 带有requires_grad=True的那些Tensor的梯度
    # 运行后，w1.grad，w2.grad就是在当前loss下，w1和w2的梯度
    loss.backward()

    # 手动使用梯度更新权重。使用torch.no_grad()包起来
    # 因为权重带有requires_grad=True，但我们不需要跟踪历史
    # 另外一个选择是使用weight.data和weight.grad.data，它们不会跟踪历史
    # 你也可以使用torch.optim.SGG来实现
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # 完成更新后，手动将梯度清零
        w1.grad.zero_()
        w2.grad.zero_()

（后面的代码比较简单，就不再翻译注释了）

PyTorch: Defining new autograd functions

在后台，每个原始atuograd操作符其实是两个对Tensor操作的函数。一个是forward函数，它负责将输入Tensor计算得到输出Tensor；一个是backward函数，它接受输出Tensor的在当前值下的梯度，计算得到输入Tensor在当前值下的梯度。

在PyTorch中，如果我们想定义自己的autograd操作符，继承torch.autograd.Function类，并实现forward、backward两个函数即可。之后我们使用它来构造实例，传入含输入数据的Tensor，就像调用函数一样。

在这次的例子中，我们先定义自己的autograd函数以实现非线性的ReLU，然后使用它实现两层网络：

class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

TensorFlow: Static Graphs

PyTorch的autograd看起来和Tensorflow很像：两个框架都定义了计算图，并使用了自动微分来计算梯度。PyTorch和TensorFlow最大的不同在于，TensorFlow使用的是静态图，而PyTorch使用动态图。

在TensorFlow中，我们只定义一次计算图，之后重复执行这个计算图，每次计算时可能使用不同的输入数据。在PyTorch中，每一次forward都定义了一个新的计算图。

静态图很好，因为可以事先进行优化。例如，一个框架可能决定融合一些图相关的操作以提高效率，或者使用一种策略来让图在多块GPU或多个机器上进行分布式操作。如果重复使用同一张图，那么代价比较高的前期优化可以被分摊，因为相同的图会被反复运行。

静态图和动态图的一个不同的方面是控制流。对于某些模型来说，我们希望它能为每一个数据点做不同的计算。例如，可能为了针对每个数据点的不同时间步而展开循环神经网络，这个展开操作可以通过循环语句实现。在静态图中，这种循环结构是图的一部分，因此，TensorFlow提供了像tf.scan这样的操作来将循环嵌入到图里。使用动态图的话，这将变得很简单，因为我们为每个实例动态的构建图，我们可以使用常规的流程控制来对每一个输入完成不同的计算。

nn module

PyTorch: nn

计算图和autograd对于定义复杂运算和自动求导是非常有用的，然而对于一个大型神经网络来说，直接使用autograd未免太低级了。

当构建神经网络时，我们经常想把计算都当成是层(layers)，其中那些带有可学习的参数的层在训练过程中是可以被优化的。

在TensorFlow里，像Keras、TensorFlow-Slim和TFLearn这些包都提供了基于计算图的高级抽象，这对构建神经网络来说是非常有用的。

在PyTorch中，nn包有着同样的目的。nn包定义了许多Module，它和神经网络的层大致一样。一个Module接受输入Tensor，计算输出Tensor，但是可能也保存了内部状态，比如那些包含可学习的参数的Tensor。nn包当然也定义了一些有用损失函数，它们在训练神经网络时是很常用的。

在这个例子中，我们使用nn包来实现我们的两层网络：

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

PyTorch: optim

到目前为止，我们都是通过手动改变那些带有可训练参数的Tensor的值来更新权重的（使用torch.no_grad() 或者 .data 来避免autograd生效）。这对简单地优化算法（比如随机梯度下降）而言不是太大的负担，但是在实践中，我们经常使用更复杂的优化器例如AdaGrad、RMSProp、Adam等等。

PyTorch的optim包抽象了优化算法的概念，实现了常用的优化算法。

在这个例子中，我们将使用nn包定义我们的模型，并且使用optim包的Adam算法来优化模型：

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

PyTorch: Custom nn Modules

有时候你想定义比现有Module更复杂的模型。这时你可以通过继承nn.Module类来实现，同时需要定义foward方法，使用其他module或者autograd操作，接受输入Tensor并得到输出Tensor。

在这次例子中，我们使用自定义的Module来实现两层网络：

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

PyTorch: Control Flow + Weight Sharing

这是一个动态图和权重共享的例子，我们实现了一个非常奇怪的模型：一个全连接的ReLU网络，它在每一次前向传播中，先选择一个1-4的随机数代表隐藏层的数量，多次重复使用相同的权重来计算最里面的隐藏层。

我们使用常规的Python控制流来实现这个循环，而且在前向传播中，我们可以通过多次重复使用同一个Module来实现最里层的权重共享。

通过基础为Module的子类来实现这一模型：

class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()