pytorch: A 60 Minute blitz笔记
0. What
- A replacement for NumPy to use the power of GPUs
- a deep learning research platform that provides maximum flexibility and speed
1. Basic
1.1 tensors
x = torch.empty(5, 3) # unitialized
x = torch.rand(5, 3)
x = torch.zeros(5, 3, dtype=torch.long)
x = torch.tensor([5.5, 3]) # construct from data
x = torch.randn_like(x, dtype=torch.float) # override dtype!
print(x.size())
> torch.Size([5, 3]) # support all tuple operations
Size() object support all tuple operations
1.2 Operations
- add
print(x + y) # opt1
print(torch.add(x, y)) # opt2
y.add_(x) # add x to y
Any operation thart mutates a tensor in-place is post-fixed with an
_
. For example:x.copy_(y)
,x.t_()
, will changex
- index: standard Numpy-like
print(x[:, 1])
- resizing:
torch.view
x = torch.randn(4, 4)
y = x.view(16)
z = x.view(-1, 8)
use
.item()
to get the value as a python number(for one element tensor)
1.3 Numpy Bridge
The Torch Tensor and NumPy array will share their underlying memory locations, and changing one will change the other.
- Tensor -> Array:
ts.numpy()
- Array -> Tensor:
torch.from_numpy(ar)
- CUDA tensor: using
.to(device, dtype, ...)
method
# let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
device = torch.device("cuda") # a CUDA device object
y = torch.ones_like(x, device=device) # directly create a tensor on GPU
x = x.to(device) # or just use strings ``.to("cuda")``
z = x + y
print(z)
print(z.to("cpu", torch.double)) # ``.to`` can also change dtype together!
2. Autograd: automatic differentiation
- define-by-run framework, your backprop is defined by how your code is run, every single iteration can be different
Tensor
requires_grad
, attribute
True
: track all operations on it.requires_grad_( ... )
changes an existing Tensor’srequires_grad
flag in-place
backward()
, method
- have all the gradients computed automatically
- specify a
gradient
argument that is a tensor of matching shape (when it has more than one elements) - when tensor is a scalar,
out.backward()
is equivalent toout.backward(torch.tensor(1))
grad
, attribute, The gradient for this tensor will be accumulated into this attribute.detach()
- detach tensor from the computation history, and to prevent future computation from being tracked.
with torch.no_grad():
- prevent tracking history (and using memory),
- helpful when evaluating a model
Fuction
- Each variable has a
.grad_fn
attribute that references aFunction
that has created the Tensor Tensor
andFunction
are interconnected and build up an acyclic graph, that encodes a complete history of computation
- Each variable has a
x = torch.ones(2, 2, requires_grad=True)
y = x + 2
z = y * y * 3
out = z.mean()
out.backward()
print(x.grad)
> tensor([[ 4.5000, 4.5000],
[ 4.5000, 4.5000]])
3. Network
3.1 model
# every model should subclass nn.Module
import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
model = Model()
print(model)
######
Model(
(conv1): Conv2d(1, 20, kernel_size=(5, 5), stride=(1, 1))
(conv2): Conv2d(20, 20, kernel_size=(5, 5), stride=(1, 1))
)
If you have a single sample, just use
input.unsqueeze(0)
to add a fake batch dimension.
3.2 build loss
input = torch.randn(1, 1, 32, 32)
output = net(input)
target = torch.arange(1, 11) # a dummy target, for example
target = target.view(1, -1) # make it the same shape as output
criterion = nn.MSELoss()
loss = criterion(output, target)
3.3 backprop
when we call loss.backward()
, the whole graph is differentiated w.r.t. the loss, and all Tensors in the graph that has requres_grad=True
will have their .grad
Tensor accumulated with the gradient.
print(loss.grad_fn) # MSELoss
print(loss.grad_fn.next_functions[0][0]) # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0]) # ReLU
###
<MseLossBackward object at 0x7fb9f7338780>
<AddmmBackward object at 0x7fb9f73385c0>
<ExpandBackward object at 0x7fb9f73385c0>
3.4 Update
- opt1
net.zero_grad() # zeroes the gradient buffers of all parameters
loss.backward()
learning_rate = 0.01
for f in net.parameters():
f.data.sub_(f.grad.data * learning_rate)
- opt2
import torch.optim as optim
# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)
# in your training loop:
optimizer.zero_grad() # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() # Does the update
Observe how gradient buffers had to be manually set to zero using
optimizer.zero_grad()
. This is because gradients are accumulated as explained in Backprop section.
3.5 notes
torch.Tensor
- A multi-dimensional array with support for autograd operations likebackward()
. Also holds the gradient w.r.t. the tensor.nn.Module
- Neural network module. Convenient way of encapsulating parameters, with helpers for moving them to GPU, exporting, loading, etc.nn.Parameter
- A kind of Tensor, that is automatically registered as a parameter when assigned as an attribute to aModule
.autograd.Function
- Implements forward and backward definitions of an autograd operation. EveryTensor
operation, creates at least a singleFunction
node, that connects to functions that created aTensor
and encodes its history.
Train a classifier
Training on GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Assume that we are on a CUDA machine, then this should print a CUDA device:
print(device)
- data conversion
net.to(device)
inputs, labels = inputs.to(device), labels.to(device)