Automatic Differentiation in PyTorch

requires_grad

What distinguishes a tensor used for training data (or validation, or test) from a tensor used as a (trainable) parameter/weight?

device = 'cuda' if torch.cuda.is_available() else 'cpu'# Step 0 - Initializes parameters "b" and "w" randomly
torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

backward

So, how do we tell PyTorch to do its thing and compute all gradients? That’s the role of the backward() method. It will compute gradients for all (requiring gradient) tensors involved in the computation of a given variable.

# Step 1 - Computes our model's predicted output - forward pass
yhat = b + w * x_train_tensor
# Step 2 - Computes the loss
# We are using ALL data points, so this is BATCH gradient descent.
# How wrong is our model? That's the error!
error = (y_train_tensor - yhat)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()
# Step 3 - Computes gradients for both "b" and "w" parameters
# No more manual computation of gradients!
loss.backward()
  • b
  • w
  • yhat
  • error

grad

What about the actual values of the gradients? We can inspect them by looking at the grad attribute of each tensor.

b.grad, w.grad

zero_

Every time we use the gradients to update the parameters, we need to zero the gradients afterward. And that’s what zero_() is good for.

# This code will be placed after Step 4 (updating the parameters)
b.grad.zero_(), w.grad.zero_()

Updating Parameters

To update a parameter, we multiply its gradient by a learning rate, flip the sign, and add it to the parameter’s former value. So, let’s first set our learning rate:

lr = 0.1
# Attempt at Step 4
b -= lr * b.grad
w -= lr * w.grad

no_grad

So, how do we tell PyTorch to “back off” and let us update our parameters without messing up with its fancy dynamic computation graph? That’s what torch.no_grad() is good for. It allows us to perform regular Python operations on tensors, without affecting PyTorch’s computation graph

# Step 4, for real
with torch.no_grad():
b -= lr * b.grad
w -= lr * w.grad
device = 'cuda' if torch.cuda.is_available() else 'cpu'# Step 0 - Initializes parameters "b" and "w" randomly
torch.manual_seed(42)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
lr = 0.1for epoch in range(200):
# Step 1 - Computes our model's predicted output - forward pass
yhat = b + w * x_train_tensor
# Step 2 - Computes the loss
# We are using ALL data points, so this is BATCH gradient descent.
# How wrong is our model? That's the error!
error = (y_train_tensor - yhat)
# It is a regression, so it computes mean squared error (MSE)
loss = (error ** 2).mean()
# Step 3 - Computes gradients for both "b" and "w" parameters
# No more manual computation of gradients!
loss.backward()
# Step 4, for real
with torch.no_grad():
b -= lr * b.grad
w -= lr * w.grad
# This code will be placed after Step 4 (updating the parameters)
b.grad.zero_(), w.grad.zero_()

Dynamic Computation Graph

Unfortunately, no one can be told what the dynamic computation graph is. You have to see it for yourself.

make_dot(yhat)
  • blue boxes ((1)s): these boxes correspond to the tensors we use as parameters, the ones we’re asking PyTorch to compute gradients for
  • gray box (MulBackward0): a Python operation that involves a gradient-computing tensor or its dependencies
  • green box (AddBackward0): the same as the gray box, except that it is the starting point for the computation of gradients (assuming the backward() method is called from the variable used to visualize the graph) — they are computed from the bottom-up in a graph
# New Step 0
b_nograd = torch.randn(1, requires_grad=False, dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
# New Step 1
yhat = b_nograd + w * x_train_tensor
make_dot(yhat)
b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)
w = torch.randn(1, requires_grad=True,dtype=torch.float, device=device)
yhat = b + w * x_train_tensor
error = y_train_tensor - yhat
loss = (error ** 2).mean()
# this makes no sense!!
if loss > 0:
yhat2 = w * x_train_tensor
error2 = y_train_tensor - yhat2
# neither does this :-)
loss += error2.mean()
make_dot(loss)

To be continued…

Autograd is just the beginning! Interested in learning more about training a model using PyTorch in a structured, and incremental way?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
ODSC - Open Data Science

ODSC - Open Data Science

94K Followers

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.