# Automatic Differentiation in PyTorch

Autograd is PyTorch’s automatic differentiation package. Thanks to it, we **don’t need to worry** about partial derivatives, chain rule, or anything like it.

To illustrate how it works, let’s say we’re trying to fit a simple linear regression with a single feature ** x**, using Mean Squared Error (MSE) as our loss:

We need to create two tensors, one for each parameter our model needs to learn: ** b** and

**.**

*w*Without PyTorch, we would have to start with our loss, and work the partial derivatives out to compute the gradients manually. Sure, it would be easy enough to do it for this toy problem, but we need **something that can scale**.

So, how do we do it? PyTorch provides some really handy methods we can use to easily compute the gradients. Let’s check them out!

# requires_grad

What distinguishes a *tensor* used for *training data* (or validation, or test) from a **tensor** used as a (*trainable*) **parameter/weight**?

The latter requires the **computation of its gradients**, so we can **update** their values (the parameters’ values, that is). That’s what the requires_grad=True argument is good for. It tells PyTorch to compute gradients for us.

Remember: a tensor for a **learnable parameter** requires a **gradient**!

In code, creating tensors for our two parameters looks like this:

device = 'cuda' if torch.cuda.is_available() else 'cpu'# Step 0 - Initializes parameters "b" and "w" randomly

torch.manual_seed(42)

b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

# backward

So, how do we tell PyTorch to do its thing and **compute all gradients**? That’s the role of the backward() method. It will compute gradients for *all* (*requiring gradient*) *tensors* involved in the computation of a given variable.

Do you remember the **starting point** for **computing the gradients**? It is the **loss**, which we would use to compute its partial derivatives with respect to our parameters.

Hence, we need to invoke the backward() method from the corresponding Python variable: loss.backward().

The code below illustrates it well, assuming we’re making both predictions and computing the loss using nothing but Numpy:

# Step 1 - Computes our model's predicted output - forward pass

yhat = b + w * x_train_tensor# Step 2 - Computes the loss

# We are using ALL data points, so this is BATCH gradient descent.

# How wrong is our model? That's the error!

error = (y_train_tensor - yhat)

# It is a regression, so it computes mean squared error (MSE)

loss = (error ** 2).mean()# Step 3 - Computes gradients for both "b" and "w" parameters

# No more manual computation of gradients!

loss.backward()

Which tensors are going to be handled by the backward() method applied to the loss?

- b
- w
- yhat
- error

We have set requires_grad=True to both ** b** and

**, so they are obviously included in the list. We use them both to compute**

*w***, so it will also make it to the list. Then we use**

*yhat***to compute the**

*yhat***, which is also added to the list.**

*error*Do you see the pattern here? If a tensor in the list is used to compute another tensor, the latter will also be included in the list. Tracking these dependencies is exactly what the dynamic computation graph is doing, as we’ll see shortly.

What about ** x_train_tensor** and

**? They are involved in the computation too… but they contain data, and thus they are**

*y_train_tensor***not**created as

**gradient-requiring tensors**. So, backward() does not care about them.

# grad

What about the **actual values of the gradients**? We can inspect them by looking at the grad attribute of each tensor.

`b.grad, w.grad`

OK, we got gradients, but there is one more thing to pay attention to: by default, **PyTorch accumulates the gradients**. How to handle that?

# zero_

Every time we use the **gradients** to **update** the parameters, we need to **zero the gradients afterward**. And that’s what zero_() is good for.

`# This code will be placed after Step 4 (updating the parameters)`

b.grad.zero_(), w.grad.zero_()

So, we can definitely ditch the manual computation of gradients and use both backward() and zero_() methods instead.

That’s it? Well, pretty much… but there is always a **catch**, and this time it has to do

with the **update of the parameters**…

# Updating Parameters

To update a parameter, we multiply its gradient by a learning rate, flip the sign, and add it to the parameter’s former value. So, let’s first set our learning rate:

`lr = 0.1`

And then use it to perform the updates:

`# Attempt at Step 4`

b -= lr * b.grad

w -= lr * w.grad

But, it turns out we **cannot** simply perform an update like this! Why not?! It turns out to be a case of “*too much of a good thing*”. The culprit is PyTorch’s ability to build a **dynamic computation graph** from every **Python operation** that involves any **gradient-computing tensor or its dependencies**.

# no_grad

So, how do we tell PyTorch to “*back off*” and let us **update our parameters** without messing up with its *fancy dynamic computation graph*? That’s what torch.no_grad() is good for. It allows us to **perform regular Python operations on tensors, without affecting PyTorch’s computation graph**

This time, the update will work as expected:

`# Step 4, for real`

with torch.no_grad():

b -= lr * b.grad

w -= lr * w.grad

Mission accomplished! We updated our parameters ** b** and

**using PyTorch’s automatic differentation package,**

*w***autograd**.

I mean, we updated it **once**. To actually *train* a model, we need to place this code inside a loop. Putting it all together, and adding a loop to it, the code should look like this:

device = 'cuda' if torch.cuda.is_available() else 'cpu'# Step 0 - Initializes parameters "b" and "w" randomly

torch.manual_seed(42)

b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)lr = 0.1for epoch in range(200):

# Step 1 - Computes our model's predicted output - forward pass

yhat = b + w * x_train_tensor # Step 2 - Computes the loss

# We are using ALL data points, so this is BATCH gradient descent.

# How wrong is our model? That's the error!

error = (y_train_tensor - yhat)

# It is a regression, so it computes mean squared error (MSE)

loss = (error ** 2).mean() # Step 3 - Computes gradients for both "b" and "w" parameters

# No more manual computation of gradients!

loss.backward() # Step 4, for real

with torch.no_grad():

b -= lr * b.grad

w -= lr * w.grad # This code will be placed after Step 4 (updating the parameters)

b.grad.zero_(), w.grad.zero_()

That was **autograd in action**! Now it is time to take a peek at the…

# Dynamic Computation Graph

“*Unfortunately, no one can be told what the dynamic computation graph is. You have to see it for yourself.*”

– Morpheus

I want you to see the graph for yourself too!

The PyTorchViz package and its make_dot(variable) method allow us to easily visualize a graph associated with a given Python variable involved in the gradient computation.

So, let’s stick with the **bare minimum**: two (gradient computing) tensors for our parameters (**b** and **w**) and the predictions (** yhat**) — these are Steps 0 and 1.

`make_dot(yhat)`

Running the code above will show us the graph below:

Let’s take a closer look at its components:

**blue boxes**((1)s): these boxes correspond to the**tensors**we use as**parameters**, the ones we’re asking PyTorch to**compute gradients**for**gray box**(MulBackward0): a**Python operation**that involves a**gradient-computing tensor or its dependencies****green box**(AddBackward0): the same as the gray box, except that it is the**starting point for the computation**of gradients (assuming the backward() method is called from the variable used to visualize the graph) — they are computed from the**bottom-up**in a graph

Now, take a closer look at the **green box** at the bottom of the graph: **two arrows** are pointing to it since it is **adding up two variables**, ** b**, and

**. Seems obvious, right?**

*w*x*Then, look at the **gray box** (MulBackward0) of the same graph: it is performing a **multiplication**, namely, ** w*x**. But there is

**only one arrow**pointing to it! The arrow comes from the blue box that corresponds to our parameter

**.**

*w*“*Why don’t we have a box for our data (x)?*“

The answer is: we **do not compute gradients** for it!

So, even though there are more tensors involved in the operations performed by the computation graph, it **only** shows **gradient-computing tensors and its dependencies**.

What would happen to the computation graph if we set requires_grad to False for our parameter ** b**?

# New Step 0

b_nograd = torch.randn(1, requires_grad=False, dtype=torch.float, device=device)

w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)# New Step 1

yhat = b_nograd + w * x_train_tensor

make_dot(yhat)

Unsurprisingly, the **blue box** corresponding to the parameter ** b** is no more!

Simple enough: **no gradients, no graph**!

The **best** thing about the dynamic computation graph is the fact that you can make it **as complex as you want** it. You can even use control flow statements (e.g., if statements) to **control the flow of the gradients**.

The figure below shows an example of this. And yes, I do know that the computation itself is complete *nonsense*…

b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)

w = torch.randn(1, requires_grad=True,dtype=torch.float, device=device)yhat = b + w * x_train_tensor

error = y_train_tensor - yhat

loss = (error ** 2).mean()# this makes no sense!!

if loss > 0:

yhat2 = w * x_train_tensor

error2 = y_train_tensor - yhat2# neither does this :-)

loss += error2.mean()make_dot(loss)

Even though the computation is nonsensical, you can clearly see the **effect** of adding a **control flow statement** like if loss > 0: it branches the computation graph in two parts. The **right branch** performs the computation **inside the if statement**, which gets added to the result of the left branch in the end. Cool, right?

# To be continued…

Autograd is just the beginning! Interested in learning more about training a model using PyTorch in a structured, and incremental way?

Don’t miss my talk at ODSC Europe 2020: “**PyTorch 101: building a model step-by-step**.”

The content of this post was adapted from my book “*Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide*”. Learn more about it at http://leanpub.com/pytorch.

About the author/speaker:

Daniel is a data scientist, developer, and author of “Deep Learning with PyTorch Step-by-Step: A Beginner’s Guide”.

He has been teaching machine learning and distributed computing technologies at Data Science Retreat, the longest-running Berlin-based bootcamp, for more than three years, helping more than 150 students advance their careers.

His professional background includes 20 years of experience working for companies in several industries: banking, government, fintech, retail and mobility.