# Automatic Differentiation in PyTorch

What distinguishes a tensor used for training data (or validation, or test) from a tensor used as a (trainable) parameter/weight?

`device = 'cuda' if torch.cuda.is_available() else 'cpu'# Step 0 - Initializes parameters "b" and "w" randomlytorch.manual_seed(42)b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)`

# backward

So, how do we tell PyTorch to do its thing and compute all gradients? That’s the role of the backward() method. It will compute gradients for all (requiring gradient) tensors involved in the computation of a given variable.

`# Step 1 - Computes our model's predicted output - forward passyhat = b + w * x_train_tensor# Step 2 - Computes the loss# We are using ALL data points, so this is BATCH gradient descent.# How wrong is our model? That's the error!error = (y_train_tensor - yhat)# It is a regression, so it computes mean squared error (MSE)loss = (error ** 2).mean()# Step 3 - Computes gradients for both "b" and "w" parameters# No more manual computation of gradients!loss.backward()`
• b
• w
• yhat
• error

What about the actual values of the gradients? We can inspect them by looking at the grad attribute of each tensor.

`b.grad, w.grad`

# zero_

Every time we use the gradients to update the parameters, we need to zero the gradients afterward. And that’s what zero_() is good for.

`# This code will be placed after Step 4 (updating the parameters)b.grad.zero_(), w.grad.zero_()`

# Updating Parameters

To update a parameter, we multiply its gradient by a learning rate, flip the sign, and add it to the parameter’s former value. So, let’s first set our learning rate:

`lr = 0.1`
`# Attempt at Step 4b -= lr * b.gradw -= lr * w.grad`

So, how do we tell PyTorch to “back off” and let us update our parameters without messing up with its fancy dynamic computation graph? That’s what torch.no_grad() is good for. It allows us to perform regular Python operations on tensors, without affecting PyTorch’s computation graph

`# Step 4, for realwith torch.no_grad():    b -= lr * b.grad    w -= lr * w.grad`
`device = 'cuda' if torch.cuda.is_available() else 'cpu'# Step 0 - Initializes parameters "b" and "w" randomlytorch.manual_seed(42)b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)lr = 0.1for epoch in range(200):    # Step 1 - Computes our model's predicted output - forward pass    yhat = b + w * x_train_tensor    # Step 2 - Computes the loss    # We are using ALL data points, so this is BATCH gradient descent.    # How wrong is our model? That's the error!    error = (y_train_tensor - yhat)    # It is a regression, so it computes mean squared error (MSE)    loss = (error ** 2).mean()    # Step 3 - Computes gradients for both "b" and "w" parameters    # No more manual computation of gradients!    loss.backward()     # Step 4, for real    with torch.no_grad():        b -= lr * b.grad        w -= lr * w.grad    # This code will be placed after Step 4 (updating the parameters)    b.grad.zero_(), w.grad.zero_()`

# Dynamic Computation Graph

Unfortunately, no one can be told what the dynamic computation graph is. You have to see it for yourself.

`make_dot(yhat)`
• blue boxes ((1)s): these boxes correspond to the tensors we use as parameters, the ones we’re asking PyTorch to compute gradients for
• gray box (MulBackward0): a Python operation that involves a gradient-computing tensor or its dependencies
• green box (AddBackward0): the same as the gray box, except that it is the starting point for the computation of gradients (assuming the backward() method is called from the variable used to visualize the graph) — they are computed from the bottom-up in a graph
`# New Step 0b_nograd = torch.randn(1, requires_grad=False, dtype=torch.float, device=device)w = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)# New Step 1yhat = b_nograd + w * x_train_tensormake_dot(yhat)`
`b = torch.randn(1, requires_grad=True, dtype=torch.float, device=device)w = torch.randn(1, requires_grad=True,dtype=torch.float, device=device)yhat = b + w * x_train_tensorerror = y_train_tensor - yhatloss = (error ** 2).mean()# this makes no sense!!if loss > 0:    yhat2 = w * x_train_tensor    error2 = y_train_tensor - yhat2# neither does this :-)loss += error2.mean()make_dot(loss)`

# To be continued…

Autograd is just the beginning! Interested in learning more about training a model using PyTorch in a structured, and incremental way?

--

-- ## ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.