The magic of Autograd

In part 1 we saw that how we can train a linear model with backpropagation. In nutshell what we computed the gradient of a composition of functions—the model and the loss—with respect to their innermost parameters (w and b) by propagating derivatives backward using the chain rule.

The basic requirement here is that all functions we’re dealing with can be differentiated analytically but as we increase the depth of the model, writing the analytical expression for the derivatives is not that easy.

This is when PyTorch tensors come to the rescue, with a PyTorch component called autograd.

PyTorch tensors can remember where they come from, in terms of the operations and parent tensors that originated them, and they can automatically provide the chain of derivatives of such operations with respect to their inputs. This means PyTorch will automatically provide the gradient of that expression with respect to its input parameters.

Computing the gradient automatically

Lets rewrite out code using autograd.

%matplotlib inline
import numpy as np
import torch
torch.set_printoptions(edgeitems=2)
t_c = torch.tensor([0.5, 14.0, 15.0, 28.0, 11.0, 8.0,
                    3.0, -4.0, 6.0, 13.0, 21.0])
t_u = torch.tensor([35.7, 55.9, 58.2, 81.9, 56.3, 48.9,
                    33.9, 21.8, 48.4, 60.4, 68.4])
t_un = 0.1 * t_u

Recalling our model and loss function

def model(t_u, w, b):
    return w * t_u + b
def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()

Let’s again initialize a parameters tensor:

params = torch.tensor([1.0, 0.0], requires_grad=True)

The requires_grad=True argument is telling PyTorch to track the entire family tree of tensors resulting from operations on params.

params.grad is None
True
loss = loss_fn(model(t_u, *params), t_c)
loss.backward()

params.grad
tensor([4517.2969,   82.6000])

At this point, the grad attribute of params contains the derivatives of the loss with respect to each element of params.

The forward graph and backward graph of the model as computed with autograd

When we compute our loss while the parameters w and b require gradients, in addition to performing the actual computation, PyTorch creates the autograd graph with the operations (in black circles) as nodes, as shown in the top row of figure above. When we call loss.backward(), PyTorch traverses this graph in the reverse direction to compute the gradients, as shown by the arrows in the bottom row of the figure.

Accumulating Grad Funtion

Calling backward will lead derivatives to accumulate at leaf nodes. So if backward was called earlier, the loss is evaluated again, backward is called again (as in any training loop), and the gradient at each leaf is accumulated (that is, summed) on top of the one computed at the previous iteration, which leads to an incorrect value for the gradient. In order to prevent this from occurring, we need to zero the gradient explicitly at each iteration. We can do this easily using the in-place zero_ method:

if params.grad is not None:
    params.grad.zero_()
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
    for epoch in range(1, n_epochs + 1):
        if params.grad is not None:  # This could be done at any point in the loop prior to calling loss.backward().
            params.grad.zero_()
        
        t_p = model(t_u, *params) 
        loss = loss_fn(t_p, t_c)
        loss.backward()
        
        with torch.no_grad():  # This is a somewhat cumbersome bit of code, but as we’ll see in the next section, it’s not an issue in practice.
            params -= learning_rate * params.grad

        if epoch % 500 == 0:
            print('Epoch %d, Loss %f' % (epoch, float(loss)))
            
    return params
training_loop(
    n_epochs = 5000, 
    learning_rate = 1e-2, 
    params = torch.tensor([1.0, 0.0], requires_grad=True), # <1> 
    t_u = t_un, # <2> 
    t_c = t_c)
Epoch 500, Loss 7.860115
Epoch 1000, Loss 3.828538
Epoch 1500, Loss 3.092191
Epoch 2000, Loss 2.957698
Epoch 2500, Loss 2.933134
Epoch 3000, Loss 2.928648
Epoch 3500, Loss 2.927830
Epoch 4000, Loss 2.927679
Epoch 4500, Loss 2.927652
Epoch 5000, Loss 2.927647
tensor([  5.3671, -17.3012], requires_grad=True)

Making this more better

What we have just saw is using vanilla gradient descent worked pretty well in our case, but we have several other optimization techniques which can help in convergance, especially when models get complicated.

import torch.optim as optim
dir(optim)
['ASGD',
 'Adadelta',
 'Adagrad',
 'Adam',
 'AdamW',
 'Adamax',
 'LBFGS',
 'Optimizer',
 'RMSprop',
 'Rprop',
 'SGD',
 'SparseAdam',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_functional',
 '_multi_tensor',
 'lr_scheduler',
 'swa_utils']

Every optimizer constructor takes a list of parameters (aka PyTorch tensors, typically with requires_grad set to True) as the first input. All parameters passed to the optimizer are retained inside the optimizer object so the optimizer can update their values and access their grad attribute, as represented in figure below.

Each optimizer exposes two methods: zero_grad and step. zero_grad zeroes the grad attribute of all the parameters passed to the optimizer upon construction. step updates the value of those parameters according to the optimization strategy implemented by the specific optimizer.

Using a Gradient Descent optimizer

Let’s create params and instantiate a gradient descent optimizer:

parmas = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-5
optimizer = optim.SGD([params], lr=learning_rate)
t_p = model(t_u, *params)
loss = loss_fn(t_p, t_c)
loss.backward()

optimizer.step()
params
tensor([ 9.5483e-01, -8.2600e-04], requires_grad=True)

The value of params is updated upon calling step without us having to touch it ourselves! What happens is that the optimizer looks into params.grad and updates params, subtracting learning_rate times grad from it, exactly as in our former handrolled code.

But we forgot to zero out our gradients, they would have accumulated in the leaves at every call to backward. Let's place zero_grad ar right place.

parmas = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)

t_p = model(t_u, *params)
loss = loss_fn(t_p, t_c)


optimizer.zero_grad()
loss.backward()
optimizer.step()

params
tensor([-41.5604,  -0.7800], requires_grad=True)

Let’s update our training loop accordingly:

def training_loop(n_epochs, optimizer, params, t_u, t_c):
  for epoch in range(1, n_epochs + 1):
        t_p = model(t_u, *params) 
        loss = loss_fn(t_p, t_c)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if epoch % 500 == 0:
          print('Epoch %d, Loss %f' % (epoch, float(loss)))
        
  return params
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate) # It’s important that both params are the same object; otherwise the optimizer won’t know what parameters were used by the model.

training_loop(
    n_epochs = 5000, 
    optimizer = optimizer,
    params = params, # 
    t_u = t_un,
    t_c = t_c)
Epoch 500, Loss 7.860120
Epoch 1000, Loss 3.828538
Epoch 1500, Loss 3.092191
Epoch 2000, Loss 2.957698
Epoch 2500, Loss 2.933134
Epoch 3000, Loss 2.928648
Epoch 3500, Loss 2.927830
Epoch 4000, Loss 2.927679
Epoch 4500, Loss 2.927652
Epoch 5000, Loss 2.927647
tensor([  5.3671, -17.3012], requires_grad=True)

Testing other optimizers

params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-1
optimizer = optim.Adam([params], lr=learning_rate) # New optimizer class

training_loop(
    n_epochs = 2000, 
    optimizer = optimizer,
    params = params,
    t_u = t_u, # We’re back to the original t_u as our input.
    t_c = t_c)
Epoch 500, Loss 7.612900
Epoch 1000, Loss 3.086700
Epoch 1500, Loss 2.928579
Epoch 2000, Loss 2.927644
tensor([  0.5367, -17.3021], requires_grad=True)

The optimizer is not the only flexible part of our training loop. Let’s turn our attention to the model.

Training, validation, and overfitting

The optimizer generally minimize the loss at the data points. Sure enough, if we had independent data points that we didn’t use to evaluate our loss or descend along its negative gradient, we would soon find out that evaluating the loss at those independent data points would yield higher-than-expected loss, this phenomenon, called overfitting.

we must take a few data points out of our dataset (the validation set) and only fit our model on the remaining data points (the training set), as shown in figure.

Then, while we’re fitting the model, we can evaluate the loss once on the training set and once on the validation set. When we’re trying to decide if we’ve done a good job of fitting our model to the data, we must look at both!

A deep neural network can potentially approximate complicated functions, provided that the number of neurons, and therefore parameters, is high enough.

if the loss evaluated in the validation set doesn’t decrease along with the training set, it means our model is improving its fit of the samples it is seeing during training, but it is not generalizing to samples outside this precise set. As soon as we evaluate the model at new, previously unseen points, the values of the loss function are poor. So, rule 2: if the training loss and the validation loss diverge, we’re overfitting.

How to prevent overfitting

  1. First of all, we should make sure we get enough data for the process.
  2. We can add penalization terms to the loss function, to make it cheaper for the model to behave more smoothly and change more slowly (up to a point).
  3. Another is to add noise to the input samples, to artificially create new data points in between training data samples and force the model to try to fit those, too.
  4. Most important to choose the right size for a neural network model in terms of parameters, the process is based on two steps: increase the size until it fits, and then scale it down until it stops overfitting.

Using Validation

n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples)

shuffled_indices = torch.randperm(n_samples)

train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

train_indices, val_indices  # <1>
(tensor([ 2,  5,  3,  4,  8,  9, 10,  1,  7]), tensor([6, 0]))
train_t_u = t_u[train_indices]
train_t_c = t_c[train_indices]

val_t_u = t_u[val_indices]
val_t_c = t_c[val_indices]

train_t_un = 0.1 * train_t_u
val_t_un = 0.1 * val_t_u

Our training loop doesn’t really change. We just want to additionally evaluate the validation loss at every epoch, to have a chance to recognize whether we’re overfitting:

def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u,
                  train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):
        train_t_p = model(train_t_u, *params) # 
        train_loss = loss_fn(train_t_p, train_t_c)
                             
        val_t_p = model(val_t_u, *params) # 
        val_loss = loss_fn(val_t_p, val_t_c)
        
        optimizer.zero_grad()
        train_loss.backward() # 
        optimizer.step()

        if epoch <= 3 or epoch % 500 == 0:
            print(f"Epoch {epoch}, Training loss {train_loss.item():.4f},"
                  f" Validation loss {val_loss.item():.4f}")
            
    return params
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)

training_loop(
    n_epochs = 3000, 
    optimizer = optimizer,
    params = params,
    train_t_u = train_t_un, # <1> 
    val_t_u = val_t_un, # <1> 
    train_t_c = train_t_c,
    val_t_c = val_t_c)
Epoch 1, Training loss 97.1590, Validation loss 4.7885
Epoch 2, Training loss 33.2024, Validation loss 29.7473
Epoch 3, Training loss 26.7429, Validation loss 42.7580
Epoch 500, Training loss 8.7188, Validation loss 12.4889
Epoch 1000, Training loss 4.3259, Validation loss 4.3778
Epoch 1500, Training loss 3.2235, Validation loss 3.0086
Epoch 2000, Training loss 2.9469, Validation loss 2.9987
Epoch 2500, Training loss 2.8775, Validation loss 3.1634
Epoch 3000, Training loss 2.8601, Validation loss 3.2884
tensor([  5.4078, -17.5912], requires_grad=True)

Autograd nits and switching it off

The model is evaluated twice—once on train_t_u and once on val_t_u—and then backward is called. Won’t this confuse autograd? Won’t backward be influenced by the values generated during the pass on the validation set?

Luckily for us, this isn’t the case. The first line in the training loop evaluates model on train_t_u to produce train_t_p. Then train_loss is evaluated from train_t_p. This creates a computation graph that links train_t_u to train_t_p to train_loss. When model is evaluated again on val_t_u, it produces val_t_p and val_loss. In this case, a separate computation graph will be created that links val_t_u to val_t_p to val_loss. Separate tensors have been run through the same functions, model and loss_fn, generating separate computation graphs, as shown in figure below.

Since we’re not ever calling backward on val_loss, why are we building the graph in the first place? We could in fact just call model and loss_fn as plain functions, without tracking the computation. However optimized, building the autograd graph comes with additional costs that we could totally forgo during the validation pass, especially when the model has millions of parameters. In order to address this, PyTorch allows us to switch off autograd when we don’t need it, using the torch.no_grad context manager.

We won’t see any meaningful advantage in terms of speed or memory consumption on our small problem. However, for larger models, the differences can add up. We can make sure this works by checking the value of the requires_grad attribute on the val_loss tensor:

def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u,
                  train_t_c, val_t_c):
    for epoch in range(1, n_epochs + 1):
        train_t_p = model(train_t_u, *params)
        train_loss = loss_fn(train_t_p, train_t_c)

        with torch.no_grad(): # <1>
            val_t_p = model(val_t_u, *params)
            val_loss = loss_fn(val_t_p, val_t_c)
            assert val_loss.requires_grad == False # 
            
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()

Summary

  1. Linear models are the simplest reasonable model to use to fit data.
  2. Convex optimization techniques can be used for linear models, but they do not generalize to neural networks, so we focus on stochastic gradient descent for parameter estimation.
  3. Deep learning can be used for generic models that are not engineered for solving a specific task, but instead can be automatically adapted to specialize themselves on the problem at hand.
  4. Learning algorithms amount to optimizing parameters of models based on observations. A loss function is a measure of the error in carrying out a task, such as the error between predicted outputs and measured values. The goal is to get the loss function as low as possible.
  5. The rate of change of the loss function with respect to the model parameters can be used to update the same parameters in the direction of decreasing loss.
  6. The optim module in PyTorch provides a collection of ready-to-use optimizers for updating parameters and minimizing loss functions.
  7. Optimizers use the autograd feature of PyTorch to compute the gradient for each parameter, depending on how that parameter contributes to the final output. This allows users to rely on the dynamic computation graph during complex forward passes.