Chapter 5 The Magic of Autograd and Optimizers
A summary of chapter 5 part 2 of Deep learning with PyTorch
The magic of Autograd
In part 1 we saw that how we can train a linear model with backpropagation. In nutshell what we computed the gradient of a composition of functions—the model and the loss—with respect to their innermost parameters (w and b) by propagating derivatives backward using the chain rule.
The basic requirement here is that all functions we’re dealing with can be differentiated analytically but as we increase the depth of the model, writing the analytical expression for the derivatives is not that easy.
This is when PyTorch tensors come to the rescue, with a PyTorch component called autograd.
PyTorch tensors can remember where they come from, in terms of the operations and parent tensors that originated them, and they can automatically provide the chain of derivatives of such operations with respect to their inputs. This means PyTorch will automatically provide the gradient of that expression with respect to its input parameters.
%matplotlib inline
import numpy as np
import torch
torch.set_printoptions(edgeitems=2)
t_c = torch.tensor([0.5, 14.0, 15.0, 28.0, 11.0, 8.0,
3.0, -4.0, 6.0, 13.0, 21.0])
t_u = torch.tensor([35.7, 55.9, 58.2, 81.9, 56.3, 48.9,
33.9, 21.8, 48.4, 60.4, 68.4])
t_un = 0.1 * t_u
Recalling our model and loss function
def model(t_u, w, b):
return w * t_u + b
def loss_fn(t_p, t_c):
squared_diffs = (t_p - t_c)**2
return squared_diffs.mean()
Let’s again initialize a parameters tensor:
params = torch.tensor([1.0, 0.0], requires_grad=True)
The requires_grad=True argument is telling PyTorch to track the entire family tree of tensors resulting from operations on params.
params.grad is None
loss = loss_fn(model(t_u, *params), t_c)
loss.backward()
params.grad
At this point, the grad attribute of params contains the derivatives of the loss with respect to each element of params.
The forward graph and backward graph of the model as computed with autograd
When we compute our loss while the parameters w and b require gradients, in addition to performing the actual computation, PyTorch creates the autograd graph with the operations (in black circles) as nodes, as shown in the top row of figure above. When we call loss.backward(), PyTorch traverses this graph in the reverse direction to compute the gradients, as shown by the arrows in the bottom row of the figure.
Accumulating Grad Funtion
Calling backward will lead derivatives to accumulate at leaf nodes. So if backward was called earlier, the loss is evaluated again, backward is called again (as in any training loop), and the gradient at each leaf is accumulated (that is, summed) on top of the one computed at the previous iteration, which leads to an incorrect value for the gradient. In order to prevent this from occurring, we need to zero the gradient explicitly at each iteration. We can do this easily using the in-place zero_ method:
if params.grad is not None:
params.grad.zero_()
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
for epoch in range(1, n_epochs + 1):
if params.grad is not None: # This could be done at any point in the loop prior to calling loss.backward().
params.grad.zero_()
t_p = model(t_u, *params)
loss = loss_fn(t_p, t_c)
loss.backward()
with torch.no_grad(): # This is a somewhat cumbersome bit of code, but as we’ll see in the next section, it’s not an issue in practice.
params -= learning_rate * params.grad
if epoch % 500 == 0:
print('Epoch %d, Loss %f' % (epoch, float(loss)))
return params
training_loop(
n_epochs = 5000,
learning_rate = 1e-2,
params = torch.tensor([1.0, 0.0], requires_grad=True), # <1>
t_u = t_un, # <2>
t_c = t_c)
import torch.optim as optim
dir(optim)
Every optimizer constructor takes a list of parameters (aka PyTorch tensors, typically with requires_grad set to True) as the first input. All parameters passed to the optimizer are retained inside the optimizer object so the optimizer can update their values and access their grad attribute, as represented in figure below.
Each optimizer exposes two methods: zero_grad and step. zero_grad zeroes the grad attribute of all the parameters passed to the optimizer upon construction. step updates the value of those parameters according to the optimization strategy implemented by the specific optimizer.
parmas = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-5
optimizer = optim.SGD([params], lr=learning_rate)
t_p = model(t_u, *params)
loss = loss_fn(t_p, t_c)
loss.backward()
optimizer.step()
params
The value of params is updated upon calling step without us having to touch it ourselves! What happens is that the optimizer looks into params.grad and updates params, subtracting learning_rate times grad from it, exactly as in our former handrolled code.
But we forgot to zero out our gradients, they would have accumulated in the leaves at every call to backward. Let's place zero_grad ar right place.
parmas = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)
t_p = model(t_u, *params)
loss = loss_fn(t_p, t_c)
optimizer.zero_grad()
loss.backward()
optimizer.step()
params
Let’s update our training loop accordingly:
def training_loop(n_epochs, optimizer, params, t_u, t_c):
for epoch in range(1, n_epochs + 1):
t_p = model(t_u, *params)
loss = loss_fn(t_p, t_c)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 500 == 0:
print('Epoch %d, Loss %f' % (epoch, float(loss)))
return params
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate) # It’s important that both params are the same object; otherwise the optimizer won’t know what parameters were used by the model.
training_loop(
n_epochs = 5000,
optimizer = optimizer,
params = params, #
t_u = t_un,
t_c = t_c)
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-1
optimizer = optim.Adam([params], lr=learning_rate) # New optimizer class
training_loop(
n_epochs = 2000,
optimizer = optimizer,
params = params,
t_u = t_u, # We’re back to the original t_u as our input.
t_c = t_c)
The optimizer is not the only flexible part of our training loop. Let’s turn our attention to the model.
Training, validation, and overfitting
The optimizer generally minimize the loss at the data points. Sure enough, if we had independent data points that we didn’t use to evaluate our loss or descend along its negative gradient, we would soon find out that evaluating the loss at those independent data points would yield higher-than-expected loss, this phenomenon, called overfitting.
we must take a few data points out of our dataset (the validation set) and only fit our model on the remaining data points (the training set), as shown in figure.
Then, while we’re fitting the model, we can evaluate the loss once on the training set and once on the validation set. When we’re trying to decide if we’ve done a good job of fitting our model to the data, we must look at both!
A deep neural network can potentially approximate complicated functions, provided that the number of neurons, and therefore parameters, is high enough.
if the loss evaluated in the validation set doesn’t decrease along with the training set, it means our model is improving its fit of the samples it is seeing during training, but it is not generalizing to samples outside this precise set. As soon as we evaluate the model at new, previously unseen points, the values of the loss function are poor. So, rule 2: if the training loss and the validation loss diverge, we’re overfitting.
How to prevent overfitting
- First of all, we should make sure we get enough data for the process.
- We can add penalization terms to the loss function, to make it cheaper for the model to behave more smoothly and change more slowly (up to a point).
- Another is to add noise to the input samples, to artificially create new data points in between training data samples and force the model to try to fit those, too.
- Most important to choose the right size for a neural network model in terms of parameters, the process is based on two steps: increase the size until it fits, and then scale it down until it stops overfitting.
n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples)
shuffled_indices = torch.randperm(n_samples)
train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]
train_indices, val_indices # <1>
train_t_u = t_u[train_indices]
train_t_c = t_c[train_indices]
val_t_u = t_u[val_indices]
val_t_c = t_c[val_indices]
train_t_un = 0.1 * train_t_u
val_t_un = 0.1 * val_t_u
Our training loop doesn’t really change. We just want to additionally evaluate the validation loss at every epoch, to have a chance to recognize whether we’re overfitting:
def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u,
train_t_c, val_t_c):
for epoch in range(1, n_epochs + 1):
train_t_p = model(train_t_u, *params) #
train_loss = loss_fn(train_t_p, train_t_c)
val_t_p = model(val_t_u, *params) #
val_loss = loss_fn(val_t_p, val_t_c)
optimizer.zero_grad()
train_loss.backward() #
optimizer.step()
if epoch <= 3 or epoch % 500 == 0:
print(f"Epoch {epoch}, Training loss {train_loss.item():.4f},"
f" Validation loss {val_loss.item():.4f}")
return params
params = torch.tensor([1.0, 0.0], requires_grad=True)
learning_rate = 1e-2
optimizer = optim.SGD([params], lr=learning_rate)
training_loop(
n_epochs = 3000,
optimizer = optimizer,
params = params,
train_t_u = train_t_un, # <1>
val_t_u = val_t_un, # <1>
train_t_c = train_t_c,
val_t_c = val_t_c)
Autograd nits and switching it off
The model is evaluated twice—once on train_t_u and once on val_t_u—and then backward is called. Won’t this confuse autograd? Won’t backward be influenced by the values generated during the pass on the validation set?
Luckily for us, this isn’t the case. The first line in the training loop evaluates model on train_t_u to produce train_t_p. Then train_loss is evaluated from train_t_p. This creates a computation graph that links train_t_u to train_t_p to train_loss. When model is evaluated again on val_t_u, it produces val_t_p and val_loss. In this case, a separate computation graph will be created that links val_t_u to val_t_p to val_loss. Separate tensors have been run through the same functions, model and loss_fn, generating separate computation graphs, as shown in figure below.
Since we’re not ever calling backward on val_loss, why are we building the graph in the first place? We could in fact just call model and loss_fn as plain functions, without tracking the computation. However optimized, building the autograd graph comes with additional costs that we could totally forgo during the validation pass, especially when the model has millions of parameters. In order to address this, PyTorch allows us to switch off autograd when we don’t need it, using the torch.no_grad context manager.
We won’t see any meaningful advantage in terms of speed or memory consumption on our small problem. However, for larger models, the differences can add up. We can make sure this works by checking the value of the requires_grad attribute on the val_loss tensor:
def training_loop(n_epochs, optimizer, params, train_t_u, val_t_u,
train_t_c, val_t_c):
for epoch in range(1, n_epochs + 1):
train_t_p = model(train_t_u, *params)
train_loss = loss_fn(train_t_p, train_t_c)
with torch.no_grad(): # <1>
val_t_p = model(val_t_u, *params)
val_loss = loss_fn(val_t_p, val_t_c)
assert val_loss.requires_grad == False #
optimizer.zero_grad()
train_loss.backward()
optimizer.step()
Summary
- Linear models are the simplest reasonable model to use to fit data.
- Convex optimization techniques can be used for linear models, but they do not generalize to neural networks, so we focus on stochastic gradient descent for parameter estimation.
- Deep learning can be used for generic models that are not engineered for solving a specific task, but instead can be automatically adapted to specialize themselves on the problem at hand.
- Learning algorithms amount to optimizing parameters of models based on observations. A loss function is a measure of the error in carrying out a task, such as the error between predicted outputs and measured values. The goal is to get the loss function as low as possible.
- The rate of change of the loss function with respect to the model parameters can be used to update the same parameters in the direction of decreasing loss.
- The optim module in PyTorch provides a collection of ready-to-use optimizers for updating parameters and minimizing loss functions.
- Optimizers use the autograd feature of PyTorch to compute the gradient for each parameter, depending on how that parameter contributes to the final output. This allows users to rely on the dynamic computation graph during complex forward passes.