Chapter 5 The Mechanics of learning
A summary of chapter 5 part 1 of Deep learning with PyTorch.
- Lets choose a linear model as a first try
- Next we have to define a loss funtion
- Down along the gradient
- Getting analytical
- Iterating to fit the model
- OVERTRAINING
- Normalizing inputs
How is it exactly that a machine learns?
In world of Machine learning or Deep learning, learning can be considered as " when a learning algorithm is presented with input data that is paired with desired outputs. And once learning has occurred, that algorithm will be capable of producing correct outputs when it is fed new data that is similar enough to the input data it was trained on".
Let's take an ancient example how Johannes Kepler, a German mathematical astronomer (1571–1630), has figured out the planatory motion along with his mentor with naked eye observations.
His research consists of following steps:
-
Got lots of good data from his friend Brahe (not without some struggle)
-
Tried to visualize the heck out of it, because he felt there was something fishy going on
- Chose the simplest possible model that had a chance to fit the data (an ellipse)
- Split the data so that he could work on part of it and keep an independent set for validation
- Started with a tentative eccentricity and size for the ellipse and iterated until the model fit the observations.
- Validated his model on the independent observations
- Looked back in disbelief
This is exactly what we will set out to do in order to learn something from data. In fact, there is virtually no difference between saying that we’ll fit the data or that we’ll make an algorithm learn from data. The process always involves a function with a number of unknown parameters whose values are estimated from data: in short, a model.
PyTorch is designed to make it easy to create models for which the derivatives of the fitting error, with respect to the parameters, can be expressed analytically.
Now we'll learn how we can take data, choose a model, and estimate the parameters of the model so that it will give good predictions on new data.
Given input data and the corresponding desired outputs (ground truth), as well as initial values for the weights, the model is fed input data (forward pass), and a measure of the error is evaluated by comparing the resulting outputs to the ground truth. In order to optimize the parameter of the model—its weights—the change in the error following a unit change in weights (that is, the gradient of the error with respect to the parameters) is computed using the chain rule for the derivative of a composite function (backward pass). The value of the weights is then updated in the direction that leads to a decrease in the error. The procedure is repeated until the error, evaluated on unseen data, falls below an acceptable level. If what we just said sounds obscure, we’ve got a whole chapter to clear things up.
Let’s try following the same process Kepler used. Along the way, we’ll use a tool he never had available: PyTorch!
%matplotlib inline
import numpy as np
import torch
torch.set_printoptions(edgeitems=2, linewidth=75)
We will be doing an experiment to establish relationship between temperature collected on celsius and another unknown unit, t_c values are temperatures in Celsius, and the t_u values are our unknown units.
t_c = [0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0]
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = torch.tensor(t_c)
t_u = torch.tensor(t_u)
Let's visualize the data
import matplotlib.pyplot as plt
plt.scatter(t_c, t_u, )
plt.xlabel("Measurement")
plt.ylabel("Temperature(^celsius)")
def model(t_u, w, b):
return w * t_u + b
def loss_fn(t_p, t_c):
squared_diffs = (t_p - t_c)**2
return squared_diffs.mean()
Note that we are building a tensor of differences, taking their square element-wise, and finally producing a scalar loss function by averaging all of the elements in the resulting tensor. It is a mean square loss.
We can now initialize the parameters, invoke the model.
w = torch.ones(())
b = torch.zeros(())
t_p = model(t_u, w, b)
t_p
Lets see the vlue of loss
loss = loss_fn(t_p, t_c)
loss
Down along the gradient
The value of loss is very high so We’ll optimize the loss function with respect to the parameters using the gradient descent algorithm.
The idea is to compute the rate of change of the loss with respect to each parameter, and modify each parameter in the direction of decreasing loss.
delta = 0.1
loss_rate_of_change_w = \
(loss_fn(model(t_u, w + delta, b), t_c) -
loss_fn(model(t_u, w - delta, b), t_c)) / (2.0 * delta)
This is saying that in the neighborhood of the current values of w and b, a unit increase in w leads to some change in the loss. If the change is negative, then we need to increase w to minimize the loss, whereas if the change is positive, we need to decrease w. By how much? Applying a change to w that is proportional to the rate of change of the loss is a good idea, especially when the loss has several parameters: we apply a change to those that exert a significant change on the loss. It is also wise to change the parameters slowly in general, because the rate of change could be dramatically different at a distance from the neighborhood of the current w value. Therefore, we typically should scale the rate of change by a small factor. This scaling factor has many names; the one we use in machine learning is learning_rate:
learning_rate = 1e-2
w = w - learning_rate * loss_rate_of_change_w
loss_rate_of_change_b = \
(loss_fn(model(t_u, w, b + delta), t_c) -
loss_fn(model(t_u, w, b - delta), t_c)) / (2.0 * delta)
b = b - learning_rate * loss_rate_of_change_b
Getting analytical
Computing the rate of change by using repeated evaluations of the model and loss in order to probe the behavior of the loss function in the neighborhood of w and b doesn’t scale well to models with many parameters. Also, it is not always clear how large the neighborhood should be.
What if we could make the neighborhood infinitesimally small, as in figure below
Recall that our model is a linear function, and our loss is a sum of squares. Let’s figure out the expressions for the derivatives. Recalling the expression for the loss:
def dloss_fn(t_p, t_c):
dsq_diffs = 2 * (t_p - t_c) / t_p.size(0) # <1>
return dsq_diffs
we get these derivatives:
def dmodel_dw(t_u, w, b):
return t_u
def dmodel_db(t_u, w, b):
return 1.0
Putting all of this together, the function returning the gradient of the loss with respect to w and b is
def grad_fn(t_u, t_c, t_p, w, b):
dloss_dtp = dloss_fn(t_p, t_c)
dloss_dw = dloss_dtp * dmodel_dw(t_u, w, b)
dloss_db = dloss_dtp * dmodel_db(t_u, w, b)
return torch.stack([dloss_dw.sum(), dloss_db.sum()]) # <1>
The same idea expressed in mathematical notation is shown in figure below. Again,we’re averaging (that is, summing and dividing by a constant) over all the data points to get a single scalar quantity for each partial derivative of the loss.
def training_loop(n_epochs, learning_rate, params, t_u, t_c):
for epoch in range(1, n_epochs + 1):
w, b = params
t_p = model(t_u, w, b) # <1>
loss = loss_fn(t_p, t_c)
grad = grad_fn(t_u, t_c, t_p, w, b) # <2>
params = params - learning_rate * grad
print('Epoch %d, Loss %f' % (epoch, float(loss))) # <3>
return params
Iterating to fit the model
We now have everything in place to optimize our parameters. Starting from a tentative value for a parameter, we can iteratively apply updates to it for a fixed number of iterations, or until w and b stop changing. There are several stopping criteria; for now, we’ll stick to a fixed number of iterations.
def training_loop(n_epochs, learning_rate, params, t_u, t_c,
print_params=True):
for epoch in range(1, n_epochs + 1):
w, b = params
t_p = model(t_u, w, b) # <1>
loss = loss_fn(t_p, t_c)
grad = grad_fn(t_u, t_c, t_p, w, b) # <2>
params = params - learning_rate * grad
if epoch in {1, 2, 3, 10, 11, 99, 100, 4000, 5000}: # <3>
print('Epoch %d, Loss %f' % (epoch, float(loss)))
if print_params:
print(' Params:', params)
print(' Grad: ', grad)
if epoch in {4, 12, 101}:
print('...')
if not torch.isfinite(loss).all():
break # <3>
return params
training_loop(
n_epochs = 100,
learning_rate = 1e-2,
params = torch.tensor([1.0, 0.0]),
t_u = t_u,
t_c = t_c)
OVERTRAINING
Wait, what happened? Our training process literally blew up, leading to losses becoming inf. This is a clear sign that params is receiving updates that are too large, and their values start oscillating back and forth as each update overshoots and the next overcorrects even more. The optimization process is unstable: it diverges instead of converging to a minimum.
How can we limit the magnitude of learning_rate * grad? Well, that looks easy. We could simply choose a smaller learning_rate
training_loop(
n_epochs = 100,
learning_rate = 1e-4,
params = torch.tensor([1.0, 0.0]),
t_u = t_u,
t_c = t_c)
Nice—the behavior is now stable. But there’s another problem: the updates to parameters are very small, so the loss decreases very slowly and eventually stalls. We could obviate this issue by making learning_rate adaptive: that is, change according to the magnitude of updates.
Normalizing inputs
We can see that the first-epoch gradient for the weight is about 50 times larger than the gradient for the bias. This means the weight and bias live in differently scaled spaces. If this is the case, a learning rate that’s large enough to meaningfully update one will be so large as to be unstable for the other; and a rate that’s appropriate for the other won’t be large enough to meaningfully change the first.
We can make sure the range of the input doesn’t get too far from the range of –1.0 to 1.0, roughly speaking. In our case, we can achieve something close enough to that by simply multiplying t_u by 0.1:
t_un = 0.1 * t_u
training_loop(
n_epochs = 100,
learning_rate = 1e-2,
params = torch.tensor([1.0, 0.0]),
t_u = t_un, # <1>
t_c = t_c)
params = training_loop(
n_epochs = 5000,
learning_rate = 1e-2,
params = torch.tensor([1.0, 0.0]),
t_u = t_un,
t_c = t_c,
print_params = False)
params
%matplotlib inline
from matplotlib import pyplot as plt
t_p = model(t_un, *params) # <1>
fig = plt.figure(dpi=600)
plt.xlabel("Temperature (°Fahrenheit)")
plt.ylabel("Temperature (°Celsius)")
plt.plot(t_u.numpy(), t_p.detach().numpy()) # <2>
plt.plot(t_u.numpy(), t_c.numpy(), 'o')
plt.savefig("temp_unknown_plot.png", format="png") # bookskip
%matplotlib inline
from matplotlib import pyplot as plt
fig = plt.figure(dpi=600)
plt.xlabel("Measurement")
plt.ylabel("Temperature (°Celsius)")
plt.plot(t_u.numpy(), t_c.numpy(), 'o')
plt.savefig("temp_data_plot.png", format="png")