A single neuron

At the core of deep learning are neural networks: mathematical entities capable of representing complicated functions through a composition of simpler functions. The basic building block of these complicated functions is the neuron, and it is nothing but a linear transformation of the input (for example, multiplying the input by a number [the weight] and adding a constant [the bias]) followed by the application of a fixed nonlinear function (referred to as the activation function).

A Multilayer perceptron

In the similar fashion a multilayer neural network, as represented in figure below is made up of a composition of functions where the output of a layer of neurons is used as an input for the following layer.

Activation function

Next we need an activation function. The activation function plays two important roles:

  1. In the inner parts of the model, it allows the output function to have different slopes at different values—something a linear function by definition cannot do. By trickily composing these differently sloped parts for many outputs, neural networks can approximate arbitrary functions.
  2. At the last layer of the network, it has the role of concentrating the outputs of the preceding linear operation into a given range.

Charecteristics of a best activation funtion

By definition, activation functions

  1. Are nonlinear. Repeated applications of (w*x+b) without an activation function results in a function of the same (affine linear) form. The nonlinearity allows the overall network to approximate more complex functions.
  2. Are differentiable, so that gradients can be computed through them. Point discontinuities, as we can see in Hardtanh or ReLU, are fine.

Using torch.nn

What we have done till now it to write our own code for the problem we are solving for temperature relationship. The same can be achieved with PyTorch nn module.

PyTorch has a whole submodule dedicated to neural networks, called torch.nn. It contains the building blocks needed to create all sorts of neural network architectures. Those building blocks are called modules in PyTorch.

A PyTorch module is a Python class deriving from the nn.Module base class. A module can have one or more Parameter instances as attributes, which are tensors whose values are optimized during the training process (think w and b in our linear model). A module can also have one or more submodules (subclasses of nn.Module) as attributes, and it will be able to track their parameters as well.

%matplotlib inline
import numpy as np
import torch
import torch.optim as optim

torch.set_printoptions(edgeitems=2, linewidth=75)

Remembering the same data set again.

t_c = [0.5,  14.0, 15.0, 28.0, 11.0,  8.0,  3.0, -4.0,  6.0, 13.0, 21.0]
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = torch.tensor(t_c).unsqueeze(1) # <1>
t_u = torch.tensor(t_u).unsqueeze(1) # <1>

t_u.shape
torch.Size([11, 1])

Splitting them into training and validation and shuffling them.

n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples)

shuffled_indices = torch.randperm(n_samples)

train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]

train_indices, val_indices
(tensor([ 3,  8,  4,  9,  1,  7, 10,  2,  5]), tensor([0, 6]))
t_u_train = t_u[train_indices]
t_c_train = t_c[train_indices]

t_u_val = t_u[val_indices]
t_c_val = t_c[val_indices]

t_un_train = 0.1 * t_u_train
t_un_val = 0.1 * t_u_val

Back to our linear model. The constructor to nn.Linear accepts three arguments: the number of input features, the number of output features, and whether the linear model includes a bias or not (defaulting to True, here):

import torch.nn as nn

linear_model = nn.Linear(1, 1) # The arguments are input size, output size, and bias defaulting to True.
linear_model(t_un_val)
tensor([[0.2873],
        [0.3206]], grad_fn=<AddmmBackward>)

We have an instance of nn.Linear with one input and one output feature. That only requires one weight and one bias:

linear_model.weight
Parameter containing:
tensor([[-0.1850]], requires_grad=True)
linear_model.bias
Parameter containing:
tensor([0.9478], requires_grad=True)

We can call the module with some input:

x = torch.ones(1)
linear_model(x)
tensor([0.7628], grad_fn=<AddBackward0>)

Any module in nn is written to produce outputs for a batch of multiple inputs at the same time. Thus, assuming we need to run nn.Linear on 10 samples, we can create an input tensor of size B × Nin, where B is the size of the batch and Nin is the number of input features, and run it once through the model. For example:

x = torch.ones(10, 1)
linear_model(x)
tensor([[0.7628],
        [0.7628],
        [0.7628],
        [0.7628],
        [0.7628],
        [0.7628],
        [0.7628],
        [0.7628],
        [0.7628],
        [0.7628]], grad_fn=<AddmmBackward>)

we replace our handmade model with nn.Linear(1,1), and then we need to pass the linear model parameters to the optimizer:

linear_model = nn.Linear(1, 1) # This is just a redefinition from earlier.
optimizer = optim.SGD(
    linear_model.parameters(), # This method call replaces [params].
    lr=1e-2)

Earlier, it was our responsibility to create parameters and pass them as the first argument to optim.SGD. Now we can use the parameters method to ask any nn.Module for a list of parameters owned by it or any of its submodules:

linear_model.parameters()
<generator object Module.parameters at 0x7f0a80a052d0>
list(linear_model.parameters())
[Parameter containing:
 tensor([[0.6997]], requires_grad=True), Parameter containing:
 tensor([-0.5405], requires_grad=True)]

At this point, the SGD optimizer has everything it needs. When optimizer.step() is called, it will iterate through each Parameter and change it by an amount proportional to what is stored in its grad attribute. Pretty clean design. Let’s take a look a the training loop now:

def training_loop(n_epochs, optimizer, model, loss_fn, t_u_train, t_u_val,
                  t_c_train, t_c_val):
    for epoch in range(1, n_epochs + 1):
        t_p_train = model(t_u_train) # 1 The model is now passed in, instead of the individual params.
        loss_train = loss_fn(t_p_train, t_c_train)

        t_p_val = model(t_u_val) # 1
        loss_val = loss_fn(t_p_val, t_c_val)
        
        optimizer.zero_grad()
        loss_train.backward() # The loss function is also passed in. We’ll use it in a moment.
        optimizer.step()

        if epoch == 1 or epoch % 1000 == 0:
            print(f"Epoch {epoch}, Training loss {loss_train.item():.4f},"
                  f" Validation loss {loss_val.item():.4f}")
def loss_fn(t_p, t_c):
    squared_diffs = (t_p - t_c)**2
    return squared_diffs.mean()

linear_model = nn.Linear(1, 1) # <1>
optimizer = optim.SGD(linear_model.parameters(), lr=1e-2)

training_loop(
    n_epochs = 3000, 
    optimizer = optimizer,
    model = linear_model,
    loss_fn = loss_fn,
    t_u_train = t_un_train,
    t_u_val = t_un_val, 
    t_c_train = t_c_train,
    t_c_val = t_c_val)

print()
print(linear_model.weight)
print(linear_model.bias)
Epoch 1, Training loss 123.8770, Validation loss 2.0055
Epoch 1000, Training loss 4.2174, Validation loss 4.2122
Epoch 2000, Training loss 2.9401, Validation loss 3.0071
Epoch 3000, Training loss 2.8596, Validation loss 3.2937

Parameter containing:
tensor([[5.4096]], requires_grad=True)
Parameter containing:
tensor([-17.6019], requires_grad=True)

There’s one last bit that we can leverage from torch.nn: the loss. Indeed, nn comes with several common loss functions, among them nn.MSELoss (MSE stands for Mean Square Error), which is exactly what we defined earlier as our loss_fn. Loss functions in nn are still subclasses of nn.Module, so we will create an instance and call it as a function. In our case, we get rid of the handwritten loss_fn and replace it:

linear_model = nn.Linear(1, 1)
optimizer = optim.SGD(linear_model.parameters(), lr=1e-2)

training_loop(
    n_epochs = 3000, 
    optimizer = optimizer,
    model = linear_model,
    loss_fn = nn.MSELoss(), # We are no longer using our handwritten loss function from earlier.
    t_u_train = t_un_train,
    t_u_val = t_un_val, 
    t_c_train = t_c_train,
    t_c_val = t_c_val)

print()
print(linear_model.weight)
print(linear_model.bias)
Epoch 1, Training loss 103.8156, Validation loss 4.9917
Epoch 1000, Training loss 4.4250, Validation loss 4.5324
Epoch 2000, Training loss 2.9532, Validation loss 2.9918
Epoch 3000, Training loss 2.8605, Validation loss 3.2838

Parameter containing:
tensor([[5.4062]], requires_grad=True)
Parameter containing:
tensor([-17.5817], requires_grad=True)

Replacing the linear model

We are going to keep everything else fixed, including the loss function, and only redefine model. Let’s build the simplest possible neural network: a linear module, followed by an activation function, feeding into another linear module. The first linear + activation layer is commonly referred to as a hidden layer for historical reasons, since its outputs are not observed directly but fed into the output layer. While the input and output of the model are both of size 1 (they have one input and one output feature), the size of the output of the first linear module is usually larger than 1.

seq_model = nn.Sequential(
            nn.Linear(1, 13), # We chose 13 arbitrarily. We wanted a number that was a different size from the other tensor shapes we have floating around.
            nn.Tanh(),
            nn.Linear(13, 1)) # This 13 must match the first size, however.
seq_model
Sequential(
  (0): Linear(in_features=1, out_features=13, bias=True)
  (1): Tanh()
  (2): Linear(in_features=13, out_features=1, bias=True)
)

Inspecting the parameters

[param.shape for param in seq_model.parameters()]
[torch.Size([13, 1]), torch.Size([13]), torch.Size([1, 13]), torch.Size([1])]
for name, param in seq_model.named_parameters():
    print(name, param.shape)
0.weight torch.Size([13, 1])
0.bias torch.Size([13])
2.weight torch.Size([1, 13])
2.bias torch.Size([1])

The name of each module in Sequential is just the ordinal with which the module appears in the arguments. Interestingly, Sequential also accepts an OrderedDict,4 in which we can name each module passed to Sequential:

from collections import OrderedDict

seq_model = nn.Sequential(OrderedDict([
    ('hidden_linear', nn.Linear(1, 8)),
    ('hidden_activation', nn.Tanh()),
    ('output_linear', nn.Linear(8, 1))
]))

seq_model
Sequential(
  (hidden_linear): Linear(in_features=1, out_features=8, bias=True)
  (hidden_activation): Tanh()
  (output_linear): Linear(in_features=8, out_features=1, bias=True)
)
for name, param in seq_model.named_parameters():
    print(name, param.shape)
hidden_linear.weight torch.Size([8, 1])
hidden_linear.bias torch.Size([8])
output_linear.weight torch.Size([1, 8])
output_linear.bias torch.Size([1])
seq_model.output_linear.bias
Parameter containing:
tensor([0.2155], requires_grad=True)

Finalizing the things

optimizer = optim.SGD(seq_model.parameters(), lr=1e-3) # We’ve dropped the learning rate a bit to help with stability.

training_loop(
    n_epochs = 5000, 
    optimizer = optimizer,
    model = seq_model,
    loss_fn = nn.MSELoss(),
    t_u_train = t_un_train,
    t_u_val = t_un_val, 
    t_c_train = t_c_train,
    t_c_val = t_c_val)
    
print('output', seq_model(t_un_val))
print('answer', t_c_val)
print('hidden', seq_model.hidden_linear.weight.grad)
Epoch 1, Training loss 233.0291, Validation loss 5.0650
Epoch 1000, Training loss 12.8599, Validation loss 3.4229
Epoch 2000, Training loss 5.4017, Validation loss 3.3166
Epoch 3000, Training loss 2.8311, Validation loss 3.5456
Epoch 4000, Training loss 1.9166, Validation loss 3.8386
Epoch 5000, Training loss 1.6425, Validation loss 4.1671
output tensor([[ 0.6594],
        [-0.0639]], grad_fn=<AddmmBackward>)
answer tensor([[0.5000],
        [3.0000]])
hidden tensor([[-7.8404],
        [-6.5679],
        [ 5.8290],
        [-0.1445],
        [-8.3380],
        [ 0.1884],
        [-8.5640],
        [ 0.2731]])

Comparing to the linear model

We can also evaluate the model on all of the data and see how it differs from a line:

from matplotlib import pyplot as plt

t_range = torch.arange(20., 90.).unsqueeze(1)

fig = plt.figure(dpi=600)
plt.xlabel("Fahrenheit")
plt.ylabel("Celsius")
plt.plot(t_u.numpy(), t_c.numpy(), 'o')
plt.plot(t_range.numpy(), seq_model(0.1 * t_range).detach().numpy(), 'c-')
plt.plot(t_u.numpy(), seq_model(0.1 * t_u).detach().numpy(), 'kx')
[<matplotlib.lines.Line2D at 0x7f0a809ae790>]
neuron_count = 20

seq_model = nn.Sequential(OrderedDict([
    ('hidden_linear', nn.Linear(1, neuron_count)),
    ('hidden_activation', nn.Tanh()),
    ('output_linear', nn.Linear(neuron_count, 1))
]))

optimizer = optim.SGD(seq_model.parameters(), lr=1e-4)

training_loop(
    n_epochs = 5000, 
    optimizer = optimizer,
    model = seq_model,
    loss_fn = nn.MSELoss(),
    t_u_train = t_un_train,
    t_u_val = t_un_val, 
    t_c_train = t_c_train,
    t_c_val = t_c_val)

from matplotlib import pyplot as plt

t_range = torch.arange(20., 90.).unsqueeze(1)

fig = plt.figure(dpi=150)
plt.xlabel("Fahrenheit")
plt.ylabel("Celsius")
plt.plot(t_u.numpy(), t_c.numpy(), 'o')
plt.plot(t_range.numpy(), seq_model(0.1 * t_range).detach().numpy(), 'c-')
plt.plot(t_u.numpy(), seq_model(0.1 * t_u).detach().numpy(), 'kx')
Epoch 1, Training loss 231.6852, Validation loss 4.3927
Epoch 1000, Training loss 52.6475, Validation loss 80.7551
Epoch 2000, Training loss 32.5132, Validation loss 39.5908
Epoch 3000, Training loss 19.7121, Validation loss 17.3253
Epoch 4000, Training loss 12.1086, Validation loss 6.7723
Epoch 5000, Training loss 8.0248, Validation loss 3.3610
[<matplotlib.lines.Line2D at 0x7f0a7fc35c50>]