Chapter 6 Artificial neurons
A summary of chapter 6 of Deep learning with PyTorch
- A single neuron
- A Multilayer perceptron
- Activation function
- Charecteristics of a best activation funtion
- Using torch.nn
A single neuron
At the core of deep learning are neural networks: mathematical entities capable of representing complicated functions through a composition of simpler functions. The basic building block of these complicated functions is the neuron, and it is nothing but a linear transformation of the input (for example, multiplying the input by a number [the weight] and adding a constant [the bias]) followed by the application of a fixed nonlinear function (referred to as the activation function).
A Multilayer perceptron
In the similar fashion a multilayer neural network, as represented in figure below is made up of a composition of functions where the output of a layer of neurons is used as an input for the following layer.
Activation function
Next we need an activation function. The activation function plays two important roles:
- In the inner parts of the model, it allows the output function to have different slopes at different values—something a linear function by definition cannot do. By trickily composing these differently sloped parts for many outputs, neural networks can approximate arbitrary functions.
- At the last layer of the network, it has the role of concentrating the outputs of the preceding linear operation into a given range.
Charecteristics of a best activation funtion
By definition, activation functions
- Are nonlinear. Repeated applications of (w*x+b) without an activation function results in a function of the same (affine linear) form. The nonlinearity allows the overall network to approximate more complex functions.
- Are differentiable, so that gradients can be computed through them. Point discontinuities, as we can see in Hardtanh or ReLU, are fine.
Using torch.nn
What we have done till now it to write our own code for the problem we are solving for temperature relationship. The same can be achieved with PyTorch nn module.
PyTorch has a whole submodule dedicated to neural networks, called torch.nn. It contains the building blocks needed to create all sorts of neural network architectures. Those building blocks are called modules in PyTorch.
A PyTorch module is a Python class deriving from the nn.Module base class. A module can have one or more Parameter instances as attributes, which are tensors whose values are optimized during the training process (think w and b in our linear model). A module can also have one or more submodules (subclasses of nn.Module) as attributes, and it will be able to track their parameters as well.
%matplotlib inline
import numpy as np
import torch
import torch.optim as optim
torch.set_printoptions(edgeitems=2, linewidth=75)
Remembering the same data set again.
t_c = [0.5, 14.0, 15.0, 28.0, 11.0, 8.0, 3.0, -4.0, 6.0, 13.0, 21.0]
t_u = [35.7, 55.9, 58.2, 81.9, 56.3, 48.9, 33.9, 21.8, 48.4, 60.4, 68.4]
t_c = torch.tensor(t_c).unsqueeze(1) # <1>
t_u = torch.tensor(t_u).unsqueeze(1) # <1>
t_u.shape
Splitting them into training and validation and shuffling them.
n_samples = t_u.shape[0]
n_val = int(0.2 * n_samples)
shuffled_indices = torch.randperm(n_samples)
train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]
train_indices, val_indices
t_u_train = t_u[train_indices]
t_c_train = t_c[train_indices]
t_u_val = t_u[val_indices]
t_c_val = t_c[val_indices]
t_un_train = 0.1 * t_u_train
t_un_val = 0.1 * t_u_val
Back to our linear model. The constructor to nn.Linear accepts three arguments: the number of input features, the number of output features, and whether the linear model includes a bias or not (defaulting to True, here):
import torch.nn as nn
linear_model = nn.Linear(1, 1) # The arguments are input size, output size, and bias defaulting to True.
linear_model(t_un_val)
We have an instance of nn.Linear with one input and one output feature. That only requires one weight and one bias:
linear_model.weight
linear_model.bias
We can call the module with some input:
x = torch.ones(1)
linear_model(x)
Any module in nn is written to produce outputs for a batch of multiple inputs at the same time. Thus, assuming we need to run nn.Linear on 10 samples, we can create an input tensor of size B × Nin, where B is the size of the batch and Nin is the number of input features, and run it once through the model. For example:
x = torch.ones(10, 1)
linear_model(x)
we replace our handmade model with nn.Linear(1,1), and then we need to pass the linear model parameters to the optimizer:
linear_model = nn.Linear(1, 1) # This is just a redefinition from earlier.
optimizer = optim.SGD(
linear_model.parameters(), # This method call replaces [params].
lr=1e-2)
Earlier, it was our responsibility to create parameters and pass them as the first argument to optim.SGD. Now we can use the parameters method to ask any nn.Module for a list of parameters owned by it or any of its submodules:
linear_model.parameters()
list(linear_model.parameters())
At this point, the SGD optimizer has everything it needs. When optimizer.step() is called, it will iterate through each Parameter and change it by an amount proportional to what is stored in its grad attribute. Pretty clean design. Let’s take a look a the training loop now:
def training_loop(n_epochs, optimizer, model, loss_fn, t_u_train, t_u_val,
t_c_train, t_c_val):
for epoch in range(1, n_epochs + 1):
t_p_train = model(t_u_train) # 1 The model is now passed in, instead of the individual params.
loss_train = loss_fn(t_p_train, t_c_train)
t_p_val = model(t_u_val) # 1
loss_val = loss_fn(t_p_val, t_c_val)
optimizer.zero_grad()
loss_train.backward() # The loss function is also passed in. We’ll use it in a moment.
optimizer.step()
if epoch == 1 or epoch % 1000 == 0:
print(f"Epoch {epoch}, Training loss {loss_train.item():.4f},"
f" Validation loss {loss_val.item():.4f}")
def loss_fn(t_p, t_c):
squared_diffs = (t_p - t_c)**2
return squared_diffs.mean()
linear_model = nn.Linear(1, 1) # <1>
optimizer = optim.SGD(linear_model.parameters(), lr=1e-2)
training_loop(
n_epochs = 3000,
optimizer = optimizer,
model = linear_model,
loss_fn = loss_fn,
t_u_train = t_un_train,
t_u_val = t_un_val,
t_c_train = t_c_train,
t_c_val = t_c_val)
print()
print(linear_model.weight)
print(linear_model.bias)
There’s one last bit that we can leverage from torch.nn: the loss. Indeed, nn comes with several common loss functions, among them nn.MSELoss (MSE stands for Mean Square Error), which is exactly what we defined earlier as our loss_fn. Loss functions in nn are still subclasses of nn.Module, so we will create an instance and call it as a function. In our case, we get rid of the handwritten loss_fn and replace it:
linear_model = nn.Linear(1, 1)
optimizer = optim.SGD(linear_model.parameters(), lr=1e-2)
training_loop(
n_epochs = 3000,
optimizer = optimizer,
model = linear_model,
loss_fn = nn.MSELoss(), # We are no longer using our handwritten loss function from earlier.
t_u_train = t_un_train,
t_u_val = t_un_val,
t_c_train = t_c_train,
t_c_val = t_c_val)
print()
print(linear_model.weight)
print(linear_model.bias)
Replacing the linear model
We are going to keep everything else fixed, including the loss function, and only redefine model. Let’s build the simplest possible neural network: a linear module, followed by an activation function, feeding into another linear module. The first linear + activation layer is commonly referred to as a hidden layer for historical reasons, since its outputs are not observed directly but fed into the output layer. While the input and output of the model are both of size 1 (they have one input and one output feature), the size of the output of the first linear module is usually larger than 1.
seq_model = nn.Sequential(
nn.Linear(1, 13), # We chose 13 arbitrarily. We wanted a number that was a different size from the other tensor shapes we have floating around.
nn.Tanh(),
nn.Linear(13, 1)) # This 13 must match the first size, however.
seq_model
Inspecting the parameters
[param.shape for param in seq_model.parameters()]
for name, param in seq_model.named_parameters():
print(name, param.shape)
The name of each module in Sequential is just the ordinal with which the module appears in the arguments. Interestingly, Sequential also accepts an OrderedDict,4 in which we can name each module passed to Sequential:
from collections import OrderedDict
seq_model = nn.Sequential(OrderedDict([
('hidden_linear', nn.Linear(1, 8)),
('hidden_activation', nn.Tanh()),
('output_linear', nn.Linear(8, 1))
]))
seq_model
for name, param in seq_model.named_parameters():
print(name, param.shape)
seq_model.output_linear.bias
Finalizing the things
optimizer = optim.SGD(seq_model.parameters(), lr=1e-3) # We’ve dropped the learning rate a bit to help with stability.
training_loop(
n_epochs = 5000,
optimizer = optimizer,
model = seq_model,
loss_fn = nn.MSELoss(),
t_u_train = t_un_train,
t_u_val = t_un_val,
t_c_train = t_c_train,
t_c_val = t_c_val)
print('output', seq_model(t_un_val))
print('answer', t_c_val)
print('hidden', seq_model.hidden_linear.weight.grad)
from matplotlib import pyplot as plt
t_range = torch.arange(20., 90.).unsqueeze(1)
fig = plt.figure(dpi=600)
plt.xlabel("Fahrenheit")
plt.ylabel("Celsius")
plt.plot(t_u.numpy(), t_c.numpy(), 'o')
plt.plot(t_range.numpy(), seq_model(0.1 * t_range).detach().numpy(), 'c-')
plt.plot(t_u.numpy(), seq_model(0.1 * t_u).detach().numpy(), 'kx')
neuron_count = 20
seq_model = nn.Sequential(OrderedDict([
('hidden_linear', nn.Linear(1, neuron_count)),
('hidden_activation', nn.Tanh()),
('output_linear', nn.Linear(neuron_count, 1))
]))
optimizer = optim.SGD(seq_model.parameters(), lr=1e-4)
training_loop(
n_epochs = 5000,
optimizer = optimizer,
model = seq_model,
loss_fn = nn.MSELoss(),
t_u_train = t_un_train,
t_u_val = t_un_val,
t_c_train = t_c_train,
t_c_val = t_c_val)
from matplotlib import pyplot as plt
t_range = torch.arange(20., 90.).unsqueeze(1)
fig = plt.figure(dpi=150)
plt.xlabel("Fahrenheit")
plt.ylabel("Celsius")
plt.plot(t_u.numpy(), t_c.numpy(), 'o')
plt.plot(t_range.numpy(), seq_model(0.1 * t_range).detach().numpy(), 'c-')
plt.plot(t_u.numpy(), seq_model(0.1 * t_u).detach().numpy(), 'kx')