Fine-tuning your first Transformer!
Exploring Huggingface Transformers library in MLt workshop part 2
- Fine-tuning your first Transformer!
- Setup
- The dataset
- From Datasets to DataFrames and back
- Filtering for a product category
- Mapping the labels
- From text to tokens
- Loading a pretrained model
- Creating a Trainer
- Evaluating cross-lingual transfer
- Using your fine-tuned model
In this notebook we'll take a look at fine-tuning a multilingual Transformer model called XLM-RoBERTa for text classification. By the end of this notebook you should know how to:
- Load and process a dataset from the Hugging Face Hub
- Create a baseline with the zero-shot classification pipeline
- Fine-tune and evaluate pretrained model on your data
- Push a model to the Hugging Face Hub
Let's get started!
If you're running this notebook on Google Colab or locally, you'll need a few dependencies installed. You can install them with pip
as follows:
To be able to share your model with the community there are a few more steps to follow.
First you have to store your authentication token from the Hugging Face website (sign up here if you haven't already!) then execute the following cell and input your username and password:
from huggingface_hub import notebook_login
notebook_login()
Then you need to install Git-LFS. Uncomment and execute the following cell:
In this notebook we'll be using the 🤗 Datasets to load and preprocess our data. If you're new to this library, check out the video below to get some additional context:
from IPython.display import YouTubeVideo
YouTubeVideo("_BZearw7f0w", width=600, height=400)
In this tutorial we'll use the Multilingual Amazon Reviews Corpus (or MARC for short). This is a large-scale collection of Amazon product reviews in several languages: English, Japanese, German, French, Spanish, and Chinese.
We can download the dataset from the Hugging Face Hub with the 🤗 Datasets library, but first let's take a look at the available subsets (also called configs):
from datasets import get_dataset_config_names
dataset_name = "amazon_reviews_multi"
langs = get_dataset_config_names(dataset_name)
langs
Okay, we can see the language codes associated with each language, as well as an all_languages
subset which presumably concatenates all the languages together. Let's begin by downloading the English subset with the load_dataset()
function from 🤗 Datasets:
from datasets import load_dataset
marc_en = load_dataset(path=dataset_name, name="en")
marc_en
One cool feature of 🤗 Datasets is that load_dataset()
will cache the files at ~/.cache/huggingface/dataset/
, so you won't need to re-download the dataset the next time your run the notebook.
We can see that marc_en
is a DatasetDict
object which is similar to a Python dictionary, with each key corresponding to a different split. We can access an element of one of these splits as follows:
marc_en["train"][0]
This certainly looks like an Amazon product review and we can see the number of stars associated with the review, as well as some metadata like the language and product category.
We can also access several rows with a slice:
marc_en["train"][:3]
and note that now we get a list of values for each column. This is because 🤗 Datasets is based on Apache Arrow, which defines a typed columnar format that is very memory efficient.
Note that altough we downloaded the dataset from the Hub, it's also possible to load datasets both locally and from custom URLs. For example, the above dataset lives at the following URL:
dataset_url = "https://amazon-reviews-ml.s3-us-west-2.amazonaws.com/json/train/dataset_en_train.json"
so we can download it manually with wget
:
!wget {dataset_url}
We can then load it locally using the json
loading script:
load_dataset("json", data_files="dataset_en_train.json")
You can actually skip the manual download step entirely by pointing data_files
directly to the URL:
load_dataset("json", data_files=dataset_url)
Now that we've had a quick look at the objects in 🤗 Datasets, let's explore the data in more detail by using our favourite tool - Pandas!
🤗 Datasets is designed to be interoperable with libraries like Pandas, as well as NumPy, PyTorch, TensorFlow, and JAX. To enable the conversion between various third-party libraries, 🤗 Datasets provides a Dataset.set_format() function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format which is Apache Arrow. The formatting is done in-place, so let’s convert our dataset to Pandas and look at a random sample:
from IPython.display import display, HTML
marc_en.set_format("pandas")
df = marc_en["train"][:]
# Create a random sample
sample = df.sample(n=5, random_state=42)
display(HTML(sample.to_html()))
We can see that the column headers are the same as we saw in the Arrow format and from the reviews we can see that negative reviews are associated with a lower star rating. Since we're now dealing with a pandas.DataFrame
we can easily query our dataset. For example, let's see what the distribution of reviews per product category looks like:
df["product_category"].value_counts()
Okay, the home
, wireless
, and sports
categories seem to be the most popular. How about the distribution of star ratings?
df["stars"].value_counts()
In this case we can see that the dataset is balanced across each star rating, which will make it somewhat easier to evaluate our models on. Imbalanced datasets are much more common in the real-world and in these cases some additional tricks like up- or down-sampling are usually needed.
Now that we've got a rough idea about the kind of data we're dealing with, let's reset the output format from pandas
back to arrow
:
marc_en.reset_format()
Although we could go ahead and fine-tune a Transformer model on the whole set of 200,000 English reviews, this will take several hours on a single GPU. So instead, we'll focus on fine-tuning a model for a single product category! In 🤗 Datasets, we can filter data very quickly by using the Dataset.filter()
method. This method expects a function that returns Boolean values, in our case True
if the product_category
matches the chosen category and False
otherwise. Here's one way to implement this, and we'll pick the sports
category as the domain to train on:
product_category = "book"
def filter_for_product(example, product_category=product_category):
return example["product_category"] == product_category
Now when we pass filter_for_product()
to Dataset.filter()
we get a filtered dataset:
product_dataset = marc_en.filter(filter_for_product)
product_dataset
Yep, this looks good - we have 13,748 reviews in the train split which agrees the number we saw in the distribution of categories earlier. Let's do a quick sanity check by taking a look at a few samples. Here 🤗 Datasets provides Dataset.shuffle()
and Dataset.select()
functions that we can chain to get a random sample:
product_dataset["train"].shuffle(seed=42).select(range(3))[:]
Okay, now that we have our corpus of book reviews, let's do one last bit of data preparation: creating label mappings from star ratings to human readable strings.
During training, 🤗 Transformers expects the labels to be ordered, starting from 0 to N. But we've seen that our star ratings range from 1-5, so let's fix that. While we're at it, we'll create a mapping between the label IDs and names, which will be handy later on when we want to run inference with our model. First we'll define the label mapping from ID to name:
label_names = ["terrible", "poor", "ok", "good", "great"]
id2label = {idx:label for idx, label in enumerate(label_names)}
id2label
We can then apply this mapping to our whole dataset by using the Dataset.map()
method. Similar to the Dataset.filter()
method, this one expects a function which receives examples as input, but returns a Python dictionary as output. The keys of the dictionary correspond to the columns, while the values correspond to the column entries. The following function creates two new columns:
- A
labels
column which is the star rating shifted down by one - A
label_name
column which provides a nice string for each rating
def map_labels(example):
# Shift labels to start from 0
label_id = example["stars"] - 1
return {"labels": label_id, "label_name": id2label[label_id]}
To apply this mapping, we simply feed it to Dataset.map
as follows:
product_dataset = product_dataset.map(map_labels)
# Peek at the first example
product_dataset["train"][0]
Great, it works! We'll also need the reverse label mapping later, so let's define it here:
label2id = {v:k for k,v in id2label.items()}
Like other machine learning models, Transformers expect their inputs in the form of numbers (not strings) and so some form of preprocessing is required. For NLP, this preprocessing step is called tokenization. Tokenization converts strings into atomic chunks called tokens, and these tokens are subsequently encoded as numerical vectors.
For more information about tokenizers, check out the following video:
YouTubeVideo("VFp38yj8h3A", width=600, height=400)
Each pretrained model comes with its own tokenizer, so to get started let's download the tokenizer of XLM-RoBERTa from the Hub:
from transformers import AutoTokenizer
model_checkpoint = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
The tokenizer has a few interesting attributes such as the vocabulary size:
tokenizer.vocab_size
This tells us that XLM-R has 250,002 tokens that is can use to represent text with. Some of the tokens are called special tokens to indicate whether a token is the start or end of a sentence, or corresponds to the mask that is associated with language modeling. Here's what the special tokens look like for XLM-R:
tokenizer.special_tokens_map
When you feed strings to the tokenizer, you'll get at least two fields (some models have more, depending on how they're trained):
-
input_ids
: These correspond to the numerical encodings that map each token to an integer -
attention_mask
: This indicates to the model which tokens should be ignored when computing self-attention
Let's see how this works with a simple example. First we encode the string:
encoded_str = tokenizer("Today I'm giving an NLP workshop at MLT")
encoded_str
and then decode the input IDs to see the mapping explicitly:
for token in encoded_str["input_ids"]:
print(token, tokenizer.decode([token]))
So to prepare our inputs, we simply need to apply the tokenizer to each example in our corpus. As before, we'll do this with Dataset.map()
so let's write a simple function to do so:
def tokenize_reviews(examples):
return tokenizer(examples["review_body"], truncation=True, max_length=180)
Here we've enabled truncation, so the tokenizer will cut any inputs that are longer than 180 tokens (which is the setting used in the MARC paper). With this function we can go ahead and tokenize the whole corpus:
tokenized_dataset = product_dataset.map(tokenize_reviews, batched=True)
tokenized_dataset
tokenized_dataset["train"][0]
This looks good, so now let's load a pretrained model!
To load a pretrained model from the Hub is quite simple: just select the appropriate AutoModelForXxx
class and use the from_pretrained()
function with the model checkpoint. In our case, we're dealing with 5 classes (one for each star) so to initialise the model we'll provide this information along with the label mappings:
from transformers import AutoModelForSequenceClassification
num_labels = 5
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels, label2id=label2id, id2label=id2label)
These warnings are perfectly normal - they are telling us that the weights in the head of the network are randomly initialised and so we should fine-tune the model on a downstream task.
Now that we have a model, the next step is to initialise a Trainer
that will take care of the training loop for us. Let's do that next.
To create a Trainer
, we usually need a few basic ingredients:
- A
TrainingArguments
class to define all the hyperparameters - A
compute_metrics
function to compute metrics during evaluation - Datasets to train and evaluate on
For more information about the Trainer
check out the following video:
YouTubeVideo("nvBXf7s7vTI", width=600, height=400)
Let's start with the TrainingArguments
:
from transformers import TrainingArguments
model_name = model_checkpoint.split("/")[-1]
batch_size = 16
num_train_epochs = 2
logging_steps = len(tokenized_dataset["train"]) // (batch_size * num_train_epochs)
args = TrainingArguments(
output_dir=f"{model_name}-finetuned-marc-en",
evaluation_strategy = "epoch",
save_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=num_train_epochs,
weight_decay=0.01,
logging_steps=logging_steps,
push_to_hub=True,
)
Here we've defined output_dir
to save our checkpoints and tweaked some of the default hyperparameters like the learning rate and weight decay. The push_to_hub
argument will push each checkpoint to the Hub automatically for us, so we can reuse the model at any point in the future!
Now that we've defined the hyperparameters, the next step is to define the metrics. In the MARC paper, the authors point out that one should use the mean absolute error (MAE) for star ratings because:
star ratings for each review are ordinal, and a 2-star prediction for a 5-star review should be penalized more heavily than a 4-star prediction for a 5-star review.
We'll take the same approach here and we can get the metric easily from Scikit-learn as follows:
import numpy as np
from sklearn.metrics import mean_absolute_error
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return {"MAE": mean_absolute_error(labels, predictions)}
With these ingredients we can now instantiate a Trainer
:
from transformers import Trainer
trainer = Trainer(
model,
args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
Note that here we've also provided the tokenizer to the Trainer
: doing so will ensure that all of our examples are automatically padded to the longest example in each batch. This is needed so that matrix operations in the forward pass of the model can be computed.
With our Trainer
, it is then a simple matter to train the model:
trainer.train()