We will use IMDB Movie Reviews Dataset

What is BERT

Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP (Natural Language Processing) pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. Google is leveraging BERT to better understand user searches.

BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.

image.png

There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning.

The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features. The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pretrained parameters.

The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

Why BERT

  • Accurate
  • Can be used for wide variety of task
  • Easy to use
  • It is game changer in NLP

Additional Reading

Ref BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

https://arxiv.org/abs/1810.04805

Understanding searches better than ever before:

https://www.blog.google/products/search/search-language-understanding-bert/

Good Resource to Read More About the BERT:

http://jalammar.github.io/illustrated-bert/


What is ktrain

ktrain is a library to help build, train, debug, and deploy neural networks in the deep learning software framework, Keras.

ktrain uses tf.keras in TensorFlow instead of standalone Keras.) Inspired by the fastai library, with only a few lines of code, ktrain allows you to easily:

  • estimate an optimal learning rate for your model given your data using a learning rate finder
  • employ learning rate schedules such as the triangular learning rate policy, 1cycle policy, and SGDR to more effectively train your model
  • employ fast and easy-to-use pre-canned models for both text classification (e.g., NBSVM, fastText, GRU with pretrained word embeddings) and image classification (e.g., ResNet, Wide Residual Networks, Inception)
  • load and preprocess text and image data from a variety of formats

  • inspect data points that were misclassified to help improve your model

  • leverage a simple prediction API for saving and deploying both models and data-preprocessing steps to make predictions on new raw data

Notebook Setup

!pip install ktrain
Requirement already satisfied: ktrain in /usr/local/lib/python3.6/dist-packages (0.16.1)
Requirement already satisfied: networkx>=2.3 in /usr/local/lib/python3.6/dist-packages (from ktrain) (2.4)
Requirement already satisfied: whoosh in /usr/local/lib/python3.6/dist-packages (from ktrain) (2.7.4)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from ktrain) (0.15.1)
Requirement already satisfied: ipython in /usr/local/lib/python3.6/dist-packages (from ktrain) (5.5.0)
Requirement already satisfied: keras-bert>=0.81.0 in /usr/local/lib/python3.6/dist-packages (from ktrain) (0.84.0)
Requirement already satisfied: seqeval in /usr/local/lib/python3.6/dist-packages (from ktrain) (0.0.12)
Requirement already satisfied: bokeh in /usr/local/lib/python3.6/dist-packages (from ktrain) (1.4.0)
Requirement already satisfied: langdetect in /usr/local/lib/python3.6/dist-packages (from ktrain) (1.0.8)
Requirement already satisfied: syntok in /usr/local/lib/python3.6/dist-packages (from ktrain) (1.3.1)
Requirement already satisfied: tensorflow-datasets in /usr/local/lib/python3.6/dist-packages (from ktrain) (2.1.0)
Requirement already satisfied: transformers>=2.7.0 in /usr/local/lib/python3.6/dist-packages (from ktrain) (2.11.0)
Requirement already satisfied: pandas>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from ktrain) (1.0.4)
Requirement already satisfied: fastprogress>=0.1.21 in /usr/local/lib/python3.6/dist-packages (from ktrain) (0.2.3)
Requirement already satisfied: tensorflow==2.1.0 in /usr/local/lib/python3.6/dist-packages (from ktrain) (2.1.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.6/dist-packages (from ktrain) (20.4)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from ktrain) (2.23.0)
Requirement already satisfied: cchardet==2.1.5 in /usr/local/lib/python3.6/dist-packages (from ktrain) (2.1.5)
Requirement already satisfied: jieba in /usr/local/lib/python3.6/dist-packages (from ktrain) (0.42.1)
Requirement already satisfied: scikit-learn==0.21.3 in /usr/local/lib/python3.6/dist-packages (from ktrain) (0.21.3)
Requirement already satisfied: matplotlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from ktrain) (3.2.1)
Requirement already satisfied: decorator>=4.3.0 in /usr/local/lib/python3.6/dist-packages (from networkx>=2.3->ktrain) (4.4.2)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from ipython->ktrain) (1.0.18)
Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from ipython->ktrain) (4.8.0)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/dist-packages (from ipython->ktrain) (4.3.3)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython->ktrain) (0.8.1)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/dist-packages (from ipython->ktrain) (47.1.1)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython->ktrain) (0.7.5)
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from ipython->ktrain) (2.1.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from keras-bert>=0.81.0->ktrain) (1.18.4)
Requirement already satisfied: keras-transformer>=0.37.0 in /usr/local/lib/python3.6/dist-packages (from keras-bert>=0.81.0->ktrain) (0.37.0)
Requirement already satisfied: Keras in /usr/local/lib/python3.6/dist-packages (from keras-bert>=0.81.0->ktrain) (2.3.1)
Requirement already satisfied: tornado>=4.3 in /usr/local/lib/python3.6/dist-packages (from bokeh->ktrain) (4.5.3)
Requirement already satisfied: pillow>=4.0 in /usr/local/lib/python3.6/dist-packages (from bokeh->ktrain) (7.0.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from bokeh->ktrain) (2.8.1)
Requirement already satisfied: PyYAML>=3.10 in /usr/local/lib/python3.6/dist-packages (from bokeh->ktrain) (3.13)
Requirement already satisfied: Jinja2>=2.7 in /usr/local/lib/python3.6/dist-packages (from bokeh->ktrain) (2.11.2)
Requirement already satisfied: six>=1.5.2 in /usr/local/lib/python3.6/dist-packages (from bokeh->ktrain) (1.12.0)
Requirement already satisfied: regex in /usr/local/lib/python3.6/dist-packages (from syntok->ktrain) (2019.12.20)
Requirement already satisfied: absl-py in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets->ktrain) (0.9.0)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets->ktrain) (0.16.0)
Requirement already satisfied: dill in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets->ktrain) (0.3.1.1)
Requirement already satisfied: attrs>=18.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets->ktrain) (19.3.0)
Requirement already satisfied: tensorflow-metadata in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets->ktrain) (0.22.1)
Requirement already satisfied: wrapt in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets->ktrain) (1.12.1)
Requirement already satisfied: protobuf>=3.6.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets->ktrain) (3.10.0)
Requirement already satisfied: promise in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets->ktrain) (2.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets->ktrain) (4.41.1)
Requirement already satisfied: termcolor in /usr/local/lib/python3.6/dist-packages (from tensorflow-datasets->ktrain) (1.1.0)
Requirement already satisfied: dataclasses; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from transformers>=2.7.0->ktrain) (0.7)
Requirement already satisfied: filelock in /usr/local/lib/python3.6/dist-packages (from transformers>=2.7.0->ktrain) (3.0.12)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.6/dist-packages (from transformers>=2.7.0->ktrain) (0.0.43)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.6/dist-packages (from transformers>=2.7.0->ktrain) (0.1.91)
Requirement already satisfied: tokenizers==0.7.0 in /usr/local/lib/python3.6/dist-packages (from transformers>=2.7.0->ktrain) (0.7.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=1.0.1->ktrain) (2018.9)
Requirement already satisfied: grpcio>=1.8.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow==2.1.0->ktrain) (1.29.0)
Requirement already satisfied: keras-applications>=1.0.8 in /usr/local/lib/python3.6/dist-packages (from tensorflow==2.1.0->ktrain) (1.0.8)
Requirement already satisfied: astor>=0.6.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow==2.1.0->ktrain) (0.8.1)
Requirement already satisfied: tensorboard<2.2.0,>=2.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow==2.1.0->ktrain) (2.1.1)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from tensorflow==2.1.0->ktrain) (0.34.2)
Requirement already satisfied: gast==0.2.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow==2.1.0->ktrain) (0.2.2)
Requirement already satisfied: tensorflow-estimator<2.2.0,>=2.1.0rc0 in /usr/local/lib/python3.6/dist-packages (from tensorflow==2.1.0->ktrain) (2.1.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow==2.1.0->ktrain) (3.2.1)
Requirement already satisfied: google-pasta>=0.1.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow==2.1.0->ktrain) (0.2.0)
Requirement already satisfied: keras-preprocessing>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow==2.1.0->ktrain) (1.1.2)
Requirement already satisfied: scipy==1.4.1; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from tensorflow==2.1.0->ktrain) (1.4.1)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging->ktrain) (2.4.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->ktrain) (2020.4.5.1)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->ktrain) (2.9)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->ktrain) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->ktrain) (3.0.4)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->ktrain) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->ktrain) (1.2.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->ktrain) (0.2.2)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->ipython->ktrain) (0.6.0)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->ipython->ktrain) (0.2.0)
Requirement already satisfied: keras-multi-head>=0.27.0 in /usr/local/lib/python3.6/dist-packages (from keras-transformer>=0.37.0->keras-bert>=0.81.0->ktrain) (0.27.0)
Requirement already satisfied: keras-pos-embd>=0.11.0 in /usr/local/lib/python3.6/dist-packages (from keras-transformer>=0.37.0->keras-bert>=0.81.0->ktrain) (0.11.0)
Requirement already satisfied: keras-position-wise-feed-forward>=0.6.0 in /usr/local/lib/python3.6/dist-packages (from keras-transformer>=0.37.0->keras-bert>=0.81.0->ktrain) (0.6.0)
Requirement already satisfied: keras-embed-sim>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from keras-transformer>=0.37.0->keras-bert>=0.81.0->ktrain) (0.7.0)
Requirement already satisfied: keras-layer-normalization>=0.14.0 in /usr/local/lib/python3.6/dist-packages (from keras-transformer>=0.37.0->keras-bert>=0.81.0->ktrain) (0.14.0)
Requirement already satisfied: h5py in /usr/local/lib/python3.6/dist-packages (from Keras->keras-bert>=0.81.0->ktrain) (2.10.0)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from Jinja2>=2.7->bokeh->ktrain) (1.1.1)
Requirement already satisfied: googleapis-common-protos in /usr/local/lib/python3.6/dist-packages (from tensorflow-metadata->tensorflow-datasets->ktrain) (1.51.0)
Requirement already satisfied: click in /usr/local/lib/python3.6/dist-packages (from sacremoses->transformers>=2.7.0->ktrain) (7.1.2)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (1.0.1)
Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (1.7.2)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (0.4.1)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (3.2.2)
Requirement already satisfied: keras-self-attention==0.46.0 in /usr/local/lib/python3.6/dist-packages (from keras-multi-head>=0.27.0->keras-transformer>=0.37.0->keras-bert>=0.81.0->ktrain) (0.46.0)
Requirement already satisfied: cachetools<3.2,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (3.1.1)
Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (4.0)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (0.2.8)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (1.3.0)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from markdown>=2.6.8->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (1.6.0)
Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/lib/python3.6/dist-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (3.1.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata; python_version < "3.8"->markdown>=2.6.8->tensorboard<2.2.0,>=2.1.0->tensorflow==2.1.0->ktrain) (3.1.0)
import tensorflow as tf
tf.__version__
'2.1.0'
!git clone https://github.com/laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k.git
Cloning into 'IMDB-Movie-Reviews-Large-Dataset-50k'...
remote: Enumerating objects: 10, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 10 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (10/10), done.
import pandas as pd
import numpy as np
import ktrain
from ktrain import text
import tensorflow as tf
data_train = pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/train.xlsx', dtype = str)
data_test = pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/test.xlsx', dtype = str)
data_train.tail()
Reviews Sentiment
24995 Everyone plays their part pretty well in this ... pos
24996 It happened with Assault on Prescient 13 in 20... neg
24997 My God. This movie was awful. I can't complain... neg
24998 When I first popped in Happy Birthday to Me, I... neg
24999 So why does this show suck? Unfortunately, tha... neg
data_test.head()
Reviews Sentiment
0 Who would have thought that a movie about a ma... pos
1 After realizing what is going on around us ...... pos
2 I grew up watching the original Disney Cindere... neg
3 David Mamet wrote the screenplay and made his ... pos
4 Admittedly, I didn't have high expectations of... neg
data_test.shape, data_train.shape
((25000, 2), (25000, 2))
(X_train, y_train), (X_test, y_test), preproc = text.texts_from_df(train_df=data_train,
                                                                   text_column = 'Reviews',
                                                                   label_columns = 'Sentiment',
                                                                   val_df = data_test,
                                                                   maxlen = 500,
                                                                   preprocess_mode = 'bert')
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en
done.
Is Multi-Label? False
preprocessing test...
language: en
done.
model = text.text_classifier(name = 'bert',
                             train_data = (X_train, y_train),
                             preproc = preproc)
Is Multi-Label? False
maxlen is 500
done.
learner = ktrain.get_learner(model=model, train_data=(X_train, y_train),
                   val_data = (X_test, y_test),
                   batch_size = 6)
# learner.lr_find()
# learner.lr_plot()

# it may take days or many days to find out.
learner.fit_onecycle(lr = 2e-5, epochs = 1)

begin training using onecycle policy with max lr of 2e-05...
Train on 25000 samples, validate on 25000 samples
25000/25000 [==============================] - 3500s 140ms/sample - loss: 0.1361 - accuracy: 0.9517 - val_loss: 0.0487 - val_accuracy: 0.9872
<tensorflow.python.keras.callbacks.History at 0x7fce51ef4940>
predictor = ktrain.get_predictor(learner.model, preproc)
data = ['this movie was horrible, the plot was really boring. acting was okay',
        'the fild is really sucked. there is not plot and acting was bad',
        'what a beautiful movie. great plot. acting was good. will see it again']
predictor.predict(data)
['neg', 'neg', 'pos']
predictor.predict(data, return_proba=True)
array([[0.99797565, 0.00202436],
       [0.99606663, 0.00393336],
       [0.00292433, 0.9970757 ]], dtype=float32)
predictor.get_classes()
['neg', 'pos']
predictor.save('/content/bert')
!zip -r /content/bert.zip /content/bert
  adding: content/bert/ (stored 0%)
  adding: content/bert/tf_model.h5 (deflated 11%)
  adding: content/bert/tf_model.preproc (deflated 52%)
predictor_load = ktrain.load_predictor('/content/bert')
predictor_load.get_classes()
['neg', 'pos']
predictor_load.predict(data)
['neg', 'neg', 'pos']