A Study on CoVe, Context2Vec, ELMo, ULMFiT and BERT

I did some research on some of the revolutionary models that had a very powerful impact on Natural Language Processing (NLP) and Natural Language Understanding (NLU) and some of its challenging tasks including Question Answering, Sentiment Analysis, and Text Entailment. These models aim to have a better understanding of the language using the Transfer Learning technique. My research focused on understanding their architecture and its results on NLP tasks. On this post, I am going to explain what I have learned during my studies and include some personal notes that I came up from my understanding.

Before going into the models, there are some terminologies that you need to be aware of to help you in understanding these models.

Vector Space Models (VSMs): A conceptual term in NLP. It represents words as a set of vectors, these vectors are considered to be the new identifiers of the words. These vectors are used in the mathematical and statistical models for classification and regression tasks. Also, they shall be unique to be able to distinguish between them when they proceed to the other models.

Word Embedding: A technique to represent the words by fixed-size vectors, so that the words which have similar or close meaning have close vectors (i.e vectors that their Euclidean distance is small). In other words, the representation involves the semantic meaning of the words, this is very important in many different areas and tasks of NLP, such as Language Modeling and Machine Translation.

Sentence Embedding: A technique to represent the whole sentence by fixed-size vectors so that the sentences which have similar or close meaning have close vectors. Just like the Word Embedding but applied on the sentence.

Language Model: A core problem in NLP that aims to model a statistical distribution to the sentence so that we can predict a word given the previous one(s). The model can be on the word-level or the character level.

Transfer Learning: An advanced technique in Machine Learning, mainly Deep Learning, that aims to store a knowledge that is obtained in a specific task, then use this stored knowledge in another similar task. The knowledge in another task can be fine-tuned to suit the currently running task.

Multi-Task Learning: An advanced technique in Machine Learning, mainly Deep Learning, that aims to solve several subtasks at the same time so that the model gain knowledge that is built by all tasks. This helps in getting a generalized representation of text because the representation holds information from several tasks.

Domain Adaptation: A subfield of Transfer Learning that aims to get a model trained on a specific data distribution (a source task) to work with another data distribution (a target task). The model tries to build correspondence features between the two tasks.

Context Vectors (CoVe)

Paper Link: https://arxiv.org/pdf/1708.00107.pdf

Context Vectors (CoVe) are vectors that are learned on top of the original word vectors, could be GloVe, Word2Vec or FastText vectors. CoVe are obtained from an encoder of a specific task, in our case, it is trained on a Machine Translation task using a two-layers Bi-directional Long short-term Memory network. In Machine Translation task, we have a source language Ls={ws0, ws1, …, wsn} and a target language Lt={wt0, wt1, …, wtm}. We aim to translate a given sentence in the source language to the target language. Neural Machine Translation (NTM) is considered to be a good resource for learning more complex representation. To be able to translate, the model learns to get the complex relations between the words, so that it gets an advanced representation of the full sentence. So, we can consider the encoder of the NMT task will preserve much more complex semantic relations between words. Using the idea of Transfer Learning, we can do the following

  • Train a Sequence to Sequence (seq2seq) model in the Machine Translation task
  • Get the encoder of the seq2seq model and use it as a pre-trained layer in any other task

Let’s consider the following OpenNMT architecture (Encoder and Decoder seq2seq with attention) that does German-to-English translation

Related image

The encoder learns the hidden representation of the German sentence. After training this model, we use the encoder as a pre-trained layer in any other classification or generation task. Formally, if we have a sentence S = {w0, w1, …, wm}, then we at first get the word representation of each word using GloVe, the use the MT-LSTM on them CoVe =  MT-LSTM(GloVe(S))

CoVe idea comes from the pre-trained CNN in ImageNet classification problem. We take the pre-trained model and use it to get features representation of images.

To test the quality of the representation, CoVe’s paper represented a novel architecture called Bi-attentive Classification Network (BCN) to be used any classification problem that has one input (like Sentiment Analysis) and two inputs (like Paraphrase Detection). The network takes the input after applying the MT-LSTM representation. However, you don’t need to use it for your task, you can just use the encoder representation as an initialization layer in your model.

BCN model does the following steps

  • It takes two inputs, mainly two sentences, in case of tasks that has only one input, then the input is duplicated
  • Replace the words of each sentence by pre-trained GloVe vectors
  • Apply the pre-trained MT-LSTM to the word vectors to get the context vectors
  • Layer1: Each of the context vectors is fed to a feedforward network with ReLU activation function
  • Layer2: The output of each network is passed to a simple biLSTM encoder. The output from the previous feedforward network, X and Y, is stacked to form a matrix to be passed to the biLSTM encoder
  • Layer3: Bi-attention layer that creates an affinity matrix, A = X YT from the two sentences stacked representation. Not considered a trainable layer. The target of the layer is to get the similarity matrix between the two inputs through the affinity matrix by applying a Softmax normalization to the column-wise vectors, Ax = Softmax(A) and Ay = Softmax(AT)
  • Layer3-cont: Getting a context summaries Cx = ATx X and Cy = ATy Y
  • Layer4: A one layer Bi-LSTM that acts as an integration layer that concatenates the feedforward network output, X and Y, with the context summaries after applying subtraction (get the difference from the original features) and element-wise product (amplify the original features). X|y = biLSTM([X; X – Cy; X ⊙ Cy]) and Y|x = biLSTM([Y; Y – Cx; Y ⊙ Cx])
  • Layer5: A pooling layer that operates along the time dimension.
  • Layer6: The output of the pooling layer is passed to a maxout network that consists of a three-layer, batch-normalization layers
  • Layer7: A softmax or Sigmoid layer to get the probability distribution of classes

They used the BCN in several downstream NLP classification tasks. The following is some recorded F1 score of the model in the testing data, compared with other SOTA models

Personal Notes

  • You don’t have to use the BCN network to operate in your task. Just train the CoVe encoder layer in a large dataset, save the encoder weights and finally append the layer at the beginning of your custom training model
  • You can either freeze the layer or fine-tune it during your training process. From my experiences, I prefer to trigger the fine-tuning process to make the model change its parameter slightly to suit my task
  • Use FastText representation instead of GloVe if you are going to work in a task that requires some differentiation in its character level representation, instead of its words semantics only. Like Entities word vectors.

Context to Embeddings (Context2Vec)

Paper Link: https://www.aclweb.org/anthology/K16-1006

Assume a case where we have a sentence like. I can’t find April. Word April maybe refer to a month name or a person name. We use the words surround it (context) to help us to determine the best suitable option. Actually, this problem refers to Word Sense Disambiguation task, on which we investigate the actual semantics of the word based on several semantic and linguistic techniques. The Context2Vec idea is taken from the original CBOW Word2Vec model, but instead of relying on averaging the embedding of the words, it relies on much more complex parametric model that is based on one layer of Bi-LSTM. The following figure shows the architecture of the CBOW model, also, you can check this post to get a full explanation to it.

CBOW applies an average function to the target word context words, Avg(Embedding(John), Embedding(a), Embedding(paper)), to infer the contextual target word embedding. The objective function gets the loss of the averaged word embeddings with target word embedding.

Context2Vec applied the same idea of windowing but instead of applying a simple average function, it applies three stages to learn complex parametric networks

  • A Bi-LSTM layer that takes left-to-right and right-to-left representations
  • A feedforward network that takes the concatenated hidden representations and produces a hidden representation through learning the network parameters
  • Finally, we apply the objective function to the network output

They used the Word2Vec negative sampling idea to get better performance while calculating the loss value. You can refer to this post to know how it works.

The following is some samples of the closest words to a given context

Personal Notes

  • Context2Vec is very similar to the Doc2Vec model. But instead of using a regular projection layer, Context2Vec uses a Bi-LSTM and feedforward models for deeper representation

Embeddings from Language Models (ELMo)

Paper Link: https://arxiv.org/pdf/1802.05365.pdf

Just like the problem we illustrated in Context2Vec, the semantics of the word is determined by the context that the word placed into. Now, as long as we talk about the context of the words, there is no better than using the power of Language Modeling to come up with better understanding the sentence representation. Language Model aims to produce a probability distribution that models a corpus or sentences.

Typically, in Language Model, the target is to predict the next word given its previous words P(Word | Context). So, we may infer from that the word is conditioned on the words mentioned before. In representation terms, the vector representation of a word depends on the context vector representation. We can apply the same idea in the words that are after the target word. So, we have here a Forward Language Model (predict the target word given the previous words) and Backward Language Model (predict the target word given the afterward words).

The following figure shows the Forward and Backward process in Language Model. It is called Bi-directional Language Model

Image result for bidirectional language model

Reference: https://medium.com/@plusepsilon/the-bidirectional-language-model-1f3961d1fb27

ELMo uses the Bi-directional Language Model to get a new embedding that will be concatenated with the initialized word embedding. Concretely, the word “are” in the above figure will have a representation formed with the following embedding vectors

  • Original embedding, GloVe, Word2Vec or FastText for example
  • Forward pass hidden layer representation vector
  • Backward pass hidden layer representation vector

These vectors have equal size and could be weighted by a real value coefficient according to the task you are working on. Finally, these are element-wise summed together to form the final vector representation of the word.

So, we can conclude the steps for using ELMo by the following

  • Train a Bi-directional Language Model in a large corpus.
  • Freeze the encoders and put them in your lowest level in your model
  • Replace the words with their word vectors
  • Apply the encoders to the words and sum the hidden representation vector with the word vectors

The following figure shows the results of using ELMo embedding in some NLP tasks

Personal Notes

  • It is better to train the language model in the corpus that is related to the task that you are working on
  • You can use several techniques to improve the Language Model by including CNN character features or more deeper models as long as you get a hidden representation to the words

Universal Language Model Fine-tuning (ULMFiT)

Paper Link: https://arxiv.org/pdf/1801.06146.pdf

How about building one model that can serve any classification task? This was the main idea of creating a Universal Language Model. As we showed before, Language Model is a perfect task to learn the complex characteristics of language and its words dependencies within a sentence.

Furthermore, ULMFiT introduced two novel techniques to improve the Transfer Learning process within the network layers, Discriminative Fine-tuning (Discr) and Slanted Triangular Learning Rates (STLR). To be able to get the best modeling, they used the SOTA ASGD Weight-Dropped LSTM (AWD-LSTM). I recommend you to check this great post for a better explanation.

AWD-LSTM is a regularized LSTM using several regularizations technique to get better generalization. The main motivation of it is to get rid of LSTM overfitting problems in long sequences that make the model acts poorly in the testing data. They introduced 4 techniques

  • DropConnect Mask
  • Variational Dropout
  • Average Stochastic Gradient Descent (ASGD)
  • Variable Length Backpropagation Through Time (BPTT)

DropConnect Mask

Paper Link for more info: http://yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdf

Regularly in the simple Dropout, we randomly, during the training phase, select some hidden layers neurons and set their output to zero. This helps in avoiding getting the model focuses on specific neurons to get trigger values. By applying Dropout to the neurons, we increase the model sparsity on the level of the layer itself.

In DropConnect we do the same idea of sparsity but we apply it to the weights matrix between the adjacent layers. In other words, we apply the random selection on the weights, not the neurons. Practically, this helps the model to avoid completely forgetting the lower neurons contribution to the calculation. The following figure shows a Mask Matrix that contains Zeros and Ones that are initialized using a Bernoulli Distribution according to the paper

DropConnect can be applied to feedforward or recurrent networks.

In AWD-LSTM model, DropConnect is applied to the hidden-to-hidden weights. The weights are multiplied with the last hidden state.

Variational Dropout

Paper Link: https://arxiv.org/pdf/1512.05287.pdf

In Dropout, we select some neurons randomly on each layer that the model will ignore their output during training the Batch samples. So, the selected dropout neurons are applied to each sample of the batch. In Variational Dropout, we select different neurons for each sample of the batch. So, we have a dynamic selection per sample. According to its paper, they find that it practically gives a better generalization.

Average Stochastic Gradient Descent (ASGD)

Paper Link: https://leon.bottou.org/publications/pdf/compstat-2010.pdf

The regular SGD optimizer uses the current iteration weights to update the weights. In ASGD, it uses the previous iterations weights. The previous weights range is determined by a constant parameter. The previous weights are summed to the current iteration weights and we use the constant to get the average. Finally, we get the gradient of the averaged weights.

Variable Length Backpropagation Through Time (BPTT)

BPTT is the optimizer that is used while training Recurrent-based Networks. As the model takes the input sequentially, then the regular Backpropagation can’t be applied because of the model’s non-fixed input. In BPTT, the weights are updated after a specific number of iterations through the input sequence.

For example, if I have a sequence with 11 words and we set the update after every 5 tokens. With this setup we won’t use the last word to update the parameter, this makes the model not getting benefit from the last token.

To get rid of this, they set a constant probability, called base BPTT, and formed a Normal Distribution using the probability value, mean and standard deviation. They here assumed that the length of the tokens is like any other data, it can be formed as a Normal Distribution. The learning rate also becomes a function of the iteration value.

Back to ULMFiT, that uses AWD-LSTM, the modeling consists of three stages

  • Perform a Language Model training on a large corpus (called LM pre-training)
  • Perform a fine-tuning on the trained Language Model using the task’s dataset (called LM fine-tuning)
  • Perform a fine-tuning on the Language Model using the task’s classifier and update the model parameters using the classification objective function (called Classifier fine-tuning)

The following figure shows the three stages diagrams

The network consists of three stacked AWD-LSTM models. These layers are augmented with a classifier architecture in the final stage. Typically, they used a Computer Vision architecture for classification (Batch Normalization + Dropout + ReLU Layer + Batch Normalization + Dropout + ReLU Layer + Softmax Layer).

For better fine-tuning, they used two novel techniques, Discriminative Fine-tuning (Discr) and Slanted Triangular Learning Rates (STLR).

Discriminative Fine-tuning (Discr)

In stage two and three, they used different learning rate on each layer. As we know, each layer in any network learns some kind of features that may be different from the other layers. Given this, then we can conclude that each layer has its own way of learning that leads to making each layer has the same learning rate isn’t efficient. So, they set different learning rates on each layer. The values of rates are a function of the other layers. The equation is as follows LRL – 1 = LRL / 2.6, where LR is the learning rate and L is layer level.

Slanted Triangular Learning Rates (STLR)

In stage two and three, they used a scheduler for learning rate values using the following formula that depends on the iteration number.

where T is the number of iterations, cut_frac is the increasing learning rate factor, cut is as switcher iteration from increasing to decreasing the rate, p is the fraction of the number of iterations we have increased or will decrease the learning rate respectively, ratio specifies how much smaller the lowest learning rate is from the maximum learning rate.

They use cut_frac = 0.1, ratio = 32 and ηmax = 0.01 according to the paper.

This produces the triangular shape of the learning rate across the iterations.

So, we can conclude the steps for using ULMFiT by the following

  • Train an AWD-LSTM, forward or backward or both, in a large general corpus
  • Fine-tune the model in the task’s dataset
  • Fine-tune with classification model in the task training dataset

Bi-directional Encoder Representations from Transformers (BERT)

Paper Link: https://arxiv.org/pdf/1810.04805.pdf

Just like ULMFiT model, BERT is one model that can serve any classification task, just by adding a classification layer at the end, and a feature extractor for a generation task. BERT is considered to be a revolutionary model that achieved SOTA results in almost all downstream NLP tasks (before XLNet).

BERT is an architecture that is based on Bi-directional Transformer model. The Transformer needs a full standalone blog post to be explained efficiently, so, I recommend you to check this great post that illustrates clearly the idea of Transformer and Self-Attention layer.

In brief, the Transformer is a seq2seq model that consists of Encoder and Decoder. The Transformer’s powerful point is in its ability to know the important tokens in the given sentence through its multi attention mechanism.

From Attention Is All You Need by Vaswani et al.

BERT uses the Bi-directional Transformer to encode the important words and their dependencies with other words.

As it is considered to be a seq2seq model, its training is different from the other models that rely on classification objective function. Instead, we train BERT on tasks on which we generate sentences, concretely, we can use it in tasks like Machine Translation, Text Paraphrasing and Text Entailment generation tasks. However, training BERT is a bit different from the regular seq2seq model. It used a technique called Teacher Forcing that is used in recurrent based networks.

Furthermore, they introduced a Masked Language Model task. Simply, they mask some percentage of the input tokens at random and then predict those masked tokens. This enables the model to use the context to predict the masked words and gets the representation of the word. Then the word is fed to the Transformer decoder.

Teacher Forcing

Paper Link: https://pdfs.semanticscholar.org/12dd/078034f72e4ebd9dfd9f80010d2ae7aaa337.pdf

Teacher Forcing is a strategy to train seq2seq models. In seq2seq models, we use the decoder’s output at step t – 1 as input at step t. You can check this post for full illustration Now, let’s see an example to see the backfire of this approach.

Assume that we are working on the English-to-French Machine Translation task. The input sentence is “My name is Ahmed” and the output sentence is “Je m’appelle Ahmed“. Let’s consider that we already fed the input to the encoder. Let’s assume that the first output of the decoder is “Ahmed“! Then the input of the next step will be incorrect as we are looking forward to getting the word “Je“. Instead, as we know the correct word, is to ignore that output, after calculating the error and keep it, and feed the correct word to the next step and so on. By doing this, we Force the model to learn like we do in supervised learning models where we use the actual label to force the parameter update.

Back to BERT model. BERT uses the Teacher Forcing algorithm during training the pairs of sentences but with a bit of modification. Instead of passing the actual input to the decoder, we will shift it to the right and pass it as input. This to avoid learning the model only copying the decoder’s input. So, from the above example, the decoder input will be “[START] Je m’appelle” and let the model learns to predict the final word “Ahmed” given the previous words.

To use BERT, we do the following steps

  • Train BERT in a large corpus (pairs of sentences)
  • Then fine-tune the model to your task-specific dataset. Ignore the decoder if you are working with single-sentence task
  • Add your Softmax layer at the top of the model for classification

The following is some SOTA results using BERT

Personal Notes

  • BERT is really powerful. I used it in two classification tasks and increased the performance significantly
  • Use flair library to play with BERT and other embeddings
  • Look to the “Additional Details for BERT” section at the end of BERT paper
  • Considered to be


We examined some novel architectures for Transfer Learning in NLP. However, I recommend you to read the original papers of these models, this will help you to understand more. Also, we mentioned some techniques that help in training models like Teacher Forcing and other RNN regularization techniques.

All of these models are implemented in PyTorch, Tensorflow and Keras that you can use for your projects.

A newly released model, XLNet, outperforms BERT in almost all tasks, however, I haven’t read its paper yet. I will update this post with it after finishing reading and understanding its paper.




3 thoughts on “A Study on CoVe, Context2Vec, ELMo, ULMFiT and BERT

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s