Previously, we talked about Word2vec model and its Skip-gram and Continuous Bag of Words (CBOW) neural networks.

Regularly, when we train any of Word2vec models, we need huge size of data. The number of words in the corpus could be millions, as you know, we want from Word2vec to build vectors representation to the words so that we can use it in NLP tasks and feed these vectors to any discriminative models. For example, in Machine Translation task, we need very huge size of words to be able to cover the language’s words and context.

Recalling Skip-gram model

You see that each of output layer in the above neural network has V dimension, which is the corpus unique vocabulary size. If we have 1M of unique words, then the size of each output layer would be 1M neuron unit. As we said in the previous post, to be able to get the words, we need to apply Softmax activation function to the output layer, so that we change the vector to probabilistic values and get the index of the maximum probability which corresponds to a 1-hot-encoding word vector.

To apply Softmax activation function, we use the following formula

The denominator is the summation of the output exponential values. The complexity of this operation is **O(V) **(regardless the exp() complexity), where **V** is the number of words in the corpus. So, this operation will have a bad performance if the vocabulary size is too high. Also, in Skip-gram, we need to do this in multiple Softmax output layer. This would be totally costly and waste alot of time when training the Word2vec models.

There was a need to make a new technique to reduce the Softmax complexity and make the training more efficient. There exist 2 different techniques for optimization, Hierarchical Softmax and Negative Sampling. They are mentioned in the original paper of Word2vec models.

Recalling our dataset from the previous post

D = {

“This battle will be my masterpiece”,

“The unseen blade is the deadliest”

};

V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ||V|| = 11

**Hierarchical Softmax**

In general, the target of Softmax layer is to maximize the probability of a word given its context **P(word | context)**. Assume that from our dataset and using CBOW model, we want to predict the word “unseen” given the context “The”, “blade” and “is”. So, our probability formula would be **P(unseen | the, blade, is)**.

Regularly, we use a flat Softmax layer to get that probability after applying the Softmax formula and get the probabilities. But in Hierarchical Softmax we don’t use a flat layer, but instead, we use a balanced Huffman tree layer, like the following.

So, to get the probability of the word “unseen”, we need to calculate (from the root) the following probability multiplication: **P(right) * P(left) * P(right) * P(right)**.

or **P(left) * P(right) * P(right) * P(right)**

The question now, how we could get these probabilities?

The idea of getting these probabilities is the same idea of logistic regression. Recalling a simple logistic regression which uses Sigmoid function in its output neuron

In logistic regression, we get the probability of Y given the features set X, P(Y | X). You get this probability by multiplying the features vector with the weights matrix, then apply the Sigmoid function to get the probability. Sigmoid(X * W + b).

Well, we will do the same with our tree. Our tree nodes would be the output of a logistic function and we connect each node of the tree by an associated weight matrix so that we do the multiplication between the hidden layer values of the neural network and the weights to get the probability of each node of the tree.

The input to the tree nodes are the hidden layer values of the neural network (the layer before the softmax). You can imagine it as we have now a new neural network that has the hidden layer values as input and the output is each node of the tree.

For each tree layer, we get the probabilities of the nodes which will lead us to know what is the predicted word by choosing the nodes with the maximum probability value in each tree layer.

You can see that the complexity is reduced from **O(N)** to **O(log(N))** which is a great performance improvement. Also, we don’t need to calculate the values of the 2 nodes of the subtree. You need only to get the probability of one of them and get the other by subtracting 1 from the other probability as **P(right) + P(left) = 1**.

**Negative Sampling**

Negative sampling is another approach for training the neural network in much more optimized approach. Actually, Negative Sampling is the most popular approach when training Word2vec models and it is implemented in Tensorflow’s Word2vec.

The idea of negative sampling is quit easy to understand and in implementation. Recalling our Softmax layer, you will see that the number of neurons are the same of the number of corpus’s words. After getting the error, we update the whole weights of the neural network. If you thought about that, actually, we don’t have to do this. We know that we have all the corpus words in the training items, so, the words weights will be certainly updated during the training phase.

Using this truth, we can instead update some of the words weights for each training example. These words would contain the positive word (the actual word that shall be predicted) and other words that are considered to be negative samples (incorrect words that shall not predicted). If you assumed that the number of negative samples are 10, then the softmax layer has only 11 neuron unit (10 negative + 1 positive) and the error will be used to update only 11 * hidden_layer_size weights instead of V * hidden_layer_size.

The question now is, how to select these negative samples?

Traditionally, choosing these negative samples depends on the words frequency in the given corpus. The higher frequency of the word means the higher chance of choosing it as a negative sample.

The following formula is used to determine the probability of selecting the word as a negative sample.

Where **c** is a constant that is selected by the model creator. After applying that, we choose the maximum **N** words, where **N** is the number of the negative samples.

**References**

– https://arxiv.org/pdf/1411.2738.pdf

– http://cs224d.stanford.edu/syllabus.html?utm_content=buffer891d6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

– https://www.tensorflow.org/tutorials/word2vec

– http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

Drawings are created using www.draw.io .