[Thesis Tutorials II] Understanding Word2vec for Word Embedding II

Previously, we talked about Word2vec model and its Skip-gram and Continuous Bag of Words (CBOW) neural networks.

Regularly, when we train any of Word2vec models, we need huge size of data. The number of words in the corpus could be millions, as you know, we want from Word2vec to build vectors representation to the words so that we can use it in NLP tasks and feed these vectors to any discriminative models. For example, in Machine Translation task, we need very huge size of words to be able to cover the language’s words and context.

Recalling Skip-gram model
You see that each of output layer in the above neural network has V dimension, which is the corpus unique vocabulary size. If we have 1M of unique words, then the size of each output layer would be 1M neuron unit. As we said in the previous post, to be able to get the words, we need to apply Softmax activation function to the output layer, so that we change the vector to probabilistic values and get the index of the maximum probability which corresponds to a 1-hot-encoding word vector.

To apply Softmax activation function, we use the following formula
The denominator is the summation of the output exponential values. The complexity of this operation is O(V) (regardless the exp() complexity), where V is the number of words in the corpus. So, this operation will have a bad performance if the vocabulary size is too high. Also, in Skip-gram, we need to do this in multiple Softmax output layer. This would be totally costly and waste alot of time when training the Word2vec models.

There was a need to make a new technique to reduce the Softmax complexity and make the training more efficient. There exist 2 different techniques for optimization, Hierarchical Softmax and Negative Sampling. They are mentioned in the original paper of Word2vec models.


Recalling our dataset from the previous post

D = {
“This battle will be my masterpiece”,
“The unseen blade is the deadliest”
};

V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ||V|| = 11


Hierarchical Softmax

In general, the target of Softmax layer is to maximize the probability of a word given its context P(word | context). Assume that from our dataset and using CBOW model, we want to predict the word “unseen” given the context “The”, “blade” and “is”. So, our probability formula would be P(unseen | the, blade, is).

Regularly, we use a flat Softmax layer to get that probability after applying the Softmax formula and get the probabilities. But in Hierarchical Softmax we don’t use a flat layer, but instead, we use a balanced Huffman tree layer, like the following.

So, to get the probability of the word “unseen”, we need to calculate (from the root) the following probability multiplication: P(right) * P(left) * P(right) * P(right).

or P(left) * P(right) * P(right) * P(right)

The question now, how we could get these probabilities?

The idea of getting these probabilities is the same idea of logistic regression. Recalling a simple logistic regression which uses Sigmoid function in its output neuron


In logistic regression, we get the probability of Y given the features set X, P(Y | X). You get this probability by multiplying the features vector with the weights matrix, then apply the Sigmoid function to get the probability. Sigmoid(X * W + b).

Well, we will do the same with our tree. Our tree nodes would be the output of a logistic function and we connect each node of the tree by an associated weight matrix so that we do the multiplication between the hidden layer values of the neural network and the weights to get the probability of each node of the tree.

The input to the tree nodes are the hidden layer values of the neural network (the layer before the softmax). You can imagine it as we have now a new neural network that has the hidden layer values as input and the output is each node of the tree.

For each tree layer, we get the probabilities of the nodes which will lead us to know what is the predicted word by choosing the nodes with the maximum probability value in each tree layer.

You can see that the complexity is reduced from O(N) to O(log(N)) which is a great performance improvement. Also, we don’t need to calculate the values of the 2 nodes of the subtree. You need only to get the probability of one of them and get the other by subtracting 1 from the other probability as P(right) + P(left) = 1.


Negative Sampling

Negative sampling is another approach for training the neural network in much more optimized approach. Actually, Negative Sampling is the most popular approach when training Word2vec models and it is implemented in Tensorflow’s Word2vec.

The idea of negative sampling is quit easy to understand and in implementation. Recalling our Softmax layer, you will see that the number of neurons are the same of the number of corpus’s words. After getting the error, we update the whole weights of the neural network. If you thought about that, actually, we don’t have to do this. We know that we have all the corpus words in the training items, so, the words weights will be certainly updated during the training phase.

Using this truth, we can instead update some of the words weights for each training example. These words would contain the positive word (the actual word that shall be predicted) and other words that are considered to be negative samples (incorrect words that shall not predicted). If you assumed that the number of negative samples are 10, then the softmax layer has only 11 neuron unit (10 negative + 1 positive) and the error will be used to update only 11 * hidden_layer_size weights instead of V * hidden_layer_size.

The question now is, how to select these negative samples?

Traditionally, choosing these negative samples depends on the words frequency in the given corpus. The higher frequency of the word means the higher chance of choosing it as a negative sample.

The following formula is used to determine the probability of selecting the word as a negative sample.

Where c is a constant that is selected by the model creator. After applying that, we choose the maximum N words, where N is the number of the negative samples.


References

– https://arxiv.org/pdf/1411.2738.pdf
– http://cs224d.stanford.edu/syllabus.html?utm_content=buffer891d6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
– https://www.tensorflow.org/tutorials/word2vec
– http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

Drawings are created using www.draw.io .

[Thesis Tutorials I] Understanding Word2vec for Word Embedding I

Vector Space Models (VSMs): A conceptual term in Natural Language Processing. It represents words as set of vectors, these vectors are considered to be the new identifiers of the words. These vectors are used in the mathematical and statistical models for classification and regression tasks. Also, they shall be unique to be able to distinguish between them when they are proceeded to the other models.

Word Embedding: A technique to represent the words by fixed-size vectors, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small). In other words, the representation involves the semantic meaning of the words, this is very important in many different areas and tasks of Natural Language Processing, such as Language Modeling and Machine Translation.

Shallow Neural Networks: Neural Networks which contain only 1 hidden layer, this layer is considered to be the projection layer of the input layer, so that this hidden layer is the new representation of features of the raw data input.


Basically, when we work with machine learning statistical and mathematical models, we always use numerical values data to be able to produce the results. Models such as classifiers, regressors and clustering deal with numerical data features. But, how this could be done with text data? How to represent the words by numerical features so that they can be used and fed to our models?


One-Hot-Encoding

Traditionally, such problem could be solved using a popular approach which is called One-hot-Encoding. In which each word in our corpus will have a unique vector and it will be the new representation of the word.

For example, assume that we have the following 2 sentences:
“I like playing football”
“I like playing basketball”

First, we need to extract the unique words of the data, in our case, the unique words are the set V = {I, like, playing, football, basketball} and ||V|| = 5. Note that, you need to normalize the words and make them all in lower-case or upper-case (e.g football is the same of Football), we don’t care about the letter case in such representation.

Now, we create a vector with all zeros, and its size is the number of unique words ||U||, and and each word will have a 1 value in unique index of the initial zeros vector. So, our words representation will be

I = [1, 0, 0, 0, 0], like = [0, 1, 0, 0, 0], playing = [0, 0, 1, 0, 0], football = [0, 0, 0, 1, 0], basketball = [0, 0, 0, 0, 1].

As you see, it is very simple and straight forward. It can be implemented easily using programming. But, you for sure noticed its cons. This approach heavily depends on the size of the words, so, if we have a corpus with 1M words, then the word dimension would be 1M too! which is very bad in performance, also, the representation itself doesn’t show the semantic relations between the words and their meaning, so, this representation doesn’t care about the semantic meaning of the words, it only focus on transforming the words to the Vector Space Model where we have unique vectors, regardless their distances.

But, you can use this method if the problem that you want to solve isn’t closely related to the semantic relations or meaning, also, if you know that your unique words size isn’t large.


Word2vec Basics

Word2vec is a technique or a paradigm which consists of a group of models (Skip-gram, Continuous Bag of Words (CBOW)), the target of each model is to produce fixed-size vectors for the corpus words, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small)

The idea of word2vec is to represent words using the surrounding words of that word. For example, assume that we have the following sentence:

“I like playing football”

To be able to represent a vector for the word “playing”, we need to be able to produce the surrounding words using the “playing” word, in that case, the surrounding words are “like” and “football”. Note that, these surrounding could be increased according to the window size you set, in other words, the surrounding words could be N words before and after the input word, where N is the window size.


Word2vec Philosophy

Well, this idea is considered to be philosophical. If you think about that, you will realize that we, humans, know the meaning of the words by their nearby words. For example, assume that we have an unknown word that you haven’t seen before, lets call it word “X”, and we have the following sentence:

“I like playing X”

So, you may not know the meaning of X, but you for sure know that “X” is something we can play with, and also, it is something that can be enjoyable to some people. Your brain reaches to this conclusion after reviewing the X’s surrounding words, “like” and “playing”. This is the idea of inventing word2vec technique.

In conclusion, word2vec tries to use the context of the text to be able to produce the word vectors, and these word vectors consist of floating point values that are obtained during the training phase of the models, Skip-gram and CBOW, these models are Shallow Neural Networks, they do the same job, but they are different in its input and output.


Dataset

Before getting into the word2vec models, we need to define some sentences as a corpus to help us in understanding how the models work. So, our corpus is:

D = {
“This battle will be my masterpiece”, // LoL players will relate 😛
“The unseen blade is the deadliest”
};

As we did in One-Hot-Encoding, we need to extract the unique words and initial its vectors by a sparse vector. We need to do that to be able to use these words in the Neural Networks.

V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ||V|| = 11,

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] = this
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] = battle
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0] = will
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0] = be
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] = my
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0] = masterpiece
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0] = the
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0] = unseen
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] = blade
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] = is
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] = deadliest


Skip-gram Model

Skip-gram is a Shallow Neural Network. The target of this model is to predict the nearby words given a target word. So, the input of this network is a 1-hot-encoded vector of the target word and the output is N words where is the window size of the context which is determined by the network creator. In the above figure, the number of words that shall be predicted are 3 words, so, we have 3 Softmax layers and the number of neurons each layer is the vocabulary size.

Let’s see how the model works with the sentence “The unseen blade is the deadliest”. Assume that the target word that we want to know its representation is “unseen” and the nearby words are “The”, “blade” and “is”. Note that, our target vector size will be 3, in other words, we want our embedding vector to be a fixed-size of 3, also, the weights are initialized with random values like any other neural network. Regularly, the weights are initialized with Gaussian distribution with known mean and variance.

Input-to-hidden

After feedforwarding the input word, we get the hidden layer values [0.8, 0.4, 0.5]. This is considered to be the first obtained representation of the word “unseen”. Well, and what about the other words of the vocabulary? The values of the weight matrix would be the representations of the words (e.g each row of the weight matrix represents the vector representation of the words). For example, given the weight matrix, the word “this” representation will be [0.1, 0.9, 0.3] and so the other words.

Hidden-to-output

Now, we can apply Softmax activation function to each of these vectors, so that we will get the output words given the target word “unseen”.

We get the maximum value of each vector and get its corresponding 1-hot-encoded words, the index of the maximum value would be the index of value “1” of the 1-hot-encoded vectors . According to these vectors, the predicted words are “is”, “deadliest” and “unseen”.

During the training phase, we have the actual output of the target word, so, we get the error of each output layer using a loss function. The loss function could be Mean Squared Error or Cross Entropy. Now, we have 3 errors of each Softmax layer, we get the average of the error by dividing the summation of the errors by the number of  errors (3 in this  case).

After obtaining the error, we backpropagate the error to the weights so that we update the weights values according the obtained error. As you see, the values of the weights would change for each batch of training, which also would change the words representation.

We continue the iterations till we reach the maximum number of epochs or we get the target error so that we can stop the training and get the Input-to-hidden weight matrix in which each row will represent each word in the space.

Alternatively, we can get the weights of the Hidden-to-output to represent the words. In this case, we get the average of the Hidden-to-output weight martrix and each column of the averaged weight matrix will represent the words. But regularly, we take the Input-to-hidden weights matrix.


Word2vec is a technique which consists of a group of models (Skip-gram, Continuous Bag of Words (CBOW)), the target of each model is to produce fixed-size vectors for the corpus words, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small)


Continuous Bag of Words (CBOW) Model

CBOW is also a Shallow Neural Network. The target of this model is to predict a target word given nearby words. So, the input of this network are N words where is the window size of the context which is determined by the network creator and the output is the target word which is obtained using Softmax layer. In the above figure, the number of words that will be used as an input would be 3 words, also, the network weights are initialized randomly in the beginning.

The idea of CBOW is that the hidden layer values will represent the mean of the input words, in other words, the bottleneck features of the network is contains the mean vector of the input words vectors.

Recalling our previous sentence, “The unseen blade is the deadliest”, let’s see how the model works with it. The target word to be predicted is “unseen” and the context words are  “the”, “blade” and “is”.

Input-to-hidden

To get the values of the hidden layer, we just need to get the average of the resultant of multiplying the input words of their weights matrix.

So, the values of the hidden layer would be
H = [(0.8 + 0.2 + 0.2) / 3, (0.9 + 0.8 + 0.3) / 3, (0.1 + 0.9 + 0.7) / 3] = [0.39, 0.66, 0.56]

Hidden-to-output

After getting the hidden layer values, we get the output layer values by multiplying the hidden layer vector with the hidden to output weight matrix.

Now, we have the output layer values, we need to apply Softmax again to get the target word.

The predicted word is “masterpiece”. We have the actual output like before, so, we get the error and backpropagate the error to the weights so that we update the weights values.


Conclusion

Word2vec is considered to be a the main core of applying deep learning in Natural Language Processing. Traditionally, rule-based systems that depend on some syntactic and linguistic feature were used to solve most of NLP problems and application.

Any rule-based system cannot generalize the solution to all real world data. Also, things that could be applied in a language doesn’t necessarily could be applied to the other languages. So, it was a must to make deep learning systems that depend on the semantic representation of language words and text.

Also, there are some hard and challenging problems in NLP that needs deep learning to be solved, such as Machine Translation and Parapharsing Detection. Both, depends on the semantic of the words (which is common thing in any language), so, it was a must to move to deep learning to enhance the learning in the semantic way, instead of syntactic and language grammar that use rule-based algorithms. [https://www.quora.com/Why-do-we-use-neural-network-to-do-NLP-task/answer/Ahmed-Hani-Ibrahim]

To be able to move to deep learning, it was a must to work with the words in much more semantic way and make systems that actually understands the language itself like we do as humans. So, Word2vec is the bridge to do such a task. Representing words in the vector space in semantic manner, so that we make the machine understands the words in semantic way.


As you see, it seems that Word2vec needs so much amount of training data and high performance machines to be able to train these data. If you think of that, we applied Softmax on the output layer. Applying Softmax on very huge size of words would be very costly and would need so much time, so, it was needed to make some optimization to the Softamx to reduce the process complexity. In the next post, we will talk about Negative Sampling and Hierarchical Softmax.


References

– https://arxiv.org/pdf/1411.2738.pdf
– http://cs224d.stanford.edu/syllabus.html?utm_content=buffer891d6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
– https://www.tensorflow.org/tutorials/word2vec

Hello, Valor!

Being a researcher and a programmer can make the life so boring. Actually, I have been feeling boring for 2 months. My life is full of same routine processes, going to work, back to home and study, or code some freelancing tasks, waiting Thursday to meet my friends and hangout. This routine really made my life so boring that effects on my daily life, at work and at home. So, I decided to try something new, try something that can refresh my life and to be added to my daily life.

I love listening to music, actually, I can’t work, study or even play without listening to music (with lyrics or without), but, I never thought that I can learn music. This boredom state was in need to learn something new, so, I didn’t find an idea better than learning music, specifically, learning to play on Violin, as it is my favourite instrument that I like to listen to.

I took the decision, enrolled in a course and bought my Violin. I am still in Level 1 at the course, the course consists of 7 levels, and I am very motivated to learn the basics and be somehow good with Violin before the end of this year. This is a long way to run, but, I am very happy, exciting to learn something new and glad that I am on my way to break my boredom state.

So, lets say, welcome, Valor (I named it Valor), I hope that I will be a good companion to you!