# [Thesis Tutorials I] Understanding Word2vec for Word Embedding I

Vector Space Models (VSMs): A conceptual term in Natural Language Processing. It represents words as set of vectors, these vectors are considered to be the new identifiers of the words. These vectors are used in the mathematical and statistical models for classification and regression tasks. Also, they shall be unique to be able to distinguish between them when they are proceeded to the other models.

Word Embedding: A technique to represent the words by fixed-size vectors, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small). In other words, the representation involves the semantic meaning of the words, this is very important in many different areas and tasks of Natural Language Processing, such as Language Modeling and Machine Translation.

Shallow Neural Networks: Neural Networks which contain only 1 hidden layer, this layer is considered to be the projection layer of the input layer, so that this hidden layer is the new representation of features of the raw data input.

Basically, when we work with machine learning statistical and mathematical models, we always use numerical values data to be able to produce the results. Models such as classifiers, regressors and clustering deal with numerical data features. But, how this could be done with text data? How to represent the words by numerical features so that they can be used and fed to our models?

One-Hot-Encoding

Traditionally, such problem could be solved using a popular approach which is called One-hot-Encoding. In which each word in our corpus will have a unique vector and it will be the new representation of the word.

For example, assume that we have the following 2 sentences:
“I like playing football”

First, we need to extract the unique words of the data, in our case, the unique words are the set V = {I, like, playing, football, basketball} and ||V|| = 5. Note that, you need to normalize the words and make them all in lower-case or upper-case (e.g football is the same of Football), we don’t care about the letter case in such representation.

Now, we create a vector with all zeros, and its size is the number of unique words ||U||, and and each word will have a 1 value in unique index of the initial zeros vector. So, our words representation will be

I = [1, 0, 0, 0, 0], like = [0, 1, 0, 0, 0], playing = [0, 0, 1, 0, 0], football = [0, 0, 0, 1, 0], basketball = [0, 0, 0, 0, 1].

As you see, it is very simple and straight forward. It can be implemented easily using programming. But, you for sure noticed its cons. This approach heavily depends on the size of the words, so, if we have a corpus with 1M words, then the word dimension would be 1M too! which is very bad in performance, also, the representation itself doesn’t show the semantic relations between the words and their meaning, so, this representation doesn’t care about the semantic meaning of the words, it only focus on transforming the words to the Vector Space Model where we have unique vectors, regardless their distances.

But, you can use this method if the problem that you want to solve isn’t closely related to the semantic relations or meaning, also, if you know that your unique words size isn’t large.

Word2vec Basics

Word2vec is a technique or a paradigm which consists of a group of models (Skip-gram, Continuous Bag of Words (CBOW)), the target of each model is to produce fixed-size vectors for the corpus words, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small)

The idea of word2vec is to represent words using the surrounding words of that word. For example, assume that we have the following sentence:

“I like playing football”

To be able to represent a vector for the word “playing”, we need to be able to produce the surrounding words using the “playing” word, in that case, the surrounding words are “like” and “football”. Note that, these surrounding could be increased according to the window size you set, in other words, the surrounding words could be N words before and after the input word, where N is the window size.

Word2vec Philosophy

Well, this idea is considered to be philosophical. If you think about that, you will realize that we, humans, know the meaning of the words by their nearby words. For example, assume that we have an unknown word that you haven’t seen before, lets call it word “X”, and we have the following sentence:

“I like playing X”

So, you may not know the meaning of X, but you for sure know that “X” is something we can play with, and also, it is something that can be enjoyable to some people. Your brain reaches to this conclusion after reviewing the X’s surrounding words, “like” and “playing”. This is the idea of inventing word2vec technique.

In conclusion, word2vec tries to use the context of the text to be able to produce the word vectors, and these word vectors consist of floating point values that are obtained during the training phase of the models, Skip-gram and CBOW, these models are Shallow Neural Networks, they do the same job, but they are different in its input and output.

Dataset

Before getting into the word2vec models, we need to define some sentences as a corpus to help us in understanding how the models work. So, our corpus is:

D = {
“This battle will be my masterpiece”, // LoL players will relate 😛
};

As we did in One-Hot-Encoding, we need to extract the unique words and initial its vectors by a sparse vector. We need to do that to be able to use these words in the Neural Networks.

V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ||V|| = 11,

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] = this
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] = battle
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0] = will
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0] = be
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] = my
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0] = masterpiece
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0] = the
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0] = unseen
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] = blade
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] = is
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] = deadliest

Skip-gram Model

Skip-gram is a Shallow Neural Network. The target of this model is to predict the nearby words given a target word. So, the input of this network is a 1-hot-encoded vector of the target word and the output is N words where is the window size of the context which is determined by the network creator. In the above figure, the number of words that shall be predicted are 3 words, so, we have 3 Softmax layers and the number of neurons each layer is the vocabulary size.

Let’s see how the model works with the sentence “The unseen blade is the deadliest”. Assume that the target word that we want to know its representation is “unseen” and the nearby words are “The”, “blade” and “is”. Note that, our target vector size will be 3, in other words, we want our embedding vector to be a fixed-size of 3, also, the weights are initialized with random values like any other neural network. Regularly, the weights are initialized with Gaussian distribution with known mean and variance.

Input-to-hidden

After feedforwarding the input word, we get the hidden layer values [0.8, 0.4, 0.5]. This is considered to be the first obtained representation of the word “unseen”. Well, and what about the other words of the vocabulary? The values of the weight matrix would be the representations of the words (e.g each row of the weight matrix represents the vector representation of the words). For example, given the weight matrix, the word “this” representation will be [0.1, 0.9, 0.3] and so the other words.

Hidden-to-output

Now, we can apply Softmax activation function to each of these vectors, so that we will get the output words given the target word “unseen”.

We get the maximum value of each vector and get its corresponding 1-hot-encoded words, the index of the maximum value would be the index of value “1” of the 1-hot-encoded vectors . According to these vectors, the predicted words are “is”, “deadliest” and “unseen”.

During the training phase, we have the actual output of the target word, so, we get the error of each output layer using a loss function. The loss function could be Mean Squared Error or Cross Entropy. Now, we have 3 errors of each Softmax layer, we get the average of the error by dividing the summation of the errors by the number of  errors (3 in this  case).

After obtaining the error, we backpropagate the error to the weights so that we update the weights values according the obtained error. As you see, the values of the weights would change for each batch of training, which also would change the words representation.

We continue the iterations till we reach the maximum number of epochs or we get the target error so that we can stop the training and get the Input-to-hidden weight matrix in which each row will represent each word in the space.

Alternatively, we can get the weights of the Hidden-to-output to represent the words. In this case, we get the average of the Hidden-to-output weight martrix and each column of the averaged weight matrix will represent the words. But regularly, we take the Input-to-hidden weights matrix.

Word2vec is a technique which consists of a group of models (Skip-gram, Continuous Bag of Words (CBOW)), the target of each model is to produce fixed-size vectors for the corpus words, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small)

Continuous Bag of Words (CBOW) Model

CBOW is also a Shallow Neural Network. The target of this model is to predict a target word given nearby words. So, the input of this network are N words where is the window size of the context which is determined by the network creator and the output is the target word which is obtained using Softmax layer. In the above figure, the number of words that will be used as an input would be 3 words, also, the network weights are initialized randomly in the beginning.

The idea of CBOW is that the hidden layer values will represent the mean of the input words, in other words, the bottleneck features of the network is contains the mean vector of the input words vectors.

Recalling our previous sentence, “The unseen blade is the deadliest”, let’s see how the model works with it. The target word to be predicted is “unseen” and the context words are  “the”, “blade” and “is”.

Input-to-hidden

To get the values of the hidden layer, we just need to get the average of the resultant of multiplying the input words of their weights matrix.

So, the values of the hidden layer would be
H = [(0.8 + 0.2 + 0.2) / 3, (0.9 + 0.8 + 0.3) / 3, (0.1 + 0.9 + 0.7) / 3] = [0.39, 0.66, 0.56]

Hidden-to-output

After getting the hidden layer values, we get the output layer values by multiplying the hidden layer vector with the hidden to output weight matrix.

Now, we have the output layer values, we need to apply Softmax again to get the target word.

The predicted word is “masterpiece”. We have the actual output like before, so, we get the error and backpropagate the error to the weights so that we update the weights values.

Conclusion

Word2vec is considered to be a the main core of applying deep learning in Natural Language Processing. Traditionally, rule-based systems that depend on some syntactic and linguistic feature were used to solve most of NLP problems and application.

Any rule-based system cannot generalize the solution to all real world data. Also, things that could be applied in a language doesn’t necessarily could be applied to the other languages. So, it was a must to make deep learning systems that depend on the semantic representation of language words and text.

Also, there are some hard and challenging problems in NLP that needs deep learning to be solved, such as Machine Translation and Parapharsing Detection. Both, depends on the semantic of the words (which is common thing in any language), so, it was a must to move to deep learning to enhance the learning in the semantic way, instead of syntactic and language grammar that use rule-based algorithms. [https://www.quora.com/Why-do-we-use-neural-network-to-do-NLP-task/answer/Ahmed-Hani-Ibrahim]

To be able to move to deep learning, it was a must to work with the words in much more semantic way and make systems that actually understands the language itself like we do as humans. So, Word2vec is the bridge to do such a task. Representing words in the vector space in semantic manner, so that we make the machine understands the words in semantic way.

As you see, it seems that Word2vec needs so much amount of training data and high performance machines to be able to train these data. If you think of that, we applied Softmax on the output layer. Applying Softmax on very huge size of words would be very costly and would need so much time, so, it was needed to make some optimization to the Softamx to reduce the process complexity. In the next post, we will talk about Negative Sampling and Hierarchical Softmax.

References

## 7 thoughts on “[Thesis Tutorials I] Understanding Word2vec for Word Embedding I”

1. Berkay says:

One of the best tutorial I have ever seen. Amazing job, thanks

2. Berkay Celik says:

Regarding my previous comment. I go over the tutorial carefully one more time. In your definition of skip-gram, you give a word, “unseen”, as an input.What I’m thinking is the output should be some other words that are obtained from skip-gram, in your example It might be better to show that these are the “The”, “blade” and “is”. This part of the tutorial gets a little bit confusing. Our text corpus might have other words also that includes “unseen” and the similar process will be repeated. In this manner, we will capture the semantics of the words. Again, this is a great tutorial, just to make comments to clear the confusion. Thanks.

1. Thanks a lot for your comment and feedback.
You are absolutely right. I am editing this to clear such confusion by tonight.

3. Maha says:

Why do you use different weight matrices in the hidden to output in the skip-gram model (same for the input to hidden in the CBOW)? Thanks.

1. Because the weights are regularly initialized random when building the neural network. The weights between the layers could be different according to the random generator model.