[Thesis Tutorials II] Understanding Word2vec for Word Embedding II

Previously, we talked about Word2vec model and its Skip-gram and Continuous Bag of Words (CBOW) neural networks.

Regularly, when we train any of Word2vec models, we need huge size of data. The number of words in the corpus could be millions, as you know, we want from Word2vec to build vectors representation to the words so that we can use it in NLP tasks and feed these vectors to any discriminative models. For example, in Machine Translation task, we need very huge size of words to be able to cover the language’s words and context.

Recalling Skip-gram model
You see that each of output layer in the above neural network has V dimension, which is the corpus unique vocabulary size. If we have 1M of unique words, then the size of each output layer would be 1M neuron unit. As we said in the previous post, to be able to get the words, we need to apply Softmax activation function to the output layer, so that we change the vector to probabilistic values and get the index of the maximum probability which corresponds to a 1-hot-encoding word vector.

To apply Softmax activation function, we use the following formula
The denominator is the summation of the output exponential values. The complexity of this operation is O(V) (regardless the exp() complexity), where V is the number of words in the corpus. So, this operation will have a bad performance if the vocabulary size is too high. Also, in Skip-gram, we need to do this in multiple Softmax output layer. This would be totally costly and waste alot of time when training the Word2vec models.

There was a need to make a new technique to reduce the Softmax complexity and make the training more efficient. There exist 2 different techniques for optimization, Hierarchical Softmax and Negative Sampling. They are mentioned in the original paper of Word2vec models.


Recalling our dataset from the previous post

D = {
“This battle will be my masterpiece”,
“The unseen blade is the deadliest”
};

V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ||V|| = 11


Hierarchical Softmax

In general, the target of Softmax layer is to maximize the probability of a word given its context P(word | context). Assume that from our dataset and using CBOW model, we want to predict the word “unseen” given the context “The”, “blade” and “is”. So, our probability formula would be P(unseen | the, blade, is).

Regularly, we use a flat Softmax layer to get that probability after applying the Softmax formula and get the probabilities. But in Hierarchical Softmax we don’t use a flat layer, but instead, we use a balanced Huffman tree layer, like the following.

So, to get the probability of the word “unseen”, we need to calculate (from the root) the following probability multiplication: P(right) * P(left) * P(right) * P(right).

or P(left) * P(right) * P(right) * P(right)

The question now, how we could get these probabilities?

The idea of getting these probabilities is the same idea of logistic regression. Recalling a simple logistic regression which uses Sigmoid function in its output neuron


In logistic regression, we get the probability of Y given the features set X, P(Y | X). You get this probability by multiplying the features vector with the weights matrix, then apply the Sigmoid function to get the probability. Sigmoid(X * W + b).

Well, we will do the same with our tree. Our tree nodes would be the output of a logistic function and we connect each node of the tree by an associated weight matrix so that we do the multiplication between the hidden layer values of the neural network and the weights to get the probability of each node of the tree.

The input to the tree nodes are the hidden layer values of the neural network (the layer before the softmax). You can imagine it as we have now a new neural network that has the hidden layer values as input and the output is each node of the tree.

For each tree layer, we get the probabilities of the nodes which will lead us to know what is the predicted word by choosing the nodes with the maximum probability value in each tree layer.

You can see that the complexity is reduced from O(N) to O(log(N)) which is a great performance improvement. Also, we don’t need to calculate the values of the 2 nodes of the subtree. You need only to get the probability of one of them and get the other by subtracting 1 from the other probability as P(right) + P(left) = 1.


Negative Sampling

Negative sampling is another approach for training the neural network in much more optimized approach. Actually, Negative Sampling is the most popular approach when training Word2vec models and it is implemented in Tensorflow’s Word2vec.

The idea of negative sampling is quit easy to understand and in implementation. Recalling our Softmax layer, you will see that the number of neurons are the same of the number of corpus’s words. After getting the error, we update the whole weights of the neural network. If you thought about that, actually, we don’t have to do this. We know that we have all the corpus words in the training items, so, the words weights will be certainly updated during the training phase.

Using this truth, we can instead update some of the words weights for each training example. These words would contain the positive word (the actual word that shall be predicted) and other words that are considered to be negative samples (incorrect words that shall not predicted). If you assumed that the number of negative samples are 10, then the softmax layer has only 11 neuron unit (10 negative + 1 positive) and the error will be used to update only 11 * hidden_layer_size weights instead of V * hidden_layer_size.

The question now is, how to select these negative samples?

Traditionally, choosing these negative samples depends on the words frequency in the given corpus. The higher frequency of the word means the higher chance of choosing it as a negative sample.

The following formula is used to determine the probability of selecting the word as a negative sample.

Where c is a constant that is selected by the model creator. After applying that, we choose the maximum N words, where N is the number of the negative samples.


References

– https://arxiv.org/pdf/1411.2738.pdf
– http://cs224d.stanford.edu/syllabus.html?utm_content=buffer891d6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
– https://www.tensorflow.org/tutorials/word2vec
– http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

Drawings are created using www.draw.io .

[Thesis Tutorials I] Understanding Word2vec for Word Embedding I

Vector Space Models (VSMs): A conceptual term in Natural Language Processing. It represents words as set of vectors, these vectors are considered to be the new identifiers of the words. These vectors are used in the mathematical and statistical models for classification and regression tasks. Also, they shall be unique to be able to distinguish between them when they are proceeded to the other models.

Word Embedding: A technique to represent the words by fixed-size vectors, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small). In other words, the representation involves the semantic meaning of the words, this is very important in many different areas and tasks of Natural Language Processing, such as Language Modeling and Machine Translation.

Shallow Neural Networks: Neural Networks which contain only 1 hidden layer, this layer is considered to be the projection layer of the input layer, so that this hidden layer is the new representation of features of the raw data input.


Basically, when we work with machine learning statistical and mathematical models, we always use numerical values data to be able to produce the results. Models such as classifiers, regressors and clustering deal with numerical data features. But, how this could be done with text data? How to represent the words by numerical features so that they can be used and fed to our models?


One-Hot-Encoding

Traditionally, such problem could be solved using a popular approach which is called One-hot-Encoding. In which each word in our corpus will have a unique vector and it will be the new representation of the word.

For example, assume that we have the following 2 sentences:
“I like playing football”
“I like playing basketball”

First, we need to extract the unique words of the data, in our case, the unique words are the set V = {I, like, playing, football, basketball} and ||V|| = 5. Note that, you need to normalize the words and make them all in lower-case or upper-case (e.g football is the same of Football), we don’t care about the letter case in such representation.

Now, we create a vector with all zeros, and its size is the number of unique words ||U||, and and each word will have a 1 value in unique index of the initial zeros vector. So, our words representation will be

I = [1, 0, 0, 0, 0], like = [0, 1, 0, 0, 0], playing = [0, 0, 1, 0, 0], football = [0, 0, 0, 1, 0], basketball = [0, 0, 0, 0, 1].

As you see, it is very simple and straight forward. It can be implemented easily using programming. But, you for sure noticed its cons. This approach heavily depends on the size of the words, so, if we have a corpus with 1M words, then the word dimension would be 1M too! which is very bad in performance, also, the representation itself doesn’t show the semantic relations between the words and their meaning, so, this representation doesn’t care about the semantic meaning of the words, it only focus on transforming the words to the Vector Space Model where we have unique vectors, regardless their distances.

But, you can use this method if the problem that you want to solve isn’t closely related to the semantic relations or meaning, also, if you know that your unique words size isn’t large.


Word2vec Basics

Word2vec is a technique or a paradigm which consists of a group of models (Skip-gram, Continuous Bag of Words (CBOW)), the target of each model is to produce fixed-size vectors for the corpus words, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small)

The idea of word2vec is to represent words using the surrounding words of that word. For example, assume that we have the following sentence:

“I like playing football”

To be able to represent a vector for the word “playing”, we need to be able to produce the surrounding words using the “playing” word, in that case, the surrounding words are “like” and “football”. Note that, these surrounding could be increased according to the window size you set, in other words, the surrounding words could be N words before and after the input word, where N is the window size.


Word2vec Philosophy

Well, this idea is considered to be philosophical. If you think about that, you will realize that we, humans, know the meaning of the words by their nearby words. For example, assume that we have an unknown word that you haven’t seen before, lets call it word “X”, and we have the following sentence:

“I like playing X”

So, you may not know the meaning of X, but you for sure know that “X” is something we can play with, and also, it is something that can be enjoyable to some people. Your brain reaches to this conclusion after reviewing the X’s surrounding words, “like” and “playing”. This is the idea of inventing word2vec technique.

In conclusion, word2vec tries to use the context of the text to be able to produce the word vectors, and these word vectors consist of floating point values that are obtained during the training phase of the models, Skip-gram and CBOW, these models are Shallow Neural Networks, they do the same job, but they are different in its input and output.


Dataset

Before getting into the word2vec models, we need to define some sentences as a corpus to help us in understanding how the models work. So, our corpus is:

D = {
“This battle will be my masterpiece”, // LoL players will relate 😛
“The unseen blade is the deadliest”
};

As we did in One-Hot-Encoding, we need to extract the unique words and initial its vectors by a sparse vector. We need to do that to be able to use these words in the Neural Networks.

V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ||V|| = 11,

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] = this
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] = battle
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0] = will
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0] = be
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] = my
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0] = masterpiece
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0] = the
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0] = unseen
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] = blade
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] = is
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] = deadliest


Skip-gram Model

Skip-gram is a Shallow Neural Network. The target of this model is to predict the nearby words given a target word. So, the input of this network is a 1-hot-encoded vector of the target word and the output is N words where is the window size of the context which is determined by the network creator. In the above figure, the number of words that shall be predicted are 3 words, so, we have 3 Softmax layers and the number of neurons each layer is the vocabulary size.

Let’s see how the model works with the sentence “The unseen blade is the deadliest”. Assume that the target word that we want to know its representation is “unseen” and the nearby words are “The”, “blade” and “is”. Note that, our target vector size will be 3, in other words, we want our embedding vector to be a fixed-size of 3, also, the weights are initialized with random values like any other neural network. Regularly, the weights are initialized with Gaussian distribution with known mean and variance.

Input-to-hidden

After feedforwarding the input word, we get the hidden layer values [0.8, 0.4, 0.5]. This is considered to be the first obtained representation of the word “unseen”. Well, and what about the other words of the vocabulary? The values of the weight matrix would be the representations of the words (e.g each row of the weight matrix represents the vector representation of the words). For example, given the weight matrix, the word “this” representation will be [0.1, 0.9, 0.3] and so the other words.

Hidden-to-output

Now, we can apply Softmax activation function to each of these vectors, so that we will get the output words given the target word “unseen”.

We get the maximum value of each vector and get its corresponding 1-hot-encoded words, the index of the maximum value would be the index of value “1” of the 1-hot-encoded vectors . According to these vectors, the predicted words are “is”, “deadliest” and “unseen”.

During the training phase, we have the actual output of the target word, so, we get the error of each output layer using a loss function. The loss function could be Mean Squared Error or Cross Entropy. Now, we have 3 errors of each Softmax layer, we get the average of the error by dividing the summation of the errors by the number of  errors (3 in this  case).

After obtaining the error, we backpropagate the error to the weights so that we update the weights values according the obtained error. As you see, the values of the weights would change for each batch of training, which also would change the words representation.

We continue the iterations till we reach the maximum number of epochs or we get the target error so that we can stop the training and get the Input-to-hidden weight matrix in which each row will represent each word in the space.

Alternatively, we can get the weights of the Hidden-to-output to represent the words. In this case, we get the average of the Hidden-to-output weight martrix and each column of the averaged weight matrix will represent the words. But regularly, we take the Input-to-hidden weights matrix.


Word2vec is a technique which consists of a group of models (Skip-gram, Continuous Bag of Words (CBOW)), the target of each model is to produce fixed-size vectors for the corpus words, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small)


Continuous Bag of Words (CBOW) Model

CBOW is also a Shallow Neural Network. The target of this model is to predict a target word given nearby words. So, the input of this network are N words where is the window size of the context which is determined by the network creator and the output is the target word which is obtained using Softmax layer. In the above figure, the number of words that will be used as an input would be 3 words, also, the network weights are initialized randomly in the beginning.

The idea of CBOW is that the hidden layer values will represent the mean of the input words, in other words, the bottleneck features of the network is contains the mean vector of the input words vectors.

Recalling our previous sentence, “The unseen blade is the deadliest”, let’s see how the model works with it. The target word to be predicted is “unseen” and the context words are  “the”, “blade” and “is”.

Input-to-hidden

To get the values of the hidden layer, we just need to get the average of the resultant of multiplying the input words of their weights matrix.

So, the values of the hidden layer would be
H = [(0.8 + 0.2 + 0.2) / 3, (0.9 + 0.8 + 0.3) / 3, (0.1 + 0.9 + 0.7) / 3] = [0.39, 0.66, 0.56]

Hidden-to-output

After getting the hidden layer values, we get the output layer values by multiplying the hidden layer vector with the hidden to output weight matrix.

Now, we have the output layer values, we need to apply Softmax again to get the target word.

The predicted word is “masterpiece”. We have the actual output like before, so, we get the error and backpropagate the error to the weights so that we update the weights values.


Conclusion

Word2vec is considered to be a the main core of applying deep learning in Natural Language Processing. Traditionally, rule-based systems that depend on some syntactic and linguistic feature were used to solve most of NLP problems and application.

Any rule-based system cannot generalize the solution to all real world data. Also, things that could be applied in a language doesn’t necessarily could be applied to the other languages. So, it was a must to make deep learning systems that depend on the semantic representation of language words and text.

Also, there are some hard and challenging problems in NLP that needs deep learning to be solved, such as Machine Translation and Parapharsing Detection. Both, depends on the semantic of the words (which is common thing in any language), so, it was a must to move to deep learning to enhance the learning in the semantic way, instead of syntactic and language grammar that use rule-based algorithms. [https://www.quora.com/Why-do-we-use-neural-network-to-do-NLP-task/answer/Ahmed-Hani-Ibrahim]

To be able to move to deep learning, it was a must to work with the words in much more semantic way and make systems that actually understands the language itself like we do as humans. So, Word2vec is the bridge to do such a task. Representing words in the vector space in semantic manner, so that we make the machine understands the words in semantic way.


As you see, it seems that Word2vec needs so much amount of training data and high performance machines to be able to train these data. If you think of that, we applied Softmax on the output layer. Applying Softmax on very huge size of words would be very costly and would need so much time, so, it was needed to make some optimization to the Softamx to reduce the process complexity. In the next post, we will talk about Negative Sampling and Hierarchical Softmax.


References

– https://arxiv.org/pdf/1411.2738.pdf
– http://cs224d.stanford.edu/syllabus.html?utm_content=buffer891d6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
– https://www.tensorflow.org/tutorials/word2vec

Hello, Valor!

Being a researcher and a programmer can make the life so boring. Actually, I have been feeling boring for 2 months. My life is full of same routine processes, going to work, back to home and study, or code some freelancing tasks, waiting Thursday to meet my friends and hangout. This routine really made my life so boring that effects on my daily life, at work and at home. So, I decided to try something new, try something that can refresh my life and to be added to my daily life.

I love listening to music, actually, I can’t work, study or even play without listening to music (with lyrics or without), but, I never thought that I can learn music. This boredom state was in need to learn something new, so, I didn’t find an idea better than learning music, specifically, learning to play on Violin, as it is my favourite instrument that I like to listen to.

I took the decision, enrolled in a course and bought my Violin. I am still in Level 1 at the course, the course consists of 7 levels, and I am very motivated to learn the basics and be somehow good with Violin before the end of this year. This is a long way to run, but, I am very happy, exciting to learn something new and glad that I am on my way to break my boredom state.

So, lets say, welcome, Valor (I named it Valor), I hope that I will be a good companion to you!

Generative Adversarial Networks [2] – Camouflage your Predator!

You can imagine the situation by learning without any prior knowledge about something, you just do mistakes and you let others correct them to you. Continuing this process will lead you to learn how to do that thing, although you may still don’t know what are you doing?!

saad

This was a comment from my professional friend Moustapha Saad on my recent post which was an overview about Generative Adversarial Networks (GANs). I find that his description is very true. It describes the true nature of GANs, so, I decided to use it as a title of this post.

This post is considered to be the part 2 of the previous post. In the previous post, we talked about the general idea of GANs, the discriminative and the generative models which cooperate with each other to improve themselves. In this post, we will focus on how the generative models work in general and some different types of generative models that could be used when building GANs.


Maximum Likelihood Generative Models

Assume that we have a dataset that contains m training samples, these samples follow an unknown probability distribution, it could be Gaussian, Bernoulli or any other probability distribution, actually, we don’t care about knowing such distribution. Our focus is to be able to generate similar data that has the same distribution of the training dataset. We can formulate the problem as a Maximum Likelihood Estimation problem.

The idea of Maximum Likelihood is to make a model that has some parameters which make it be able to generate the same probability distribution, P(X;Θ). As we we have m samples in the training data, we can generalize the Maximum Likelihood all over the data, Πm=1Pmodel(Xi;Θ).

Now, the target is to find the Θ that Maximize the likelihood, so, our mathematical representation of this problem would be Θ := arg Θ max Πm=1Pmodel(Xi;Θ)

To avoid any numerical problems, it is always better to transform to the log space, as you know, log(x * y) = log(x) + log(y), we turns the above equation to sum instead of product. So, our final Empirical Distribution equation to obtain the Maximum Likelihood of the training data would be Θ := arg Θ max Σm=1log Pmodel(Xi;Θ).

We call the above model as an Empirical Distribution, as we don’t pretty certain about the distribution we get after getting the parameter Θ. The question is, how to know whether the obtained distribution is really the actual distribution of the training data?

Kullback–Leibler Divergence (KL-divergence) is a measurement that could be used to answer the above question, and it can be used in both the discrete and continuous probability distributions. Given 2 probability distribution, P and Q, the KL-divergence is measured by DKL(P || Q) = Σi P(i) log (P(i) / Q(i)).

The target is to minimize the KL-divergence, the less difference between the distributions of P and Q means that they are close to each other. So, recall our Maximum Likelihood, after getting the empirical distribution, we measure the KL-divergence between the empirical distribution and the actual distribution, in other words, minimizing the KL-divergence means maximizing the Likelihood of the training dataset.


Explicit vs. Implicit Maximum Likelihood Models

After knowing the general idea of Maximum Likelihood models and their target, we need to be more specified when we consider talking about Maximum Likelihood generative models. Actually, there are 2 different types of Maximum Likelihood models, Explicit and Implicit models.

Explicit Models: Models that fully depend on the training data, and try to learn the parameters of P(X;Θ), which are the mean and variance, using the regular gradient or any other optimization technique. To be able to make the training efficient, we need to design a model that makes use of the training samples, also, the collected training samples should be generalized enough to cover all variables of the probability distribution and this may be hard in some real-life data, finally, the explicit models require some complex computational power to learn the parameters.

Implicit Models: In these models, we don’t actually use the data to learn the true distribution parameters, we assume that the distribution parameters are hidden and we try to predict it using some observed variables from the data. These observed variables have no rules to be determined, they are extracted by the researcher or by an expert who may have a pre-knowledge about the data. You can think of it as an inference stochastic process that could be done using Markov Chains, you use some observable data to be able to get the hidden variables (called Latent variables), refer to this post to know more about Markov Chains. We use these latent variables (which makes a latent space) to generate a similar data based on the given training samples. Regularly, implicit models could be used for dimension reduction of the data without any prior knowledge about the data distribution.

The question now, to what model the generative model in GANs belongs to?

Recalling GANs figure,

You find that GAN uses the latent space to be able to generate data that is considered to be in similar distribution of the real samples. So, it is an implicit model, but it doesn’t use any observable data like Markov Chains do, it deals directly with the training sample, in other words, it treats the samples of the training data as the latent variables, and change their distribution by adding some noise to the sample features. That’s why it is considered to be an implicit model, it doesn’t use the data distribution, and it doesn’t depend on any observable data, like we do with Markov Chains.

To clarify more, you can imagine the situation by learning without any prior knowledge about something, you just do mistakes and you let others correct them to you. Continuing this process will lead you to learn how to do that thing, although you may still don’t know what are you doing?!

Actually, there are many different types of Maximum Likelihood models in general, the following figure (taken from Ian Goodfellow tutorial, you fill find it in the references) is a taxonomy of  different types of models.

tax


The Generative Model’s Camouflage Process

In GANs, the generative model tries to take some real samples from the training data, without any prior knowledge about its distribution, and replacing some actual features with other fake features (camouflage). As the generative model is a Deep Neural Network, the fakes features could be added in the hidden layers after feedforward the input to the neural network. Alternatively, the noise could be added before input the features to the neural network, and let the network produce the fake sample after feedforwarding the noisy fake sample. We can represent the generative model in mathematical function by f(X, Noise) = Z.

The discriminator (Predator) then receives 2 signals, the first one is X, which is the real sample, and Z, which is the fake sample. The role of the discriminator is to assign 2 probabilities, using softmax for example. If it gives a probability that is close to 1 to the real sample, which means that it perfectly distinguishes between the real and fakes samples, then it backpropagates the error (calculated using cross entropy) to the generative model to fine-tune its parameters (neural network weights) to enhance its ability to produce a more tricky sample. If it gives a probability that is close to 1 to the fake sample, which means that it sadly doesn’t distinguish between the real and fakes samples, then it backpropagates the error to the discriminator model to fine-tune its parameters to enhance its ability to avoid the fake sample.


Existed GANs with Popular Generative Models

Commonly, Denoising Auto-Encoders are used nowadays when building GANs. There exists an interesting project about Face Completion using Generative Adversarial Denoising Autoencoder, you can find the project website here http://www.cc.gatech.edu/~hays/7476/projects/Avery_Wenchen/. Also, there are other models that are called Adversarial Autoencoder (AAEs) https://arxiv.org/pdf/1511.05644.pdf and Deep Convolutional Generative Adversarial Networks https://github.com/carpedm20/DCGAN-tensorflow , but I don’t know much about them.

The idea of the Adversarial Learning is dominating nowadays and it is considered to be a hot topic in Deep Learning that could be used in multiple fields, such as Speech Recognition and Computer Vision.

 

References

– https://arxiv.org/pdf/1701.00160.pdf
– https://openreview.net/pdf?id=B16Jem9xe
– https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
– http://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html
– http://www.theanalysisfactor.com/what-is-a-latent-variable/
– http://www.cc.gatech.edu/~hays/7476/projects/Avery_Wenchen/
– https://arxiv.org/pdf/1406.2661.pdf
– http://cs.stanford.edu/people/karpathy/gan/

Overview: Generative Adversarial Networks – When Deep Learning Meets Game Theory

Before going into the main topic of this article, which is about a new neural network model architecture called Generative Adversarial Networks (GANs), we need to illustrate some definitions and models in Machine Learning and Artificial Intelligence in general.

Discriminative Models: Models that predict a hidden observation (called class) given some evidence (called features). In other words, we have some features and observations about an entity, and we want to predict its class, category or label. You can imagine the model as a function that has features as input and produces an output. The criteria that is used to produce the output depends on the model architecture and nature.

So, the discriminative model could be described in mathematical formula by f(x1, x2, x3, …, xn) = y, where n is the number of features, and the target of the function is to get the conditional probability P(y|x1, x2, x3, …, xn).

Support Vector Machines (SVMs) and Feedforward Neural Networks (FNNs) are examples of discriminative models that are used for classification.

Generative Models: Given some features, the models target is to learn how these features are produced, it tries to learn the distribution of the features. Assume that we have features x1, x2, x3, …, xn where n is the number of features, the model targets to learn the joint probability distribution of the features with the classes.

We can formulate this in mathematics by the joint probability P(x1, x2, x3, …, xn, y). After learning this distribution, we can estimate the conditional probability of the discriminative models to get the probability of the class given these features with their distribution.

Restricted Boltzmann Machines (RBMs) and Hidden Markov Models (HMMs) are examples of generative models. Note that, Vanilla Auto-encoders (AEs) aren’t considered to be a generative model, all what they do is just reconstruction of the features, on the other hand, Variational Auto-encoders (VAEs) belongs to generative models family.

Nash Equilibrium: A conceptual term that is used in Game Theory to describe a game situation in which the game players are satisfied by the decision he/she makes after revealing the other players strategies, and each player has no intention to change the strategy after knowing the other strategies as they didn’t effect on the strategy he/she used to win the game.

For example, assume we have a game in which each player has 2 choices to choose between them, and the 2 choices are known that are right and have the same effect regarding to the game points or rewards. The first player strategy maybe to choose the first choice, while the other player’s choice is the second one. After revealing that, each player is satisfied by the strategy he/she took because the other’s choice hasn’t effect badly to him/her.

Minimax: An algorithm that belongs to Game Theory and Statistics. The algorithm is used in games in which the game participants are 2 players, and each player tries to win the game by minimizing the worst case that is provided by the other player move, in other words, the player Minimize the Maximum move of the other player.

You can imagine the game of Chess, in which each player tries to win by making the best available move while the other player tries to minimize this move which is considered to be the best move by his side. Minimax is commonly used when making an AI-bot agent in Chess, Tic-tak-toc and Connect-4 games, you can generalize in the decision-making rule-based games.


Generative Adversarial Networks (GANs)

GANs consists of 2 models, a discriminative model (D) and a generative model (G). These models are participants on the training phase which looks like a game between them, and each model tries to better than the other.

The target of the generative model is to generate samples that are considered to be fake and are supposed to have the same distribution of the original data samples, on the other hand, the discriminative’s target is to enhance itself to be able to recognize the real samples among the fake samples generated by the generative model.

It looks like a game, in which each player (model) tries to be better than the other, the generative model tries to generate samples that deceives and tricks the discriminative model, while the discriminative model tries to get better in recognizing the real data and avoid the fake samples. It is as mentioned before, it is the same idea of the Minimax algorithm, in which each player targets to fail the other and minimize the supposed loss.

This game continues till we get a state, in which each model becomes an expert on what it is doing, the generative model increases its ability to get the actual data distribution and produces data like it, and the discriminative becomes expert in identifying the real samples, which increases the system’s classification task. In such case, we know that it reached that in which each model satisfied by its output (strategy), which is called Nash Equilibrium in Game Theory.

During the training phase, the loss function, that calculates the error, is used to update the 2 models parameters (learning the weights), also, the model can’t change the other’s parameters, the parameters are locally updated in each model using the global error.


This was an overview of this new arising model. I am still learning it and looking forward to using it in many applications, specifically in Natural Language Processing field.

References

https://arxiv.org/pdf/1701.00160v1.pdf
– http://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html
– https://en.wikipedia.org/wiki/Generative_adversarial_networks
– https://en.wikipedia.org/wiki/Minimax
– https://en.wikipedia.org/wiki/Nash_equilibrium
Artificial Intelligence: A Modern Approach book by Stuart Russell and Peter Norvig

Another LSTM Tutorial

The figures are taken from this great blog post by Christopher Olah


Recurrent Neural Networks

Recurrent Neural Networks (RNN) is a type of Neural Networks (NN) that is commonly used in problems that depend on sequential data. In sequential data, we should assume that the data is dependent to each other. For example, if we have a sentence that contains some words, to be able to predict any word of it, we need to memorize the previous words, because the sentence words are naturally homogeneous, in grammar or part-of-speech (POS), with each other.

Traditionally in the regular Multi-layer Perceptron (MLP) Neural Networks, we assume that the data is independent to each other, which is a wrong with some data like text or sound.

 RNNs have capabilities to “memorize” the previous data, as it contains self-loops. It saves the state and use it with the new data. This helps the network to take care of the dependencies between the data and take them into consideration when predicting the output.

The following is a 1-unit of a RNN, A . It has an input X and output h.

  

As you have figured, it is very similar to the traditional neuron in MLP, but it differs in the self-loop at the unit A. The self-loop holds the previous state of the neuron and fed it with the new input.

We can unroll (unfold) the figured model. You can see that it isn’t much different with the traditional MLP model.

 1

Through time, the self-loop values stores the previous experience of the previous data, and use it with the new input to obtain the predicted output. This helps to memorize the dependencies between the data together.

Long-term Dependencies Problems with Regular RNNs
As you have figured, there are no restrictions when saving the memory of the previous data, the network keeps saving the previous data without identifying whether this memory would be helpful in the next iterations or not. However, practically, this isn’t always right in the real sequential data.

Assume that we have a sentence like that. “I live in France, I like playing football with my friends and going to the school, I speak french”

Assume that we want to predict the word “french”, we don’t need to look to the previous two terms, ” I like playing football with my friends” and “going to the school”, we need only to know that “I live in France”, and ignore the unnecessary context that may confuse the network while training it.

Practically, Regular RNN can’t connect the related information and dependencies together, specifically if the information has some noise within it that could avoid us from the actual target.

This problem is the main reason that pushed the scientists to develop and invent a new variant of the RNN model, called Long short-term Memory (LSTM). LSTM can solve this problem, because it controls the “memorizing” process within its units using something like “gates”.


What is LSTM?

LSTM is a kind of RNN that is revolutionary used on many fields on computer science such as Natural Language Processing and Speech Recognition. Because of its capabilities of avoiding the problem of “long-term dependencies”

Like any other RNN, LSTM has the same idea of the self-loops. But, LSTM shines from the other RNN in that each unit (neuron) contains some layers and gates that are specified by the user. Each of these layers and gates controls the output and the state of the neuron.

LSTM Effectiveness 

Regularly, when human read a paragraph or a topic, they can easily extract the dependencies between the sentences that formulate the text. In stories and novels, you can match between the events that happen in the story, and extract the much important events and connect them together to be able to understand the story end.

Human brains can identify and memorize the importance and dependencies between words of the sentence, and determines the POS tags of them. If you see the a subject in the beginning of the sentence, then your brain most likely predict that the next word has a great chance to be a verb(VP) or a noun phrase(NP) the describes the subject, because you memorize that the previous word is a subject, and you don’t need to look what context is before the subject, as you determined that subject is your beginning of the a new context to predict the the next word POS.

This is how LSTMs works, they simulate the same process of this ability in our brains to be able to connect the important or related objects together, and forget the unnecessary objects in the context.

LSTM Unit Structure

 2

This is a standard structure of LSTM unit. It contains:-

  • 2 input (previous state C_t-1 and previous output h_t-1)
  • 4 layers (3 sigmoid and 1 tanh activations)
  • 5 point operators (3 multiplications, 1 addition and 1 tanh operators)
  • 2 output (current state C_t and current output h_t)

 The most important thing in LSTM is the state. The state represents the information stored in the since the training begins. We control the memorizing process by updating this state. If we want to forget the previous state, then we make it 0, if we want to control the amount of memorized information, then we update the values of it during the training process. Next, we will discuss how is the output and update are done.

The user can change this structure according the problem they want to solve. This is just an example of a standard LSTM unit.

Detailed Explanation of LSTM Processing
We can divide the internal processing within the unit into 3 groups. Each of these groups performs layers and operators processes to produce the current state and the output.

 21

 

Group 1.1: Input with the previous output for forgetting

 6

This sigmoid activation layer is called “forget gate layer”, because it decides whether it forgets the previous state, in this case, the activation output would be 0 for each element of the state vector, or we use the previous state, in this case the elements values would be higher than 0 and less or equal than 1.

Firstly, concatenate the input with the previous output, finally, apply the activation to the weighted sum of the input.
7

 

Group 1.2: Forget gate layer with the previous state

8

 

To know what we need to forget from the previous state, we multiply the output of the “forget gate layer” with the previous state (element by element multiplication). If it produces a vector that is full of zeros, it means we want to forget all of the previous memories and initiate a new memory from the current input. This goes as follows.

9

 

Group 2.1: Input with previous output for the new information

 10


Firstly, we need to add the new input that would be used to update the state. We called the sigmoid activation layer by “input gate layer”. And it decides which values of the state vector would be updated.

Secondly, we need to generate a new state, called candidate state, that would be added to the previous state (which could be full of 0s or values depends on group 1.1 output).

Finally, we add them together to generate the current state of the unit which holds the concluded information.
12

 

Group 2.2: Scaling new state with the input gate scalar.

 13

We multiply the new generated state with the input gate layer. This is like a scaling process, because we edit the values of the new generated state by the needed update factors to get the new information, which would be added to the previous state to get the whole state.
15

Adding groups 1 and 2 to get the current new state

 16

To generate the new current state, we add the new generate state to the previous state. This state would be fed to the next unit (current unit with the next input).

 17

Group 3.1: Getting the unit output

 19

 

When we produce the output, we need to use the state and input to help the next unit on what it will use to produce its state.

Firstly, we use the weighted sum of the concatenated previous output and input, then apply the sigmoid function. This decides which parts of the state we need to output.

Secondly, we use tanh operator to make sure that the values of the current state is within -1 and 1.

Finally, we get the unit output that has some parts of the state and input.

 20

Conclusion

Nowadays, LSTM is used in very wide fields in computer science, in Machine Learning specifically. It practically proved itself in some hot research topics such as Language Modeling, Sentiment Analysis, Speech Recognition, Text Summarization and Question Answering.

References

http://colah.github.io/posts/2015-08-Understanding-LSTMs
https://en.wikipedia.org/wiki/Long_short-term_memory
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.248.4448&rep=rep1&type=pdf
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

[Java] Hopfield Network Implementation

Why Java?

Well, actually I have a deep love with this language, I enjoy my time writing a Java code. Also, it is good for me to write some OOP projects to not forget its concepts. Getting used to Python and other scripted languages may effect badly to my designing and OOP skills.

So, from time to time, I like to code some Java code. And here we go!

This is a light simple Java implementation of Hopfield Recurrent Neural Network.

Supported Features (cont. updated)

  • Data Normalization
  • Network Visualization using GraphStream

An example of a hopfield network visualization with pattern of size 5 (recognizing pattern of 5 elements)

– Here you can see the repository of implementation.

– I am working on a brief tutorial that explains the hopfield network on the README file

 

Prim’s Running Time With Heaps

Input: Given a graph with N nodes and M weighted edges, we want to obtain the Minimum Spanning Tree (MST) of the graph.

Note: MST is a tree that is formed in a graph which makes all the graph nodes
connected with the minimum total edges weights, also cycles isn’t allowed in the tree.

Output: Using Prim’s algorithm , we obtain a tree (collection of nodes and edges) that
satisfy the above conditions (The green edges in the figure forms the MST).

kruskalll

Algorithm Steps (Taken from Wikipedia):

  1. Initialize a tree with a single vertex, chosen arbitrarily from the graph.
  2. Grow the tree by one edge of the edges that connect the tree to vertices not yet in the tree, find the minimum-weight edge, and transfer it to the tree.
  3. Repeat step 2 (until all vertices are in the tree).

To be able to put all the graph nodes to the tree, we need to iterate all over the graph nodes. So, the operation costs O(N).

In each iteration, we obtain the minimum edge weight to grow the tree. Regularly, getting the minimum value costs O(X) where X is the size of the data. Here, we have M edges. To find the minimum edge per iteration, we need to form O(M) iteration at most where M is the number of graph edges.

Combining the both, we obtain the final running time which is O(NM).

Time Improvement Proof Using Heaps:

Heap is a data structure that is popular among programming languages. Nearly all programming languages implement it in their Collections or Abstract Data Types.

H

The heap is a binary tree (each node has only 2 children), if it is a max heap then the tree sorts the values so that the direct root and parent nodes are always higher than the children, if it is a min heap then the tree sorts the values so that the direct root and parent nodes are always lower than the children.

Min-heap

The tree supports many operations with low running time compared to the naive implementations.

BH

The Improved Algorithm Steps will be:

  1. Create a Min Heap with size of N (costs O(N))
  2. Initialize the heap with a node with cost 0 as it is the root node
  3. While the heap isn’t empty, do the following
    – Extract the minimum node value
    u from the heap (costs O(1))
    – Delete the extracted node as it is visited
    (costs O(log(N)))
    – Get the adjacent nodes
    v from the extracted node (costs O(M) at most where M >= (N – 1) in the worst case)
    – For every adjacent node v, check if it is in the heap and not included in the MST
    – If it is in the heap but it can be relaxed (update the u-v weight with lower weight) and insert the new value of v (costs O(log(N)))
    Add u to the MST

 So, the final running time will be O(M log(N)), where M is the number of edges and N is the number of nodes.

References

https://en.wikipedia.org/wiki/Binary_heap

https://en.wikipedia.org/wiki/Heap_(data_structure)

https://en.wikipedia.org/wiki/Prim%27s_algorithm

 

 

 

[Python] U-V Decomposition using Swarm Optimization

Recently, I have implemented U-V decomposition technique for Recommendation Systems using Particle Swarm Optimization.

U-V decomposition is an optimization problem for a matrix. Here, our matrix elements represent some users review about movies.

M

We have 5 users and 5 movies. Each row represents a user’s review for each movie. There are some missing values from the matrix which means that the user hasn’t put his/her review on that movie.

You can check the code here https://github.com/AhmedHani/CS624-ML/tree/master/Implementations/U-V%20Decomposition . Later, I will write another post to explain my code.

Input: Matrix 5×5 with unknown reviews

Intermediate output: Matrix U and V which multiplication gives a new matrix without unknown reviews

Output: Matrix 5×5 without unknown values.

 

References 

– The matrix example is taken from the book http://infolab.stanford.edu/~ullman/mmds/book.pdf at chapter 9

 

The Unsolvability Halting Problem

Definition

Halting problem is a famous problem in computer science. The problem is simply we want to prove and know whether a computer program will be halted or terminated or it will continue running forever.

Alan Turing proved that there is known solution for the halting problem. It is impossible to determine whether a program with an input will be terminated or it will run forever.

The proof used contradiction method to prove its unsolvability.

The Proof

  • Assume that we have a program P(X) that takes X as an input
  • The output of the program will be 0 or 1 where 1 indicates the program will halt and 0 indicates the program will continue forever
  • Assume we have another program Q that takes program P as its input, Q(P(X)) that output 0 or 1 where 1 indicates P(X) will run forever and 0 indicates it will halt
  • Assume we pass the program Q to itself (Recursion) as its parameter Q(Q)
  • Now, we have 2 cases
  • Case 1: If Q(Q) halts, then P(X) returns 1, which means Q(P(X)) will return 1 which indicates P(X) runs forever. Contradiction
  • Case 2: If Q(Q) runs forever, then P(X) returns 0, which means Q(P(X)) will return 0 which indicates P(X) halts. Contradiction

References
https://en.wikipedia.org/wiki/Halting_problem
 http://www.comp.nus.edu.sg/~cs5234/FAQ/halt.html