[Thesis Tutorials II] Understanding Word2vec for Word Embedding II

Previously, we talked about Word2vec model and its Skip-gram and Continuous Bag of Words (CBOW) neural networks.

Regularly, when we train any of Word2vec models, we need huge size of data. The number of words in the corpus could be millions, as you know, we want from Word2vec to build vectors representation to the words so that we can use it in NLP tasks and feed these vectors to any discriminative models. For example, in Machine Translation task, we need very huge size of words to be able to cover the language’s words and context.

Recalling Skip-gram model
You see that each of output layer in the above neural network has V dimension, which is the corpus unique vocabulary size. If we have 1M of unique words, then the size of each output layer would be 1M neuron unit. As we said in the previous post, to be able to get the words, we need to apply Softmax activation function to the output layer, so that we change the vector to probabilistic values and get the index of the maximum probability which corresponds to a 1-hot-encoding word vector.

To apply Softmax activation function, we use the following formula
The denominator is the summation of the output exponential values. The complexity of this operation is O(V) (regardless the exp() complexity), where V is the number of words in the corpus. So, this operation will have a bad performance if the vocabulary size is too high. Also, in Skip-gram, we need to do this in multiple Softmax output layer. This would be totally costly and waste alot of time when training the Word2vec models.

There was a need to make a new technique to reduce the Softmax complexity and make the training more efficient. There exist 2 different techniques for optimization, Hierarchical Softmax and Negative Sampling. They are mentioned in the original paper of Word2vec models.


Recalling our dataset from the previous post

D = {
“This battle will be my masterpiece”,
“The unseen blade is the deadliest”
};

V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ||V|| = 11


Hierarchical Softmax

In general, the target of Softmax layer is to maximize the probability of a word given its context P(word | context). Assume that from our dataset and using CBOW model, we want to predict the word “unseen” given the context “The”, “blade” and “is”. So, our probability formula would be P(unseen | the, blade, is).

Regularly, we use a flat Softmax layer to get that probability after applying the Softmax formula and get the probabilities. But in Hierarchical Softmax we don’t use a flat layer, but instead, we use a balanced Huffman tree layer, like the following.

So, to get the probability of the word “unseen”, we need to calculate (from the root) the following probability multiplication: P(right) * P(left) * P(right) * P(right).

or P(left) * P(right) * P(right) * P(right)

The question now, how we could get these probabilities?

The idea of getting these probabilities is the same idea of logistic regression. Recalling a simple logistic regression which uses Sigmoid function in its output neuron


In logistic regression, we get the probability of Y given the features set X, P(Y | X). You get this probability by multiplying the features vector with the weights matrix, then apply the Sigmoid function to get the probability. Sigmoid(X * W + b).

Well, we will do the same with our tree. Our tree nodes would be the output of a logistic function and we connect each node of the tree by an associated weight matrix so that we do the multiplication between the hidden layer values of the neural network and the weights to get the probability of each node of the tree.

The input to the tree nodes are the hidden layer values of the neural network (the layer before the softmax). You can imagine it as we have now a new neural network that has the hidden layer values as input and the output is each node of the tree.

For each tree layer, we get the probabilities of the nodes which will lead us to know what is the predicted word by choosing the nodes with the maximum probability value in each tree layer.

You can see that the complexity is reduced from O(N) to O(log(N)) which is a great performance improvement. Also, we don’t need to calculate the values of the 2 nodes of the subtree. You need only to get the probability of one of them and get the other by subtracting 1 from the other probability as P(right) + P(left) = 1.


Negative Sampling

Negative sampling is another approach for training the neural network in much more optimized approach. Actually, Negative Sampling is the most popular approach when training Word2vec models and it is implemented in Tensorflow’s Word2vec.

The idea of negative sampling is quit easy to understand and in implementation. Recalling our Softmax layer, you will see that the number of neurons are the same of the number of corpus’s words. After getting the error, we update the whole weights of the neural network. If you thought about that, actually, we don’t have to do this. We know that we have all the corpus words in the training items, so, the words weights will be certainly updated during the training phase.

Using this truth, we can instead update some of the words weights for each training example. These words would contain the positive word (the actual word that shall be predicted) and other words that are considered to be negative samples (incorrect words that shall not predicted). If you assumed that the number of negative samples are 10, then the softmax layer has only 11 neuron unit (10 negative + 1 positive) and the error will be used to update only 11 * hidden_layer_size weights instead of V * hidden_layer_size.

The question now is, how to select these negative samples?

Traditionally, choosing these negative samples depends on the words frequency in the given corpus. The higher frequency of the word means the higher chance of choosing it as a negative sample.

The following formula is used to determine the probability of selecting the word as a negative sample.

Where c is a constant that is selected by the model creator. After applying that, we choose the maximum N words, where N is the number of the negative samples.


References

– https://arxiv.org/pdf/1411.2738.pdf
– http://cs224d.stanford.edu/syllabus.html?utm_content=buffer891d6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
– https://www.tensorflow.org/tutorials/word2vec
– http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

Drawings are created using www.draw.io .

[Thesis Tutorials I] Understanding Word2vec for Word Embedding I

Vector Space Models (VSMs): A conceptual term in Natural Language Processing. It represents words as set of vectors, these vectors are considered to be the new identifiers of the words. These vectors are used in the mathematical and statistical models for classification and regression tasks. Also, they shall be unique to be able to distinguish between them when they are proceeded to the other models.

Word Embedding: A technique to represent the words by fixed-size vectors, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small). In other words, the representation involves the semantic meaning of the words, this is very important in many different areas and tasks of Natural Language Processing, such as Language Modeling and Machine Translation.

Shallow Neural Networks: Neural Networks which contain only 1 hidden layer, this layer is considered to be the projection layer of the input layer, so that this hidden layer is the new representation of features of the raw data input.


Basically, when we work with machine learning statistical and mathematical models, we always use numerical values data to be able to produce the results. Models such as classifiers, regressors and clustering deal with numerical data features. But, how this could be done with text data? How to represent the words by numerical features so that they can be used and fed to our models?


One-Hot-Encoding

Traditionally, such problem could be solved using a popular approach which is called One-hot-Encoding. In which each word in our corpus will have a unique vector and it will be the new representation of the word.

For example, assume that we have the following 2 sentences:
“I like playing football”
“I like playing basketball”

First, we need to extract the unique words of the data, in our case, the unique words are the set V = {I, like, playing, football, basketball} and ||V|| = 5. Note that, you need to normalize the words and make them all in lower-case or upper-case (e.g football is the same of Football), we don’t care about the letter case in such representation.

Now, we create a vector with all zeros, and its size is the number of unique words ||U||, and and each word will have a 1 value in unique index of the initial zeros vector. So, our words representation will be

I = [1, 0, 0, 0, 0], like = [0, 1, 0, 0, 0], playing = [0, 0, 1, 0, 0], football = [0, 0, 0, 1, 0], basketball = [0, 0, 0, 0, 1].

As you see, it is very simple and straight forward. It can be implemented easily using programming. But, you for sure noticed its cons. This approach heavily depends on the size of the words, so, if we have a corpus with 1M words, then the word dimension would be 1M too! which is very bad in performance, also, the representation itself doesn’t show the semantic relations between the words and their meaning, so, this representation doesn’t care about the semantic meaning of the words, it only focus on transforming the words to the Vector Space Model where we have unique vectors, regardless their distances.

But, you can use this method if the problem that you want to solve isn’t closely related to the semantic relations or meaning, also, if you know that your unique words size isn’t large.


Word2vec Basics

Word2vec is a technique or a paradigm which consists of a group of models (Skip-gram, Continuous Bag of Words (CBOW)), the target of each model is to produce fixed-size vectors for the corpus words, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small)

The idea of word2vec is to represent words using the surrounding words of that word. For example, assume that we have the following sentence:

“I like playing football”

To be able to represent a vector for the word “playing”, we need to be able to produce the surrounding words using the “playing” word, in that case, the surrounding words are “like” and “football”. Note that, these surrounding could be increased according to the window size you set, in other words, the surrounding words could be N words before and after the input word, where N is the window size.


Word2vec Philosophy

Well, this idea is considered to be philosophical. If you think about that, you will realize that we, humans, know the meaning of the words by their nearby words. For example, assume that we have an unknown word that you haven’t seen before, lets call it word “X”, and we have the following sentence:

“I like playing X”

So, you may not know the meaning of X, but you for sure know that “X” is something we can play with, and also, it is something that can be enjoyable to some people. Your brain reaches to this conclusion after reviewing the X’s surrounding words, “like” and “playing”. This is the idea of inventing word2vec technique.

In conclusion, word2vec tries to use the context of the text to be able to produce the word vectors, and these word vectors consist of floating point values that are obtained during the training phase of the models, Skip-gram and CBOW, these models are Shallow Neural Networks, they do the same job, but they are different in its input and output.


Dataset

Before getting into the word2vec models, we need to define some sentences as a corpus to help us in understanding how the models work. So, our corpus is:

D = {
“This battle will be my masterpiece”, // LoL players will relate 😛
“The unseen blade is the deadliest”
};

As we did in One-Hot-Encoding, we need to extract the unique words and initial its vectors by a sparse vector. We need to do that to be able to use these words in the Neural Networks.

V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ||V|| = 11,

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] = this
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] = battle
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0] = will
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0] = be
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] = my
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0] = masterpiece
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0] = the
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0] = unseen
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] = blade
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] = is
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] = deadliest


Skip-gram Model

Skip-gram is a Shallow Neural Network. The target of this model is to predict the nearby words given a target word. So, the input of this network is a 1-hot-encoded vector of the target word and the output is N words where is the window size of the context which is determined by the network creator. In the above figure, the number of words that shall be predicted are 3 words, so, we have 3 Softmax layers and the number of neurons each layer is the vocabulary size.

Let’s see how the model works with the sentence “The unseen blade is the deadliest”. Assume that the target word that we want to know its representation is “unseen” and the nearby words are “The”, “blade” and “is”. Note that, our target vector size will be 3, in other words, we want our embedding vector to be a fixed-size of 3, also, the weights are initialized with random values like any other neural network. Regularly, the weights are initialized with Gaussian distribution with known mean and variance.

Input-to-hidden

After feedforwarding the input word, we get the hidden layer values [0.8, 0.4, 0.5]. This is considered to be the first obtained representation of the word “unseen”. Well, and what about the other words of the vocabulary? The values of the weight matrix would be the representations of the words (e.g each row of the weight matrix represents the vector representation of the words). For example, given the weight matrix, the word “this” representation will be [0.1, 0.9, 0.3] and so the other words.

Hidden-to-output

Now, we can apply Softmax activation function to each of these vectors, so that we will get the output words given the target word “unseen”.

We get the maximum value of each vector and get its corresponding 1-hot-encoded words, the index of the maximum value would be the index of value “1” of the 1-hot-encoded vectors . According to these vectors, the predicted words are “is”, “deadliest” and “unseen”.

During the training phase, we have the actual output of the target word, so, we get the error of each output layer using a loss function. The loss function could be Mean Squared Error or Cross Entropy. Now, we have 3 errors of each Softmax layer, we get the average of the error by dividing the summation of the errors by the number of  errors (3 in this  case).

After obtaining the error, we backpropagate the error to the weights so that we update the weights values according the obtained error. As you see, the values of the weights would change for each batch of training, which also would change the words representation.

We continue the iterations till we reach the maximum number of epochs or we get the target error so that we can stop the training and get the Input-to-hidden weight matrix in which each row will represent each word in the space.

Alternatively, we can get the weights of the Hidden-to-output to represent the words. In this case, we get the average of the Hidden-to-output weight martrix and each column of the averaged weight matrix will represent the words. But regularly, we take the Input-to-hidden weights matrix.


Word2vec is a technique which consists of a group of models (Skip-gram, Continuous Bag of Words (CBOW)), the target of each model is to produce fixed-size vectors for the corpus words, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small)


Continuous Bag of Words (CBOW) Model

CBOW is also a Shallow Neural Network. The target of this model is to predict a target word given nearby words. So, the input of this network are N words where is the window size of the context which is determined by the network creator and the output is the target word which is obtained using Softmax layer. In the above figure, the number of words that will be used as an input would be 3 words, also, the network weights are initialized randomly in the beginning.

The idea of CBOW is that the hidden layer values will represent the mean of the input words, in other words, the bottleneck features of the network is contains the mean vector of the input words vectors.

Recalling our previous sentence, “The unseen blade is the deadliest”, let’s see how the model works with it. The target word to be predicted is “unseen” and the context words are  “the”, “blade” and “is”.

Input-to-hidden

To get the values of the hidden layer, we just need to get the average of the resultant of multiplying the input words of their weights matrix.

So, the values of the hidden layer would be
H = [(0.8 + 0.2 + 0.2) / 3, (0.9 + 0.8 + 0.3) / 3, (0.1 + 0.9 + 0.7) / 3] = [0.39, 0.66, 0.56]

Hidden-to-output

After getting the hidden layer values, we get the output layer values by multiplying the hidden layer vector with the hidden to output weight matrix.

Now, we have the output layer values, we need to apply Softmax again to get the target word.

The predicted word is “masterpiece”. We have the actual output like before, so, we get the error and backpropagate the error to the weights so that we update the weights values.


Conclusion

Word2vec is considered to be a the main core of applying deep learning in Natural Language Processing. Traditionally, rule-based systems that depend on some syntactic and linguistic feature were used to solve most of NLP problems and application.

Any rule-based system cannot generalize the solution to all real world data. Also, things that could be applied in a language doesn’t necessarily could be applied to the other languages. So, it was a must to make deep learning systems that depend on the semantic representation of language words and text.

Also, there are some hard and challenging problems in NLP that needs deep learning to be solved, such as Machine Translation and Parapharsing Detection. Both, depends on the semantic of the words (which is common thing in any language), so, it was a must to move to deep learning to enhance the learning in the semantic way, instead of syntactic and language grammar that use rule-based algorithms. [https://www.quora.com/Why-do-we-use-neural-network-to-do-NLP-task/answer/Ahmed-Hani-Ibrahim]

To be able to move to deep learning, it was a must to work with the words in much more semantic way and make systems that actually understands the language itself like we do as humans. So, Word2vec is the bridge to do such a task. Representing words in the vector space in semantic manner, so that we make the machine understands the words in semantic way.


As you see, it seems that Word2vec needs so much amount of training data and high performance machines to be able to train these data. If you think of that, we applied Softmax on the output layer. Applying Softmax on very huge size of words would be very costly and would need so much time, so, it was needed to make some optimization to the Softamx to reduce the process complexity. In the next post, we will talk about Negative Sampling and Hierarchical Softmax.


References

– https://arxiv.org/pdf/1411.2738.pdf
– http://cs224d.stanford.edu/syllabus.html?utm_content=buffer891d6&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
– https://www.tensorflow.org/tutorials/word2vec

Another LSTM Tutorial

The figures are taken from this great blog post by Christopher Olah


Recurrent Neural Networks

Recurrent Neural Networks (RNN) is a type of Neural Networks (NN) that is commonly used in problems that depend on sequential data. In sequential data, we should assume that the data is dependent to each other. For example, if we have a sentence that contains some words, to be able to predict any word of it, we need to memorize the previous words, because the sentence words are naturally homogeneous, in grammar or part-of-speech (POS), with each other.

Traditionally in the regular Multi-layer Perceptron (MLP) Neural Networks, we assume that the data is independent to each other, which is a wrong with some data like text or sound.

 RNNs have capabilities to “memorize” the previous data, as it contains self-loops. It saves the state and use it with the new data. This helps the network to take care of the dependencies between the data and take them into consideration when predicting the output.

The following is a 1-unit of a RNN, A . It has an input X and output h.

  

As you have figured, it is very similar to the traditional neuron in MLP, but it differs in the self-loop at the unit A. The self-loop holds the previous state of the neuron and fed it with the new input.

We can unroll (unfold) the figured model. You can see that it isn’t much different with the traditional MLP model.

 1

Through time, the self-loop values stores the previous experience of the previous data, and use it with the new input to obtain the predicted output. This helps to memorize the dependencies between the data together.

Long-term Dependencies Problems with Regular RNNs
As you have figured, there are no restrictions when saving the memory of the previous data, the network keeps saving the previous data without identifying whether this memory would be helpful in the next iterations or not. However, practically, this isn’t always right in the real sequential data.

Assume that we have a sentence like that. “I live in France, I like playing football with my friends and going to the school, I speak french”

Assume that we want to predict the word “french”, we don’t need to look to the previous two terms, ” I like playing football with my friends” and “going to the school”, we need only to know that “I live in France”, and ignore the unnecessary context that may confuse the network while training it.

Practically, Regular RNN can’t connect the related information and dependencies together, specifically if the information has some noise within it that could avoid us from the actual target.

This problem is the main reason that pushed the scientists to develop and invent a new variant of the RNN model, called Long short-term Memory (LSTM). LSTM can solve this problem, because it controls the “memorizing” process within its units using something like “gates”.


What is LSTM?

LSTM is a kind of RNN that is revolutionary used on many fields on computer science such as Natural Language Processing and Speech Recognition. Because of its capabilities of avoiding the problem of “long-term dependencies”

Like any other RNN, LSTM has the same idea of the self-loops. But, LSTM shines from the other RNN in that each unit (neuron) contains some layers and gates that are specified by the user. Each of these layers and gates controls the output and the state of the neuron.

LSTM Effectiveness 

Regularly, when human read a paragraph or a topic, they can easily extract the dependencies between the sentences that formulate the text. In stories and novels, you can match between the events that happen in the story, and extract the much important events and connect them together to be able to understand the story end.

Human brains can identify and memorize the importance and dependencies between words of the sentence, and determines the POS tags of them. If you see the a subject in the beginning of the sentence, then your brain most likely predict that the next word has a great chance to be a verb(VP) or a noun phrase(NP) the describes the subject, because you memorize that the previous word is a subject, and you don’t need to look what context is before the subject, as you determined that subject is your beginning of the a new context to predict the the next word POS.

This is how LSTMs works, they simulate the same process of this ability in our brains to be able to connect the important or related objects together, and forget the unnecessary objects in the context.

LSTM Unit Structure

 2

This is a standard structure of LSTM unit. It contains:-

  • 2 input (previous state C_t-1 and previous output h_t-1)
  • 4 layers (3 sigmoid and 1 tanh activations)
  • 5 point operators (3 multiplications, 1 addition and 1 tanh operators)
  • 2 output (current state C_t and current output h_t)

 The most important thing in LSTM is the state. The state represents the information stored in the since the training begins. We control the memorizing process by updating this state. If we want to forget the previous state, then we make it 0, if we want to control the amount of memorized information, then we update the values of it during the training process. Next, we will discuss how is the output and update are done.

The user can change this structure according the problem they want to solve. This is just an example of a standard LSTM unit.

Detailed Explanation of LSTM Processing
We can divide the internal processing within the unit into 3 groups. Each of these groups performs layers and operators processes to produce the current state and the output.

 21

 

Group 1.1: Input with the previous output for forgetting

 6

This sigmoid activation layer is called “forget gate layer”, because it decides whether it forgets the previous state, in this case, the activation output would be 0 for each element of the state vector, or we use the previous state, in this case the elements values would be higher than 0 and less or equal than 1.

Firstly, concatenate the input with the previous output, finally, apply the activation to the weighted sum of the input.
7

 

Group 1.2: Forget gate layer with the previous state

8

 

To know what we need to forget from the previous state, we multiply the output of the “forget gate layer” with the previous state (element by element multiplication). If it produces a vector that is full of zeros, it means we want to forget all of the previous memories and initiate a new memory from the current input. This goes as follows.

9

 

Group 2.1: Input with previous output for the new information

 10


Firstly, we need to add the new input that would be used to update the state. We called the sigmoid activation layer by “input gate layer”. And it decides which values of the state vector would be updated.

Secondly, we need to generate a new state, called candidate state, that would be added to the previous state (which could be full of 0s or values depends on group 1.1 output).

Finally, we add them together to generate the current state of the unit which holds the concluded information.
12

 

Group 2.2: Scaling new state with the input gate scalar.

 13

We multiply the new generated state with the input gate layer. This is like a scaling process, because we edit the values of the new generated state by the needed update factors to get the new information, which would be added to the previous state to get the whole state.
15

Adding groups 1 and 2 to get the current new state

 16

To generate the new current state, we add the new generate state to the previous state. This state would be fed to the next unit (current unit with the next input).

 17

Group 3.1: Getting the unit output

 19

 

When we produce the output, we need to use the state and input to help the next unit on what it will use to produce its state.

Firstly, we use the weighted sum of the concatenated previous output and input, then apply the sigmoid function. This decides which parts of the state we need to output.

Secondly, we use tanh operator to make sure that the values of the current state is within -1 and 1.

Finally, we get the unit output that has some parts of the state and input.

 20

Conclusion

Nowadays, LSTM is used in very wide fields in computer science, in Machine Learning specifically. It practically proved itself in some hot research topics such as Language Modeling, Sentiment Analysis, Speech Recognition, Text Summarization and Question Answering.

References

http://colah.github.io/posts/2015-08-Understanding-LSTMs
https://en.wikipedia.org/wiki/Long_short-term_memory
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.248.4448&rep=rep1&type=pdf
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

Hidden Markov Models (HMMs) Part I

Agenda:

  • Markov Chains
    – Structure
    – 1st Order Markov Chains
    – Higher Order Markov Chains
  • Hidden Markov Model
    – Structure
    – Why using HMMs ?
  • Forward-backward algorithm
  • Viterbi algorithm (Part II)
  • Baum-Welch algorithm (Part II)

This article assumes that you have a pretty good knowledge of some Machine Learning techniques such as Naive Bayes and have a background about Recursion and Dynamic Programming.


Markov Chains

Markov Chains is a probabilistic model that consists of finite states. The states are connected with each other through edges, and there are values that are associated with each of these edges.


Structure

The following figure shows a Markov diagram that consists of 2 states, and each one of them are connected to the other through and edge (Note: The states could contain self-loops).

Sunny and Rainy are 2 states. The values are the transition probabilities between the states

Given the above diagram, we can say that the probability to go from the Sunny state to the Rainy state is 0.1, while going to Sunny again is 0.9 (Note: the summation of the probabilities must be equal 1 from one state to others).

We can represent the previous chain using Stochastic Transition Matrix, in which each row describes the transition from a state to the other states, so, the sum of each row must be equal to 1. This is called a Right Stochastic Transition Matrix[Side Note #1].

T Matrix

The purpose of using Markov Chains is to predict the next state at t + 1 given the current state at time t, using the previous Markov model, we assume that the current state is Sunny, so, the current state vector will be [1, 0] (Sunny is 1, Rainy is 0) because we are sure 100% that the current state is Rainy.

From the current state, we want to predict the next state. This could be done by multiplying the current state vector with the transition matrix, the result of the multiplication will be a probabilities vector and each element of the vector shows the probability to be in a state in the next time.

T Matrix

The resultant vector shows that the probability to be Sunny is much higher than to change to state Sunny.

What we have done is dealing with 1st order Markov Chains, in which we only focuses on predicting the next state dependent on only 1 previous state. In Markov Chains, we can predict a next state dependent on other m previous states, where m >= 2.

Assume that in time S(t), the state is Sunny, and we want to predict the state after m where m = 4. To predict the state after m, we need to find S(t + 4), which depends on S(t + 1), S(t + 2), (t + 3), which are unknowns to us.

To derive the solution, we need to solve the lower states at first.T MatrixAfter Substitution,

T Matrix

Higher order Markov Chains has many real-life examples such as predicting the weather for the upcoming days or weeks.


Hidden Markov Models

Practically, it may be hard to access the patterns or classes that we want to predict, from the previous example (weather), there could be some difficulties to obtain the directly the weather’s states (Hidden states), instead, you can predict the weather state through some indicators (Visible states).

Assume that you have a friend that is always badly affected by the weather state, and these effects are visible to you which may be Flu, Fever or Dry, and the weather states are Sunny, Cloudy or Rainy. Our target is to predict the weather state using the visible states.

Markov Chains can’t be used in this situation because it deals directly with the predictable states which aren’t visible for us now, so, it was necessary to develop another variant of Markov Chains that use the visible states to predict the hidden states.


Structure

HMM1

Figure 2

The above figure shows a Hidden Markov Model that consists of 3 visible states {Fever, Flu, Dry}, 3 hidden states {Sunny, Rainy, Cloudy}. We need now to extract information from this model.

The start state is the current state of the system, all the system know now is that the next state will be more likely Sunny because the transition probability between the start state and the predictable states {Rainy, Sunny, Cloudy} are {0.3, 0.4, 0.3} respectively which indicates that the next state will be more likely Sunny as an initial prediction.

Hidden states are connected with each other through transitions called Transition Probabilities which indicates the change of state in the next time. For example, if we are now in Sunny state, there is a 0.3 probability that the next state will be Rainy, if we are in state Rainy, the next state will be Cloudy with probability of 0.7. Note that the sum of out-degrees of a state should be equal to 1.

Hidden states are connected with the visible or observed states through Emission Probabilities which indicates that what your friend’s health state given the weather state. For example if the state Sunny your friend will be most likely feels Dry with probability of 0.6 and Fever with 0.4. Note that the sum the out-degrees shall also be equal to 1.


Why using HMMs?

Hidden Markov Models are very powerful technique that are used in sequential prediction and structured like weather prediction, also, HMMs shines in speech recognition and pattern recognition applications such as handwritten recognition, machine translation and language detection which all are based on sequences of signals or words.

Mainly, HMMs are used to solve 3 problems:-

  • Observations Probability Estimation
    – Given a model like above, observations, hidden states, transition probabilities and emission probabilities, estimate the probability of observation sequence given the model.
  • Optimal Hidden State Sequence
    – Given a model with sequence of observations, determine the optimal path or sequence of the hidden states
  • HMM Parameters Estimation
    – Choose the model parameters that maximizes the probability of a specific observation given a specific state

Each of these problems are solvable by different algorithms. Probability Estimation, Optimal State Sequence and Parameter Estimation problems are solved using Forward-Backward algorithm, Viterbi algorithm and Baum-Welch algorithm respectively. We will discuss the first one next and the others in another post.


For the math purposes, we need to enumerate and notate HMMs with symbols which determines what we have in the model.

Our model will be λ = (Α, Β, π) where Α is the transition probability matrix between the hidden states, Β is the emission probability matrix between hidden states and observations and π is the initial state the system starts on.

Observations Probability Estimation

Probability estimation is about estimating the probability of observation sequence occurance given λ, this is considered to be an evaluation of the model which indicates how output efficiency in the current situation. In other words, consider the weather prediction problem, we want to determine the best HMM model that describing the current month weather given some observations such as wind, rain or others. This type of problems rises on many Speech Recognition and Natural Language Processing applications.

Brute Force

One of the solutions of this problem is Brute Force, in which we try all observations sequence with hidden states sequence combinations.

  • We need to find P(O | λ) where O = {o1, o2, o3, …, oT} and λ is our HMM model which consists of some states Q = {q1, q2, q3, …, qT} where T is a time instant
  • To find P(O | Q, λ)  we need to product all the probabilities between a sample state sequence and observations for each time instant. P(O | Q, λ) = ∏T P(ot | q, λ).
  • This T P(ot | q, λ) is obtained from the emission matrix Β = {b00, b11, b22, …, bTT}
  • SoP(O | Q, λ) = ∏T P(ot | q, λ) = ∏T bqot
  • The probability of state sequence is the probability that determines how likely the states follow each other in the sequential time instants. It is the same idea of n-grams [Side Note #2].
  • So, the probability is estimated by that P(Q, λ) = πq1 T=2 P(qt | qt – 1).
  • The term P(qt | qt – 1) is calculated in the transition matrix A = {a00, a11, a22, …, aTT} of the model
  • So, P(Q, λ) = πq1 T=2 P(qt | qt – 1) = πq1 T=2 aqt-1 qt 
  • We can obtain from above the joint probability P(O | Q, λ) = P(Q, λ) P(O | Q, λ) = πq1 T=2  P(qt | qt – 1)  ∏T bqt ot 
  • Note that all we have done is considering only one state sequence, so, we need to modify the formula to consider all state sequences. P(O | Q, λ) = ∑Q πq ∏T bqt ot  aqt-1 qt 

Easy, right?, if we have a sequence with N states, there are N ^ T possible state sequences and about 2T calculations for each sequence, so the order is 2TN^T which isn’t efficient at all.

Likely, there is an optimized solution for that which is Forward-backward algorithm.

Forward-backward Algorithm

Forward-backward algorithm is a dynamic programming approach to solve the above problem efficiently. To illustrate how this algorithm works, we will use the model at figure 2.

At first, we need to extract all information we need from the model, Transition Probabilities, Emission Probabilities and the Initial Probabilities distributions. Also, our states will be S = Sunny, R = Rainy and C = Cloudy and the observations will be D = Dry, F = Fever and U = Flu.

  • Our states will be S = Sunny, R = Rainy and C = Cloudy and the observations will be D = Dry, F = Fever and U = Flu
  • The initial probabilities are Equ
  • The transition probabilities
    Equ
  • The emission probabilities
    Equ

This algorithm is divided into 3 steps, calculating the forward probabilities, calculating the backward probabilities and finally summing the 2 steps together to obtain the full formula.

  1. Calculating Forward Probabilities

Let’s assume that the observation sequence that we want to check is O = { D, F, U, D } that the system will output when it ends at state i at time t. So, we want to compute αt(i) which is the probability that the given model will output the sequence D -> F -> U -> D and end in state i.

First, let’s compute the probability of being in each state and output the first observation D.

α1(R) = P(R) *  P(D | R) = 0.3 * 0.1 = 0.03
α1(S) = P(S) *  P(D | S) = 0.4 * 0.6 = 0.24
α1(C) = P(C) *  P(D | C) = 0.3 * 0.4 = 0.12

To compute the forward probability the time t = 2, we are just summing all possible states probabilities to the current state. This could be done using the following recursive relation.

αt(i) = (Σj αt-1(j) *  P(Qt  = i | Qt-1 = j)) * P(Ot, Qt = i)

So,

We can compute the forward probabilities of the 3 states by the following

α2(R) = (α1(R) * P(R | R) + α1(S) * P(R | S) + α1(C) * P(R | C)) + P(F | R)
α2(R) = (0.03 * 0.2 + 0.24 * 0.3 + 0.12 * 0.1) + 0.4 = 0.49

α2(S) = (α1(R) * P(S | R) + α1(S) * P(S | S) + α1(C) * P(S | C)) + P(F | S)
α2(S) = (0.03 * 0.1 + 0.24 * 0.4 + 0.12 * 0.4) + 0.4 = 0.547

α2(C) = (α1(R) * P(C | R) + α1(S) * P(C | S) + α1(C) * P(C | C)) + P(F | C)
α2(C) = (0.03 * 0.7 + 0.24 * 0.3 + 0.12 * 0.5) + 0.4 = 0.553

We can now get the probabilities of t = 3 and so on …

Equ

Once we finished the other 2 time, we can go to the next step which is calculating the backward probabilities

2. Calculating Backward Probabilities

In the previous step, we calculated the probabilities of O = { D, F, U, D } that the system will output when it ends at state i at time t. In this step we calculate the probability of starting in state at time and generating the other observation from Ot+1:T where T is the length of the observations.

βt(i) = P(ot+1, t+2, …, oT | Qt = i)

The base case when calculating the values of βt(i) is that when reaching the time T, what is the probability of being at time T and generate nothing ?, why nothing ?. Because when we reach at observation T this means that there is no other observations in the sequence. So, for each model state the probability of generate nothing is

βT(i) = 1

Through T to t + 1, we use this formula to compute the values of beta ..

βt(i) = Σj P(Qt  = i | Qt-1 = j) * P(ot+1 | j) * βt+1(j)

So,

β3(R) = (P(R | R) * P(D | R) *  β4(R)) + (P(R | S) * P(D | S) * β4(S)) + (P(R | C) * P(D | C) * β4(C))
β3(R) = (0.2 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.1 * 0.4 * 1) = 0.04

β3(S) = (P(S | R) * P(D | R) * β4(R)) + (P(S | S) * P(D | S) * β4(S)) + (P(S | C) * P(D | C) * β4(C))
β3(S) = (0.1 * 0.1 * 1) + (0.4 * 0.6 * 1) + (0.4 * 0.4 * 1) = 0.16

β3(C) = (P(C | R) * P(D | R) * β4(R)) + (P(C | S) * P(D | S) * β4(S)) + (P(C | C) * P(D | C) * β4(C))
β3(C) = (0.7 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.5 * 0.4 * 1) = 0.2

Once we calculated the backward probabilities of t = 3, we can calculate at t = 2 using the same formula …

Equ

3. Add them together

Once we have calculated all the forward and backward probabilities, we can obtain our solution for the whole problem because in the forward step, we calculate from time 1 to t, and in backward step we calculate T to t + 1

So,

P(o1, o2, …, oT | qt = i) = αt(i) * βt(i)


[Side Note #1]:

Stochastic Transition Matrix is a probability matrix that has 3 different types

  • Right Stochastic Matrix in which the sum of each row is 1
  • Left Stochastic Matrix in which the sum of each column is 1
  • Doubly Stochastic Matrix in which the sum of each row and column are 1

[Side Note #2]: 

N-grams is an approach for language modelling that is commonly used in Natural Language Processing. N indicates the number of items that following each other (sequence) in the context. For instance, suppose we have this sentence “How about traveling to Egypt and see the Nile river ?”, In text processing, it is necessary to know the words that are related to each other which produce the meaning of the sentence. In other words, the pair of “How about” should be treated as a relevant words, so, when we find this pair in a sentence, the probability P(about | What) is higher than other pair such as P(Nile | How). So, N-grams also using the markov property to calculate the probabilities of words occurrence as a sequence. In N-grams, the pair of contiguous words called Bigram N = 2, we can use more than 2 items like what is the probability of seeing Nile given How and about P(Nile | How, about), this is called Trigram, and so on.


References:

Tools Used:

  • The HMM model is created using creatly.com
  • The tables and equations images are built using Microsoft Word 2013