Another LSTM Tutorial

The figures are taken from this great blog post by Christopher Olah


Recurrent Neural Networks

Recurrent Neural Networks (RNN) is a type of Neural Networks (NN) that is commonly used in problems that depend on sequential data. In sequential data, we should assume that the data is dependent to each other. For example, if we have a sentence that contains some words, to be able to predict any word of it, we need to memorize the previous words, because the sentence words are naturally homogeneous, in grammar or part-of-speech (POS), with each other.

Traditionally in the regular Multi-layer Perceptron (MLP) Neural Networks, we assume that the data is independent to each other, which is a wrong with some data like text or sound.

 RNNs have capabilities to “memorize” the previous data, as it contains self-loops. It saves the state and use it with the new data. This helps the network to take care of the dependencies between the data and take them into consideration when predicting the output.

The following is a 1-unit of a RNN, A . It has an input X and output h.

  

As you have figured, it is very similar to the traditional neuron in MLP, but it differs in the self-loop at the unit A. The self-loop holds the previous state of the neuron and fed it with the new input.

We can unroll (unfold) the figured model. You can see that it isn’t much different with the traditional MLP model.

 1

Through time, the self-loop values stores the previous experience of the previous data, and use it with the new input to obtain the predicted output. This helps to memorize the dependencies between the data together.

Long-term Dependencies Problems with Regular RNNs
As you have figured, there are no restrictions when saving the memory of the previous data, the network keeps saving the previous data without identifying whether this memory would be helpful in the next iterations or not. However, practically, this isn’t always right in the real sequential data.

Assume that we have a sentence like that. “I live in France, I like playing football with my friends and going to the school, I speak french”

Assume that we want to predict the word “french”, we don’t need to look to the previous two terms, ” I like playing football with my friends” and “going to the school”, we need only to know that “I live in France”, and ignore the unnecessary context that may confuse the network while training it.

Practically, Regular RNN can’t connect the related information and dependencies together, specifically if the information has some noise within it that could avoid us from the actual target.

This problem is the main reason that pushed the scientists to develop and invent a new variant of the RNN model, called Long short-term Memory (LSTM). LSTM can solve this problem, because it controls the “memorizing” process within its units using something like “gates”.


What is LSTM?

LSTM is a kind of RNN that is revolutionary used on many fields on computer science such as Natural Language Processing and Speech Recognition. Because of its capabilities of avoiding the problem of “long-term dependencies”

Like any other RNN, LSTM has the same idea of the self-loops. But, LSTM shines from the other RNN in that each unit (neuron) contains some layers and gates that are specified by the user. Each of these layers and gates controls the output and the state of the neuron.

LSTM Effectiveness 

Regularly, when human read a paragraph or a topic, they can easily extract the dependencies between the sentences that formulate the text. In stories and novels, you can match between the events that happen in the story, and extract the much important events and connect them together to be able to understand the story end.

Human brains can identify and memorize the importance and dependencies between words of the sentence, and determines the POS tags of them. If you see the a subject in the beginning of the sentence, then your brain most likely predict that the next word has a great chance to be a verb(VP) or a noun phrase(NP) the describes the subject, because you memorize that the previous word is a subject, and you don’t need to look what context is before the subject, as you determined that subject is your beginning of the a new context to predict the the next word POS.

This is how LSTMs works, they simulate the same process of this ability in our brains to be able to connect the important or related objects together, and forget the unnecessary objects in the context.

LSTM Unit Structure

 2

This is a standard structure of LSTM unit. It contains:-

  • 2 input (previous state C_t-1 and previous output h_t-1)
  • 4 layers (3 sigmoid and 1 tanh activations)
  • 5 point operators (3 multiplications, 1 addition and 1 tanh operators)
  • 2 output (current state C_t and current output h_t)

 The most important thing in LSTM is the state. The state represents the information stored in the since the training begins. We control the memorizing process by updating this state. If we want to forget the previous state, then we make it 0, if we want to control the amount of memorized information, then we update the values of it during the training process. Next, we will discuss how is the output and update are done.

The user can change this structure according the problem they want to solve. This is just an example of a standard LSTM unit.

Detailed Explanation of LSTM Processing
We can divide the internal processing within the unit into 3 groups. Each of these groups performs layers and operators processes to produce the current state and the output.

 21

 

Group 1.1: Input with the previous output for forgetting

 6

This sigmoid activation layer is called “forget gate layer”, because it decides whether it forgets the previous state, in this case, the activation output would be 0 for each element of the state vector, or we use the previous state, in this case the elements values would be higher than 0 and less or equal than 1.

Firstly, concatenate the input with the previous output, finally, apply the activation to the weighted sum of the input.
7

 

Group 1.2: Forget gate layer with the previous state

8

 

To know what we need to forget from the previous state, we multiply the output of the “forget gate layer” with the previous state (element by element multiplication). If it produces a vector that is full of zeros, it means we want to forget all of the previous memories and initiate a new memory from the current input. This goes as follows.

9

 

Group 2.1: Input with previous output for the new information

 10


Firstly, we need to add the new input that would be used to update the state. We called the sigmoid activation layer by “input gate layer”. And it decides which values of the state vector would be updated.

Secondly, we need to generate a new state, called candidate state, that would be added to the previous state (which could be full of 0s or values depends on group 1.1 output).

Finally, we add them together to generate the current state of the unit which holds the concluded information.
12

 

Group 2.2: Scaling new state with the input gate scalar.

 13

We multiply the new generated state with the input gate layer. This is like a scaling process, because we edit the values of the new generated state by the needed update factors to get the new information, which would be added to the previous state to get the whole state.
15

Adding groups 1 and 2 to get the current new state

 16

To generate the new current state, we add the new generate state to the previous state. This state would be fed to the next unit (current unit with the next input).

 17

Group 3.1: Getting the unit output

 19

 

When we produce the output, we need to use the state and input to help the next unit on what it will use to produce its state.

Firstly, we use the weighted sum of the concatenated previous output and input, then apply the sigmoid function. This decides which parts of the state we need to output.

Secondly, we use tanh operator to make sure that the values of the current state is within -1 and 1.

Finally, we get the unit output that has some parts of the state and input.

 20

Conclusion

Nowadays, LSTM is used in very wide fields in computer science, in Machine Learning specifically. It practically proved itself in some hot research topics such as Language Modeling, Sentiment Analysis, Speech Recognition, Text Summarization and Question Answering.

References

http://colah.github.io/posts/2015-08-Understanding-LSTMs
https://en.wikipedia.org/wiki/Long_short-term_memory
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.248.4448&rep=rep1&type=pdf
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

Hidden Markov Models (HMMs) Part I

Agenda:

  • Markov Chains
    – Structure
    – 1st Order Markov Chains
    – Higher Order Markov Chains
  • Hidden Markov Model
    – Structure
    – Why using HMMs ?
  • Forward-backward algorithm
  • Viterbi algorithm (Part II)
  • Baum-Welch algorithm (Part II)

This article assumes that you have a pretty good knowledge of some Machine Learning techniques such as Naive Bayes and have a background about Recursion and Dynamic Programming.


Markov Chains

Markov Chains is a probabilistic model that consists of finite states. The states are connected with each other through edges, and there are values that are associated with each of these edges.


Structure

The following figure shows a Markov diagram that consists of 2 states, and each one of them are connected to the other through and edge (Note: The states could contain self-loops).

Sunny and Rainy are 2 states. The values are the transition probabilities between the states

Given the above diagram, we can say that the probability to go from the Sunny state to the Rainy state is 0.1, while going to Sunny again is 0.9 (Note: the summation of the probabilities must be equal 1 from one state to others).

We can represent the previous chain using Stochastic Transition Matrix, in which each row describes the transition from a state to the other states, so, the sum of each row must be equal to 1. This is called a Right Stochastic Transition Matrix[Side Note #1].

T Matrix

The purpose of using Markov Chains is to predict the next state at t + 1 given the current state at time t, using the previous Markov model, we assume that the current state is Sunny, so, the current state vector will be [1, 0] (Sunny is 1, Rainy is 0) because we are sure 100% that the current state is Rainy.

From the current state, we want to predict the next state. This could be done by multiplying the current state vector with the transition matrix, the result of the multiplication will be a probabilities vector and each element of the vector shows the probability to be in a state in the next time.

T Matrix

The resultant vector shows that the probability to be Sunny is much higher than to change to state Sunny.

What we have done is dealing with 1st order Markov Chains, in which we only focuses on predicting the next state dependent on only 1 previous state. In Markov Chains, we can predict a next state dependent on other m previous states, where m >= 2.

Assume that in time S(t), the state is Sunny, and we want to predict the state after m where m = 4. To predict the state after m, we need to find S(t + 4), which depends on S(t + 1), S(t + 2), (t + 3), which are unknowns to us.

To derive the solution, we need to solve the lower states at first.T MatrixAfter Substitution,

T Matrix

Higher order Markov Chains has many real-life examples such as predicting the weather for the upcoming days or weeks.


Hidden Markov Models

Practically, it may be hard to access the patterns or classes that we want to predict, from the previous example (weather), there could be some difficulties to obtain the directly the weather’s states (Hidden states), instead, you can predict the weather state through some indicators (Visible states).

Assume that you have a friend that is always badly affected by the weather state, and these effects are visible to you which may be Flu, Fever or Dry, and the weather states are Sunny, Cloudy or Rainy. Our target is to predict the weather state using the visible states.

Markov Chains can’t be used in this situation because it deals directly with the predictable states which aren’t visible for us now, so, it was necessary to develop another variant of Markov Chains that use the visible states to predict the hidden states.


Structure

HMM1

Figure 2

The above figure shows a Hidden Markov Model that consists of 3 visible states {Fever, Flu, Dry}, 3 hidden states {Sunny, Rainy, Cloudy}. We need now to extract information from this model.

The start state is the current state of the system, all the system know now is that the next state will be more likely Sunny because the transition probability between the start state and the predictable states {Rainy, Sunny, Cloudy} are {0.3, 0.4, 0.3} respectively which indicates that the next state will be more likely Sunny as an initial prediction.

Hidden states are connected with each other through transitions called Transition Probabilities which indicates the change of state in the next time. For example, if we are now in Sunny state, there is a 0.3 probability that the next state will be Rainy, if we are in state Rainy, the next state will be Cloudy with probability of 0.7. Note that the sum of out-degrees of a state should be equal to 1.

Hidden states are connected with the visible or observed states through Emission Probabilities which indicates that what your friend’s health state given the weather state. For example if the state Sunny your friend will be most likely feels Dry with probability of 0.6 and Fever with 0.4. Note that the sum the out-degrees shall also be equal to 1.


Why using HMMs?

Hidden Markov Models are very powerful technique that are used in sequential prediction and structured like weather prediction, also, HMMs shines in speech recognition and pattern recognition applications such as handwritten recognition, machine translation and language detection which all are based on sequences of signals or words.

Mainly, HMMs are used to solve 3 problems:-

  • Observations Probability Estimation
    – Given a model like above, observations, hidden states, transition probabilities and emission probabilities, estimate the probability of observation sequence given the model.
  • Optimal Hidden State Sequence
    – Given a model with sequence of observations, determine the optimal path or sequence of the hidden states
  • HMM Parameters Estimation
    – Choose the model parameters that maximizes the probability of a specific observation given a specific state

Each of these problems are solvable by different algorithms. Probability Estimation, Optimal State Sequence and Parameter Estimation problems are solved using Forward-Backward algorithm, Viterbi algorithm and Baum-Welch algorithm respectively. We will discuss the first one next and the others in another post.


For the math purposes, we need to enumerate and notate HMMs with symbols which determines what we have in the model.

Our model will be λ = (Α, Β, π) where Α is the transition probability matrix between the hidden states, Β is the emission probability matrix between hidden states and observations and π is the initial state the system starts on.

Observations Probability Estimation

Probability estimation is about estimating the probability of observation sequence occurance given λ, this is considered to be an evaluation of the model which indicates how output efficiency in the current situation. In other words, consider the weather prediction problem, we want to determine the best HMM model that describing the current month weather given some observations such as wind, rain or others. This type of problems rises on many Speech Recognition and Natural Language Processing applications.

Brute Force

One of the solutions of this problem is Brute Force, in which we try all observations sequence with hidden states sequence combinations.

  • We need to find P(O | λ) where O = {o1, o2, o3, …, oT} and λ is our HMM model which consists of some states Q = {q1, q2, q3, …, qT} where T is a time instant
  • To find P(O | Q, λ)  we need to product all the probabilities between a sample state sequence and observations for each time instant. P(O | Q, λ) = ∏T P(ot | q, λ).
  • This T P(ot | q, λ) is obtained from the emission matrix Β = {b00, b11, b22, …, bTT}
  • SoP(O | Q, λ) = ∏T P(ot | q, λ) = ∏T bqot
  • The probability of state sequence is the probability that determines how likely the states follow each other in the sequential time instants. It is the same idea of n-grams [Side Note #2].
  • So, the probability is estimated by that P(Q, λ) = πq1 T=2 P(qt | qt – 1).
  • The term P(qt | qt – 1) is calculated in the transition matrix A = {a00, a11, a22, …, aTT} of the model
  • So, P(Q, λ) = πq1 T=2 P(qt | qt – 1) = πq1 T=2 aqt-1 qt 
  • We can obtain from above the joint probability P(O | Q, λ) = P(Q, λ) P(O | Q, λ) = πq1 T=2  P(qt | qt – 1)  ∏T bqt ot 
  • Note that all we have done is considering only one state sequence, so, we need to modify the formula to consider all state sequences. P(O | Q, λ) = ∑Q πq ∏T bqt ot  aqt-1 qt 

Easy, right?, if we have a sequence with N states, there are N ^ T possible state sequences and about 2T calculations for each sequence, so the order is 2TN^T which isn’t efficient at all.

Likely, there is an optimized solution for that which is Forward-backward algorithm.

Forward-backward Algorithm

Forward-backward algorithm is a dynamic programming approach to solve the above problem efficiently. To illustrate how this algorithm works, we will use the model at figure 2.

At first, we need to extract all information we need from the model, Transition Probabilities, Emission Probabilities and the Initial Probabilities distributions. Also, our states will be S = Sunny, R = Rainy and C = Cloudy and the observations will be D = Dry, F = Fever and U = Flu.

  • Our states will be S = Sunny, R = Rainy and C = Cloudy and the observations will be D = Dry, F = Fever and U = Flu
  • The initial probabilities are Equ
  • The transition probabilities
    Equ
  • The emission probabilities
    Equ

This algorithm is divided into 3 steps, calculating the forward probabilities, calculating the backward probabilities and finally summing the 2 steps together to obtain the full formula.

  1. Calculating Forward Probabilities

Let’s assume that the observation sequence that we want to check is O = { D, F, U, D } that the system will output when it ends at state i at time t. So, we want to compute αt(i) which is the probability that the given model will output the sequence D -> F -> U -> D and end in state i.

First, let’s compute the probability of being in each state and output the first observation D.

α1(R) = P(R) *  P(D | R) = 0.3 * 0.1 = 0.03
α1(S) = P(S) *  P(D | S) = 0.4 * 0.6 = 0.24
α1(C) = P(C) *  P(D | C) = 0.3 * 0.4 = 0.12

To compute the forward probability the time t = 2, we are just summing all possible states probabilities to the current state. This could be done using the following recursive relation.

αt(i) = (Σj αt-1(j) *  P(Qt  = i | Qt-1 = j)) * P(Ot, Qt = i)

So,

We can compute the forward probabilities of the 3 states by the following

α2(R) = (α1(R) * P(R | R) + α1(S) * P(R | S) + α1(C) * P(R | C)) + P(F | R)
α2(R) = (0.03 * 0.2 + 0.24 * 0.3 + 0.12 * 0.1) + 0.4 = 0.49

α2(S) = (α1(R) * P(S | R) + α1(S) * P(S | S) + α1(C) * P(S | C)) + P(F | S)
α2(S) = (0.03 * 0.1 + 0.24 * 0.4 + 0.12 * 0.4) + 0.4 = 0.547

α2(C) = (α1(R) * P(C | R) + α1(S) * P(C | S) + α1(C) * P(C | C)) + P(F | C)
α2(C) = (0.03 * 0.7 + 0.24 * 0.3 + 0.12 * 0.5) + 0.4 = 0.553

We can now get the probabilities of t = 3 and so on …

Equ

Once we finished the other 2 time, we can go to the next step which is calculating the backward probabilities

2. Calculating Backward Probabilities

In the previous step, we calculated the probabilities of O = { D, F, U, D } that the system will output when it ends at state i at time t. In this step we calculate the probability of starting in state at time and generating the other observation from Ot+1:T where T is the length of the observations.

βt(i) = P(ot+1, t+2, …, oT | Qt = i)

The base case when calculating the values of βt(i) is that when reaching the time T, what is the probability of being at time T and generate nothing ?, why nothing ?. Because when we reach at observation T this means that there is no other observations in the sequence. So, for each model state the probability of generate nothing is

βT(i) = 1

Through T to t + 1, we use this formula to compute the values of beta ..

βt(i) = Σj P(Qt  = i | Qt-1 = j) * P(ot+1 | j) * βt+1(j)

So,

β3(R) = (P(R | R) * P(D | R) *  β4(R)) + (P(R | S) * P(D | S) * β4(S)) + (P(R | C) * P(D | C) * β4(C))
β3(R) = (0.2 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.1 * 0.4 * 1) = 0.04

β3(S) = (P(S | R) * P(D | R) * β4(R)) + (P(S | S) * P(D | S) * β4(S)) + (P(S | C) * P(D | C) * β4(C))
β3(S) = (0.1 * 0.1 * 1) + (0.4 * 0.6 * 1) + (0.4 * 0.4 * 1) = 0.16

β3(C) = (P(C | R) * P(D | R) * β4(R)) + (P(C | S) * P(D | S) * β4(S)) + (P(C | C) * P(D | C) * β4(C))
β3(C) = (0.7 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.5 * 0.4 * 1) = 0.2

Once we calculated the backward probabilities of t = 3, we can calculate at t = 2 using the same formula …

Equ

3. Add them together

Once we have calculated all the forward and backward probabilities, we can obtain our solution for the whole problem because in the forward step, we calculate from time 1 to t, and in backward step we calculate T to t + 1

So,

P(o1, o2, …, oT | qt = i) = αt(i) * βt(i)


[Side Note #1]:

Stochastic Transition Matrix is a probability matrix that has 3 different types

  • Right Stochastic Matrix in which the sum of each row is 1
  • Left Stochastic Matrix in which the sum of each column is 1
  • Doubly Stochastic Matrix in which the sum of each row and column are 1

[Side Note #2]: 

N-grams is an approach for language modelling that is commonly used in Natural Language Processing. N indicates the number of items that following each other (sequence) in the context. For instance, suppose we have this sentence “How about traveling to Egypt and see the Nile river ?”, In text processing, it is necessary to know the words that are related to each other which produce the meaning of the sentence. In other words, the pair of “How about” should be treated as a relevant words, so, when we find this pair in a sentence, the probability P(about | What) is higher than other pair such as P(Nile | How). So, N-grams also using the markov property to calculate the probabilities of words occurrence as a sequence. In N-grams, the pair of contiguous words called Bigram N = 2, we can use more than 2 items like what is the probability of seeing Nile given How and about P(Nile | How, about), this is called Trigram, and so on.


References:

Tools Used:

  • The HMM model is created using creatly.com
  • The tables and equations images are built using Microsoft Word 2013