Agenda:

- Markov Chains

– Structure

– 1st Order Markov Chains

– Higher Order Markov Chains - Hidden Markov Model

– Structure

– Why using HMMs ? - Forward-backward algorithm
- Viterbi algorithm (Part II)
- Baum-Welch algorithm (Part II)

This article assumes that you have a pretty good knowledge of some Machine Learning techniques such as Naive Bayes and have a background about Recursion and Dynamic Programming.

**Markov Chains**

Markov Chains is a probabilistic model that consists of finite states. The states are connected with each other through edges, and there are values that are associated with each of these edges.

**Structure**

The following figure shows a Markov diagram that consists of 2 states, and each one of them are connected to the other through and edge (**Note**: The states could contain self-loops).

Sunny and Rainy are 2 states. The values are the transition probabilities between the states

Given the above diagram, we can say that the probability to go from the Sunny state to the Rainy state is 0.1, while going to Sunny again is 0.9 (**Note**: the summation of the probabilities must be equal 1 from one state to others).

We can represent the previous chain using Stochastic Transition Matrix, in which each row describes the transition from a state to the other states, so, the sum of each row must be equal to 1. This is called a Right Stochastic Transition Matrix[**Side Note #1**].

The purpose of using Markov Chains is to predict the next state at **t + 1** given the current state at time **t**,** **using the previous Markov model, we assume that the current state is Sunny, so, the current state vector will be [1, 0] (Sunny is 1, Rainy is 0) because we are sure 100% that the current state is Rainy.

From the current state, we want to predict the next state. This could be done by multiplying the current state vector with the transition matrix, the result of the multiplication will be a probabilities vector and each element of the vector shows the probability to be in a state in the next time.

The resultant vector shows that the probability to be Sunny is much higher than to change to state Sunny.

What we have done is dealing with 1st order Markov Chains, in which we only focuses on predicting the next state dependent on only 1 previous state. In Markov Chains, we can predict a next state dependent on other **m** previous states, where m >= 2.

Assume that in time **S(t)**, the state is Sunny, and we want to predict the state after **m** where m = 4. To predict the state after **m**, we need to find **S(t + 4)**, which depends on **S(t + 1)**, **S(t + 2)**, **(t + 3), **which are unknowns to us.

To derive the solution, we need to solve the lower states at first.After Substitution,

Higher order Markov Chains has many real-life examples such as predicting the weather for the upcoming days or weeks.

**Hidden Markov Models**

Practically, it may be hard to access the patterns or classes that we want to predict, from the previous example (weather), there could be some difficulties to obtain the directly the weather’s states (Hidden states), instead, you can predict the weather state through some indicators (Visible states).

Assume that you have a friend that is always badly affected by the weather state, and these effects are visible to you which may be Flu, Fever or Dry, and the weather states are Sunny, Cloudy or Rainy. Our target is to predict the weather state using the visible states.

Markov Chains can’t be used in this situation because it deals directly with the predictable states which aren’t visible for us now, so, it was necessary to develop another variant of Markov Chains that use the visible states to predict the hidden states.

**Structure**

*Figure 2*

The above figure shows a Hidden Markov Model that consists of 3 visible states {Fever, Flu, Dry}, 3 hidden states {Sunny, Rainy, Cloudy}. We need now to extract information from this model.

The start state is the current state of the system, all the system know now is that the next state will be more likely Sunny because the transition probability between the start state and the predictable states {Rainy, Sunny, Cloudy} are {0.3, 0.4, 0.3} respectively which indicates that the next state will be more likely Sunny as an initial prediction.

Hidden states are connected with each other through transitions called Transition Probabilities which indicates the change of state in the next time. For example, if we are now in Sunny state, there is a 0.3 probability that the next state will be Rainy, if we are in state Rainy, the next state will be Cloudy with probability of 0.7. Note that the sum of out-degrees of a state should be equal to 1.

Hidden states are connected with the visible or observed states through Emission Probabilities which indicates that what your friend’s health state given the weather state. For example if the state Sunny your friend will be most likely feels Dry with probability of 0.6 and Fever with 0.4. Note that the sum the out-degrees shall also be equal to 1.

**Why using HMMs?**

Hidden Markov Models are very powerful technique that are used in sequential prediction and structured like weather prediction, also, HMMs shines in speech recognition and pattern recognition applications such as handwritten recognition, machine translation and language detection which all are based on sequences of signals or words.

Mainly, HMMs are used to solve 3 problems:-

**Observations Probability Estimation**

– Given a model like above, observations, hidden states, transition probabilities and emission probabilities, estimate the probability of observation sequence given the model.**Optimal Hidden State Sequence**

– Given a model with sequence of observations, determine the optimal path or sequence of the hidden states**HMM Parameters Estimation**

– Choose the model parameters that maximizes the probability of a specific observation given a specific state

Each of these problems are solvable by different algorithms. Probability Estimation, Optimal State Sequence and Parameter Estimation problems are solved using Forward-Backward algorithm, Viterbi algorithm and Baum-Welch algorithm respectively. We will discuss the first one next and the others in another post.

For the math purposes, we need to enumerate and notate HMMs with symbols which determines what we have in the model.

Our model will be **λ = (Α, Β, π)** where **Α** is the transition probability matrix between the hidden states, **Β** is the emission probability matrix between hidden states and observations and **π** is the initial state the system starts on.

**Observations Probability Estimation**

Probability estimation is about estimating the probability of observation sequence occurance given **λ, **this is considered to be an evaluation of the model which indicates how output efficiency in the current situation. In other words, consider the weather prediction problem, we want to determine the best HMM model that describing the current month weather given some observations such as wind, rain or others. This type of problems rises on many Speech Recognition and Natural Language Processing applications.

*Brute Force*

One of the solutions of this problem is Brute Force, in which we try all observations sequence with hidden states sequence combinations.

- We need to find
where*P(O | λ)***O**= {o_{1}, o_{2}, o_{3}, …, o} and_{T}is our HMM model which consists of some states*λ***Q**= {q_{1}, q_{2}, q_{3}, …, q} where_{T}**T**is a time instant

- To find
we need to product all the probabilities between a sample state sequence and observations for each time instant.*P(O | Q, λ)*.*P(O | Q, λ) = ∏*_{T}P(o_{t}| q_{t }, λ)

- This
is obtained from the emission matrix*∏*_{T}P(o_{t}| q_{t }, λ)**Β**= {b**00,**b11**,**b22**, …,**b_{TT}}

- So
**,***P(O | Q, λ) = ∏*_{T}P(o_{t}| q_{t }, λ) = ∏_{T}b_{qt ot}

- The probability of state sequence is the probability that determines how likely the states follow each other in the sequential time instants. It is the same idea of n-grams [
**Side Note #2**].

- So, the probability is estimated by that
*P(Q, λ) = πq1**∏*_{T=2 }*P(q*_{t}| q_{t – 1}).

- The term
is calculated in the transition matrix*P(q*_{t}| q_{t – 1})**A**= {a00, a11, a22, …, a_{TT}} of the model

- So,
*P(Q, λ) = πq1**∏*_{T=2 }*P(q*_{t}| q_{t – 1}) = πq1*∏*_{T=2}a_{qt-1 qt}_{ }

- We can obtain from above the joint probability
*P(O | Q, λ) = P(Q, λ) P(O | Q, λ) = πq1**∏*_{T=2 }*P(q*_{t}| q_{t – 1}) ∏_{T}b_{qt ot}

- Note that all we have done is considering only one state sequence, so, we need to modify the formula to consider all state sequences.
*P(O | Q, λ) = ∑*_{Q}πq ∏_{T}b_{qt ot}**a**_{qt-1 qt}_{ }

Easy, right?, if we have a sequence with **N** states, there are **N ^ T** possible state sequences and about **2T** calculations for each sequence, so the order is **2TN^T** which isn’t efficient at all.

Likely, there is an optimized solution for that which is Forward-backward algorithm.

*Forward-backward Algorithm*

Forward-backward algorithm is a dynamic programming approach to solve the above problem efficiently. To illustrate how this algorithm works, we will use the model at **figure 2.**

At first, we need to extract all information we need from the model, Transition Probabilities, Emission Probabilities and the Initial Probabilities distributions. Also, our states will be **S** = Sunny, **R** = Rainy and **C** = Cloudy and the observations will be **D** = Dry, **F** = Fever and **U** = Flu.

- Our states will be
**S**= Sunny,**R**= Rainy and**C**= Cloudy and the observations will be**D**= Dry,**F**= Fever and**U**= Flu

- The transition probabilities

This algorithm is divided into 3 steps, calculating the forward probabilities, calculating the backward probabilities and finally summing the 2 steps together to obtain the full formula.

*Calculating Forward Probabilities*

Let’s assume that the observation sequence that we want to check is **O** = { D, F, U, D } that the system will output when it ends at state **i** at time **t. **So, we want to compute * α_{t}(i) *which is the probability that the given model will output the sequence D -> F -> U -> D and end in state

**i.**

First, let’s compute the probability of being in each state and output the first observation D.

**α1(R) = P(R) * P(D | R) = 0.3 * 0.1 = 0.03
α _{1}(S) = P(S) * P(D | S) = 0.4 * 0.6 = 0.24
α_{1}(C) = P(C) * P(D | C) = 0.3 * 0.4 = 0.12
**

To compute the forward probability the time t = 2, we are just summing all possible states probabilities to the current state. This could be done using the following recursive relation.

**αt(i) = (Σ _{j} α_{t-1}(j) * P(Q_{t} = i | Q_{t-1} = j)) * P(O_{t}, Q_{t} = i)**

So,

We can compute the forward probabilities of the 3 states by the following

**α2(R) = (α1(R) * P(R | R) + α1(S) * P(R | S) + α1(C) * P(R | C)) + P(F | R)
**

**α2(R) = (0.03 * 0.2 + 0.24 * 0.3 + 0.12 * 0.1) + 0.4 = 0.49****α2(S) = (α1(R) * P(S | R) + α1(S) * P(S | S) + α1(C) * P(S | C)) + P(F | S)
**

**α2(S) = (0.03 * 0.1 + 0.24 * 0.4 + 0.12 * 0.4) + 0.4 = 0.547****α2(C) = (α1(R) * P(C | R) + α1(S) * P(C | S) + α1(C) * P(C | C)) + P(F | C)
**

**α2(C) = (0.03 * 0.7 + 0.24 * 0.3 + 0.12 * 0.5) + 0.4 = 0.553**We can now get the probabilities of t = 3 and so on …

Once we finished the other 2 time, we can go to the next step which is calculating the backward probabilities

*2**. Calculating Backward Probabilities*

In the previous step, we calculated the probabilities of **O** = { D, F, U, D } that the system will output when it ends at state **i** at time **t. **In this step we calculate the probability of **starting** in state **i **at time **t **and generating the other observation from **O _{t+1:T}** where

**T**is the length of the observations.

**β _{t}(i) = P(o_{t+1}, _{t+2}, …, o_{T} | Q_{t} = i)**

The base case when calculating the values of * β_{t}(i) *is that when reaching the time

**T,**what is the probability of being at time T and generate nothing ?, why nothing ?. Because when we reach at observation T this means that there is no other observations in the sequence. So, for each model state the probability of generate nothing is

**1**

**βT(i) = 1**

Through **T** to **t + 1, **we use this formula to compute the values of beta ..

**β _{t}(i) = Σ_{j} P(Q_{t} = i | Q_{t-1} = j) * P(o_{t+1} | j) * β_{t+1}(j)**

So,

**β3(R) = (P(R | R) * P(D | R) * β4(R)) + (P(R | S) * P(D | S) * β4(S)) + (P(R | C) * P(D | C) * β4(C))**

β3(R) = (0.2 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.1 * 0.4 * 1) = 0.04

**β3(S) = (P(S | R) * P(D | R) * β4(R)) + (P(S | S) * P(D | S) * β4(S)) + (P(S | C) * P(D | C) * β4(C))
β3(S) = (0.1 * 0.1 * 1) + (0.4 * 0.6 * 1) + (0.4 * 0.4 * 1) = 0.16
**

**β3(C) = (P(C | R) * P(D | R) * β4(R)) + (P(C | S) * P(D | S) * β4(S)) + (P(C | C) * P(D | C) * β4(C))
β3(C) = (0.7 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.5 * 0.4 * 1) = 0.2
**

Once we calculated the backward probabilities of t = 3, we can calculate at t = 2 using the same formula …

3. **Add them together**

Once we have calculated all the forward and backward probabilities, we can obtain our solution for the whole problem because in the forward step, we calculate from time **1** to **t**, and in backward step we calculate **T** to **t + 1**

So,

*P(o _{1}, o_{2}, …, o_{T} | q_{t} = i) = α_{t}(i) * β_{t}(i)*

**[Side Note #1]:**

Stochastic Transition Matrix is a probability matrix that has 3 different types

- Right Stochastic Matrix in which the sum of each
**row**is 1 - Left Stochastic Matrix in which the sum of each
**column**is 1 - Doubly Stochastic Matrix in which the sum of each
**row**and**column**are 1

**[Side Note #2]: **

N-grams is an approach for language modelling that is commonly used in Natural Language Processing. N indicates the number of items that following each other (sequence) in the context. For instance, suppose we have this sentence “How about traveling to Egypt and see the Nile river ?”, In text processing, it is necessary to know the words that are related to each other which produce the meaning of the sentence. In other words, the pair of “How about” should be treated as a relevant words, so, when we find this pair in a sentence, the probability P(about | What) is higher than other pair such as P(Nile | How). So, N-grams also using the markov property to calculate the probabilities of words occurrence as a sequence. In N-grams, the pair of contiguous words called Bigram N = 2, we can use more than 2 items like what is the probability of seeing Nile given How and about P(Nile | How, about), this is called Trigram, and so on.

**References:**

*Christopher D. Manning and Hinrich Schiitze:*Foundations of Statistical Natural Language Processing- http://research.cs.tamu.edu/prism/lectures/pr/pr_l23.pdf
- http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/hmms/s2_pg2.html

**Tools Used:**

- The HMM model is created using creatly.com
- The tables and equations images are built using Microsoft Word 2013

I think it is great except for some mistakes in some formulas.

Thanks for the feedback.

Could you please tell me the mistakes in the formulas to correct them?