[Java] Hidden Markov Model Implementation

This is my last post this year. So, I have decided to finish the HMM implementation before the end of the year.

I tried to make the project clean as far as possible to enable those who want to use it understandable and easy to use. However, this isn’t the final version, there are many things that I want to add to the API and other enhancements. Any updates will be added to the README of the repository.

–聽Here you can see the repository of implementation.

– To see the features, check the README file

– To check the javadoc of the whole project click here

The test model is taken from my own created HMM model using creately.com app.


Posts related:

I will add the other 2 parts after finishing them.

That’s all. See you next year! 馃檪

[Kaggle] Poker Rule Induction

I wrote a note http://nbviewer.ipython.org/github/AhmedHani/Kaggle-Machine-Learning-Competitions/blob/master/Easy/PokerRuleInduction/PokerRuleInduction.ipynb聽about Poker Rule Induction problem, the note explains the problem description and the steps I used to solve it.

It is considered a good problem for those who want to start solving at Kaggle and know聽about some Machine Learning libraries in Python that are commonly used when solving at Kaggle.

Hidden Markov Models (HMMs) Part I


  • Markov Chains
    – 1st聽Order Markov Chains
    – Higher聽Order Markov Chains
  • Hidden Markov Model
    – Structure
    –聽Why using HMMs ?
  • Forward-backward algorithm
  • Viterbi algorithm (Part II)
  • Baum-Welch algorithm聽(Part II)

This article assumes that you have a pretty good knowledge of some Machine Learning techniques such as Naive Bayes and have a background about Recursion and Dynamic Programming.

Markov Chains

Markov Chains is a probabilistic model that consists of finite states. The states are connected with each other through edges, and there are values that are associated with each of these edges.


The following figure shows a Markov diagram that consists of 2 states, and each one of them are connected to the other through and edge (Note: The states could contain self-loops).

Sunny and Rainy are 2 states. The values are the transition probabilities between the states

Given the above diagram, we can say that the probability to go from the Sunny state to the Rainy state is 0.1, while going to Sunny again is 0.9 (Note: the summation of the probabilities must be equal 1 from one state to others).

We can represent the previous chain using Stochastic Transition Matrix, in which each row describes the transition from a state to the other states, so, the sum of each row must be equal to 1. This is called a Right Stochastic Transition Matrix[Side聽Note #1].

T Matrix

The purpose of using Markov Chains is to predict the next state at t聽+ 1 given the current state at time t,using the previous Markov model, we assume that the current state is Sunny, so, the current state vector will be [1, 0] (Sunny is 1, Rainy is 0) because we are sure 100% that the current state is Rainy.

From the current state, we want to predict the next state. This could be done by multiplying the current state vector with the transition matrix, the result of the multiplication will be a probabilities vector and each element of the vector shows the probability to be in a state in the next time.

T Matrix

The resultant vector shows that the probability to be Sunny is much higher than to change to state Sunny.

What we have done is dealing with 1st order Markov Chains, in which we only focuses on predicting the next state dependent on only 1 previous state. In Markov Chains, we can predict a next state dependent on other m previous states, where m >= 2.

Assume that in time S(t), the state is Sunny, and we want to predict the state after m where m = 4. To predict the state after m, we need to find S(t + 4), which depends on S(t + 1), S(t + 2), (t + 3),聽which are unknowns to us.

To derive the solution, we need to solve the lower states at first.T MatrixAfter Substitution,

T Matrix

Higher order Markov Chains has many real-life examples such as predicting the weather for the upcoming days or weeks.

Hidden Markov Models

Practically, it may be hard to access the patterns or classes that we want to predict, from the previous example (weather), there could be some difficulties to obtain the directly the weather’s states (Hidden states), instead, you can predict the weather state through some indicators (Visible states).

Assume that you have a friend that is always badly affected by the weather state, and these effects are visible to you which may be Flu, Fever or Dry, and the weather states are Sunny, Cloudy or Rainy. Our target is to predict the weather state using the visible states.

Markov Chains can’t be used in this situation because it deals directly with the predictable states which aren’t visible for us now, so, it was necessary to develop another variant of Markov Chains that use the visible states to predict the hidden states.



Figure 2

The above figure shows a Hidden Markov Model that consists of 3 visible states {Fever, Flu, Dry}, 3 hidden states {Sunny, Rainy, Cloudy}. We need now to extract information from this model.

The start state is the current state of the system,聽all the system know now is that the next state will be more likely Sunny because the transition probability between the start state and the predictable states {Rainy, Sunny, Cloudy} are {0.3, 0.4, 0.3} respectively which indicates that the next state will be more likely Sunny as an initial prediction.

Hidden states are connected with each other through transitions called Transition Probabilities which indicates the change of state in the next time. For example, if we are now in Sunny state, there is a 0.3 probability that the next state will be Rainy, if we are in state Rainy, the next state will be Cloudy with probability of 0.7. Note that the sum of out-degrees of a state should be equal to 1.

Hidden states are connected with the visible or observed states through Emission Probabilities which indicates that what your friend’s health state given the weather state. For example if the state Sunny your friend will be most likely feels Dry with probability of 0.6 and Fever with 0.4. Note that the sum the out-degrees shall also be equal to 1.

Why using HMMs?

Hidden Markov Models are very powerful technique that are used in sequential prediction and structured like聽weather prediction, also, HMMs shines in speech recognition and pattern recognition applications such as handwritten recognition, machine translation and language detection which all are based on sequences of signals or words.

Mainly, HMMs are used to solve 3 problems:-

  • Observations Probability Estimation
    – Given a model like above, observations, hidden states, transition probabilities and emission probabilities, estimate the probability of observation sequence given the model.
  • Optimal Hidden State Sequence
    – Given a model聽with sequence of observations, determine the optimal path or sequence of the hidden states
  • HMM Parameters Estimation
    – Choose the model parameters that maximizes the probability of a specific聽observation given a specific state

Each of these problems are solvable by different algorithms. Probability Estimation, Optimal State Sequence and Parameter Estimation problems are solved using Forward-Backward algorithm, Viterbi algorithm and聽Baum-Welch algorithm respectively. We will discuss the first one next and the others in another post.

For the math purposes, we need to enumerate and聽notate HMMs with symbols which determines what we have in the model.

Our model will be 位 = (螒, 螔, 蟺) where聽 is the transition probability matrix between the hidden states,聽 is the emission probability matrix between hidden states and observations and聽 is the initial state the system starts on.

Observations Probability Estimation

Probability estimation is about estimating the probability of聽observation sequence occurance given 位,聽this is considered to be an evaluation of the model which indicates how output efficiency in the current situation. In other words, consider the weather prediction problem, we want to determine the best HMM model that describing the current month weather given some observations such as wind, rain or others. This type of problems rises on many Speech Recognition and Natural Language Processing applications.

Brute Force

One of the solutions of this problem is Brute Force, in which we try all observations sequence with hidden states sequence combinations.

  • We need to find P(O | 位)聽where O = {o1, o2, o3, …, oT} and聽is our HMM model which consists of some states Q =聽{q1, q2, q3, …, qT} where T is a time instant
  • To find聽P(O | Q, 位) 聽we need to product all the probabilities between a sample state sequence and observations for each time instant.聽P(O | Q, 位) =聽鈭T P(ot | qt聽, 位).
  • This聽T P(ot | qt聽, 位)聽is obtained from the emission聽matrix聽= {b00, b11, b22, …, bTT}
  • So,聽P(O | Q, 位) =聽鈭T P(ot | qt聽, 位) =聽鈭T bqt聽ot
  • The probability of state sequence is the probability that determines how likely the states follow each other in the sequential time instants. It is the same idea of n-grams [Side Note #2].
  • So, the probability is estimated by that聽P(Q, 位) =聽蟺q1T=2聽P(qt | qt – 1).
  • The term聽P(qt | qt – 1) is calculated in the transition matrix A =聽{a00, a11, a22, …, aTT} of the model
  • So,聽P(Q, 位) =聽蟺q1T=2聽P(qt | qt – 1) =聽蟺q1 T=2聽aqt-1 qt
  • We can obtain from above the joint probability聽P(O | Q, 位) =聽P(Q, 位)聽P(O | Q, 位) =聽蟺q1T=2 聽P(qt | qt – 1) 聽鈭T bqt聽ot
  • Note that all we have done is considering only one state sequence, so, we need to modify the formula to consider all state sequences. P(O | Q, 位) = 鈭Q聽蟺q聽鈭T bqt聽otaqt-1 qt

Easy, right?, if we have a sequence with N states, there are N ^ T possible state sequences and about 2T calculations for each sequence, so the order is 2TN^T which isn’t efficient at all.

Likely, there is an optimized solution for that which is Forward-backward algorithm.

Forward-backward Algorithm

Forward-backward algorithm is a dynamic programming approach to solve the above problem efficiently. To illustrate how this algorithm works, we will use the model at figure 2.

At first, we need to extract all information we need from the model, Transition Probabilities, Emission Probabilities and the Initial Probabilities distributions. Also, our states will be S = Sunny, R = Rainy and C = Cloudy and the observations will be D = Dry, F = Fever and U = Flu.

  • Our states will be S = Sunny, R = Rainy and C = Cloudy and the observations will be D = Dry, F = Fever and U = Flu
  • The initial probabilities are聽Equ
  • The transition probabilities
  • The emission probabilities

This algorithm is divided into 3 steps, calculating the forward probabilities, calculating the backward probabilities and finally summing the 2 steps together to obtain the full formula.

  1. Calculating Forward Probabilities

Let’s assume that the observation sequence that we want to check is O = { D, F, U, D } that the system will output when it ends at state i at time t.聽So, we want to compute t(i)聽which is the probability that the given model will output the sequence D -> F -> U -> D and end in state i.

First, let’s compute the probability of being in each state and output the first observation D.

1(R) = P(R) * 聽P(D | R) = 0.3 * 0.1 = 0.03
1(S) = P(S) * 聽P(D | S) = 0.4 * 0.6 = 0.24
1(C) = P(C) * 聽P(D | C) = 0.3 * 0.4 = 0.12

To compute the forward probability the time t = 2, we are just summing all possible states probabilities to the current state. This could be done using the following recursive relation.

t(i) = (危jt-1(j) * 聽P(Qt聽 = i | Qt-1 = j)) * P(Ot, Qt = i)


We can compute the forward probabilities of the 3 states by the following

2(R) = (伪1(R) * P(R | R) + 伪1(S) * P(R | S) + 伪1(C) * P(R | C)) + P(F | R)
2(R) = (0.03 * 0.2 + 0.24 * 0.3 + 0.12 * 0.1) + 0.4 = 0.49

2(S) =聽(伪1(R) * P(S | R) + 伪1(S) * P(S | S) + 伪1(C) * P(S | C)) + P(F | S)
2(S) = (0.03 * 0.1 + 0.24 * 0.4 + 0.12 * 0.4) + 0.4 = 0.547

2(C) =聽(伪1(R) * P(C | R) + 伪1(S) * P(C | S) + 伪1(C) * P(C | C)) + P(F | C)
2(C) = (0.03 * 0.7 + 0.24 * 0.3 + 0.12 * 0.5) + 0.4 = 0.553

We can now get the probabilities of t = 3 and so on …


Once we finished the other 2 time, we can go to the next step which is calculating the backward probabilities

2. Calculating Backward聽Probabilities

In the previous step, we calculated the probabilities of O = { D, F, U, D } that the system will output when it ends at state i at time t.聽In this step we calculate the probability of starting in state i聽at time t聽and generating the other observation from Ot+1:T where T is the length of the observations.

t(i) = P(ot+1, t+2, …, oT | Qt = i)

The base case when calculating the values of聽t(i)聽is that when reaching the time T, what is the probability of being at time T and generate nothing ?, why nothing ?. Because when we reach at observation T this means that there is no other observations in the sequence. So, for each model state the probability of generate nothing is 1聽

T(i) = 1

Through T to t + 1,聽we use this formula to compute the values of beta ..

t(i) =聽危j P(Qt聽 = i | Qt-1 = j) *聽P(ot+1 | j) * 尾t+1(j)


3(R) = (P(R | R) * P(D | R) * 聽尾4(R)) + (P(R | S) * P(D | S) * 尾4(S))聽+ (P(R | C) * P(D | C) * 4(C))
3(R) = (0.2 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.1 * 0.4 * 1) = 0.04

3(S) = (P(S | R) * P(D | R) * 尾4(R)) + (P(S | S) * P(D | S) * 尾4(S)) + (P(S | C) * P(D | C) * 尾4(C))
3(S) = (0.1 * 0.1 * 1) + (0.4 * 0.6 * 1) + (0.4 * 0.4 * 1) = 0.16

3(C) = (P(C | R) * P(D | R) * 尾4(R)) + (P(C | S) * P(D | S) * 尾4(S)) + (P(C | C) * P(D | C) * 尾4(C))
3(C) = (0.7 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.5 * 0.4 * 1) = 0.2

Once we calculated the backward probabilities of t = 3, we can calculate at t = 2 using the same formula …


3. Add them together

Once we have calculated all the forward and backward probabilities, we can obtain our solution for the whole problem because in the forward step, we calculate from time 1 to t, and in backward step we calculate T to t + 1


P(o1, o2, …, oT | qt = i) = 伪t(i) * 尾t(i)

[Side Note #1]:

Stochastic Transition Matrix聽is a probability matrix that has 3 different types

  • Right Stochastic Matrix in which the sum of each row is 1
  • Left聽Stochastic Matrix in which the sum of each column聽is 1
  • Doubly聽Stochastic Matrix in which the sum of each row and聽column are聽1

[Side Note #2]:聽

N-grams is an approach for language modelling that is commonly used in Natural Language Processing. N indicates the number of items that following each other (sequence)聽in the context. For instance, suppose we have this sentence “How about traveling to Egypt and see the Nile river ?”, In text processing, it is necessary to know the words that are related to each other which produce the meaning of the sentence. In other words, the pair of “How about” should be treated as a relevant words, so, when we find this pair in a sentence, the probability P(about | What) is higher than other pair such as P(Nile | How). So, N-grams also using the markov property to calculate the probabilities of words occurrence as a sequence. In N-grams, the pair of contiguous words called Bigram N = 2, we can use more than 2 items like what is the probability of seeing Nile given How and about P(Nile | How, about), this is called Trigram, and so on.


Tools Used:

  • The HMM model is created using creatly.com
  • The tables and equations images are built using Microsoft Word 2013

[Week 2] Signals and Systems

Things are getting more exciting!

The week is divided to three segments. The first segment introduced some 2D and 3D discrete signals such as Images and videos, mentioning the meaning of Unit Impulse and Unit Step discrete signals, also the segment shows the separation availability between multiple numbers of signals.

In the second segment. It introduced the 2-dimenational complex exponential signal and how to build blocks of signals that have the same frequencies.

In the third segment. It was about 2D convolution examples in which some filters are used to manipulate the original image to produce an output image.

2D and 3D Discrete Signals

  • 2D discrete signals are signals that depend on 2 variables, each one of them has its own minimum and maximum values.
  • Example: Images
  • 2D Image consists of pixels, each pixel has its x-y coordinate in the image
  • The pixel holds 3 combined values which represent the RED, GREEN and BLUE values
  • If the image is Gray-scaled, this means that RED == GREEN == BLUE
  • 3D discrete signals have an additional variable
  • Videos could be considered 3D discrete signals because the dimensions are: x, y, z, where is z is the frame number in the video, and x, y are the dimension of the frame


Discrete Unit Impulse

  • Impulse signal is a function which value is zero everywhere except at zero
  • So, the discrete unit impulse of 2 signals is 1 if the values of the first and second signals are zero
  • The discrete unit impulse of 2 signals n1, n2 is聽eq
  • The signals are called 鈥淪eparable鈥 if聽eq

Discrete Unit Step

  • The discrete unit step of 2 signals is 1 if the values of the first and second signals are greater or equal zero
  • The discrete unit step of 2 signals, n1, n2 is聽eq

Complex Exponential Signals

  • The complex exponential of 2 signals are defined as聽eq
  • Where eq聽are the periodicity of聽eqsignals respectively
  • According to Euler formula,聽eq
  • Where eq聽is a rational multiple of聽eq

2D Systems

  • Systems that takes eq聽as an input, and using some operations or filters T[input], it produces聽eq
  • There are 3 kinds of systems
  • Systems are : Linear Systems, Spatially Invariant Systems and Linear and Spatially Invariant Systems

Linear System

  • Given the input eq聽where eq聽 is a weight
  • The system is called 鈥淟inear System鈥 if the output of the input is聽eq

Spatially Invariant Systems

  • Given聽eq
  • The system is called 鈥淪patially Invariant鈥 if聽eq

Linear and Spatially Invariant Systems

  • Depends on the impulse response of the sight signal, the sight signal could be obtained from devices such as Camera, Telescope
  • Let eq聽be the impulse response of sight signal聽eq
  • Now, for any given input eq聽to the impulse response, the output is eq聽, where ** is 2D discrete convolution


Examples for 2D convolution

  • Noise Reduction
  • Edge Detection
  • Sharpen
  • Blur



[Week 1] Introduction to Image and Video Processing

An Introduction to the course, the week focuses on some definitions and basics about different kinds of signals that we deal with on our life.

DefSignal:聽In Electrical Engineering, a signal is a function that contains some information about an event that occurs in discrete/continuous time.


聽 聽 Heart Signals 聽 聽 聽 聽 聽 聽 聽 Human Voice

Analog vs. Digital Signals

  • To make any processing to a signal using any device (Computer, Mobile,.. etc.), we need to convert the signal that comes from the source which is in an Analog form to a Digital form.
  • To convert the signal to Digital form, the device transforms the acoustic signal to electrical signal
  • Def Acoustic Signal: the noise(voice, sound) that humans/animals produce
  • Def Electrical Signal: Signal that is generated by the microphone by converting the sound signal to voltage signal
  • The electrical signal transforms to digital signal using sampling and quantization
  • Def Sampling: Transform the continuous signal to discrete signal by take a sample from the input
  • Def Quantization: Get the amplitude of the discrete signal by mapping the large set of values to small values to be counted
  • After making the processing, we reverse the pervious steps to convert the digital signal to analog to be hearable


Images Classification

  • Reflection Images: Images information resource in the object鈥檚 surface, example {Radar}
  • Emission Images: Images information resource internally in the object, example {Infrared}
  • Absorption Images: Images information resource in the internal structure of the object, example


Introduction to Reinforcement Learning

Hey, what鈥檚 up guys?

Today is the beginning of a series about an interesting field of science and new for me. The series is about Reinforcement Learning.

The reason why I have started to learn this topic is that I have known about supervised and unsupervised learning in through Andrew NG Machine Learning course, I thought that these two techniques are the only available approaches for machine learning (regardless Evolutionary and Swarm methods), but later I found that there are other approaches like Semi-supervised and Reinforcement Learning, so, I decided to get through Reinforcement Learning to know more about it and its algorithms.

In this series I explain what I learnt about Reinforcement Learning and mention the resources I use during my study provided with my code implementation. The main reasons for these blogging is that I didn鈥檛 find many blogs talking about this field, it will motivate me to finish what I want to learn and also to memorize what I studied, you can consider them as documentations for each chapter.

Now, let鈥檚 get started.

Chapter 1: Introduction to Reinforcement Learning

Before we get into our topic, we need to talk about Supervised and Unsupervised Learning and mention the differences between each of these 2 techniques as they are very important to define the meaning of Reinforcement learning.

In supervised learning, we have a training data 鈥揷ould be images or numerical values鈥 , it consists of some of training examples that were obtained before, training data is used to train the classifier which could be any known classifier such as SVM, KNN ..Etc. or we may use Neural Nets. The classifier uses this labeled data to help on identifying and classifying new data 鈥搆nown as test data鈥, the main factor for classification is extracting the object鈥檚 features 鈥搆nown as feature vector鈥 that are considered to be the descriptors of the object.

In unsupervised learning, you don鈥檛 have a training data, you are given set of unlabeled data, and the task is to cluster these data into separate classes, clustering is based on the features of the given items, the well-known algorithm for clustering is K-means algorithms and its variants, also, the famous Hidden Markov Models (HMM) are used too in unsupervised learning.

On the other hand, reinforcement learning is simply learning what to do to get the best benefit in the current situation, the agent has the ability to learn from the current environment after several trials. Back to supervised learning, you will find that training data is a must for classification, and that isn鈥檛 practically in some of interactive problem in which the agent should decide what to do in the run time on the current environment. The solution is that the agent should has the ability from its experience instead of using training data and that is introduced in reinforcement learning.

As in supervised and unsupervised learning, reinforcement learning has some basic characteristics that defines and describes its process, we can conclude them in 3 points:-

  • Trials: In which the agent uses many trials in the same environment to obtain and predict the opponent鈥檚 behavior (like in chess game)
  • Error: After each situation or state in the environment, the agent calculates the error in its search to minimize it in a similar situation
  • Reward: The benefits which the agent will obtain after perform the action

Also, the agent which uses reinforcement learning has two important choices when choosing its action or prediction:-

  • Exploitation: The ability to use a previous successful experience in the current situation
  • Exploration: The ability to explore other solutions or paths which could lead to better reward

When we build an agent which uses reinforcement learning to perform some actions, we need at first to define some elements that the agent uses to learn in the environment. Scientists identified main 4 elements for the learning system:-

  • Policy
  • Reward Function
  • Value Function
  • Model (optional)

We will talk about each of these elements, and to fully understand these elements we will use Tic-Tac-Toe as an example.

Policy defines the agent鈥檚 behavior and attitude in the environment in a specific state. In Tic-Tac-Toe, the agent should play 鈥榅鈥 in each of its turn, this is the policy of the agent. After consider this policy, an estimation should be done to estimate the winning probability for each move.

Reward Function defines the benefit or desirability of the next state 鈥搊nly鈥 after agent鈥檚 move, the higher of the reward, the better solution. In Tic-Tac-Toe, When the agent plays 鈥榅鈥, it estimates the reward which will be obtained after being in the resultant state, on other words, higher reward means a good move, and lower reward indicates a bad move.

Value Function defines the total rewards in the long run of all the states, and that is different from the Reward Function which only focuses in the next state only. We can obtain from that the Value Function is more important that the Reward Function.

Model is an optional element when we build an agent and only appears in Planning that is used in games, the model predicts the next state of the current environment given the current state and the action which the agent will perform. This simulation or prediction effects on the decision making process of the agent.

Note: A good agent may use a lower reward in a specific state if this state will lead it to be in other higher Rewarded state, and that could give a better Value Function.

The following shows a diagram of an agent in Tic-Tac-Toe uses reinforcement learning to perform its next moves


  • The opponent started the game in point 鈥榓鈥, turning the game state to be in state 鈥榖鈥
  • In state 鈥榖鈥 the agent checked all possible states which is presented as dashed lines, then selected –the policy– one of them presented as state 鈥榗鈥
  • The opponent played its move, turning the game to state 鈥榙鈥
  • In state 鈥榙鈥, the agent made an exploratory move that leaded it to be in state 鈥榚*鈥, but the agent found that it wasn鈥檛 a good state which won鈥檛 resultant for learning
  • The agent backtracked and chose another state 鈥榚鈥
  • The curved lines donates a backup for the state, the reason for that is to get better winning estimation after each move, so, the backup is an option for the agent


In the next chapter, we will talk about one of reinforcement learning application.



Data Normalization and Standardization for Neural Networks Output Classification


  • Data standardization (encoding) for data input
    – Binary
    – Positive-Negative
    – Manhattan & Euclidean
    – Categories encoding
  • Standardization vs. Normalization
    – Min-Max Normalization
    – Gaussian Normalization
  • Data Decoding
    – Softmax activation function
    – Mean Squared Error
    – Entropy (Information Theory)
    – Mean Cross Entropy
  • MSE vs. MCE

This article assumes you have a pretty good knowledge about Neural Networks and some basic algorithms like Backpropagation.

When you work on neural networks, you always see yourself dealing with numeric data, basically, neural networks can be performed only with numeric data, algorithms such as backpropogation or when you simulate perceptron, you always use some functions or equations to calculate your output, when you build your network you use matrices to represent the biases and layer-to-layer values, after you get the output, you estimate the error and try to get the most optimal solution. As you see, you always use numeric data.

But, what about the other data types?.. Could neural networks be built to make a good prediction or get an optimal output given data like 鈥渇ood鈥, 鈥渓ocation鈥 or 鈥済ender鈥?

The solution is to encode the non-numerical data and normalize it to be represented as numeric data, this operation is called 鈥淒ata Encoding and decoding鈥, the name 鈥淒ata Standardization鈥 is used too. Suppose your input data looks like 鈥ID鈥, 鈥Name鈥, 鈥Nationality鈥, 鈥Gender鈥, 鈥Preferred food鈥 and 鈥Wage鈥, for more clarification, the following table represents a training set, each ith row represents ith person.

ID聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽 Name聽聽聽聽聽聽聽聽聽聽聽聽聽聽 Gender聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽 Nationality聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽 Preferred food聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽 Wage

1聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽 Hani 聽 聽 聽 聽 聽 聽 聽 聽 聽Male 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 French 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽Rice 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 30
2聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽 Saad聽聽聽聽聽聽聽聽聽聽聽聽聽聽 聽 Male 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽Italian 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 Pizza 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 40
3聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽聽 Sofia聽聽聽聽聽聽聽聽聽聽聽聽聽聽 聽 Female 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 Russian 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 Spaghetti 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽15

To make it easy, I will represent the above table in a simple 2D matrix

I will take each column, then check if it contains numeric data or not.

The first column is 鈥ID鈥, the ID is just considered to be a unique identifier for each item in the training set, so this column isn鈥檛 involved in the process.

The second column is about person鈥檚 name, it isn鈥檛 involved in the process too, but you need to map each ID with the name and use ID instead of the name to avoid duplicated names in the training set

The third column is about the Gender, obviously, the Gander type has only 2 values, Male or female, you have 3 ways to encode this data

  • 0-1 Encoding (Binary method)
    – You are free to choose, male will be set to 0 and female will be set to 1 or the vice versa
  • +1 -1 Encoding
    – If you find yourself you will be in trouble if there is a 0 value in your input, use +1 and -1 values instead
  • Manhattan Encoding (x-y axis)
    – Simply, use pair of values, this method is good when you deal with more than 2 values, in our case 聽 聽 聽Gender we set the male by [0, 1] and female by [1, 0] or the vice versa, this method will work perfectly 聽with 4 possible values, if there are more than 4 possible values, you can include -1 in your encoding, 聽this will be called Euclidean Encoding.

The fourth column which represents the person鈥檚 nationality, there is no known approach or method to deal with this kind of types, person鈥檚 nationality is represented as a string value, some of you could encode each of the characters string to ASCII then use some constant to normalize the data, but the method I prefer which gives me flexibility when I work on this kind of data is using matrices.

Matrices approach depends on the size of possible values of the category, so, first of all we count the number of different values in the column. Back to our training set, we will see there are only 3 nationalities presented in the table, let鈥檚 create a matrix N that will hold the column values, the size of matrix is m x 1, m is number of different values in the input data, in our case m = 3.


We will create a new identity matrix A of the same size of N


Let N = A.


Now we will set each nationality values to its corresponding row, thus

French = [1, 0, 0].
Italian = [0, 1, 0].
Russian = [0, 0, 1].

The only one thing remaining is to replace each nationality string to its corresponding vector value.

This approach works perfectly with small and medium input data, and works good with large amount of input data.


  • Dynamic
  • Easy to understand
  • Easy in coding


  • Memory
  • Complicated when dealing with very huge amount of data
  • Bad performance with huge amount of data

The fifth column is about the preferred food for each person, this will be treated like the previous column, the matrices approach and count the number of different food in the input. There are some special cases could be found in this column if it is given in other input data, we will talk about it in another post.

The last column is the wage, as you see this column is already using numeric data to present the wage聽value per day of each person (you don鈥檛 say!), Will we leave the values without do any processes on it?, the answer is no, we shall normalize these values to suit the other previous values, experience shows that normalizing numeric data produces better output than leaving them without normalization, but how will we normalize these values?

Before we answer this question, we need at first know exactly the difference between 鈥Normalization鈥 and 鈥Standardization鈥.

The relation between Normalization and Standardization looks like the relation between Recursion and Backtrack, any Backtrack is a Recursion, but not any Recursion is considered to be a Backtrack. Any Standardization is considered to be Normalization, but not any normalization is considered to be Standardization, to clarify more, we need to know the definition of each one of them.

In statistics, commonly normalizing data is to set the value within 1, the value should only be in these intervals [0, 1] or [-1, 1], for example in RGB, the basic value for each color is from 0 to 256, the values could be 55, 40, ..etc., but it can鈥檛 exceed 256 or gets below 0, we want to normalize the colors values to be in the interval [0, 1]. 聽The most common method for normalization is.

  • Min-Max Normalization

In Min-Max normalization, we use below formula


The resulted value won鈥檛 exceed 1 or get below 0, you can use this method only if you want to set a value in range [0, 1].

In [-1, 1] we use the below formula if we want to make 0 centralized


Standardization is pretty the same thing with Normalization, but using Standardization will calculate the Z-score, this will transform the data to have 0 mean and 1 variance, the resulted value will be relatively close to zero according to its value, if the value is close to the mean, the resulted value will be close to zero, it is done using Gaussian normalization equation.


Check the following figure (from Wikipedia):-


As you see, there isn鈥檛 much difference between Standardization and Normalization, experience shows that using Gaussian normalization gives better output than Min-Max normalization.

Back to our table, as I have said, it will be better if we use standardization with numeric values of 鈥淲age鈥, I will use Gaussian normalization.

Step 1:
Step 2:


Step 3:
Take each wage聽value and use Gaussian normalization equation

As you see, the result is 0.09 which is near zero because the value 鈥30鈥 is very close to 鈥28.33鈥, in addition the value is positive because the wage value is more than mean, if the value was less than 28.33, the normalized value will be negative, that鈥檚 why Gaussian normalization is better Min-Max, Gaussian normalization gives you more information about the true value.

Now, let鈥檚 put聽all the above together and get the normalized input data to use it in the neural network

Data Decoding

All we have done till now is just about normalizing the input data and using some encoding techniques to transform a category type value to numeric to suit neural networks, but what about the output? .. Obviously, the output of encoded input will be in encoded form too, so.., how we can get the right prediction or classification to the output? In other words, how can we decode the output to their original form?

The best way to understand what we will do next is by example, from our previous table; we set an encoded value to each nationality

French = [1, 0, 0].
Italy = [0, 1, 0].
Russia = [0, 0, 1].

Assume that you get sample output like that [0.6, 0.3, 0.1], [0.1, 0.7, 0.2], [0.0, 0.3, 0.7], for the targets French, Italy and Russia respectively, how do you check the validation for this example?

In this case, you can think the values of each vector as a probability, you can assume that when the values are in range [0, 1] and the sum of all vector values is equal to 1, from this assumption we could easily get the right prediction of the output.

Output 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 聽 Natio.

[0.6, 0.3, 0.1] 聽 聽 聽 聽 聽 聽 聽 [1, 0, 0]
[0.1, 0.7, 0.2] 聽 聽 聽 聽 聽 聽 聽 [0, 1, 0]
[0.0, 0.3, 0.7] 聽 聽 聽 聽 聽 聽 聽 [0, 0, 1]

Check each value of the output with its corresponding value in nationality鈥檚 vector; you could see in the first case, the value 鈥0.6鈥 is the closer value to 1 than 鈥0.3鈥 and 鈥0.1鈥, which shows that it is a right prediction. The same with the second and third cases, we can conclude that the output makes a good prediction.

We can use the previous approach when the values of the output vector are within [0, 1], but what can we do with outputs like [3.0, 2.5, 7.0], all the values is greater than 1, you can鈥檛 use the previous method with this, so, what shall we do?

The solution is to use the softmax activation function, using it will transform your output values to probabilities, these new values must be between [0, 1], the formula is.

Let鈥檚 calculate the new vector using the above equation

Step 1:

Step 2:

So, the new output vector is [0.017, 0.010, 0.971], now you can use this vector for your classification

There is a lot of math behind softmax function; you can search about it if you are interested.

def softMax(self, output):
    newOutput = [0.0 for i in range(0, self.numberOfOutputs)]
    sum = 0.0

    for i in range(0, self.numberOfOutputs):
        sum += round(output[i], 2)

    for i in range(0, self.numberOfOutputs):
        newOutput.append(output[i] / sum)

    return newOutput


聽聽聽聽聽聽聽聽聽 All our assumptions till now depends on that the neural network output will be always correct, the output will always match the target output, but practically this isn鈥檛 always true, you may face something like

Output聽聽聽聽聽聽聽聽聽聽聽聽 聽 聽聽聽聽聽聽聽聽聽 聽聽Target.

[0.3, 0.3, 0.4] 聽 聽 聽 聽 聽 聽 聽 [1, 0, 0]

According what we have said and the method we have used, the last value in the output vector is the nearest value to 1, but this isn鈥檛 matched with our target vector, we conclude from that there is an error with the prediction or classification, it is important to compute your output error, this will help to improve your neural network training and to know how the efficiency of your network, To compute the value of this error, there are 2 common approach:-

  • Mean Squared Error
  • Mean Cross Entropy

We will talk about each one of them; let鈥檚 begin with Mean Squared Error (MSE)

From Wikipedia, the definition of MSE is 鈥the mean squared error (MSE) of an estimator measures the average of the squares of the “errors”, that is, the difference between the estimator and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss.鈥

That is, let鈥檚 create the some training items and the supposed target values

Output聽聽聽聽聽聽聽聽聽聽聽聽 聽 聽聽聽聽聽聽聽聽聽 聽聽Target.

[0.3, 0.3, 0.4] 聽 聽 聽 聽 聽 聽 聽 聽[1, 0, 0]
[0.2, 0.3, 0.5] 聽 聽 聽 聽 聽 聽 聽 聽[0, 1, 0]


First, let鈥檚 get the sum of the squared difference between the 2 vectors values of the first training item

The second training item

Finally, let鈥檚 get the average of these sums

As you see, the error is high, indicates that the prediction is very far from the correct one; this should guide you to train your network more.

Let鈥檚 see the other approach which I prefer more, Mean Cross Entropy (MCE), before we get working with examples, I would like to get into the 鈥Entropy鈥 concept, so, if you get bored with that, you can ignore it and jump to the MCE equation and example.

In Information Theory, Quantification is a concept that indicates the amount of information that you can gain from an event or sample, the amount of information reflects on your decisions, assume that you are creating a system that deals with data send/receive, for example Skype, you send data (speech) and receive data, how do you determine the best encoding method to deal with these voice signals?, you decide that when you know information about the data, for example if your input data will be only 鈥Yes鈥 or 鈥No鈥, you can use only 1 bit to encode this case, so, we can say that the Entropy of this case is 1 which is the minimum number of bits needed to encode this case. There is other kind of information you could obtain from an event like its probability distribution which I will focus in.

So, Entropy is the key measurement to know the average number of bits to encode an event, we can obtain from that the more information we can get from an event the more Entropy value we will expect,

As I said before, probability distribution of an event is considered to be good information we can use Entropy to evaluate this probability distribution, Entropy result evaluates the randomness in the probability distribution, this evaluation is very important in the case you want to get the most optimality from a uniform distribution, assume that you have variables X, Y which have actual distribution [0.3, 0.2, 0.1, 0.4], [0.5, 0.5] respectively, to calculate the Entropy of X, E(X) we use



As you see, in the first distribution X, the Entropy is 1.85, and 1 in Y, this because the randomness in X distribution is higher (more information) than Y (less information).

You can use Entropy to compare between two or more probability models, this comparison shows you how close or how far between these models with your target model, assume that you have a variable T which has actual probability distribution [0.2, 0.1, 0.7], and you have 2 probability models X and Y, which have probability distribution of [0.3, 0.1, 0.6] and [0.3, 0.3, 0.4] respectively, , you want to get the nearest or the closet model to your target model T, the first step is to calculate your target鈥檚 Entropy


The second step is to calculate the Cross Entropy (CE), CE is a variant of Entropy function, which estimates how close of model B with model A


Let鈥檚 estimate how close model X to model T


Model Y with T


We can observe from that model X is much closer to T than model Y.

There are many other variants of Entropy, I only mentioned that the ones you may use as a programmer in your applications, if you are interested with Entropy as key concept of Information Theory, you can search about it, you will find good papers talking about it.

Mean Cross Entropy (MCE) is my preferable approach when computing error in the neural network categories output, I will use the same example of MSE.

Output聽聽聽聽聽聽聽聽聽聽聽聽 聽 聽聽聽聽聽聽聽聽聽 聽聽Target.

[0.3, 0.3, 0.4] 聽 聽 聽 聽 聽 聽 聽 聽[1, 0, 0]
[0.2, 0.3, 0.5] 聽 聽 聽 聽 聽 聽 聽 聽[0, 1, 0]

MCE measures the average of how far or close the neural network output with the target output, MCE formula is the following


Let鈥檚 begin with the first training item


The second training item


MCE calculation,


Your target is to reach 1, and the MCE is 1.7, this indicates that there is a 0.7 error

If you find yourself didn鈥檛 understand well, please go above where you will find all information you want to fully understand MCE approach.


聽聽聽聽聽聽聽聽聽 Well, in machine learning the answer is always 鈥渋t depends on the problem itself鈥, but the both of them effect on the gradient of the backpropagation training.

Here is the implementation of the both methods

def getMeanSquaredError(self, trueTheta, output):
    sum = 0.0
    sumOfSum = 0.0;

    for i in range(0, self.numberOfOutputs):
        sum = pow((trueTheta[i] - output[i]), 2)
        sumOfSum += sum;

    return sumOfSum / self.numberOfOutputs
def getMeanCrossEntropy(self, trueTheta, output):
    sum = 0.0

    for i in range(0, self.numberOfOutputs):
        sum += (math.log2(trueTheta[i]) * output[i])

    return -1.0 * sum / self.numberOfOutputs

Stanford’s Machine Learning Course
James McCaffrey Neural Networks Book

Facade Pattern

One of the most important patterns that you should deal with when you are working on huge systems which have many objects you are interact with.


Pattern’s Idea:-

You may find yourself dealing with many objects in your application, it will be good if there is away to deal with all system objects through few ones, this is the main idea or the main motivation to create the Facade pattern, as the meaning of “Facade”, this pattern builds a layer between the programmer and the other system’s objects, the programmer interacts with these objects through only one object.


For better understanding this pattern, assume that you want to build a form application, the form may contains labels, text boxes and buttons, each one of these handlers has some properties you could change.

public abstract class Design
    public abstract void setPosition(double x, double y);
    public abstract void setWidth(double x);
    public abstract void setHeight(double y);
    public abstract void setColor(int r, int g, int b);

We have an abstract class that has the handler’s properties, we will use these methods to create each handler’s class.

public class TextBox : Design
    public override void setPosition(double x, double y)
        Console.WriteLine("TextBox has been set at " + x + " " + y);

     public override void setWidth(double x)
         Console.WriteLine("TextBox width is " + x + " now");

     public override void setHeight(double y)
         Console.WriteLine("TextBox Height is " + y + " now");

     public override void setColor(int r, int g, int b)



public class Button : Design
    public override void setPosition(double x, double y)
        Console.WriteLine("Button has been set at " + x + " " + y);

    public override void setWidth(double x)
        Console.WriteLine("Button width is " + x + " now");

    public override void setHeight(double y)
        Console.WriteLine("Button Height is " + y + " now");

    public override void setColor(int r, int g, int b)
        Console.WriteLine("Button Color is set!");


The Structure:-


Let’s create a Facade class that considered to be a layer which enables the programmer to get access to each handler through this class.

public class GraphicalUserInterface
    private Design textBox;
    private Design button;
    private Design label;

    public GraphicalUserInterface()
        this.textBox = new TextBox();
        this.label = new Label();
        this.button = new Button();

    public void createTextBox()
        this.textBox.setPosition(1.0, 1.0);

    public void createLabel()
        this.label.setPosition(2.0, 2.0);

    public void createButton()
        this.button.setPosition(3.0, 3.0);

Finally, create the facade object which represents the other objects..

static void Main(string[] args)
     GraphicalUserInterface buildGUI = new GraphicalUserInterface();




[1] Reduce system objects
[2] Simplify huge systems to subsystems

You can see the full implementation HERE

Design Patterns Explained Book