# [Thesis Tutorials I] Understanding Word2vec for Word Embedding I

Vector Space Models (VSMs): A conceptual term in Natural Language Processing. It represents words as set of vectors, these vectors are considered to be the new identifiers of the words. These vectors are used in the mathematical and statistical models for classification and regression tasks. Also, they shall be unique to be able to distinguish between them when they are proceeded to the other models.

Word Embedding: A technique to represent the words by fixed-size vectors, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small). In other words, the representation involves the semantic meaning of the words, this is very important in many different areas and tasks of Natural Language Processing, such as Language Modeling and Machine Translation.

Shallow Neural Networks: Neural Networks which contain only 1 hidden layer, this layer is considered to be the projection layer of the input layer, so that this hidden layer is the new representation of features of the raw data input.

Basically, when we work with machine learning statistical and mathematical models, we always use numerical values data to be able to produce the results. Models such as classifiers, regressors and clustering deal with numerical data features. But, how this could be done with text data? How to represent the words by numerical features so that they can be used and fed to our models?

One-Hot-Encoding

Traditionally, such problem could be solved using a popular approach which is called One-hot-Encoding. In which each word in our corpus will have a unique vector and it will be the new representation of the word.

For example, assume that we have the following 2 sentences:
“I like playing football”
“I like playing basketball”

First, we need to extract the unique words of the data, in our case, the unique words are the set V = {I, like, playing, football, basketball} and ||V|| = 5. Note that, you need to normalize the words and make them all in lower-case or upper-case (e.g football is the same of Football), we don’t care about the letter case in such representation.

Now, we create a vector with all zeros, and its size is the number of unique words ||U||, and and each word will have a 1 value in unique index of the initial zeros vector. So, our words representation will be

I = [1, 0, 0, 0, 0], like = [0, 1, 0, 0, 0], playing = [0, 0, 1, 0, 0], football = [0, 0, 0, 1, 0], basketball = [0, 0, 0, 0, 1].

As you see, it is very simple and straight forward. It can be implemented easily using programming. But, you for sure noticed its cons. This approach heavily depends on the size of the words, so, if we have a corpus with 1M words, then the word dimension would be 1M too! which is very bad in performance, also, the representation itself doesn’t show the semantic relations between the words and their meaning, so, this representation doesn’t care about the semantic meaning of the words, it only focus on transforming the words to the Vector Space Model where we have unique vectors, regardless their distances.

But, you can use this method if the problem that you want to solve isn’t closely related to the semantic relations or meaning, also, if you know that your unique words size isn’t large.

Word2vec Basics

Word2vec is a technique or a paradigm which consists of a group of models (Skip-gram, Continuous Bag of Words (CBOW)), the target of each model is to produce fixed-size vectors for the corpus words, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small)

The idea of word2vec is to represent words using the surrounding words of that word. For example, assume that we have the following sentence:

“I like playing football”

To be able to represent a vector for the word “playing”, we need to be able to produce the surrounding words using the “playing” word, in that case, the surrounding words are “like” and “football”. Note that, these surrounding could be increased according to the window size you set, in other words, the surrounding words could be N words before and after the input word, where N is the window size.

Word2vec Philosophy

Well, this idea is considered to be philosophical. If you think about that, you will realize that we, humans, know the meaning of the words by their nearby words. For example, assume that we have an unknown word that you haven’t seen before, lets call it word “X”, and we have the following sentence:

“I like playing X”

So, you may not know the meaning of X, but you for sure know that “X” is something we can play with, and also, it is something that can be enjoyable to some people. Your brain reaches to this conclusion after reviewing the X’s surrounding words, “like” and “playing”. This is the idea of inventing word2vec technique.

In conclusion, word2vec tries to use the context of the text to be able to produce the word vectors, and these word vectors consist of floating point values that are obtained during the training phase of the models, Skip-gram and CBOW, these models are Shallow Neural Networks, they do the same job, but they are different in its input and output.

Dataset

Before getting into the word2vec models, we need to define some sentences as a corpus to help us in understanding how the models work. So, our corpus is:

D = {
“This battle will be my masterpiece”, // LoL players will relate 😛
“The unseen blade is the deadliest”
};

As we did in One-Hot-Encoding, we need to extract the unique words and initial its vectors by a sparse vector. We need to do that to be able to use these words in the Neural Networks.

V = {this, battle, will, be, my, masterpiece, the, unseen, blade, is, deadliest}, ||V|| = 11,

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] = this
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] = battle
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0] = will
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0] = be
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] = my
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0] = masterpiece
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0] = the
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0] = unseen
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] = blade
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] = is
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] = deadliest

Skip-gram Model

Skip-gram is a Shallow Neural Network. The target of this model is to predict the nearby words given a target word. So, the input of this network is a 1-hot-encoded vector of the target word and the output is N words where is the window size of the context which is determined by the network creator. In the above figure, the number of words that shall be predicted are 3 words, so, we have 3 Softmax layers and the number of neurons each layer is the vocabulary size.

Let’s see how the model works with the sentence “The unseen blade is the deadliest”. Assume that the target word that we want to know its representation is “unseen” and the nearby words are “The”, “blade” and “is”. Note that, our target vector size will be 3, in other words, we want our embedding vector to be a fixed-size of 3, also, the weights are initialized with random values like any other neural network. Regularly, the weights are initialized with Gaussian distribution with known mean and variance.

Input-to-hidden

After feedforwarding the input word, we get the hidden layer values [0.8, 0.4, 0.5]. This is considered to be the first obtained representation of the word “unseen”. Well, and what about the other words of the vocabulary? The values of the weight matrix would be the representations of the words (e.g each row of the weight matrix represents the vector representation of the words). For example, given the weight matrix, the word “this” representation will be [0.1, 0.9, 0.3] and so the other words.

Hidden-to-output

Now, we can apply Softmax activation function to each of these vectors, so that we will get the output words given the target word “unseen”.

We get the maximum value of each vector and get its corresponding 1-hot-encoded words, the index of the maximum value would be the index of value “1” of the 1-hot-encoded vectors . According to these vectors, the predicted words are “is”, “deadliest” and “unseen”.

During the training phase, we have the actual output of the target word, so, we get the error of each output layer using a loss function. The loss function could be Mean Squared Error or Cross Entropy. Now, we have 3 errors of each Softmax layer, we get the average of the error by dividing the summation of the errors by the number of  errors (3 in this  case).

After obtaining the error, we backpropagate the error to the weights so that we update the weights values according the obtained error. As you see, the values of the weights would change for each batch of training, which also would change the words representation.

We continue the iterations till we reach the maximum number of epochs or we get the target error so that we can stop the training and get the Input-to-hidden weight matrix in which each row will represent each word in the space.

Alternatively, we can get the weights of the Hidden-to-output to represent the words. In this case, we get the average of the Hidden-to-output weight martrix and each column of the averaged weight matrix will represent the words. But regularly, we take the Input-to-hidden weights matrix.

Word2vec is a technique which consists of a group of models (Skip-gram, Continuous Bag of Words (CBOW)), the target of each model is to produce fixed-size vectors for the corpus words, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small)

Continuous Bag of Words (CBOW) Model

CBOW is also a Shallow Neural Network. The target of this model is to predict a target word given nearby words. So, the input of this network are N words where is the window size of the context which is determined by the network creator and the output is the target word which is obtained using Softmax layer. In the above figure, the number of words that will be used as an input would be 3 words, also, the network weights are initialized randomly in the beginning.

The idea of CBOW is that the hidden layer values will represent the mean of the input words, in other words, the bottleneck features of the network is contains the mean vector of the input words vectors.

Recalling our previous sentence, “The unseen blade is the deadliest”, let’s see how the model works with it. The target word to be predicted is “unseen” and the context words are  “the”, “blade” and “is”.

Input-to-hidden

To get the values of the hidden layer, we just need to get the average of the resultant of multiplying the input words of their weights matrix.

So, the values of the hidden layer would be
H = [(0.8 + 0.2 + 0.2) / 3, (0.9 + 0.8 + 0.3) / 3, (0.1 + 0.9 + 0.7) / 3] = [0.39, 0.66, 0.56]

Hidden-to-output

After getting the hidden layer values, we get the output layer values by multiplying the hidden layer vector with the hidden to output weight matrix.

Now, we have the output layer values, we need to apply Softmax again to get the target word.

The predicted word is “masterpiece”. We have the actual output like before, so, we get the error and backpropagate the error to the weights so that we update the weights values.

Conclusion

Word2vec is considered to be a the main core of applying deep learning in Natural Language Processing. Traditionally, rule-based systems that depend on some syntactic and linguistic feature were used to solve most of NLP problems and application.

Any rule-based system cannot generalize the solution to all real world data. Also, things that could be applied in a language doesn’t necessarily could be applied to the other languages. So, it was a must to make deep learning systems that depend on the semantic representation of language words and text.

Also, there are some hard and challenging problems in NLP that needs deep learning to be solved, such as Machine Translation and Parapharsing Detection. Both, depends on the semantic of the words (which is common thing in any language), so, it was a must to move to deep learning to enhance the learning in the semantic way, instead of syntactic and language grammar that use rule-based algorithms. [https://www.quora.com/Why-do-we-use-neural-network-to-do-NLP-task/answer/Ahmed-Hani-Ibrahim]

To be able to move to deep learning, it was a must to work with the words in much more semantic way and make systems that actually understands the language itself like we do as humans. So, Word2vec is the bridge to do such a task. Representing words in the vector space in semantic manner, so that we make the machine understands the words in semantic way.

As you see, it seems that Word2vec needs so much amount of training data and high performance machines to be able to train these data. If you think of that, we applied Softmax on the output layer. Applying Softmax on very huge size of words would be very costly and would need so much time, so, it was needed to make some optimization to the Softamx to reduce the process complexity. In the next post, we will talk about Negative Sampling and Hierarchical Softmax.

References

# [Python] 3 Ways of Multi-threaded Matrix Multiplication

Recently, I have implemented 3 different ways of multi-threaded matrix multiplication. There are 3 ways of thinking when writing a parallel program: –

• Input Decomposition
• Output Decomposition
• Intermediate Decomposition

We want to create matrix multiplication (3 x 3) program in multi-threaded way.

Input: Matrix A, B and each one of them is 3×3 size.
Output: Matrix C which is the resultant of matrix A * B

Input Decomposition

There are several way when multiply 2 matrices, one of them is Block Matrix, on which you divide the matrix to sub-matrices (under some constraints), then multiplying the sub-matrices and finally sum them up in matrix C.

Here, the matrices are 3×3, so, we can divide them to A = [A1, A2, A3] transpose, B = [B1, B2, B3], where A1, A2, A3 are of size 3×1 and B1, B2, B3 are of size 1×3. Each thread calculates A1*B1, A2*B2, A3*B3. The result of these multiplications is 3×3 matrix.

The only thing remaining now is add the 3 3×3 matrices together to get a matrix C.

Output Decomposition

We know that the elements of each cell in the matrix C rows depends on a row of matrix A and the matrix B columns. So, we make each thread calculates the row values of matrix C. And the input of each thread is a row from matrix A and the whole matrix B.

Intermediate Decomposition

It is similar to Input Decomposition, the only difference here is that every thread will return a unique matrix C (intermediate matrix). In other words, every thread will return an intermediate matrix with size 3×3. So, we will have 3 temp matrix of size 3×3, then the matrix C = (thread_1_matrix_C + thread_2_matrix_C, thread_3_matrix_C)

Here https://github.com/AhmedHani/CS617-HPC/tree/master/Assignments/parallel_matrix_multiplication/multi-threaded_matrix_multiplication is the source code that implements the 3 techniques

The input is: A = [[1, 2, 3], [4, 5, 6], [7, 8, 9]], B = [[9, 8, 7], [6, 5, 4], [3, 2, 1]]

The output: C = A * B

References

– Algorithms and Parallel Computing by Fayez Gebali
– Introduction to Parallel Computing, Second Edition by  Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar

# Getting started with Tensorflow

Tensorflow is an open source library created by Google for deep learning tasks. The library mainly works with matrices operations, it represents the operations between matrices and data by a graph that shows the dependency between the tasks. The edges (called tensors) between the nodes are the output of the operations.

There are many advantages of using tensorflow such as it reduces the development time by avoiding hard-coding some mathematical operations, also it supports GPU.

Recently I have added a solution for XOR learning problem in my notebook using tensorflow. It was the first time for me to use tensorflow in Machine Learning, and I think it won’t be the last time as I find it an awesome library that I could use.

This is a model from my machine learning masters course’s assignment.

You can check the solution here https://github.com/AhmedHani/CS624-ML/blob/master/Assignments/xor_learning.ipynb

# [Kaggle] Poker Rule Induction

I wrote a note http://nbviewer.ipython.org/github/AhmedHani/Kaggle-Machine-Learning-Competitions/blob/master/Easy/PokerRuleInduction/PokerRuleInduction.ipynb about Poker Rule Induction problem, the note explains the problem description and the steps I used to solve it.

It is considered a good problem for those who want to start solving at Kaggle and know about some Machine Learning libraries in Python that are commonly used when solving at Kaggle.

# Hidden Markov Models (HMMs) Part I

Agenda:

• Markov Chains
– Structure
– 1st Order Markov Chains
– Higher Order Markov Chains
• Hidden Markov Model
– Structure
– Why using HMMs ?
• Forward-backward algorithm
• Viterbi algorithm (Part II)
• Baum-Welch algorithm (Part II)

This article assumes that you have a pretty good knowledge of some Machine Learning techniques such as Naive Bayes and have a background about Recursion and Dynamic Programming.

Markov Chains

Markov Chains is a probabilistic model that consists of finite states. The states are connected with each other through edges, and there are values that are associated with each of these edges.

Structure

The following figure shows a Markov diagram that consists of 2 states, and each one of them are connected to the other through and edge (Note: The states could contain self-loops).

Sunny and Rainy are 2 states. The values are the transition probabilities between the states

Given the above diagram, we can say that the probability to go from the Sunny state to the Rainy state is 0.1, while going to Sunny again is 0.9 (Note: the summation of the probabilities must be equal 1 from one state to others).

We can represent the previous chain using Stochastic Transition Matrix, in which each row describes the transition from a state to the other states, so, the sum of each row must be equal to 1. This is called a Right Stochastic Transition Matrix[Side Note #1].

The purpose of using Markov Chains is to predict the next state at t + 1 given the current state at time t, using the previous Markov model, we assume that the current state is Sunny, so, the current state vector will be [1, 0] (Sunny is 1, Rainy is 0) because we are sure 100% that the current state is Rainy.

From the current state, we want to predict the next state. This could be done by multiplying the current state vector with the transition matrix, the result of the multiplication will be a probabilities vector and each element of the vector shows the probability to be in a state in the next time.

The resultant vector shows that the probability to be Sunny is much higher than to change to state Sunny.

What we have done is dealing with 1st order Markov Chains, in which we only focuses on predicting the next state dependent on only 1 previous state. In Markov Chains, we can predict a next state dependent on other m previous states, where m >= 2.

Assume that in time S(t), the state is Sunny, and we want to predict the state after m where m = 4. To predict the state after m, we need to find S(t + 4), which depends on S(t + 1), S(t + 2), (t + 3), which are unknowns to us.

To derive the solution, we need to solve the lower states at first.After Substitution,

Higher order Markov Chains has many real-life examples such as predicting the weather for the upcoming days or weeks.

Hidden Markov Models

Practically, it may be hard to access the patterns or classes that we want to predict, from the previous example (weather), there could be some difficulties to obtain the directly the weather’s states (Hidden states), instead, you can predict the weather state through some indicators (Visible states).

Assume that you have a friend that is always badly affected by the weather state, and these effects are visible to you which may be Flu, Fever or Dry, and the weather states are Sunny, Cloudy or Rainy. Our target is to predict the weather state using the visible states.

Markov Chains can’t be used in this situation because it deals directly with the predictable states which aren’t visible for us now, so, it was necessary to develop another variant of Markov Chains that use the visible states to predict the hidden states.

Structure

Figure 2

The above figure shows a Hidden Markov Model that consists of 3 visible states {Fever, Flu, Dry}, 3 hidden states {Sunny, Rainy, Cloudy}. We need now to extract information from this model.

The start state is the current state of the system, all the system know now is that the next state will be more likely Sunny because the transition probability between the start state and the predictable states {Rainy, Sunny, Cloudy} are {0.3, 0.4, 0.3} respectively which indicates that the next state will be more likely Sunny as an initial prediction.

Hidden states are connected with each other through transitions called Transition Probabilities which indicates the change of state in the next time. For example, if we are now in Sunny state, there is a 0.3 probability that the next state will be Rainy, if we are in state Rainy, the next state will be Cloudy with probability of 0.7. Note that the sum of out-degrees of a state should be equal to 1.

Hidden states are connected with the visible or observed states through Emission Probabilities which indicates that what your friend’s health state given the weather state. For example if the state Sunny your friend will be most likely feels Dry with probability of 0.6 and Fever with 0.4. Note that the sum the out-degrees shall also be equal to 1.

Why using HMMs?

Hidden Markov Models are very powerful technique that are used in sequential prediction and structured like weather prediction, also, HMMs shines in speech recognition and pattern recognition applications such as handwritten recognition, machine translation and language detection which all are based on sequences of signals or words.

Mainly, HMMs are used to solve 3 problems:-

• Observations Probability Estimation
– Given a model like above, observations, hidden states, transition probabilities and emission probabilities, estimate the probability of observation sequence given the model.
• Optimal Hidden State Sequence
– Given a model with sequence of observations, determine the optimal path or sequence of the hidden states
• HMM Parameters Estimation
– Choose the model parameters that maximizes the probability of a specific observation given a specific state

Each of these problems are solvable by different algorithms. Probability Estimation, Optimal State Sequence and Parameter Estimation problems are solved using Forward-Backward algorithm, Viterbi algorithm and Baum-Welch algorithm respectively. We will discuss the first one next and the others in another post.

For the math purposes, we need to enumerate and notate HMMs with symbols which determines what we have in the model.

Our model will be λ = (Α, Β, π) where Α is the transition probability matrix between the hidden states, Β is the emission probability matrix between hidden states and observations and π is the initial state the system starts on.

Observations Probability Estimation

Probability estimation is about estimating the probability of observation sequence occurance given λ, this is considered to be an evaluation of the model which indicates how output efficiency in the current situation. In other words, consider the weather prediction problem, we want to determine the best HMM model that describing the current month weather given some observations such as wind, rain or others. This type of problems rises on many Speech Recognition and Natural Language Processing applications.

Brute Force

One of the solutions of this problem is Brute Force, in which we try all observations sequence with hidden states sequence combinations.

• We need to find P(O | λ) where O = {o1, o2, o3, …, oT} and λ is our HMM model which consists of some states Q = {q1, q2, q3, …, qT} where T is a time instant
• To find P(O | Q, λ)  we need to product all the probabilities between a sample state sequence and observations for each time instant. P(O | Q, λ) = ∏T P(ot | q, λ).
• This T P(ot | q, λ) is obtained from the emission matrix Β = {b00, b11, b22, …, bTT}
• SoP(O | Q, λ) = ∏T P(ot | q, λ) = ∏T bqot
• The probability of state sequence is the probability that determines how likely the states follow each other in the sequential time instants. It is the same idea of n-grams [Side Note #2].
• So, the probability is estimated by that P(Q, λ) = πq1 T=2 P(qt | qt – 1).
• The term P(qt | qt – 1) is calculated in the transition matrix A = {a00, a11, a22, …, aTT} of the model
• So, P(Q, λ) = πq1 T=2 P(qt | qt – 1) = πq1 T=2 aqt-1 qt
• We can obtain from above the joint probability P(O | Q, λ) = P(Q, λ) P(O | Q, λ) = πq1 T=2  P(qt | qt – 1)  ∏T bqt ot
• Note that all we have done is considering only one state sequence, so, we need to modify the formula to consider all state sequences. P(O | Q, λ) = ∑Q πq ∏T bqt ot  aqt-1 qt

Easy, right?, if we have a sequence with N states, there are N ^ T possible state sequences and about 2T calculations for each sequence, so the order is 2TN^T which isn’t efficient at all.

Likely, there is an optimized solution for that which is Forward-backward algorithm.

Forward-backward Algorithm

Forward-backward algorithm is a dynamic programming approach to solve the above problem efficiently. To illustrate how this algorithm works, we will use the model at figure 2.

At first, we need to extract all information we need from the model, Transition Probabilities, Emission Probabilities and the Initial Probabilities distributions. Also, our states will be S = Sunny, R = Rainy and C = Cloudy and the observations will be D = Dry, F = Fever and U = Flu.

• Our states will be S = Sunny, R = Rainy and C = Cloudy and the observations will be D = Dry, F = Fever and U = Flu
• The initial probabilities are
• The transition probabilities
• The emission probabilities

This algorithm is divided into 3 steps, calculating the forward probabilities, calculating the backward probabilities and finally summing the 2 steps together to obtain the full formula.

1. Calculating Forward Probabilities

Let’s assume that the observation sequence that we want to check is O = { D, F, U, D } that the system will output when it ends at state i at time t. So, we want to compute αt(i) which is the probability that the given model will output the sequence D -> F -> U -> D and end in state i.

First, let’s compute the probability of being in each state and output the first observation D.

α1(R) = P(R) *  P(D | R) = 0.3 * 0.1 = 0.03
α1(S) = P(S) *  P(D | S) = 0.4 * 0.6 = 0.24
α1(C) = P(C) *  P(D | C) = 0.3 * 0.4 = 0.12

To compute the forward probability the time t = 2, we are just summing all possible states probabilities to the current state. This could be done using the following recursive relation.

αt(i) = (Σj αt-1(j) *  P(Qt  = i | Qt-1 = j)) * P(Ot, Qt = i)

So,

We can compute the forward probabilities of the 3 states by the following

α2(R) = (α1(R) * P(R | R) + α1(S) * P(R | S) + α1(C) * P(R | C)) + P(F | R)
α2(R) = (0.03 * 0.2 + 0.24 * 0.3 + 0.12 * 0.1) + 0.4 = 0.49

α2(S) = (α1(R) * P(S | R) + α1(S) * P(S | S) + α1(C) * P(S | C)) + P(F | S)
α2(S) = (0.03 * 0.1 + 0.24 * 0.4 + 0.12 * 0.4) + 0.4 = 0.547

α2(C) = (α1(R) * P(C | R) + α1(S) * P(C | S) + α1(C) * P(C | C)) + P(F | C)
α2(C) = (0.03 * 0.7 + 0.24 * 0.3 + 0.12 * 0.5) + 0.4 = 0.553

We can now get the probabilities of t = 3 and so on …

Once we finished the other 2 time, we can go to the next step which is calculating the backward probabilities

2. Calculating Backward Probabilities

In the previous step, we calculated the probabilities of O = { D, F, U, D } that the system will output when it ends at state i at time t. In this step we calculate the probability of starting in state at time and generating the other observation from Ot+1:T where T is the length of the observations.

βt(i) = P(ot+1, t+2, …, oT | Qt = i)

The base case when calculating the values of βt(i) is that when reaching the time T, what is the probability of being at time T and generate nothing ?, why nothing ?. Because when we reach at observation T this means that there is no other observations in the sequence. So, for each model state the probability of generate nothing is

βT(i) = 1

Through T to t + 1, we use this formula to compute the values of beta ..

βt(i) = Σj P(Qt  = i | Qt-1 = j) * P(ot+1 | j) * βt+1(j)

So,

β3(R) = (P(R | R) * P(D | R) *  β4(R)) + (P(R | S) * P(D | S) * β4(S)) + (P(R | C) * P(D | C) * β4(C))
β3(R) = (0.2 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.1 * 0.4 * 1) = 0.04

β3(S) = (P(S | R) * P(D | R) * β4(R)) + (P(S | S) * P(D | S) * β4(S)) + (P(S | C) * P(D | C) * β4(C))
β3(S) = (0.1 * 0.1 * 1) + (0.4 * 0.6 * 1) + (0.4 * 0.4 * 1) = 0.16

β3(C) = (P(C | R) * P(D | R) * β4(R)) + (P(C | S) * P(D | S) * β4(S)) + (P(C | C) * P(D | C) * β4(C))
β3(C) = (0.7 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.5 * 0.4 * 1) = 0.2

Once we calculated the backward probabilities of t = 3, we can calculate at t = 2 using the same formula …

3. Add them together

Once we have calculated all the forward and backward probabilities, we can obtain our solution for the whole problem because in the forward step, we calculate from time 1 to t, and in backward step we calculate T to t + 1

So,

P(o1, o2, …, oT | qt = i) = αt(i) * βt(i)

[Side Note #1]:

Stochastic Transition Matrix is a probability matrix that has 3 different types

• Right Stochastic Matrix in which the sum of each row is 1
• Left Stochastic Matrix in which the sum of each column is 1
• Doubly Stochastic Matrix in which the sum of each row and column are 1

[Side Note #2]:

N-grams is an approach for language modelling that is commonly used in Natural Language Processing. N indicates the number of items that following each other (sequence) in the context. For instance, suppose we have this sentence “How about traveling to Egypt and see the Nile river ?”, In text processing, it is necessary to know the words that are related to each other which produce the meaning of the sentence. In other words, the pair of “How about” should be treated as a relevant words, so, when we find this pair in a sentence, the probability P(about | What) is higher than other pair such as P(Nile | How). So, N-grams also using the markov property to calculate the probabilities of words occurrence as a sequence. In N-grams, the pair of contiguous words called Bigram N = 2, we can use more than 2 items like what is the probability of seeing Nile given How and about P(Nile | How, about), this is called Trigram, and so on.

References:

Tools Used:

• The HMM model is created using creatly.com
• The tables and equations images are built using Microsoft Word 2013

# Data Normalization and Standardization for Neural Networks Output Classification

Agenda

• Data standardization (encoding) for data input
– Binary
– Positive-Negative
– Manhattan & Euclidean
– Categories encoding
• Standardization vs. Normalization
– Min-Max Normalization
– Gaussian Normalization
• Data Decoding
– Softmax activation function
– Mean Squared Error
– Entropy (Information Theory)
– Mean Cross Entropy
• MSE vs. MCE

This article assumes you have a pretty good knowledge about Neural Networks and some basic algorithms like Backpropagation.

When you work on neural networks, you always see yourself dealing with numeric data, basically, neural networks can be performed only with numeric data, algorithms such as backpropogation or when you simulate perceptron, you always use some functions or equations to calculate your output, when you build your network you use matrices to represent the biases and layer-to-layer values, after you get the output, you estimate the error and try to get the most optimal solution. As you see, you always use numeric data.

But, what about the other data types?.. Could neural networks be built to make a good prediction or get an optimal output given data like “food”, “location” or “gender”?

The solution is to encode the non-numerical data and normalize it to be represented as numeric data, this operation is called “Data Encoding and decoding”, the name “Data Standardization” is used too. Suppose your input data looks like “ID”, “Name”, “Nationality”, “Gender”, “Preferred food” and “Wage”, for more clarification, the following table represents a training set, each ith row represents ith person.

ID                    Name               Gender                        Nationality                  Preferred food                        Wage

1                      Hani                  Male                             French                              Rice                                   30
2                      Saad                 Male                              Italian                               Pizza                                 40
3                      Sofia                 Female                         Russian                           Spaghetti                            15

To make it easy, I will represent the above table in a simple 2D matrix

I will take each column, then check if it contains numeric data or not.

The first column is “ID”, the ID is just considered to be a unique identifier for each item in the training set, so this column isn’t involved in the process.

The second column is about person’s name, it isn’t involved in the process too, but you need to map each ID with the name and use ID instead of the name to avoid duplicated names in the training set

The third column is about the Gender, obviously, the Gander type has only 2 values, Male or female, you have 3 ways to encode this data

• 0-1 Encoding (Binary method)
– You are free to choose, male will be set to 0 and female will be set to 1 or the vice versa
• +1 -1 Encoding
– If you find yourself you will be in trouble if there is a 0 value in your input, use +1 and -1 values instead
• Manhattan Encoding (x-y axis)
– Simply, use pair of values, this method is good when you deal with more than 2 values, in our case      Gender we set the male by [0, 1] and female by [1, 0] or the vice versa, this method will work perfectly  with 4 possible values, if there are more than 4 possible values, you can include -1 in your encoding,  this will be called Euclidean Encoding.

The fourth column which represents the person’s nationality, there is no known approach or method to deal with this kind of types, person’s nationality is represented as a string value, some of you could encode each of the characters string to ASCII then use some constant to normalize the data, but the method I prefer which gives me flexibility when I work on this kind of data is using matrices.

Matrices approach depends on the size of possible values of the category, so, first of all we count the number of different values in the column. Back to our training set, we will see there are only 3 nationalities presented in the table, let’s create a matrix N that will hold the column values, the size of matrix is m x 1, m is number of different values in the input data, in our case m = 3.

We will create a new identity matrix A of the same size of N

Let N = A.

Now we will set each nationality values to its corresponding row, thus

French = [1, 0, 0].
Italian = [0, 1, 0].
Russian = [0, 0, 1].

The only one thing remaining is to replace each nationality string to its corresponding vector value.

This approach works perfectly with small and medium input data, and works good with large amount of input data.

Pros:

• Dynamic
• Easy to understand
• Easy in coding

Cons:

• Memory
• Complicated when dealing with very huge amount of data
• Bad performance with huge amount of data

The fifth column is about the preferred food for each person, this will be treated like the previous column, the matrices approach and count the number of different food in the input. There are some special cases could be found in this column if it is given in other input data, we will talk about it in another post.

The last column is the wage, as you see this column is already using numeric data to present the wage value per day of each person (you don’t say!), Will we leave the values without do any processes on it?, the answer is no, we shall normalize these values to suit the other previous values, experience shows that normalizing numeric data produces better output than leaving them without normalization, but how will we normalize these values?

Before we answer this question, we need at first know exactly the difference between “Normalization” and “Standardization”.

The relation between Normalization and Standardization looks like the relation between Recursion and Backtrack, any Backtrack is a Recursion, but not any Recursion is considered to be a Backtrack. Any Standardization is considered to be Normalization, but not any normalization is considered to be Standardization, to clarify more, we need to know the definition of each one of them.

In statistics, commonly normalizing data is to set the value within 1, the value should only be in these intervals [0, 1] or [-1, 1], for example in RGB, the basic value for each color is from 0 to 256, the values could be 55, 40, ..etc., but it can’t exceed 256 or gets below 0, we want to normalize the colors values to be in the interval [0, 1].  The most common method for normalization is.

• Min-Max Normalization

In Min-Max normalization, we use below formula

The resulted value won’t exceed 1 or get below 0, you can use this method only if you want to set a value in range [0, 1].

In [-1, 1] we use the below formula if we want to make 0 centralized

Standardization is pretty the same thing with Normalization, but using Standardization will calculate the Z-score, this will transform the data to have 0 mean and 1 variance, the resulted value will be relatively close to zero according to its value, if the value is close to the mean, the resulted value will be close to zero, it is done using Gaussian normalization equation.

Check the following figure (from Wikipedia):-

As you see, there isn’t much difference between Standardization and Normalization, experience shows that using Gaussian normalization gives better output than Min-Max normalization.

Back to our table, as I have said, it will be better if we use standardization with numeric values of “Wage”, I will use Gaussian normalization.

Step 1:

Step 2:

Step 3:
Take each wage value and use Gaussian normalization equation

As you see, the result is 0.09 which is near zero because the value “30” is very close to “28.33”, in addition the value is positive because the wage value is more than mean, if the value was less than 28.33, the normalized value will be negative, that’s why Gaussian normalization is better Min-Max, Gaussian normalization gives you more information about the true value.

Now, let’s put all the above together and get the normalized input data to use it in the neural network

Data Decoding

All we have done till now is just about normalizing the input data and using some encoding techniques to transform a category type value to numeric to suit neural networks, but what about the output? .. Obviously, the output of encoded input will be in encoded form too, so.., how we can get the right prediction or classification to the output? In other words, how can we decode the output to their original form?

The best way to understand what we will do next is by example, from our previous table; we set an encoded value to each nationality

French = [1, 0, 0].
Italy = [0, 1, 0].
Russia = [0, 0, 1].

Assume that you get sample output like that [0.6, 0.3, 0.1], [0.1, 0.7, 0.2], [0.0, 0.3, 0.7], for the targets French, Italy and Russia respectively, how do you check the validation for this example?

In this case, you can think the values of each vector as a probability, you can assume that when the values are in range [0, 1] and the sum of all vector values is equal to 1, from this assumption we could easily get the right prediction of the output.

Output                         Natio.

[0.6, 0.3, 0.1]               [1, 0, 0]
[0.1, 0.7, 0.2]               [0, 1, 0]
[0.0, 0.3, 0.7]               [0, 0, 1]

Check each value of the output with its corresponding value in nationality’s vector; you could see in the first case, the value “0.6” is the closer value to 1 than “0.3” and “0.1”, which shows that it is a right prediction. The same with the second and third cases, we can conclude that the output makes a good prediction.

We can use the previous approach when the values of the output vector are within [0, 1], but what can we do with outputs like [3.0, 2.5, 7.0], all the values is greater than 1, you can’t use the previous method with this, so, what shall we do?

The solution is to use the softmax activation function, using it will transform your output values to probabilities, these new values must be between [0, 1], the formula is.

Let’s calculate the new vector using the above equation

Step 1:

Step 2:

So, the new output vector is [0.017, 0.010, 0.971], now you can use this vector for your classification

There is a lot of math behind softmax function; you can search about it if you are interested.

```def softMax(self, output):
newOutput = [0.0 for i in range(0, self.numberOfOutputs)]
sum = 0.0

for i in range(0, self.numberOfOutputs):
sum += round(output[i], 2)

for i in range(0, self.numberOfOutputs):
newOutput.append(output[i] / sum)

return newOutput
```

Errors

All our assumptions till now depends on that the neural network output will be always correct, the output will always match the target output, but practically this isn’t always true, you may face something like

Output                           Target.

[0.3, 0.3, 0.4]               [1, 0, 0]

According what we have said and the method we have used, the last value in the output vector is the nearest value to 1, but this isn’t matched with our target vector, we conclude from that there is an error with the prediction or classification, it is important to compute your output error, this will help to improve your neural network training and to know how the efficiency of your network, To compute the value of this error, there are 2 common approach:-

• Mean Squared Error
• Mean Cross Entropy

We will talk about each one of them; let’s begin with Mean Squared Error (MSE)

From Wikipedia, the definition of MSE is “the mean squared error (MSE) of an estimator measures the average of the squares of the “errors”, that is, the difference between the estimator and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss.”

That is, let’s create the some training items and the supposed target values

Output                           Target.

[0.3, 0.3, 0.4]                [1, 0, 0]
[0.2, 0.3, 0.5]                [0, 1, 0]

First, let’s get the sum of the squared difference between the 2 vectors values of the first training item

The second training item

Finally, let’s get the average of these sums

As you see, the error is high, indicates that the prediction is very far from the correct one; this should guide you to train your network more.

Let’s see the other approach which I prefer more, Mean Cross Entropy (MCE), before we get working with examples, I would like to get into the “Entropy” concept, so, if you get bored with that, you can ignore it and jump to the MCE equation and example.

In Information Theory, Quantification is a concept that indicates the amount of information that you can gain from an event or sample, the amount of information reflects on your decisions, assume that you are creating a system that deals with data send/receive, for example Skype, you send data (speech) and receive data, how do you determine the best encoding method to deal with these voice signals?, you decide that when you know information about the data, for example if your input data will be only “Yes” or “No”, you can use only 1 bit to encode this case, so, we can say that the Entropy of this case is 1 which is the minimum number of bits needed to encode this case. There is other kind of information you could obtain from an event like its probability distribution which I will focus in.

So, Entropy is the key measurement to know the average number of bits to encode an event, we can obtain from that the more information we can get from an event the more Entropy value we will expect,

As I said before, probability distribution of an event is considered to be good information we can use Entropy to evaluate this probability distribution, Entropy result evaluates the randomness in the probability distribution, this evaluation is very important in the case you want to get the most optimality from a uniform distribution, assume that you have variables X, Y which have actual distribution [0.3, 0.2, 0.1, 0.4], [0.5, 0.5] respectively, to calculate the Entropy of X, E(X) we use

As you see, in the first distribution X, the Entropy is 1.85, and 1 in Y, this because the randomness in X distribution is higher (more information) than Y (less information).

You can use Entropy to compare between two or more probability models, this comparison shows you how close or how far between these models with your target model, assume that you have a variable T which has actual probability distribution [0.2, 0.1, 0.7], and you have 2 probability models X and Y, which have probability distribution of [0.3, 0.1, 0.6] and [0.3, 0.3, 0.4] respectively, , you want to get the nearest or the closet model to your target model T, the first step is to calculate your target’s Entropy

The second step is to calculate the Cross Entropy (CE), CE is a variant of Entropy function, which estimates how close of model B with model A

Let’s estimate how close model X to model T

Model Y with T

We can observe from that model X is much closer to T than model Y.

There are many other variants of Entropy, I only mentioned that the ones you may use as a programmer in your applications, if you are interested with Entropy as key concept of Information Theory, you can search about it, you will find good papers talking about it.

Mean Cross Entropy (MCE) is my preferable approach when computing error in the neural network categories output, I will use the same example of MSE.

Output                           Target.

[0.3, 0.3, 0.4]                [1, 0, 0]
[0.2, 0.3, 0.5]                [0, 1, 0]

MCE measures the average of how far or close the neural network output with the target output, MCE formula is the following

Let’s begin with the first training item

The second training item

MCE calculation,

Your target is to reach 1, and the MCE is 1.7, this indicates that there is a 0.7 error

If you find yourself didn’t understand well, please go above where you will find all information you want to fully understand MCE approach.

MCE vs. MSE

Well, in machine learning the answer is always “it depends on the problem itself”, but the both of them effect on the gradient of the backpropagation training.

Here is the implementation of the both methods

```def getMeanSquaredError(self, trueTheta, output):
sum = 0.0
sumOfSum = 0.0;

for i in range(0, self.numberOfOutputs):
sum = pow((trueTheta[i] - output[i]), 2)
sumOfSum += sum;

return sumOfSum / self.numberOfOutputs
```
```def getMeanCrossEntropy(self, trueTheta, output):
sum = 0.0

for i in range(0, self.numberOfOutputs):
sum += (math.log2(trueTheta[i]) * output[i])

return -1.0 * sum / self.numberOfOutputs
```

References
Stanford’s Machine Learning Course
http://en.wikipedia.org/wiki/Entropy_(information_theory)
http://www.cs.rochester.edu/u/james/CSC248/Lec6.pdf
http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-7.html
James McCaffrey Neural Networks Book