Generative Adversarial Networks [2] – Camouflage your Predator!

You can imagine the situation by learning without any prior knowledge about something, you just do mistakes and you let others correct them to you. Continuing this process will lead you to learn how to do that thing, although you may still don’t know what are you doing?!

saad

This was a comment from my professional friend Moustapha Saad on my recent post which was an overview about Generative Adversarial Networks (GANs). I find that his description is very true. It describes the true nature of GANs, so, I decided to use it as a title of this post.

This post is considered to be the part 2 of the previous post. In the previous post, we talked about the general idea of GANs, the discriminative and the generative models which cooperate with each other to improve themselves. In this post, we will focus on how the generative models work in general and some different types of generative models that could be used when building GANs.


Maximum Likelihood Generative Models

Assume that we have a dataset that contains m training samples, these samples follow an unknown probability distribution, it could be Gaussian, Bernoulli or any other probability distribution, actually, we don’t care about knowing such distribution. Our focus is to be able to generate similar data that has the same distribution of the training dataset. We can formulate the problem as a Maximum Likelihood Estimation problem.

The idea of Maximum Likelihood is to make a model that has some parameters which make it be able to generate the same probability distribution, P(X;Θ). As we we have m samples in the training data, we can generalize the Maximum Likelihood all over the data, Πm=1Pmodel(Xi;Θ).

Now, the target is to find the Θ that Maximize the likelihood, so, our mathematical representation of this problem would be Θ := arg Θ max Πm=1Pmodel(Xi;Θ)

To avoid any numerical problems, it is always better to transform to the log space, as you know, log(x * y) = log(x) + log(y), we turns the above equation to sum instead of product. So, our final Empirical Distribution equation to obtain the Maximum Likelihood of the training data would be Θ := arg Θ max Σm=1log Pmodel(Xi;Θ).

We call the above model as an Empirical Distribution, as we don’t pretty certain about the distribution we get after getting the parameter Θ. The question is, how to know whether the obtained distribution is really the actual distribution of the training data?

Kullback–Leibler Divergence (KL-divergence) is a measurement that could be used to answer the above question, and it can be used in both the discrete and continuous probability distributions. Given 2 probability distribution, P and Q, the KL-divergence is measured by DKL(P || Q) = Σi P(i) log (P(i) / Q(i)).

The target is to minimize the KL-divergence, the less difference between the distributions of P and Q means that they are close to each other. So, recall our Maximum Likelihood, after getting the empirical distribution, we measure the KL-divergence between the empirical distribution and the actual distribution, in other words, minimizing the KL-divergence means maximizing the Likelihood of the training dataset.


Explicit vs. Implicit Maximum Likelihood Models

After knowing the general idea of Maximum Likelihood models and their target, we need to be more specified when we consider talking about Maximum Likelihood generative models. Actually, there are 2 different types of Maximum Likelihood models, Explicit and Implicit models.

Explicit Models: Models that fully depend on the training data, and try to learn the parameters of P(X;Θ), which are the mean and variance, using the regular gradient or any other optimization technique. To be able to make the training efficient, we need to design a model that makes use of the training samples, also, the collected training samples should be generalized enough to cover all variables of the probability distribution and this may be hard in some real-life data, finally, the explicit models require some complex computational power to learn the parameters.

Implicit Models: In these models, we don’t actually use the data to learn the true distribution parameters, we assume that the distribution parameters are hidden and we try to predict it using some observed variables from the data. These observed variables have no rules to be determined, they are extracted by the researcher or by an expert who may have a pre-knowledge about the data. You can think of it as an inference stochastic process that could be done using Markov Chains, you use some observable data to be able to get the hidden variables (called Latent variables), refer to this post to know more about Markov Chains. We use these latent variables (which makes a latent space) to generate a similar data based on the given training samples. Regularly, implicit models could be used for dimension reduction of the data without any prior knowledge about the data distribution.

The question now, to what model the generative model in GANs belongs to?

Recalling GANs figure,

You find that GAN uses the latent space to be able to generate data that is considered to be in similar distribution of the real samples. So, it is an implicit model, but it doesn’t use any observable data like Markov Chains do, it deals directly with the training sample, in other words, it treats the samples of the training data as the latent variables, and change their distribution by adding some noise to the sample features. That’s why it is considered to be an implicit model, it doesn’t use the data distribution, and it doesn’t depend on any observable data, like we do with Markov Chains.

To clarify more, you can imagine the situation by learning without any prior knowledge about something, you just do mistakes and you let others correct them to you. Continuing this process will lead you to learn how to do that thing, although you may still don’t know what are you doing?!

Actually, there are many different types of Maximum Likelihood models in general, the following figure (taken from Ian Goodfellow tutorial, you fill find it in the references) is a taxonomy of  different types of models.

tax


The Generative Model’s Camouflage Process

In GANs, the generative model tries to take some real samples from the training data, without any prior knowledge about its distribution, and replacing some actual features with other fake features (camouflage). As the generative model is a Deep Neural Network, the fakes features could be added in the hidden layers after feedforward the input to the neural network. Alternatively, the noise could be added before input the features to the neural network, and let the network produce the fake sample after feedforwarding the noisy fake sample. We can represent the generative model in mathematical function by f(X, Noise) = Z.

The discriminator (Predator) then receives 2 signals, the first one is X, which is the real sample, and Z, which is the fake sample. The role of the discriminator is to assign 2 probabilities, using softmax for example. If it gives a probability that is close to 1 to the real sample, which means that it perfectly distinguishes between the real and fakes samples, then it backpropagates the error (calculated using cross entropy) to the generative model to fine-tune its parameters (neural network weights) to enhance its ability to produce a more tricky sample. If it gives a probability that is close to 1 to the fake sample, which means that it sadly doesn’t distinguish between the real and fakes samples, then it backpropagates the error to the discriminator model to fine-tune its parameters to enhance its ability to avoid the fake sample.


Existed GANs with Popular Generative Models

Commonly, Denoising Auto-Encoders are used nowadays when building GANs. There exists an interesting project about Face Completion using Generative Adversarial Denoising Autoencoder, you can find the project website here http://www.cc.gatech.edu/~hays/7476/projects/Avery_Wenchen/. Also, there are other models that are called Adversarial Autoencoder (AAEs) https://arxiv.org/pdf/1511.05644.pdf and Deep Convolutional Generative Adversarial Networks https://github.com/carpedm20/DCGAN-tensorflow , but I don’t know much about them.

The idea of the Adversarial Learning is dominating nowadays and it is considered to be a hot topic in Deep Learning that could be used in multiple fields, such as Speech Recognition and Computer Vision.

 

References

– https://arxiv.org/pdf/1701.00160.pdf
– https://openreview.net/pdf?id=B16Jem9xe
– https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
– http://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html
– http://www.theanalysisfactor.com/what-is-a-latent-variable/
– http://www.cc.gatech.edu/~hays/7476/projects/Avery_Wenchen/
– https://arxiv.org/pdf/1406.2661.pdf
– http://cs.stanford.edu/people/karpathy/gan/

Overview: Generative Adversarial Networks – When Deep Learning Meets Game Theory

Before going into the main topic of this article, which is about a new neural network model architecture called Generative Adversarial Networks (GANs), we need to illustrate some definitions and models in Machine Learning and Artificial Intelligence in general.

Discriminative Models: Models that predict a hidden observation (called class) given some evidence (called features). In other words, we have some features and observations about an entity, and we want to predict its class, category or label. You can imagine the model as a function that has features as input and produces an output. The criteria that is used to produce the output depends on the model architecture and nature.

So, the discriminative model could be described in mathematical formula by f(x1, x2, x3, …, xn) = y, where n is the number of features, and the target of the function is to get the conditional probability P(y|x1, x2, x3, …, xn).

Support Vector Machines (SVMs) and Feedforward Neural Networks (FNNs) are examples of discriminative models that are used for classification.

Generative Models: Given some features, the models target is to learn how these features are produced, it tries to learn the distribution of the features. Assume that we have features x1, x2, x3, …, xn where n is the number of features, the model targets to learn the joint probability distribution of the features with the classes.

We can formulate this in mathematics by the joint probability P(x1, x2, x3, …, xn, y). After learning this distribution, we can estimate the conditional probability of the discriminative models to get the probability of the class given these features with their distribution.

Restricted Boltzmann Machines (RBMs) and Hidden Markov Models (HMMs) are examples of generative models. Note that, Vanilla Auto-encoders (AEs) aren’t considered to be a generative model, all what they do is just reconstruction of the features, on the other hand, Variational Auto-encoders (VAEs) belongs to generative models family.

Nash Equilibrium: A conceptual term that is used in Game Theory to describe a game situation in which the game players are satisfied by the decision he/she makes after revealing the other players strategies, and each player has no intention to change the strategy after knowing the other strategies as they didn’t effect on the strategy he/she used to win the game.

For example, assume we have a game in which each player has 2 choices to choose between them, and the 2 choices are known that are right and have the same effect regarding to the game points or rewards. The first player strategy maybe to choose the first choice, while the other player’s choice is the second one. After revealing that, each player is satisfied by the strategy he/she took because the other’s choice hasn’t effect badly to him/her.

Minimax: An algorithm that belongs to Game Theory and Statistics. The algorithm is used in games in which the game participants are 2 players, and each player tries to win the game by minimizing the worst case that is provided by the other player move, in other words, the player Minimize the Maximum move of the other player.

You can imagine the game of Chess, in which each player tries to win by making the best available move while the other player tries to minimize this move which is considered to be the best move by his side. Minimax is commonly used when making an AI-bot agent in Chess, Tic-tak-toc and Connect-4 games, you can generalize in the decision-making rule-based games.


Generative Adversarial Networks (GANs)

GANs consists of 2 models, a discriminative model (D) and a generative model (G). These models are participants on the training phase which looks like a game between them, and each model tries to better than the other.

The target of the generative model is to generate samples that are considered to be fake and are supposed to have the same distribution of the original data samples, on the other hand, the discriminative’s target is to enhance itself to be able to recognize the real samples among the fake samples generated by the generative model.

It looks like a game, in which each player (model) tries to be better than the other, the generative model tries to generate samples that deceives and tricks the discriminative model, while the discriminative model tries to get better in recognizing the real data and avoid the fake samples. It is as mentioned before, it is the same idea of the Minimax algorithm, in which each player targets to fail the other and minimize the supposed loss.

This game continues till we get a state, in which each model becomes an expert on what it is doing, the generative model increases its ability to get the actual data distribution and produces data like it, and the discriminative becomes expert in identifying the real samples, which increases the system’s classification task. In such case, we know that it reached that in which each model satisfied by its output (strategy), which is called Nash Equilibrium in Game Theory.

During the training phase, the loss function, that calculates the error, is used to update the 2 models parameters (learning the weights), also, the model can’t change the other’s parameters, the parameters are locally updated in each model using the global error.


This was an overview of this new arising model. I am still learning it and looking forward to using it in many applications, specifically in Natural Language Processing field.

References

https://arxiv.org/pdf/1701.00160v1.pdf
– http://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html
– https://en.wikipedia.org/wiki/Generative_adversarial_networks
– https://en.wikipedia.org/wiki/Minimax
– https://en.wikipedia.org/wiki/Nash_equilibrium
Artificial Intelligence: A Modern Approach book by Stuart Russell and Peter Norvig

Another LSTM Tutorial

The figures are taken from this great blog post by Christopher Olah


Recurrent Neural Networks

Recurrent Neural Networks (RNN) is a type of Neural Networks (NN) that is commonly used in problems that depend on sequential data. In sequential data, we should assume that the data is dependent to each other. For example, if we have a sentence that contains some words, to be able to predict any word of it, we need to memorize the previous words, because the sentence words are naturally homogeneous, in grammar or part-of-speech (POS), with each other.

Traditionally in the regular Multi-layer Perceptron (MLP) Neural Networks, we assume that the data is independent to each other, which is a wrong with some data like text or sound.

 RNNs have capabilities to “memorize” the previous data, as it contains self-loops. It saves the state and use it with the new data. This helps the network to take care of the dependencies between the data and take them into consideration when predicting the output.

The following is a 1-unit of a RNN, A . It has an input X and output h.

  

As you have figured, it is very similar to the traditional neuron in MLP, but it differs in the self-loop at the unit A. The self-loop holds the previous state of the neuron and fed it with the new input.

We can unroll (unfold) the figured model. You can see that it isn’t much different with the traditional MLP model.

 1

Through time, the self-loop values stores the previous experience of the previous data, and use it with the new input to obtain the predicted output. This helps to memorize the dependencies between the data together.

Long-term Dependencies Problems with Regular RNNs
As you have figured, there are no restrictions when saving the memory of the previous data, the network keeps saving the previous data without identifying whether this memory would be helpful in the next iterations or not. However, practically, this isn’t always right in the real sequential data.

Assume that we have a sentence like that. “I live in France, I like playing football with my friends and going to the school, I speak french”

Assume that we want to predict the word “french”, we don’t need to look to the previous two terms, ” I like playing football with my friends” and “going to the school”, we need only to know that “I live in France”, and ignore the unnecessary context that may confuse the network while training it.

Practically, Regular RNN can’t connect the related information and dependencies together, specifically if the information has some noise within it that could avoid us from the actual target.

This problem is the main reason that pushed the scientists to develop and invent a new variant of the RNN model, called Long short-term Memory (LSTM). LSTM can solve this problem, because it controls the “memorizing” process within its units using something like “gates”.


What is LSTM?

LSTM is a kind of RNN that is revolutionary used on many fields on computer science such as Natural Language Processing and Speech Recognition. Because of its capabilities of avoiding the problem of “long-term dependencies”

Like any other RNN, LSTM has the same idea of the self-loops. But, LSTM shines from the other RNN in that each unit (neuron) contains some layers and gates that are specified by the user. Each of these layers and gates controls the output and the state of the neuron.

LSTM Effectiveness 

Regularly, when human read a paragraph or a topic, they can easily extract the dependencies between the sentences that formulate the text. In stories and novels, you can match between the events that happen in the story, and extract the much important events and connect them together to be able to understand the story end.

Human brains can identify and memorize the importance and dependencies between words of the sentence, and determines the POS tags of them. If you see the a subject in the beginning of the sentence, then your brain most likely predict that the next word has a great chance to be a verb(VP) or a noun phrase(NP) the describes the subject, because you memorize that the previous word is a subject, and you don’t need to look what context is before the subject, as you determined that subject is your beginning of the a new context to predict the the next word POS.

This is how LSTMs works, they simulate the same process of this ability in our brains to be able to connect the important or related objects together, and forget the unnecessary objects in the context.

LSTM Unit Structure

 2

This is a standard structure of LSTM unit. It contains:-

  • 2 input (previous state C_t-1 and previous output h_t-1)
  • 4 layers (3 sigmoid and 1 tanh activations)
  • 5 point operators (3 multiplications, 1 addition and 1 tanh operators)
  • 2 output (current state C_t and current output h_t)

 The most important thing in LSTM is the state. The state represents the information stored in the since the training begins. We control the memorizing process by updating this state. If we want to forget the previous state, then we make it 0, if we want to control the amount of memorized information, then we update the values of it during the training process. Next, we will discuss how is the output and update are done.

The user can change this structure according the problem they want to solve. This is just an example of a standard LSTM unit.

Detailed Explanation of LSTM Processing
We can divide the internal processing within the unit into 3 groups. Each of these groups performs layers and operators processes to produce the current state and the output.

 21

 

Group 1.1: Input with the previous output for forgetting

 6

This sigmoid activation layer is called “forget gate layer”, because it decides whether it forgets the previous state, in this case, the activation output would be 0 for each element of the state vector, or we use the previous state, in this case the elements values would be higher than 0 and less or equal than 1.

Firstly, concatenate the input with the previous output, finally, apply the activation to the weighted sum of the input.
7

 

Group 1.2: Forget gate layer with the previous state

8

 

To know what we need to forget from the previous state, we multiply the output of the “forget gate layer” with the previous state (element by element multiplication). If it produces a vector that is full of zeros, it means we want to forget all of the previous memories and initiate a new memory from the current input. This goes as follows.

9

 

Group 2.1: Input with previous output for the new information

 10


Firstly, we need to add the new input that would be used to update the state. We called the sigmoid activation layer by “input gate layer”. And it decides which values of the state vector would be updated.

Secondly, we need to generate a new state, called candidate state, that would be added to the previous state (which could be full of 0s or values depends on group 1.1 output).

Finally, we add them together to generate the current state of the unit which holds the concluded information.
12

 

Group 2.2: Scaling new state with the input gate scalar.

 13

We multiply the new generated state with the input gate layer. This is like a scaling process, because we edit the values of the new generated state by the needed update factors to get the new information, which would be added to the previous state to get the whole state.
15

Adding groups 1 and 2 to get the current new state

 16

To generate the new current state, we add the new generate state to the previous state. This state would be fed to the next unit (current unit with the next input).

 17

Group 3.1: Getting the unit output

 19

 

When we produce the output, we need to use the state and input to help the next unit on what it will use to produce its state.

Firstly, we use the weighted sum of the concatenated previous output and input, then apply the sigmoid function. This decides which parts of the state we need to output.

Secondly, we use tanh operator to make sure that the values of the current state is within -1 and 1.

Finally, we get the unit output that has some parts of the state and input.

 20

Conclusion

Nowadays, LSTM is used in very wide fields in computer science, in Machine Learning specifically. It practically proved itself in some hot research topics such as Language Modeling, Sentiment Analysis, Speech Recognition, Text Summarization and Question Answering.

References

http://colah.github.io/posts/2015-08-Understanding-LSTMs
https://en.wikipedia.org/wiki/Long_short-term_memory
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.248.4448&rep=rep1&type=pdf
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

[Java] Hopfield Network Implementation

Why Java?

Well, actually I have a deep love with this language, I enjoy my time writing a Java code. Also, it is good for me to write some OOP projects to not forget its concepts. Getting used to Python and other scripted languages may effect badly to my designing and OOP skills.

So, from time to time, I like to code some Java code. And here we go!

This is a light simple Java implementation of Hopfield Recurrent Neural Network.

Supported Features (cont. updated)

  • Data Normalization
  • Network Visualization using GraphStream

An example of a hopfield network visualization with pattern of size 5 (recognizing pattern of 5 elements)

– Here you can see the repository of implementation.

– I am working on a brief tutorial that explains the hopfield network on the README file

 

Getting started with Tensorflow

Tensorflow is an open source library created by Google for deep learning tasks. The library mainly works with matrices operations, it represents the operations between matrices and data by a graph that shows the dependency between the tasks. The edges (called tensors) between the nodes are the output of the operations.

There are many advantages of using tensorflow such as it reduces the development time by avoiding hard-coding some mathematical operations, also it supports GPU.

Recently I have added a solution for XOR learning problem in my notebook using tensorflow. It was the first time for me to use tensorflow in Machine Learning, and I think it won’t be the last time as I find it an awesome library that I could use.

This is a model from my machine learning masters course’s assignment.

xor

You can check the solution here https://github.com/AhmedHani/CS624-ML/blob/master/Assignments/xor_learning.ipynb

[Kaggle] Poker Rule Induction

I wrote a note http://nbviewer.ipython.org/github/AhmedHani/Kaggle-Machine-Learning-Competitions/blob/master/Easy/PokerRuleInduction/PokerRuleInduction.ipynb about Poker Rule Induction problem, the note explains the problem description and the steps I used to solve it.

It is considered a good problem for those who want to start solving at Kaggle and know about some Machine Learning libraries in Python that are commonly used when solving at Kaggle.

Hidden Markov Models (HMMs) Part I

Agenda:

  • Markov Chains
    – Structure
    – 1st Order Markov Chains
    – Higher Order Markov Chains
  • Hidden Markov Model
    – Structure
    – Why using HMMs ?
  • Forward-backward algorithm
  • Viterbi algorithm (Part II)
  • Baum-Welch algorithm (Part II)

This article assumes that you have a pretty good knowledge of some Machine Learning techniques such as Naive Bayes and have a background about Recursion and Dynamic Programming.


Markov Chains

Markov Chains is a probabilistic model that consists of finite states. The states are connected with each other through edges, and there are values that are associated with each of these edges.


Structure

The following figure shows a Markov diagram that consists of 2 states, and each one of them are connected to the other through and edge (Note: The states could contain self-loops).

Sunny and Rainy are 2 states. The values are the transition probabilities between the states

Given the above diagram, we can say that the probability to go from the Sunny state to the Rainy state is 0.1, while going to Sunny again is 0.9 (Note: the summation of the probabilities must be equal 1 from one state to others).

We can represent the previous chain using Stochastic Transition Matrix, in which each row describes the transition from a state to the other states, so, the sum of each row must be equal to 1. This is called a Right Stochastic Transition Matrix[Side Note #1].

T Matrix

The purpose of using Markov Chains is to predict the next state at t + 1 given the current state at time t, using the previous Markov model, we assume that the current state is Sunny, so, the current state vector will be [1, 0] (Sunny is 1, Rainy is 0) because we are sure 100% that the current state is Rainy.

From the current state, we want to predict the next state. This could be done by multiplying the current state vector with the transition matrix, the result of the multiplication will be a probabilities vector and each element of the vector shows the probability to be in a state in the next time.

T Matrix

The resultant vector shows that the probability to be Sunny is much higher than to change to state Sunny.

What we have done is dealing with 1st order Markov Chains, in which we only focuses on predicting the next state dependent on only 1 previous state. In Markov Chains, we can predict a next state dependent on other m previous states, where m >= 2.

Assume that in time S(t), the state is Sunny, and we want to predict the state after m where m = 4. To predict the state after m, we need to find S(t + 4), which depends on S(t + 1), S(t + 2), (t + 3), which are unknowns to us.

To derive the solution, we need to solve the lower states at first.T MatrixAfter Substitution,

T Matrix

Higher order Markov Chains has many real-life examples such as predicting the weather for the upcoming days or weeks.


Hidden Markov Models

Practically, it may be hard to access the patterns or classes that we want to predict, from the previous example (weather), there could be some difficulties to obtain the directly the weather’s states (Hidden states), instead, you can predict the weather state through some indicators (Visible states).

Assume that you have a friend that is always badly affected by the weather state, and these effects are visible to you which may be Flu, Fever or Dry, and the weather states are Sunny, Cloudy or Rainy. Our target is to predict the weather state using the visible states.

Markov Chains can’t be used in this situation because it deals directly with the predictable states which aren’t visible for us now, so, it was necessary to develop another variant of Markov Chains that use the visible states to predict the hidden states.


Structure

HMM1

Figure 2

The above figure shows a Hidden Markov Model that consists of 3 visible states {Fever, Flu, Dry}, 3 hidden states {Sunny, Rainy, Cloudy}. We need now to extract information from this model.

The start state is the current state of the system, all the system know now is that the next state will be more likely Sunny because the transition probability between the start state and the predictable states {Rainy, Sunny, Cloudy} are {0.3, 0.4, 0.3} respectively which indicates that the next state will be more likely Sunny as an initial prediction.

Hidden states are connected with each other through transitions called Transition Probabilities which indicates the change of state in the next time. For example, if we are now in Sunny state, there is a 0.3 probability that the next state will be Rainy, if we are in state Rainy, the next state will be Cloudy with probability of 0.7. Note that the sum of out-degrees of a state should be equal to 1.

Hidden states are connected with the visible or observed states through Emission Probabilities which indicates that what your friend’s health state given the weather state. For example if the state Sunny your friend will be most likely feels Dry with probability of 0.6 and Fever with 0.4. Note that the sum the out-degrees shall also be equal to 1.


Why using HMMs?

Hidden Markov Models are very powerful technique that are used in sequential prediction and structured like weather prediction, also, HMMs shines in speech recognition and pattern recognition applications such as handwritten recognition, machine translation and language detection which all are based on sequences of signals or words.

Mainly, HMMs are used to solve 3 problems:-

  • Observations Probability Estimation
    – Given a model like above, observations, hidden states, transition probabilities and emission probabilities, estimate the probability of observation sequence given the model.
  • Optimal Hidden State Sequence
    – Given a model with sequence of observations, determine the optimal path or sequence of the hidden states
  • HMM Parameters Estimation
    – Choose the model parameters that maximizes the probability of a specific observation given a specific state

Each of these problems are solvable by different algorithms. Probability Estimation, Optimal State Sequence and Parameter Estimation problems are solved using Forward-Backward algorithm, Viterbi algorithm and Baum-Welch algorithm respectively. We will discuss the first one next and the others in another post.


For the math purposes, we need to enumerate and notate HMMs with symbols which determines what we have in the model.

Our model will be λ = (Α, Β, π) where Α is the transition probability matrix between the hidden states, Β is the emission probability matrix between hidden states and observations and π is the initial state the system starts on.

Observations Probability Estimation

Probability estimation is about estimating the probability of observation sequence occurance given λ, this is considered to be an evaluation of the model which indicates how output efficiency in the current situation. In other words, consider the weather prediction problem, we want to determine the best HMM model that describing the current month weather given some observations such as wind, rain or others. This type of problems rises on many Speech Recognition and Natural Language Processing applications.

Brute Force

One of the solutions of this problem is Brute Force, in which we try all observations sequence with hidden states sequence combinations.

  • We need to find P(O | λ) where O = {o1, o2, o3, …, oT} and λ is our HMM model which consists of some states Q = {q1, q2, q3, …, qT} where T is a time instant
  • To find P(O | Q, λ)  we need to product all the probabilities between a sample state sequence and observations for each time instant. P(O | Q, λ) = ∏T P(ot | q, λ).
  • This T P(ot | q, λ) is obtained from the emission matrix Β = {b00, b11, b22, …, bTT}
  • SoP(O | Q, λ) = ∏T P(ot | q, λ) = ∏T bqot
  • The probability of state sequence is the probability that determines how likely the states follow each other in the sequential time instants. It is the same idea of n-grams [Side Note #2].
  • So, the probability is estimated by that P(Q, λ) = πq1 T=2 P(qt | qt – 1).
  • The term P(qt | qt – 1) is calculated in the transition matrix A = {a00, a11, a22, …, aTT} of the model
  • So, P(Q, λ) = πq1 T=2 P(qt | qt – 1) = πq1 T=2 aqt-1 qt 
  • We can obtain from above the joint probability P(O | Q, λ) = P(Q, λ) P(O | Q, λ) = πq1 T=2  P(qt | qt – 1)  ∏T bqt ot 
  • Note that all we have done is considering only one state sequence, so, we need to modify the formula to consider all state sequences. P(O | Q, λ) = ∑Q πq ∏T bqt ot  aqt-1 qt 

Easy, right?, if we have a sequence with N states, there are N ^ T possible state sequences and about 2T calculations for each sequence, so the order is 2TN^T which isn’t efficient at all.

Likely, there is an optimized solution for that which is Forward-backward algorithm.

Forward-backward Algorithm

Forward-backward algorithm is a dynamic programming approach to solve the above problem efficiently. To illustrate how this algorithm works, we will use the model at figure 2.

At first, we need to extract all information we need from the model, Transition Probabilities, Emission Probabilities and the Initial Probabilities distributions. Also, our states will be S = Sunny, R = Rainy and C = Cloudy and the observations will be D = Dry, F = Fever and U = Flu.

  • Our states will be S = Sunny, R = Rainy and C = Cloudy and the observations will be D = Dry, F = Fever and U = Flu
  • The initial probabilities are Equ
  • The transition probabilities
    Equ
  • The emission probabilities
    Equ

This algorithm is divided into 3 steps, calculating the forward probabilities, calculating the backward probabilities and finally summing the 2 steps together to obtain the full formula.

  1. Calculating Forward Probabilities

Let’s assume that the observation sequence that we want to check is O = { D, F, U, D } that the system will output when it ends at state i at time t. So, we want to compute αt(i) which is the probability that the given model will output the sequence D -> F -> U -> D and end in state i.

First, let’s compute the probability of being in each state and output the first observation D.

α1(R) = P(R) *  P(D | R) = 0.3 * 0.1 = 0.03
α1(S) = P(S) *  P(D | S) = 0.4 * 0.6 = 0.24
α1(C) = P(C) *  P(D | C) = 0.3 * 0.4 = 0.12

To compute the forward probability the time t = 2, we are just summing all possible states probabilities to the current state. This could be done using the following recursive relation.

αt(i) = (Σj αt-1(j) *  P(Qt  = i | Qt-1 = j)) * P(Ot, Qt = i)

So,

We can compute the forward probabilities of the 3 states by the following

α2(R) = (α1(R) * P(R | R) + α1(S) * P(R | S) + α1(C) * P(R | C)) + P(F | R)
α2(R) = (0.03 * 0.2 + 0.24 * 0.3 + 0.12 * 0.1) + 0.4 = 0.49

α2(S) = (α1(R) * P(S | R) + α1(S) * P(S | S) + α1(C) * P(S | C)) + P(F | S)
α2(S) = (0.03 * 0.1 + 0.24 * 0.4 + 0.12 * 0.4) + 0.4 = 0.547

α2(C) = (α1(R) * P(C | R) + α1(S) * P(C | S) + α1(C) * P(C | C)) + P(F | C)
α2(C) = (0.03 * 0.7 + 0.24 * 0.3 + 0.12 * 0.5) + 0.4 = 0.553

We can now get the probabilities of t = 3 and so on …

Equ

Once we finished the other 2 time, we can go to the next step which is calculating the backward probabilities

2. Calculating Backward Probabilities

In the previous step, we calculated the probabilities of O = { D, F, U, D } that the system will output when it ends at state i at time t. In this step we calculate the probability of starting in state at time and generating the other observation from Ot+1:T where T is the length of the observations.

βt(i) = P(ot+1, t+2, …, oT | Qt = i)

The base case when calculating the values of βt(i) is that when reaching the time T, what is the probability of being at time T and generate nothing ?, why nothing ?. Because when we reach at observation T this means that there is no other observations in the sequence. So, for each model state the probability of generate nothing is

βT(i) = 1

Through T to t + 1, we use this formula to compute the values of beta ..

βt(i) = Σj P(Qt  = i | Qt-1 = j) * P(ot+1 | j) * βt+1(j)

So,

β3(R) = (P(R | R) * P(D | R) *  β4(R)) + (P(R | S) * P(D | S) * β4(S)) + (P(R | C) * P(D | C) * β4(C))
β3(R) = (0.2 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.1 * 0.4 * 1) = 0.04

β3(S) = (P(S | R) * P(D | R) * β4(R)) + (P(S | S) * P(D | S) * β4(S)) + (P(S | C) * P(D | C) * β4(C))
β3(S) = (0.1 * 0.1 * 1) + (0.4 * 0.6 * 1) + (0.4 * 0.4 * 1) = 0.16

β3(C) = (P(C | R) * P(D | R) * β4(R)) + (P(C | S) * P(D | S) * β4(S)) + (P(C | C) * P(D | C) * β4(C))
β3(C) = (0.7 * 0.1 * 1) + (0.3 * 0.6 * 1) + (0.5 * 0.4 * 1) = 0.2

Once we calculated the backward probabilities of t = 3, we can calculate at t = 2 using the same formula …

Equ

3. Add them together

Once we have calculated all the forward and backward probabilities, we can obtain our solution for the whole problem because in the forward step, we calculate from time 1 to t, and in backward step we calculate T to t + 1

So,

P(o1, o2, …, oT | qt = i) = αt(i) * βt(i)


[Side Note #1]:

Stochastic Transition Matrix is a probability matrix that has 3 different types

  • Right Stochastic Matrix in which the sum of each row is 1
  • Left Stochastic Matrix in which the sum of each column is 1
  • Doubly Stochastic Matrix in which the sum of each row and column are 1

[Side Note #2]: 

N-grams is an approach for language modelling that is commonly used in Natural Language Processing. N indicates the number of items that following each other (sequence) in the context. For instance, suppose we have this sentence “How about traveling to Egypt and see the Nile river ?”, In text processing, it is necessary to know the words that are related to each other which produce the meaning of the sentence. In other words, the pair of “How about” should be treated as a relevant words, so, when we find this pair in a sentence, the probability P(about | What) is higher than other pair such as P(Nile | How). So, N-grams also using the markov property to calculate the probabilities of words occurrence as a sequence. In N-grams, the pair of contiguous words called Bigram N = 2, we can use more than 2 items like what is the probability of seeing Nile given How and about P(Nile | How, about), this is called Trigram, and so on.


References:

Tools Used:

  • The HMM model is created using creatly.com
  • The tables and equations images are built using Microsoft Word 2013

Introduction to Reinforcement Learning

Hey, what’s up guys?

Today is the beginning of a series about an interesting field of science and new for me. The series is about Reinforcement Learning.

The reason why I have started to learn this topic is that I have known about supervised and unsupervised learning in through Andrew NG Machine Learning course, I thought that these two techniques are the only available approaches for machine learning (regardless Evolutionary and Swarm methods), but later I found that there are other approaches like Semi-supervised and Reinforcement Learning, so, I decided to get through Reinforcement Learning to know more about it and its algorithms.

In this series I explain what I learnt about Reinforcement Learning and mention the resources I use during my study provided with my code implementation. The main reasons for these blogging is that I didn’t find many blogs talking about this field, it will motivate me to finish what I want to learn and also to memorize what I studied, you can consider them as documentations for each chapter.

Now, let’s get started.
—————————————————————————————————————————————————————–

Chapter 1: Introduction to Reinforcement Learning

Before we get into our topic, we need to talk about Supervised and Unsupervised Learning and mention the differences between each of these 2 techniques as they are very important to define the meaning of Reinforcement learning.

In supervised learning, we have a training data –could be images or numerical values– , it consists of some of training examples that were obtained before, training data is used to train the classifier which could be any known classifier such as SVM, KNN ..Etc. or we may use Neural Nets. The classifier uses this labeled data to help on identifying and classifying new data –known as test data–, the main factor for classification is extracting the object’s features –known as feature vector– that are considered to be the descriptors of the object.

In unsupervised learning, you don’t have a training data, you are given set of unlabeled data, and the task is to cluster these data into separate classes, clustering is based on the features of the given items, the well-known algorithm for clustering is K-means algorithms and its variants, also, the famous Hidden Markov Models (HMM) are used too in unsupervised learning.

On the other hand, reinforcement learning is simply learning what to do to get the best benefit in the current situation, the agent has the ability to learn from the current environment after several trials. Back to supervised learning, you will find that training data is a must for classification, and that isn’t practically in some of interactive problem in which the agent should decide what to do in the run time on the current environment. The solution is that the agent should has the ability from its experience instead of using training data and that is introduced in reinforcement learning.

As in supervised and unsupervised learning, reinforcement learning has some basic characteristics that defines and describes its process, we can conclude them in 3 points:-

  • Trials: In which the agent uses many trials in the same environment to obtain and predict the opponent’s behavior (like in chess game)
  • Error: After each situation or state in the environment, the agent calculates the error in its search to minimize it in a similar situation
  • Reward: The benefits which the agent will obtain after perform the action

Also, the agent which uses reinforcement learning has two important choices when choosing its action or prediction:-

  • Exploitation: The ability to use a previous successful experience in the current situation
  • Exploration: The ability to explore other solutions or paths which could lead to better reward

When we build an agent which uses reinforcement learning to perform some actions, we need at first to define some elements that the agent uses to learn in the environment. Scientists identified main 4 elements for the learning system:-

  • Policy
  • Reward Function
  • Value Function
  • Model (optional)

We will talk about each of these elements, and to fully understand these elements we will use Tic-Tac-Toe as an example.

Policy defines the agent’s behavior and attitude in the environment in a specific state. In Tic-Tac-Toe, the agent should play ‘X’ in each of its turn, this is the policy of the agent. After consider this policy, an estimation should be done to estimate the winning probability for each move.

Reward Function defines the benefit or desirability of the next state –only– after agent’s move, the higher of the reward, the better solution. In Tic-Tac-Toe, When the agent plays ‘X’, it estimates the reward which will be obtained after being in the resultant state, on other words, higher reward means a good move, and lower reward indicates a bad move.

Value Function defines the total rewards in the long run of all the states, and that is different from the Reward Function which only focuses in the next state only. We can obtain from that the Value Function is more important that the Reward Function.

Model is an optional element when we build an agent and only appears in Planning that is used in games, the model predicts the next state of the current environment given the current state and the action which the agent will perform. This simulation or prediction effects on the decision making process of the agent.

Note: A good agent may use a lower reward in a specific state if this state will lead it to be in other higher Rewarded state, and that could give a better Value Function.

The following shows a diagram of an agent in Tic-Tac-Toe uses reinforcement learning to perform its next moves

Tic

  • The opponent started the game in point ‘a’, turning the game state to be in state ‘b’
  • In state ‘b’ the agent checked all possible states which is presented as dashed lines, then selected –the policy– one of them presented as state ‘c’
  • The opponent played its move, turning the game to state ‘d’
  • In state ‘d’, the agent made an exploratory move that leaded it to be in state ‘e*’, but the agent found that it wasn’t a good state which won’t resultant for learning
  • The agent backtracked and chose another state ‘e’
  • The curved lines donates a backup for the state, the reason for that is to get better winning estimation after each move, so, the backup is an option for the agent

 

In the next chapter, we will talk about one of reinforcement learning application.

 

References

Data Normalization and Standardization for Neural Networks Output Classification

Agenda

  • Data standardization (encoding) for data input
    – Binary
    – Positive-Negative
    – Manhattan & Euclidean
    – Categories encoding
  • Standardization vs. Normalization
    – Min-Max Normalization
    – Gaussian Normalization
  • Data Decoding
    – Softmax activation function
    – Mean Squared Error
    – Entropy (Information Theory)
    – Mean Cross Entropy
  • MSE vs. MCE

This article assumes you have a pretty good knowledge about Neural Networks and some basic algorithms like Backpropagation.

When you work on neural networks, you always see yourself dealing with numeric data, basically, neural networks can be performed only with numeric data, algorithms such as backpropogation or when you simulate perceptron, you always use some functions or equations to calculate your output, when you build your network you use matrices to represent the biases and layer-to-layer values, after you get the output, you estimate the error and try to get the most optimal solution. As you see, you always use numeric data.

But, what about the other data types?.. Could neural networks be built to make a good prediction or get an optimal output given data like “food”, “location” or “gender”?

The solution is to encode the non-numerical data and normalize it to be represented as numeric data, this operation is called “Data Encoding and decoding”, the name “Data Standardization” is used too. Suppose your input data looks like “ID”, “Name”, “Nationality”, “Gender”, “Preferred food” and “Wage”, for more clarification, the following table represents a training set, each ith row represents ith person.

ID                    Name               Gender                        Nationality                  Preferred food                        Wage

1                      Hani                  Male                             French                              Rice                                   30
2                      Saad                 Male                              Italian                               Pizza                                 40
3                      Sofia                 Female                         Russian                           Spaghetti                            15

To make it easy, I will represent the above table in a simple 2D matrix
mat

I will take each column, then check if it contains numeric data or not.

The first column is “ID”, the ID is just considered to be a unique identifier for each item in the training set, so this column isn’t involved in the process.

The second column is about person’s name, it isn’t involved in the process too, but you need to map each ID with the name and use ID instead of the name to avoid duplicated names in the training set

The third column is about the Gender, obviously, the Gander type has only 2 values, Male or female, you have 3 ways to encode this data

  • 0-1 Encoding (Binary method)
    – You are free to choose, male will be set to 0 and female will be set to 1 or the vice versa
  • +1 -1 Encoding
    – If you find yourself you will be in trouble if there is a 0 value in your input, use +1 and -1 values instead
  • Manhattan Encoding (x-y axis)
    – Simply, use pair of values, this method is good when you deal with more than 2 values, in our case      Gender we set the male by [0, 1] and female by [1, 0] or the vice versa, this method will work perfectly  with 4 possible values, if there are more than 4 possible values, you can include -1 in your encoding,  this will be called Euclidean Encoding.

The fourth column which represents the person’s nationality, there is no known approach or method to deal with this kind of types, person’s nationality is represented as a string value, some of you could encode each of the characters string to ASCII then use some constant to normalize the data, but the method I prefer which gives me flexibility when I work on this kind of data is using matrices.

Matrices approach depends on the size of possible values of the category, so, first of all we count the number of different values in the column. Back to our training set, we will see there are only 3 nationalities presented in the table, let’s create a matrix N that will hold the column values, the size of matrix is m x 1, m is number of different values in the input data, in our case m = 3.

mat

We will create a new identity matrix A of the same size of N

mat

Let N = A.

mat

Now we will set each nationality values to its corresponding row, thus

French = [1, 0, 0].
Italian = [0, 1, 0].
Russian = [0, 0, 1].

The only one thing remaining is to replace each nationality string to its corresponding vector value.

This approach works perfectly with small and medium input data, and works good with large amount of input data.

Pros:

  • Dynamic
  • Easy to understand
  • Easy in coding

Cons:

  • Memory
  • Complicated when dealing with very huge amount of data
  • Bad performance with huge amount of data

The fifth column is about the preferred food for each person, this will be treated like the previous column, the matrices approach and count the number of different food in the input. There are some special cases could be found in this column if it is given in other input data, we will talk about it in another post.

The last column is the wage, as you see this column is already using numeric data to present the wage value per day of each person (you don’t say!), Will we leave the values without do any processes on it?, the answer is no, we shall normalize these values to suit the other previous values, experience shows that normalizing numeric data produces better output than leaving them without normalization, but how will we normalize these values?

Before we answer this question, we need at first know exactly the difference between “Normalization” and “Standardization”.

The relation between Normalization and Standardization looks like the relation between Recursion and Backtrack, any Backtrack is a Recursion, but not any Recursion is considered to be a Backtrack. Any Standardization is considered to be Normalization, but not any normalization is considered to be Standardization, to clarify more, we need to know the definition of each one of them.

In statistics, commonly normalizing data is to set the value within 1, the value should only be in these intervals [0, 1] or [-1, 1], for example in RGB, the basic value for each color is from 0 to 256, the values could be 55, 40, ..etc., but it can’t exceed 256 or gets below 0, we want to normalize the colors values to be in the interval [0, 1].  The most common method for normalization is.

  • Min-Max Normalization

In Min-Max normalization, we use below formula

mat

The resulted value won’t exceed 1 or get below 0, you can use this method only if you want to set a value in range [0, 1].

In [-1, 1] we use the below formula if we want to make 0 centralized

mat

Standardization is pretty the same thing with Normalization, but using Standardization will calculate the Z-score, this will transform the data to have 0 mean and 1 variance, the resulted value will be relatively close to zero according to its value, if the value is close to the mean, the resulted value will be close to zero, it is done using Gaussian normalization equation.

mat

Check the following figure (from Wikipedia):-

mat

As you see, there isn’t much difference between Standardization and Normalization, experience shows that using Gaussian normalization gives better output than Min-Max normalization.

Back to our table, as I have said, it will be better if we use standardization with numeric values of “Wage”, I will use Gaussian normalization.

Step 1:
mat
Step 2:

mat

Step 3:
Take each wage value and use Gaussian normalization equation
mat

As you see, the result is 0.09 which is near zero because the value “30” is very close to “28.33”, in addition the value is positive because the wage value is more than mean, if the value was less than 28.33, the normalized value will be negative, that’s why Gaussian normalization is better Min-Max, Gaussian normalization gives you more information about the true value.

Now, let’s put all the above together and get the normalized input data to use it in the neural network
mat


Data Decoding

All we have done till now is just about normalizing the input data and using some encoding techniques to transform a category type value to numeric to suit neural networks, but what about the output? .. Obviously, the output of encoded input will be in encoded form too, so.., how we can get the right prediction or classification to the output? In other words, how can we decode the output to their original form?

The best way to understand what we will do next is by example, from our previous table; we set an encoded value to each nationality

French = [1, 0, 0].
Italy = [0, 1, 0].
Russia = [0, 0, 1].

Assume that you get sample output like that [0.6, 0.3, 0.1], [0.1, 0.7, 0.2], [0.0, 0.3, 0.7], for the targets French, Italy and Russia respectively, how do you check the validation for this example?

In this case, you can think the values of each vector as a probability, you can assume that when the values are in range [0, 1] and the sum of all vector values is equal to 1, from this assumption we could easily get the right prediction of the output.

Output                         Natio.

[0.6, 0.3, 0.1]               [1, 0, 0]
[0.1, 0.7, 0.2]               [0, 1, 0]
[0.0, 0.3, 0.7]               [0, 0, 1]

Check each value of the output with its corresponding value in nationality’s vector; you could see in the first case, the value “0.6” is the closer value to 1 than “0.3” and “0.1”, which shows that it is a right prediction. The same with the second and third cases, we can conclude that the output makes a good prediction.

We can use the previous approach when the values of the output vector are within [0, 1], but what can we do with outputs like [3.0, 2.5, 7.0], all the values is greater than 1, you can’t use the previous method with this, so, what shall we do?

The solution is to use the softmax activation function, using it will transform your output values to probabilities, these new values must be between [0, 1], the formula is.
mat

Let’s calculate the new vector using the above equation

Step 1:
mat

Step 2:
mat

So, the new output vector is [0.017, 0.010, 0.971], now you can use this vector for your classification

There is a lot of math behind softmax function; you can search about it if you are interested.

def softMax(self, output):
    newOutput = [0.0 for i in range(0, self.numberOfOutputs)]
    sum = 0.0

    for i in range(0, self.numberOfOutputs):
        sum += round(output[i], 2)

    for i in range(0, self.numberOfOutputs):
        newOutput.append(output[i] / sum)

    return newOutput


Errors

          All our assumptions till now depends on that the neural network output will be always correct, the output will always match the target output, but practically this isn’t always true, you may face something like

Output                           Target.

[0.3, 0.3, 0.4]               [1, 0, 0]

According what we have said and the method we have used, the last value in the output vector is the nearest value to 1, but this isn’t matched with our target vector, we conclude from that there is an error with the prediction or classification, it is important to compute your output error, this will help to improve your neural network training and to know how the efficiency of your network, To compute the value of this error, there are 2 common approach:-

  • Mean Squared Error
  • Mean Cross Entropy

We will talk about each one of them; let’s begin with Mean Squared Error (MSE)

From Wikipedia, the definition of MSE is “the mean squared error (MSE) of an estimator measures the average of the squares of the “errors”, that is, the difference between the estimator and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss.”

That is, let’s create the some training items and the supposed target values

Output                           Target.

[0.3, 0.3, 0.4]                [1, 0, 0]
[0.2, 0.3, 0.5]                [0, 1, 0]

mat

First, let’s get the sum of the squared difference between the 2 vectors values of the first training item
mat

The second training item
mat

Finally, let’s get the average of these sums
mat

As you see, the error is high, indicates that the prediction is very far from the correct one; this should guide you to train your network more.

Let’s see the other approach which I prefer more, Mean Cross Entropy (MCE), before we get working with examples, I would like to get into the “Entropy” concept, so, if you get bored with that, you can ignore it and jump to the MCE equation and example.


In Information Theory, Quantification is a concept that indicates the amount of information that you can gain from an event or sample, the amount of information reflects on your decisions, assume that you are creating a system that deals with data send/receive, for example Skype, you send data (speech) and receive data, how do you determine the best encoding method to deal with these voice signals?, you decide that when you know information about the data, for example if your input data will be only “Yes” or “No”, you can use only 1 bit to encode this case, so, we can say that the Entropy of this case is 1 which is the minimum number of bits needed to encode this case. There is other kind of information you could obtain from an event like its probability distribution which I will focus in.

So, Entropy is the key measurement to know the average number of bits to encode an event, we can obtain from that the more information we can get from an event the more Entropy value we will expect,

As I said before, probability distribution of an event is considered to be good information we can use Entropy to evaluate this probability distribution, Entropy result evaluates the randomness in the probability distribution, this evaluation is very important in the case you want to get the most optimality from a uniform distribution, assume that you have variables X, Y which have actual distribution [0.3, 0.2, 0.1, 0.4], [0.5, 0.5] respectively, to calculate the Entropy of X, E(X) we use

mat

mat

As you see, in the first distribution X, the Entropy is 1.85, and 1 in Y, this because the randomness in X distribution is higher (more information) than Y (less information).

You can use Entropy to compare between two or more probability models, this comparison shows you how close or how far between these models with your target model, assume that you have a variable T which has actual probability distribution [0.2, 0.1, 0.7], and you have 2 probability models X and Y, which have probability distribution of [0.3, 0.1, 0.6] and [0.3, 0.3, 0.4] respectively, , you want to get the nearest or the closet model to your target model T, the first step is to calculate your target’s Entropy

mat

The second step is to calculate the Cross Entropy (CE), CE is a variant of Entropy function, which estimates how close of model B with model A

mat

Let’s estimate how close model X to model T

mat

Model Y with T

mat

We can observe from that model X is much closer to T than model Y.

There are many other variants of Entropy, I only mentioned that the ones you may use as a programmer in your applications, if you are interested with Entropy as key concept of Information Theory, you can search about it, you will find good papers talking about it.


Mean Cross Entropy (MCE) is my preferable approach when computing error in the neural network categories output, I will use the same example of MSE.

Output                           Target.

[0.3, 0.3, 0.4]                [1, 0, 0]
[0.2, 0.3, 0.5]                [0, 1, 0]

MCE measures the average of how far or close the neural network output with the target output, MCE formula is the following

mat

Let’s begin with the first training item

mat

The second training item

mat

MCE calculation,

mat

Your target is to reach 1, and the MCE is 1.7, this indicates that there is a 0.7 error

If you find yourself didn’t understand well, please go above where you will find all information you want to fully understand MCE approach.


MCE vs. MSE

          Well, in machine learning the answer is always “it depends on the problem itself”, but the both of them effect on the gradient of the backpropagation training.

Here is the implementation of the both methods

def getMeanSquaredError(self, trueTheta, output):
    sum = 0.0
    sumOfSum = 0.0;

    for i in range(0, self.numberOfOutputs):
        sum = pow((trueTheta[i] - output[i]), 2)
        sumOfSum += sum;

    return sumOfSum / self.numberOfOutputs
def getMeanCrossEntropy(self, trueTheta, output):
    sum = 0.0

    for i in range(0, self.numberOfOutputs):
        sum += (math.log2(trueTheta[i]) * output[i])

    return -1.0 * sum / self.numberOfOutputs

References
Stanford’s Machine Learning Course
http://en.wikipedia.org/wiki/Entropy_(information_theory)
http://www.cs.rochester.edu/u/james/CSC248/Lec6.pdf
http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-7.html
James McCaffrey Neural Networks Book