Overview: Generative Adversarial Networks – When Deep Learning Meets Game Theory

Before going into the main topic of this article, which is about a new neural network model architecture called Generative Adversarial Networks (GANs), we need to illustrate some definitions and models in Machine Learning and Artificial Intelligence in general.

Discriminative Models: Models that predict a hidden observation (called class) given some evidence (called features). In other words, we have some features and observations about an entity, and we want to predict its class, category or label. You can imagine the model as a function that has features as input and produces an output. The criteria that is used to produce the output depends on the model architecture and nature.

So, the discriminative model could be described in mathematical formula by f(x1, x2, x3, …, xn) = y, where n is the number of features, and the target of the function is to get the conditional probability P(y|x1, x2, x3, …, xn).

Support Vector Machines (SVMs) and Feedforward Neural Networks (FNNs) are examples of discriminative models that are used for classification.

Generative Models: Given some features, the models target is to learn how these features are produced, it tries to learn the distribution of the features. Assume that we have features x1, x2, x3, …, xn where n is the number of features, the model targets to learn the joint probability distribution of the features with the classes.

We can formulate this in mathematics by the joint probability P(x1, x2, x3, …, xn, y). After learning this distribution, we can estimate the conditional probability of the discriminative models to get the probability of the class given these features with their distribution.

Restricted Boltzmann Machines (RBMs) and Hidden Markov Models (HMMs) are examples of generative models. Note that, Vanilla Auto-encoders (AEs) aren’t considered to be a generative model, all what they do is just reconstruction of the features, on the other hand, Variational Auto-encoders (VAEs) belongs to generative models family.

Nash Equilibrium: A conceptual term that is used in Game Theory to describe a game situation in which the game players are satisfied by the decision he/she makes after revealing the other players strategies, and each player has no intention to change the strategy after knowing the other strategies as they didn’t effect on the strategy he/she used to win the game.

For example, assume we have a game in which each player has 2 choices to choose between them, and the 2 choices are known that are right and have the same effect regarding to the game points or rewards. The first player strategy maybe to choose the first choice, while the other player’s choice is the second one. After revealing that, each player is satisfied by the strategy he/she took because the other’s choice hasn’t effect badly to him/her.

Minimax: An algorithm that belongs to Game Theory and Statistics. The algorithm is used in games in which the game participants are 2 players, and each player tries to win the game by minimizing the worst case that is provided by the other player move, in other words, the player Minimize the Maximum move of the other player.

You can imagine the game of Chess, in which each player tries to win by making the best available move while the other player tries to minimize this move which is considered to be the best move by his side. Minimax is commonly used when making an AI-bot agent in Chess, Tic-tak-toc and Connect-4 games, you can generalize in the decision-making rule-based games.


Generative Adversarial Networks (GANs)

GANs consists of 2 models, a discriminative model (D) and a generative model (G). These models are participants on the training phase which looks like a game between them, and each model tries to better than the other.

The target of the generative model is to generate samples that are considered to be fake and are supposed to have the same distribution of the original data samples, on the other hand, the discriminative’s target is to enhance itself to be able to recognize the real samples among the fake samples generated by the generative model.

It looks like a game, in which each player (model) tries to be better than the other, the generative model tries to generate samples that deceives and tricks the discriminative model, while the discriminative model tries to get better in recognizing the real data and avoid the fake samples. It is as mentioned before, it is the same idea of the Minimax algorithm, in which each player targets to fail the other and minimize the supposed loss.

This game continues till we get a state, in which each model becomes an expert on what it is doing, the generative model increases its ability to get the actual data distribution and produces data like it, and the discriminative becomes expert in identifying the real samples, which increases the system’s classification task. In such case, we know that it reached that in which each model satisfied by its output (strategy), which is called Nash Equilibrium in Game Theory.

During the training phase, the loss function, that calculates the error, is used to update the 2 models parameters (learning the weights), also, the model can’t change the other’s parameters, the parameters are locally updated in each model using the global error.


This was an overview of this new arising model. I am still learning it and looking forward to using it in many applications, specifically in Natural Language Processing field.

References

https://arxiv.org/pdf/1701.00160v1.pdf
– http://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html
– https://en.wikipedia.org/wiki/Generative_adversarial_networks
– https://en.wikipedia.org/wiki/Minimax
– https://en.wikipedia.org/wiki/Nash_equilibrium
Artificial Intelligence: A Modern Approach book by Stuart Russell and Peter Norvig

Another LSTM Tutorial

The figures are taken from this great blog post by Christopher Olah


Recurrent Neural Networks

Recurrent Neural Networks (RNN) is a type of Neural Networks (NN) that is commonly used in problems that depend on sequential data. In sequential data, we should assume that the data is dependent to each other. For example, if we have a sentence that contains some words, to be able to predict any word of it, we need to memorize the previous words, because the sentence words are naturally homogeneous, in grammar or part-of-speech (POS), with each other.

Traditionally in the regular Multi-layer Perceptron (MLP) Neural Networks, we assume that the data is independent to each other, which is a wrong with some data like text or sound.

 RNNs have capabilities to “memorize” the previous data, as it contains self-loops. It saves the state and use it with the new data. This helps the network to take care of the dependencies between the data and take them into consideration when predicting the output.

The following is a 1-unit of a RNN, A . It has an input X and output h.

  

As you have figured, it is very similar to the traditional neuron in MLP, but it differs in the self-loop at the unit A. The self-loop holds the previous state of the neuron and fed it with the new input.

We can unroll (unfold) the figured model. You can see that it isn’t much different with the traditional MLP model.

 1

Through time, the self-loop values stores the previous experience of the previous data, and use it with the new input to obtain the predicted output. This helps to memorize the dependencies between the data together.

Long-term Dependencies Problems with Regular RNNs
As you have figured, there are no restrictions when saving the memory of the previous data, the network keeps saving the previous data without identifying whether this memory would be helpful in the next iterations or not. However, practically, this isn’t always right in the real sequential data.

Assume that we have a sentence like that. “I live in France, I like playing football with my friends and going to the school, I speak french”

Assume that we want to predict the word “french”, we don’t need to look to the previous two terms, ” I like playing football with my friends” and “going to the school”, we need only to know that “I live in France”, and ignore the unnecessary context that may confuse the network while training it.

Practically, Regular RNN can’t connect the related information and dependencies together, specifically if the information has some noise within it that could avoid us from the actual target.

This problem is the main reason that pushed the scientists to develop and invent a new variant of the RNN model, called Long short-term Memory (LSTM). LSTM can solve this problem, because it controls the “memorizing” process within its units using something like “gates”.


What is LSTM?

LSTM is a kind of RNN that is revolutionary used on many fields on computer science such as Natural Language Processing and Speech Recognition. Because of its capabilities of avoiding the problem of “long-term dependencies”

Like any other RNN, LSTM has the same idea of the self-loops. But, LSTM shines from the other RNN in that each unit (neuron) contains some layers and gates that are specified by the user. Each of these layers and gates controls the output and the state of the neuron.

LSTM Effectiveness 

Regularly, when human read a paragraph or a topic, they can easily extract the dependencies between the sentences that formulate the text. In stories and novels, you can match between the events that happen in the story, and extract the much important events and connect them together to be able to understand the story end.

Human brains can identify and memorize the importance and dependencies between words of the sentence, and determines the POS tags of them. If you see the a subject in the beginning of the sentence, then your brain most likely predict that the next word has a great chance to be a verb(VP) or a noun phrase(NP) the describes the subject, because you memorize that the previous word is a subject, and you don’t need to look what context is before the subject, as you determined that subject is your beginning of the a new context to predict the the next word POS.

This is how LSTMs works, they simulate the same process of this ability in our brains to be able to connect the important or related objects together, and forget the unnecessary objects in the context.

LSTM Unit Structure

 2

This is a standard structure of LSTM unit. It contains:-

  • 2 input (previous state C_t-1 and previous output h_t-1)
  • 4 layers (3 sigmoid and 1 tanh activations)
  • 5 point operators (3 multiplications, 1 addition and 1 tanh operators)
  • 2 output (current state C_t and current output h_t)

 The most important thing in LSTM is the state. The state represents the information stored in the since the training begins. We control the memorizing process by updating this state. If we want to forget the previous state, then we make it 0, if we want to control the amount of memorized information, then we update the values of it during the training process. Next, we will discuss how is the output and update are done.

The user can change this structure according the problem they want to solve. This is just an example of a standard LSTM unit.

Detailed Explanation of LSTM Processing
We can divide the internal processing within the unit into 3 groups. Each of these groups performs layers and operators processes to produce the current state and the output.

 21

 

Group 1.1: Input with the previous output for forgetting

 6

This sigmoid activation layer is called “forget gate layer”, because it decides whether it forgets the previous state, in this case, the activation output would be 0 for each element of the state vector, or we use the previous state, in this case the elements values would be higher than 0 and less or equal than 1.

Firstly, concatenate the input with the previous output, finally, apply the activation to the weighted sum of the input.
7

 

Group 1.2: Forget gate layer with the previous state

8

 

To know what we need to forget from the previous state, we multiply the output of the “forget gate layer” with the previous state (element by element multiplication). If it produces a vector that is full of zeros, it means we want to forget all of the previous memories and initiate a new memory from the current input. This goes as follows.

9

 

Group 2.1: Input with previous output for the new information

 10


Firstly, we need to add the new input that would be used to update the state. We called the sigmoid activation layer by “input gate layer”. And it decides which values of the state vector would be updated.

Secondly, we need to generate a new state, called candidate state, that would be added to the previous state (which could be full of 0s or values depends on group 1.1 output).

Finally, we add them together to generate the current state of the unit which holds the concluded information.
12

 

Group 2.2: Scaling new state with the input gate scalar.

 13

We multiply the new generated state with the input gate layer. This is like a scaling process, because we edit the values of the new generated state by the needed update factors to get the new information, which would be added to the previous state to get the whole state.
15

Adding groups 1 and 2 to get the current new state

 16

To generate the new current state, we add the new generate state to the previous state. This state would be fed to the next unit (current unit with the next input).

 17

Group 3.1: Getting the unit output

 19

 

When we produce the output, we need to use the state and input to help the next unit on what it will use to produce its state.

Firstly, we use the weighted sum of the concatenated previous output and input, then apply the sigmoid function. This decides which parts of the state we need to output.

Secondly, we use tanh operator to make sure that the values of the current state is within -1 and 1.

Finally, we get the unit output that has some parts of the state and input.

 20

Conclusion

Nowadays, LSTM is used in very wide fields in computer science, in Machine Learning specifically. It practically proved itself in some hot research topics such as Language Modeling, Sentiment Analysis, Speech Recognition, Text Summarization and Question Answering.

References

http://colah.github.io/posts/2015-08-Understanding-LSTMs
https://en.wikipedia.org/wiki/Long_short-term_memory
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.248.4448&rep=rep1&type=pdf
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

[Java] Hopfield Network Implementation

Why Java?

Well, actually I have a deep love with this language, I enjoy my time writing a Java code. Also, it is good for me to write some OOP projects to not forget its concepts. Getting used to Python and other scripted languages may effect badly to my designing and OOP skills.

So, from time to time, I like to code some Java code. And here we go!

This is a light simple Java implementation of Hopfield Recurrent Neural Network.

Supported Features (cont. updated)

  • Data Normalization
  • Network Visualization using GraphStream

An example of a hopfield network visualization with pattern of size 5 (recognizing pattern of 5 elements)

– Here you can see the repository of implementation.

– I am working on a brief tutorial that explains the hopfield network on the README file

 

Prim’s Running Time With Heaps

Input: Given a graph with N nodes and M weighted edges, we want to obtain the Minimum Spanning Tree (MST) of the graph.

Note: MST is a tree that is formed in a graph which makes all the graph nodes
connected with the minimum total edges weights, also cycles isn’t allowed in the tree.

Output: Using Prim’s algorithm , we obtain a tree (collection of nodes and edges) that
satisfy the above conditions (The green edges in the figure forms the MST).

kruskalll

Algorithm Steps (Taken from Wikipedia):

  1. Initialize a tree with a single vertex, chosen arbitrarily from the graph.
  2. Grow the tree by one edge of the edges that connect the tree to vertices not yet in the tree, find the minimum-weight edge, and transfer it to the tree.
  3. Repeat step 2 (until all vertices are in the tree).

To be able to put all the graph nodes to the tree, we need to iterate all over the graph nodes. So, the operation costs O(N).

In each iteration, we obtain the minimum edge weight to grow the tree. Regularly, getting the minimum value costs O(X) where X is the size of the data. Here, we have M edges. To find the minimum edge per iteration, we need to form O(M) iteration at most where M is the number of graph edges.

Combining the both, we obtain the final running time which is O(NM).

Time Improvement Proof Using Heaps:

Heap is a data structure that is popular among programming languages. Nearly all programming languages implement it in their Collections or Abstract Data Types.

H

The heap is a binary tree (each node has only 2 children), if it is a max heap then the tree sorts the values so that the direct root and parent nodes are always higher than the children, if it is a min heap then the tree sorts the values so that the direct root and parent nodes are always lower than the children.

Min-heap

The tree supports many operations with low running time compared to the naive implementations.

BH

The Improved Algorithm Steps will be:

  1. Create a Min Heap with size of N (costs O(N))
  2. Initialize the heap with a node with cost 0 as it is the root node
  3. While the heap isn’t empty, do the following
    – Extract the minimum node value
    u from the heap (costs O(1))
    – Delete the extracted node as it is visited
    (costs O(log(N)))
    – Get the adjacent nodes
    v from the extracted node (costs O(M) at most where M >= (N – 1) in the worst case)
    – For every adjacent node v, check if it is in the heap and not included in the MST
    – If it is in the heap but it can be relaxed (update the u-v weight with lower weight) and insert the new value of v (costs O(log(N)))
    Add u to the MST

 So, the final running time will be O(M log(N)), where M is the number of edges and N is the number of nodes.

References

https://en.wikipedia.org/wiki/Binary_heap

https://en.wikipedia.org/wiki/Heap_(data_structure)

https://en.wikipedia.org/wiki/Prim%27s_algorithm

 

 

 

The Unsolvability Halting Problem

Definition

Halting problem is a famous problem in computer science. The problem is simply we want to prove and know whether a computer program will be halted or terminated or it will continue running forever.

Alan Turing proved that there is known solution for the halting problem. It is impossible to determine whether a program with an input will be terminated or it will run forever.

The proof used contradiction method to prove its unsolvability.

The Proof

  • Assume that we have a program P(X) that takes X as an input
  • The output of the program will be 0 or 1 where 1 indicates the program will halt and 0 indicates the program will continue forever
  • Assume we have another program Q that takes program P as its input, Q(P(X)) that output 0 or 1 where 1 indicates P(X) will run forever and 0 indicates it will halt
  • Assume we pass the program Q to itself (Recursion) as its parameter Q(Q)
  • Now, we have 2 cases
  • Case 1: If Q(Q) halts, then P(X) returns 1, which means Q(P(X)) will return 1 which indicates P(X) runs forever. Contradiction
  • Case 2: If Q(Q) runs forever, then P(X) returns 0, which means Q(P(X)) will return 0 which indicates P(X) halts. Contradiction

References
https://en.wikipedia.org/wiki/Halting_problem
 http://www.comp.nus.edu.sg/~cs5234/FAQ/halt.html

 

Solving Time-dependent Graph Using Modified Dijkstra Algorithm

Paper: http://research.microsoft.com/pubs/173806/edbt08tdsp.pdf

Content: –

  • Time-dependent Graph – Definition
  • Problem Definition
  • Existed Solutions
    – Bellman-Ford
    – Extended A*
  • New Dijkstra Based Algorithm

 

Time-dependent Graph – Definition

We can define Time-dependent Graph (TDG) as GT(V, E, W), where V is a graphs vertices, E is the edges between each vertex and W is the weights between each edge of the graph. Weights vary for each time interval T where T is a time domain pre-specified by the graph. So, W is a function of time (called edge-delay function) that has 3 parameters, the source node, destination node and the departure time from the source node.

1

Figure 1 shows an example of TDG graph. We have 4 nodes and directed edges between each one of them. There are weights on the edges between every pair of the graph. The weights change during time. In (b), it shows that at all time intervals, the weights are constant (10). (c) shows that the weight from nodes 1, 3 change in 2 time intervals, from [0, 5] the weight will be 10 and from [10, 60] the weight increases to 25.

 

Problem Definition

The given problem for the graph is to estimate the least total travel time (LTT) from the source node to the destination node given the edge-delay functions and time interval T. The travel time is the arrival time minus the starting time.

The source node, destination node and the starting time are given from the user. In other words, the LTT function should have 3 parameters LTT(source, destination, departure time of source).

There is another factor of time which is the waiting time at node. It is allowed to wait at a node, hence, the departure time of the node = arrival time + wait time.

The solution should estimate the shortest path of time at time-dependent graph (TDSP) from the source node to the destination node.

 

Existed Solutions

There are 2 existed solutions for TDSP. The objective is to minimize the LTT function given the starting node and destination node.

Bellman-Ford and extended A* algorithms are discrete time solution for such a problem.

Given graph GT(V, E, W) and query LTT(source, destination, starting_time), minimize the LTT function.

Bellman-Ford Based Algorithm: –

2

We can illustrate the algorithm by the following steps: –

  • First, we need to initialize the time to reach every node of the graph from the source by infinite. GL(t) indicates the earliest arrival time at node L from the source node S with starting time (t).
  • Initialize the edges between each pair of nodes. HK,L(t) is a function that indicates the earliest arrival to node L from source node S given the edge (K, L) with starting time (t)
  • Brute force to form relaxation between edges. Then update GL(t) and HK,L(t) till we get minimum time. This occurs when there are no more changes to the weights of the edges (Convergence).
  • Updating GL(t) and HK,L(t) are called path-selection and time-refinement steps respectively

A* Based Algorithm: –

Another algorithm to solve TDSP problem. Here, the algorithm assumes that there is no waiting time at each node.

There are many paths from the source node and destination node, the main idea here is use Priority Queue (like regular A*) to get all paths from source to destination. Let Pk is the kth path to reach from source node S to destination node D. For each path Pk we have an associated function with it. FPk(t) =  GPk(t) + dk,e(t) – t, where GPk(t) is the arrival time from source node S to node K using path Pk and dk,d(t) is a lower-bound estimation (heuristic) for time from node K to destination D.

For each iteration, the queue get on of paths then trying to minimize the function FPk(t). Then pop the path, get the other path and so on.

Here, path-selection and time-refinement are used in the priority queue and function update respectively.

The algorithm is called KDXZ for the first characters of their authors names. Experiments showed that this algorithm works fine with small graphs and struggles with larger ones as it is hard to estimate the value of dk,e(t)

 

New Dijkstra Based Algorithm: –

The algorithm is about decomposing the 2 mentioned steps, the time-refinement and path-selection steps as outlines of two-step-LLT

3

The time-refinement function estimates the earliest arrival time functions of each node in the graph. Then checks if there exists a path between the source and destination nodes. The path-selection gest the path which match the best arrival time for the best starting time t*.

Path-selection Algorithm: –

4

Time-refinement Algorithm (Dijkstra main function)

5

At first, we initialize the earliest arrival time function for each node of the graph. The algorithm refines arrival time incrementally for each sub-interval of the node time domain. Then use this function by extending the interval each time. Well-refined function is a function that specifies the earliest arrival time at a node from the source node.

For each iteration, the algorithm takes the front node of the priority queue then trying to expand its starting-time sub-interval to reach the well-refined function. Then update the function for every neighbor of the current node.

If the current node is the destination node, then the arrival time function of the destination node.