You can imagine the situation by learning without any prior knowledge about something, you just do mistakes and you let others correct them to you. Continuing this process will lead you to learn how to do that thing, although you may still don’t know what are you doing?!
This was a comment from my professional friend Moustapha Saad on my recent post which was an overview about Generative Adversarial Networks (GANs). I find that his description is very true. It describes the true nature of GANs, so, I decided to use it as a title of this post.
This post is considered to be the part 2 of the previous post. In the previous post, we talked about the general idea of GANs, the discriminative and the generative models which cooperate with each other to improve themselves. In this post, we will focus on how the generative models work in general and some different types of generative models that could be used when building GANs.
Maximum Likelihood Generative Models
Assume that we have a dataset that contains m training samples, these samples follow an unknown probability distribution, it could be Gaussian, Bernoulli or any other probability distribution, actually, we don’t care about knowing such distribution. Our focus is to be able to generate similar data that has the same distribution of the training dataset. We can formulate the problem as a Maximum Likelihood Estimation problem.
The idea of Maximum Likelihood is to make a model that has some parameters which make it be able to generate the same probability distribution, P(X;Θ). As we we have m samples in the training data, we can generalize the Maximum Likelihood all over the data, Πm=1Pmodel(Xi;Θ).
Now, the target is to find the Θ that Maximize the likelihood, so, our mathematical representation of this problem would be Θ := arg Θ max Πm=1Pmodel(Xi;Θ)
To avoid any numerical problems, it is always better to transform to the log space, as you know, log(x * y) = log(x) + log(y), we turns the above equation to sum instead of product. So, our final Empirical Distribution equation to obtain the Maximum Likelihood of the training data would be Θ := arg Θ max Σm=1log Pmodel(Xi;Θ).
We call the above model as an Empirical Distribution, as we don’t pretty certain about the distribution we get after getting the parameter Θ. The question is, how to know whether the obtained distribution is really the actual distribution of the training data?
Kullback–Leibler Divergence (KL-divergence) is a measurement that could be used to answer the above question, and it can be used in both the discrete and continuous probability distributions. Given 2 probability distribution, P and Q, the KL-divergence is measured by DKL(P || Q) = Σi P(i) log (P(i) / Q(i)).
The target is to minimize the KL-divergence, the less difference between the distributions of P and Q means that they are close to each other. So, recall our Maximum Likelihood, after getting the empirical distribution, we measure the KL-divergence between the empirical distribution and the actual distribution, in other words, minimizing the KL-divergence means maximizing the Likelihood of the training dataset.
Explicit vs. Implicit Maximum Likelihood Models
After knowing the general idea of Maximum Likelihood models and their target, we need to be more specified when we consider talking about Maximum Likelihood generative models. Actually, there are 2 different types of Maximum Likelihood models, Explicit and Implicit models.
Explicit Models: Models that fully depend on the training data, and try to learn the parameters of P(X;Θ), which are the mean and variance, using the regular gradient or any other optimization technique. To be able to make the training efficient, we need to design a model that makes use of the training samples, also, the collected training samples should be generalized enough to cover all variables of the probability distribution and this may be hard in some real-life data, finally, the explicit models require some complex computational power to learn the parameters.
Implicit Models: In these models, we don’t actually use the data to learn the true distribution parameters, we assume that the distribution parameters are hidden and we try to predict it using some observed variables from the data. These observed variables have no rules to be determined, they are extracted by the researcher or by an expert who may have a pre-knowledge about the data. You can think of it as an inference stochastic process that could be done using Markov Chains, you use some observable data to be able to get the hidden variables (called Latent variables), refer to this post to know more about Markov Chains. We use these latent variables (which makes a latent space) to generate a similar data based on the given training samples. Regularly, implicit models could be used for dimension reduction of the data without any prior knowledge about the data distribution.
The question now, to what model the generative model in GANs belongs to?
Recalling GANs figure,
You find that GAN uses the latent space to be able to generate data that is considered to be in similar distribution of the real samples. So, it is an implicit model, but it doesn’t use any observable data like Markov Chains do, it deals directly with the training sample, in other words, it treats the samples of the training data as the latent variables, and change their distribution by adding some noise to the sample features. That’s why it is considered to be an implicit model, it doesn’t use the data distribution, and it doesn’t depend on any observable data, like we do with Markov Chains.
To clarify more, you can imagine the situation by learning without any prior knowledge about something, you just do mistakes and you let others correct them to you. Continuing this process will lead you to learn how to do that thing, although you may still don’t know what are you doing?!
Actually, there are many different types of Maximum Likelihood models in general, the following figure (taken from Ian Goodfellow tutorial, you fill find it in the references) is a taxonomy of different types of models.
The Generative Model’s Camouflage Process
In GANs, the generative model tries to take some real samples from the training data, without any prior knowledge about its distribution, and replacing some actual features with other fake features (camouflage). As the generative model is a Deep Neural Network, the fakes features could be added in the hidden layers after feedforward the input to the neural network. Alternatively, the noise could be added before input the features to the neural network, and let the network produce the fake sample after feedforwarding the noisy fake sample. We can represent the generative model in mathematical function by f(X, Noise) = Z.
The discriminator (Predator) then receives 2 signals, the first one is X, which is the real sample, and Z, which is the fake sample. The role of the discriminator is to assign 2 probabilities, using softmax for example. If it gives a probability that is close to 1 to the real sample, which means that it perfectly distinguishes between the real and fakes samples, then it backpropagates the error (calculated using cross entropy) to the generative model to fine-tune its parameters (neural network weights) to enhance its ability to produce a more tricky sample. If it gives a probability that is close to 1 to the fake sample, which means that it sadly doesn’t distinguish between the real and fakes samples, then it backpropagates the error to the discriminator model to fine-tune its parameters to enhance its ability to avoid the fake sample.
Existed GANs with Popular Generative Models
Commonly, Denoising Auto-Encoders are used nowadays when building GANs. There exists an interesting project about Face Completion using Generative Adversarial Denoising Autoencoder, you can find the project website here http://www.cc.gatech.edu/~hays/7476/projects/Avery_Wenchen/. Also, there are other models that are called Adversarial Autoencoder (AAEs) https://arxiv.org/pdf/1511.05644.pdf and Deep Convolutional Generative Adversarial Networks https://github.com/carpedm20/DCGAN-tensorflow , but I don’t know much about them.
The idea of the Adversarial Learning is dominating nowadays and it is considered to be a hot topic in Deep Learning that could be used in multiple fields, such as Speech Recognition and Computer Vision.