GANs: A Brief Theory and Practice, and Image-to-Image Conversion with Pix2Pix

ODSC - Open Data Science
8 min readOct 6, 2021

--

Editor’s note: Ajay is a speaker for ODSC West 2021. Be sure to check out his talk, “GANs: Theory and Practice, Image Synthesis With GANs Using TensorFlow” to go beyond Pix2Pix there!

There are prerequisites for making the most out of this blog.

The reader should be familiar with deep neural networks, CNNs, U-Net, dropouts, etc., and how to train, fine-tune, evaluate, and test machine learning models. In addition, acquaintance with data distribution, loss functions, gradients, and backpropagation is required.

Replicating creative art by humans is hard by machines.

Human cognition and creativity is by far the most intricate system to mimic by machines yet. Recognizing patterns from images and data, so apparent to the human brain, remains a difficult task for machines, not to mention the ability to create new meaningful imagery and textual artworks. However, the rubber meets the road at one point, and the innovative deep-neural networks (DNNs) are arguably powerful to make this infeasible feasible. Machines being creative is attainable by Generative Adversarial Networks (GAN). GANs are a specific style of DNNs that Ian Goodfellow [1] brought back in 2014. GANs have made imagery and textual creative tasks achievable for machines through their strength in replicating humans’ abilities. In particular, regardless of the imagery artwork, a GAN can create a new impressive piece.

https://odsc.com/california/#register

We can use GANs to create new data such as images, text, audio, and videos. GANs are part of a family of Generative models, which, when trained well on a given dataset, can learn the pattern from it and start producing new similar data (Figure 1). In this blog, I will focus on the creation of a new image from another one.

Figure 1: GAN learns patterns from the animal images and creates a dog’s image [Source: ImageNet]

The GAN framework can create new data and has two key ingredients.

A Generator agent to generate data and a Discriminator critic to help improve the Generator are the two main ingredients of a GAN. The Generator agent (or function) G learns and estimates the data distribution ‒ patterns ‒ from the given dataset. The Discriminator critic (or function) D helps the Generator estimate the distribution accurately by providing feedback on the measure of the goodness of the Generator in the form of error gradients. Both the agent and the critic play against and learn from each other in a non-cooperative manner. Note the adversarial part is coming from this in the GAN. Once the GAN has trained in this adversarial manner, the Generator is ready to generate new data. An analogy of GAN would be two players playing a game, where both the players advance their moves against each other and get better in the process by learning from each others’ moves. We call this game a zero-sum or minimax game as only one player can win.

Mathematically, the Generator function is a parametric function G(z; Ө(G)): z → X that learns a mapping from a simple distribution, to say a random or uniform, to the actual data distribution and has Ө(G) as its parameter. Call z as the latent variable and X as the data variable (Figure 2). We denote here generated data by Xg and the real data by Xr.

The Discriminator is a binary classifier function whose job is to differentiate the real data from the generated one. Moreover, the Discriminator function, a parametric one, D(X; Ө(D)): X → 0|1, is defined to reward the actual data Xr and penalize the generated data Xg. The GAN framework, which is the minimax game, is shown in Figure 2 as follows.

Figure 2: The GAN framework

There is a value function defined and two loops for GAN’s training.

Further, define the GAN loss function for this minimax game as follows:

Pdata is the real data distribution, and the Pmodel is the data distribution generated by the Generator function. We define J(D) as the Discriminator loss and J(G) as the Generator loss. Note that the Discriminator’s loss forces D(Xr) to 1 and D(Xg) to 0 when J(D) goes to 0 (meaning minimizing the error). This Discriminator loss is similar to a sigmoid-output-based binary classifier. Note a ½ before each of the terms in the loss function. For the Discriminator’s training, we take half of the data from real and half of the generated one.

A value function is defined to play this minimax game as follows:

The goal of the GAN is to find a Nash equilibrium, used to optimize a minimax game, for the above value function such that the following criterion is satisfied:

Ө(G)* is the optimal value of the Generator’s parameter to generate data similar to real ones.

As apparent in the preceding value function optimization, the training of GAN happens in a sequence of two loops, inner and outer (Node a maxD and minG, respectively). In the inner loop, we freeze the Generator’s parameters and train the Discriminator, and in the outer loop, we freeze the Discriminator’s parameters and train the Generator. These two loops go on until the training converges.

We now conclude the GANs’ brief theory and training and switch to the cGANs next.

We can condition the GANs namely cGAN.

A Vanilla GAN defined previously doesn’t have control over what types of data to generate. For example, in a GAN which generates human faces, is there a way to create an image of only females? Controlling which data to generate is where cGAN, aka conditional GAN, comes into the picture. We can condition either or both the Generator and Discriminator with some extra information ‒ such as a variable or class labels ‒ while GAN’s training and control what data type to generate at the inference time by furnishing this additional information to the Generator. The paper “Conditional generative adversarial nets.” formulated the cGAN. [2]

Figure 3: The cGAN: both of the Discriminator and Generator conditioned with the class label.

The new Generator function for cGAN becomes G(z | c; Ө(G)): {z, c} → X, and the Discriminator becomes D(X | c; Ө(D)): {X, c} → 0|1, c being the condition. The change in the cost functions are as follows:

The value function of the new minimax game becomes as follows:

And the optimization criteria is now the following:

The training of cGANs is the same as GANs once the cost functions are modified.

Using a class label as a condition to a GAN, we can build a Pix2Pix model to convert images of one type to another.

Here comes the fun part of being creative. Berkeley AI lab’s published paper, “Image-to-Image Translation with Conditional Adversarial Networks,” also known as Pix2Pix GAN [3], features many tasks of creating images made possible with GANs. Such as creating maps from aerial photographs of the roads, converting pictures from day to night, or filling lives to older black-and-white photos and convert them to colored ones [Figure 4].

Figure 4: Pix2Pix GAN generating various types of images.[3]

Let’s review the Pix2Pix GAN’s architecture and loss functions. First, note that the Pix2Pix model converts one image to a target image and preserves the structure of the input image. We call this image conversion from one domain to another. Let’s denote the input domain image variable by X and the output domain variable by Y. In Pix2Pix, the Generator’s job is to learn the mapping from the input domain to the output domain. The goal of the Discriminator is still to differentiate between the training images and the generated images, though with one change ‒ condition it with the input image too.

Define the Generator function as G(X; Ө(G)): X → Y and the Discriminator as D(Y | X; Ө(D)): {Y, X} → 0|1. We condition the Discriminator with X, the input image. Figure 5 shows the complete Pix2Pix architecture. Please note the drop of the latent variable from the original Generator function definition for the case of Pix2Pix. Here the latent variable information comes from the Generator’s network ‒ either Encoder-decoder or U-Net [4]. Figure 6 shows the Generator’s architecture.

Figure 5: Pix2Pix GAN architecture. The Discriminator is conditioned on the input image.[3]

Figure 6: The Generator ‒ either an Encoder-decoder or U-Net ‒ for Pix2Pix. [3]

With the trained Pix2Pix model, the Generator can generate new target images given an input image. Figure 7 shows some examples of generated images.

Figure 7: Generated images using Pix2Pix. [3]

Closing remarks

GANs are one of the most helpful DL techniques in the past several years, particularly for data synthesis. Estimating and learning an intrinsic data distribution from a given dataset and generating newer data that looks authentic is one of the most significant successes of GANs. In this blog, I briefed and presented a mathematical formulation of GANs to generate new data. I also discussed a couple of varieties of GANs, such as cGAN and Pix2Pix. Apologies if I missed some specific details and nuances due to the limitation on the size of this blog. I will strongly suggest you read these papers and will be glad to clarify any doubts. Further, I hope you enjoyed reading this blog.

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform.

--

--

ODSC - Open Data Science
ODSC - Open Data Science

Written by ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.