Arjovsky et al. [1] proposed a new framework to suppress the gradient vanishing issue in the traditional Generative Adversarial Networks (GAN) [2]. A distance measure, Earth-Mover (EM) or Wasserstein distance, is utilized to guarantee continuous and differentiable gradient of objective function during training. This blog is organized as follows. Section 1 introduces the background knowledge for unknown distribution learning. Section 2 presents the Wasserstein distance, and shows its advantages by comparing with other three different distance measures. In section 3, the framework of Wasserstein GAN is described. Finally, the discussion is given.
1. Background
Learning unknown distributions (for example, for the learning of generative models) is fundamental problem in machine learning. Data is sampled from unknown distribution , the goal is to learn a distribution to approximate the real one , are the parameters of the distribution. Usually, there have two approaches to achieve this goal. The first one is to directly learn the probability density function , e.g. using maximum likelihood estimation (MLE). Yet may be difficult to obtain, and another approach is to learn a function that transforms an existing distribution into . Here, is some differentiable function, is a common distribution (usually uniform or Gaussian), and .
1.1. Maximum likelihood estimation
For samples and given distribution , the MLE is written as:
When , the samples can approximate the real data distribution very closely. Thus, based on Eq. 1, we can get the following result:
In the last line of Eq. 2, the new term is the entropy of . Here this entropy value is a constant, and it will not impact the searching of for minimizing Eq. 2. According the definition of Kullback-Leibler (KL) divergence, for two continuous distributions and , the KL divergence is . So, Eq. 2 can be further written as:
However, for a sample , if while , the KL divergence will be close to . This is bad for the MLE learning because will be very likely to have value zeros. A typical method to fix the issue is to add random noise to the learning of distribution. This ensures the distribution is defined everywhere. But this way may increase computation and the noise models may still require handcrafted selection and design.
1.2. Generative adversarial networks
Generative Adversarial Networks is a well known example of the indirectly approach (i.e. the distribution transformation discussed above) for distribution learning. The training of GAN is a competing game that includes two neural networks. The input data for the first network, generator, is some noise data (e.g. noise image), and it tries to generate some “intermediate” samples for the other network. The second network, discriminator (or called critic), receives either a generated sample or a real data sample, and tries to learn a model that can distinguish between the “fake” and real data. The generator is trained to fool the discriminator who utilizes the real data. After lots of iterative training, the generator would have the ability to obtain the results that are very close to the real ones.
Formally, the competition between the generator and the discriminator is the minimax objective:
where is the model distribution implicitly defined by , (the input to the generator is sampled from some simple noise distribution , such as the uniform distribution or a spherical Gaussian distribution).
The first step to solve the objective function of GAN is to find the optimal discriminator. Set the derivative of (with respect to ) as 0, and the optimal is got by the follows:
Put the optimal back into , and we can set the following equations:
where the represents the Jensen-Shannon (JS) divergence, defined as: , and is the mixture .
Thus, minimizing Eq. 7 with respect to is equal to minimize the JS divergence when we have . In the almost situations of training, and (almost) have no overlap, then , the gradient will vanish.
2. Distance Measures
2.1. Definition of Wasserstein distance
Because the KL and JS divergences exist the gradient vanishing issue, the Earth-Mover (EM) distance or Wasserstein distance is utilized as measure for GAN framework training. Here, the definition of Wasserstein distance is shown as follows:
Here, denotes the set of all joint distributions whose marginal distributions are and . could be the norm. , so the Wasserstein distance assesses how much effort is cost to move a point with one distribution to another point with different distribution, and represents the moved distance.
Consider a simple example, let it exist in probability distributions defined over . The true data distribution is . The distributions is set as . is the parameter, and of the two distributions is sampled from . For comparisons, KL divergence, JS divergence and Total variation distance () are employed. The follows show the different performance of the four distance measures.
Total Variation (TV) distance:
Kullback-Leibler (KL) divergence:
Jensen-Shannon (JS) divergence:
Wasserstein distance:
Here, the output of Wasserstein distance depends on the parameter . While for the other three measures, when and have no overlap (very common situation in the GAN training), their outputs are binary, and may cause no meaningful gradient.
This example shows that when using TV, KL or JS as distance measure for distribution estimation, there exist situations that do not make the learning converge. Moreover, these may also cause the gradient to be zero while computing , or . But the Wasserstein distance could suppress the non-convergence and gradient vanishing issues.
2.2. Theorem proof
Now, let us go deeper to see for general conditions, the Wasserstein distance is also continuous and differentiable, and even when the transform function is neural network, will also work well.
Theorem 1: Let be a fixed distribution over . Let be a random variable (e.g Gaussian) over . Let : be a function, that will be denoted with the first coordinate and the second. Let denote the distribution of . Then,
1. If is continuous in , so is .
2. If is locally Lipschitz and satisfies regularity assumption , then is continuous everywhere, and differentiable almost everywhere.
3. Statements – are false for and all the .
Assumption 1: Let : be locally Lipschitz. For a distribution over , if satisfies this assumption, then exists local Lipschitz constants such that:
Proof of Theorem 1: 1. Bound , where and are two parameter vectors in . The distribution of the joint is , and has . Based on the definition of the Wasserstein distance in Eq. 8, we have:
is continuous in , and is compact. So for all and , as , and we have:
Moreover, because is a distance measure, then using the triangle inequality for and , we have:
Thus, as is very close to , is also very close to zero, so is continuous when is continuous in . So the item of Theorem is proved.
2. If is locally Lipschitz, for a given pair , there exist a constant and an open set . For every , we have:
If set and compute expectation, then we have:
Define , and using Eq. 9 and Eq. 10, we have:
This shows that is locally Lipschitz, and is differentiable everywhere. So, the item of Theorem is proved.
3. The example in Section 2 has shown a counterexample for item 3.
Corollary 1: Let be any feedforward neural network parameterized by , is a prior over , and . Then assumption is satisfied and therefore is continuous everywhere and differentiable almost everywhere.
Proof of Corollary 1: If there exists , then we can prove the corollary .
Let is the number of layers, and is the application of layers to inclusively (e.g. ), then we have:
where are the weight matrices, and are the diagonal Jacobians of the nonlinearities.
Let be the Lipschitz constant of the nonlinearity of the network, and then get and . Consider the definition of gradient in multivariable function, and then we have:
If set and , so we get the following formula, and prove Corollary 1.
3. Wasserstein GAN
Although the Wasserstein distance shows good properties for distribution learning, but the original formula of is intractable. So the Kantorovich-Rubenstein (KR) duality is utilized to approximate .
where the supremum is over all the 1-Lipschitz functions, , is the Lipschitz constant of . is the function that satisfies the supremum operation.
Thus, based on , the envelope theorem can be applied to the following equation:
Therefore, if we employ a parameterized family of function (with the 1-Lipschitz constraints) for the optimization of , we can use the follows:
The KR-duality approximated Wasserstein distance can be applied for a GAN framework, whose training process includes training 3 steps:
1. For a fixed , compute the gradient of with respect to . In order to guarantee the Lipschitz constraints, a weight clipping sub-step is implemented, making (e.g. ).
2. When get an optimal (discriminator), compute for finding the optimal parameters of generator.
3. Update for the next iteration.
4. Discussion
The major contribution of methodology in the Wasserstein GAN is that a KR-duality approximated Wasserstein distance is employed to build a GAN framework. The Wasserstein GAN can suppress the gradient vanishing issue in the traditional methods, and provide remarkably clean gradients. Therefore, even using the batch-normalization-removed generator from some previous approaches (e.g. DCGAN [3]), the Wasserstein GAN can also work very well, yet the previous GANs fail. However, one possible drawback of the Wasserstein GAN is that its performance highly depends on the selection of threshold . An unsuitable may cause the 1-Lipschitz constraints unsatisfied. Therefore, a further improvement is the method: “Improved Training of Wasserstein GANs” [4]. This method introduces a gradient penalty, , into the objective function (where , , , ). The gradient penalty plays a role to automatically regularize the objective function, and thus increase the robust of Wasserstein GAN for different applications. All in all, the Wasserstein GAN offers an very effective distance measure for the development of GAN models.
Reference:
[1] Arjovsky, Martin, Soumith Chintala, and Léon Bottou. “Wasserstein gan.” arXiv preprint arXiv:1701.07875 (2017).
[2] Arjovsky, Martin, and Léon Bottou. “Towards principled methods for training generative adversarial networks.” arXiv preprint arXiv:1701.04862 (2017).
[3] Radford, Alec, Luke Metz, and Soumith Chintala. “Unsupervised representation learning with deep convolutional generative adversarial networks.” arXiv preprint arXiv:1511.06434 (2015).
[4] Gulrajani, Ishaan, et al. “Improved training of wasserstein gans.” arXiv preprint arXiv:1704.00028 (2017).