Wasserstein Generative Adversarial Networks

Arjovsky et al. [1] proposed a new framework to suppress the gradient vanishing issue in the traditional Generative Adversarial Networks (GAN) [2]. A distance measure, Earth-Mover (EM) or Wasserstein distance, is utilized to guarantee continuous and differentiable gradient of objective function during training. This blog is organized as follows. Section 1 introduces the background knowledge for unknown distribution learning. Section 2 presents the Wasserstein distance, and shows its advantages by comparing with other three different distance measures. In section 3, the framework of Wasserstein GAN is described. Finally, the discussion is given.

1. Background

Learning unknown distributions (for example, for the learning of generative models) is fundamental problem in machine learning. Data is sampled from unknown distribution ${P_r}$ , the goal is to learn a distribution ${P_\theta}$ to approximate the real one ${P_r}$ , ${\theta (\in R^d)}$ are the parameters of the distribution. Usually, there have two approaches to achieve this goal. The first one is to directly learn the probability density function ${P_\theta}$ , e.g. using maximum likelihood estimation (MLE). Yet ${P_r}$ may be difficult to obtain, and another approach is to learn a function that transforms an existing distribution ${Z}$ into ${P_\theta}$ . Here, ${g_\theta}$ is some differentiable function, ${Z}$ is a common distribution (usually uniform or Gaussian), and ${P_\theta = g_\theta(Z)}$ .

1.1. Maximum likelihood estimation

For ${m}$ samples and given distribution ${P_{\theta}}$ , the MLE is written as:

$\displaystyle \mathop {\max }\limits_{\theta \in {R^d}} \frac{1}{m}\sum\limits_{i = 1}^m {\log {P_\theta }\left( {{x^i}} \right)} \ \ \ \ \ (1)$

When ${m\rightarrow \infty}$ , the ${m}$ samples can approximate the real data distribution ${P_r}$ very closely. Thus, based on Eq. 1, we can get the following result:

$\displaystyle \begin{array}{*{20}{l}} {\mathop {\lim }\limits_{m \rightarrow \infty } \mathop {\max }\limits_{\theta \in {R^d}} \frac{1}{m}\sum\limits_{i = 1}^m {\log {P_\theta }\left( {{x^i}} \right)} }\\ { = \mathop {\max }\limits_{\theta \in {R^d}} \int_x {{P_r}\left( x \right)\log {P_\theta }\left( x \right)} dx}\\ { = \mathop {\min }\limits_{\theta \in {R^d}} - \int_x {{P_r}\left( x \right)\log {P_\theta }\left( x \right)} dx}\\ { = \mathop {\min }\limits_{\theta \in {R^d}} \int_x {{P_r}\left( x \right)\log {P_r}\left( x \right)} dx - \int_x {{P_r}\left( x \right)\log {P_\theta }\left( x \right)} dx} \end{array} \ \ \ \ \ (2)$

In the last line of Eq. 2, the new term ${\int_x {{P_r}\left( x \right)\log {P_r}\left( x \right)dx}}$ is the entropy of ${P_r}$ . Here this entropy value is a constant, and it will not impact the searching of ${\theta}$ for minimizing Eq. 2. According the definition of Kullback-Leibler (KL) divergence, for two continuous distributions ${P}$ and ${Q}$ , the KL divergence is ${KL\left( {P||Q} \right) = \int_x {P\left( x \right)\log \frac{{P\left( x \right)}}{{Q\left( x \right)}}dx}}$ . So, Eq. 2 can be further written as:

$\displaystyle \mathop {\lim }\limits_{m \rightarrow \infty } \mathop {\max }\limits_{\theta \in {R^d}} \frac{1}{m}\sum\limits_{i = 1}^m {\log {P_\theta }\left( {{x^i}} \right)} = \mathop {\min }\limits_{\theta \in {R^d}} KL\left( {{P_r}||{P_\theta }} \right) \ \ \ \ \ (3)$

However, for a sample ${x}$ , if ${Q(x)=0}$ while ${P(x)>0}$ , the KL divergence will be close to ${\infty}$ . This is bad for the MLE learning because ${P_\theta}$ will be very likely to have value zeros. A typical method to fix the issue is to add random noise to the learning of distribution. This ensures the distribution is defined everywhere. But this way may increase computation and the noise models may still require handcrafted selection and design.

1.2. Generative adversarial networks

Generative Adversarial Networks is a well known example of the indirectly approach (i.e. the distribution transformation discussed above) for distribution learning. The training of GAN is a competing game that includes two neural networks. The input data for the first network, generator, is some noise data (e.g. noise image), and it tries to generate some “intermediate” samples for the other network. The second network, discriminator (or called critic), receives either a generated sample or a real data sample, and tries to learn a model that can distinguish between the “fake” and real data. The generator is trained to fool the discriminator who utilizes the real data. After lots of iterative training, the generator would have the ability to obtain the results that are very close to the real ones.

Formally, the competition between the generator ${G}$ and the discriminator ${D}$ is the minimax objective:

$\displaystyle \mathop {\min }\limits_G \mathop {\max }\limits_D V\left( {D,G} \right) \ \ \ \ \ (4)$

$\displaystyle V\left( {D,G} \right) = \mathop E\limits_{x \sim {P_r}} \left[ {\log \left( {D\left( x \right)} \right)} \right] + \mathop E\limits_{x \sim {P_g}} \left[ {\log \left( {1 - D\left( {x} \right)} \right)} \right] \ \ \ \ \ (5)$

where ${P_g}$ is the model distribution implicitly defined by ${G(z)}$ , ${z\sim Z}$ (the input ${z}$ to the generator is sampled from some simple noise distribution ${Z}$ , such as the uniform distribution or a spherical Gaussian distribution).

The first step to solve the objective function of GAN is to find the optimal discriminator. Set the derivative of ${V(D,G)}$ (with respect to ${D}$ ) as 0, and the optimal ${D^*}$ is got by the follows:

$\displaystyle \frac{{{P_r}\left( x \right)}}{{D\left( x \right)}} - \frac{{{P_g}\left( x \right)}}{{1 - D\left( x \right)}} = 0 \Rightarrow {D^*}\left( x \right) = \frac{{{P_r}\left( x \right)}}{{{P_r}\left( x \right) + {P_g}\left( x \right)}} \ \ \ \ \ (6)$

Put the optimal ${D^*}$ back into ${V(D,G)}$ , and we can set the following equations:

$\displaystyle V\left( {{D^*},G} \right) = \mathop E\limits_{x \sim {P_r}} \left[ {\log \frac{{{P_r}\left( x \right)}}{{\frac{1}{2}\left( {{P_r}\left( x \right) + {P_g}\left( x \right)} \right)}}} \right] - \log 2 + \mathop E\limits_{x \sim {P_g}} \left[ {\log \frac{{{P_g}\left( x \right)}}{{\frac{1}{2}\left( {{P_r}\left( x \right) + {P_g}\left( x \right)} \right)}}} \right] - \log 2 \Rightarrow$

$\displaystyle V\left( {{D^*},G} \right) = 2 \times \left( {\frac{1}{2}KL\left( {{P_r}||\frac{{{P_r} + {P_g}}}{2}} \right) + \frac{1}{2}KL\left( {{P_g}\left( x \right)||\frac{{{P_r} + {P_g}}}{2}} \right)} \right) - 2\log 2 \Rightarrow$

$\displaystyle V\left( {{D^*},G} \right) = 2JS\left( {{P_r},{P_g}} \right) - 2\log 2 \ \ \ \ \ (7)$

where the ${JS}$ represents the Jensen-Shannon (JS) divergence, defined as: ${JS\left( {{P_r},{P_g}} \right) = \frac{1}{2}KL\left( {{P_r}||{P_m}} \right) + \frac{1}{2}KL\left( {{P_g}||{P_m}} \right)}$ , and ${P_m}$ is the mixture ${(P_r+P_g)/2}$ .

Thus, minimizing Eq. 7 with respect to ${G}$ is equal to minimize the JS divergence when we have ${D^*}$ . In the almost situations of training, ${P_r}$ and ${P_g}$ (almost) have no overlap, then ${JS=log2}$ , the gradient will vanish.

2. Distance Measures

2.1. Definition of Wasserstein distance

Because the KL and JS divergences exist the gradient vanishing issue, the Earth-Mover (EM) distance or Wasserstein distance is utilized as measure for GAN framework training. Here, the definition of Wasserstein distance is shown as follows:

$\displaystyle W\left( {{P_r},{P_g}} \right) = \mathop {\inf }\limits_{\gamma \in \prod \left( {{P_r},{P_g}} \right)} {E_{\left( {x,y} \right) \sim \gamma }}\left[ {\left\| {x - y} \right\|} \right] \ \ \ \ \ (8)$

Here, ${\prod(P_r,P_g)}$ denotes the set of all joint distributions ${\gamma(x,y)}$ whose marginal distributions are ${P_r}$ and ${P_g}$ . ${\left\| \cdot \right\|}$ could be the ${\emph{l}_2}$ norm. ${{E_{\left( {x,y} \right) \sim \gamma }}\left[ {\left\| {x - y} \right\|} \right] = \int_x {\int_y {\gamma \left( {x,y} \right)\left\| {x - y} \right\|dxdy} }}$ , so the Wasserstein distance assesses how much effort is cost to move a point with one distribution to another point with different distribution, and ${{\left\| {x - y} \right\|}}$ represents the moved distance.

Consider a simple example, let it exist in probability distributions defined over ${R^2}$ . The true data distribution ${P_r}$ is ${(0, y)}$ . The distributions ${P_\theta}$ is set as ${P_\theta = (\theta, y)}$ . ${\theta}$ is the parameter, and ${y}$ of the two distributions is sampled from ${U[0,1]}$ . For comparisons, KL divergence, JS divergence and Total variation distance ( ${\delta \left( {{P_r},{P_g}} \right) = \mathop {\sup }\limits_{a} \left| {{P_r}\left( a \right) - {P_g}\left( a \right)} \right|}$ ) are employed. The follows show the different performance of the four distance measures.

Total Variation (TV) distance:

$\displaystyle \delta \left( {{P_r},{P_\theta }} \right) = \left\{ \begin{array}{l} \begin{array}{*{20}{c}} {1,}&{\theta \ne 0} \end{array}\\ \begin{array}{*{20}{c}} {0,}&{\theta = 0} \end{array} \end{array} \right.$

Kullback-Leibler (KL) divergence:

$\displaystyle KL\left( {{P_r}||{P_\theta }} \right) = KL\left( {{P_\theta }||{P_r}} \right) = \left\{ \begin{array}{l} \begin{array}{*{20}{c}} { + \infty ,}&{\theta \ne 0} \end{array}\\ \begin{array}{*{20}{c}} {0,}&{\theta = 0} \end{array} \end{array} \right.$

Jensen-Shannon (JS) divergence:

$\displaystyle JS\left( {{P_r},{P_\theta }} \right) = \frac{1}{2}KL\left( {{P_r}||{P_m}} \right) + \frac{1}{2}KL\left( {{P_\theta }||{P_m}} \right) = \left\{ \begin{array}{l} \begin{array}{*{20}{c}} {\log 2,}&{\theta \ne 0} \end{array}\\ \begin{array}{*{20}{c}} {0,}&{\theta = 0} \end{array} \end{array} \right.$

Wasserstein distance:

$\displaystyle W\left( {{P_r},{P_\theta }} \right) = \left| \theta \right|$

Here, the output of Wasserstein distance depends on the parameter ${\theta}$ . While for the other three measures, when ${P_r}$ and ${P_\theta}$ have no overlap (very common situation in the GAN training), their outputs are binary, and may cause no meaningful gradient.

This example shows that when using TV, KL or JS as distance measure for distribution estimation, there exist situations that do not make the learning converge. Moreover, these may also cause the gradient to be zero while computing ${{\nabla _\theta }TV\left( {{P_r},{P_\theta }} \right)}$ , ${{\nabla _\theta }KL\left( {{P_r},{P_\theta }} \right)}$ or ${{\nabla _\theta }JS\left( {{P_r},{P_\theta }} \right)}$ . But the Wasserstein distance could suppress the non-convergence and gradient vanishing issues.

2.2. Theorem proof

Now, let us go deeper to see for general conditions, the Wasserstein distance ${W(P_r,P_\theta)}$ is also continuous and differentiable, and even when the transform function ${g_\theta}$ is neural network, ${W(P_r,P_\theta)}$ will also work well.

Theorem 1: Let ${P_r}$ be a fixed distribution over ${\rm X}$ . Let ${z}$ be a random variable (e.g Gaussian) over ${\rm Z}$ . Let ${g}$ : ${\rm Z \times {R^d} \rightarrow \rm X}$ be a function, that will be denoted ${{g_\theta }\left( z \right)}$ with the first coordinate and ${\theta}$ the second. Let ${P_\theta}$ denote the distribution of ${{g_\theta }\left( z \right)}$ . Then,

1. If ${g}$ is continuous in ${\theta}$ , so is ${W(P_r,P_\theta)}$ .

2. If ${g}$ is locally Lipschitz and satisfies regularity assumption ${1}$ , then ${W(P_r,P_\theta)}$ is continuous everywhere, and differentiable almost everywhere.

3. Statements ${1}$ – ${2}$ are false for ${JS(P_r,P_\theta)}$ and all the ${KL}$ .

Assumption 1: Let ${g_\theta}$ : ${{\rm Z} \times {R^d} \rightarrow {\rm X}}$ be locally Lipschitz. For a distribution ${p}$ over ${\rm Z}$ , if ${g_\theta}$ satisfies this assumption, then exists local Lipschitz constants ${L(\theta, z)}$ such that:

$\displaystyle {E_{z \sim p}}\left[ {L\left( {\theta ,z} \right)} \right] < + \infty$

Proof of Theorem 1: 1. Bound ${W\left( {{P_\theta },{P_{\theta '}}} \right)}$ , where ${\theta}$ and ${{\theta '}}$ are two parameter vectors in ${R^d}$ . The distribution of the joint ${(g_\theta(z), g_{\theta '}(z))}$ is ${\gamma}$ , and has ${\gamma \in \prod \left( {{P_\theta },{P_{\theta '}}} \right)}$ . Based on the definition of the Wasserstein distance in Eq. 8, we have:

$\displaystyle W\left( {{P_\theta },{P_{\theta '}}} \right) \le \int_{\rm X \times \rm X} {\left\| {x - y} \right\|d\gamma } = {E_{\left( {x,y} \right) \sim \gamma }}\left[ {\left\| {x - y} \right\|} \right] = {E_z}\left[ {\left\| {{g_\theta }\left( z \right) - {g_{\theta '}}\left( z \right)} \right\|} \right]$

${g}$ is continuous in ${\theta}$ , and ${\rm X}$ is compact. So for all ${\theta}$ and ${z}$ , as ${\theta \rightarrow {\theta '}}$ , and we have:

$\displaystyle W\left( {{P_\theta },{P_{\theta '}}} \right) \le {E_z}\left[ {\left\| {{g_\theta }\left( z \right) - {g_{\theta '}}\left( z \right)} \right\|} \right]{ \rightarrow _{\theta \rightarrow \theta '}}0$

Moreover, because ${W(P_\theta, P_{\theta '})}$ is a distance measure, then using the triangle inequality for ${W(P_r, P_\theta)}$ and ${W(P_r, P_{\theta '})}$ , we have:

$\displaystyle \left| {W\left( {{P_r},{P_\theta }} \right) - W\left( {{P_r},{P_{\theta '}}} \right)} \right| \le W\left( {{P_\theta },{P_{\theta '}}} \right){ \rightarrow _{\theta \rightarrow \theta '}}0 \ \ \ \ \ (9)$

Thus, as ${\theta}$ is very close to ${\theta '}$ , ${|W(P_r,P_\theta) - W(P_r, P_{\theta '})|}$ is also very close to zero, so ${W(P_r, P_\theta)}$ is continuous when ${g}$ is continuous in ${\theta}$ . So the item ${1}$ of Theorem ${1}$ is proved.

2. If ${g}$ is locally Lipschitz, for a given pair ${(\theta, z)}$ , there exist a constant ${L(\theta, z)}$ and an open set ${(\theta, z) \in U}$ . For every ${(\theta {'}, z') \in U}$ , we have:

$\displaystyle \left\| {{g_\theta }\left( z \right) - {{g}_{\theta '}}\left( {z'} \right)} \right\| \le L\left( {\theta ,z} \right)\left( {\left\| {\theta - \theta '} \right\| + \left\| {z - z'} \right\|} \right)$

If set ${z' = z}$ and compute expectation, then we have:

$\displaystyle {E_z}\left[ {\left\| {{g_\theta }\left( z \right) - {g_{\theta '}}\left( z \right)} \right\|} \right] \le \left\| {\theta - \theta '} \right\|{E_z}\left[ {L\left( {\theta ,z} \right)} \right] \ \ \ \ \ (10)$

Define ${L\left( \theta \right) = {E_z}\left[ {L\left( {\theta ,z} \right)} \right]}$ , and using Eq. 9 and Eq. 10, we have:

$\displaystyle \left| {W\left( {{P_r},{P_\theta }} \right) - W\left( {{P_r},{P_{\theta '}}} \right)} \right| \le W\left( {{P_\theta },{P_{\theta '}}} \right) \le {E_z}\left[ {\left\| {{g_\theta }\left( z \right) - {g_{\theta '}}\left( z \right)} \right\|} \right] \le L\left( \theta \right)\left\| {\theta - \theta '} \right\| \ \ \ \ \ (11)$

This shows that ${{W\left( {{P_r},{P_\theta }} \right)}}$ is locally Lipschitz, and is differentiable everywhere. So, the item ${2}$ of Theorem ${1}$ is proved.

3. The example in Section 2 has shown a counterexample for item 3.

Corollary 1: Let ${g_\theta}$ be any feedforward neural network parameterized by ${\theta}$ , ${p(z)}$ is a prior over ${z}$ , and ${{E_{z \sim {P_r}}}\left[ {\left\| z \right\|} \right] < \infty}$ . Then assumption ${1}$ is satisfied and therefore ${W(P_r,P_\theta)}$ is continuous everywhere and differentiable almost everywhere.

Proof of Corollary 1: If there exists ${{E_{z \sim p\left( z \right)}}\left[ {\left\| {{\nabla _{\theta ,z}}{g_\theta }\left( z \right)} \right\|} \right] < + \infty}$ , then we can prove the corollary ${1}$ .

Let ${H}$ is the number of layers, and ${Net_{i:j}}$ is the application of layers ${i}$ to ${j}$ inclusively (e.g. ${g_\theta = Net_{1:H} = z\prod\nolimits_{k = 1}^H {{W_k}{D_k}}}$ ), then we have:

$\displaystyle {\nabla _{{W_k}}}{g_\theta }\left( z \right) = \left( {\left( {\prod\limits_{i = k + 1}^H {{W_i}{D_i}} } \right){D_k}} \right)Ne{t_{1:k - 1}}\left( z \right)$

$\displaystyle {\nabla _z}{g_\theta }\left( z \right) = \prod\nolimits_{k = 1}^H {{W_k}{D_k}}$

where ${W_k}$ are the weight matrices, and ${D_k}$ are the diagonal Jacobians of the nonlinearities.

Let ${L}$ be the Lipschitz constant of the nonlinearity of the network, and then get ${\left\| {{D_i}} \right\| \le L}$ and ${\left\| {Ne{t_{1:k - 1}}\left( z \right)} \right\| \le \left\| z \right\|{L^{k - 1}}\prod\nolimits_{i = 1}^{k - 1} {{W_i}}}$ . Consider the definition of gradient in multivariable function, and then we have:

$\displaystyle \left\| {{\nabla _{\theta ,z}}{g_\theta }\left( z \right)} \right\| \le \left\| {{\nabla _z}{g_\theta }\left( z \right)} \right\| + \sum\limits_{k = 1}^H {\left\| {{\nabla _{{W_k}}}{g_\theta }\left( z \right)} \right\|}$

$\displaystyle \le \left\| {\prod\limits_{i = 1}^H {{W_i}{D_i}} } \right\| + \sum\limits_{k = 1}^H {\left\| {\left( {\left( {\prod\limits_{i = k + 1}^H {{W_i}{D_i}} } \right){D_k}} \right)Ne{t_{1:k - 1}}\left( z \right)} \right\|}$

$\displaystyle \le {L^H}\prod\limits_{i = H}^K {\left\| {{W_i}} \right\|} + \sum\limits_{k = 1}^H {\left\| z \right\|{L^H}\left( {\prod\limits_{i = 1}^{k - 1} {\left\| {{W_i}} \right\|} } \right)\left( {\prod\limits_{i = k + 1}^H {\left\| {{W_i}} \right\|} } \right)}$

If set ${{C_1}\left( \theta \right) = {L^H}\left( {\prod\limits_{i = H}^K {\left\| {{W_i}} \right\|} } \right)}$ and ${{C_2}\left( \theta \right) = \sum\limits_{k = 1}^H {{L^H}\left( {\prod\limits_{i = 1}^{k - 1} {\left\| {{W_i}} \right\|} } \right)\left( {\prod\limits_{i = k + 1}^H {\left\| {{W_i}} \right\|} } \right)}}$ , so we get the following formula, and prove Corollary 1.

$\displaystyle {E_{z \sim p\left( z \right)}}\left[ {\left\| {{\nabla _{\theta ,z}}{g_\theta }\left( z \right)} \right\|} \right] \le {C_1}\left( \theta \right) + {C_2}\left( \theta \right){E_{z \sim p\left( z \right)}}\left[ {\left\| z \right\|} \right] < + \infty \ \ \ \ \ (12)$

3. Wasserstein GAN

Although the Wasserstein distance shows good properties for distribution learning, but the original formula of ${W(P_r, P_\theta)}$ is intractable. So the Kantorovich-Rubenstein (KR) duality is utilized to approximate ${W(P_r, P_\theta)}$ .

$\displaystyle W\left( {{P_r},{P_\theta }} \right) = \mathop {\sup }\limits_{\tilde f \in F} \left( {{E_{x \sim {P_r}}}\left[ {\tilde f\left( x \right)} \right] - {E_{z \sim p\left( z \right)}}\left[ {\tilde f\left( {{g_\theta }\left( z \right)} \right)} \right]} \right),{\left\| {\tilde f} \right\|_L} \le 1$

$\displaystyle \Rightarrow {E_{x \sim {P_r}}}\left[ {f\left( x \right)} \right] - {E_{z \sim p\left( z \right)}}\left[ {f\left( {{g_\theta }\left( z \right)} \right)} \right],f \in F \ \ \ \ \ (13)$

where the supremum is over all the 1-Lipschitz functions, ${\tilde f: \rm X \rightarrow R}$ , ${{\left\| {\tilde f} \right\|_L}}$ is the Lipschitz constant of ${\tilde f}$ . ${f}$ is the function that satisfies the supremum operation.

Thus, based on ${f}$ , the envelope theorem can be applied to the following equation:

$\displaystyle {\nabla _\theta }W\left( {{P_r},{P_\theta }} \right) = {\nabla _\theta }\left( {{E_{x \sim {P_r}}}\left[ {f\left( x \right)} \right] - {E_{z \sim p\left( z \right)}}\left[ {f\left( {{g_\theta }\left( z \right)} \right)} \right]} \right) = - {E_{z \sim p\left( z \right)}}\left[ {{\nabla _\theta }f\left( {{g_\theta }\left( z \right)} \right)} \right] \ \ \ \ \ (14)$

Therefore, if we employ a parameterized family of function ${f_w}$ (with the 1-Lipschitz constraints) for the optimization of ${W(P_r, P_\theta)}$ , we can use the follows:

$\displaystyle \mathop {\max }\limits_w \mathop E\limits_{x \sim {P_r}} \left[ {{f_w}\left( x \right)} \right] - \mathop E\limits_{z \sim p\left( z \right)} \left[ {{f_w}\left( {{g_\theta }\left( z \right)} \right)} \right] \ \ \ \ \ (15)$

The KR-duality approximated Wasserstein distance can be applied for a GAN framework, whose training process includes training 3 steps:

1. For a fixed ${\theta}$ , compute the gradient of ${W(P_r, P_\theta)}$ with respect to ${w}$ . In order to guarantee the Lipschitz constraints, a weight clipping sub-step is implemented, making ${w=[-c, c]}$ (e.g. ${c = 0.01}$ ).

2. When get an optimal ${f_w}$ (discriminator), compute ${- {E_{z \sim p\left( z \right)}}\left[ {{\nabla _\theta }f\left( {{g_\theta }\left( z \right)} \right)} \right]}$ for finding the optimal parameters of generator.

3. Update ${\theta}$ for the next iteration.

4. Discussion

The major contribution of methodology in the Wasserstein GAN is that a KR-duality approximated Wasserstein distance is employed to build a GAN framework. The Wasserstein GAN can suppress the gradient vanishing issue in the traditional methods, and provide remarkably clean gradients. Therefore, even using the batch-normalization-removed generator from some previous approaches (e.g. DCGAN [3]), the Wasserstein GAN can also work very well, yet the previous GANs fail. However, one possible drawback of the Wasserstein GAN is that its performance highly depends on the selection of threshold ${c}$ . An unsuitable ${c}$ may cause the 1-Lipschitz constraints unsatisfied. Therefore, a further improvement is the method: “Improved Training of Wasserstein GANs” [4]. This method introduces a gradient penalty, ${\mathop E\limits_{x \sim {P_{\hat x}}} {\left[ {{{\left\| {{\nabla _x}D\left( x \right)} \right\|}_2} - 1} \right]^2}}$ , into the objective function (where ${x_r \sim P_r}$ , ${x_g \sim P_g}$ , ${\hat x = \epsilon x_r + (1-\epsilon)x_g}$ , ${\epsilon \in [0, 1]}$ ). The gradient penalty plays a role to automatically regularize the objective function, and thus increase the robust of Wasserstein GAN for different applications. All in all, the Wasserstein GAN offers an very effective distance measure for the development of GAN models.

Reference:

[1] Arjovsky, Martin, Soumith Chintala, and Léon Bottou. “Wasserstein gan.” arXiv preprint arXiv:1701.07875 (2017).

[2] Arjovsky, Martin, and Léon Bottou. “Towards principled methods for training generative adversarial networks.” arXiv preprint arXiv:1701.04862 (2017).

[3] Radford, Alec, Luke Metz, and Soumith Chintala. “Unsupervised representation learning with deep convolutional generative adversarial networks.” arXiv preprint arXiv:1511.06434 (2015).

[4] Gulrajani, Ishaan, et al. “Improved training of wasserstein gans.” arXiv preprint arXiv:1704.00028 (2017).

Theoretical Machine Learning

Wasserstein Generative Adversarial Networks

Leave a comment Cancel reply

Share this:

Share this:

Leave a comment Cancel reply