Coder Social home page Coder Social logo

ganhacks's Introduction

(this list is no longer maintained, and I am not sure how relevant it is in 2020)

How to Train a GAN? Tips and tricks to make GANs work

While research in Generative Adversarial Networks (GANs) continues to improve the fundamental stability of these models, we use a bunch of tricks to train them and make them stable day to day.

Here are a summary of some of the tricks.

Here's a link to the authors of this document

If you find a trick that is particularly useful in practice, please open a Pull Request to add it to the document. If we find it to be reasonable and verified, we will merge it in.

1. Normalize the inputs

  • normalize the images between -1 and 1
  • Tanh as the last layer of the generator output

2: A modified loss function

In GAN papers, the loss function to optimize G is min (log 1-D), but in practice folks practically use max log D

  • because the first formulation has vanishing gradients early on
  • Goodfellow et. al (2014)

In practice, works well:

  • Flip labels when training generator: real = fake, fake = real

3: Use a spherical Z

  • Dont sample from a Uniform distribution

cube

  • Sample from a gaussian distribution

sphere

4: BatchNorm

  • Construct different mini-batches for real and fake, i.e. each mini-batch needs to contain only all real images or all generated images.
  • when batchnorm is not an option use instance normalization (for each sample, subtract mean and divide by standard deviation).

batchmix

5: Avoid Sparse Gradients: ReLU, MaxPool

  • the stability of the GAN game suffers if you have sparse gradients
  • LeakyReLU = good (in both G and D)
  • For Downsampling, use: Average Pooling, Conv2d + stride
  • For Upsampling, use: PixelShuffle, ConvTranspose2d + stride

6: Use Soft and Noisy Labels

  • Label Smoothing, i.e. if you have two target labels: Real=1 and Fake=0, then for each incoming sample, if it is real, then replace the label with a random number between 0.7 and 1.2, and if it is a fake sample, replace it with 0.0 and 0.3 (for example).
    • Salimans et. al. 2016
  • make the labels the noisy for the discriminator: occasionally flip the labels when training the discriminator

7: DCGAN / Hybrid Models

  • Use DCGAN when you can. It works!
  • if you cant use DCGANs and no model is stable, use a hybrid model : KL + GAN or VAE + GAN

8: Use stability tricks from RL

  • Experience Replay
    • Keep a replay buffer of past generations and occassionally show them
    • Keep checkpoints from the past of G and D and occassionaly swap them out for a few iterations
  • All stability tricks that work for deep deterministic policy gradients
  • See Pfau & Vinyals (2016)

9: Use the ADAM Optimizer

  • optim.Adam rules!
    • See Radford et. al. 2015
  • Use SGD for discriminator and ADAM for generator

10: Track failures early

  • D loss goes to 0: failure mode
  • check norms of gradients: if they are over 100 things are screwing up
  • when things are working, D loss has low variance and goes down over time vs having huge variance and spiking
  • if loss of generator steadily decreases, then it's fooling D with garbage (says martin)

11: Dont balance loss via statistics (unless you have a good reason to)

  • Dont try to find a (number of G / number of D) schedule to uncollapse training
  • It's hard and we've all tried it.
  • If you do try it, have a principled approach to it, rather than intuition

For example

while lossD > A:
  train D
while lossG > B:
  train G

12: If you have labels, use them

  • if you have labels available, training the discriminator to also classify the samples: auxillary GANs

13: Add noise to inputs, decay over time

14: [notsure] Train discriminator more (sometimes)

  • especially when you have noise
  • hard to find a schedule of number of D iterations vs G iterations

15: [notsure] Batch Discrimination

  • Mixed results

16: Discrete variables in Conditional GANs

  • Use an Embedding layer
  • Add as additional channels to images
  • Keep embedding dimensionality low and upsample to match image channel size

17: Use Dropouts in G in both train and test phase

Authors

  • Soumith Chintala
  • Emily Denton
  • Martin Arjovsky
  • Michael Mathieu

ganhacks's People

Contributors

soumith avatar zdx3578 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ganhacks's Issues

Generator try to maximize, Discriminator try to minimize

Hi,
I am training a CycleGAN but the log file(Tensorbaord scalar) for both G and D is confusing to me.
As it written in the paper (Also, all the GAN papers), G aims to minimize the objective against an adversary D that tries to maximize it.

At the beginning of training, G is tried to minimize but after 60 epochs start to maximize also D is tried to maximize at the beginning and then start to minimize after epoch 60. I want to know the intuition behind of it. any idea?

should not model to minimize G and maximize D all the time, why after some epochs generators start to maximize and discriminator minimize?

Tensorboard scalar: https://imgur.com/rdTxrUP

Thanks in advance.

Please explain trick 2

https://github.com/soumith/ganhacks#2-a-modified-loss-function

When training generator standard way is to pass [batch_generated_imgs, batch_real_imgs] and np.ones(shape=batch_size * 2) where all images are labelled 1 (real) to trick discriminator.

If I understand this trick correctly it is saying to pass [batch_generated_imgs, batch_real_imgs] and np.concat(np.ones(shape=batch_size), np.zeroes(shape=batch_size)) where labels are now flipped for fake and real?

label smoothing one-sided?

a previous issue was closed but I don't think it should be.

I'm opening this as I don't know how to request re-opening that one:
#10

there are two opinions:

  • only smooth D (0s and 1s)
  • only smooth 1s for data sources coming from the training data and G

which one is not clear yet

There is no source that has empirically demonstrated a positive effect of replay buffer on GAN training

I understand that the trick #8 "Use stability tricks from RL" is based on an actual paper. But other than the paper and this webpage, there's no source that has suggested experience replay for GAN. In the paper, it was only suggested it was reasonable to try it from the analogy, and so is it in this page. I would like to see an empirical evidence of the extent to which the GAN training benefits from keeping a replay buffer of past generations and occasionally showing them.

How to handle failure mode

What's the best practice for handling failure mode, i.e. when the discriminator or the generator has 0 loss. Should we just terminate learning and debug the algorithm? Or should we rollback a couple of epochs back and continue training from there?

Sampling from a sphere instead of a cube

Especially the trick number 3 (sampling z from a sphere) sounds interesting to me, but I'm unsure how it should be implemented.
Is it just that I need to normalize my z vector to have unit length such that it lies on the surface of a hyper sphere?
If not is there some special function to sample points from inside a hypersphere in python?
Is there a repository where these ganhacks are already implemented (for dcgan)?

Saturation of G_loss and D_loss

Hi,

I am working on AWA dataset, trying to generate images of animals with the help of GAN. The generator and discriminator loss becomes saturated( G_loss ~ 2.0+- 0.3) and D_loss ~ (0.65 +- 0.2). Should I continue to train my GAN for more epochs(currently I have trained for 100 epochs and it has been saturated for over 50 epochs) to improve performance?

Probable things to investigate when Generator falls behind the Discriminator

Hi,

I am unsure whether this is worth creating a new issue or not, so please feel free to let me know if it's not. Actually I am quite new to training GANs and hence was hoping someone with more experience can provide some guidance.

My problem is, my Generator error keeps on increasing steadily (not spiking suddenly, but gradually) and the Discriminator error keeps on reducing simultaneously. Below I provide the statistics :

Generator error / Discriminator error
0.75959807634354 / 0.59769108891487
1.3820139408112 / 0.35363309383392
1.9390360116959 / 0.2000379934907
2.1018676519394 / 0.16694237589836
2.5574874728918 / 0.10423161834478
2.8098415493965 / 0.082516837120056
3.2860078886151 / 0.046023709699512
3.630749514699 / 0.028832530975342
3.7707495708019 / 0.022863862104714
3.8990840911865 / 0.020417057722807
4.1248006802052 / 0.017872251570225
4.259504699707 / 0.01507920846343
4.2479295730591 / 0.013462643604726
4.4426490783691 / 0.010646429285407
4.6057481756434 / 0.0098107368685305
4.6718273162842 / 0.0096474666148424
4.8214926728979 / 0.0079655896406621
4.7656826004386 / 0.0076067917048931
4.8425741195679 / 0.0080536706373096
4.9743659980595 / 0.0066521260887384

When this is the case, what are some of the probable things to investigate or I should look out for while trying to mitigate the problem ?

For example, I came across: Make the discriminator much less expressive by using a smaller model. Generation is a much harder task and requires more parameters, so the generator should be significantly bigger. [1]

If someone has similar pointers up their sleeves, it will be very helpful.

Secondly, just out of curiosity, is there a reason that most of the implementations that I have come across, uses the same lr for both the generator and discriminator ? (DC-GAN, pix2pix, Text-to-image).

I mean, since the generator's job is much harder (generating something plausible from random noise), in hindsight, giving a higher lr to it makes more sense. Or is it just simply application specific and the same lr just works out for the above mentioned works ?

Thanks in advance !

I can't understand this trick

Hello , Soumith
The trick ----- Add gaussion noisy to every layer of D not G.
Today , I use this trick to DCGAN

the D network

def dis_net(data_array , weights , biases ,reuse=False):
    data_array = GaussionNoisy_layers(data_array)

    conv1 = conv2d(data_array , weights['wc1'] , biases['bc1'])

    conv1 = lrelu(conv1)
    conv1 = GaussionNoisy_layers(conv1 , sigma=0.3)

    conv2 = conv2d(conv1 , weights['wc2']  , biases['bc2'])
    conv2 = batch_norma(conv2 , scope="dis_bn1" , reuse=reuse)
    conv2 = lrelu(conv2)
    conv2 = GaussionNoisy_layers(conv2 , sigma=0.5)

    conv3 = conv2d(conv2 , weights['wc3'] , biases['bc3'])
    conv3 = batch_norma(conv3 , scope="dis_bn2" , reuse=reuse)
    conv3 = lrelu(conv3)
    conv3 = GaussionNoisy_layers(conv3 , sigma=0.5)

    conv4 = conv2d(conv3 , weights['wc4'] , biases['bc4'])
    conv4 = batch_norma(conv4 , scope="dis_bn3" , reuse=reuse)
    conv4 = lrelu(conv4)
    conv4 = GaussionNoisy_layers(conv4 , sigma=0.5)

    out = tf.reshape(conv4 , [-1 , weights['wd'].get_shape().as_list()[0]])
    out = fully_connect(out ,weights['wd'] , biases['bd'])

    return tf.nn.sigmoid(out) , out

the Gaussion noisy

def GaussionNoisy_layers(input_layer , sigma = 0.1):

    noisy = np.random.normal(0.0 , sigma , tf.to_int64(input_layer).get_shape())
    return noisy + input_layer

but it can't converge.
if i just add noisy to one layer of D
Gan can converge.

Why ?
Thanks for your answer!

about generator loss

You said

if loss of generator steadily decreases, then it's fooling D with garbage (says martin)

How should be the loss ? Could you elaborate please ?

Trick 2 explanation

I don't fully understand the phrase in trick 2:

"In practice, works well:
Flip labels when training generator: real = fake, fake = real"

Is this saying that with low probability you should corrupt the labels? (although that seems to be what trick 6 says.)

Have you try more deeper networks?

Hello , Soumith
As we all know , vgg , inception , resnet , this networks are very deep
And Have you tried deeper networks to contruct the gans?
For example Using reset in D networks.

Alternating training vs. DCGAN?

I'm a bit confused about the generator and discriminator training. Perhaps it's just the semantics, but the DCGAN "starter" code that many published GANs often use (and is promoted here), performs a discriminator update and a generator update for each minibatch. However, you talk about (warn against) using losses to balance statistics in training the generator and discriminator. This suggests that you freeze one for a while and only train the other, and vice versa.

So which is it: both are updated each minibatch, or you alternate training one network at a time (for N iterations)?

Are the orders ranked?

Hi guys,

I was wondering if the list of tricks are ordered in priorities? If not, could you rank the tricks?

Thanks

Discriminator with accuracy 1, generator fools it perfectly. What's going on?

I'm training a GAN where the discriminator has almost perfect accuracy (generated samples are classified 0, true samples are classified 1), but at the same time the generator is able to produce samples that fool the discriminator perfectly (always classified 1).
I'm guessing this is a vanishing gradient problem, but I cannot figure out how to solve it. All usual hacks don't seem to work.

Does anybody have some suggestions?

multiple Discriminator networks

Thanks for the list! Multiple discriminator networks have helped me out a fair amount. Would this make your list as a [notsure] method? I haven't tried using RL stuff where you keep checkpoints of past Gs and Ds so maybe that is a similar but much better method.

Generator ADAM X Discriminator SGD

It`s not an issue but a question. I would like to know why it would be better train the Discriminator with SGD than train both parts with ADAM. I've been trying to improve the results of my GAN and before test this I would like to know why is it better to understand this!

Adam in the generator?

There is a little problem with the using of the adam optimizer in generator. From my experience, the using of optimizer with momentum may cause instability in training. Actually, I think the RMSprop is a better choice.
This is my personal view and experience, discussion is welcomed.

About the generalization of these tricks

Its not an issue that I am reporting. But I just want to ask that up to what extent these tricks can be generalized? Can these be applied to any kind of adversarial training?

Feasible loss simplification?

Hello, I wonder if the following simplifications lead in practice to the same result as the original loss functions:

  • Use inverted labels everywhere such that D(x) = 0 and D(G(z)) = 1 in the optimal case.
  • Drop the logarithm and even the "+1" could be dropped since constants do not affect the gradient:

       image

  • Then one can directly use a gradient descent optimizer to do:

       image

  • And also the generator loss would look like this and can be minimized for training the generator:

       image

  • If we assume that the discriminator is good enough and will produce an output close to 1 for D(G(z)) then vanishing gradients should be no problem, correct?

Thank you very much for your help!

Florian

How hard it is to make a GAN experiment reproducible?

Is there any complication of reproducibility while working on GAN type model.
For tensorflow, I was getting about (+-2) score difference while applying DANN.

Or it's a problem regarding the design of the model.
[Want some opinion. Not an issue of this repository.]

G is always really smaller than D

I am training a generative adversarial network to perform style transfer from two different image domains (source and target). Since I have available class information i have an extra Q network (except G and D) that measures the classification results fo the generated images for the target domain and their labels. From the convergence of the system I have noticed that D is starting from 8 (the error of the network) and slightly drops until 4.5 and the generator error is starting from 1 and quickly drops to 0.2. Is that behaviour an example of mode-collapse? What is exactly the relationship between the errors of D and G? The loss function of D and G I am using can be found here https://github.com/r0nn13/conditional-dcgan-keras while the loss function of Q network is categorical cross-entropy. The loss functions can be found here: https://imgur.com/a/bDrTcpm

Validate a GAN

What is the best way to validate a GAN, since we can't straightforward track one loss. E.g. in case of SISR, is the choice of best PSNR a reasonable act?

Standard G & D Loss Values & Trends?

Hello, I've been training my first GANs, it's far from easy.

I wondered if there are typical values and trends to look for in the loss? I'm using Wasserstain-GP. The starting value of the D loss is dependent on the lambda value for the W-GP loss. If lambda is 0 then both losses start at 0; if I increase lambda then D loss starts at values higher than 0.

The general trend of the losses in my model: If lambda is greater than 0, then D loss very rapidly drops from the starting value down to 0 and stays there for some time. G loss moves above and below zero with seemingly little pattern.

So I wondered if there are typical values for the various losses and whether we should look for certain trends? And can these be added into the list of hacks? (section 10: track failures early).

normal behavior of GAN

I am wondering is there any rules to check whether GAN converges or not if I cannot check the generators directly. Those tricks are very helpful but when I face the loss it is still hard for me to make conclusion.

So in my case this is the result:
Epoch 1:

component              | loss | generation_loss | auxiliary_loss
-----------------------------------------------------------------
generator (train)      | 5.93 | 0.73            | 5.20
generator (test)       | 2.48 | 1.50            | 0.98
discriminator (train)  | 6.22 | 0.59            | 5.63
discriminator (test)   | 4.13 | 0.32            | 3.82

Epoch 2:

component              | loss | generation_loss | auxiliary_loss
-----------------------------------------------------------------
generator (train)      | 3.06 | 0.88            | 2.18
generator (test)       | 1.66 | 1.51            | 0.14
discriminator (train)  | 4.03 | 0.44            | 3.58
discriminator (test)   | 3.43 | 0.28            | 3.16

Epoch 3:

component              | loss | generation_loss | auxiliary_loss
-----------------------------------------------------------------
generator (train)      | 2.50 | 0.83            | 1.67
generator (test)       | 1.76 | 1.70            | 0.06
discriminator (train)  | 3.51 | 0.36            | 3.14
discriminator (test)   | 3.20 | 0.20            | 2.99

Epoch 4:

component              | loss | generation_loss | auxiliary_loss
-----------------------------------------------------------------
generator (train)      | 2.27 | 0.73            | 1.54
generator (test)       | 2.19 | 2.16            | 0.03
discriminator (train)  | 3.27 | 0.29            | 2.98
discriminator (test)   | 3.05 | 0.13            | 2.91

Epoch 5:

component              | loss | generation_loss | auxiliary_loss
-----------------------------------------------------------------
generator (train)      | 2.06 | 0.62            | 1.44
generator (test)       | 2.68 | 2.66            | 0.02
discriminator (train)  | 3.11 | 0.21            | 2.89
discriminator (test)   | 2.96 | 0.09            | 2.87

Loss of generator and discriminator are both decreasing but the loss on cv set is increasing. I already insert noise to the discriminator and used dropout for generator(but I think it is not used for cv).

Label smoothing

I believe the label smoothing is not performed for fake samples as mentioned under Use Soft and Noisy Labels section in this repo. In the referenced paper by Salimans et. al. 2016 they mention that they smooth only the positive labels, leaving negative labels set to 0.

It was later explained by Goodfellow at NIPS 2016 Tutorial why label smoothing is done only for real samples:

It is important to not smooth the labels for the fake samples. Suppose we use a target of 1 − α for the real data and a target of 0 + β for the fake samples. When β is zero, then smoothing by α does nothing but scale down the optimal value of the discriminator. When β is nonzero, the shape of the optimal discriminator function changes. The discriminator will thus reinforce incorrect behavior in the generator; the generator will be trained either to produce samples that resemble the data or to produce samples that resemble the samples it already makes.

Label smoothing should be one-sided?

Regarding trick 6, label smoothing should be one-sided, real images only (Salimans et. al. 2016). Their rationale makes sense. Did you find evidence to the contrary?

Update tips for WGANs

Wasserstein GANs are supposed to fix some of the challenges of GANs.

  • No more vanishing gradients!
  • No more mode collapse!
  • Loss function is more meaningful.

Some of the tricks in here seem not compatible or no longer relevant with WGANs. Anyone have experience with WGAN and can comment on which tricks are applicable and which one are not?

Thanks for a great repo!

What do you mean by trick 4?

Thanks for your very insightful tricks for training the GAN.

But I have problem understanding the first trick in 4 (Construct different mini-batches for real and fake, i.e. each mini-batch needs to contain only all real images or all generated images.)

Do you suggest that instead of training D with 1:1 pos&neg examples in each mini-batch as done in DCGAN (https://github.com/carpedm20/DCGAN-tensorflow), we should train the D with only pos or neg examples in each mini-batch? Why should this work?

for GAN ,why my D loss is increse,and G loss decrease to 0 at the begining

The generated picture is noise.
step: 4650,G_loss_adv: 0.325, G_accuracy: 0.984,
D_loss_adv: 0.982, d_loss_pos: 0.598, d_loss_neg: 1.366,
D_accuracy: 0.258, d_pos_acc: 0.500, d_neg_acc: 0.016
my G_loss less than D_loss and generated samples score significantly higher than the real picture, D is completely abnormal (normal D_loss is small, D can distinguish true and false it?), my D structure using four convs + fullly connected, I do not know Why do you make a mistake?

G loss increase, what is this mean?

Hi, I am training a conditional GAN.
At the beginning, both G and D loss decrease, but around 200 epoch, G loss start to increase from 1 to 3, and the image quality seems to stop improve.

Any ideas? Thank you in advance.

another trick maybe helpful, Could you test on this?

In the pix2pix tensorflow code, I have introduce a shared layer between generator and discriminator.
the shared layer is before the last layer in discriminator.
The training progress shows that the trend of loss decreasing are stabler than the original.
I have not figure out why yet, and I got the inspiration from taiji.

Trick 11: How to schedule thresholds A and B?

I want to use trick 11, because I have a collapsed GAN problem: discriminator is too strong (accuracy=1, loss =0). I want to use the trick:

while lossD > A:
train D
while lossG > B:
train G

But how to schedule A and B at each step? Do you have a principled approach?
For example, can I choose (A_{n+1}, B_{n+1})= 0.5 * (A_n, B_n) ?
What do you recommend?

Do you have other suggestions? New references on this topic?

Regarding sampling image in batches

I have question regarding sampling image in batches for both training/testing

Is it better for it to be randomly sampled or just simple slicing (eg. dataset[:batch_size-1])?

how does trick 6 change ideal output activation/loss

Regarding Trick 6:

What activation do you recommend for the discriminator's real_or_generated output layer? 'm using a sigmoid activation function but since the output can be as large as 1.2 I'm wondering if something like leaky relu would be better since sigmoid activation fcn is [0, 1.0]. And would you still recommend using binary_crossentropy as loss for this output or something else now that using soft labels?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.