soumith / ganhacks Goto Github PK

starter from "How to Train a GAN?" at NIPS2016

ganhacks's Introduction

(this list is no longer maintained, and I am not sure how relevant it is in 2020)

How to Train a GAN? Tips and tricks to make GANs work

While research in Generative Adversarial Networks (GANs) continues to improve the fundamental stability of these models, we use a bunch of tricks to train them and make them stable day to day.

Here are a summary of some of the tricks.

Here's a link to the authors of this document

If you find a trick that is particularly useful in practice, please open a Pull Request to add it to the document. If we find it to be reasonable and verified, we will merge it in.

1. Normalize the inputs

normalize the images between -1 and 1
Tanh as the last layer of the generator output

2: A modified loss function

In GAN papers, the loss function to optimize G is min (log 1-D), but in practice folks practically use max log D

because the first formulation has vanishing gradients early on
Goodfellow et. al (2014)

In practice, works well:

Flip labels when training generator: real = fake, fake = real

3: Use a spherical Z

Dont sample from a Uniform distribution

Sample from a gaussian distribution

When doing interpolations, do the interpolation via a great circle, rather than a straight line from point A to point B
Tom White's Sampling Generative Networks ref code https://github.com/dribnet/plat has more details

4: BatchNorm

Construct different mini-batches for real and fake, i.e. each mini-batch needs to contain only all real images or all generated images.
when batchnorm is not an option use instance normalization (for each sample, subtract mean and divide by standard deviation).

5: Avoid Sparse Gradients: ReLU, MaxPool

the stability of the GAN game suffers if you have sparse gradients
LeakyReLU = good (in both G and D)
For Downsampling, use: Average Pooling, Conv2d + stride
For Upsampling, use: PixelShuffle, ConvTranspose2d + stride
- PixelShuffle: https://arxiv.org/abs/1609.05158

6: Use Soft and Noisy Labels

Label Smoothing, i.e. if you have two target labels: Real=1 and Fake=0, then for each incoming sample, if it is real, then replace the label with a random number between 0.7 and 1.2, and if it is a fake sample, replace it with 0.0 and 0.3 (for example).
- Salimans et. al. 2016
make the labels the noisy for the discriminator: occasionally flip the labels when training the discriminator

7: DCGAN / Hybrid Models

Use DCGAN when you can. It works!
if you cant use DCGANs and no model is stable, use a hybrid model : KL + GAN or VAE + GAN

8: Use stability tricks from RL

Experience Replay
- Keep a replay buffer of past generations and occassionally show them
- Keep checkpoints from the past of G and D and occassionaly swap them out for a few iterations
All stability tricks that work for deep deterministic policy gradients
See Pfau & Vinyals (2016)

9: Use the ADAM Optimizer

optim.Adam rules!
- See Radford et. al. 2015
Use SGD for discriminator and ADAM for generator

10: Track failures early

D loss goes to 0: failure mode
check norms of gradients: if they are over 100 things are screwing up
when things are working, D loss has low variance and goes down over time vs having huge variance and spiking
if loss of generator steadily decreases, then it's fooling D with garbage (says martin)

11: Dont balance loss via statistics (unless you have a good reason to)

Dont try to find a (number of G / number of D) schedule to uncollapse training
It's hard and we've all tried it.
If you do try it, have a principled approach to it, rather than intuition

For example

while lossD > A:
  train D
while lossG > B:
  train G

12: If you have labels, use them

if you have labels available, training the discriminator to also classify the samples: auxillary GANs

13: Add noise to inputs, decay over time

Add some artificial noise to inputs to D (Arjovsky et. al., Huszar, 2016)
- http://www.inference.vc/instance-noise-a-trick-for-stabilising-gan-training/
- https://openreview.net/forum?id=Hk4_qw5xe
adding gaussian noise to every layer of generator (Zhao et. al. EBGAN)
- Improved GANs: OpenAI code also has it (commented out)

14: [notsure] Train discriminator more (sometimes)

especially when you have noise
hard to find a schedule of number of D iterations vs G iterations

15: [notsure] Batch Discrimination

Mixed results

16: Discrete variables in Conditional GANs

Use an Embedding layer
Add as additional channels to images
Keep embedding dimensionality low and upsample to match image channel size

17: Use Dropouts in G in both train and test phase

Provide noise in the form of dropout (50%).
Apply on several layers of our generator at both training and test time
https://arxiv.org/pdf/1611.07004v1.pdf

Authors

Soumith Chintala
Emily Denton
Martin Arjovsky
Michael Mathieu

ganhacks's People

Contributors

Stargazers

Watchers

Forkers

donghyunlee peratham junhocho xypan1232 ml-lab ducha-aiki chao1224 spk921 alexander-rakhlin pmadhyastha wanjinchang jdc08161063 dragon9001 chetkhatri dromanov chagge nieshaoshuai fredhuang16 effectiveai vyraun coderx7 multipath allensmile digideskio weiningzhang kyocen ieswxia gwnudt benjamesbabala zdx3578 codeaudit hasnainv leonbai johnrocamora omgitshongyu alinaag richardyang40148 leezqcst albertxiebnu miguelbautista dmartinalbo schangpi soledad89 pavelgonchar raminia misc-git-forks kokirijedi anandsrao yydxlv mave5 zshanwei ganji15 hedgefair mmmika rockt ychernushenko dshaun xiximeng brade31919 zhangyuancv oztc johnsonc silasxue shabanian kvananth mouatez andandandand algoskynet lianboxu vseledkin stevenstoner poivrenoir alakia ranniee farahanams tryerrorman loliverhennigh projectafey styanddty liuf1989 xflee junmyung soonminhwang fodrh1201 qinmanjun fuzhaohai stevenygd aerophile jungel2star qingsong99 lixuejian zbessinger dwang6257 drjzhou jithsjoy obinsc nidetaoge jacklone brettll jerrygood0703

ganhacks's Issues

Generator try to maximize, Discriminator try to minimize

Hi,
I am training a CycleGAN but the log file(Tensorbaord scalar) for both G and D is confusing to me.
As it written in the paper (Also, all the GAN papers), G aims to minimize the objective against an adversary D that tries to maximize it.

At the beginning of training, G is tried to minimize but after 60 epochs start to maximize also D is tried to maximize at the beginning and then start to minimize after epoch 60. I want to know the intuition behind of it. any idea?

should not model to minimize G and maximize D all the time, why after some epochs generators start to maximize and discriminator minimize?

Tensorboard scalar: https://imgur.com/rdTxrUP

Thanks in advance.

Any tricks for text embedding?

Please explain trick 2

https://github.com/soumith/ganhacks#2-a-modified-loss-function

When training generator standard way is to pass [batch_generated_imgs, batch_real_imgs] and np.ones(shape=batch_size * 2) where all images are labelled 1 (real) to trick discriminator.

If I understand this trick correctly it is saying to pass [batch_generated_imgs, batch_real_imgs] and np.concat(np.ones(shape=batch_size), np.zeroes(shape=batch_size)) where labels are now flipped for fake and real?

label smoothing one-sided?

a previous issue was closed but I don't think it should be.

I'm opening this as I don't know how to request re-opening that one:
#10

there are two opinions:

only smooth D (0s and 1s)
only smooth 1s for data sources coming from the training data and G

which one is not clear yet

There is no source that has empirically demonstrated a positive effect of replay buffer on GAN training

I understand that the trick #8 "Use stability tricks from RL" is based on an actual paper. But other than the paper and this webpage, there's no source that has suggested experience replay for GAN. In the paper, it was only suggested it was reasonable to try it from the analogy, and so is it in this page. I would like to see an empirical evidence of the extent to which the GAN training benefits from keeping a replay buffer of past generations and occasionally showing them.

How to handle failure mode

What's the best practice for handling failure mode, i.e. when the discriminator or the generator has 0 loss. Should we just terminate learning and debug the algorithm? Or should we rollback a couple of epochs back and continue training from there?

Sampling from a sphere instead of a cube

Especially the trick number 3 (sampling z from a sphere) sounds interesting to me, but I'm unsure how it should be implemented.
Is it just that I need to normalize my z vector to have unit length such that it lies on the surface of a hyper sphere?
If not is there some special function to sample points from inside a hypersphere in python?
Is there a repository where these ganhacks are already implemented (for dcgan)?

Saturation of G_loss and D_loss

Hi,

I am working on AWA dataset, trying to generate images of animals with the help of GAN. The generator and discriminator loss becomes saturated( G_loss ~ 2.0+- 0.3) and D_loss ~ (0.65 +- 0.2). Should I continue to train my GAN for more epochs(currently I have trained for 100 epochs and it has been saturated for over 50 epochs) to improve performance?

Probable things to investigate when Generator falls behind the Discriminator

Hi,

I am unsure whether this is worth creating a new issue or not, so please feel free to let me know if it's not. Actually I am quite new to training GANs and hence was hoping someone with more experience can provide some guidance.

My problem is, my Generator error keeps on increasing steadily (not spiking suddenly, but gradually) and the Discriminator error keeps on reducing simultaneously. Below I provide the statistics :

Generator error / Discriminator error
0.75959807634354 / 0.59769108891487
1.3820139408112 / 0.35363309383392
1.9390360116959 / 0.2000379934907
2.1018676519394 / 0.16694237589836
2.5574874728918 / 0.10423161834478
2.8098415493965 / 0.082516837120056
3.2860078886151 / 0.046023709699512
3.630749514699 / 0.028832530975342
3.7707495708019 / 0.022863862104714
3.8990840911865 / 0.020417057722807
4.1248006802052 / 0.017872251570225
4.259504699707 / 0.01507920846343
4.2479295730591 / 0.013462643604726
4.4426490783691 / 0.010646429285407
4.6057481756434 / 0.0098107368685305
4.6718273162842 / 0.0096474666148424
4.8214926728979 / 0.0079655896406621
4.7656826004386 / 0.0076067917048931
4.8425741195679 / 0.0080536706373096
4.9743659980595 / 0.0066521260887384

When this is the case, what are some of the probable things to investigate or I should look out for while trying to mitigate the problem ?

For example, I came across: Make the discriminator much less expressive by using a smaller model. Generation is a much harder task and requires more parameters, so the generator should be significantly bigger. [1]

If someone has similar pointers up their sleeves, it will be very helpful.

Secondly, just out of curiosity, is there a reason that most of the implementations that I have come across, uses the same lr for both the generator and discriminator ? (DC-GAN, pix2pix, Text-to-image).

I mean, since the generator's job is much harder (generating something plausible from random noise), in hindsight, giving a higher lr to it makes more sense. Or is it just simply application specific and the same lr just works out for the above mentioned works ?

Thanks in advance !

I can't understand this trick

Hello , Soumith
The trick ----- Add gaussion noisy to every layer of D not G.
Today , I use this trick to DCGAN

the D network

def dis_net(data_array , weights , biases ,reuse=False):
    data_array = GaussionNoisy_layers(data_array)

    conv1 = conv2d(data_array , weights['wc1'] , biases['bc1'])

    conv1 = lrelu(conv1)
    conv1 = GaussionNoisy_layers(conv1 , sigma=0.3)

    conv2 = conv2d(conv1 , weights['wc2']  , biases['bc2'])
    conv2 = batch_norma(conv2 , scope="dis_bn1" , reuse=reuse)
    conv2 = lrelu(conv2)
    conv2 = GaussionNoisy_layers(conv2 , sigma=0.5)

    conv3 = conv2d(conv2 , weights['wc3'] , biases['bc3'])
    conv3 = batch_norma(conv3 , scope="dis_bn2" , reuse=reuse)
    conv3 = lrelu(conv3)
    conv3 = GaussionNoisy_layers(conv3 , sigma=0.5)

    conv4 = conv2d(conv3 , weights['wc4'] , biases['bc4'])
    conv4 = batch_norma(conv4 , scope="dis_bn3" , reuse=reuse)
    conv4 = lrelu(conv4)
    conv4 = GaussionNoisy_layers(conv4 , sigma=0.5)

    out = tf.reshape(conv4 , [-1 , weights['wd'].get_shape().as_list()[0]])
    out = fully_connect(out ,weights['wd'] , biases['bd'])

    return tf.nn.sigmoid(out) , out

the Gaussion noisy

def GaussionNoisy_layers(input_layer , sigma = 0.1):

    noisy = np.random.normal(0.0 , sigma , tf.to_int64(input_layer).get_shape())
    return noisy + input_layer

but it can't converge.
if i just add noisy to one layer of D
Gan can converge.

Why ?
Thanks for your answer!

Comments or Intuition why these tricks work?

Hi,
Thanks for the interesting collection. Is it possible to add a comment to each trick on why it actually helps training?
Thanks!

about generator loss

You said

if loss of generator steadily decreases, then it's fooling D with garbage (says martin)

How should be the loss ? Could you elaborate please ?

Trick 2 explanation

I don't fully understand the phrase in trick 2:

"In practice, works well:
Flip labels when training generator: real = fake, fake = real"

Is this saying that with low probability you should corrupt the labels? (although that seems to be what trick 6 says.)

Have you try more deeper networks?

Hello , Soumith
As we all know , vgg , inception , resnet , this networks are very deep
And Have you tried deeper networks to contruct the gans?
For example Using reset in D networks.

Why Discriminator Loss 0 is a failure mode ?

In 10. you say Discriminator loss 0 is failure mode , but in the paper they say that,

What I'm getting wrong here ?

Thanks,

Why are the biases set to zero in the training of GAN?

I found that in some implementations of GAN, the biases of convolution layers in both Generator and Discriminator are set to zero before updating. I am wondering if you can tell the reason why this is happening. Thanks a lot.

refer to:

https://github.com/reedscot/icml2016 -> main_cls_int.lua -> line 310-311 and line 395-396.
https://github.com/phillipi/pix2pix -> train.lua -> line 232-233 and line 262-263

Alternating training vs. DCGAN?

I'm a bit confused about the generator and discriminator training. Perhaps it's just the semantics, but the DCGAN "starter" code that many published GANs often use (and is promoted here), performs a discriminator update and a generator update for each minibatch. However, you talk about (warn against) using losses to balance statistics in training the generator and discriminator. This suggests that you freeze one for a while and only train the other, and vice versa.

So which is it: both are updated each minibatch, or you alternate training one network at a time (for N iterations)?

Are the orders ranked?

Hi guys,

I was wondering if the list of tricks are ordered in priorities? If not, could you rank the tricks?

Thanks

Using generator and discriminator loss to evaluate GANs.

Hey!

I was studying upon evaluation metric for GANs.
Was a little confused as to why the loss of the generator and discriminator are not good measures to evaluate the performance of GANs.

Thanks

How would one go about training the network with one channel images?

Let's say I have normal RGB images and would like to only take the Y component of the YUV color space, how would one go about doing that and setting up tfrecords?

Discriminator with accuracy 1, generator fools it perfectly. What's going on?

I'm training a GAN where the discriminator has almost perfect accuracy (generated samples are classified 0, true samples are classified 1), but at the same time the generator is able to produce samples that fool the discriminator perfectly (always classified 1).
I'm guessing this is a vanishing gradient problem, but I cannot figure out how to solve it. All usual hacks don't seem to work.

Does anybody have some suggestions?

multiple Discriminator networks

Thanks for the list! Multiple discriminator networks have helped me out a fair amount. Would this make your list as a [notsure] method? I haven't tried using RL stuff where you keep checkpoints of past Gs and Ds so maybe that is a similar but much better method.

Generator ADAM X Discriminator SGD

It`s not an issue but a question. I would like to know why it would be better train the Discriminator with SGD than train both parts with ADAM. I've been trying to improve the results of my GAN and before test this I would like to know why is it better to understand this!

Adam in the generator?

There is a little problem with the using of the adam optimizer in generator. From my experience, the using of optimizer with momentum may cause instability in training. Actually, I think the RMSprop is a better choice.
This is my personal view and experience, discussion is welcomed.

About the generalization of these tricks

Its not an issue that I am reporting. But I just want to ask that up to what extent these tricks can be generalized? Can these be applied to any kind of adversarial training?

Feasible loss simplification?

Hello, I wonder if the following simplifications lead in practice to the same result as the original loss functions:

Use inverted labels everywhere such that D(x) = 0 and D(G(z)) = 1 in the optimal case.
Drop the logarithm and even the "+1" could be dropped since constants do not affect the gradient:

Then one can directly use a gradient descent optimizer to do:

And also the generator loss would look like this and can be minimized for training the generator:

If we assume that the discriminator is good enough and will produce an output close to 1 for D(G(z)) then vanishing gradients should be no problem, correct?

Thank you very much for your help!

Florian

How hard it is to make a GAN experiment reproducible?

Is there any complication of reproducibility while working on GAN type model.
For tensorflow, I was getting about (+-2) score difference while applying DANN.

Or it's a problem regarding the design of the model.
[Want some opinion. Not an issue of this repository.]

G is always really smaller than D

I am training a generative adversarial network to perform style transfer from two different image domains (source and target). Since I have available class information i have an extra Q network (except G and D) that measures the classification results fo the generated images for the target domain and their labels. From the convergence of the system I have noticed that D is starting from 8 (the error of the network) and slightly drops until 4.5 and the generator error is starting from 1 and quickly drops to 0.2. Is that behaviour an example of mode-collapse? What is exactly the relationship between the errors of D and G? The loss function of D and G I am using can be found here https://github.com/r0nn13/conditional-dcgan-keras while the loss function of Q network is categorical cross-entropy. The loss functions can be found here: https://imgur.com/a/bDrTcpm

Validate a GAN

What is the best way to validate a GAN, since we can't straightforward track one loss. E.g. in case of SISR, is the choice of best PSNR a reasonable act?

Standard G & D Loss Values & Trends?

Hello, I've been training my first GANs, it's far from easy.

I wondered if there are typical values and trends to look for in the loss? I'm using Wasserstain-GP. The starting value of the D loss is dependent on the lambda value for the W-GP loss. If lambda is 0 then both losses start at 0; if I increase lambda then D loss starts at values higher than 0.

The general trend of the losses in my model: If lambda is greater than 0, then D loss very rapidly drops from the starting value down to 0 and stays there for some time. G loss moves above and below zero with seemingly little pattern.

So I wondered if there are typical values for the various losses and whether we should look for certain trends? And can these be added into the list of hacks? (section 10: track failures early).

measuring quality

What are you guys thinking about this?
http://markus-liedl.com/2018-01-13-measure-generative-adversarial-networks-objectively.html
I'm restarting a measure discriminator to count how many minibatches it needs till convergence. My assumption is: The more similar the real and fake data are the longer it takes. It's a very simple way of measuring progress.

normal behavior of GAN

I am wondering is there any rules to check whether GAN converges or not if I cannot check the generators directly. Those tricks are very helpful but when I face the loss it is still hard for me to make conclusion.

So in my case this is the result:
Epoch 1:

component              | loss | generation_loss | auxiliary_loss
-----------------------------------------------------------------
generator (train)      | 5.93 | 0.73            | 5.20
generator (test)       | 2.48 | 1.50            | 0.98
discriminator (train)  | 6.22 | 0.59            | 5.63
discriminator (test)   | 4.13 | 0.32            | 3.82

Epoch 2:

component              | loss | generation_loss | auxiliary_loss
-----------------------------------------------------------------
generator (train)      | 3.06 | 0.88            | 2.18
generator (test)       | 1.66 | 1.51            | 0.14
discriminator (train)  | 4.03 | 0.44            | 3.58
discriminator (test)   | 3.43 | 0.28            | 3.16

Epoch 3:

component              | loss | generation_loss | auxiliary_loss
-----------------------------------------------------------------
generator (train)      | 2.50 | 0.83            | 1.67
generator (test)       | 1.76 | 1.70            | 0.06
discriminator (train)  | 3.51 | 0.36            | 3.14
discriminator (test)   | 3.20 | 0.20            | 2.99

Epoch 4:

component              | loss | generation_loss | auxiliary_loss
-----------------------------------------------------------------
generator (train)      | 2.27 | 0.73            | 1.54
generator (test)       | 2.19 | 2.16            | 0.03
discriminator (train)  | 3.27 | 0.29            | 2.98
discriminator (test)   | 3.05 | 0.13            | 2.91

Epoch 5:

component              | loss | generation_loss | auxiliary_loss
-----------------------------------------------------------------
generator (train)      | 2.06 | 0.62            | 1.44
generator (test)       | 2.68 | 2.66            | 0.02
discriminator (train)  | 3.11 | 0.21            | 2.89
discriminator (test)   | 2.96 | 0.09            | 2.87

Loss of generator and discriminator are both decreasing but the loss on cv set is increasing. I already insert noise to the discriminator and used dropout for generator(but I think it is not used for cv).

Label smoothing

I believe the label smoothing is not performed for fake samples as mentioned under Use Soft and Noisy Labels section in this repo. In the referenced paper by Salimans et. al. 2016 they mention that they smooth only the positive labels, leaving negative labels set to 0.

It was later explained by Goodfellow at NIPS 2016 Tutorial why label smoothing is done only for real samples:

It is important to not smooth the labels for the fake samples. Suppose we use a target of 1 − α for the real data and a target of 0 + β for the fake samples. When β is zero, then smoothing by α does nothing but scale down the optimal value of the discriminator. When β is nonzero, the shape of the optimal discriminator function changes. The discriminator will thus reinforce incorrect behavior in the generator; the generator will be trained either to produce samples that resemble the data or to produce samples that resemble the samples it already makes.

When to use labels (12: If you have labels, use them)

Can you explain what benefits you've seen from using labels, how they were used and in which context and application?

Label smoothing should be one-sided?

Regarding trick 6, label smoothing should be one-sided, real images only (Salimans et. al. 2016). Their rationale makes sense. Did you find evidence to the contrary?

Update tips for WGANs

Wasserstein GANs are supposed to fix some of the challenges of GANs.

No more vanishing gradients!
No more mode collapse!
Loss function is more meaningful.

Some of the tricks in here seem not compatible or no longer relevant with WGANs. Anyone have experience with WGAN and can comment on which tricks are applicable and which one are not?

Thanks for a great repo!

What do you mean by trick 4?

Thanks for your very insightful tricks for training the GAN.

But I have problem understanding the first trick in 4 (Construct different mini-batches for real and fake, i.e. each mini-batch needs to contain only all real images or all generated images.)

Do you suggest that instead of training D with 1:1 pos&neg examples in each mini-batch as done in DCGAN (https://github.com/carpedm20/DCGAN-tensorflow), we should train the D with only pos or neg examples in each mini-batch? Why should this work?

for GAN ，why my D loss is increse，and G loss decrease to 0 at the begining

The generated picture is noise.
step: 4650,G_loss_adv: 0.325, G_accuracy: 0.984,
D_loss_adv: 0.982, d_loss_pos: 0.598, d_loss_neg: 1.366,
D_accuracy: 0.258, d_pos_acc: 0.500, d_neg_acc: 0.016
my G_loss less than D_loss and generated samples score significantly higher than the real picture, D is completely abnormal (normal D_loss is small, D can distinguish true and false it?), my D structure using four convs + fullly connected, I do not know Why do you make a mistake?

Discriminator accuracy: 0.5 on average, but collapsed into 0 for negative samples and 1 for positive samples

I am monitoring the discriminator accuracy on separate batches for positive and negative samples as suggested in trick 4, but it often occurs the following situation:

Average accuracy = 0.5 (average loss 0.69)
Acc on negative samples = 0.0 (loss 0.69)
Acc on positive samples = 1.0 (loss 0.7)

What could be the cause of this problem?

Cheers,
Daniele

G loss increase, what is this mean?

Hi, I am training a conditional GAN.
At the beginning, both G and D loss decrease, but around 200 epoch, G loss start to increase from 1 to 3, and the image quality seems to stop improve.

Any ideas? Thank you in advance.

Regarding 13: Add noise to inputs...

Improved GANs: OpenAI code also has it but in D, the supervised model. Not G as Zhao et. al. in the EBGAN.

Reference:
https://github.com/openai/improved-gan/blob/master/mnist_svhn_cifar10/train_mnist_feature_matching.py

another trick maybe helpful, Could you test on this?

In the pix2pix tensorflow code, I have introduce a shared layer between generator and discriminator.
the shared layer is before the last layer in discriminator.
The training progress shows that the trend of loss decreasing are stabler than the original.
I have not figure out why yet, and I got the inspiration from taiji.

Trick 11: How to schedule thresholds A and B?

I want to use trick 11, because I have a collapsed GAN problem: discriminator is too strong (accuracy=1, loss =0). I want to use the trick:

while lossD > A:
train D
while lossG > B:
train G

But how to schedule A and B at each step? Do you have a principled approach?
For example, can I choose (A_{n+1}, B_{n+1})= 0.5 * (A_n, B_n) ?
What do you recommend?

Do you have other suggestions? New references on this topic?

Replace BatchNorm with WeightNorm

According to On the Effects of Batch and Weight Normalization in Generative Adversarial Networks, WeightNorm is generally superior to BatchNorm in the context of GANs.

Why "Construct different mini-batches for real and fake, i.e. each mini-batch needs to contain only all real images or all generated images"?

What is motivation behind "Construct different mini-batches for real and fake, i.e. each mini-batch needs to contain only all real images or all generated images" can you eleborate on that?

GAN - Multiple iterations for generator training

Regarding sampling image in batches

I have question regarding sampling image in batches for both training/testing

Is it better for it to be randomly sampled or just simple slicing (eg. dataset[:batch_size-1])?

not working in trick 13: adding gaussian noise to every layer of generator

when I add normal noise to each layers of G, the output becomes total garbege even after thousands of epochs. I am wondering if I am doing this in the wrong way by adding noise from tf.random_normal(tensor_shape, mean=0, var=1).

Question: Any interest in putting together a premium version for sale on SugarKubes?

Hi @soumith , congrats on the traction! Would you be interested in putting together a premium version of this repo to sell on SugarKubes? It's a code and container marketplace I'm putting together.

how does trick 6 change ideal output activation/loss

Regarding Trick 6:

What activation do you recommend for the discriminator's real_or_generated output layer? 'm using a sigmoid activation function but since the output can be as large as 1.2 I'm wondering if something like leaky relu would be better since sigmoid activation fcn is [0, 1.0]. And would you still recommend using binary_crossentropy as loss for this output or something else now that using soft labels?