Coder Social home page Coder Social logo

deep-vector-quantization's Introduction

deep vector quantization

Implements training code for VQVAE's, i.e. autoencoders with categorical latent variable bottlenecks, which are then easy to subsequently plug into existing infrastructure for modeling sequences of discrete variables (GPT and friends). dvq/vqvae.py is the entry point of the training script and a small training run can be called e.g. as:

cd dvq; python vqvae.py --gpus 1 --data_dir /somewhere/to/store/cifar10

This will reproduce the original DeepMind VQVAE paper (see references before) using a semi-small network on CIFAR-10. Work on this repo is ongoing and for now requires reading of code and understanding these approaches. Next up aiming to reproduce DALL-E result, for this most of the code is in place but we need to train with the logit laplace distribution, tune the gumbel softmax hyperparameters, and train on ImageNet+.

References

DeepMind's VQVAE

The VQVAE from the paper can be trained with --vq_flavor vqvae --enc_dec_flavor deepmind. I am able to get what I think are expected results on CIFAR-10 using VQVAE (judging by reconstruction loss achieved). However I had to resort to a data-driven intialization scheme with k-means (which is with current implementation not multi-gpu compatible), and which the sonnet repo does not use, potentially due to more careful model initialization treatment. When I do not use data-driven init the training exhibits catastrophic index collapse.

Jang et al. Gumbel Softmax

For this use --vq_flavor gumbel. Trains and converges to slightly higher reconstruction loss, but tuning the scale of the kl divergence loss and the temperature decay rate and the version of gumbel (soft/hard) has so far proved a little bit finicky. Also the whole thing trains much slower. Requires a bit more thorough hyperparameter search than a few one-off guesses.

OpenAI's DALL-E

Re-implementation is not yet complete, e.g. we still use MSE is still used as a loss, we still only train on CIFAR and use a smaller network, etc. However, the different encoder/decoder architecture trains and gives comparable results to the (simpler) DeepMind version on untuned 1-GPU trial runs on stride /4 VQVAEs. Situation is developing...

deep-vector-quantization's People

Contributors

karpathy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deep-vector-quantization's Issues

Related questions about the gumbel softmax

Hey Andrej,

been following the discussion you had with Phil regarding the vqVAE code.

Do you happen to know what the whole Gumbel parameterization trick brings to the table that could not also be achieved with the original vqVAE? I find it a bit hard to understand intuitively what

z_q = einsum('b n h w, n d -> b d h w', soft_one_hot, self.embed.weight)

i.e. the contraction over the feature dimension with the embedding rows does. And how does it relate to the traditional approach in which one determines the closest vector of the embeddings to every feature vector?

Also, I most of the times see Transposed convolutions in the Decoder part. would convolutions + upsampling or pixel shuffle methods improve the results? Or is it intentional that the encoder/decoder part in most vqVAE implementations remains relatively simple?

Thank you in advance and greetings from Germany!

i have another implement . is it correct?

#==========
`
def forward(self, z):
B, C, H, W = z.size()

    z_e = self.proj(z)
    z_e = z_e.permute(0, 2, 3, 1) # make (B, H, W, C)
    flatten = z_e.reshape(-1, self.embedding_dim)

    # DeepMind def does not do this but I find I have to... ;\
    if self.training and self.data_initialized.item() == 0:
        print('running kmeans!!') # data driven initialization for the embeddings
        rp = torch.randperm(flatten.size(0))
        kd = kmeans2(flatten[rp[:20000]].data.cpu().numpy(), self.n_embed, minit='points')
        self.embed.weight.data.copy_(torch.from_numpy(kd[0]))
        self.data_initialized.fill_(1)
        # TODO: this won't work in multi-GPU setups

    dist = (
        flatten.pow(2).sum(1, keepdim=True)
        - 2 * flatten @ self.embed.weight.t()
        + self.embed.weight.pow(2).sum(1, keepdim=True).t()
    )  # 距离公式.使用均值不等式.保证了距离大于0.
    _, ind = (-dist).max(1)
    ind = ind.view(B, H, W)  # 每一个像素点都做vq 所以一共是64*128个向量.

    # vector quantization cost that trains the embedding vectors
    z_q = self.embed_code(ind) # (B, H, W, C)
    commitment_cost = 0.25
    diff = commitment_cost * (z_q.detach() - z_e).pow(2).mean() + (z_q - z_e.detach()).pow(2).mean()
    diff *= self.kld_scale



    diff2=(z_q - z_e).pow(2).mean()
    z_q = z_e + (z_q - z_e).detach() # noop in forward pass, straight-through gradient estimator in backward pass
    z_q = z_q.permute(0, 3, 1, 2) # stack encodings into channels again: (B, C, H, W)

    diff2*=self.kld_scale

    return z_q, diff2, ind  #  z_q 是量化之后的图片,  diff是距离, ind是量化的索引.`

i change diff to diff2? i have no clue about whether it is correct . i trained it converge more faster.

Missing 1x1 convolutions at the beginning of the decoder

I believe that there is at least one 1x1 conv missing. In the paper on p. 3 they mention the crucial importance of those but I could only find a projection here prior to the bottleneck.

self.proj = nn.Conv2d(num_hiddens, n_embed, 1)

As a side question: What is the reason that many Autoencoder architectures do away completely with normalization layers in both the encoder and the decoder? I tried to reseach this question but couldnt find a proper answer. Also does the size and complexity of both directly relate to the reconstruction quality? I have seen huge encoder/decoder structures which did not perform significantly better than the modest form you have in this repo or Phils simple architecture for that matter

kl divergence loss of GumbelQuantize

Hi Andrej!
I have a hard time understanding the loss of kl divergence to the prior. Where does it come from? Does it mean we assume the prior is a uniform distribution with prob = 1/self.n_embed ?

diff = kld_scale * torch.sum(qy * torch.log(qy * self.n_embed + 1e-10), dim=1).mean()

Thanks for any hints!

About the input of F.gumbel_softmax

From my understanding, the input of F.gumbel_softmax (i.e., the logits parameter) should be the \log of a discrete distribution. However, I didn't see any softmax or log_softmax before the gumbel_softmax. It seems like you're treating the output of self.proj as log-probabilities with the range of (-inf, inf), which indicates that the probabilities of the discrete distribution have the range of (0, inf).

I'm curious about why you don't use softmax to normalize things into (0, 1) and make the sum of them to be 1. Does the mathematics still make sense without normalizing?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.