Coder Social home page Coder Social logo

vq-diffusion's Introduction

VQ-Diffusion (CVPR2022, Oral) and
Improved VQ-Diffusion

Overview

This is the official repo for the paper: Vector Quantized Diffusion Model for Text-to-Image Synthesis and Improved Vector Quantized Diffusion Models.

The code is the same as https://github.com/cientgu/VQ-Diffusion, some issues that have been raised can refer to it.

VQ-Diffusion is based on a VQ-VAE whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). It produces significantly better text-to-image generation results when compared with Autoregressive models with similar numbers of parameters. Compared with previous GAN-based methods, VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin.

Framework

Integration with πŸ€— Diffusers library

VQ-Diffusion is now also available in 🧨 Diffusers and accesible via the VQDiffusionPipeline. Diffusers allows you to test VQ-Diffusion in just a couple lines of code.

You can install diffusers as follows:

pip install diffusers torch accelerate transformers

And then try out the model with just a couple lines of code:

import torch
from diffusers import VQDiffusionPipeline

pipeline = VQDiffusionPipeline.from_pretrained("microsoft/vq-diffusion-ithq", torch_dtype=torch.float16, revision="fp16")
pipeline = pipeline.to("cuda")

image = pipeline("teddy bear playing in the pool").images[0]

# save image
image.save("./teddy_bear.png")

You can find the model card of the ITHQ checkpoint here.

Requirements

We suggest to use the docker. Also, you may run:

bash install_req.sh

Data Preparing

Microsoft COCO

β”‚MSCOCO_Caption/
β”œβ”€β”€annotations/
β”‚  β”œβ”€β”€ captions_train2014.json
β”‚  β”œβ”€β”€ captions_val2014.json
β”œβ”€β”€train2014/
β”‚  β”œβ”€β”€ train2014/
β”‚  β”‚   β”œβ”€β”€ COCO_train2014_000000000009.jpg
β”‚  β”‚   β”œβ”€β”€ ......
β”œβ”€β”€val2014/
β”‚  β”œβ”€β”€ val2014/
β”‚  β”‚   β”œβ”€β”€ COCO_val2014_000000000042.jpg
β”‚  β”‚   β”œβ”€β”€ ......

CUB-200

β”‚CUB-200/
β”œβ”€β”€images/
β”‚  β”œβ”€β”€ 001.Black_footed_Albatross/
β”‚  β”œβ”€β”€ 002.Laysan_Albatross
β”‚  β”œβ”€β”€ ......
β”œβ”€β”€text/
β”‚  β”œβ”€β”€ text/
β”‚  β”‚   β”œβ”€β”€ 001.Black_footed_Albatross/
β”‚  β”‚   β”œβ”€β”€ 002.Laysan_Albatross
β”‚  β”‚   β”œβ”€β”€ ......
β”œβ”€β”€train/
β”‚  β”œβ”€β”€ filenames.pickle
β”œβ”€β”€test/
β”‚  β”œβ”€β”€ filenames.pickle

ImageNet

β”‚imagenet/
β”œβ”€β”€train/
β”‚  β”œβ”€β”€ n01440764
β”‚  β”‚   β”œβ”€β”€ n01440764_10026.JPEG
β”‚  β”‚   β”œβ”€β”€ n01440764_10027.JPEG
β”‚  β”‚   β”œβ”€β”€ ......
β”‚  β”œβ”€β”€ ......
β”œβ”€β”€val/
β”‚  β”œβ”€β”€ n01440764
β”‚  β”‚   β”œβ”€β”€ ILSVRC2012_val_00000293.JPEG
β”‚  β”‚   β”œβ”€β”€ ILSVRC2012_val_00002138.JPEG
β”‚  β”‚   β”œβ”€β”€ ......
β”‚  β”œβ”€β”€ ......

Pretrained Model

We release four text-to-image pretrained model, trained on Conceptual Caption, MSCOCO, CUB200, and LAION-human datasets. Also, we release the ImageNet pretrained model, and provide the CLIP pretrained model for convenient. These should be put under OUTPUT/pretrained_model/ . These pretrained model file may be large because they are training checkpoints, which contains gradient information, optimizer information, ema model and others.

Besides, we release four pretrained models with learnable classifier-free on ITHQ, ImageNet, Conceptual Caption and MSCOCO dataset.

We provide the VQVAE models on FFHQ, OpenImages, and ImageNet datasets, these models are from Taming Transformer, we provide them here for convenient. Please put them under OUTPUT/pretrained_model/taming_dvae/ .

To support ITHQ dataset, we trained a new VQVAE model on ITHQ dataset.

For your convenience, we provide a script for downloading all models. You may run bash vqdiffusion_download_checkpoints.sh.

Inference

To generate image from in-the-wild text:

from inference_VQ_Diffusion import VQ_Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='configs/ithq.yaml', path='OUTPUT/pretrained_model/ithq_learnable.pth')

# Inference VQ-Diffusion
VQ_Diffusion_model.inference_generate_sample_with_condition("teddy bear playing in the pool", truncation_rate=0.86, save_root="RESULT", batch_size=4)

# Inference Improved VQ-Diffusion with learnable classifier-free sampling
VQ_Diffusion_model.inference_generate_sample_with_condition("teddy bear playing in the pool", truncation_rate=1.0, save_root="RESULT", batch_size=4, guidance_scale=5.0)
VQ_Diffusion_model.inference_generate_sample_with_condition("a long exposure photo of waterfall", truncation_rate=1.0, save_root="RESULT", batch_size=4, guidance_scale=5.0)

# Inference Improved VQ-Diffusion with fast/high-quality inference
VQ_Diffusion_model.inference_generate_sample_with_condition("a long exposure photo of waterfall", truncation_rate=0.86, save_root="RESULT", batch_size=4, infer_speed=0.5) # high-quality inference, 0.5x inference speed
VQ_Diffusion_model.inference_generate_sample_with_condition("a long exposure photo of waterfall", truncation_rate=0.86, save_root="RESULT", batch_size=4, infer_speed=2) # fast inference, 2x inference speed
# infer_speed shoule be float in [0.1, 10], larger infer_speed means faster inference and smaller infer_speed means slower inference

# Inference Improved VQ-Diffusion with purity sampling
VQ_Diffusion_model.inference_generate_sample_with_condition("a long exposure photo of waterfall", truncation_rate=0.86, save_root="RESULT", batch_size=4, prior_rule=2, prior_weight=1) # purity sampling

# Inference Improved VQ-Diffusion with both learnable classifier-free sampling and fast inference
VQ_Diffusion_model.inference_generate_sample_with_condition("a long exposure photo of waterfall", truncation_rate=1.0, save_root="RESULT", batch_size=4, guidance_scale=5.0, infer_speed=2) # classifier-free guidance and fast inference

To generate image from given text on MSCOCO/CUB/CC datasets:

from inference_VQ_Diffusion import VQ_Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_text.yaml', path='OUTPUT/pretrained_model/coco_learnable.pth')

# Inference VQ-Diffusion
VQ_Diffusion_model.inference_generate_sample_with_condition("A group of elephants walking in muddy water", truncation_rate=0.86, save_root="RESULT", batch_size=4)

# Inference Improved VQ-Diffusion with learnable classifier-free sampling
VQ_Diffusion_model.inference_generate_sample_with_condition("A group of elephants walking in muddy water", truncation_rate=1.0, save_root="RESULT", batch_size=4, guidance_scale=3.0)

You may change coco_learnable.pth to other pretrained model to test different text.

To generate image from given ImageNet class label:

from inference_VQ_Diffusion import VQ_Diffusion

# Inference VQ-Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_imagenet.yaml', path='OUTPUT/pretrained_model/imagenet_pretrained.pth')
VQ_Diffusion_model.inference_generate_sample_with_class(407, truncation_rate=0.86, save_root="RESULT", batch_size=4)


# Inference Improved VQ-Diffusion with classifier-free sampling
VQ_Diffusion_model = VQ_Diffusion(config='configs/imagenet.yaml', path='OUTPUT/pretrained_model/imagenet_learnable.pth', imagenet_cf=True)
VQ_Diffusion_model.inference_generate_sample_with_class(407, truncation_rate=0.94, save_root="RESULT", batch_size=8, guidance_scale=1.5)

Training

First, change the data_root to correct path in configs/coco.yaml or other configs.

Train Text2Image generation on MSCOCO dataset:

python running_command/run_train_coco.py

Train Text2Image generation on CUB200 dataset:

python running_command/run_train_cub.py

Train conditional generation on ImageNet dataset:

python running_command/run_train_imagenet.py

Train unconditional generation on FFHQ dataset:

python running_command/run_train_ffhq.py

Fine-tune Text2Image generation on MSCOCO dataset with learnable classifier-free:

python running_command/run_tune_coco.py

Cite VQ-Diffusion

if you find our code helpful for your research, please consider citing:

@article{gu2021vector,
  title={Vector Quantized Diffusion Model for Text-to-Image Synthesis},
  author={Gu, Shuyang and Chen, Dong and Bao, Jianmin and Wen, Fang and Zhang, Bo and Chen, Dongdong and Yuan, Lu and Guo, Baining},
  journal={arXiv preprint arXiv:2111.14822},
  year={2021}
}

Acknowledgement

Thanks to everyone who makes their code and models available. In particular,

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using VQ-Diffusion, please submit a GitHub issue. For other communications related to VQ-Diffusion, please contact Shuyang Gu ([email protected]) or Dong Chen ([email protected]).

vq-diffusion's People

Contributors

cene555 avatar cientgu avatar microsoft-github-policy-service[bot] avatar patrickvonplaten avatar tzco avatar williamberman avatar youchenghuanxian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vq-diffusion's Issues

CUB dataset

Hello, thanks for the code.
How can I obtain the dataset for CUB in the suggested format?
Thanks!

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

When i ran model on imagenet: one error happened:
How can i tackle the problem about tensor? Dashen / masters help me , thanks!

Model unexpected keys:
['transformer.log_alpha', 'transformer.log_1_min_alpha', 'transformer.log_cumprod_alpha', 'transformer.log_1_min_cumprod_alpha']
Evaluate EMA model
/home/zvwang/miniconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/functional.py:1967: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
Traceback (most recent call last):
File "test_imagenet.py", line 5, in
VQ_Diffusion_model.inference_generate_sample_with_class(407, truncation_rate=0.86, save_root="RESULT", batch_size=4)
File "/home/zvwang/VQ-Diffusion/inference_VQ_Diffusion.py", line 93, in inference_generate_sample_with_class
sample_type="top"+str(truncation_rate)+'r',
File "/home/zvwang/miniconda3/envs/myenv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/zvwang/VQ-Diffusion/image_synthesis/modeling/models/conditional_dalle.py", line 184, in generate_content
content = self.content_codec.decode(trans_out['content_token']) #(8,1024)->(8,3,256,256)
File "/home/zvwang/VQ-Diffusion/image_synthesis/modeling/codecs/image_codec/taming_gumbel_vqvae.py", line 202, in decode
img_seq=self.quantize_to_full[img_seq].type_as(img_seq)
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

How to train VQVAE models?

Hi,
I want to utilize this work for other data, but it seems that only Text2image training script is provided.
How to train VQVAE models?

Unable to train in >1 GPU

I am running on an 8 GPU node. For some reason, I am unable to run on more than 1 GPU. In the following, I tried to run on 2 GPUs by setting CUDA_VISIBLE_DEVICES=0,1. I have set the batch size to 1 (COCO).

 File "/home/eos/workspace/image_synthesis/modeling/transformers/transformer_utils.py", line 50, in forward                 att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1))) # (B, nh, T, T)                                    
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 1; 11.77 GiB total capacity; 10.46 GiB already allocated; 31.06 MiB free; 10.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF   
  

On 1 GPU, it trains very slowly, now giving an ETA of 330 days.

How do you prevent generating [MASK] token at the last step

Thanks for your great work! I have a question about the sampling process. The sampling process at each timestep t is equivalent to sampling from p_theta(x0 | xt), and then sample from q(x_{t-1} | x0, xt), right? Then this conditional posterior distribution contains MASK token, which is theoretically inevitable to encounter when sampling. But in the end, we cannot use [MASK] sample to generate image. So, can you elaborate how to prevent this from happening, i.e., getting [MASK] when sampling from q(x_{t-1} | x0, xt)?

About the training of CUB200 dataset

Regarding the training of cub200, I followed all the parameter settings of the source code. I first loaded cc_learned.pth and started training. According to the paper, this should be the VQ-diffusion-F model. After training up to 300epoch, I observed that the val loss has almost stabilized and tested it on 299epoch. But the test result is very bad with a fid of 28. I am curious why this is the case.

My test command is:
VQ_Diffusion_model.inference_generate_sample_with_condition(data,truncation_rate=1.0, save_root="pre/ep299_tr1_gs5",batch_size=1, guidance_scale=5.0)

Here is the tensorboard and the visualization of test result:

tb
vs

I also tried to train VQ-diffusion-B which has no pretrained model. But the result is worse.

Does anyone encounter the same problem?

Cannot run training script on single GPU

Run training script:

import os

string = "CUDA_VISIBLE_DEVICES=0 python train.py --name cub200_train --config_file configs/cub200.yaml --num_node 1 --tensorboard --load_path OUTPUT/pretrained_model/CC_pretrained.pth"

os.system(string)

Following error occured:

Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                          
  File "train.py", line 169, in <module>                                                                                                                                                                                                                                                                                                                                                    
    main()                                                                                                                                                                                                                                                                                                                                                                                  
  File "train.py", line 125, in main                                                                                                                                                                                                                                                                                                                                                        
    launch(main_worker, args.ngpus_per_node, args.num_node, args.node_rank, args.dist_url, args=(args,))                                                                                                                                                                                                                                                                                    
  File "/home/user/Github/MUGE/VQ-Diffusion/image_synthesis/distributed/launch.py", line 50, in launch                                                                                                                                                                                                                                                                              
    fn(local_rank, *args)                                                                                                                                                                                                                                                                                                                                                                   
  File "train.py", line 158, in main_worker                                                                                                                                                                                                                                                                                                                                                 
    solver.resume(path=args.load_path,                                                                                                                                                                                                                                                                                                                                                      
  File "/home/user/Github/MUGE/VQ-Diffusion/image_synthesis/engine/solver.py", line 374, in resume                                                                                                                                                                                                                                                                                  
    self.model.load_state_dict(state_dict['model'])                                                                                                                                                                                                                                                                                                                                         
  File "/home/user/miniconda3/envs/vqd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict                                                                                                                                                                                                                                                               
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(                                                                                                                                                                                                                                                                                                               
RuntimeError: Error(s) in loading state_dict for DALLE:                                                                                                                                                                                                                                                                                                                                     
        Unexpected key(s) in state_dict: "transformer.log_alpha", "transformer.log_1_min_alpha", "transformer.log_cumprod_alpha", "transformer.log_1_min_cumprod_alpha".

Set CUDA_VISIBLE_DEVICES=0,1, then everying is OK.
How to load state_dict for DALLE on single GPU?

Hugging face API in the readme needs to be updated.

Running the old one gives:

ValueError: Pipeline <class 'diffusers.pipelines.vq_diffusion.pipeline_vq_diffusion.VQDiffusionPipeline'> expected {'vqvae', 'transformer', 'scheduler', 'learned_classifier_free_sampling_embeddings', 'tokenizer', 'text_encoder'}, but only {'vqvae', 'tokenizer', 'transformer', 'text_encoder', 'scheduler'} were passed.

And calling as this works fine:

import torch
# from diffusers import VQDiffusionPipeline
# pipeline = VQDiffusionPipeline.from_pretrained("microsoft/vq-diffusion-ithq", torch_dtype=torch.float16, revision="fp16")

from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained("microsoft/vq-diffusion-ithq")

pipeline = pipeline.to("cuda")

image = pipeline("teddy bear playing in the pool").images[0]

# save image
image.save("./teddy_bear.png")

Also it seems like a problem of specifying torch_dtype=torch.float16, revision="fp16"

How to implement the irregular mask inpainting results

Hi,

I have some questions with the irregular masking inpainting results.
How can I reproduce the results in Figure 5 at Appendix C ?
I tried to generate samples with the code in the repo, but found that some of tokens that were given in sample image turns into different tokens while being sampled from diffusion model(p_sample)

It seems that the repo does not include a inference code for inpainting results.
It would be nice if anyone can help me with inpainting results!

Thanks

Progress toward Google Colab with Inference Error

I created the following Colab notebook but I am getting an error during inference. https://colab.research.google.com/drive/15JABpusfx_vk32GXDFHSiB9Y-CXJowe5#scrollTo=ox_nqiA6MMno

Here is the error from the line starting with VQ_Diffusion_model = ....


ImportError Traceback (most recent call last)

in ()
1 from inference_VQ_Diffusion import VQ_Diffusion
----> 2 VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_imagenet.yaml', path='OUTPUT/pretrained_model/imagenet_pretrained.pth')
3 VQ_Diffusion_model.inference_generate_sample_with_condition("a huge white stone castle in a meadow painted by Rene Magritte",truncation_rate=0.85, save_root="RESULT",batch_size=4)
4 VQ_Diffusion_model.inference_generate_sample_with_condition("a woman in a dark red dress painted by Norman Rockwell",truncation_rate=0.85, save_root="RESULT",batch_size=4,fast=2) # for fast inference

24 frames

/usr/local/lib/python3.7/dist-packages/torchtext/vocab/vocab_factory.py in ()
2 from typing import Dict, Iterable, Optional, List
3 from collections import Counter, OrderedDict
----> 4 from torchtext._torchtext import (
5 Vocab as VocabPybind,
6 )

ImportError: /usr/local/lib/python3.7/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZTVN5torch3jit6MethodE

q_posterior computation

Thank you for releasing the codes.
Could you elaborate a bit more on a part in the q_posterior function?
I haven't been understood why this is necessary, when x_t == [MASK].

Calculation of q_posterior?

log_qt = self.q_pred(log_x_t, t) # q(xt|x0)

q_pred(log_x_start, t) is the forward computation for sampling x_t given x_0 and t, but in this line of posterior, the given condition is log_x_t and t?

log_qt_one_timestep = self.q_pred_one_timestep(log_x_t, t) # q(xt|xt_1)

The same confusion also appears in this line. q_pred_one_timestep(self, log_x_t, t) is also the forward computation for sampling x_t given x_{t-1}, but in this line of posterior, the given condition is log_x_t and t?

download links are invalid

im trying to download pretrained models,but i find the links are all valid.
how can i find these pretrained models?

Equations in the original paper

Hi, I have several questions about the equations in the paper:

  1. Equ. 10: where the prior distribution comes from? I don't understand. Can you help to figure this out?
  2. Equ. 11: the summation is conducted on a variable from 1 to K, it seems that the variable is a scalar. But the input nosing x_t or the estimated x_0 should be a tensor with height and weight, not just a scalar. Am I right?

Filter Ratio when Sample?

Terrific work! Great thanks for sharing your code! I am a little confused about the filter ratio [0.0, 0.5, 1.0] when you sample the image. I empirically find that 0.5 performs better. What does this paramter control? Did your paper mention it?

About the proof of introducing a small uniform noise helps preventing trivial posterior.

Hi.

Fantastic work about discrete diffusion but i still have something not so clear.

You've mentioned in your paper that introducing a small uniform noise instead of totally masking helps preventing from model collapse. In the paper you said that the proof is in the supplementary. But in the supplementary I have only found the proof of Equation 8. I wonder where the actural proof of your claim is, or how to come up with your claim from the proof of Equation 8.

A million thanks.

Cannot download the pretrained model

After > 2 hours of waiting, my download of the pretrained (.pth) model failed. Ouch. Is there a way you can put the models on a faster more reliable server? I have a fast internet on this computer: 200Mbps.
Screen Shot 2022-01-10 at 1 55 30 PM

File "taming_f8_8192_openimages_last.pth" not found?

Thanks for your code. But when running:

from inference_VQ_Diffusion import VQ_Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_text.yaml', path='OUTPUT/pretrained_model/coco_learnable.pth')

an error occurred

FileNotFoundError: [Errno 2] No such file or directory: 'OUTPUT/pretrained_model/taming_dvae/taming_f8_8192_openimages_last.pth'

However, it seems that this file is not provided. I'll be grateful for any response.

The whole error traceback is copied below:

Traceback (most recent call last):
File "inference_VQ_Diffusion.py", line 152, in
path='OUTPUT/pretrained_model/cub_pretrained.pth')
File "inference_VQ_Diffusion.py", line 25, in init
self.info = self.get_model(ema=True, model_path=path, config_path=config, imagenet_cf=imagenet_cf)
File "inference_VQ_Diffusion.py", line 45, in get_model
model = build_model(config)
File "/root/VQ-Diffusion/image_synthesis/modeling/build.py", line 5, in build_model
return instantiate_from_config(config['model'])
File "/root/VQ-Diffusion/image_synthesis/utils/misc.py", line 132, in instantiate_from_config
return cls(**config.get("params", dict()))
File "/root/VQ-Diffusion/image_synthesis/modeling/models/dalle.py", line 35, in init
self.content_codec = instantiate_from_config(content_codec_config)
File "/root/VQ-Diffusion/image_synthesis/utils/misc.py", line 132, in instantiate_from_config
return cls(**config.get("params", dict()))
File "/root/VQ-Diffusion/image_synthesis/modeling/codecs/image_codec/taming_gumbel_vqvae.py", line 225, in init
model = self.LoadModel(config_path, ckpt_path)
File "/root/VQ-Diffusion/image_synthesis/modeling/codecs/image_codec/taming_gumbel_vqvae.py", line 248, in LoadModel
sd = torch.load(ckpt_path, map_location="cpu")["state_dict"]
File "/root/miniconda3/envs/vqdm/lib/python3.7/site-packages/torch/serialization.py", line 594, in load
with _open_file_like(f, 'rb') as opened_file:
File "/root/miniconda3/envs/vqdm/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/root/miniconda3/envs/vqdm/lib/python3.7/site-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'OUTPUT/pretrained_model/taming_dvae/taming_f8_8192_openimages_last.pth'

text-guide image editing?

How to do text-guided image edtining by VQ-diffusion? The model is the same with text-to-image? Are the input image and the masked image all input into the net? how to use them?

Difference from LDM

Hi, thanks for the great work!

I just noticed that your paper is actually a concurrent work with LDM (exactly the same conference publication!), just wondering what's the main difference between these two works in terms of method? (I took a quick pass but seems that these two papers proposed basically the same technique?)

Thanks!

About unconditional synthesis on FFHQ.

Hi Authors,

Thanks for sharing this nice work!
I am trying to reproduce the results of unconditional synthesis on FFHQ dataset.
Compared to other tasks, however, training and inference details for this experiment seem to be insufficient.
Could you tell me the training details and inference code for unconditional image generation on FFHQ?

Thanks a lot:)

State-of-the-art comparison

Table from the recent "Improved Vector Quantized Diffusion Models" comparing to "state-of-the-art".

image

It seems that you're not aware of all previous approaches (disregarding concurrent work like Make-a-scene and Imagen):

COCO
As I understand, your approach has been trained on COCO, hence, it is not fair to compare it to GLIDE directly (which is zero-shot). In the training+testing on coco setting, prior work also achieved better numbers (FID=8.12).

ImageNet
Current best FID=2.26 at resolution 256x256.

Having problem downloading pretrained models

Hello, I tried downloading from the provided links but it keeps interrupting the download, is there a way to get these weights from other sources? I saw there was an issue already closed but they solved it by retrying and had a bit of luck, it's a weak I'm trying to do this without success. Thanks in advance

cannot reproduce inference using coco_pretrained.pth

When I run the code below, I get an AttributeError.

VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_text.yaml', path='OUTPUT/pretrained_model/coco_learnable.pth')
VQ_Diffusion_model.inference_generate_sample_with_condition("A group of elephants walking in muddy water", truncation_rate=0.86, save_root="RESULT", batch_size=4)

The whole error traceback is as follows.

Working with z of shape (1, 256, 32, 32) = 262144 dimensions.
/home/lizhibing/anaconda3/envs/vqdif/lib/python3.9/site-packages/torchvision/transforms/transforms.py:280: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
  warnings.warn(
{'overall': {'trainable': '370.77M', 'non_trainable': '126.29M', 'total': '497.06M'}, 'content_codec': {'trainable': '0', 'non_trainable': '65.8M', 'total': '65.8M'}, 'condition_codec': {'trainable': '0', 'non_trainable': '0', 'total': '0'}, 'transformer': {'trainable': '370.77M', 'non_trainable': '60.49M', 'total': '431.26M'}}
Model missing keys:
 []
Model unexpected keys:
 ['transformer.empty_text_embed']
Evaluate EMA model
Traceback (most recent call last):
  File "/home/lizhibing/repo/VQ-Diffusion/inference_VQ_Diffusion.py", line 184, in <module>
    VQ_Diffusion_model.inference_generate_sample_with_condition("A group of elephants walking in muddy water", truncation_rate=1.0, save_root="RESULT", batch_size=4, guidance_scale=3.0)
  File "/home/lizhibing/repo/VQ-Diffusion/inference_VQ_Diffusion.py", line 126, in inference_generate_sample_with_condition
    model_out = self.model.generate_content(
  File "/home/lizhibing/anaconda3/envs/vqdif/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/lizhibing/repo/VQ-Diffusion/image_synthesis/modeling/models/dalle.py", line 164, in generate_content
    cf_cond_emb = self.transformer.empty_text_embed.unsqueeze(0).repeat(batch_size, 1, 1)
  File "/home/lizhibing/anaconda3/envs/vqdif/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DiffusionTransformer' object has no attribute 'empty_text_embed'

Some parameters don't receive gradients.

Hello, when I am running training command on coco, I encounter the following error:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.

Then I find the parameters in "module.content_codec" and "module.transformer.condition_emb" don't receive grads. So should we set find_unused_parameters=True in DDP?

Cannot reproduce FID on Imagenet.

I test the pre-trained checkpoint in "Improved VQ-Diffusion". I use the following inference script provided in 'inference_VQ_Diffusion.py' and your pretrained checkpoints, but only got the 20.4958 FID on ImageNet. This performance is far from the 11.89 in the paper. What causes the difference?

VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_imagenet.yaml', path='OUTPUT/pretrained_model/imagenet_pretrained.pth')
VQ_Diffusion_model.inference_generate_sample_with_class(407, truncation_rate=0.86, save_root="RESULT", batch_size=4)

And with the following sampling script, I can get the 7.2740 FID. Which performance should I match to, the Improved VQ-Diffusion or Improved VQ-Diffusion*.

VQ_Diffusion_model = VQ_Diffusion(config='configs/imagenet.yaml', path='OUTPUT/pretrained_model/imagenet_learnable.pth', imagenet_cf=True)
VQ_Diffusion_model.inference_generate_sample_with_class(407, truncation_rate=0.94, save_root="RESULT", batch_size=8, guidance_scale=1.5)

屏幕ζˆͺε›Ύ 2022-06-16 234032

device-side assert triggered

I trained taming-transformers on my own data set and got the ckpt file and the corresponding yaml file. When I apply it to vq-diffusion, an error will be reported. I followed configs/imagenet.yaml. , only the ckpt file path and the corresponding yaml file path are replaced.

I feel that some parameters need to be adjusted accordingly, but due to personal ability problems, I have not debugged it. My personal suspicion is that help_folder/statistics/taming_vqvae_974.pt may be different from the parameters I used to train taming-transformers. If you can provide training ifhq dataset details, I would be greatly appreciated.

configs/mydataset.yaml

# change from o4
model:
  target: image_synthesis.modeling.models.conditional_dalle.C_DALLE
  params:
    content_info: {key: image}
    condition_info: {key: label}
    content_codec_config: 
      target: image_synthesis.modeling.codecs.image_codec.taming_gumbel_vqvae.TamingVQVAE
      params:
        trainable: False
        token_shape: [16, 16]
        config_path: 'OUTPUT/pretrained_model/mydataset/mydataset.yaml'
        ckpt_path: 'OUTPUT/pretrained_model/mydataset/last.ckpt'
        num_tokens: 1024
        quantize_number: 974
        mapping_path: './help_folder/statistics/taming_vqvae_974.pt'
        # return_logits: True
    diffusion_config:      
      target: image_synthesis.modeling.transformers.diffusion_transformer.DiffusionTransformer
      params:
        diffusion_step: 100
        alpha_init_type: 'alpha1'        
        auxiliary_loss_weight: 1.0e-3
        adaptive_auxiliary_loss: True
        mask_weight: [1, 1]    # the loss weight on mask region and non-mask region

        transformer_config:
          target: image_synthesis.modeling.transformers.transformer_utils.Condition2ImageTransformer
          params:
            attn_type: 'selfcondition'
            n_layer: 24
            class_type: 'adalayernorm'
            class_number: 15
            content_seq_len: 256  # 16 x 16
            content_spatial_size: [16, 16]
            n_embd: 512 # the dim of embedding dims   # both this and content_emb_config
            n_head: 16 
            attn_pdrop: 0.0
            resid_pdrop: 0.0
            block_activate: GELU2
            timestep_type: 'adalayernorm'    # adainsnorm or adalayernorm and abs
            mlp_hidden_times: 4
            mlp_type: 'conv_mlp'
        condition_emb_config:
          target: image_synthesis.modeling.embeddings.class_embedding.ClassEmbedding
          params:
            num_embed: 15 # 
            embed_dim: 512
            identity: True
        content_emb_config:
          target: image_synthesis.modeling.embeddings.dalle_mask_image_embedding.DalleMaskImageEmbedding
          params:
            num_embed: 974
            spatial_size: !!python/tuple [32, 32]
            embed_dim: 512
            trainable: True
            pos_emb_type: embedding

solver:
  base_lr: 3.0e-6
  adjust_lr: none # not adjust lr according to total batch_size
  max_epochs: 100
  save_epochs: 2
  validation_epochs: 100
  sample_iterations: epoch  # epoch #30000      # how many iterations to perform sampling once ?
  print_specific_things: True

  # config for ema
  ema:
    decay: 0.99
    update_interval: 25
    device: cpu

  clip_grad_norm:
    target: image_synthesis.engine.clip_grad_norm.ClipGradNorm
    params:
      start_iteration: 0
      end_iteration: 5000
      max_norm: 0.5
  optimizers_and_schedulers: # a list of configures, so we can config several optimizers and schedulers
  - name: none # default is None
    optimizer:
      target: torch.optim.AdamW
      params: 
        betas: !!python/tuple [0.9, 0.96]
        weight_decay: 4.5e-2
    scheduler:
      step_iteration: 1
      target: image_synthesis.engine.lr_scheduler.ReduceLROnPlateauWithWarmup
      params:
        factor: 0.5
        patience: 100000
        min_lr: 1.0e-6
        threshold: 1.0e-1
        threshold_mode: rel
        warmup_lr: 4.5e-4 # the lr to be touched after warmup
        warmup: 5000 

dataloader:
........

OUTPUT/pretrained_model/mydataset/mydataset.yaml

model:
  base_learning_rate: 4.5e-06
  target: image_synthesis.taming.models.vqgan.VQModel
  params:
    embed_dim: 256
    n_embed: 1024
    monitor: val/rec_loss
    ddconfig:
      double_z: false
      z_channels: 256
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult:
      - 1
      - 1
      - 2
      - 2
      - 4
      num_res_blocks: 2
      attn_resolutions:
      - 16
      dropout: 0.0
    lossconfig:
      target: image_synthesis.taming.modules.losses.vqperceptual.VQLPIPSWithDiscriminator
      params:
        disc_conditional: false
        disc_in_channels: 3
        disc_start: 0
        disc_weight: 0.8
        codebook_weight: 1.0
        # ssim_loss: true


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.