compvis / latent-diffusion Goto Github PK

High-Resolution Image Synthesis with Latent Diffusion Models

License: MIT License

Python 9.41% Shell 0.07% Jupyter Notebook 90.52%

latent-diffusion's Introduction

Latent Diffusion Models

High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach*, Andreas Blattmann*, Dominik Lorenz, Patrick Esser, Björn Ommer
* equal contribution

News

July 2022

Inference code and model weights to run our retrieval-augmented diffusion models are now available. See this section.

April 2022

Thanks to Katherine Crowson, classifier-free guidance received a ~2x speedup and the PLMS sampler is available. See also this PR.
Our 1.45B latent diffusion LAION model was integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo:
More pre-trained LDMs are available:
- A 1.45B model trained on the LAION-400M database.
- A class-conditional model on ImageNet, achieving a FID of 3.6 when using classifier-free guidance Available via a colab notebook .

Requirements

A suitable conda environment named ldm can be created and activated with:

conda env create -f environment.yaml
conda activate ldm

Pretrained Models

A general list of all available checkpoints is available in via our model zoo. If you use any of these models in your work, we are always happy to receive a citation.

Retrieval Augmented Diffusion Models

We include inference code to run our retrieval-augmented diffusion models (RDMs) as described in https://arxiv.org/abs/2204.11824.

To get started, install the additionally required python packages into your ldm environment

pip install transformers==4.19.2 scann kornia==0.6.4 torchmetrics==0.6.0
pip install git+https://github.com/arogozhnikov/einops.git

and download the trained weights (preliminary ceckpoints):

mkdir -p models/rdm/rdm768x768/
wget -O models/rdm/rdm768x768/model.ckpt https://ommer-lab.com/files/rdm/model.ckpt

As these models are conditioned on a set of CLIP image embeddings, our RDMs support different inference modes, which are described in the following.

RDM with text-prompt only (no explicit retrieval needed)

Since CLIP offers a shared image/text feature space, and RDMs learn to cover a neighborhood of a given example during training, we can directly take a CLIP text embedding of a given prompt and condition on it. Run this mode via

python scripts/knn2img.py  --prompt "a happy bear reading a newspaper, oil on canvas"

RDM with text-to-image retrieval

To be able to run a RDM conditioned on a text-prompt and additionally images retrieved from this prompt, you will also need to download the corresponding retrieval database. We provide two distinct databases extracted from the Openimages- and ArtBench- datasets. Interchanging the databases results in different capabilities of the model as visualized below, although the learned weights are the same in both cases.

Download the retrieval-databases which contain the retrieval-datasets (Openimages (~11GB) and ArtBench (~82MB)) compressed into CLIP image embeddings:

mkdir -p data/rdm/retrieval_databases
wget -O data/rdm/retrieval_databases/artbench.zip https://ommer-lab.com/files/rdm/artbench_databases.zip
wget -O data/rdm/retrieval_databases/openimages.zip https://ommer-lab.com/files/rdm/openimages_database.zip
unzip data/rdm/retrieval_databases/artbench.zip -d data/rdm/retrieval_databases/
unzip data/rdm/retrieval_databases/openimages.zip -d data/rdm/retrieval_databases/

We also provide trained ScaNN search indices for ArtBench. Download and extract via

mkdir -p data/rdm/searchers
wget -O data/rdm/searchers/artbench.zip https://ommer-lab.com/files/rdm/artbench_searchers.zip
unzip data/rdm/searchers/artbench.zip -d data/rdm/searchers

Since the index for OpenImages is large (~21 GB), we provide a script to create and save it for usage during sampling. Note however, that sampling with the OpenImages database will not be possible without this index. Run the script via

python scripts/train_searcher.py

Retrieval based text-guided sampling with visual nearest neighbors can be started via

python scripts/knn2img.py  --prompt "a happy pineapple" --use_neighbors --knn <number_of_neighbors>

Note that the maximum supported number of neighbors is 20. The database can be changed via the cmd parameter --database which can be [openimages, artbench-art_nouveau, artbench-baroque, artbench-expressionism, artbench-impressionism, artbench-post_impressionism, artbench-realism, artbench-renaissance, artbench-romanticism, artbench-surrealism, artbench-ukiyo_e]. For using --database openimages, the above script (scripts/train_searcher.py) must be executed before. Due to their relatively small size, the artbench datasetbases are best suited for creating more abstract concepts and do not work well for detailed text control.

Coming Soon

better models
more resolutions
image-to-image retrieval

Text-to-Image

Download the pre-trained weights (5.7GB)

mkdir -p models/ldm/text2img-large/
wget -O models/ldm/text2img-large/model.ckpt https://ommer-lab.com/files/latent-diffusion/nitro/txt2img-f8-large/model.ckpt

and sample with

python scripts/txt2img.py --prompt "a virus monster is playing guitar, oil on canvas" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0  --ddim_steps 50

This will save each sample individually as well as a grid of size n_iter x n_samples at the specified output location (default: outputs/txt2img-samples). Quality, sampling speed and diversity are best controlled via the scale, ddim_steps and ddim_eta arguments. As a rule of thumb, higher values of scale produce better samples at the cost of a reduced output diversity.
Furthermore, increasing ddim_steps generally also gives higher quality samples, but returns are diminishing for values > 250. Fast sampling (i.e. low values of ddim_steps) while retaining good quality can be achieved by using --ddim_eta 0.0.
Faster sampling (i.e. even lower values of ddim_steps) while retaining good quality can be achieved by using --ddim_eta 0.0 and --plms (see Pseudo Numerical Methods for Diffusion Models on Manifolds).

Beyond 256²

For certain inputs, simply running the model in a convolutional fashion on larger features than it was trained on can sometimes result in interesting results. To try it out, tune the H and W arguments (which will be integer-divided by 8 in order to calculate the corresponding latent size), e.g. run

python scripts/txt2img.py --prompt "a sunset behind a mountain range, vector image" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0

to create a sample of size 384x1024. Note, however, that controllability is reduced compared to the 256x256 setting.

The example below was generated using the above command.

Inpainting

Download the pre-trained weights

wget -O models/ldm/inpainting_big/last.ckpt https://heibox.uni-heidelberg.de/f/4d9ac7ea40c64582b7c9/?dl=1

and sample with

python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results

indir should contain images *.png and masks <image_fname>_mask.png like the examples provided in data/inpainting_examples.

Class-Conditional ImageNet

Available via a notebook .

Unconditional Models

We also provide a script for sampling from unconditional LDMs (e.g. LSUN, FFHQ, ...). Start it via

CUDA_VISIBLE_DEVICES=<GPU_ID> python scripts/sample_diffusion.py -r models/ldm/<model_spec>/model.ckpt -l <logdir> -n <\#samples> --batch_size <batch_size> -c <\#ddim steps> -e <\#eta>

Train your own LDMs

Data preparation

Faces

For downloading the CelebA-HQ and FFHQ datasets, proceed as described in the taming-transformers repository.

LSUN

The LSUN datasets can be conveniently downloaded via the script available here. We performed a custom split into training and validation images, and provide the corresponding filenames at https://ommer-lab.com/files/lsun.zip. After downloading, extract them to ./data/lsun. The beds/cats/churches subsets should also be placed/symlinked at ./data/lsun/bedrooms/./data/lsun/cats/./data/lsun/churches, respectively.

ImageNet

The code will try to download (through Academic Torrents) and prepare ImageNet the first time it is used. However, since ImageNet is quite large, this requires a lot of disk space and time. If you already have ImageNet on your disk, you can speed things up by putting the data into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ (which defaults to ~/.cache/autoencoders/data/ILSVRC2012_{split}/data/), where {split} is one of train/validation. It should have the following structure:

${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
├── n01440764
│   ├── n01440764_10026.JPEG
│   ├── n01440764_10027.JPEG
│   ├── ...
├── n01443537
│   ├── n01443537_10007.JPEG
│   ├── n01443537_10014.JPEG
│   ├── ...
├── ...

If you haven't extracted the data, you can also place ILSVRC2012_img_train.tar/ILSVRC2012_img_val.tar (or symlinks to them) into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/ / ${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/, which will then be extracted into above structure without downloading it again. Note that this will only happen if neither a folder ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ nor a file ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready exist. Remove them if you want to force running the dataset preparation again.

Model Training

Logs and checkpoints for trained models are saved to logs/<START_DATE_AND_TIME>_<config_spec>.

Training autoencoder models

Configs for training a KL-regularized autoencoder on ImageNet are provided at configs/autoencoder. Training can be started by running

CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/autoencoder/<config_spec>.yaml -t --gpus 0,

where config_spec is one of {autoencoder_kl_8x8x64(f=32, d=64), autoencoder_kl_16x16x16(f=16, d=16), autoencoder_kl_32x32x4(f=8, d=4), autoencoder_kl_64x64x3(f=4, d=3)}.

For training VQ-regularized models, see the taming-transformers repository.

Training LDMs

In configs/latent-diffusion/ we provide configs for training LDMs on the LSUN-, CelebA-HQ, FFHQ and ImageNet datasets. Training can be started by running

CUDA_VISIBLE_DEVICES=<GPU_ID> python main.py --base configs/latent-diffusion/<config_spec>.yaml -t --gpus 0,

where <config_spec> is one of {celebahq-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3),ffhq-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3), lsun_bedrooms-ldm-vq-4(f=4, VQ-reg. autoencoder, spatial size 64x64x3), lsun_churches-ldm-vq-4(f=8, KL-reg. autoencoder, spatial size 32x32x4),cin-ldm-vq-8(f=8, VQ-reg. autoencoder, spatial size 32x32x4)}.

Model Zoo

Pretrained Autoencoding Models

All models were trained until convergence (no further substantial improvement in rFID).

Model	rFID vs val	train steps	PSNR	PSIM	Link	Comments
f=4, VQ (Z=8192, d=3)	0.58	533066	27.43 +/- 4.26	0.53 +/- 0.21	https://ommer-lab.com/files/latent-diffusion/vq-f4.zip
f=4, VQ (Z=8192, d=3)	1.06	658131	25.21 +/- 4.17	0.72 +/- 0.26	https://heibox.uni-heidelberg.de/f/9c6681f64bb94338a069/?dl=1	no attention
f=8, VQ (Z=16384, d=4)	1.14	971043	23.07 +/- 3.99	1.17 +/- 0.36	https://ommer-lab.com/files/latent-diffusion/vq-f8.zip
f=8, VQ (Z=256, d=4)	1.49	1608649	22.35 +/- 3.81	1.26 +/- 0.37	https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip
f=16, VQ (Z=16384, d=8)	5.15	1101166	20.83 +/- 3.61	1.73 +/- 0.43	https://heibox.uni-heidelberg.de/f/0e42b04e2e904890a9b6/?dl=1

f=4, KL	0.27	176991	27.53 +/- 4.54	0.55 +/- 0.24	https://ommer-lab.com/files/latent-diffusion/kl-f4.zip
f=8, KL	0.90	246803	24.19 +/- 4.19	1.02 +/- 0.35	https://ommer-lab.com/files/latent-diffusion/kl-f8.zip
f=16, KL (d=16)	0.87	442998	24.08 +/- 4.22	1.07 +/- 0.36	https://ommer-lab.com/files/latent-diffusion/kl-f16.zip
f=32, KL (d=64)	2.04	406763	22.27 +/- 3.93	1.41 +/- 0.40	https://ommer-lab.com/files/latent-diffusion/kl-f32.zip

Get the models

Running the following script downloads und extracts all available pretrained autoencoding models.

bash scripts/download_first_stages.sh

The first stage models can then be found in models/first_stage_models/<model_spec>

Pretrained LDMs

Datset	Task	Model	FID	IS	Prec	Recall	Link	Comments
CelebA-HQ	Unconditional Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=0)	5.11 (5.11)	3.29	0.72	0.49	https://ommer-lab.com/files/latent-diffusion/celeba.zip
FFHQ	Unconditional Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=1)	4.98 (4.98)	4.50 (4.50)	0.73	0.50	https://ommer-lab.com/files/latent-diffusion/ffhq.zip
LSUN-Churches	Unconditional Image Synthesis	LDM-KL-8 (400 DDIM steps, eta=0)	4.02 (4.02)	2.72	0.64	0.52	https://ommer-lab.com/files/latent-diffusion/lsun_churches.zip
LSUN-Bedrooms	Unconditional Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=1)	2.95 (3.0)	2.22 (2.23)	0.66	0.48	https://ommer-lab.com/files/latent-diffusion/lsun_bedrooms.zip
ImageNet	Class-conditional Image Synthesis	LDM-VQ-8 (200 DDIM steps, eta=1)	7.77(7.76)* /15.82**	201.56(209.52)* /78.82**	0.84* / 0.65**	0.35* / 0.63**	https://ommer-lab.com/files/latent-diffusion/cin.zip	: w/ guiding, classifier_scale 10 *: w/o guiding, scores in bracket calculated with script provided by ADM
Conceptual Captions	Text-conditional Image Synthesis	LDM-VQ-f4 (100 DDIM steps, eta=0)	16.79	13.89	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/text2img.zip	finetuned from LAION
OpenImages	Super-resolution	LDM-VQ-4	N/A	N/A	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip	BSR image degradation
OpenImages	Layout-to-Image Synthesis	LDM-VQ-4 (200 DDIM steps, eta=0)	32.02	15.92	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/layout2img_model.zip
Landscapes	Semantic Image Synthesis	LDM-VQ-4	N/A	N/A	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/semantic_synthesis256.zip
Landscapes	Semantic Image Synthesis	LDM-VQ-4	N/A	N/A	N/A	N/A	https://ommer-lab.com/files/latent-diffusion/semantic_synthesis.zip	finetuned on resolution 512x512

Get the models

The LDMs listed above can jointly be downloaded and extracted via

bash scripts/download_models.sh

The models can then be found in models/ldm/<model_spec>.

Coming Soon...

More inference scripts for conditional LDMs.
In the meantime, you can play with our colab notebook https://colab.research.google.com/drive/1xqzUi2iXQXDqXBHQGP9Mqt2YrYW6cx-J?usp=sharing

Comments

Our codebase for the diffusion models builds heavily on OpenAI's ADM codebase and https://github.com/lucidrains/denoising-diffusion-pytorch. Thanks for open-sourcing!
The implementation of the transformer encoder is from x-transformers by lucidrains.

BibTeX

@misc{rombach2021highresolution,
      title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
      year={2021},
      eprint={2112.10752},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{https://doi.org/10.48550/arxiv.2204.11824,
  doi = {10.48550/ARXIV.2204.11824},
  url = {https://arxiv.org/abs/2204.11824},
  author = {Blattmann, Andreas and Rombach, Robin and Oktay, Kaan and Ommer, Björn},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Retrieval-Augmented Diffusion Models},
  publisher = {arXiv},
  year = {2022},  
  copyright = {arXiv.org perpetual, non-exclusive license}
}

latent-diffusion's People

Contributors

Stargazers

Watchers

Forkers

multipath afiaka87 ouhenio kastnerkyle panxiebit wn1695173791 techthiyanes namnaku87 nkiiiiid s263628839 1987981838 fifteenlab anduin2080 lailai4 sunanlin13174 leehm00 louiskin59 luomuyishi starrysky1986 hercules261188 stanye-west roclee81 honyking jinwook-shim styler00dollar chnxuangithub laplacekorea y-c-li torment123 macroustc shinypond peterzhousz poliver269 basemdabbour jags111 andyst75 yj7082126 wangqiang9 dashstander dauntlessai lzhbrian francotheengineer datadev810 sanster samedii transat amirabdelwahed gorluxor skrivov jd-p pharmapsychotic crowsonkb kirilcvetkov92 kevincleppe johannah deeptitan puppetry-ai cdmatters metaphorz skylion007 vipermu thegeniverse c00renut arslan-z iuriimattos2 leemlller zixan mbrukman babelnft swizad hydslhjs israelgonzalezb ib-by pierrefdz cstichbury dalakatt cnopens ak391 tristanheywood ml-and-ai-repo surrealism7x inf800 loiren mornydew andrey-yakovtsev reidsanders wyattautomation qcatty zhirnov swenkel jhapran jstjohn seondong swordbearfire lilisierrayu changdedu ahmedimtiazprio migrorita justinpinkney torgbuiedunyenyo

latent-diffusion's Issues

Text2Image Training

Maybe I'm an idiot, but if it's not already somewhere in the repo, are you intending to release the script you used to train the text2image model? And what specs were needed to train it in the first place? Thanks!

Colab Notebook example fails on weight inputs for conv.py

Was attempting to run the Colab for this project to gauge functionality. Here is the error it threw when running the model:

/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight, bias)
    441                             _pair(0), self.dilation, self.groups)
    442         return F.conv2d(input, weight, bias, self.stride,
--> 443                         self.padding, self.dilation, self.groups)
    444 
    445     def forward(self, input: Tensor) -> Tensor:

RuntimeError: Given groups=1, weight of size [128, 3, 3, 3], expected input[1, 4, 128, 128] to have 3 channels, but got 4 channels instead

I have seen this before when there is a mismatch in pyTorch libraries. Are we sure the dependencies are accurate?

how can I train with semantic

Excuse me , I want to train the ldm with semantic , but I can not find appropriate dataloaders (they may be 'landscapes.RFWTrain' and 'RFWValidation' ).Will the dataloaders be released later, or can I find they in other places?

Conda env yaml should be changed

Thanks for great research, I found out conda env settings error while using scripts below.

conda env create -f environment.yaml
conda activate ldm

# environment.yaml
name: ldm
...
dependencies:
  ...
  - pytorch=1.7.0
  - torchvision=0.8.1
  - pip:
    ...
    - pytorch-lightning==1.4.2
    ...

pytorch-lightning==1.4.2 automatically imports from torchmetrics.utilities.data import get_num_classes as _get_num_classes but that function was dropped by this PR.

So yaml file should be changed by updating pytorch & torchvision & pytorch-lightning or add explicit torchmetric version

Error Log

(ldm) ubuntu@nipa2021-19981:~/jwk/latent-diffusion$ python scripts/txt2img.py --prompt "a sunset behind a mountain range, vector image" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0  
Loading model from models/ldm/text2img-large/model.ckpt
Traceback (most recent call last):
  File "scripts/txt2img.py", line 101, in <module>
    model = load_model_from_config(config, "models/ldm/text2img-large/model.ckpt")  # TODO: check path
  File "scripts/txt2img.py", line 18, in load_model_from_config
    model = instantiate_from_config(config.model)
  File "/home/ubuntu/jwk/latent-diffusion/ldm/util.py", line 78, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()))
  File "/home/ubuntu/jwk/latent-diffusion/ldm/util.py", line 86, in get_obj_from_str
    return getattr(importlib.import_module(module, package=None), cls)
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/ubuntu/jwk/latent-diffusion/ldm/models/diffusion/ddpm.py", line 12, in <module>
    import pytorch_lightning as pl
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 18, in <module>
    from pytorch_lightning.metrics.utils import deprecated_metrics, void
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/utils.py", line 22, in <module>
    from torchmetrics.utilities.data import get_num_classes as _get_num_classes
ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/torchmetrics/utilities/data.py)

Evaluating first stage autoencoders

First of all, thank you so much for making this high-quality repository as well as pretrained models publicly available! This is highly useful for exploring your research.

I am currently training first stage autoencoders on a custom dataset (SoundCloud images) and am struggling with evaluating these models (other than with the loss values logged in TensorBoard). I plan to compare the performance of initializing the autoencoder weights randomly vs. fine-tuning one of your pretrained autoencoders.

I would prefer to calculate rFID, PSNR, and PSIM the same way as you did for your results table. Could you please provide a hint as to how you evaluate your autoencoders? Is there some other repository or toolkit that you rely on?

terminate called after throwing an instance of 'c10::Error'

I am playing with ldm.models.diffusion.ddpm.LatentDiffusion with 4 GPUs and DDP distribution. After around 30 epochs, it stopped,

`terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1603729096996/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first):
frame_#0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f082820c8b2 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame#1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1070 (0x7f082845ef20 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame#2: c10::TensorImpl::release_resources() + 0x4d (0x7f08281f7b7d in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libc10.so)
frame#3: + 0x5f65b2 (0x7f08725575b2 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame#4: + 0x13c2bc (0x55b1c22232bc in /root/miniconda3/envs/ldm/bin/python)
frame#5: + 0x1efd35 (0x55b1c22d6d35 in /root/miniconda3/envs/ldm/bin/python)
frame#_6: PyObject_GC_Malloc + 0x88 (0x55b1c2223998 in /root/miniconda3/envs/ldm/bin/python)
frame#7: PyType_GenericAlloc + 0x3b (0x55b1c2293a8b in /root/miniconda3/envs/ldm/bin/python)
frame#8: + 0xc385 (0x7f08a1bbf385 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame#9: + 0x13d585 (0x55b1c2224585 in /root/miniconda3/envs/ldm/bin/python)
frame#10: + 0xf97f (0x7f08a1bc297f in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame#11: + 0xfb7e (0x7f08a1bc2b7e in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame#12: + 0x1e857 (0x7f08a1bd1857 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/bit_generator.cpython-38-x86_64-linux-gnu.so)
frame#13: + 0x5f92c (0x55b1c214692c in /root/miniconda3/envs/ldm/bin/python)
frame#14: + 0x16fb40 (0x55b1c2256b40 in /root/miniconda3/envs/ldm/bin/python)
frame#_15: + 0xe4d6 (0x7f08a17a84d6 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/mt19937.cpython-38-x86_64-linux-gnu.so)
frame#16: + 0x13d60c (0x55b1c222460c in /root/miniconda3/envs/ldm/bin/python)
frame#17: + 0x14231 (0x7f08a1bf4231 in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/mtrand.cpython-38-x86_64-linux-gnu.so)
frame#18: + 0x21d0e (0x7f08a1c01d0e in /root/miniconda3/envs/ldm/lib/python3.8/site-packages/numpy/random/mtrand.cpython-38-x86_64-linux-gnu.so)
frame#_19: PyObject_MakeTpCall + 0x1a4 (0x55b1c22247d4 in /root/miniconda3/envs/ldm/bin/python)
frame#_20: PyEval_EvalFrameDefault + 0x4596 (0x55b1c22abf56 in /root/miniconda3/envs/ldm/bin/python)
frame#_21: PyEval_EvalCodeWithName + 0x2d2 (0x55b1c2271a92 in /root/miniconda3/envs/ldm/bin/python)
frame#_22: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame#23: + 0x18be79 (0x55b1c2272e79 in /root/miniconda3/envs/ldm/bin/python)
frame#24: PyVectorcall_Call + 0x71 (0x55b1c2224041 in /root/miniconda3/envs/ldm/bin/python)
frame#_25: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame#_26: PyEval_EvalCodeWithName + 0x7df (0x55b1c2271f9f in /root/miniconda3/envs/ldm/bin/python)
frame#_27: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame#28: + 0x18be79 (0x55b1c2272e79 in /root/miniconda3/envs/ldm/bin/python)
frame#29: PyVectorcall_Call + 0x71 (0x55b1c2224041 in /root/miniconda3/envs/ldm/bin/python)
frame#_30: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame#_31: PyEval_EvalCodeWithName + 0x7df (0x55b1c2271f9f in /root/miniconda3/envs/ldm/bin/python)
frame#_32: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame#_33: PyObject_FastCallDict + 0x24b (0x55b1c22734cb in /root/miniconda3/envs/ldm/bin/python)
frame#_34: PyObject_Call_Prepend + 0x63 (0x55b1c2273733 in /root/miniconda3/envs/ldm/bin/python)
frame#35: + 0x18c83a (0x55b1c227383a in /root/miniconda3/envs/ldm/bin/python)
frame#36: PyObject_Call + 0x70 (0x55b1c2224200 in /root/miniconda3/envs/ldm/bin/python)
frame#_37: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame#_38: PyEval_EvalCodeWithName + 0x2d2 (0x55b1c2271a92 in /root/miniconda3/envs/ldm/bin/python)
frame#_39: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame#_40: PyObject_FastCallDict + 0x24b (0x55b1c22734cb in /root/miniconda3/envs/ldm/bin/python)
frame#_41: PyObject_Call_Prepend + 0x63 (0x55b1c2273733 in /root/miniconda3/envs/ldm/bin/python)
frame#42: + 0x18c83a (0x55b1c227383a in /root/miniconda3/envs/ldm/bin/python)
frame#_43: PyObject_MakeTpCall + 0x22f (0x55b1c222485f in /root/miniconda3/envs/ldm/bin/python)
frame#_44: PyEval_EvalFrameDefault + 0x11d0 (0x55b1c22a8b90 in /root/miniconda3/envs/ldm/bin/python)
frame#_45: PyFunction_Vectorcall + 0x10b (0x55b1c227286b in /root/miniconda3/envs/ldm/bin/python)
frame#46: + 0xba0de (0x55b1c21a10de in /root/miniconda3/envs/ldm/bin/python)
frame#47: + 0x17eb32 (0x55b1c2265b32 in /root/miniconda3/envs/ldm/bin/python)
frame#48: PyObject_GetItem + 0x49 (0x55b1c22568c9 in /root/miniconda3/envs/ldm/bin/python)
frame#_49: PyEval_EvalFrameDefault + 0xbdd (0x55b1c22a859d in /root/miniconda3/envs/ldm/bin/python)
frame#_50: PyEval_EvalCodeWithName + 0x659 (0x55b1c2271e19 in /root/miniconda3/envs/ldm/bin/python)
frame#_51: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame#52: + 0xfeb84 (0x55b1c21e5b84 in /root/miniconda3/envs/ldm/bin/python)
frame#_53: PyEval_EvalCodeWithName + 0x7df (0x55b1c2271f9f in /root/miniconda3/envs/ldm/bin/python)
frame#_54: PyFunction_Vectorcall + 0x1e3 (0x55b1c2272943 in /root/miniconda3/envs/ldm/bin/python)
frame#55: + 0x10075e (0x55b1c21e775e in /root/miniconda3/envs/ldm/bin/python)
frame#_56: PyFunction_Vectorcall + 0x10b (0x55b1c227286b in /root/miniconda3/envs/ldm/bin/python)
frame#57: PyVectorcall_Call + 0x71 (0x55b1c2224041 in /root/miniconda3/envs/ldm/bin/python)
frame#_58: PyEval_EvalFrameDefault + 0x1fdb (0x55b1c22a999b in /root/miniconda3/envs/ldm/bin/python)
frame#_59: PyFunction_Vectorcall + 0x10b (0x55b1c227286b in /root/miniconda3/envs/ldm/bin/python)
frame#60: + 0x10075e (0x55b1c21e775e in /root/miniconda3/envs/ldm/bin/python)
frame#_61: PyEval_EvalCodeWithName + 0x2d2 (0x55b1c2271a92 in /root/miniconda3/envs/ldm/bin/python)
frame#62: + 0x18bd20 (0x55b1c2272d20 in /root/miniconda3/envs/ldm/bin/python)
frame#_63: + 0x10011a (0x55b1c21e711a in /root/miniconda3/envs/ldm/bin/python)

Epoch 37: 69%|\u258b| 227/328 [18:34<08:13, 4.89s/it, loss=0.794, v_num=2, train/loss_simple_step=0.792, train/loss_vlb_step=0.0081, traTraceback (most recent call last):
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 101, in run
super().run(batch, batch_idx, dataloader_idx)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 148, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 202, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 396, in _optimizer_step
model_ref.optimizer_step(
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1618, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/optim/adamw.py", line 65, in step
loss = closure()
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 236, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 537, in training_step_and_backward
result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 307, in _training_step
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 193, in training_step
return self.training_type_plugin.training_step(*step_kwargs.values())
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 383, in training_step
return self.model(*args, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 82, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 343, in training_step
loss, loss_dict = self.shared_step(batch)
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 887, in shared_step
x, c = self.get_input(batch, self.first_stage_key)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 661, in get_input
z = self.get_first_stage_encoding(encoder_posterior).detach()
File "/root/Desktop/ldm/ldm/models/diffusion/ddpm.py", line 544, in get_first_stage_encoding
z = encoder_posterior.sample()
File "/root/Desktop/ldm/ldm/modules/distributions/distributions.py", line 36, in sample
x = self.mean + self.std * torch.randn(self.mean.shape).to(device=self.parameters.device)
File "/root/miniconda3/envs/ldm/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 21388) is killed by signal: Aborted.
`

I am sure it is related to this issue , but unable to fix by setting rank_zero_only=True.

Any help is appreciated

config file for conditional LDM

thank you for sharing this great work.

where can I find config files for these unconditional tasks, such as Text-conditional Image Synthesis, Super-resolution, Layout-to-Image Synthesis and Semantic Image Synthesis?

the download links are merely ckpt files, and config files at configs/latent-diffusion are all unconditional tasks.

Same seed produces different outputs

Prompt: winter, sunrise, path in the forest, painted by Caspar David Friedrich (royaltyfree)
Steps: 50
ETA: 0
Iterations: 1
Width: 384
Height: 256
Samples_in_parallel: 1
Diversity_scale: 5
PLMS_sampling: 0
Seed: 1

re-run with same parameters:

The stablility of training

We use our data which only contains face.
However, when we train ldm, we find the loss does not degrease. The loss ==0.798

Epoch 5: 8%|▊ | 107/1422 [00:57<11:36, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.800, train/loss_vlb_step=0.0366, train/loss_step=0.800, global_step=6675.0, lr_abs=0.0032, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 107/1422 [00:57<11:36, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.799, train/loss_vlb_step=0.00406, train/loss_step=0.799, global_step=6676.0, lr_abs=0.0032, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 108/1422 [00:57<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.799, train/loss_vlb_step=0.00406, train/loss_step=0.799, global_step=6676.0, lr_abs=0.0032, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 108/1422 [00:57<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.799, train/loss_vlb_step=0.0137, train/loss_step=0.799, global_step=6677.0, lr_abs=0.0032, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 109/1422 [00:58<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.799, train/loss_vlb_step=0.0137, train/loss_step=0.799, global_step=6677.0, lr_abs=0.0032, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 109/1422 [00:58<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.795, train/loss_vlb_step=0.00617, train/loss_step=0.795, global_step=6678.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 110/1422 [00:58<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.795, train/loss_vlb_step=0.00617, train/loss_step=0.795, global_step=6678.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 110/1422 [00:58<11:35, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.798, train/loss_vlb_step=0.00474, train/loss_step=0.798, global_step=6679.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 111/1422 [00:59<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.798, train/loss_vlb_step=0.00474, train/loss_step=0.798, global_step=6679.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 111/1422 [00:59<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.796, train/loss_vlb_step=0.00421, train/loss_step=0.796, global_step=6680.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 112/1422 [00:59<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.796, train/loss_vlb_step=0.00421, train/loss_step=0.796, global_step=6680.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 112/1422 [00:59<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.797, train/loss_vlb_step=0.00464, train/loss_step=0.797, global_step=6681.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 113/1422 [01:00<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.797, train/loss_vlb_step=0.00464, train/loss_step=0.797, global_step=6681.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]
Epoch 5: 8%|▊ | 113/1422 [01:00<11:34, 1.89it/s, loss=0.798, v_num=0, train/loss_simple_step=0.794, train/loss_vlb_step=0.00373, train/loss_step=0.794, global_step=6682.0, lr_abs=0.00321, train/loss_simple_epoch=0.748, train/loss_vlb_epoch=0.0059, train/loss_epoch=0.748]

Thanks for comments

add web demo/models to Huggingface

Hi, would you be interested in adding latent-diffusion to Hugging Face? The Hub offers free hosting, and it would make your work more accessible and visible to the rest of the ML community. Models/datasets/spaces(web demos) can be added to a user account or organization similar to github.

Example from other organizations:
Keras: https://huggingface.co/keras-io
Microsoft: https://huggingface.co/microsoft
Facebook: https://huggingface.co/facebook

Example spaces with repos:
github: https://github.com/salesforce/BLIP
Spaces: https://huggingface.co/spaces/salesforce/BLIP

github: https://github.com/facebookresearch/omnivore
Spaces: https://huggingface.co/spaces/akhaliq/omnivore

and here are guides for adding spaces/models/datasets to your org

How to add a Space: https://huggingface.co/blog/gradio-spaces
how to add models: https://huggingface.co/docs/hub/adding-a-model
uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.

When will release cond-LDM inference scripts?

Hi, very impressive work. But despite SR and inpaint task, when will release more interesting conditional inference scripts? Looking forward to these

Question about training stability

Hello, thank you so much for this wonderful paper and codebase. I am trying to reproduce the results of lsun_churches-ldm-kl-8.yaml. I have not modified any parameters in the config and I am using your pretrained first stage model.

However, some part of training is not working correctly -- the losses are not decreasing as expected.

My loss curves are below:

Do you know what might be going wrong here? I feel like I have done something incorrectly, but I believe that I followed the instructions closely.

Thank you for your help!

Error in the configuration files for SIS task

Hi,

I notice that the config.taml files in models/ldm/semantic_synthesis256 or models/ldm/semantic_synthesis512 follow the configuration of LDM-4.
However, I noticed that in p25 and p5 of your paper, you've stated that LDM for semantic image synthesis is the LDM-8.
Although I can change the hyperparameters for the models thanks to the detailed description, I still need the hyperparameters related to training/optimizing the model.
Could you provide the correct configuration files?

Constraining the output to within the borders?

(Might be able to be solved as part of #34 where e.g. transparent areas are forbidden?)

I'm generating movie posters / book covers / etc. and most of the time, the output is off the edge of the image (see attachment.)

Would be super if there was a way to hint / constrain the output - it shouldn't have seen anything cut-off like that in the training sets, I think? VQGAN-CLIP doesn't have this issue (but also isn't generating as good output in as quick a time which is why I'd prefer to use LD.)

how run in colab k80?

License

Thank you for the awesome work!
What is the license the models and code are released under?

Alllow inference on CPU...

Tried to allocate 128.00 MiB (GPU 0; 7.79 GiB total capacity; 6.19 GiB already allocated; 81.44 MiB free; 6.34 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

Setting CUDA_VISIBLE_DEVICES to -1 to force CPU results in no cuda devices found.

Please allow CPU inference at the expense of time.

memory problem

Hello.
How much memory do you need to run?

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 6.00 GiB total capacity; 4.78 GiB already allocated; 0 bytes free; 4.82 GiB reserved in total by PyTorch)

Any solution to run with a small amount of memory?

question about disc_start

Thanks for this great work.

I am trying to train vq models with custom data, only realized that disc_start in vq configs are very different,
for example,
vq-f8-n256 disc_start: 250001
vq-f8 disc_start: 1

Any particular reasons that discrimator could start from 1?

Reproducing inpainting results

Hi,
thanks for this great repo! I was trying to reproduce the inpainting results on the example images and obtain noticeable artifacts.

Do you have an idea what could be the reason? I am running:
python scripts/inpaint.py --indir data/inpainting_examples/ --outdir outputs/inpainting_results

pip error

when create ldm
pip enconter the following error:

ERROR: Requested clip from git+https://github.com/openai/CLIP.git@main#egg=cli (from -r requirements.txt (line 14)) has different name in metadata: 'clip'

Solving environment: failed [ResolvePackageNotFound: - cudatoolkit=11.3.1]

Hey there!

I'm using macOS Monterey. I installed Anaconda via the Installer (https://www.anaconda.com/products/individual) and then when running:
conda env create -f environment.yaml

It fails with:

Do you have any idea why?

What is rel_pos in the x_transformer.py file?

Based on what has been defined here, self.rel_pos would be a function that always returns None. Is there any specific reason for this?

Model size

Thanks for your interesting work.
How about the model size of LDM, compared with StyleGAN and ProjectedGAN?

All checkpoints and links on ommer-lab 404 not found

Hey thanks for all your work and the excellent readme!

There seems to be an issue with all files having moved or disappeared from https://ommer-lab.com/files/latent-diffusion/* , all the links are 404-ing now. The heibox links still work fine.

If the hosting is going to be an issue, it would be nice if the checkpoints were all uploaded as a Github release on this repo https://github.com/CompVis/latent-diffusion/releases , this way Github will cover the hosting indefinitely and it doesn't have to be a worry anymore for any future maintenence. See SwinIR's release as an example: https://github.com/JingyunLiang/SwinIR/releases

I'm super interested in text2img so hopefully these can be restored 😃 .
Thanks again!

cannot load vq-f4 model

All the vq models work for me except the first one at https://ommer-lab.com/files/latent-diffusion/vq-f4.zip

using this config:

model:
  base_learning_rate: 4.5e-06
  target: ldm.models.autoencoder.VQModel
  params:
    embed_dim: 3
    n_embed: 8192
    monitor: val/rec_loss
    ddconfig:
      double_z: false
      z_channels: 3
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult:
      - 1
      - 2
      - 4
      num_res_blocks: 2
      attn_resolutions: []
      dropout: 0.0
    lossconfig:
      target: taming.modules.losses.vqperceptual.VQLPIPSWithDiscriminator
      params:
        disc_conditional: false
        disc_in_channels: 3
        disc_start: 0
        disc_weight: 0.75
        codebook_weight: 1.0

data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 8
    num_workers: 16
    wrap: true
    train:
      target: ldm.data.openimages.FullOpenImagesTrain
      params:
        crop_size: 256
    validation:
      target: ldm.data.openimages.FullOpenImagesValidation
      params:
        crop_size: 256

code:

config = OmegaConf.load('./vq-f4/config.yaml')
pl_sd = torch.load('./vq-f4/model.ckpt', map_location="cpu")
sd = pl_sd["state_dict"]
ldm = instantiate_from_config(config.model)
ldm.load_state_dict(sd, strict=False)

error:

RuntimeError: Error(s) in loading state_dict for VQModel:
	size mismatch for encoder.down.1.block.0.conv1.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 128, 3, 3]).
	size mismatch for encoder.down.1.block.0.conv1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.1.block.0.norm2.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.1.block.0.norm2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.1.block.0.conv2.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for encoder.down.1.block.0.conv2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.1.block.1.norm1.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.1.block.1.norm1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.1.block.1.conv1.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for encoder.down.1.block.1.conv1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.1.block.1.norm2.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.1.block.1.norm2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.1.block.1.conv2.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for encoder.down.1.block.1.conv2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.1.downsample.conv.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for encoder.down.1.downsample.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.2.block.0.norm1.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.2.block.0.norm1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for encoder.down.2.block.0.conv1.weight: copying a param with shape torch.Size([256, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 256, 3, 3]).
	size mismatch for encoder.down.2.block.0.conv1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for encoder.down.2.block.0.norm2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for encoder.down.2.block.0.norm2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for encoder.down.2.block.0.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for encoder.down.2.block.0.conv2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for encoder.down.2.block.0.nin_shortcut.weight: copying a param with shape torch.Size([256, 128, 1, 1]) from checkpoint, the shape in current model is torch.Size([512, 256, 1, 1]).
	size mismatch for encoder.down.2.block.0.nin_shortcut.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for encoder.down.2.block.1.norm1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for encoder.down.2.block.1.norm1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for encoder.down.2.block.1.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for encoder.down.2.block.1.conv1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for encoder.down.2.block.1.norm2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for encoder.down.2.block.1.norm2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for encoder.down.2.block.1.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for encoder.down.2.block.1.conv2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for encoder.conv_out.weight: copying a param with shape torch.Size([8, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([3, 512, 3, 3]).
	size mismatch for encoder.conv_out.bias: copying a param with shape torch.Size([8]) from checkpoint, the shape in current model is torch.Size([3]).
	size mismatch for decoder.conv_in.weight: copying a param with shape torch.Size([512, 8, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 3, 3, 3]).
	size mismatch for decoder.up.0.block.0.norm1.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.0.block.0.norm1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.0.block.0.conv1.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([128, 256, 3, 3]).
	size mismatch for decoder.up.1.block.0.norm1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.1.block.0.norm1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.1.block.0.conv1.weight: copying a param with shape torch.Size([128, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 512, 3, 3]).
	size mismatch for decoder.up.1.block.0.conv1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.0.norm2.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.0.norm2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.0.conv2.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for decoder.up.1.block.0.conv2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.0.nin_shortcut.weight: copying a param with shape torch.Size([128, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 512, 1, 1]).
	size mismatch for decoder.up.1.block.0.nin_shortcut.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.1.norm1.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.1.norm1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.1.conv1.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for decoder.up.1.block.1.conv1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.1.norm2.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.1.norm2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.1.conv2.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for decoder.up.1.block.1.conv2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.2.norm1.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.2.norm1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.2.conv1.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for decoder.up.1.block.2.conv1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.2.norm2.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.2.norm2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.block.2.conv2.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for decoder.up.1.block.2.conv2.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.1.upsample.conv.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]).
	size mismatch for decoder.up.1.upsample.conv.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
	size mismatch for decoder.up.2.block.0.norm1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.0.norm1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.0.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for decoder.up.2.block.0.conv1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.0.norm2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.0.norm2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.0.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for decoder.up.2.block.0.conv2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.1.norm1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.1.norm1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.1.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for decoder.up.2.block.1.conv1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.1.norm2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.1.norm2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.1.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for decoder.up.2.block.1.conv2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.2.norm1.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.2.norm1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.2.conv1.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for decoder.up.2.block.2.conv1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.2.norm2.weight: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.2.norm2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.block.2.conv2.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for decoder.up.2.block.2.conv2.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for decoder.up.2.upsample.conv.weight: copying a param with shape torch.Size([256, 256, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
	size mismatch for decoder.up.2.upsample.conv.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
	size mismatch for loss.discriminator.main.8.weight: copying a param with shape torch.Size([1, 256, 4, 4]) from checkpoint, the shape in current model is torch.Size([512, 256, 4, 4]).
	size mismatch for quantize.embedding.weight: copying a param with shape torch.Size([16384, 8]) from checkpoint, the shape in current model is torch.Size([8192, 3]).
	size mismatch for quant_conv.weight: copying a param with shape torch.Size([8, 8, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 3, 1, 1]).
	size mismatch for quant_conv.bias: copying a param with shape torch.Size([8]) from checkpoint, the shape in current model is torch.Size([3]).
	size mismatch for post_quant_conv.weight: copying a param with shape torch.Size([8, 8, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 3, 1, 1]).
	size mismatch for post_quant_conv.bias: copying a param with shape torch.Size([8]) from checkpoint, the shape in current model is torch.Size([3]).

Training steps for autoencoder training

Thanks for sharing your great project.

How many steps did you train your autoencoders on ImageNet? It seems that there is no description of the number of training steps or epochs in the provided configuration files. Sorry if it's mentioned elsewhere.

Question about some techniques

Hi @pesser!
Thank you for sharing the implementation of your wonderful work!

I have questions about some techniques.
Would you tell me these questions?

I have used your pretrained celeba256 weight.
The images were recorded using such as intermediates['x_inter'].append(img).

Why do you step the time? According to this line, it seems you choose time values for each num_ddpm_timesteps // num_ddim_timesteps.
Actually, I have never seen this technique.
If I do not step the above values and T: 1000 -> 0, i.e. the time steps are continuous and have ranged from 1 to 1000, I cannot get clear results. This image was recorded in six separate 1000 iterations.
This image was recorded using x_inter.

This image was recorded using pred_x0.

If the time step is fixed to default values in this line and the start time is decreased such as to 800, I cannot clear results.
Why? Your method cannot work well other than t=1000? (actually, if t=1000, 50 iterations because time steps are split)
If your method cannot perform the question (3), your method cannot perform this unique denoising technique as shown in Sohl-Dickstein+ ICML15? Can you possibly accomplish it?

Best regards,
Udon

autoencoder for LDM

Hi!
Could you put which autoencoding models correspond to which LDMs on the table, please?
Maybe I am missing this information somewhere, but it seems it's not clear which one is for which.

Training on LAION-400M

Hi @rromb @ablattmann

Would you be able to add some documentation on how to generate/pre-process the dataset if we want to train on LAION-400M?

Thanks! 🙏

Preventing/constraining words in the output

I asked LD for "fire" (samples=3, iter=2) and 5 of the 6 outputs had rendered some variant of the word "FIRE". Is it possible to somehow control whether text is rendered or not? Sometimes it's ideal (generating posters, books, etc.) but sometimes it ruins the render (cf "fire" above).

Or is it just a case that text from the training set is being picked up and there's nothing that can be done (other than training on non-textual sources)?

Details about training inpainting model

Hi, this is an awesome work. I am looking into the code but cannot find the entrance to train the inpainting model. Can you elaborate it a little bit?

conda env settings issue

Thanks for great research, I found out conda env settings error while using scripts below.

conda env create -f environment.yaml
conda activate ldm

Error Log

(ldm) ubuntu@nipa2021-19981:~/jwk/latent-diffusion$ python scripts/txt2img.py --prompt "a sunset behind a mountain range, vector image" --ddim_eta 1.0 --n_samples 1 --n_iter 1 --H 384 --W 1024 --scale 5.0  
Loading model from models/ldm/text2img-large/model.ckpt
Traceback (most recent call last):
  File "scripts/txt2img.py", line 101, in <module>
    model = load_model_from_config(config, "models/ldm/text2img-large/model.ckpt")  # TODO: check path
  File "scripts/txt2img.py", line 18, in load_model_from_config
    model = instantiate_from_config(config.model)
  File "/home/ubuntu/jwk/latent-diffusion/ldm/util.py", line 78, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()))
  File "/home/ubuntu/jwk/latent-diffusion/ldm/util.py", line 86, in get_obj_from_str
    return getattr(importlib.import_module(module, package=None), cls)
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/ubuntu/jwk/latent-diffusion/ldm/models/diffusion/ddpm.py", line 12, in <module>
    import pytorch_lightning as pl
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 18, in <module>
    from pytorch_lightning.metrics.utils import deprecated_metrics, void
  File "/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/pytorch_lightning/metrics/utils.py", line 22, in <module>
    from torchmetrics.utilities.data import get_num_classes as _get_num_classes
ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/home/ubuntu/anaconda3/envs/ldm/lib/python3.8/site-packages/torchmetrics/utilities/data.py)

# environment.yaml
name: ldm
...
dependencies:
  ...
  - pytorch=1.7.0
  - torchvision=0.8.1
  - pip:
    ...
    - pytorch-lightning==1.4.2
    ...

Problem was pytorch-lightning==1.4.2 automatically imports from torchmetrics.utilities.data import get_num_classes as _get_num_classes but that function was dropped by this PR.

So yaml file should be changed by updating pytorch & torchvision & pytorch-lightning or add explicit torchmetric version

I solved it by updating pytorch-lightning.

Text + partial image prompting

Hi !

In Dall-E, we can provide a partial image in addition to the text description so that the model only completes the image. See:

Can we do the same with your models? That would be awesome.
I tried to modify the LAION-400M model notebook but without much success.

Why do higher resolution images have duplicate artifacts?

Hi @rromb @ablattmann

Thank you for sharing your work and documenting it well.

While generating higher resolution images, I am seeing duplicate artifacts. For example here is a plane example.

Is there a way to generate just 1 object instead of multiples?

super resolution example

Thanks for great model :).
I read readme file but couldn't find super_resolution example like scripts/inpaint.py.
It looks like super-resolution can be done with notebook_helpers.py
Do you have plan to support sr example?

the stability of training(a collapse loss)

Thanks for the great work.
I try to train the ldm model on ImageNet with 8 V100, but get a bad result.I found that loss was normal at first, but soon collapsed:

and the sampled image are all noise at 5000 steps:

How can I solve this problem, thank you very much

RuntimeError: Given groups=1, weight of size [192, 3, 3, 3], expected input[6, 4, 32, 32] to have 3 channels, but got 4 channels instead？ error

made a colab for text2image

Here is a minimal colab for the text2image with a little gui: ~~https://j.mp/txt2im~~ https://bit.ly/txt2im

Cool!

Thank you very much for releasing the new checkpoints!

Would you mind sharing more details about the training of the text2img-large model? - Did you train it on the full, unfiltered LAION-400M? For how many epochs?
What hardware did you use for how long? :)

Kind regards,
Christoph Schuhmann
Organization Lead LAION

Btw, here is our new dataset,LAION-5B :-)
https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

Non-descriptive error when sampling in Colab

When I run !python scripts/txt2img.py --prompt "a virus monster is playing guitar, oil on canvas" --ddim_eta 0.0 --n_samples 4 --n_iter 4 --scale 5.0 --ddim_steps 50

It produces

Loading model from models/ldm/text2img-large/model.ckpt
1
2
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 872.30 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
^C

I have added in the 1 and 2 to see where the script is failing by modifying it like so

def load_model_from_config(config, ckpt, verbose=False):
    print(f"Loading model from {ckpt}")
    pl_sd = torch.load(ckpt, map_location="cpu")
    print(1)
    sd = pl_sd["state_dict"]
    print(2)
    model = instantiate_from_config(config.model)
    print(3)
    m, u = model.load_state_dict(sd, strict=False)
    print(4)
    if len(m) > 0 and verbose:
        print("missing keys:")
        print(m)
    if len(u) > 0 and verbose:
        print("unexpected keys:")
        print(u)

    model.cuda()
    model.eval()
    return model

What is that ^C and how do I debug this?

The problem with the checkpoint finetuning.

Hi. First of all, thank you for this wonderful repository. I try to run a training and have the following problem:

I downloaded a small part of the imagenet dataset (2Gb) and unzipped it. There were only images, so I had to change the "./ldm/data/imagenet.py" a bit to be able to load my dataset. The output gave example["image"] and example["LR_image"] as required.

Then I fixed a few lines in "./models/ldm/bsr_sr/config.yaml", namely in train and validation I changed target to the path to imagenet.py.

Then I downloaded your ckpt file from notebook_helpers.py and decided to try to finetune the weight.

CUDA_VISIBLE_DEVICES=0 python main.py
--base "./models/ldm/bsr_sr/config.yaml"
--name "test"
--resume_from_checkpoint "./logs/diffusion/superresolution_bsr/last.yaml/?dl=1"
-t --gpus=0

But I got an error:

RuntimeError: Error(s) in loading state_dict for LatentDiffusion:
Unexpected key(s) in state_dict: "ddim_sigmas", "ddim_alphas", "ddim_alphas_prev", "ddim_sqrt_one_minus_alphas".

If I read the weights, delete those 4 keys and write to a new file, the training starts fine. Do I understand correctly that without them, the training will not work good? If I start the training from scratch, the resulting checkpoints will not contain these 4 keys at all. Can you tell me what I'm doing wrong?

And another small question: I separately trained the autoencoder (first_stage_models), got the checkpoint, but I can't find where to specify it when training the diffusion model (ldm). Perhaps the autoencoder is not involved in this step, but then where do I specify it if I want to run an inference with my weights?

Render between two images

Is it possible to generate a sequence of images between two prompts to realize key frame animation ?
The basic idea is to render a set of frames using keyframes / multiple prompts.

If not is it possible to dump the intermediate step images?

Thank you.

Models on COCO dataset

Hi, thank you for the wonderful paper and available code. The unconditional models you have released are all trained on very specific dataset, like faces and building. I know it was not mentioned in the paper but I wonder in any case did you try to train unconditional model on COCO dataset which contains more various objects? If you did, what's the performance look like and can you shared the pre-trained model?

Thank you!

Details about training super resolution model

Hi @rromb, @ablattmann, @pesser, and thank you for making your great work publicly available.

Could you please supply the code for the class ldm.data.openimages.SuperresOpenImagesAdvancedTrain/Validation to train your model for super-resolution, as required in bsr_sr/config.yaml (see this line)?
Otherwise, some more information about how to train the SR model with datasets not included in your repository would be very helpful.

Thank you very much!

params to reproduce text2image results

can you kindly share the parameters to reproduce these amazing results: https://github.com/CompVis/latent-diffusion/raw/main/assets/txt2img-preview.png ?

create similar images from a real image.

Is there a way for this to create similar images of a real image like dall-e 2 does?

[Question] Is it possible to gradually diffuse/transform one given real image to another using diffusion model?

Thanks for this great work. I'm quite interested in the possible applications of the (latent) diffusion model proposed in the impressive paper. Your works have shown many possible promising applications of this newly emerging generative modeling approach. However, I have another question that bothers me for several days. It would be great if you could give some advices or suggestions on this problem. The problem is actually a open one, and it's detailed below.

Question
Given an initial image (e.g. a 256x256 image with a red dog on it) as the starting image, can we use a diffusion model to diffuse/transform the initial image gradually, until it satisfies the expectation (e.g. conditioned on a text prompt of "a yellow cat"), the final image should be an image describing "a yellow cat".

Difficulty
As we know, the diffusion model assumes the initial image should be taken from the gaussian distribution. But in our situation, it is not the case. Our initial image is a real image, which I think it breaks the assumption.

I've directly tried to implement the thought in a most naive way, but it doesn't seem to work, because it generate some vague results.
It would be great if you could give some advices or suggestions on this problem. Thank you!!

predicted_indices not found in losses

predicted_indices is required for loss in ldm/models/autoencoder.py but none of the loss types seems to support that. Anything I missed?

Tools for self-hosting LDM inpaining model

https://github.com/Sanster/lama-cleaner (It started out as a way to self-hosting the LaMa model, so...)

lama-cleaner-0.4.0.mp4

compvis / latent-diffusion Goto Github PK

latent-diffusion's Introduction

Latent Diffusion Models

News

July 2022

April 2022

Requirements

Pretrained Models

Retrieval Augmented Diffusion Models

RDM with text-prompt only (no explicit retrieval needed)

RDM with text-to-image retrieval

Coming Soon

Text-to-Image

Beyond 256²

Inpainting

Class-Conditional ImageNet

Unconditional Models

Train your own LDMs

Data preparation

Faces

LSUN

ImageNet

Model Training

Training autoencoder models

Training LDMs

Model Zoo

Pretrained Autoencoding Models

Get the models

Pretrained LDMs

Get the models

Coming Soon...

Comments

BibTeX

latent-diffusion's People

Contributors

Stargazers

Watchers

Forkers

latent-diffusion's Issues

Recommend Projects

Recommend Topics

Recommend Org