antixk / pytorch-vae Goto Github PK
View Code? Open in Web Editor NEWA Collection of Variational Autoencoders (VAE) in PyTorch.
License: Apache License 2.0
A Collection of Variational Autoencoders (VAE) in PyTorch.
License: Apache License 2.0
The code fails for general values of hidden dims, image sizes, etc. due to shape mismatches. Since image size and hidden dims are parameters anyway, can you please increase the flexibility in other parts of the code? This will allow it to be used as it is on other datasets and different architectures.
In experiment.py you define M_N like this
train_loss = self.model.loss_function(*results,
M_N = self.params['batch_size']/ self.num_train_imgs,
optimizer_idx=optimizer_idx,
batch_idx = batch_idx)
M_N is then used as a weighting factor for the KD term in the VAE implementations. Why does the weight depend on the ratio of the batch size and dataset size?
I saw other literature that uses the ratio of the latent dimension and the input dimension instead. I am not sure which is correct.
Related without answers (or not directly to my question)
#11 (says that it is to correct variances caused by small batches)
#23
#35
I have checked previous issues with image size problem. You mentioned that
model = *VAE*(<in_chanels>, <latent_dim>, hidden_dims=[16, 32, 64, 128, 256, 512])
doing this would increase image size to 128. To make it 256 should we add [8, 16, 32, 64, 128, 256, 512]
as well?
Also why doing this changes image size? I don't understand. I have known this from this issue #29
Hi, is there anyway to change the image size from 64 to 128 easily?
Hi,
When I run the code, I encounter the following error. Would you please let me know how I can fix this issue?
python run.py -c configs/wae_mmd_imq.yaml
Traceback (most recent call last):
File "run.py", line 44, in
runner = Trainer(default_save_path=f"{tt_logger.save_dir}",
TypeError: init() got an unexpected keyword argument 'default_save_path'
According to the original paper by Kim et al., the permutation function permutes across the batch for each dimension. In the case here, if B, D = z.size()
, def permute_latent(self, z: Tensor)
should permute z along the dimension of B
, i. e., z[i, j] = z[new_indices[i], j]
, where new_indices = torch.randperm(B)
.
Hi nice works,
I use dynamic batch size in my training, is it ok to use dynamic batch size to train beta-tcvae, as start_weight
calculating depends on batch_size
PyTorch-VAE/models/betatc_vae.py
Line 177 in 8700d24
Thank you for sharing this repo!
In VanillaVAE (and maybe others) I was wondering about the choice of using tanh as the final activation for outputs (that has a range of [-1,1]) without normalizing the input images to the same range [-1,1] (in the dataset transforms).
Wouldn't it make the reconstruction loss work much harder, opposed to either normalizing the input or using a final activation like sigmoid?
command: python run.py --config configs/vae.yaml
Then error messages pop up as the following:
INFO:root:gpu available: True, used: True
INFO:root:VISIBLE GPUS: 0
======= Training VanillaVAE =======
3099it [00:00, 8467848.92it/s]
Using downloaded and verified file: ../../shared/Data/celeba/list_attr_celeba.txt
Using downloaded and verified file: ../../shared/Data/celeba/identity_CelebA.txt
Using downloaded and verified file: ../../shared/Data/celeba/list_bbox_celeba.txt
Using downloaded and verified file: ../../shared/Data/celeba/list_landmarks_align_celeba.txt
Using downloaded and verified file: ../../shared/Data/celeba/list_eval_partition.txt
3099it [00:00, 6145696.50it/s]
Using downloaded and verified file: ../../shared/Data/celeba/list_attr_celeba.txt
Using downloaded and verified file: ../../shared/Data/celeba/identity_CelebA.txt
Using downloaded and verified file: ../../shared/Data/celeba/list_bbox_celeba.txt
Using downloaded and verified file: ../../shared/Data/celeba/list_landmarks_align_celeba.txt
Using downloaded and verified file: ../../shared/Data/celeba/list_eval_partition.txt
Traceback (most recent call last):
File "/home/jbhuang/anaconda3/envs/vae2/lib/python3.7/site-packages/pytorch_lightning/core/decorators.py", line 17, in _get_data_loader
value = getattr(self, attr_name)
File "/home/jbhuang/anaconda3/envs/vae2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 591, in getattr
type(self).name, name))
AttributeError: 'VAEXperiment' object has no attribute '_lazy_train_dataloader'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/jbhuang/MyWork/vae/PyTorch-VAE/utils.py", line 17, in func_wrapper
return pl.data_loader(fn)(self)
File "/home/jbhuang/anaconda3/envs/vae2/lib/python3.7/site-packages/pytorch_lightning/core/decorators.py", line 20, in _get_data_loader
value = fn(self) # Lazy evaluation, done only once.
File "/home/jbhuang/MyWork/vae/PyTorch-VAE/experiment.py", line 143, in train_dataloader
download=True) #Bill
File "/home/jbhuang/anaconda3/envs/vae2/lib/python3.7/site-packages/torchvision/datasets/celeba.py", line 63, in init
self.download()
File "/home/jbhuang/anaconda3/envs/vae2/lib/python3.7/site-packages/torchvision/datasets/celeba.py", line 117, in download
with zipfile.ZipFile(os.path.join(self.root, self.base_folder, "img_align_celeba.zip"), "r") as f:
File "/home/jbhuang/anaconda3/envs/vae2/lib/python3.7/zipfile.py", line 1258, in init
self._RealGetContents()
File "/home/jbhuang/anaconda3/envs/vae2/lib/python3.7/zipfile.py", line 1325, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Please help, thank you.
update README
Hello and so happy to see you use Pytorch-Lightning! ๐
Just wondering if you already heard about quite the new Pytorch Lightning (PL) ecosystem CI where we would like to invite you to... You can check out our blog post about it: Stay Ahead of Breaking Changes with the New Lightning Ecosystem CI โก
As you use PL framework for your cool project, we would like to enhance your experience and offer you safe updates to our future releases. At this moment, you run tests with a particular PL version, but it may accidentally happen that the next version will be incompatible with your project... ๐ We do not intend to change anything on our project side, but still here we have a solution - ecosystem CI with testing both - your and our latest development head we can find it very early and prevent releasing eventually bad version... ๐
What is needed to do?
What will you get?
Thank you for sharing your comprehensive and illuminating set of examples in this repository. I'm currently thinking of re-implementing a subset of these models, based on your Python implementations, using LibTorch, PyTorch's C++ frontend.
Providing I obtain some fruitful results, would you be interested in hosting some of those models here?
Hi,
I am running TC-Beta VAE on my data and I changed my architecture to an MLP encoder and Decoder. But I am getting nan in the loss function. And it seems I am getting nans for log_importance_weights, log_q_z and log_prod_q_z. Should I just add an epsilon to each of these quantities before taking log or there is some other issue that I am missing.
Hey, cool repo. Just wanted to let you know that I got the error below, when I run pip install -r requirements.txt
. Not sure if others get this.
My python version: 3.8.5
My pip version: 20.0.2
Collecting pytorch-lightning==0.6.0
Using cached pytorch-lightning-0.6.0.tar.gz (95 kB)
Collecting PyYAML==5.1.2
Using cached PyYAML-5.1.2.tar.gz (265 kB)
Collecting tensorboard==2.1.0
Using cached tensorboard-2.1.0-py3-none-any.whl (3.8 MB)
Collecting tensorboardX==1.6
Using cached tensorboardX-1.6-py2.py3-none-any.whl (129 kB)
Collecting terminado==0.8.1
Using cached terminado-0.8.1-py2.py3-none-any.whl (33 kB)
Collecting test-tube==0.7.0
Using cached test_tube-0.7.0.tar.gz (20 kB)
ERROR: Could not find a version that satisfies the requirement torch==1.2.0 (from -r requirements.txt (line 7)) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2, 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1)
ERROR: No matching distribution found for torch==1.2.0 (from -r requirements.txt (line 7))
New to pytorch and I'm looking to run your work but I'm encountering the error when when I set download=True
in the appropriate locations in the experiment.py
to download the celeba datasets I'm encountering this error:
Traceback (most recent call last):
File "run.py", line 55, in <module>
runner.fit(experiment)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 602, in fit
self.single_gpu_train(model)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 470, in single_gpu_train
self.run_pretrain_routine(model)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 796, in run_pretrain_routine
self.reset_val_dataloader(ref_model)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 154, in reset_val_dataloader
self.val_dataloaders = self.request_data_loader(model.val_dataloader)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 220, in request_data_loader
data_loader = data_loader_fx()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pytorch_lightning/core/decorators.py", line 16, in inner_fx
return fn(self)
File "/home/ubuntu/PyTorch-VAE/experiment.py", line 161, in val_dataloader
download=True),
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torchvision/datasets/celeba.py", line 63, in __init__
self.download()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torchvision/datasets/celeba.py", line 117, in download
with zipfile.ZipFile(os.path.join(self.root, self.base_folder, "img_align_celeba.zip"), "r") as f:
File "/home/ubuntu/anaconda3/lib/python3.6/zipfile.py", line 1108, in __init__
self._RealGetContents()
File "/home/ubuntu/anaconda3/lib/python3.6/zipfile.py", line 1175, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Running on Ubuntu Deep Learning AMI instance with torch==1.3.1
and torchvision==0.4.2
.
Would appreciate any help you can give! Thanks a lot.
I am using your wonderful library in my research project. There seems to be a bug in the VQ VAE mode class where the reconstruction is a blank image. Is it a known bug? Can you please help me with this issue?
the error message is:
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4096, 6]], which is output 0 of TBackward, is at version 2; expected version 1 instead.
it seems to be the problem that a term(self.D_z_reserve
) used in D_tc_loss calculated at vae_loss stage was modified somehow.
D_tc_loss = 0.5 * (F.cross_entropy(self.D_z_reserve, false_labels) + F.cross_entropy(D_z_perm, true_labels))
giving details:
I calculated and updated vae loss first, like:
self.optim_VAE.zero_grad()
vae_loss.backward(retain_graph=True)
self.optim_VAE.step()
then when updating discriminator:
z = z.detach()
z_perm = self.permute_latent(z)
D_z_perm = self.D(z_perm)
D_tc_loss = 0.5 * (F.cross_entropy(self.D_z_reserve, false_labels) + F.cross_entropy(D_z_perm, true_labels))
self.optim_D.zero_grad()
D_tc_loss.backward()
self.optim_D.step()
the error message occurs as discribed at beginning.
when I delete term F.cross_entropy(self.D_z_reserve, false_labels)
in D_tc_loss,
or change D_tc_loss into
D_tc_loss = 0.5 * (F.cross_entropy(self.D_z_reserve.detach(), false_labels) + F.cross_entropy(D_z_perm, true_labels))
everything goes alright.
but I'm not sure if use .detach()
here is fine, and wondering what exact problem it is, waiting for you reply, thanks a lot.
It should be possible to sample
Hi, thanks for this repository. I noticed while running that the train loss numbers are around <0.1 after an epoch, but the validation loss ranges from 20-30 (for Vanilla VAE, but the mismatch holds across a few other models with the same base like CVAE). I think this mismatch was introduced by #2 - the division for M_N
now uses a different denominator in train vs val. The original intent was to get around val
being run for the first time before train
in the newer versions of lightning, but I don't think this is correct. Instead, one workaround is to run Trainer.fit
with num_sanity_val_steps=0
; this way num_train_steps
is set before validation is run, so the train loss and val loss are back on the same scale. By doing this, I get similar numbers for both.
Am I misinterpreting something? Please let me know if I'm incorrect/the train and val losses should be very different. I'm not sure I understand the different scaling. Although the previous fix gets the two terms to be comparable again, I'm not really sure why we reweight the KL term in the first place - any insight would be appreciated.
Hi,
In the VAE paper (https://arxiv.org/pdf/1312.6114.pdf), the VAE loss function has no additional weight parameter for the KLD loss:
However, in the implementation of the Vanilla VAE model, the loss function is written as below:
loss = recons_loss + kld_weight * kld_loss
When I set "kld_weight" to 1 in my model, it could not learn how to reconstruct the images. If I understand correctly, the "kld_weight" reduces the effect of the KLD loss to balance it with the reconstruction loss. However, as I mentioned, it is not defined in the VAE paper. Could anyone please explain to me why this parameter is used and why it is set to 0.00025 by default?
I am trying to reproduce published results on the CIFAR-10 dataset. My results currently do not look good using the default parameters for e.g. Vanilla VAE (and others). I.e. the model learns something very blurry, similar to e.g. https://bjlkeng.github.io/images/m2_images.png
Any suggestions how to improve these, e.g. I noticed that reducing the weight of the KL-loss already makes a big difference.
I've been running into this error when using your package when running on the newest master version of lightning. Running this command:
python run.py -c configs/cvae.yaml
results in ultimate error message:
AttributeError: 'VAEXperiment' object has no attribute 'num_train_imgs'
This is with Python 3.7.
When I downgrade to Lightning version 0.6.0 the same command works.
Hi,
When using beta_VAE for my own dataset, I'm not sure how I could set values for gamma and max_capacity. Should I just use the default one? Or is there any rule for setting them? Does anyone have a sense or explanation of this? Thank you!
Hello @AntixK !
Thanks for sharing this helpful repository.
I am a beginner in VAE implementation and hence, had a confusion related to vanilla_vae:
Could you please provide me insight on these?
Thank you, and have a nice day!
Hi was wondering why the img_size parameter for experiments is set to 64 as the celebA dataset images are of much larger size.
I am trying to experiment this on a different dataset that is larger than celebA (512 x 512) and was wondering if I should change img_size to create better reconstructed images. I tried changing it myself but I ran into size mismatch issues even though I don't see in the model file where the size of 64 comes to play.
Any help would be much appreciated
Hello,
I used VanillaVAE to reconstruct game images, but I failed to do that.
You can see the images: the images in the first row is original images, while the lower one for reconstructed images.
The background of the image can be perfectly reconstructed, but the key object cannot be reconstructed. Do you have any suggestion?
Thank you!
Hi, I am a new programmer on Python and still new on writing program on computer. So it is a bit messy on how to manage the folder and package. While I am running the code, I have face the following error.
pytorch_lightning.utilities.debugging.MisconfigurationException:
You requested GPUs: [0]
But your machine only has: []
May I ask is there any method to solve it. Thanks a lot
The computation of the gaussian kernel is missing a sign in the exponent:
PyTorch-VAE/models/mssim_vae.py
Line 204 in 8700d24
Hi!
Maybe it's a silly question but why do you use a KL Weight term? I understand that it's the percentage that a batch is over the total dataset. For instance, if there are 100 observations and the batch size is 10, the kl_weight should be 0.1, but why do you use it? I've seen some other implementations and doesn't find it. I'm sure there's a reason but I cannot find why weight just the KL Divergence and no the reconstruction loss.
Thank you so much! :)
Why do you compute the dynamic range in the MS-SSIM VAE from the data range of the reconstructed images. If I understand the original SSIM paper correctly, the dynamic range should be the largest values that the images might assume (e.g. 1.0)?
I am trying to run this repo for the first time. I am getting the following error. Torch is installed and I am able to import torch outside of this script. Has anyone experienced a similar issue?
Traceback (most recent call last):
File "run.py", line 5, in <module>
from models import *
File "D:\github\PyTorch-VAE\models\__init__.py", line 1, in <module>
from .base import *
File "D:\github\PyTorch-VAE\models\base.py", line 2, in <module>
from torch import nn
ModuleNotFoundError: No module named 'torch'
Hi @AntixK,
Thank you for sharing your project with us.
I have a doubt. How do I use another dataset? I would like to use the CIFAR-10 collection. I changed the experiments file, but give the following error:
File "/home/josi/doutorado-2019/PyTorch-VAE/experiment.py", line 143, in train_dataloader download=True) TypeError: __init__() got an unexpected keyword argument 'split'
Could you help me?
When I try to run this code, I get many errors that seem to be related to using deprecated options in pytorch_lightning, such as the max_nb_epochs option or the pytorch_lightning.logging module (which was replaced with the pytorch_lightning.logger module).
Lightning-AI/pytorch-lightning#663
Will this repo be updated to use the latest version of pytorch_lightning? I'm having a difficult time getting things to work
Thank you,
Ryan
Is it possible to use these models on our own custom dataset?
Thanks for your code. I encountered a problem when running the program. Because of the initialization problem of linear layer, each dimension value of the hidden vector learned is very small. Is there any suggested initialization method that can achieve better Is it good?
It is needed to add a line self.save_hyperparameters()
in
Line 16 in 8700d24
vae_model
and params
here), for successfully calling LightningModule.load_from_checkpoint(PATH)
and runner.test()
afterwards.Hi, I'm new to pytorch and VAEs in general. I attempted to run your VanillaVAE but I can't figure out how to reference the dataset inside the config file. Specifically, this is my folder structure:
and my config looks like this:
exp_params:
dataset: celeba
data_path: "/content/drive/MyDrive/Colab/celeba"
img_size: 64
batch_size: 144 # Better to have a square number
LR: 0.005
weight_decay: 0.0
scheduler_gamma: 0.95
but still the error remains the same:
RuntimeError: Dataset not found or corrupted. You can use download=True to download it
Apart from the fact that using download=True doesn't work (looks like it attempts to download an invalid dataset), is there anything to take into account like unzipping the archive img_align_celeba.zip or stuff like that?
Is there a reason for manually implementing reparameterization instead of using the rsample method provided in PyTorch?
Hi @AntixK
Many thanks for this great effort.
Based on my understanding so far the original VAE does not talk about weighing the kl_divergence_loss. Later beta-vae and many other papers made the case of weighing the kl_div (and essentially treat it as a hyper-parameter).
In your implementations, I see that you consistently use kld_weight = kwards['M_N'] = batch_size/num_of_images
.
Is this a norm to select the weight for kl_div loss using the ratio of batch size and a number of images?
Since in the original VAE paper no weighing was done is it okay to use it in vanilla_vae.py?
Regards
Kapil
Hi,
First, thanks for all the shared work !
I have a question concerning the sampling function in the Vanilla VAE. Why do you sample from a normal distribution (0,1) and not from a normal distribution with the learned parameters mu and sigma ? Since when we train the network we decode from the latent space over this distribution isnt more meaningful to sample from this distribution ? Maybe is there something I didnt get.
Thank you again
Hi Anand and all,
As weighting of samples, weight
should be detached from the current computational graph for the expected optimization objective, right? See
Line 155 in 8700d24
Hi, I have a question about the sampling process in the Vamp VAE model. code
I am new to VAE, so maybe my question is naive. And my question is why the code draws samples from the standard gaussian instead of the vamp prior?
hey! really cool repo. Just wanted to let you know you should upgrade to 0.7.2 as it includes a good amount of additional functionality including TPU training and solves a good deal of bugs for edge cases.
in addition, it makes the code MUCH simpler and adds a few more training loop hooks
I've noticed that on not training data the latent representation of my data after encoder brings negative values of standard deviation. Can you, please, explain why it can be possible?
Thank you in advance!
Nice work!
There is a question confusing me. Why do you multiply kwargs['M_N'] in kld_loss?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.