nv-tlabs / cld-sgm Goto Github PK

View Code? Open in Web Editor NEW

193.0 193.0 14.0 36.59 MB

Score-Based Generative Modeling with Critically-Damped Langevin Diffusion

License: Other

Python 91.74% C++ 1.22% Cuda 7.05%

cld-sgm's People

Contributors

Stargazers

Watchers

Forkers

dashstander joskid ai-hub-deep-learning-fundamental swizad pkulwj1994 doublescoop jxzhangjhu jingzhang617 victor-qtp phoenixdigitalfx shifengxu kevinrojas1499 minhmpa alamda

cld-sgm's Issues

inf Loss

Thanks for open sourcing the awesome repo. However, I am running experiments and consistently get the inf loss no matter how I change the hyperparameters (even set learning rate to zero, reduce model size)

To reproduce

# clone repo and setup python and deps
python main.py -cc configs/default_cifar10.txt -sc configs/specific_cifar10.txt --root $(pwd) --mode train --workdir logs/debug --n_gpus_per_node 1 --training_batch_size 64 --testing_batch_size 64 --sampling_batch_size 64 --log_freq 1

Env

PyTorch version: 1.8.1+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 8.4.0-3ubuntu2) 8.4.0
Clang version: Could not collect
CMake version: version 3.16.3

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000

Nvidia driver version: 510.60.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-fid==0.2.1
[pip3] torch==1.8.1+cu111
[pip3] torchdiffeq==0.2.3
[pip3] torchvision==0.9.1+cu111
[conda] torch                     1.8.1+cu111              pypi_0    pypi
[conda] torchdiffeq               0.2.3                    pypi_0    pypi
[conda] torchvision               0.9.1+cu111              pypi_0    pypi

Output

WARNING - module_wrapper.py - 2022-05-04 07:51:58,366 - From /home/qzhang419/anaconda3/envs/cld/lib/python3.8/site-packages/tensorflow_gan/python/estimator/tpu_gan_estimator.py:42: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.

INFO - run_lib.py - 2022-05-04 07:51:58,368 - Namespace(attention_type='ddpm', attn_resolutions='16', autocast_eval=True, autocast_train=True, beta0=4.0, beta1=0.0, beta_type='linear', cc='configs/default_cifar10.txt', center_image=True, ch_mult='1,2,2,2', checkpoint=None, ckpt_file=None, cld_objective='hsm', cont_nbr=None, conv_size=3, data_dim=None, data_location=None, dataset='cifar10', denoising=True, device=device(type='cuda', index=0), distributed=True, dropout=0.1, ema_rate=0.9999, embedding_type='fourier', eval_density=False, eval_density_npts=101, eval_fid=False, eval_fid_samples=50000, eval_folder=None, eval_freq=20000, eval_hist_samples=100000, eval_iw_likelihood=False, eval_jacobian_norm=False, eval_likelihood=False, eval_loss=False, eval_loss_variance=False, eval_loss_variance_images=1, eval_sample=False, eval_sample_hist=False, eval_sample_samples=1, eval_seed=0, eval_threshold=1, fid_freq=50000, fid_samples_training=20000, fid_threshold=100000, fir_kernel='1,3,3,1', fourier_scale=16, gamma=0.04, global_rank=0, global_size=1, grad_clip=1.0, image_channels=3, image_size=32, init_scale=0.0, is_image=True, learning_rate=0.0002, likelihood_atol=1e-05, likelihood_eps=1e-05, likelihood_freq=50000, likelihood_hutchinson_type='rademacher', likelihood_rtol=1e-05, likelihood_solver='scipy_solver', likelihood_solver_options={'solver': 'RK45'}, likelihood_threshold=2000000, local_rank=0, log_freq=1, loss_eps=1e-05, m_inv=4.0, master_address='127.0.0.1', master_port=6020, mixed_score=True, mode='train', n_channels=128, n_discrete_steps=None, n_eval_batches=1, n_gpus_per_node=1, n_likelihood_batches=1, n_nodes=1, n_resblocks=8, n_train_iters=800000, n_warmup_iters=100000, name='ncsnpp', node_rank=0, nonlinearity='swish', normalization='GroupNorm', numerical_eps=1e-09, optimizer='Adam', overwrite=False, progressive='none', progressive_combine='sum', progressive_input='residual', resamp_with_conv=True, resblock_type='biggan', root='/tmp/CLD-SGM', sampling_atol=1e-05, sampling_batch_size=64, sampling_eps=0.001, sampling_method='ode', sampling_rtol=1e-05, sampling_solver='scipy_solver', sampling_solver_options={'solver': 'RK45'}, save_freq=50000, save_threshold=300000, sc='configs/specific_cifar10.txt', sde='cld', seed=0, skip_rescale=True, snapshot_freq=10000, snapshot_threshold=1, sscs_num_stab=0.0, striding='linear', testing_batch_size=64, training_batch_size=64, use_fir=True, weight_decay=0.0, weighting='reweightedv2', workdir='logs/debug')
INFO - run_lib.py - 2022-05-04 07:52:02,146 - Number of trainable parameters in model: 107593859
INFO - run_lib.py - 2022-05-04 07:52:03,269 - Number of total iterations: 800000
INFO - resolver.py - 2022-05-04 07:52:03,379 - Using /tmp/tfhub_modules to cache modules.
INFO - run_lib.py - 2022-05-04 07:52:10,876 - Starting training at step 0
INFO - run_lib.py - 2022-05-04 07:52:13,258 - Iter 1/800000 Loss: inf Time: 1.815
INFO - distributed.py - 2022-05-04 07:52:13,280 - Reducer buckets have been rebuilt in this iteration.
INFO - run_lib.py - 2022-05-04 07:52:13,710 - Iter 2/800000 Loss: inf Time: 0.451
INFO - run_lib.py - 2022-05-04 07:52:14,115 - Iter 3/800000 Loss: inf Time: 0.403
INFO - run_lib.py - 2022-05-04 07:52:14,510 - Iter 4/800000 Loss: inf Time: 0.394
INFO - run_lib.py - 2022-05-04 07:52:14,908 - Iter 5/800000 Loss: inf Time: 0.397
INFO - run_lib.py - 2022-05-04 07:52:15,304 - Iter 6/800000 Loss: inf Time: 0.395
INFO - run_lib.py - 2022-05-04 07:52:15,722 - Iter 7/800000 Loss: inf Time: 0.417
INFO - run_lib.py - 2022-05-04 07:52:16,128 - Iter 8/800000 Loss: inf Time: 0.405
INFO - run_lib.py - 2022-05-04 07:52:16,527 - Iter 9/800000 Loss: inf Time: 0.398
INFO - run_lib.py - 2022-05-04 07:52:16,925 - Iter 10/800000 Loss: inf Time: 0.396
INFO - run_lib.py - 2022-05-04 07:52:17,324 - Iter 11/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:17,723 - Iter 12/800000 Loss: inf Time: 0.397
INFO - run_lib.py - 2022-05-04 07:52:18,118 - Iter 13/800000 Loss: inf Time: 0.394
INFO - run_lib.py - 2022-05-04 07:52:18,513 - Iter 14/800000 Loss: inf Time: 0.394
INFO - run_lib.py - 2022-05-04 07:52:18,925 - Iter 15/800000 Loss: inf Time: 0.411
INFO - run_lib.py - 2022-05-04 07:52:19,328 - Iter 16/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:19,728 - Iter 17/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:20,132 - Iter 18/800000 Loss: inf Time: 0.403
INFO - run_lib.py - 2022-05-04 07:52:20,532 - Iter 19/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:20,929 - Iter 20/800000 Loss: inf Time: 0.396
INFO - run_lib.py - 2022-05-04 07:52:21,330 - Iter 21/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:21,730 - Iter 22/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:22,132 - Iter 23/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:22,532 - Iter 24/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:22,935 - Iter 25/800000 Loss: inf Time: 0.402
INFO - run_lib.py - 2022-05-04 07:52:23,336 - Iter 26/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:23,736 - Iter 27/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:24,136 - Iter 28/800000 Loss: inf Time: 0.398
INFO - run_lib.py - 2022-05-04 07:52:24,543 - Iter 29/800000 Loss: inf Time: 0.406
INFO - run_lib.py - 2022-05-04 07:52:24,945 - Iter 30/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:25,347 - Iter 31/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:25,749 - Iter 32/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:26,147 - Iter 33/800000 Loss: inf Time: 0.397
INFO - run_lib.py - 2022-05-04 07:52:26,548 - Iter 34/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:26,950 - Iter 35/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:27,351 - Iter 36/800000 Loss: inf Time: 0.400

train on a single node with 4 GPUs

Thank you so much for open sourcing this awesome work. However, I want to know how to train on a single node with multiple GPUs, e.g. 4 GPUs, is the following cmd right? Thank you in advance.

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py -cc configs/default_cifar10.txt -sc configs/specific_cifar10.txt --root ./ROOT --mode train --workdir work_dir/cifar10 --n_gpus_per_node 4 --training_batch_size 64 --testing_batch_size 64 --sampling_batch_size 64

Problem with running the code

ninja: build stopped: subcommand failed

Traceback (most recent call last):
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "main.py", line 45, in setup
fn(config)
File "main.py", line 74, in main
import run_lib
File "/import/home/a/TSG/CLD-SGM/run_lib.py", line 21, in
from models import ncsnpp
File "/import/home/a/TSG/CLD-SGM/models/ncsnpp.py", line 25, in
from . import utils, layers, layerspp, normalization
File "/import/home/a/TSG/CLD-SGM/models/layerspp.py", line 20, in
from . import up_or_down_sampling
File "/import/home/a/TSG/CLD-SGM/models/up_or_down_sampling.py", line 25, in
from op import upfirdn2d
File "/import/home/a/TSG/CLD-SGM/op/init.py", line 9, in
from .fused_act import FusedLeakyReLU, fused_leaky_relu
File "/import/home/a/TSG/CLD-SGM/op/fused_act.py", line 19, in
fused = load(
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
return _jit_compile(
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1292, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1404, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1683, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused': [1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DP
YBIND11_BUILD_ABI="_cxxabi1011"

D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT
16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++14 -c /import/home/a$
/TSG/CLD-SGM/op/fused_bias_act_kernel.cu -o fused_bias_act_kernel.cuda.o
FAILED: fused_bias_act_kernel.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -isystem

a/anaconda3/envs/diffusion/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++14 -c /import/home/a/TSG/CLD-SGM/op/fused_bias_act_kernel.cu -o fused_bias_act_kernel.cuda.o
In file included from /import/home/a/TSG/CLD-SGM/op/fused_bias_act_kernel.cu:11:0:
/home/a/anaconda3/envs/diffusion/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:7:10: fatal error: cublas_v2.h: No such file or directory
#include <cublas_v2.h>
^~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.

May I ask how the animation is being done?

Just wondering what kind of software is used to draw this diffusion process

Regarding time_cond in ncsnn++

Hi
Can somebody tell me from where is this time_cond parameter is calculated. Since it is used in the forward function of ncsnpp.py file??

Nan in Loss/Score

Dear Authors,

Thanks for the open-source codes. I run the cifar10 experiment on a single-card with the following code:

python main.py -cc configs/default_cifar10.txt -sc configs/specific_cifar10.txt --root $ROOT --mode train --workdir work_dir/cifar10 --n_gpus_per_node 1 --training_batch_size 64 --testing_batch_size 64 --sampling_batch_size 64

However, the loss is always nan (error in the line 61 of losses.py). I check the output values. The score values in line 45 are all NAN. Do you know how to solve this issue?

regards,
Hanshu

Unable to Download Cifar10 Stats File

Thank you for sharing the code for an interesting work.

I am unable to download the stats file to compute the FID due to permission error. I would appreciate if you can provide the link with public access.

Cifar 10 Stats File Link: https://drive.google.com/file/d/14UB27-Spi8VjZYKST3ZcT8YVhAluiFWI/view?usp=sharing
Error: You are not allowed to perform this action.

Thanks in advance!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.