Coder Social home page Coder Social logo

cld-sgm's People

Contributors

karstenkreis avatar timudk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cld-sgm's Issues

inf Loss

Thanks for open sourcing the awesome repo. However, I am running experiments and consistently get the inf loss no matter how I change the hyperparameters (even set learning rate to zero, reduce model size)

To reproduce

# clone repo and setup python and deps
python main.py -cc configs/default_cifar10.txt -sc configs/specific_cifar10.txt --root $(pwd) --mode train --workdir logs/debug --n_gpus_per_node 1 --training_batch_size 64 --testing_batch_size 64 --sampling_batch_size 64 --log_freq 1

Env

PyTorch version: 1.8.1+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 8.4.0-3ubuntu2) 8.4.0
Clang version: Could not collect
CMake version: version 3.16.3

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000

Nvidia driver version: 510.60.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.1.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-fid==0.2.1
[pip3] torch==1.8.1+cu111
[pip3] torchdiffeq==0.2.3
[pip3] torchvision==0.9.1+cu111
[conda] torch                     1.8.1+cu111              pypi_0    pypi
[conda] torchdiffeq               0.2.3                    pypi_0    pypi
[conda] torchvision               0.9.1+cu111              pypi_0    pypi

Output

WARNING - module_wrapper.py - 2022-05-04 07:51:58,366 - From /home/qzhang419/anaconda3/envs/cld/lib/python3.8/site-packages/tensorflow_gan/python/estimator/tpu_gan_estimator.py:42: The name tf.estimator.tpu.TPUEstimator is deprecated. Please use tf.compat.v1.estimator.tpu.TPUEstimator instead.

INFO - run_lib.py - 2022-05-04 07:51:58,368 - Namespace(attention_type='ddpm', attn_resolutions='16', autocast_eval=True, autocast_train=True, beta0=4.0, beta1=0.0, beta_type='linear', cc='configs/default_cifar10.txt', center_image=True, ch_mult='1,2,2,2', checkpoint=None, ckpt_file=None, cld_objective='hsm', cont_nbr=None, conv_size=3, data_dim=None, data_location=None, dataset='cifar10', denoising=True, device=device(type='cuda', index=0), distributed=True, dropout=0.1, ema_rate=0.9999, embedding_type='fourier', eval_density=False, eval_density_npts=101, eval_fid=False, eval_fid_samples=50000, eval_folder=None, eval_freq=20000, eval_hist_samples=100000, eval_iw_likelihood=False, eval_jacobian_norm=False, eval_likelihood=False, eval_loss=False, eval_loss_variance=False, eval_loss_variance_images=1, eval_sample=False, eval_sample_hist=False, eval_sample_samples=1, eval_seed=0, eval_threshold=1, fid_freq=50000, fid_samples_training=20000, fid_threshold=100000, fir_kernel='1,3,3,1', fourier_scale=16, gamma=0.04, global_rank=0, global_size=1, grad_clip=1.0, image_channels=3, image_size=32, init_scale=0.0, is_image=True, learning_rate=0.0002, likelihood_atol=1e-05, likelihood_eps=1e-05, likelihood_freq=50000, likelihood_hutchinson_type='rademacher', likelihood_rtol=1e-05, likelihood_solver='scipy_solver', likelihood_solver_options={'solver': 'RK45'}, likelihood_threshold=2000000, local_rank=0, log_freq=1, loss_eps=1e-05, m_inv=4.0, master_address='127.0.0.1', master_port=6020, mixed_score=True, mode='train', n_channels=128, n_discrete_steps=None, n_eval_batches=1, n_gpus_per_node=1, n_likelihood_batches=1, n_nodes=1, n_resblocks=8, n_train_iters=800000, n_warmup_iters=100000, name='ncsnpp', node_rank=0, nonlinearity='swish', normalization='GroupNorm', numerical_eps=1e-09, optimizer='Adam', overwrite=False, progressive='none', progressive_combine='sum', progressive_input='residual', resamp_with_conv=True, resblock_type='biggan', root='/tmp/CLD-SGM', sampling_atol=1e-05, sampling_batch_size=64, sampling_eps=0.001, sampling_method='ode', sampling_rtol=1e-05, sampling_solver='scipy_solver', sampling_solver_options={'solver': 'RK45'}, save_freq=50000, save_threshold=300000, sc='configs/specific_cifar10.txt', sde='cld', seed=0, skip_rescale=True, snapshot_freq=10000, snapshot_threshold=1, sscs_num_stab=0.0, striding='linear', testing_batch_size=64, training_batch_size=64, use_fir=True, weight_decay=0.0, weighting='reweightedv2', workdir='logs/debug')
INFO - run_lib.py - 2022-05-04 07:52:02,146 - Number of trainable parameters in model: 107593859
INFO - run_lib.py - 2022-05-04 07:52:03,269 - Number of total iterations: 800000
INFO - resolver.py - 2022-05-04 07:52:03,379 - Using /tmp/tfhub_modules to cache modules.
INFO - run_lib.py - 2022-05-04 07:52:10,876 - Starting training at step 0
INFO - run_lib.py - 2022-05-04 07:52:13,258 - Iter 1/800000 Loss: inf Time: 1.815
INFO - distributed.py - 2022-05-04 07:52:13,280 - Reducer buckets have been rebuilt in this iteration.
INFO - run_lib.py - 2022-05-04 07:52:13,710 - Iter 2/800000 Loss: inf Time: 0.451
INFO - run_lib.py - 2022-05-04 07:52:14,115 - Iter 3/800000 Loss: inf Time: 0.403
INFO - run_lib.py - 2022-05-04 07:52:14,510 - Iter 4/800000 Loss: inf Time: 0.394
INFO - run_lib.py - 2022-05-04 07:52:14,908 - Iter 5/800000 Loss: inf Time: 0.397
INFO - run_lib.py - 2022-05-04 07:52:15,304 - Iter 6/800000 Loss: inf Time: 0.395
INFO - run_lib.py - 2022-05-04 07:52:15,722 - Iter 7/800000 Loss: inf Time: 0.417
INFO - run_lib.py - 2022-05-04 07:52:16,128 - Iter 8/800000 Loss: inf Time: 0.405
INFO - run_lib.py - 2022-05-04 07:52:16,527 - Iter 9/800000 Loss: inf Time: 0.398
INFO - run_lib.py - 2022-05-04 07:52:16,925 - Iter 10/800000 Loss: inf Time: 0.396
INFO - run_lib.py - 2022-05-04 07:52:17,324 - Iter 11/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:17,723 - Iter 12/800000 Loss: inf Time: 0.397
INFO - run_lib.py - 2022-05-04 07:52:18,118 - Iter 13/800000 Loss: inf Time: 0.394
INFO - run_lib.py - 2022-05-04 07:52:18,513 - Iter 14/800000 Loss: inf Time: 0.394
INFO - run_lib.py - 2022-05-04 07:52:18,925 - Iter 15/800000 Loss: inf Time: 0.411
INFO - run_lib.py - 2022-05-04 07:52:19,328 - Iter 16/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:19,728 - Iter 17/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:20,132 - Iter 18/800000 Loss: inf Time: 0.403
INFO - run_lib.py - 2022-05-04 07:52:20,532 - Iter 19/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:20,929 - Iter 20/800000 Loss: inf Time: 0.396
INFO - run_lib.py - 2022-05-04 07:52:21,330 - Iter 21/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:21,730 - Iter 22/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:22,132 - Iter 23/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:22,532 - Iter 24/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:22,935 - Iter 25/800000 Loss: inf Time: 0.402
INFO - run_lib.py - 2022-05-04 07:52:23,336 - Iter 26/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:23,736 - Iter 27/800000 Loss: inf Time: 0.399
INFO - run_lib.py - 2022-05-04 07:52:24,136 - Iter 28/800000 Loss: inf Time: 0.398
INFO - run_lib.py - 2022-05-04 07:52:24,543 - Iter 29/800000 Loss: inf Time: 0.406
INFO - run_lib.py - 2022-05-04 07:52:24,945 - Iter 30/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:25,347 - Iter 31/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:25,749 - Iter 32/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:26,147 - Iter 33/800000 Loss: inf Time: 0.397
INFO - run_lib.py - 2022-05-04 07:52:26,548 - Iter 34/800000 Loss: inf Time: 0.400
INFO - run_lib.py - 2022-05-04 07:52:26,950 - Iter 35/800000 Loss: inf Time: 0.401
INFO - run_lib.py - 2022-05-04 07:52:27,351 - Iter 36/800000 Loss: inf Time: 0.400

train on a single node with 4 GPUs

Thank you so much for open sourcing this awesome work. However, I want to know how to train on a single node with multiple GPUs, e.g. 4 GPUs, is the following cmd right? Thank you in advance.

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py -cc configs/default_cifar10.txt -sc configs/specific_cifar10.txt --root ./ROOT --mode train --workdir work_dir/cifar10 --n_gpus_per_node 4 --training_batch_size 64 --testing_batch_size 64 --sampling_batch_size 64

ninja: build stopped: subcommand failed

Traceback (most recent call last):
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "main.py", line 45, in setup
fn(config)
File "main.py", line 74, in main
import run_lib
File "/import/home/a/TSG/CLD-SGM/run_lib.py", line 21, in
from models import ncsnpp
File "/import/home/a/TSG/CLD-SGM/models/ncsnpp.py", line 25, in
from . import utils, layers, layerspp, normalization
File "/import/home/a/TSG/CLD-SGM/models/layerspp.py", line 20, in
from . import up_or_down_sampling
File "/import/home/a/TSG/CLD-SGM/models/up_or_down_sampling.py", line 25, in
from op import upfirdn2d
File "/import/home/a/TSG/CLD-SGM/op/init.py", line 9, in
from .fused_act import FusedLeakyReLU, fused_leaky_relu
File "/import/home/a/TSG/CLD-SGM/op/fused_act.py", line 19, in
fused = load(
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
return _jit_compile(
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1292, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1404, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/a/anaconda3/envs/diffusion/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1683, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused': [1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DP
YBIND11_BUILD_ABI="_cxxabi1011"

D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT
16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++14 -c /import/home/a$
/TSG/CLD-SGM/op/fused_bias_act_kernel.cu -o fused_bias_act_kernel.cuda.o
FAILED: fused_bias_act_kernel.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -isystem

a/anaconda3/envs/diffusion/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++14 -c /import/home/a/TSG/CLD-SGM/op/fused_bias_act_kernel.cu -o fused_bias_act_kernel.cuda.o
In file included from /import/home/a/TSG/CLD-SGM/op/fused_bias_act_kernel.cu:11:0:
/home/a/anaconda3/envs/diffusion/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:7:10: fatal error: cublas_v2.h: No such file or directory
#include <cublas_v2.h>
^~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.

Regarding time_cond in ncsnn++

Hi
Can somebody tell me from where is this time_cond parameter is calculated. Since it is used in the forward function of ncsnpp.py file??

Nan in Loss/Score

Dear Authors,

Thanks for the open-source codes. I run the cifar10 experiment on a single-card with the following code:

  • python main.py -cc configs/default_cifar10.txt -sc configs/specific_cifar10.txt --root $ROOT --mode train --workdir work_dir/cifar10 --n_gpus_per_node 1 --training_batch_size 64 --testing_batch_size 64 --sampling_batch_size 64

However, the loss is always nan (error in the line 61 of losses.py). I check the output values. The score values in line 45 are all NAN. Do you know how to solve this issue?

regards,
Hanshu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.