Coder Social home page Coder Social logo

svd_xtend's Introduction

SVD Xtend

Stable Video Diffusion Training Code and Extensions 🚀

💡 Highlight

  • Finetuning SVD. See Part 1.
  • Tracklet-Conditioned Video Generation. Building upon SVD, you can control the movement of objects using tracklets(bbox). See Part 2.

Part 1: Training

Comparison

size=(512, 320), motion_bucket_id=127, fps=7, noise_aug_strength=0.00
generator=torch.manual_seed(111)
Init Image Before Fine-tuning After Fine-tuning
demo ori ft
demo ori ft
demo ori ft
demo ori ft

Video Data Processing

Note that BDD100K is a driving video/image dataset, but this is not a necessity for training. Any video can be used to initiate your training. Please refer to the DummyDataset data reading logic. In short, you only need to modify self.base_folder. Then arrange your videos in the following file structure:

self.base_folder
    ├── video_name1
    │   ├── video_frame1
    │   ├── video_frame2
    │   ...
    ├── video_name2
    │   ├── video_frame1
        ├── ...

Training Configuration(on the BDD100K dataset)

This training configuration is for reference only, I set all parameters of unet to be trainable during the training and adopted a learning rate of 1e-5.

accelerate launch train_svd.py \
    --pretrained_model_name_or_path=/path/to/weight \
    --per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
    --max_train_steps=50000 \
    --width=512 \
    --height=320 \
    --checkpointing_steps=1000 --checkpoints_total_limit=1 \
    --learning_rate=1e-5 --lr_warmup_steps=0 \
    --seed=123 \
    --mixed_precision="fp16" \
    --validation_steps=200

Part 2: Tracklet2Video

Tracklet2Video

We have attempted to incorporate layout control on top of img2video, which makes the motion of objects more controllable, similar to what is demonstrated in the image below. The code and weights will be updated soon. It should be noted that we use a resolution of 512*320 for SVD to generate videos, so the quality of the generated videos appears to be poor (which is somewhat unfair to SVD), but our intention is to demonstrate the effectiveness of tracklet control, and we will resolve the issue with video quality as soon as possible.

Init Image Gen Video by SVD Gen Video by Ours
demo1 svd1 gen1
demo2 svd2 gen2

Methods

We have utilized the Self-Tracking training from Boximator and the Instance-Enhancer from TrackDiffusion. For more details, please refer to the paper.

🏷️ TODO List

  • Support text2video (WIP)
  • Support more conditional inputs, such as layout

♥️ Acknowledgement

Our model is related to Diffusers and Stability AI. Thanks for their great work!

Thanks Boximator and GLIGEN for their awesome models.

✒️ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{li2023trackdiffusion,
  title={Trackdiffusion: Multi-object tracking data generation via diffusion models},
  author={Li, Pengxiang and Liu, Zhili and Chen, Kai and Hong, Lanqing and Zhuge, Yunzhi and Yeung, Dit-Yan and Lu, Huchuan and Jia, Xu},
  journal={arXiv preprint arXiv:2312.00651},
  year={2023}
}

svd_xtend's People

Contributors

pixeli99 avatar blakeone avatar danielvegamyhre avatar kiteretsu77 avatar ciarastrawberry avatar

Stargazers

BQsummer avatar  avatar KUN avatar Yuxuan Wu avatar  avatar JEONDONGBEEN avatar  avatar Xiaolong Liu avatar  avatar Cundian Yang avatar Jean-Philippe Deblonde avatar  avatar Xiwen Liang avatar  avatar Linhao Zhong avatar Hyoung-Kyu Song avatar Divano avatar  avatar  avatar Kuqs avatar povloper avatar Cedric Perauer avatar  avatar Dr.AlexLiu avatar Garrett Allen avatar  avatar  avatar  avatar  avatar Kirok Kim avatar Liu Shaoteng avatar Hung Nguyen avatar Chang Liu avatar Danilo Guanabara avatar  avatar Guy Yariv avatar Visual Synthesizer  avatar  avatar  avatar XYY avatar  avatar tigerwang avatar Linhan Wang avatar  avatar Tianyi Zhu avatar  avatar Yuan-Man avatar GYChen avatar Yi Ren avatar Wuyang LI avatar Egqawkq avatar  avatar Xiong Lin avatar  avatar MK avatar  avatar  avatar YuhaoLiu avatar Zekai Luo avatar  avatar Jiayi Guo avatar Gan Qijun avatar  avatar Ishan avatar Teng Li avatar  avatar Yifan Pu avatar Altay Avcı avatar  avatar  avatar  avatar JH_Xie avatar  avatar Xiaobin HU(kevin) avatar Luo Chaofan avatar Jose Cohenca avatar zhaobin Chu avatar Jean-Baptiste Hauchard avatar Aditya Oberai avatar Jianjun Zhou avatar  avatar Shahid Bilal avatar Liam McFadden avatar Mohamad Zeina avatar Yang Liu avatar Max Ku avatar Zhengkai Jiang avatar  avatar  avatar  avatar  avatar  avatar ZJGUO avatar Max avatar  avatar Arsen avatar  avatar Huiming Sun avatar Yunfeng Wang avatar Sijie Zhao avatar

Watchers

MyShadel avatar Yunlin Chen avatar  avatar Jean-Baptiste Hauchard avatar Pyjcsx avatar  avatar ExponentialML avatar  avatar Haoge Deng avatar Zhenyu Tang avatar  avatar

svd_xtend's Issues

training gpu cost

Hi, thank you for your open-source code. How much GPU consumption is required during training? Is it necessary to add additional deepspeed or checkpoint to save training memory consumption?

Thank you, fine_tuned weights

Dear,

Thank you for uploading this amazing repository. Are the finetuned model weights available I would love to see how the models runs on my machine?

With regards,

Jasper

How can I use the trained module

|-- outputs
|-- checkpoint-1000
| |-- optimizer.bin
| |-- random_states_0.pkl
| |-- scaler.pt
| |-- scheduler.bin
| -- unet | |-- config.json | -- diffusion_pytorch_model.safetensors
|-- feature_extractor
| -- preprocessor_config.json |-- image_encoder | |-- config.json | -- model.safetensors
|-- logs
|-- model_index.json
|-- scheduler
| -- scheduler_config.json |-- unet | |-- config.json | -- diffusion_pytorch_model.safetensors
|-- vae
| |-- config.json
| -- diffusion_pytorch_model.safetensors -- validation_images

Thank you, I have successfully completed the training,output directory structure above.I have a question: How can I use the trained module? Can I use optimizer.bin instead of svd_xt.safetensors? The size of optimizer.bin is over 3GB, whereas svd_xt.safetensors is over 9GB. Therefore, I am wondering if they can be used interchangeably

Question about classifier guidance for image in training code

Hello, nice work on the training code! Thank you for sharing this code.
I have a question about your image conditioned classifier guidance in your code.

if args.conditioning_dropout_prob is not None:
  random_p = torch.rand(
      bsz, device=latents.device, generator=generator)
  # Sample masks for the edit prompts.
  prompt_mask = random_p < 2 * args.conditioning_dropout_prob
  prompt_mask = prompt_mask.reshape(bsz, 1, 1)
  # Final text conditioning.
  null_conditioning = torch.zeros_like(encoder_hidden_states)
  encoder_hidden_states = torch.where(
      prompt_mask, null_conditioning.unsqueeze(1), encoder_hidden_states.unsqueeze(1))
  # Sample masks for the original images.
  image_mask_dtype = conditional_latents.dtype
  image_mask = 1 - (
      (random_p >= args.conditioning_dropout_prob).to(
          image_mask_dtype)
      * (random_p < 3 * args.conditioning_dropout_prob).to(image_mask_dtype)
  )
  image_mask = image_mask.reshape(bsz, 1, 1, 1)
  # Final image conditioning.
  conditional_latents = image_mask * conditional_latents

I wonder this is an official way of implementing the classifier free guidance for image conditions. If the drop prob is 0.1 as default,

with prob 0.1: first frame concat remains, first frame for cross attention is 0
with prob 0.1: first frame concat is 0, first frame for cross attention is 0
with prob 0.1: first frame concat is 0, first frame for cross attention remains
with prob 0.1: first frame concat remains, first frame for cross attention remains

Is this as your intention?

Thank you

Hyperparameters configuration for training

Hi! Thank you so much for this helpful implementation!

I am interested in the configuration of the hyperparameters you use to finetune the SVD. Do you just use the default value written in the parse_args()( e.g., lr=1e-4, conditioning_dropout_prob=0.1, etc.)? I would be deeply appreciated if a training script could be provided.

How to support batch_size > 1?

Thans for you amazing Job!
At the line 1017 of train_svd.py,where the batch_size freeze as 1(noise_aug_strength = cond_sigmas[0] # TODO: support batch > 1)? What should I do to support batch>1?

Question about the forward pass

Hi,

While exploring diffusion models, I noticed the standard forward pass often uses the formula
$\alpha \cdot x + \sigma \cdot \epsilon$.
However, in your video diffusion model code, I saw a different approach:

sigmas = rand_log_normal(shape=[bsz,], loc=0.7, scale=1.6).to(latents.device)
noisy_latents = latents + noise * sigmas
inp_noisy_latents = noisy_latents / ((sigmas**2 + 1) ** 0.5)

You're sampling noise levels from a log-normal distribution and I'm curious about the reasoning behind this choice.
If there are any papers or references that guided this decision, could you share them?

Thanks for your insights!

Suggestions on loading video data

Thank authors for sharing training code of SVD and it works quite well. I just want to give some suggestions to accelerate data loading. First, if you have the original .mp4 or other formats video data, using library decord is much faster than loading video frames iteratively. Second, setting --num_workers to 0 largely improves loading speed for me(I am using 2 40G A100 for training).

Just some small tips and thank authors again.

多卡训练报错

感谢您的工作!
我现在使用单卡训练没有问题,使用多卡训练会出现如下报错:
Traceback (most recent call last):
File "train_svd.py", line 1264, in
main()
File "train_svd.py", line 1045, in main
added_time_ids = _get_add_time_ids(
File "train_svd.py", line 949, in _get_add_time_ids
passed_add_embed_dim = unet.config.addition_time_embed_dim *
File "/.pt2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'config'

我使用的命令:
accelerate launch train_svd.py
--pretrained_model_name_or_path=stable-video-diffusion-img2vid-xt-1-1
--per_gpu_batch_size=1 --gradient_accumulation_steps=1
--max_train_steps=100
--width=512
--height=320
--checkpointing_steps=50 --checkpoints_total_limit=1
--learning_rate=1e-5 --lr_warmup_steps=0
--seed=123
--mixed_precision="fp16"
--validation_steps=20
--num_workers=0 \

Tensor size mismatch error

Traceback (most recent call last):
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\train_svd.py", line 1286, in <module>
    main()
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\train_svd.py", line 1114, in main
    model_pred = unet(
                 ^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\accelerate\utils\operations.py", line 817, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\accelerate\utils\operations.py", line 805, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\diffusers\models\unets\unet_spatio_temporal_condition.py", line 463, in forward
    sample = upsample_block(
             ^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\diffusers\models\unets\unet_3d_blocks.py", line 2351, in forward
    hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 2 but got size 1 for tensor number 1 in the list.

I get this error whenever i try to run the script. These are the arguments i use

accelerate launch train_svd.py --pretrained_model_name_or_path=stabilityai/stable-video-diffusion-img2vid-xt-1-1 --per_gpu_batch_size=2 --gradient_accumulation_steps=1 --max_train_steps=5000 --width=10 --height=10 --checkpointing_steps=1000 --checkpoints_total_limit=1 --learning_rate=1e-5 --lr_warmup_steps=0 --seed=123 --mixed_precision="fp16" --validation_steps=200

(Using low resolutions to avoid out of vram issues while testing, same issue happens at higher resolutions too, was not able to run with default resolution on my pc)

Any idea what is causing the error?

Questions about the noise sampling.

Great works!
I am wondering:

  1. Why not use EDM noise sampling instead of the stratege in the simple diffusion?
  2. Why using fixed noise strength (0) on the condition image? I thinkl the sampling expression has been given in svd paper.

Out of Memory

How much GPU memory is used for resolution of 512*320 and batch size of 1?

Questions on text2video?

when I try to figure out how to adapt the framework for text2video synthesis, I found that the SpatialTemporalUNet has a input channel 8 which is depicted in this line:


    @register_to_config
    def __init__(
        self,
        sample_size: Optional[int] = None,
        in_channels: int = 8,
        out_channels: int = 4,
        down_block_types: Tuple[str] = (

Then I check the pipeline inference and I found the denoising input is actually a concatenation of a noise and latent input:


# Concatenate image_latents over channels dimention
latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)

my question is that how to obtain the image_latents if we only use text as a input when training a text2video model?
Do you recently have any progress on text2video?

Question about the new push for partial fine-tuning.

Hello! For the latest update of push, I find that there is a modification on partial fine-tuning. I see that you only train the param which has a name with "temporal_transformer_block". I wonder if there is any source related to why we should do that. Thank you so much for your attention and participation.

Multi-gpu training

First of, thank you for providing SVD training code.

I'm trying to train using multi-GPU, are there any changes I need to make to the code and prompt?
In the code, should I uncomment this part and what else do I need to do?

    # ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
    accelerator = Accelerator(
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        project_config=accelerator_project_config,
        # kwargs_handlers=[ddp_kwargs]
    )

How much memory is needed when training SVD

First of all, thank you very much for sharing the code. At the same time, I would like to ask how much memory is needed when training SVD, and what is the minimum video memory configuration required?

DeepSpeed Accelerate Config

Hi,

Thanks for building this amazing repo.

Have you tried to use deepspeed? If so, is it possible for you to provide a config file?

Thanks!

pretrained_model_name_or_path报错

请问一下这个是svd_xt_image_decoder.safetensors吗,为什么我加上路径之后报错我配置的参数是pretrained_model_name_or_path="/SVD_Xtend-main/svd_xt_image_decoder.safetensors"然后就报错
raise EnvironmentError(f"It looks like the config file at '{config_file}' is not a valid JSON file.")
OSError: It looks like the config file at '/SVD_Xtend-main/svd_xt_image_decoder.safetensors' is not a valid JSON file.
是不是我下载的这个不是预训练权重文件

text2video data

Hi @pixeli99,

Thank you for your nice work. I have a question about text2video data. Could you give more description about the dataset file structure for text2video training? What should I put the images file and text file?

Thank you so much.

Questions about the noise sampling.

Thank you for sharing! This code helps me a lot.

When using this code to finetune SVD, I have some questions with the noise sample. The noise sampling in this code is as following:
sigmas = rand_cosine_interpolated(shape=[bsz,], image_d=image_d, noise_d_low=noise_d_low, noise_d_high=noise_d_high, sigma_data=sigma_data, min_value=min_value, max_value=max_value).to(latents.device)
sigmas = sigmas[:, None, None, None, None]
noisy_latents = latents + noise * sigmas

I want to know if I can replace this simply with a diffusers noise scheduler such as DDPMScheduler?

Hope to get your help!

Support more conditional inputs, such as layout

Hello, I saw a "Support more conditional inputs, such as layout" in the to-do list. What does this mean. Is it that the first frame + the BB trajectory of the subsequent frames are given to guide video generation? When will it probably be online?

Exception: Could not find the transformer layer class to wrap in the model.

When trying to use Accelerate with FSDP I get the error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/acidhax/dev/SVD_Xtend/train_svd.py", line 1255, in <module>
[rank0]:     main()
[rank0]:   File "/home/acidhax/dev/SVD_Xtend/train_svd.py", line 881, in main
[rank0]:     unet, optimizer, lr_scheduler, train_dataloader = accelerator.prepare(
[rank0]:   File "/home/acidhax/miniconda3/envs/training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank0]:     result = tuple(
[rank0]:   File "/home/acidhax/miniconda3/envs/training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank0]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]:   File "/home/acidhax/miniconda3/envs/training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank0]:     return self.prepare_model(obj, device_placement=device_placement)
[rank0]:   File "/home/acidhax/miniconda3/envs/training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1443, in prepare_model
[rank0]:     self.state.fsdp_plugin.set_auto_wrap_policy(model)
[rank0]:   File "/home/acidhax/miniconda3/envs/training/lib/python3.10/site-packages/accelerate/utils/dataclasses.py", line 1182, in set_auto_wrap_policy
[rank0]:     raise Exception("Could not find the transformer layer class to wrap in the model.")
[rank0]: Exception: Could not find the transformer layer class to wrap in the model.

Fail to start the training at high resolutions.

Hi there, thank you for this great reproduction. I also reproduced the training code based on the codebase of Stability AI. When I start my training with high resolutions (e.g. 576x1024, 512x896), I found the model can only produce a sequence of blur like below:
blur
However, if I run it at a relatively lower resolution (e.g. 320x576, 448x768) with all other settings fixed, the sampling results is as perfect as the public results. In fact, I found that the terrible results occur when the video width exceeds 800 pixels, despite the official recommendation to run at 576x1024. It's possible that I mistakenly touched some codes in my codebase, and I'm still working on this issue. I'm curious, have you ever attempted training at high resolutions? Have you encountered any similar problems?

Finetune SVD with discrete time noise scheduler

Hi, thanks for your great training code for SVD!

When I switch the default SVD sampler into discrete noise scheduler, the generated video looks very bad.
May I ask which layers/parameters of SVD should I finetune to make it compatible with discrete time noise scheduler? Or are there some ways that can transform the SVD noisy_latents (which is equal to latents + sigma * noise, with large sigma values) into corresponding noisy_latents (with smaller sigmas) used to train on discrete noise schedulers?

Thanks!

Error when I start the training

I get an error about the UNetSpatioTemporalConditionModel.
This is the Traceback:
Traceback (most recent call last):
File "/workspace/SVD_Xtend/train_svd.py", line 1255, in
main()
File "/workspace/SVD_Xtend/train_svd.py", line 1043, in main
added_time_ids = _get_add_time_ids(
File "/workspace/SVD_Xtend/train_svd.py", line 951, in _get_add_time_ids
expected_add_embed_dim = unet.module.add_embedding.linear_1.in_features
File "/opt/conda/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 218, in getattr
return super().getattr(name)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'UNetSpatioTemporalConditionModel' object has no attribute 'module'

Do you have any advice on text2video?

Hi,

Thanks for open-sourcing this great project.

I am curious about how to implement a text2video version of SVD. Given an input image and a prompt, how to generate a video? Can I simply replace the encoder_hidden_states with the text embedding to finetune SVD?

Thanks!

请问下BF16是怎么执行的,我执行有报错,帮看下

Traceback (most recent call last):
File "train_svd.py", line 1262, in
main()
File "train_svd.py", line 1089, in main
model_pred = unet(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1833, in forward
loss = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/workspace/swx689421/diffusers/src/diffusers/models/unets/unet_spatio_temporal_condition.py", line 409, in forward
emb = self.time_embedding(t_emb)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/workspace/swx689421/diffusers/src/diffusers/models/embeddings.py", line 228, in forward
sample = self.linear_1(sample)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype

Which BDD100K data is used for training?

Hi @pixeli99 , thank you for your amazing work! I just read your code and wondering which subset of bdd100 data (I have not found the "bdd100k/images/track/mini" in their official web) is used for your training?

AttributeError: 'UNetSpatioTemporalConditionModel' object has no attribute 'module'. Did you mean: 'modules'?

I got an error when I run with: accelerate launch train_svd.py --pretrained_model_name_or_path=/root/stable-video-diffusion-img2vid-xt --output_dir="model_out" --per_gpu_batch_size=1 --gradient_accumulation_steps=1 --max_train_steps=50000 --width=512 --height=320 --checkpointing_steps=1000 --checkpoints_total_limit=1 --learning_rate=1e-5 --lr_warmup_steps=0 --seed=123 --mixed_precision="fp16" --validation_steps=200

train_dataloader is <accelerate.data_loader.DataLoaderShard object at 0x7f4c101b5150>
Traceback (most recent call last):
File "/root/SVD_Xtend/train_svd.py", line 1246, in
main()
File "/root/SVD_Xtend/train_svd.py", line 1025, in main
added_time_ids = _get_add_time_ids(
File "/root/SVD_Xtend/train_svd.py", line 940, in _get_add_time_ids
passed_add_embed_dim = unet.module.config.addition_time_embed_dim *
File "/root/SVD_Xtend/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 222, in getattr
return super().getattr(name)
File "/root/SVD_Xtend/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'UNetSpatioTemporalConditionModel' object has no attribute 'module'. Did you mean: 'modules'?

About the step_loss == nan

Hello,
Thanks for your brilliant work!
When I run the code, I find the step loss always equals nan when I use the bdd dataset. After carefully checking the code, I find the last block of the upsample_block' s output will be nan. I just use the fp16 model and follow the pipeline.
Could anyone tell me what is the reason?

Thanks a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.