pixeli99 / svd_xtend Goto Github PK

View Code? Open in Web Editor NEW

498.0 11.0 45.0 485 KB

Stable Video Diffusion Training Code and Extensions.

Python 98.27% Jupyter Notebook 1.73%

svd_xtend's Introduction

SVD Xtend

Stable Video Diffusion Training Code and Extensions 🚀

💡 Highlight

Finetuning SVD. See Part 1.
Tracklet-Conditioned Video Generation. Building upon SVD, you can control the movement of objects using tracklets(bbox). See Part 2.

Part 1: Training

Comparison

size=(512, 320), motion_bucket_id=127, fps=7, noise_aug_strength=0.00
generator=torch.manual_seed(111)

Init Image	Before Fine-tuning	After Fine-tuning

Video Data Processing

Note that BDD100K is a driving video/image dataset, but this is not a necessity for training. Any video can be used to initiate your training. Please refer to the DummyDataset data reading logic. In short, you only need to modify self.base_folder. Then arrange your videos in the following file structure:

self.base_folder
    ├── video_name1
    │   ├── video_frame1
    │   ├── video_frame2
    │   ...
    ├── video_name2
    │   ├── video_frame1
        ├── ...

Training Configuration(on the BDD100K dataset)

This training configuration is for reference only, I set all parameters of unet to be trainable during the training and adopted a learning rate of 1e-5.

accelerate launch train_svd.py \
    --pretrained_model_name_or_path=/path/to/weight \
    --per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
    --max_train_steps=50000 \
    --width=512 \
    --height=320 \
    --checkpointing_steps=1000 --checkpoints_total_limit=1 \
    --learning_rate=1e-5 --lr_warmup_steps=0 \
    --seed=123 \
    --mixed_precision="fp16" \
    --validation_steps=200

Part 2: Tracklet2Video

Tracklet2Video

We have attempted to incorporate layout control on top of img2video, which makes the motion of objects more controllable, similar to what is demonstrated in the image below. The code and weights will be updated soon. It should be noted that we use a resolution of 512*320 for SVD to generate videos, so the quality of the generated videos appears to be poor (which is somewhat unfair to SVD), but our intention is to demonstrate the effectiveness of tracklet control, and we will resolve the issue with video quality as soon as possible.

Init Image	Gen Video by SVD	Gen Video by Ours

Methods

We have utilized the Self-Tracking training from Boximator and the Instance-Enhancer from TrackDiffusion. For more details, please refer to the paper.

🏷️ TODO List

Support text2video (WIP)
Support more conditional inputs, such as layout

♥️ Acknowledgement

Our model is related to Diffusers and Stability AI. Thanks for their great work!

Thanks Boximator and GLIGEN for their awesome models.

✒️ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{li2023trackdiffusion,
  title={Trackdiffusion: Multi-object tracking data generation via diffusion models},
  author={Li, Pengxiang and Liu, Zhili and Chen, Kai and Hong, Lanqing and Zhuge, Yunzhi and Yeung, Dit-Yan and Lu, Huchuan and Jia, Xu},
  journal={arXiv preprint arXiv:2312.00651},
  year={2023}
}

svd_xtend's People

Contributors

Stargazers

Watchers

Forkers

ciarastrawberry blakeone kustomzone kiteretsu77 dushwe systemerrorwang ma-xu m-muaz wipwai xuweiyichen wangqiang9 jensinjames isam007 anminhhung gatepoet avuo yunyangge leedaehan-kev sun-jian furkan-celik feng-7 allenxuejian hyeonho99 springtostring sanakspock agokrani hl-hanlin zdyshine eduartexd khaledbutainy jpatel-bdai jbhauchard jh-xie ren-creater songyang86 thanhpham1987 yuhaoliu7456 bruinxiong zhangxujinsh xjx777 magicwang1111 danielvegamyhre hungnguyen2611

svd_xtend's Issues

Questions about the predicted_noise of EDM framework.

Sorry to disturb you again.

I am trying to accelerate SVD with DDIM, but in the EDM training framework, I didn't find how to calculate the predicted_noise with sigmas and model_pred. Can you help me?

What can I change if I want to add LoRA method into your train script ?

I see the "
from diffusers.models.lora import LoRALinearLayer and
parser.add_argument(
"--rank",
type=int,
default=128,
help=("The dimension of the LoRA update matrices."),
)
"
in your train script but you do not use them.

AttributeError: 'UNetSpatioTemporalConditionModel' object has no attribute 'module'. Did you mean: 'modules'?

Hey there, I'm trying out your new training script but I get this error... I'm using stabilityai/stable-video-diffusion-img2vid for the pretrained_model_name_or_path BTW but I'm not sure if that's correct?

AttributeError: 'UNetSpatioTemporalConditionModel' object has no attribute 'module'. Did you mean: 'modules'?

training gpu cost

Hi, thank you for your open-source code. How much GPU consumption is required during training? Is it necessary to add additional deepspeed or checkpoint to save training memory consumption?

About the GPU settings when you training.

Hi, your code is very useful!

And I wonder the GPU setting when you are trying. Does it require 80G GPU memory? or some size smaller?

Thanks a lot!

Thank you, fine_tuned weights

Dear,

Thank you for uploading this amazing repository. Are the finetuned model weights available I would love to see how the models runs on my machine?

With regards,

Jasper

How can I use the trained module

Thank you, I have successfully completed the training,output directory structure above.I have a question: How can I use the trained module? Can I use optimizer.bin instead of svd_xt.safetensors? The size of optimizer.bin is over 3GB, whereas svd_xt.safetensors is over 9GB. Therefore, I am wondering if they can be used interchangeably

Question about classifier guidance for image in training code

Hello, nice work on the training code! Thank you for sharing this code.
I have a question about your image conditioned classifier guidance in your code.

if args.conditioning_dropout_prob is not None:
  random_p = torch.rand(
      bsz, device=latents.device, generator=generator)
  # Sample masks for the edit prompts.
  prompt_mask = random_p < 2 * args.conditioning_dropout_prob
  prompt_mask = prompt_mask.reshape(bsz, 1, 1)
  # Final text conditioning.
  null_conditioning = torch.zeros_like(encoder_hidden_states)
  encoder_hidden_states = torch.where(
      prompt_mask, null_conditioning.unsqueeze(1), encoder_hidden_states.unsqueeze(1))
  # Sample masks for the original images.
  image_mask_dtype = conditional_latents.dtype
  image_mask = 1 - (
      (random_p >= args.conditioning_dropout_prob).to(
          image_mask_dtype)
      * (random_p < 3 * args.conditioning_dropout_prob).to(image_mask_dtype)
  )
  image_mask = image_mask.reshape(bsz, 1, 1, 1)
  # Final image conditioning.
  conditional_latents = image_mask * conditional_latents

I wonder this is an official way of implementing the classifier free guidance for image conditions. If the drop prob is 0.1 as default,

with prob 0.1: first frame concat remains, first frame for cross attention is 0
with prob 0.1: first frame concat is 0, first frame for cross attention is 0
with prob 0.1: first frame concat is 0, first frame for cross attention remains
with prob 0.1: first frame concat remains, first frame for cross attention remains

Is this as your intention?

Thank you

Hyperparameters configuration for training

Hi! Thank you so much for this helpful implementation!

I am interested in the configuration of the hyperparameters you use to finetune the SVD. Do you just use the default value written in the parse_args()( e.g., lr=1e-4, conditioning_dropout_prob=0.1, etc.)? I would be deeply appreciated if a training script could be provided.

How to support batch_size > 1?

Thans for you amazing Job!
At the line 1017 of train_svd.py，where the batch_size freeze as 1（noise_aug_strength = cond_sigmas[0] # TODO: support batch > 1）? What should I do to support batch>1?

losses barely dropped with BDDx datasets？？

Question about the forward pass

Hi,

While exploring diffusion models, I noticed the standard forward pass often uses the formula
$\alpha \cdot x + \sigma \cdot \epsilon$.
However, in your video diffusion model code, I saw a different approach:

sigmas = rand_log_normal(shape=[bsz,], loc=0.7, scale=1.6).to(latents.device)
noisy_latents = latents + noise * sigmas
inp_noisy_latents = noisy_latents / ((sigmas**2 + 1) ** 0.5)

You're sampling noise levels from a log-normal distribution and I'm curious about the reasoning behind this choice.
If there are any papers or references that guided this decision, could you share them?

Thanks for your insights!

Training on multi GPUs with different memory size

Suggestions on loading video data

Thank authors for sharing training code of SVD and it works quite well. I just want to give some suggestions to accelerate data loading. First, if you have the original .mp4 or other formats video data, using library decord is much faster than loading video frames iteratively. Second, setting --num_workers to 0 largely improves loading speed for me(I am using 2 40G A100 for training).

Just some small tips and thank authors again.

多卡训练报错

感谢您的工作！
我现在使用单卡训练没有问题，使用多卡训练会出现如下报错：
Traceback (most recent call last):
File "train_svd.py", line 1264, in
main()
File "train_svd.py", line 1045, in main
added_time_ids = _get_add_time_ids(
File "train_svd.py", line 949, in _get_add_time_ids
passed_add_embed_dim = unet.config.addition_time_embed_dim *
File "/.pt2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'config'

我使用的命令：
accelerate launch train_svd.py
--pretrained_model_name_or_path=stable-video-diffusion-img2vid-xt-1-1
--per_gpu_batch_size=1 --gradient_accumulation_steps=1
--max_train_steps=100
--width=512
--height=320
--checkpointing_steps=50 --checkpoints_total_limit=1
--learning_rate=1e-5 --lr_warmup_steps=0
--seed=123
--mixed_precision="fp16"
--validation_steps=20
--num_workers=0 \

Add a short inference script loading from checkpoint?

Impressive work!

Can you also add a short inference script to load from the checkpoint and do inference on image?

just thank you for share this SVD training code

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 6 but got size 5 for tensor number 1 in the list.

when it runs:

model_pred = unet(
inp_noisy_latents, timesteps, encoder_hidden_states, added_time_ids=added_time_ids).sample

The error is
"RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 6 but got size 5 for tensor number 1 in the list."

could you please tell me how to fix this
thanks

Tensor size mismatch error

Traceback (most recent call last):
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\train_svd.py", line 1286, in <module>
    main()
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\train_svd.py", line 1114, in main
    model_pred = unet(
                 ^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\accelerate\utils\operations.py", line 817, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\accelerate\utils\operations.py", line 805, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\torch\amp\autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\diffusers\models\unets\unet_spatio_temporal_condition.py", line 463, in forward
    sample = upsample_block(
             ^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\MyStuff\Programming\Python\AI\SVD_Xtend\.venv\Lib\site-packages\diffusers\models\unets\unet_3d_blocks.py", line 2351, in forward
    hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 2 but got size 1 for tensor number 1 in the list.

I get this error whenever i try to run the script. These are the arguments i use

accelerate launch train_svd.py --pretrained_model_name_or_path=stabilityai/stable-video-diffusion-img2vid-xt-1-1 --per_gpu_batch_size=2 --gradient_accumulation_steps=1 --max_train_steps=5000 --width=10 --height=10 --checkpointing_steps=1000 --checkpoints_total_limit=1 --learning_rate=1e-5 --lr_warmup_steps=0 --seed=123 --mixed_precision="fp16" --validation_steps=200

(Using low resolutions to avoid out of vram issues while testing, same issue happens at higher resolutions too, was not able to run with default resolution on my pc)

Any idea what is causing the error?

Question for the encoder_hidden_states

When I try to run the script, I found the encoder_hidden_states to be zero.

Questions about the noise sampling.

Great works!
I am wondering:

Why not use EDM noise sampling instead of the stratege in the simple diffusion?
Why using fixed noise strength (0) on the condition image? I thinkl the sampling expression has been given in svd paper.

May I ask what pretrained_model_name_or_path is

May I ask what pretrained_model_name_or_path is，thank you

Out of Memory

How much GPU memory is used for resolution of 512*320 and batch size of 1?

能否提供下pretrained_model_name_or_path的下载路径？

Questions on text2video?

when I try to figure out how to adapt the framework for text2video synthesis, I found that the SpatialTemporalUNet has a input channel 8 which is depicted in this line:


    @register_to_config
    def __init__(
        self,
        sample_size: Optional[int] = None,
        in_channels: int = 8,
        out_channels: int = 4,
        down_block_types: Tuple[str] = (

Then I check the pipeline inference and I found the denoising input is actually a concatenation of a noise and latent input:


# Concatenate image_latents over channels dimention
latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)

my question is that how to obtain the image_latents if we only use text as a input when training a text2video model?
Do you recently have any progress on text2video?

Question about the new push for partial fine-tuning.

Hello! For the latest update of push, I find that there is a modification on partial fine-tuning. I see that you only train the param which has a name with "temporal_transformer_block". I wonder if there is any source related to why we should do that. Thank you so much for your attention and participation.

Multi-gpu training

First of, thank you for providing SVD training code.

I'm trying to train using multi-GPU, are there any changes I need to make to the code and prompt?
In the code, should I uncomment this part and what else do I need to do?

    # ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
    accelerator = Accelerator(
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        mixed_precision=args.mixed_precision,
        log_with=args.report_to,
        project_config=accelerator_project_config,
        # kwargs_handlers=[ddp_kwargs]
    )

How much memory is needed when training SVD

First of all, thank you very much for sharing the code. At the same time, I would like to ask how much memory is needed when training SVD, and what is the minimum video memory configuration required?

DeepSpeed Accelerate Config

Hi,

Thanks for building this amazing repo.

Have you tried to use deepspeed? If so, is it possible for you to provide a config file?

Thanks!

pretrained_model_name_or_path报错

请问一下这个是svd_xt_image_decoder.safetensors吗，为什么我加上路径之后报错我配置的参数是pretrained_model_name_or_path="/SVD_Xtend-main/svd_xt_image_decoder.safetensors"然后就报错
raise EnvironmentError(f"It looks like the config file at '{config_file}' is not a valid JSON file.")
OSError: It looks like the config file at '/SVD_Xtend-main/svd_xt_image_decoder.safetensors' is not a valid JSON file.
是不是我下载的这个不是预训练权重文件

'EulerDiscreteScheduler' object has no attribute 'get_velocity'

Thank you very much for your work. I did not find any implementation of get in EulerDiscreteSchedule in diffusers_ Velocity, may I ask if you have implemented one yourself?

multi-gpu training?

How to set configs for multi-gpus training?

text2video data

Hi @pixeli99,

Thank you for your nice work. I have a question about text2video data. Could you give more description about the dataset file structure for text2video training? What should I put the images file and text file?

Thank you so much.

Questions about the noise sampling.

Thank you for sharing! This code helps me a lot.

When using this code to finetune SVD, I have some questions with the noise sample. The noise sampling in this code is as following:
sigmas = rand_cosine_interpolated(shape=[bsz,], image_d=image_d, noise_d_low=noise_d_low, noise_d_high=noise_d_high, sigma_data=sigma_data, min_value=min_value, max_value=max_value).to(latents.device)
sigmas = sigmas[:, None, None, None, None]
noisy_latents = latents + noise * sigmas

I want to know if I can replace this simply with a diffusers noise scheduler such as DDPMScheduler?

Hope to get your help!

Support more conditional inputs, such as layout

Hello, I saw a "Support more conditional inputs, such as layout" in the to-do list. What does this mean. Is it that the first frame + the BB trajectory of the subsequent frames are given to guide video generation? When will it probably be online?

Maybe there's a mistake with the value range when sending the image into the VAE.

here you send an image with values ranging from [0,1] into the VAE, but the VAE expects the input value range to be [-1, 1].

Exception: Could not find the transformer layer class to wrap in the model.

When trying to use Accelerate with FSDP I get the error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/acidhax/dev/SVD_Xtend/train_svd.py", line 1255, in <module>
[rank0]:     main()
[rank0]:   File "/home/acidhax/dev/SVD_Xtend/train_svd.py", line 881, in main
[rank0]:     unet, optimizer, lr_scheduler, train_dataloader = accelerator.prepare(
[rank0]:   File "/home/acidhax/miniconda3/envs/training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1292, in prepare
[rank0]:     result = tuple(
[rank0]:   File "/home/acidhax/miniconda3/envs/training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1293, in <genexpr>
[rank0]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]:   File "/home/acidhax/miniconda3/envs/training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1169, in _prepare_one
[rank0]:     return self.prepare_model(obj, device_placement=device_placement)
[rank0]:   File "/home/acidhax/miniconda3/envs/training/lib/python3.10/site-packages/accelerate/accelerator.py", line 1443, in prepare_model
[rank0]:     self.state.fsdp_plugin.set_auto_wrap_policy(model)
[rank0]:   File "/home/acidhax/miniconda3/envs/training/lib/python3.10/site-packages/accelerate/utils/dataclasses.py", line 1182, in set_auto_wrap_policy
[rank0]:     raise Exception("Could not find the transformer layer class to wrap in the model.")
[rank0]: Exception: Could not find the transformer layer class to wrap in the model.

Fail to start the training at high resolutions.

Hi there, thank you for this great reproduction. I also reproduced the training code based on the codebase of Stability AI. When I start my training with high resolutions (e.g. 576x1024, 512x896), I found the model can only produce a sequence of blur like below:

However, if I run it at a relatively lower resolution (e.g. 320x576, 448x768) with all other settings fixed, the sampling results is as perfect as the public results. In fact, I found that the terrible results occur when the video width exceeds 800 pixels, despite the official recommendation to run at 576x1024. It's possible that I mistakenly touched some codes in my codebase, and I'm still working on this issue. I'm curious, have you ever attempted training at high resolutions? Have you encountered any similar problems?

Finetune SVD with discrete time noise scheduler

Hi, thanks for your great training code for SVD!

When I switch the default SVD sampler into discrete noise scheduler, the generated video looks very bad.
May I ask which layers/parameters of SVD should I finetune to make it compatible with discrete time noise scheduler? Or are there some ways that can transform the SVD noisy_latents (which is equal to latents + sigma * noise, with large sigma values) into corresponding noisy_latents (with smaller sigmas) used to train on discrete noise schedulers?

Thanks!

after training on 512x512, the video not move always, why?

Error when I start the training

I get an error about the UNetSpatioTemporalConditionModel.
This is the Traceback:
Traceback (most recent call last):
File "/workspace/SVD_Xtend/train_svd.py", line 1255, in
main()
File "/workspace/SVD_Xtend/train_svd.py", line 1043, in main
added_time_ids = _get_add_time_ids(
File "/workspace/SVD_Xtend/train_svd.py", line 951, in _get_add_time_ids
expected_add_embed_dim = unet.module.add_embedding.linear_1.in_features
File "/opt/conda/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 218, in getattr
return super().getattr(name)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'UNetSpatioTemporalConditionModel' object has no attribute 'module'

Do you have any advice on text2video?

Hi,

Thanks for open-sourcing this great project.

I am curious about how to implement a text2video version of SVD. Given an input image and a prompt, how to generate a video? Can I simply replace the encoder_hidden_states with the text embedding to finetune SVD?

Thanks!

WARNING - tensorboardX.x2num - NaN or Inf found in input tensor.

Question：How to use the trained modules to generate a video？

请问下BF16是怎么执行的，我执行有报错，帮看下

Traceback (most recent call last):
File "train_svd.py", line 1262, in
main()
File "train_svd.py", line 1089, in main
model_pred = unet(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1833, in forward
loss = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/workspace/swx689421/diffusers/src/diffusers/models/unets/unet_spatio_temporal_condition.py", line 409, in forward
emb = self.time_embedding(t_emb)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/workspace/swx689421/diffusers/src/diffusers/models/embeddings.py", line 228, in forward
sample = self.linear_1(sample)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype

Which BDD100K data is used for training?

Hi @pixeli99 , thank you for your amazing work! I just read your code and wondering which subset of bdd100 data (I have not found the "bdd100k/images/track/mini" in their official web) is used for your training?

[CUDA out of memory] training in 1024 × 576 resolution in the A100 80G

Hi, Thanks for any suggestions.
The largest resolution that could be used for training is 512 × 512 with ~76G memory cost.
I set the enable_xformers_memory_efficient_attention to True but nothing changed at all.

thank you and do you use the frames at 10th second for training?

AttributeError: 'UNetSpatioTemporalConditionModel' object has no attribute 'module'. Did you mean: 'modules'?

I got an error when I run with: accelerate launch train_svd.py --pretrained_model_name_or_path=/root/stable-video-diffusion-img2vid-xt --output_dir="model_out" --per_gpu_batch_size=1 --gradient_accumulation_steps=1 --max_train_steps=50000 --width=512 --height=320 --checkpointing_steps=1000 --checkpoints_total_limit=1 --learning_rate=1e-5 --lr_warmup_steps=0 --seed=123 --mixed_precision="fp16" --validation_steps=200

train_dataloader is <accelerate.data_loader.DataLoaderShard object at 0x7f4c101b5150>
Traceback (most recent call last):
File "/root/SVD_Xtend/train_svd.py", line 1246, in
main()
File "/root/SVD_Xtend/train_svd.py", line 1025, in main
added_time_ids = _get_add_time_ids(
File "/root/SVD_Xtend/train_svd.py", line 940, in _get_add_time_ids
passed_add_embed_dim = unet.module.config.addition_time_embed_dim *
File "/root/SVD_Xtend/venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 222, in getattr
return super().getattr(name)
File "/root/SVD_Xtend/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'UNetSpatioTemporalConditionModel' object has no attribute 'module'. Did you mean: 'modules'?

About the step_loss == nan

Hello,
Thanks for your brilliant work!
When I run the code, I find the step loss always equals nan when I use the bdd dataset. After carefully checking the code, I find the last block of the upsample_block' s output will be nan. I just use the fp16 model and follow the pipeline.
Could anyone tell me what is the reason?

Thanks a lot!

pixeli99 / svd_xtend Goto Github PK

svd_xtend's Introduction

SVD Xtend

💡 Highlight

Part 1: Training

Comparison

Video Data Processing

Training Configuration(on the BDD100K dataset)

Part 2: Tracklet2Video

Tracklet2Video

Methods

🏷️ TODO List

♥️ Acknowledgement

✒️ Citation

svd_xtend's People

Contributors

Stargazers

Watchers

Forkers

svd_xtend's Issues

Recommend Projects

Recommend Topics

Recommend Org