Coder Social home page Coder Social logo

alibaba / animate-anything Goto Github PK

View Code? Open in Web Editor NEW
559.0 15.0 39.0 33.19 MB

Fine-Grained Open Domain Image Animation with Motion Guidance

Home Page: https://animationai.github.io/AnimateAnything/

License: MIT License

Python 99.38% Shell 0.05% Jupyter Notebook 0.57%
animation video-diffusion-model video-generation

animate-anything's Introduction

๐Ÿ‘‰ AnimateAnything: Fine Grained Open Domain Image Animation with Motion Guidance

Zuozhuo Dai, Zhenghao Zhang, Menghao Li, Junchao Liao, Siyu Zhu, Long Qin, Weizhi Wang

views

Showcases

video_demol.mp4
Input Image with Mask Prompt Result
Input image Barbie watching the camera with a smiling face. Result
Input image The cloak swaying in the wind. Result
Input image A red fish is swimming. Result

Framework

framework

News ๐Ÿ”ฅ

2024.2.5: Support multiple GPUs training with Accelerator DeepSpeed. Config DeepSpeed zero_stage 2 and offload_optimizer_device cpu, you can do full finetuning animate-anything with 4x16G V100 GPUs and SVD with 4x24G A10 GPUs now.

2023.12.27: Support finetuning based on SVD (stable video diffusion) model. Update SVD based animate_anything_svd_v1.0

2023.12.18: Update model to animate_anything_512_v1.02

Features Planned

  • ๐Ÿ’ฅ Enhanced prompt-following: generating long-detailed captions using LLaVA.
  • ๐Ÿ’ฅ Replace the U-Net with DiffusionTransformer (DiT) as the base model.
  • ๐Ÿ’ฅ Variable resolutions.
  • ๐Ÿ’ฅ Support Huggingface Demo / Google Colab.
  • etc.

Getting Started

This repository is based on Text-To-Video-Finetuning.

Create Conda Environment (Optional)

It is recommended to install Anaconda.

Windows Installation: https://docs.anaconda.com/anaconda/install/windows/

Linux Installation: https://docs.anaconda.com/anaconda/install/linux/

conda create -n animation python=3.10
conda activate animation

Python Requirements

pip install -r requirements.txt

Running inference

Please download the pretrained model to output/latent, then run the following command. Please replace the {download_model} to your download model name:

python train.py --config output/latent/{download_model}/config.yaml --eval validation_data.prompt_image=example/barbie2.jpg validation_data.prompt='A cartoon girl is talking.'

To control the motion area, we can use the labelme to generate a binary mask. First, we use labelme to draw the polygon for the reference image.

Then we run the following command to transform the labelme json file to a mask.

labelme_json_to_dataset qingming2.json

Then run the following command for inference:

python train.py --config output/latent/{download_model}/config.yaml --eval validation_data.prompt_image=example/qingming2.jpg validation_data.prompt='Peoples are walking on the street.' validation_data.mask=example/qingming2_label.jpg 

User can adjust the motion strength by using the mask motion model:

python train.py --config output/latent/{download_model}/
config.yaml --eval validation_data.prompt_image=example/qingming2.jpg validation_data.prompt='Peoples are walking on the street.' validation_data.mask=example/qingming2_label.jpg validation_data.strength=5

Video super resolution

The model output low res videos, you can use video super resolution model to output high res videos. For example, we can use Real-CUGAN cartoon style video super resolution:

git clone https://github.com/bilibili/ailab.git
cd ailab/Real-CUGAN
python inference_video.py

Training

Using Captions

You can use caption files when training with video. Simply place the videos into a folder and create a json with captions like this:

[
      {"caption": "Cute monster character flat design animation video", "video": "000001_000050/1066697179.mp4"}, 
      {"caption": "Landscape of the cherry blossom", "video": "000001_000050/1066688836.mp4"}
]

Then in your config, make sure to set dataset_types to video_json and set the video_dir and video json path like this:

  - dataset_types: 
      - video_json
    train_data:
      video_dir: '/webvid/webvid/data/videos'
      video_json: '/webvid/webvid/data/40K.json'

Process Automatically

You can automatically caption the videos using the Video-BLIP2-Preprocessor Script and set the dataset_types and json_path like this:

  - dataset_types: 
      - video_blip
    train_data:
      json_path: 'blip_generated.json'

Configuration

The configuration uses a YAML config borrowed from Tune-A-Video repositories.

All configuration details are placed in example/train_mask_motion.yaml. Each parameter has a definition for what it does.

Finetuning anymate-anything

You can finetune anymate-anything with text, motion mask, motion strength guidance on your own dataset. The following config requires around 30G GPU RAM. You can reduce the train_batch_size, train_data.width, train_data.height, and n_sample_frames in the config to reduce GPU RAM:

python train.py --config example/train_mask_motion.yaml pretrained_model_path=<download_model>

Finetune Stable Video Diffusion:

Stable Video Diffusion (SVD) img2vid model can generate high resolution videos. However, it does not have the text or motion mask control. You can finetune SVD with motioin mask guidance with the following commands and pretrained SVD model. This config requires around 80G GPU RAM.

python train_svd.py --config example/train_svd_mask.yaml pretrained_model_path=<download_model>

If you only want to finetune SVD on your own dataset without motion mask control, please use the following config:

python train_svd.py --config example/train_svd.yaml pretrained_model_path=<svd_model>

Multiple GPUs training

I strongly recommend use multiple GPUs training with Accelerator, which will largely decrease the VRAM requirement. Please first config the accelerator with deepspeed. An example config is located in example/deepspeed.yaml.

And then replace 'python train_xx.py ...' commands above with 'accelerate launch train_xx.py ...', for example:

accelerate launch --config_file example/deepspeed.yaml train_svd.py --config example/train_svd_mask.yaml pretrained_model_path=<download_model>

SVD video2video

We now release the finetuned vid2vid SVD model, you can try it via the gradio UI.

Please download the vid2vid_SVD model and extract it to output/svd/{download_model} and then run the command:

python app_svd.py --config example/train_svd_v2v.yaml pretrained_model_path=output/svd/{download_model}

We provide several examples in the svd_video2video_examples directory.

Bibtex

Please cite this paper if you find the code is useful for your research:

@misc{dai2023animateanything,
      title={AnimateAnything: Fine-Grained Open Domain Image Animation with Motion Guidance}, 
      author={Zuozhuo Dai and Zhenghao Zhang and Yao Yao and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang},
      year={2023},
      eprint={2311.12886},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Shoutouts

animate-anything's People

Contributors

dailingx avatar daizuozhuo avatar eltociear avatar leojc avatar sculmh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

animate-anything's Issues

question about some implementation details

I compared your implementations of 'train_svd.py' and 'train.py' and found several interesting points:

https://github.com/alibaba/animate-anything/blob/main/train_svd.py#L441
could you explain the difference between latent_dist.mode() latent_dist.sample()?

https://github.com/alibaba/animate-anything/blob/main/train_svd.py#L433
Why not pass the vae and image_encoder as arguments into this function? Is there any difference btw passing in arguments and sparing from pipeline?

Finetune SVD with webvid

Thanks for your contribution!
I want to finetune with webvid-2M. But I found that the format of webvid is not same as your scripts.
How can I process webvid?

how can inference the trained SVD model

i use python train_svd.py --config output/svd/animate_anything_svd_v1.0/config.yaml --eval validation_data.prompt_image=example/XX.png validation_data.prompt='XXXX.'

i got error ValueError: The provided pretrained_model_name_or_path "output/svd/animate_anyhting_svd_v1.0" is neither a valid local path nor a valid repo id. Please check the parameter.

How to config lora.yaml

When I use the configuration 'lora_training_config.yaml' or 'stable_lora_config.yaml', I got this error File "/animate-anything/train.py", line 345, in handle_trainable_modules for tm in tuple(trainable_modules): TypeError: 'NoneType' object is not iterable

Could you show some correct example when using config about lora?

Finetune SVD Lora๏ผŸ

can your code train_svd.py used to Finetune SVD Lora๏ผŸ I changed your .yaml file to Lora fine-tuning but found many bugs. I need your help.

Small range of motion

thanks for your great work!

It seems that the generated videos all have a small motion range, Whether it's big strengh or not . what do you think is the reason for this? According to the principle, training can also focus on relatively large movements, is it because of the inclusion of motion loss or motion module๏ผŸ

How to setup dataset

I cannot figure out how to setup my training data and how to adjust train_svd.yaml. As captions are not needed for SVD i only have a dataset with videos.

videos
|----- video1.mp4
|----- video2.mp4

Like this i have a videos folder, but setting video_dirto "/videos" and "json_path" to "blip_generated.json" does not work. I get this error

Traceback (most recent call last):
  File "D:\MyStuff\Programming\Python\AI\animate-anything\utils\dataset.py", line 182, in load_from_json
    with open(path) as jpath:
         ^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'blip_generated.json'
Non-existant JSON path. Skipping.
Traceback (most recent call last):
  File "D:\MyStuff\Programming\Python\AI\animate-anything\train_svd.py", line 841, in <module>
    main(**args_dict)
  File "D:\MyStuff\Programming\Python\AI\animate-anything\train_svd.py", line 585, in main
    train_dataloader = torch.utils.data.DataLoader(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\MyStuff\Programming\Python\AI\animate-anything\animate_anything\Lib\site-packages\torch\utils\data\dataloader.py", line 351, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\MyStuff\Programming\Python\AI\animate-anything\animate_anything\Lib\site-packages\torch\utils\data\sampler.py", line 107, in __init__
    raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

ๅŠ ่ฝฝvaeๆจกๅž‹ๆ—ถๆŠฅ้”™

ๆ‚จๅฅฝ๏ผŒๅœจๆŽจ็†ๆ—ถๅ‘็”Ÿไปฅไธ‹้”™่ฏฏ๏ผŒ่ฏท้—ฎๆ˜ฏๅฆไธบvaeๆจกๅž‹ไธๅŒน้…ๅฏผ่‡ด็š„ๅ‘ข๏ผŸ่ฐข่ฐข
Traceback (most recent call last):
File "/home/data/liboxian/animate-anything/train.py", line 1167, in
main_eval(**args_dict)
File "/home/data/liboxian/animate-anything/train.py", line 1134, in main_eval
noise_scheduler, tokenizer, text_encoder, vae, unet = load_primary_models(pretrained_model_path, motion_mask, motion_strength)
File "/home/data/liboxian/animate-anything/train.py", line 134, in load_primary_models
vae = AutoencoderKL.from_pretrained(pretrained_model_path, subfolder="vae")
File "/home/liboxian/miniconda3/envs/sd/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 583, in from_pretrained
raise ValueError(
ValueError: Cannot load <class 'diffusers.models.autoencoder_kl.AutoencoderKL'> from output/latent/animate_anything_512_v1.01 because the following keys are missing:
decoder.mid_block.attentions.0.proj_attn.bias, decoder.mid_block.attentions.0.proj_attn.weight, encoder.mid_block.attentions.0.query.weight, encoder.mid_block.attentions.0.query.bias, decoder.mid_block.attentions.0.key.bias, decoder.mid_block.attentions.0.query.weight, decoder.mid_block.attentions.0.key.weight, decoder.mid_block.attentions.0.value.weight, encoder.mid_block.attentions.0.key.bias, encoder.mid_block.attentions.0.proj_attn.bias, decoder.mid_block.attentions.0.query.bias, decoder.mid_block.attentions.0.value.bias, encoder.mid_block.attentions.0.value.weight, encoder.mid_block.attentions.0.key.weight, encoder.mid_block.attentions.0.proj_attn.weight, encoder.mid_block.attentions.0.value.bias.

Encountering "Tensor Shape Mismatch" Error during Training

Thank you for your contributions to the aigc community. I've encountered an issue while training using the train_mask_motion.yaml configuration file. I modified the training and testing datasets in the configuration file and initiated training using the command:
python train.py --config ./example/train_mask_motion.yaml
However, after training for 32 iterations, I encountered the following error in

if mode == 'attn':
def custom_forward(
hidden_states,
encoder_hidden_states=None,
cross_attention_kwargs=None
):
inputs = module(
hidden_states,
encoder_hidden_states,
cross_attention_kwargs
)
return inputs
:
image

I find it puzzling why a tensor shape mismatch error is occurring midway through training. I would appreciate any insights or guidance you can provide to help me understand and resolve this issue.

Thank you once again for your assistance!

animate_anything_svd_v1.0

Good project.

Is the animate_anything_svd_v1.0 model fine-tuned on your 20k data based on the SVD pre-trained model?
I am curious about which modules of the model are fine-tuned.

This problem?

File "C:\Users\xxx\anaconda3\envs\animation\lib\site-packages\yaml\scanner.py", line 1238, in scan_flow_scalar_spaces
raise ScannerError("while scanning a quoted scalar", start_mark,
yaml.scanner.ScannerError: while scanning a quoted scalar
in "", line 1, column 1:
'A
^
found unexpected end of stream
in "", line 1, column 3:
'A
^

train_svd.py error

ไฝ ๅฅฝ๏ผŒๆˆ‘ๅœจ่ฟ›่กŒๅพฎ่ฐƒ็š„ๆ—ถๅ€™้‡ๅˆฐๅฆ‚ไธ‹้—ฎ้ข˜๏ผŒ่ฏท้—ฎไธ€ไธ‹ๅฆ‚ไฝ•่งฃๅ†ณๅ‘ข๏ผŸ
่ฟ่กŒๅ‘ฝไปคๅฆ‚ไธ‹๏ผš
accelerate launch --config_file example/deepspeed.yaml train_svd.py --config example/train_svd_mask.yaml pretrained_model_path=output/latent/animate_anything_512_v1.02

ๅ…ถไธญpretrained_model_pathๆจกๅž‹ๆฅ่‡ชread.me
่ฟ›่กŒfine-tuneๆ—ถๅ€™ๆŠฅ้”™๏ผš
File "/animate-anything/train_svd.py", line 89, in load_primary_models
pipeline = StableVideoDiffusionPipeline.from_pretrained(pretrained_model_path)
File "/home/admin/anaconda3/envs/animation/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 1352, in from_pretrained
raise ValueError(
ValueError: Pipeline <class 'diffusers.pipelines.stable_video_diffusion.pipeline_stable_video_diffusion.StableVideoDiffusionPipeline'> expected {'scheduler', 'unet', 'image_encoder', 'vae', 'feature_extractor'}, but only {'unet', 'scheduler', 'vae'} were passed.

้žๅธธๅธŒๆœ›่ƒฝๆ”ถๅˆฐๆ‚จ็š„็ญ”ๅคใ€‚

Interference issue II

running CLI python train.py --config ./animate_anything_512_v1.02/config.yaml --eval validation_data.prompt_image=example/barbie2.jpg validation_data.prompt='A cartoon girl is talking.'

File "train.py", line 1134, in
main_eval(**args_dict)
File "train.py", line 1121, in main_eval
batch_eval(unet, text_encoder, vae, vae_processor, lora_manager, pretrained_model_path,
File "train.py", line 1081, in batch_eval
precision = eval(pipeline, vae_processor,
File "train.py", line 1035, in eval
imageio.mimwrite(out_file, video_frames, fps=validation_data.get('fps', 8))
File "/home/kka/.local/lib/python3.8/site-packages/imageio/v2.py", line 495, in mimwrite
return file.write(ims, is_batch=True, **kwargs)
File "/home/kka/.local/lib/python3.8/site-packages/imageio/plugins/pyav.py", line 632, in write
self.init_video_stream(codec, fps=fps, pixel_format=out_pixel_format)
File "/home/kka/.local/lib/python3.8/site-packages/imageio/plugins/pyav.py", line 836, in init_video_stream
stream = self._container.add_stream(codec, fps)
File "av/container/output.pyx", line 61, in av.container.output.OutputContainer.add_stream
ValueError: needs one of codec_name or template

I checked the package requirement, no version limitation of imageio is in requirements.txt...
Anything wrong?

Finetuning animate-anything VRAM issue

I tried to finetune animate-anything model with single sample video preprocessed with Video-BLIP2-Preprocessor Script, using example/train_mask_motion.yaml config file and single A6000 GPU(48GB), but it keeps on getting CUDA out of memory error trying to allocate more than 48GB.

Then I tried to reduce resolution in config file from 512x512 to 256x256, but it says tensor size does not match while forwarding through unet_3d_blocks.

It would be nice if you can share the config file and GPU that you used to finetune animate-anything with around 30GB VRAM, as it is written in README, and what do we need to do if we are going to downscale the input resolution.

Question about SVD finetuning result

Hi! Thanks for your cool work.
I'm curious about the result from your SVD finetuning result. Are your current demos from SVD finetuning model or not?
Is it possible for you to show the animate result from SVD finetuning?
Thanks!

Missing train_mask_8fps/checkpoint-7500

Hi, I tried to run inference with instructions from README but I got this error:

OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like output/latent/train_mask_8fps/checkpoint-7500 is not the path to a directory containing a scheduler_config.json file.

It seems that config filemask_motion_v1/config.yaml requires pretrained_model_path: output/latent/train_mask_8fps/checkpoint-7500, which is not included.

How is LORA used in fine-tuning?

Hi there,

First thank you so much for this awesome repo! I had a question about the LORA usage in the fine-tuning stage. Even if I choose "all" parameters to fine tune in the config file, it printed only around 10k parameters are tuned. Are these parameters selected by LORA? Suppose I am tuning on a very different video modality, is it possible that I tune on more parameters? Thanks in advance!

ValueError: num_samples should be a positive integer value, but got num_samples=0

I get an error when run with python train_svd.py --config example/train_svd.yaml pretrained_model_path=/root/sspaas-fs/stable-video-diffusion-img2vid-xt.

16 Attention layers using Scaled Dot Product Attention.
dataset: video_blip
Non-existant JSON path. Skipping.
train_datasets: <utils.dataset.VideoBLIPDataset object at 0x7fe54c3e3ca0>
Traceback (most recent call last):
  File "train_svd.py", line 1007, in <module>
    main(**args_dict)
  File "train_svd.py", line 701, in main
    train_dataloader = torch.utils.data.DataLoader(
  File "/root/sspaas-fs/animate-anything/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 351, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
  File "/root/sspaas-fs/animate-anything/venv/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 107, in __init__
    raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

my video dataset is set up and I don't know what could be causing this error.

VRAM Requirements to Fine-tune SVD

Hi,
Found your repo very useful.

I was wondering whether you know how much VRAM is necessary to train/fine-tune SVD in the normal setting (without mask)?

Thank you,
Miguel

Interference issue

running CLI python train.py --config ./animate_anything_512_v1.02/config.yaml --eval validation_data.prompt_image=example/barbie2.jpg validation_data.prompt='A cartoon girl is talking.'

Traceback (most recent call last):
File "train.py", line 1134, in
main_eval(**args_dict)
File "train.py", line 1121, in main_eval
batch_eval(unet, text_encoder, vae, vae_processor, lora_manager, pretrained_model_path,
File "train.py", line 1081, in batch_eval
precision = eval(pipeline, vae_processor,
File "train.py", line 1020, in eval
video_frames, video_latents = pipeline(
File "/home/kka/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/kka/AI/animate-anything/models/pipeline.py", line 205, in call
video = tensor2vid(video_tensor)
TypeError: tensor2vid() missing 1 required positional argument: 'processor'

looks like to miss an argument which I don't find it at pipeline.py, it's wrong coding or something else?

How to prepare the training dataset?

Thanks for your amazing work and for sharing the code with us! It's truly impressive and much appreciated.

here I have some questions regarding the code:

For the train_svd_mask.yaml,

  1. Regarding the step count of 20,000, could you advise on the recommended number of GPUs? Concerned that 20,000 steps might not cover the entire 100k data samples which the svd_mask model used.

For the dataset, the paper mentions random sampling of 20k from HD-VILA-100M. Would you share more information about๏ผš

  1. When training on a new dataset with svd_mask, are there specific preprocessing requirements such as resolution standards or data filtering based on motion thresholds?
  2. And which parameters require adjustment to ensure the model adapts appropriately?

how to explain the mask loss?

hi , i notice the finetune loss has been changed.
for the mask loss: loss += F.mse_loss(predict_x0*(1-mask), condition_latent*(1-mask))
could you please tell me how can this loss control the motion region?

error: "AttributeError: 'DistributedDataParallel' object has no attribute 'enable_gradient_checkpointing'"

thanks for your great work!

I encountered an error while executing the command, that "python train_svd.py --config example/train_svd.yaml pretrained_model_path=stabilityai/stable-video-diffusion-img2vid/".

error detail:
Traceback (most recent call last):
File "/checkpoint/binary/train_package/./train_svd.py", line 840, in
main(**args_dict)
File "/checkpoint/binary/train_package/./train_svd.py", line 599, in main
unet_and_text_g_c(
File "/checkpoint/binary/train_package/./train_svd.py", line 107, in unet_and_text_g_c
unet.enable_gradient_checkpointing()
File "/root/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'DistributedDataParallel' object has no attribute 'enable_gradient_checkpointing'

I want to know how to solve this problem. This problem has been bothering me for a long time.

RuntimeError while running train_svd

Thank you for sharing train SVD. I am trying to run it using a single video / caption. I have reduced num_frames in train_svd.yaml file to 2 in order to save VRAM. However, I am getting the following tensor mismatch. Any idea what is going wrong?

python train_svd.py --config example/train_svd.yaml pretrained_m
odel_path=stabilityai/stable-video-diffusion-img2vid-xt
Initializing the conversion map
animation/lib/python3.10/site-packages/accelerate/accelerator.py:371: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
01/05/2024 13:02:43 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

Loading pipeline components...:   0%|                                                             | 0/5 [00:00<?, ?it/s]Loaded feature_extractor as CLIPImageProcessor from `feature_extractor` subfolder of stabilityai/stable-video-diffusion-img2vid-xt.
Loaded vae as AutoencoderKLTemporalDecoder from `vae` subfolder of stabilityai/stable-video-diffusion-img2vid-xt.
Loading pipeline components...:  40%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                               | 2/5 [00:00<00:00, 16.15it/s]Loaded unet as UNetSpatioTemporalConditionModel from `unet` subfolder of stabilityai/stable-video-diffusion-img2vid-xt.
Loaded image_encoder as CLIPVisionModelWithProjection from `image_encoder` subfolder of stabilityai/stable-video-diffusion-img2vid-xt.
Loading pipeline components...:  80%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–          | 4/5 [00:00<00:00,  3.72it/s]Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-video-diffusion-img2vid-xt.
Loading pipeline components...: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 5/5 [00:00<00:00,  5.15it/s]
16 Attention layers using Scaled Dot Product Attention.
01/05/2024 13:02:49 - INFO - __main__ - ***** Running training *****
01/05/2024 13:02:49 - INFO - __main__ -   Num examples = 192
01/05/2024 13:02:49 - INFO - __main__ -   Num Epochs = 53
01/05/2024 13:02:49 - INFO - __main__ -   Instantaneous batch size per device = 1
01/05/2024 13:02:49 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
01/05/2024 13:02:49 - INFO - __main__ -   Gradient Accumulation steps = 1
01/05/2024 13:02:49 - INFO - __main__ -   Total optimization steps = 10000
Steps:   0%|                                                                                  | 0/10000 [00:00<?, ?it/s]1428 params have been unfrozen for training.
Traceback (most recent call last):
  File "animate-anything/train_svd.py", line 1006, in <module>
    main(**args_dict)
  File "animate-anything/train_svd.py", line 805, in main
    loss = finetune_unet(pipeline, batch, use_offset_noise, 
  File "animate-anything/train_svd.py", line 503, in finetune_unet
    model_pred = unet(input_latents, c_noise.reshape([bsz]), encoder_hidden_states=encoder_hidden_states, 
  File "animation/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "animation/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
    return model_forward(*args, **kwargs)
  File "animation/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "animation/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "animation/lib/python3.10/site-packages/diffusers/models/unet_spatio_temporal_condition.py", line 463, in forward
    sample = upsample_block(
  File "animation/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "animation/lib/python3.10/site-packages/diffusers/models/unet_3d_blocks.py", line 2351, in forward
    hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 136 but got size 135 for tensor number 1 in the list.

Question about the training data

Thanks for providing animate_anything_svd_v1.0 model.

May i ask how many videos are used to finetune the svd model to acquire animate_anything_svd_v1.0 ?
and maybe the different between svd and svd_mask model is the conv_in module? for the input dimension from 8 to 9.

Set-of-Masks for Multiple Motion Guidance

Hi,

Thanks for a great repo. When I read your paper, I see that the model can generate multiple motions from multiple masks. In the demo code, I see that you combine all motions into 1 mask and 1 prompt.

Is it the official setting or can we input the model with multiple masks and prompts to have a clearer description? I mean something like Set-of-Mark Prompting.

Create a loop

Hi,
thank you for making this public, is there any way to make a video in a consistent loop where the first frame follows the last one?

Running python train.py with example/train_mask_motion.yaml, I get the warning

Some weights of UNet3DConditionModel were not initialized from the model checkpoint at output/latent/animate_anything_512_v1.02 and are newly initialized because the shapes did not match:

  • conv_in.weight: found shape torch.Size([320, 4, 3, 3]) in the checkpoint and torch.Size([320, 5, 3, 3]) in the model instantiated

I want to know whether has some problems?

how to control motion magnitude

In the train.py file, I noticed for motion control, the motion magnitudes are computed in both RGB(batch["motion"]) and latent space (latent_motion = calculate_latent_motion_score(latents)), but only the latter is used in unet prediction. Could you explain why not use the former one but only use the latter one?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.