showlab / motiondirector Goto Github PK

View Code? Open in Web Editor NEW

699.0 32.0 41.0 181.55 MB

MotionDirector: Motion Customization of Text-to-Video Diffusion Models.

Home Page: https://showlab.github.io/MotionDirector/

License: Apache License 2.0

Python 100.00%

diffusion-models text-to-motion text-to-video text-to-video-generation video-generation motion-customization

motiondirector's Introduction

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

Rui Zhao · Yuchao Gu · Jay Zhangjie Wu · David Junhao Zhang · Jia-Wei Liu · Weijia Wu · Jussi Keppo · Mike Zheng Shou

Show Lab, National University of Singapore | Zhejiang University

MotionDirector can customize text-to-video diffusion models to generate videos with desired motions.

Task Definition

Motion Customization of Text-to-Video Diffusion Models:
Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate diverse videos with this motion.

Demos

Demo Video:

Customize both Appearance and Motion:

Reference images or videos	Videos generated by MotionDirector

Reference images for appearance customization: "A Terracotta Warrior on a pure color background."	"A Terracotta Warrior is riding a horse through an ancient battlefield." seed: 1455028	"A Terracotta Warrior is playing golf in front of the Great Wall." seed: 5804477	"A Terracotta Warrior is walking cross the ancient army captured with a reverse follow cinematic shot." seed: 653658

Reference videos for motion customization: "A person is riding a bicycle."	"A Terracotta Warrior is riding a bicycle past an ancient Chinese palace." seed: 166357.	"A Terracotta Warrior is lifting weights in front of the Great Wall." seed: 5635982	"A Terracotta Warrior is skateboarding." seed: 9033688

News

[2024.02.03] MotionDirector for AnimateDiff is available. Thanks to ExponentialML.
[2023.12.27] MotionDirector with Customized Appearance released. Now, you can customize both appearance and motion in video generation.
[2023.12.27] MotionDirector for Image Animation released.
[2023.12.23] MotionDirector has been featured in Hugging Face's 'Spaces of the Week 🔥' trending list!
[2023.12.13] Online gradio demo released @ Hugging Face Spaces! Welcome to try it.
[2023.12.06] MotionDirector for Sports released! Lifting weights, riding horse, palying golf, etc.
[2023.12.05] Colab demo is available. Thanks to Camenduru.
[2023.12.04] MotionDirector for Cinematic Shots released. Now, you can make AI films with professional cinematic shots!
[2023.12.02] Code and model weights released!
[2023.10.12] Paper and project page released.

ToDo

Gradio Demo
More trained weights of MotionDirector

Model List

Type	Training Data	Descriptions	Link
MotionDirector for Sports	Multiple videos for each model.	Learn motion concepts of sports, i.e. lifting weights, riding horse, palying golf, etc.	Link
MotionDirector for Cinematic Shots	A single video for each model.	Learn motion concepts of cinematic shots, i.e. dolly zoom, zoom in, zoom out, etc.	Link
MotionDirector for Image Animation	A single image for spatial path. And a single video or multiple videos for temporal path.	Animate the given image with learned motions.	Link
MotionDirector with Customized Appearance	A single image or multiple images for spatial path. And a single video or multiple videos for temporal path.	Customize both appearance and motion in video generation.	Link

Setup

Requirements

# create virtual environment
conda create -n motiondirector python=3.8
conda activate motiondirector
# install packages
pip install -r requirements.txt

Weights of Foundation Models

git lfs install
## You can choose the ModelScopeT2V or ZeroScope, etc., as the foundation model.
## ZeroScope
git clone https://huggingface.co/cerspense/zeroscope_v2_576w ./models/zeroscope_v2_576w/
## ModelScopeT2V
git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b ./models/model_scope/

Weights of trained MotionDirector

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/ruizhaocv/MotionDirector_weights ./outputs

# More and better trained MotionDirector are released at a new repo:
git clone https://huggingface.co/ruizhaocv/MotionDirector ./outputs
# The usage is slightly different, which will be updated later.

Usage

Training

Train MotionDirector on multiple videos:

python MotionDirector_train.py --config ./configs/config_multi_videos.yaml

Train MotionDirector on a single video:

python MotionDirector_train.py --config ./configs/config_single_video.yaml

Note:

Before running the above command, make sure you replace the path to foundational model weights and training data with your own in the config files config_multi_videos.yaml or config_single_video.yaml.
Generally, training on multiple 16-frame videos usually takes 300~500 steps, about 9~16 minutes using one A5000 GPU. Training on a single video takes 50~150 steps, about 1.5~4.5 minutes using one A5000 GPU. The required VRAM for training is around 14GB.
Reduce n_sample_frames if your GPU memory is limited.
Reduce the learning rate and increase the training steps for better performance.

Inference

python MotionDirector_inference.py --model /path/to/the/foundation/model  --prompt "Your prompt" --checkpoint_folder /path/to/the/trained/MotionDirector --checkpoint_index 300 --noise_prior 0.

Note:

Replace /path/to/the/foundation/model with your own path to the foundation model, like ZeroScope.
The value of checkpoint_index means the checkpoint saved at which the training step is selected.
The value of noise_prior indicates how much the inversion noise of the reference video affects the generation. We recommend setting it to 0 for MotionDirector trained on multiple videos to achieve the highest diverse generation, while setting it to 0.1~0.5 for MotionDirector trained on a single video for faster convergence and better alignment with the reference video.

Inference with pre-trained MotionDirector

All available weights are at official Huggingface Repo. Run the download command, the weights will be downloaded to the folder outputs, then run the following inference command to generate videos.

MotionDirector trained on multiple videos:

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A person is riding a bicycle past the Eiffel Tower." --checkpoint_folder ./outputs/train/riding_bicycle/ --checkpoint_index 300 --noise_prior 0. --seed 7192280

Note:

Replace /path/to/the/ZeroScope with your own path to the foundation model, i.e. the ZeroScope.
Change the prompt to generate different videos.
The seed is set to a random value by default. Set it to a specific value will obtain certain results, as provided in the table below.

Results:

Reference Videos	Videos Generated by MotionDirector

"A person is riding a bicycle."	"A person is riding a bicycle past the Eiffel Tower.” seed: 7192280	"A panda is riding a bicycle in a garden." seed: 2178639	"An alien is riding a bicycle on Mars." seed: 2390886

MotionDirector trained on a single video:

16 frames:

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A tank is running on the moon." --checkpoint_folder ./outputs/train/car_16/ --checkpoint_index 150 --noise_prior 0.5 --seed 8551187

Reference Video	Videos Generated by MotionDirector

"A car is running on the road."	"A tank is running on the moon.” seed: 8551187	"A lion is running past the pyramids." seed: 431554	"A spaceship is flying past Mars." seed: 8808231

24 frames:

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A truck is running past the Arc de Triomphe." --checkpoint_folder ./outputs/train/car_24/ --checkpoint_index 150 --noise_prior 0.5 --width 576 --height 320 --num-frames 24 --seed 34543

Reference Video	Videos Generated by MotionDirector

"A car is running on the road."	"A truck is running past the Arc de Triomphe.” seed: 34543	"An elephant is running in a forest." seed: 2171736

"A car is running on the road."	"A person on a camel is running past the pyramids." seed: 4904126	"A spacecraft is flying past the Milky Way galaxy." seed: 3235677

MotionDirector for Sports

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A panda is lifting weights in a garden." --checkpoint_folder ./outputs/train/lifting_weights/ --checkpoint_index 300 --noise_prior 0. --seed 9365597

Videos Generated by MotionDirector
Lifting Weights		Riding Bicycle

"A panda is lifting weights in a garden.” seed: 1699276	"A police officer is lifting weights in front of the police station.” seed: 6804745	"A panda is riding a bicycle in a garden." seed: 2178639	"An alien is riding a bicycle on Mars." seed: 2390886
Riding Horse		Riding Horse

"A knight riding on horseback passing by a castle.” seed: 6491893	"A man riding an elephant through the jungle.” seed: 6230765	"A girl riding a unicorn galloping under the moonlight." seed: 6940542	"An adventurer riding a dinosaur exploring through the rainforest." seed: 6972276
Skateboarding		Playing Golf

"A robot is skateboarding in a cyberpunk city.” seed: 1020673	"A teddy bear skateboarding in Times Square New York.” seed: 3306353	"A man is playing golf in front of the White House." seed: 8870450	"A monkey is playing golf on a field full of flowers." seed: 2989633

More sports, to be continued ...

MotionDirector for Cinematic Shots

1. Zoom

1.1 Dolly Zoom (Hitchcockian Zoom)

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A firefighter standing in front of a burning forest captured with a dolly zoom." --checkpoint_folder ./outputs/train/dolly_zoom/ --checkpoint_index 150 --noise_prior 0.5 --seed 9365597

Reference Video	Videos Generated by MotionDirector

"A man standing in room captured with a dolly zoom."	"A firefighter standing in front of a burning forest captured with a dolly zoom." seed: 9365597 noise_prior: 0.5	"A lion sitting on top of a cliff captured with a dolly zoom." seed: 1675932 noise_prior: 0.5	"A Roman soldier standing in front of the Colosseum captured with a dolly zoom." seed: 2310805 noise_prior: 0.5

"A man standing in room captured with a dolly zoom."	"A firefighter standing in front of a burning forest captured with a dolly zoom." seed: 4615820 noise_prior: 0.3	"A lion sitting on top of a cliff captured with a dolly zoom." seed: 4114896 noise_prior: 0.3	"A Roman soldier standing in front of the Colosseum captured with a dolly zoom." seed: 7492004

1.2 Zoom In

The reference video is shot with my own water cup. You can also pick up your cup or any other object to practice camera movements and turn it into imaginative videos. Create your AI films with customized camera movements!

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A firefighter standing in front of a burning forest captured with a zoom in." --checkpoint_folder ./outputs/train/zoom_in/ --checkpoint_index 150 --noise_prior 0.3 --seed 1429227

Reference Video	Videos Generated by MotionDirector

"A cup in a lab captured with a zoom in."	"A firefighter standing in front of a burning forest captured with a zoom in." seed: 1429227	"A lion sitting on top of a cliff captured with a zoom in." seed: 487239	"A Roman soldier standing in front of the Colosseum captured with a zoom in." seed: 1393184

1.3 Zoom Out

python MotionDirector_inference.py --model /path/to/the/ZeroScope  --prompt "A firefighter standing in front of a burning forest captured with a zoom out." --checkpoint_folder ./outputs/train/zoom_out/ --checkpoint_index 150 --noise_prior 0.3 --seed 4971910

Reference Video	Videos Generated by MotionDirector

"A cup in a lab captured with a zoom out."	"A firefighter standing in front of a burning forest captured with a zoom out." seed: 4971910	"A lion sitting on top of a cliff captured with a zoom out." seed: 1767994	"A Roman soldier standing in front of the Colosseum captured with a zoom out." seed: 8203639

2. Advanced Cinematic Shots

Follow		Reverse Follow

"A fireman is walking through fire captured with a follow cinematic shot.” seed: 4926511	"A spaceman is walking on the moon with a follow cinematic shot.” seed: 7594623	"A fireman is walking through fire captured with a reverse follow cinematic shot.” seed: 9759630	"A spaceman walking on the moon captured with a reverse follow cinematic shot." seed: 4539309
Chest Transition		Mini Jib Reveal: Foot-to-Head Shot

"A fireman is walking through the burning forest captured with a chest transition cinematic shot.” seed: 5236349	"An ancient Roman soldier walks through the crowd on the street captured with a chest transition cinematic shot.” seed: 3982271	"An ancient Roman soldier walks through the crowd on the street captured with a mini jib reveal cinematic shot.” seed: 654178	"A British Redcoat soldier is walking through the mountains captured with a mini jib reveal cinematic shot." seed: 566917
Pull Back: Subject Enters form the Left		Orbit

"A robot looks at a distant cyberpunk city captured with a pull back cinematic shot.” seed: 9342597	"A woman looks at a distant erupting volcano captured with a pull back cinematic shot.” seed: 4197508	"A fireman in the burning forest captured with an orbit cinematic shot.” seed: 8450300	"A spaceman on the moon captured with an orbit cinematic shot." seed: 5899496

More Cinematic Shots, to be continued ....

MotionDirector for Image Animation

Train

Train the spatial path with reference image.

python MotionDirector_train.py --config ./configs/config_single_image.yaml

Then train the temporal path to learn the motion in reference video.

python MotionDirector_train.py --config ./configs/config_single_video.yaml

Inference

Inference with spatial path learned from reference image and temporal path learned form reference video.

python MotionDirector_inference_multi.py --model /path/to/the/foundation/model  --prompt "Your prompt" --spatial_path_folder /path/to/the/trained/MotionDirector/spatial/lora/ --temporal_path_folder /path/to/the/trained/MotionDirector/temporal/lora/ --noise_prior 0.

Example

Download the pre-trained weights.

git clone https://huggingface.co/ruizhaocv/MotionDirector ./outputs

Run the following command.

python MotionDirector_inference_multi.py --model /path/to/the/ZeroScope  --prompt "A car is running on the road." --spatial_path_folder ./outputs/train/image_animation/train_2023-12-26T14-37-16/checkpoint-300/spatial/lora/ --temporal_path_folder ./outputs/train/image_animation/train_2023-12-26T13-08-20/checkpoint-300/temporal/lora/ --noise_prior 0.5 --seed 5057764

Reference Image	Reference Video	Videos Generated by MotionDirector

"A car is running on the road."	"A car is running on the road."	"A car is running on the road." seed: 5057764	"A car is running on the road covered with snow." seed: 4904543

MotionDirector with Customized Appearance

Train

Train the spatial path with reference images.

python MotionDirector_train.py --config ./configs/config_multi_images.yaml

Then train the temporal path to learn the motions in reference videos.

python MotionDirector_train.py --config ./configs/config_multi_videos.yaml

Inference

Inference with spatial path learned from reference images and temporal path learned form reference videos.

python MotionDirector_inference_multi.py --model /path/to/the/foundation/model  --prompt "Your prompt" --spatial_path_folder /path/to/the/trained/MotionDirector/spatial/lora/ --temporal_path_folder /path/to/the/trained/MotionDirector/temporal/lora/ --noise_prior 0.

Example

Download the pre-trained weights.

git clone https://huggingface.co/ruizhaocv/MotionDirector ./outputs

Run the following command.

python MotionDirector_inference_multi.py --model /path/to/the/ZeroScope  --prompt "A Terracotta Warrior is riding a horse through an ancient battlefield." --spatial_path_folder ./outputs/train/customized_appearance/terracotta_warrior/checkpoint-default/spatial/lora --temporal_path_folder ./outputs/train/riding_horse/checkpoint-default/temporal/lora/ --noise_prior 0. --seed 1455028

Results are shown in the table.

More results

If you have a more impressive MotionDirector or generated videos, please feel free to open an issue and share them with us. We would greatly appreciate it. Improvements to the code are also highly welcome.

Please refer to Project Page for more results.

Astronaut's daily life on Mars:

Astronaut's daily life on Mars (Motion concepts learned by MotionDirector)
Lifting Weights	Playing Golf	Riding Horse	Riding Bicycle

"An astronaut is lifting weights on Mars, 4K, high quailty, highly detailed.” seed: 4008521	"Astronaut playing golf on Mars” seed: 659514	"An astronaut is riding a horse on Mars, 4K, high quailty, highly detailed." seed: 1913261	"An astronaut is riding a bicycle past the pyramids Mars, 4K, high quailty, highly detailed." seed: 5532778
Skateboarding	Cinematic Shot: "Reverse Follow"	Cinematic Shot: "Follow"	Cinematic Shot: "Orbit"

"An astronaut is skateboarding on Mars" seed: 6615212	"An astronaut is walking on Mars captured with a reverse follow cinematic shot." seed: 1224445	"An astronaut is walking on Mars captured with a follow cinematic shot." seed: 6191674	"An astronaut is standing on Mars captured with an orbit cinematic shot." seed: 7483453

Citation

@article{zhao2023motiondirector,
  title={MotionDirector: Motion Customization of Text-to-Video Diffusion Models},
  author={Zhao, Rui and Gu, Yuchao and Wu, Jay Zhangjie and Zhang, David Junhao and Liu, Jiawei and Wu, Weijia and Keppo, Jussi and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2310.08465},
  year={2023}
}

Shoutouts

This code builds on diffusers, Tune-a-video and Text-To-Video-Finetuning. Thanks for open-sourcing!
Thanks to camenduru for the colab demo.
Thanks to yhyu13 for the Huggingface Repo.
We would like to thank AK(@_akhaliq) and huggingface team for the help of setting up oneline gradio demo.
Thanks to MagicAnimate for the gradio demo template.
Thanks to deepbeepmeep, and XiaominLi for improving the code.

motiondirector's People

Contributors

Stargazers

Watchers

Forkers

kustomzone leftomelas commerceless tomchapin matthewgard1 5l1v3r1 cherrerajobs camenduru systemexecutellc syguan96 studio13-nyc thebossdragon bluesealjs eki-indradi jdwebprogrammer dale007261 chenyu-inspirai mbrukman robertert jags111 eltociear arkboy1224 byteshow1234 paperwave techthiyanes so-ai-love wangqiang9 peterzs sorokinvld mine6753 oleglr vineetp6 lcsouzamenezes zhuxiongwei24 chnxindong param-harrison mingxingren yptang5488 lidi100

motiondirector's Issues

Some confusion about the inconsistency between code and paper description

I appreciate your awesome work very much. However, there are some things I don't understand.
During the training process, each iteration has a 20% probability that mask_spatial_lora is True, and mask_temporal_lora is always False. If a certain lora is masked, the lora scale of this part will be set to 0. This does not freeze the lora, but cancels the lora. This is inconsistent with the dual channel mentioned in the paper. Is it because I didn’t understand the code?
thank you for your reply

pickle.UnpicklingError: invalid load key, 'v'.

When trying to run inference on my loras and those provided from hf (i.e. after running this command:

python MotionDirector_inference_multi.py --model /workspace/nina-home/MotionDirector/models/zeroscope_v2_576w/  --prompt "A car is running on the road." --spatial_path_folder /workspace/nina-home/MotionDirector/outputs/ready/image_animation/train_2023-12-26T14-37-16/checkpoint-300/spatial/lora/ --temporal_path_folder /workspace/nina-home/MotionDirector/outputs/ready/image_animation/train_2023-12-26T13-08-20/checkpoint-300/temporal/lora/ --noise_prior 0.5 --seed 5057764

I get this error:

[2024-03-01 16:38:56,120] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Initializing the conversion map
Traceback (most recent call last):
  File "MotionDirector_inference_multi.py", line 287, in <module>
    video_frames = inference(
  File "/workspace/nina-home/.conda/envs/motiondirector/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "MotionDirector_inference_multi.py", line 174, in inference
    pipe = initialize_pipeline(model, device, xformers, sdp, spatial_lora_path, temporal_lora_path, lora_rank,
  File "MotionDirector_inference_multi.py", line 65, in initialize_pipeline
    unet_lora_params, unet_negation = lora_manager_spatial.add_lora_to_model(
  File "/workspace/nina-home/MotionDirector/utils/lora_handler.py", line 214, in add_lora_to_model
    params, negation, is_injection_hybrid = self.do_lora_injection(
  File "/workspace/nina-home/MotionDirector/utils/lora_handler.py", line 183, in do_lora_injection
    params, negation = self.lora_injector(**injector_args)  # inject_trainable_lora_extended
  File "/workspace/nina-home/MotionDirector/utils/lora.py", line 471, in inject_trainable_lora_extended
    loras = torch.load(loras)
  File "/workspace/nina-home/.conda/envs/motiondirector/lib/python3.8/site-packages/torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/workspace/nina-home/.conda/envs/motiondirector/lib/python3.8/site-packages/torch/serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.

Help is greatly appreciated:)

Weird Results

Hi~ Thanks for sharing your inspiring work. But why is my output not as expected?

Below is my config
{
"name": "Python: inference",
"type": "python",
"request": "launch",
"program": "MotionDirector_inference.py",
"console": "integratedTerminal",
"justMyCode": false,
"args": ["--model", "/mnt/lustre/usr/.cache/huggingface/hub/models--cerspense--zeroscope_v2_576w/snapshots/6963642a64dbefa93663d1ecebb4ceda2d9ecb28",
"--prompt", "A tank is running on the moon.",
"--checkpoint_folder", "/mnt/lustre/usr/outputs/train/car_16",
"--checkpoint_index", "150",
"--noise_prior", "0.5",
"--seed", "8551187"]
}

Training the example code but Crashed

I am using 4090 graphic card to train, but it will crached half of the process. And it always crashed it 50 steps......
Example code:

python MotionDirector_train.py --config ./configs/config_single_video.yaml

Error Output:

(motiondirector) PS D:\Coding\AILearning\AI_Art_Technology_Demo\MotionDirector> python MotionDirector_train.py --config ./configs/config_single_video.yaml
Initializing the conversion map
D:\Applications\Miniconda3\envs\motiondirector\lib\site-packages\accelerate\accelerator.py:359: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
02/11/2024 14:26:18 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
33 Attention layers using Scaled Dot Product Attention.
Lora successfully injected into UNet3DConditionModel.
Lora successfully injected into UNet3DConditionModel.
{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
Caching Latents.:   0%|                                                                                                                       | 0/1 [00:00<?, ?it/s]D:\Applications\Miniconda3\envs\motiondirector\lib\site-packages\diffusers\models\attention_processor.py:1129: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  hidden_states = F.scaled_dot_product_attention(
{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:10<00:00,  4.91it/s]
Caching Latents.: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:10<00:00, 10.88s/it]
02/11/2024 14:26:36 - INFO - __main__ - ***** Running training *****
02/11/2024 14:26:36 - INFO - __main__ -   Num examples = 1
02/11/2024 14:26:36 - INFO - __main__ -   Num Epochs = 150
02/11/2024 14:26:36 - INFO - __main__ -   Instantaneous batch size per device = 1
02/11/2024 14:26:36 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
02/11/2024 14:26:36 - INFO - __main__ -   Gradient Accumulation steps = 1
02/11/2024 14:26:36 - INFO - __main__ -   Total optimization steps = 150
Steps:  33%|███████████████████████████████████████▋                                                                               | 50/150 [00:46<01:31,  1.10it/s]
{'rescale_betas_zero_snr', 'timestep_spacing'} was not found in config. Values will be initialized to default values.
(motiondirector) PS D:\Coding\AILearning\AI_Art_Technology_Demo\MotionDirector>

Memory Leak

Thanks for this very interesting work that at last allows us to have some level of direction in video generation.
First run works quite well (less than 40s on a RTX 4090 24GB) but successive runs are very slow.
There is obviously a memory leak towards the end of the generation process as GPU memory is never released.

Errors in Lora : UnboundLocalError: local variable '_tmp' referenced before assignment`

Hello, when running the inference code I am getting this error. Is there anything I am missing ?
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
/media/MotionDirector/MotionDirector_train.py:1048 in

1045 parser = argparse.ArgumentParser()
1046 parser.add_argument("--config", type=str, default='configs/config_single_video.yaml'
1047 args = parser.parse_args()
❱ 1048 main(**OmegaConf.load(args.config))
1049

/media/MotionDirector/MotionDirector_train.py:607 in main

604       # one temporal lora                                                               
605       lora_manager_temporal = LoraHandler(use_unet_lora=use_unet_lora, unet_replace_mo  
606

❱ 607 unet_lora_params_temporal, unet_negation_temporal = lora_manager_temporal.add_lo
608 use_unet_lora, unet, lora_manager_temporal.unet_replace_modules, lora_unet_d
609 lora_path + '/temporal/lora/', r=lora_rank)
610

/media//MotionDirector/utils/lora_handler.py:214 in
add_lora_to_model

211 )
212
213 if use_lora:
❱ 214 params, negation, is_injection_hybrid = self.do_lora_injection(
215 model,
216 replace_modules,
217 bias=self.lora_bias,

/media/MotionDirector/utils/lora_handler.py:183 in
do_lora_injection

180 is_injection_hybrid = True
181 injector_args = lora_loader_args
182
❱ 183 params, negation = self.lora_injector(**injector_args) # inject_trainable_l
184 for _up, _down in extract_lora_ups_down(
185 model,
186 target_replace_module=REPLACE_MODULES):

/media/MotionDirector/utils/lora.py:532 in
inject_trainable_lora_extended

529                if bias is not None:                                                  
530                   _tmp.conv.bias = bias                                             
531             # switch the module

❱ 532 _tmp.to(_child_module.weight.device).to(_child_module.weight.dtype)
533 if bias is not None:
534 _tmp.to(_child_module.bias.device).to(_child_module.bias.dtype)
535
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
UnboundLocalError: local variable '_tmp' referenced before assignment

earlier checkpoints doing better than later checkpoints

Hi,

In my experiments, earlier checkpoints (~ 200 - 400) for both temporal and spatial tuning do much better than later checkpoints. To the point that later checkpoints seem like they generate pure noise videos! what could be the cause of this?

How to fix random seed？

Congratulations on a great job.
I've fixed the random seed using the code below, but the output is still different every time.

def init_seed(seed):
    torch.cuda.manual_seed_all(seed)
    torch.manual_seed(seed)
    np.random.seed(seed)
    random.seed(seed)
    # torch.backends.cudnn.enabled = False
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

Why my result is such unclear? any body have the experiments?

training time

Hi, thanks for your nice work and open code! I try to train the single video using the code, but it takes an hour using A100. Is it normal? Hoping for your answers.

The code for directly finetune the foundation model

Hey, thanks for your great work! And I hope your paper can be accepted by ICLR. I am a new guy in this area, and still trying to learn the diffusion model. I wonder if you have the code for directly finetuning the foundation model instead of customizing it? I am trying to use different motions instead of one to directly finetune it... if not, can it be achieved by modifying some code instead your MotionDirector_train.py file? Thanks a lot for any help!

How to visualize the denoising process？

What are the axes, mean and variance?

does each model only contain an individual motion or can multiple types of motion be trained inside a single model?

does the model only contain an individual motion or can you train multiple types of motion inside a single model?

Can you have a swinging a golfball motion and a riding a horse motion inside a single MotionDirector model?

why the inference results are not aligned with the validation results?

Hello, i used the weights saved after the training step to inference, but the results are not aligned with the results generated from the last validation step in training. So what is the reason for this phenomenon?

Error occurred on loras

Hi, thanks for releasing the nice work.
When I try to run the Train MotionDirector on a single video, using the provided data ''./test_data/car_turn/car-turn-24.mp4'', the code encounters an error
Traceback (most recent call last):
File "MotionDirector_train.py", line 1024, in
main(**OmegaConf.load(args.config))
File "MotionDirector_train.py", line 894, in main
loss_spatial, loss_temporal, latents, init_noise = finetune_unet(batch, step, mask_spatial_lora=mask_spatial_lora, mask_temporal_lora=mask_temporal_lora)
File "MotionDirector_train.py", line 820, in finetune_unet
loras[lora_idx + step].scale = 1.
IndexError: list index out of range

What is going on?

How to animate a single image?

Great work!
I am wondering how to reproduce the Row4 of Figure 2 in the main paper.
Could you please give me some advice?
Do we need inversion latents of the reference video in this case?

Why are the inference results different from the results you posted?

Hello author, I used the weights you provided for intervention. The seed is the same as the one you set. Why are the results I got different from the results you published? I would be grateful if you could answer my doubts. .

May I ask what work the lora-related code(eg.class LoraHandler) is based on?

Thank you very much for your paper and open source code.
I have a huge headache when reading lora related code. Did you write the lorahandler class yourself? Or is it based on previous work? Whose job is it? I want to check further relevant information to understand LoraHandler.
Looking forward to your answer

How to visualize the denoising process？

What are the axes, mean and variance?

problem about training for image animation

Thanks for releasing the nice work.
when I run the command,"python MotionDirector_train.py --config ./configs/config_single_image.yaml"
the following error occurs:
Traceback (most recent call last):
File "MotionDirector_train.py", line 1050, in
main(**OmegaConf.load(args.config))
File "MotionDirector_train.py", line 912, in main
loss_spatial, loss_temporal, latents, init_noise = finetune_unet(batch, step, mask_spatial_lora=mask_spatial_lora, mask_temporal_lora=mask_temporal_lora)
File "MotionDirector_train.py", line 836, in finetune_unet
loras = extract_lora_child_module(unet, target_replace_module=["TransformerTemporalModel"])
File "/home/MotionDirector/utils/lora.py", line 702, in extract_lora_child_module
raise ValueError("No lora injected.")
ValueError: No lora injected.
Why does this happen? And how can I solve it?

Added LoRA weights to hugging face

Hi,

Thanks for posting your work in text2video, FYI I have uploaded google drive LoRA weights to Huggingface as a community repo

https://huggingface.co/Yhyu13/MotionDirector_LoRA

Thanks!

Adding more data in the test folder .

I was planning to add some new videos to the test_data/bycycle , so i downloaded some data from shutterstock and placed them there , but it is shoing that
decord._ffi.base.DECORDError: [13:32:25] /github/workspace/src/video/video_reader.cc:270: [./test_data/bicycle/a_man_riding_cycle_in_rain.mp4] Failed to measure duration/frame-count due to broken metadata.

can you please provide any suggestion on how can I add more data.
Thank you..

Does MotionDirector work on Stable Video Diffusion?

code bug 'UnboundLocalError: local variable '_tmp' referenced before assignment'

File "MotionDirector/utils/lora.py", line 532, in inject_trainable_lora_extended
_tmp.to(_child_module.weight.device).to(_child_module.weight.dtype)

UnboundLocalError: local variable '_tmp' referenced before assignment

Not able to save spatial and temporal data

Hi,
I am not able to see the spatial and temporal data saved when I am trying to train a single image and single video. Is there anything I am missing in the config file that I need to change?

This is the output at present.

Errors in dataset.py

hi, I have checked vr[0].shape, that is, (h, w, c).

I have no idea if this problem is caused by the version of decord? I use decord==0.6.0 as requirements.txt

Testing samples

Thanks for your great work! Can you share all the testing videos on your project page?

any timeline for code/model weights?

hello 👋🏽 incredible work you guys are doing. excited to play with it, just wondering when you're thinking of releasing code?

Why is the GPU utilization low?

Hi, thanks for releasing the nice work.
I running the code, but find that the GPU utilization is very low, is your GPU utilization also low?

Training speed is a little slow.

Does every new animation need to be retrained?

Hello, Thanks for releasing the nice work.
I have just come into contact with this field, and I have a question that I am curious about: If I want to animate new image with different appearances and customize new actions, do I need to use the image and related videos to train again?

If so, does it mean that the weights given by link were trained only on the reference image/video and no other new images/videos were used?

Looking forward to your answer very much！

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.