Coder Social home page Coder Social logo

guyyariv / tempotokens Goto Github PK

View Code? Open in Web Editor NEW
82.0 3.0 11.0 10.98 MB

This repo contains the official PyTorch implementation of: Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Home Page: https://pages.cs.huji.ac.il/adiyoss-lab/TempoTokens/

License: MIT License

Python 100.00%
ai-art audio-to-video audio-visual deep-learning diffusion-models generative-ai video-synthesis modelscope pytorch

tempotokens's Introduction

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

This repo contains the official PyTorch implementation of Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

audio-to-video.mp4

Abstract

We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse.

Installation

git clone [email protected]:guyyariv/TempoTokens.git
cd TempoTokens
pip install -r requirements.txt

And initialize an Accelerate environment with:

accelerate config

Download BEATs pre-trained model

mkdir -p models/BEATs/ && wget -P models/BEATs/ -O "models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt" "https://valle.blob.core.windows.net/share/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D"

Training

Execute the relevant command for each dataset we have trained on, including VGGSound, Landscape, and AudioSet-Drum.

accelerate launch train.py --config configs/v2/vggsound.yaml
accelerate launch train.py --config configs/v2/landscape.yaml
accelerate launch train.py --config configs/v2/audioset_drum.yaml

We strongly recommend reviewing the configuration files and customizing the parameters according to your preferences.

Pre-trained weights

Obtain the pre-trained weights for the three datasets we conducted training on by visiting the following link: https://drive.google.com/drive/folders/10pRWoq0m5torvMXILmIQd7j9fLPEeHtS We advise you to save the folders in the directory named "models/."

Inference

The inference.py script serves the purpose of generating videos using trained checkpoints. Once you've completed the model training using the provided command (or opted for our pre-trained models) , you can effortlessly create videos from the datasets we've utilized for training, such as VGGSound, Landscape, and AudioSet-Drum.

accelerate launch inference.py --mapper_weights models/vggsound/learned_embeds.pth --testset vggsound
accelerate launch inference.py --mapper_weights models/landscape/learned_embeds.pth --testset landscape
accelerate launch inference.py --mapper_weights models/audioset_drum/learned_embeds.pth --testset audioset_drum

Moreover, you have the capability to generate a video from your own audio, as demonstrated below:

accelerate launch inference.py --mapper_weights models/vggsound/learned_embeds.pth --audio_path /audio/path
> python inference.py --help

usage: inference.py [-h] -m MODEL -p PROMPT [-n NEGATIVE_PROMPT] [-o OUTPUT_DIR]
                    [-B BATCH_SIZE] [-W WIDTH] [-H HEIGHT] [-T NUM_FRAMES]
                    [-WS WINDOW_SIZE] [-VB VAE_BATCH_SIZE] [-s NUM_STEPS]
                    [-g GUIDANCE_SCALE] [-i INIT_VIDEO] [-iw INIT_WEIGHT] [-f FPS]
                    [-d DEVICE] [-x] [-S] [-lP LORA_PATH] [-lR LORA_RANK] [-rw]

options:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        HuggingFace repository or path to model checkpoint directory
  -p PROMPT, --prompt PROMPT
                        Text prompt to condition on
  -n NEGATIVE_PROMPT, --negative-prompt NEGATIVE_PROMPT
                        Text prompt to condition against
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Directory to save output video to
  -B BATCH_SIZE, --batch-size BATCH_SIZE
                        Batch size for inference
  -W WIDTH, --width WIDTH
                        Width of output video
  -H HEIGHT, --height HEIGHT
                        Height of output video
  -T NUM_FRAMES, --num-frames NUM_FRAMES
                        Total number of frames to generate
  -WS WINDOW_SIZE, --window-size WINDOW_SIZE
                        Number of frames to process at once (defaults to full
                        sequence). When less than num_frames, a round robin diffusion
                        process is used to denoise the full sequence iteratively one
                        window at a time. Must be divide num_frames exactly!
  -VB VAE_BATCH_SIZE, --vae-batch-size VAE_BATCH_SIZE
                        Batch size for VAE encoding/decoding to/from latents (higher
                        values = faster inference, but more memory usage).
  -s NUM_STEPS, --num-steps NUM_STEPS
                        Number of diffusion steps to run per frame.
  -g GUIDANCE_SCALE, --guidance-scale GUIDANCE_SCALE
                        Scale for guidance loss (higher values = more guidance, but
                        possibly more artifacts).
  -i INIT_VIDEO, --init-video INIT_VIDEO
                        Path to video to initialize diffusion from (will be resized to
                        the specified num_frames, height, and width).
  -iw INIT_WEIGHT, --init-weight INIT_WEIGHT
                        Strength of visual effect of init_video on the output (lower
                        values adhere more closely to the text prompt, but have a less
                        recognizable init_video).
  -f FPS, --fps FPS     FPS of output video
  -d DEVICE, --device DEVICE
                        Device to run inference on (defaults to cuda).
  -x, --xformers        Use XFormers attnetion, a memory-efficient attention
                        implementation (requires `pip install xformers`).
  -S, --sdp             Use SDP attention, PyTorch's built-in memory-efficient
                        attention implementation.
  -lP LORA_PATH, --lora_path LORA_PATH
                        Path to Low Rank Adaptation checkpoint file (defaults to empty
                        string, which uses no LoRA).
  -lR LORA_RANK, --lora_rank LORA_RANK
                        Size of the LoRA checkpoint's projection matrix (defaults to
                        64).
  -rw, --remove-watermark
                        Post-process the videos with LAMA to inpaint ModelScope's
                        common watermarks.

Acknowledgments

Our code is partially built upon Text-To-Video-Finetuning

Cite

If you use our work in your research, please cite the following paper:

@misc{yariv2023diverse,
      title={Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation}, 
      author={Guy Yariv and Itai Gat and Sagie Benaim and Lior Wolf and Idan Schwartz and Yossi Adi},
      year={2023},
      eprint={2309.16429},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

This repository is released under the MIT license as found in the LICENSE file.

tempotokens's People

Contributors

guyyariv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

tempotokens's Issues

Error for training landsacoe dataset

When I run the training code on the landscape dataset, I encounter an error. How should I solve it?

LoRA rank 16 is too large. setting to: 4
Traceback (most recent call last):
File "train.py", line 1221, in
main(**config)
File "train.py", line 770, in main
unet_lora_params, unet_negation = inject_lora(
File "train.py", line 293, in inject_lora
params, negation = injector(**injector_args)
File "/home/TempoTokens/utils/lora.py", line 461, in inject_trainable_lora_extended
_tmp.to(_child_module.bias.device).to(_child_module.bias.dtype)
AttributeError: 'NoneType' object has no attribute 'device'

Thank you for your answer!

AV-Align metric implementation

Hi,

Thanks again for responding to my previous issue. I was wondering if you plan to release the code to reproduce the AV-Align metric proposed in the paper? I looked through the repo and could not find any source code for this (apologies if I'm mistaken). I think it's a very interesting metric and would be important in trying to reproduce your results. Thanks!

Modification about configs/structure of the dataset folder

A fancy work! I encountered an issue where data could not be found when I ran train.py. Your config file shows that the dataset files are divided into video/audio subfolders, but the subfolders in the dataset files you provided are not divided this way. I encounter errors similar to the following when running the code.

FileNotFoundError: [Errno 2] No such file or directory: 'None/video/'

Thus, I am unsure how to modify the contents of the config file or the division of the dataset files. Could you show me how to edit the config file or change the division of the dataset subfolders? Or could you upload a dataset that matches the content of the config file?

Thanks!

Example inference command does not work

Your example
inference.py --mapper_weights models/vggsound/learned_embeds.pth --audio_path /audio/path
I try the following
python inference.py --mapper_weights models\vggsound\learned_embeds.pth --audio_path croaking.mp3
which gives this error

usage: inference.py [-h] -m MODEL --mapper_weights MAPPER_WEIGHTS [-p PROMPT] [-n NEGATIVE_PROMPT] [-o OUTPUT_DIR] [-B BATCH_SIZE] [-W WIDTH] [-H HEIGHT] [-T NUM_FRAMES] [-WS WINDOW_SIZE] [-VB VAE_BATCH_SIZE] [-s NUM_STEPS] [-g GUIDANCE_SCALE] [-i INIT_VIDEO]
                    [-iw INIT_WEIGHT] [-f FPS] [-d DEVICE] [-x] [-S] [-lP LORA_PATH] [-lR LORA_RANK] [-rw] [-l] [-r SEED] [--n N] [--testset TESTSET] [--audio_path AUDIO_PATH]
inference.py: error: the following arguments are required: -m/--model

What do I need to specify for the --model parameter? Thanks for any tips.

Cannt download the BEATs_iter3+ (AS2M) (cpt2)

mkdir -p models/BEATs/ && wget -P models/BEATs/ -O "models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt" "https://valle.blob.core.windows.net/share/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D" Will not apply HSTS. The HSTS database must be a regular and non-world-writable file. ERROR: could not open HSTS store at '/home/yang/.wget-hsts'. HSTS will be disabled. --2024-03-30 12:01:55-- https://valle.blob.core.windows.net/share/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D Resolving valle.blob.core.windows.net (valle.blob.core.windows.net)... 20.60.231.33 Connecting to valle.blob.core.windows.net (valle.blob.core.windows.net)|20.60.231.33|:443... connected. HTTP request sent, awaiting response... 403 Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature. 2024-03-30 12:01:56 ERROR 403: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature..

Can you provide other link?

Filtered VGGSound file list?

image
Hi,

First of all, really like the paper :) I think it presents an elegant solution and impressive results. As seen above, the paper mentions a filtered version of VGGSound, which seems like a really good idea since the dataset is usually quite noisy. I was wondering if you could share the file list for this filtered version (as a .csv, for example) ? I looked around in the repo, and the only list I could find was datasets/vggsound.csv, which seems to contain (almost?) the whole dataset, rather than the filtered version. Any help would be greatly appreciated. Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.