Coder Social home page Coder Social logo

guyyariv / audiotoken Goto Github PK

View Code? Open in Web Editor NEW
69.0 5.0 2.0 53.38 MB

This repo contains the official PyTorch implementation of AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

Home Page: https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken/

License: MIT License

Python 99.78% Shell 0.22%
audio-to-image deep-learning diffusion-models image-generation multi-modal audio2image stable-diffusion text2image ai-art

audiotoken's Introduction

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

This repo contains the official PyTorch implementation of AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

Abstract

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: how can we adopt such models to be conditioned on other modalities?. In this paper, we propose a novel method utilizing latent diffusion models, trained for text-to-image-generation, to generate images, conditioned on audio recordings. Using a pre- trained audio encoding model, the proposed method encodes audio into a new token which can be considered as an adap- tation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable param eters making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods considering both objective and subjective metrics.

Hugging Face Spaces

Installation

git clone [email protected]:guyyariv/AudioToken.git
cd AudioToken
pip install -r requirements.txt

And initialize an Accelerate environment with:

accelerate config

Download BEATs pre-trained model

mkdir -p models/BEATs/ && wget -O models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt "https://valle.blob.core.windows.net/share/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt?sv=2020-08-04&st=2023-03-01T07%3A51%3A05Z&se=2033-03-02T07%3A51%3A00Z&sr=c&sp=rl&sig=QJXmSJG9DbMKf48UDIU1MfzIro8HQOf3sqlNXiflY1I%3D"

Pre-Trained Embedder

alt text

The embedder's weights, which we pre-trained and on which the article is based, may be found at: output/embedder_learned_embeds.bin

Training

First, download our data set. VGGSound. Download links for the dataset can be found here.

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATA_DIR="./vggsound/"
export OUTPUT_DIR="output/"

accelerate launch train.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --data_dir=$DATA_DIR \
  --output_dir=$OUTPUT_DIR 
  --resolution=512 \
  --train_batch_size=4 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=30000 \
  --learning_rate=1.0e-05 \

Note: Change the resolution to 768 if you are using the stable-diffusion-2 768x768 model.

Inference

After you've trained a model with the above command, you can simply generate images using the following script:

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATA_DIR="./vggsound/"
export OUTPUT_DIR="output/"
export LEARNED_EMBEDS="output/embedder_learned_embeds.bin"

accelerate launch inference.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --data_dir=$DATA_DIR \
  --output_dir=$OUTPUT_DIR \ 
  --learned_embeds=$LEARNED_EMBEDS

Cite

If you use our work in your research, please cite the following paper:

@article{yariv2023audiotoken,
  title={AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation},
  author={Yariv, Guy and Gat, Itai and Wolf, Lior and Adi, Yossi and Schwartz, Idan},
  journal={arXiv preprint arXiv:2305.13050},
  year={2023}
}

License

This repository is released under the MIT license as found in the LICENSE file.

audiotoken's People

Contributors

guyyariv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

audiotoken's Issues

I cannot find any audio files of the VGGSound dataset

Dear Author, after I downloaded the VGGSound dataset from the huggingface connection you gave and unzipped it, I found that there is only video in it and no audio file, where should I download the corresponding audio file, without the audio file, I can't run the train.py

Some details about how to inference

Dear Author, Is the test set in the inference.py the entire VGGSound dataset? I see that the code needs to use the dataloader of this dataset, so how do I implement it if I want to execute inference.py on any other audio?

Speech and Image Embeddings

Given an audio file and an image, how can I use audioToken's pretrained model to derive an audio_feature_vector (i.e. speech embedding) and an image-feature-vector (i.e. image embedding)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.