Coder Social home page Coder Social logo

stonemo / deepavfusion Goto Github PK

View Code? Open in Web Editor NEW
12.0 6.0 0.0 26.98 MB

Official codebase for "Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling".

License: Apache License 2.0

Python 100.00%
attention-mechanism audio-visual-correspondence audio-visual-learning multimodal-learning self-supervised-learning sound-source-localization sound-source-separation transformer-architecture masked-autoencoder masked-image-modeling

deepavfusion's Introduction

Masked Autoencoders enable strong Audio-Visual Early Fusion

Official codebase and pre-trained models for our DeepAVFusion framework as described in the paper.

Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
Shentong Mo, Pedro Morgado
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.

DeepAVFusion Illustration

Setup

Environment

Our environment was created as follows

conda create -n deepavfusion python=3.10
conda activate deepavfusion
conda install pytorch=2.0 torchvision=0.15 torchaudio=2.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install submitit hydra-core av wandb tqdm scipy scikit-image scikit-learn timm mir_eval jupyter matplotlib

Simply run conda env create -f requirements.yml to replicate it.

Datasets

In this work, we used a variety of datasets, including VGGSound, AudioSet, MUSIC and AVSBench. We assume that you have downloaded all datasets. Expected data format is briefly described in DATASETS.md

PATH2VGGSOUND="/path/to/vggsound"
PATH2AUDIOSET="/path/to/audioset"
PATH2MUSIC="/path/to/music"
PATH2AVSBENCH="/path/to/avsbench"

DeepAVFusion Pre-training

We release two models based on the VIT-Base architecture, trained on the VGGSounds and AudioSet datasets, respectively. The models were trained with the following commands.

# Pre-training on VGGSounds
PYTHONPATH=. python launcher.py --config-name=deepavfusion job_name=deepavfusion_vitb_vggsound_ep\${opt.epochs} \
  data.dataset=vggsound data.data_path=${PATH2VGGSOUND} \
  model.fusion.layers=all model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 \
  opt.epochs=200 opt.warmup_epochs=40 opt.batch_size=64 opt.accum_iter=1 opt.blr=1.5e-4 \
  env.ngpu=8 env.world_size=1 env.seed=0
  
# Pre-training on AudioSet
PYTHONPATH=. python launcher.py --config-name=deepavfusion job_name=deepavfusion_vitb_as2m_ep\${opt.epochs} \
  data.dataset=audioset data.data_path=${PATH2AUDIOSET} \
  model.fusion.layers=all model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 \
  opt.epochs=200 opt.warmup_epochs=40 opt.batch_size=64 opt.accum_iter=4 opt.blr=1.5e-4 \
  env.ngpu=8 env.world_size=1 env.seed=0 

The nearest neighbor training curve of the model trained on VGGSound can be seen below. The retrieval performance of fusion tokens is substantially better than uni-modal representations, suggesting that fusion tokens aggregate high-level semantics, while uni-modal representations encode the low-level details required for masked reconstruction.

DeepAVFusion training curve

The pre-trained models are available in the checkpoints/ directory.

Downstream tasks

We evaluate our model on a variety of downstream tasks. In each case, the pre-trained model is used for feature extraction (with or without fine-tuning depending on the evaluation protocol) and a task-specific decoder is trained from scratch to carry the task.

Audio Event Recognition

Dataset Eval Protocol Pre-trained Model Top1 Acc
VGGSound Linear Probe VGGSound-200ep 53.08
CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_vggsound pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=60 opt.warmup_epochs=10 opt.batch_size=64 opt.accum_iter=4 opt.blr=0.3 env.ngpu=4 env.world_size=1
VGGSound Linear Probe AudioSet2M-200ep 53.08
CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_vggsound pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=60 opt.warmup_epochs=10 opt.batch_size=64 opt.accum_iter=4 opt.blr=0.3 env.ngpu=4 env.world_size=1
VGGSound Fine-tuning VGGSound-200ep 58.19
CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_vggsound pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1
VGGSound Fine-tuning AudioSet2M-200ep 57.91
CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=finetune_vggsound pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=vggsound data.data_path=${PATH2VGGSOUND} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1
Dataset Eval Protocol Pre-trained Model Top1 AP
AudioSet-Bal Linear Probe VGGSound-200ep 53.08
CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_as2mbal pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=300 opt.warmup_epochs=20 opt.batch_size=256 opt.accum_iter=1 opt.blr=0.3 env.ngpu=2 env.world_size=1
AudioSet-Bal Linear Probe AudioSet2M-200ep 53.08
CMDPYTHONPATH=. python launcher.py --config-name=linprobe job_name=eval_linprobe_as2mbal pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=300 opt.warmup_epochs=20 opt.batch_size=256 opt.accum_iter=1 opt.blr=0.3 env.ngpu=2 env.world_size=1
AudioSet-Bal Fine-tuning VGGSound-200ep 58.19
CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_as2mbal pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=200 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1
AudioSet-Bal Fine-tuning AudioSet2M-200ep 57.91
CMDPYTHONPATH=. python launcher.py --config-name=finetune job_name=eval_finetune_as2mbal pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=audioset-bal-orig data.data_path=${PATH2AUDIOSET} opt.epochs=200 opt.warmup_epochs=20 opt.batch_size=32 opt.accum_iter=4 opt.blr=3e-4 env.ngpu=4 env.world_size=1

Visually Guided Source Separation

Dataset Pre-training SDR SIR SAR
VGGSound-Music VGGSound-200ep 5.79 8.24 13.82
CMDPYTHONPATH=. python launcher.py --config-name=avsrcsep job_name=eval_avsrcsep_vggsound_music pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=vggsound_music data.data_path=${PATH2VGGSOUND} opt.epochs=300 opt.warmup_epochs=40 opt.batch_size=16 opt.accum_iter=8 opt.blr=3e-4 avss.log_freq=True avss.weighted_loss=True avss.binary_mask=False avss.num_mixtures=2 env.ngpu=4 env.world_size=1
VGGSound-Music AudioSet2M-200ep 6.93 9.93 13.49
CMDPYTHONPATH=. python launcher.py --config-name=avsrcsep job_name=eval_avsrcsep_vggsound_music pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=vggsound_music data.data_path=${PATH2VGGSOUND} opt.epochs=300 opt.warmup_epochs=40 opt.batch_size=16 opt.accum_iter=8 opt.blr=3e-4 avss.log_freq=True avss.weighted_loss=True avss.binary_mask=False avss.num_mixtures=2 env.ngpu=4 env.world_size=1

Audio Visual Semantic Segmentation

Dataset Pre-training mIoU FScore
AVSBench-S4 VGGSounds-200ep 89.94 92.34
CMDPYTHONPATH=. python launcher.py --config-name=avsegm job_name=eval_avsbench_s4 pretrain_job_name=deepavfusion_vitb_vggsound_ep200 model.fusion.attn_ratio=0.25 model.fusion.mlp_ratio=1.0 data.dataset=avsbench_s4 data.data_path=${PATH2AVSBENCH} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=16 opt.accum_iter=8 opt.blr=2e-4 env.ngpu=4 env.world_size=1
AVSBench-S4 AudioSet2M-200ep 90.27 92.49
CMDPYTHONPATH=. python launcher.py --config-name=avsegm job_name=eval_avsbench_s4 pretrain_job_name=deepavfusion_vitb_as2m_ep200 model.fusion.attn_ratio=1.0 model.fusion.mlp_ratio=4.0 data.dataset=avsbench_s4 data.data_path=${PATH2AVSBENCH} opt.epochs=100 opt.warmup_epochs=20 opt.batch_size=16 opt.accum_iter=8 opt.blr=2e-4 env.ngpu=4 env.world_size=1

Demonstrations

Demo 1

Original Video

demo1_orig.mp4

Localized Sources

Separated Source #1

demo1_sep1.mp4

Separated Source #2

demo1_sep2.mp4
Demo 2

Original Video

demo2_orig.mp4

Localized Sources

Separated Source #1

demo2_sep1.mp4

Separated Source #2

demo2_sep2.mp4

Demo 3

Original Video

demo3_orig.mp4

Localized Sources

Separated Source #1

demo3_sep1.mp4

Separated Source #2

demo3_sep2.mp4

Demo 4

Original Video

demo4_orig.mp4

Localized Sources

Separated Source #1

demo4_sep1.mp4

Separated Source #2

demo4_sep2.mp4

Citation

If you find this repository useful, please cite our paper:

@inproceedings{mo2024deepavfusion,
  title={Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling},
  author={Mo, Shentong and Morgado, Pedro},
  booktitle={Proceedings of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

deepavfusion's People

Contributors

stonemo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

deepavfusion's Issues

Open Source

Awesome work!
Looking forward to the release of code !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.