Coder Social home page Coder Social logo

ego4d_talknet_asd's Introduction

Ego4d Audio-visual Diarization Benchmark: Active Speaker Detection

This repository contains the code adapted from TalkNet, an active speaker detection model to detect 'whether the face in the screen is speaking or not?'. For more details, please refer to [Paper] [Video_English] [Video_Chinese].


Dependencies

Start from building the environment

sudo apt-get install ffmpeg
conda create -n TalkNet python=3.7.9 anaconda
conda activate TalkNet
pip install -r requirement.txt

Start from the existing environment

pip install -r requirement.txt

TalkNet in Ego4d dataset

Data preparation

Download data manifest (manifest.csv) and annotations (av_{train/val/test_unannotated}.json) for audio-visual diarization benchmark following the Ego4D download instructions.

Note: the default folder to save videos and annotations is ./data, please create symbolic links in ./data if you save them in another directory. The structure should be like this:

data/

  • csv/
    • manifest.csv
  • json/
    • av_train.json
    • av_val.json
    • av_test_unannotated.json
  • split/
    • test.list
    • train.list
    • val.list
    • full.list
  • videos/
    • 00407bd8-37b4-421b-9c41-58bb8f141716.mp4
    • 007beb60-cbab-4c9e-ace4-f5f1ba73fccf.mp4
    • ...

Run the following script to download videos and generate clips:

python utils/download_clips.py

Run the following scripts to preprocess the videos and annotations:

bash scripts/extract_frame.sh
bash scripts/extract_wave.sh
python utils/preprocessing.py

Training

Then you can train TalkNet on Ego4s using:

python trainTalkNet.py

The results will be saved in exps/exp:

exps/exp/score.txt: output score file

exps/exp/model/model_00xx.model: trained model

exps/exps/val_res.csv: prediction for val set.

Pretrained model

The model pretrained on AVA will automatically be downloaded into data/pretrain_AVA.model.

Our model trained on Ego4d performs ACC:79.27% on test set.


Inference

Data preparation

We can predict active speakers for each person given the face tracks. Please put the tracking results in ./data/track_results. The structure should be like this:

data/

  • track_results/
    • results/
      • 0.txt
      • 1.txt
      • ...
    • v.txt

Run the following script to make the tracking results compatible with dataloader (specify subset from ['full', 'val', 'test']):

python utils/process_tracking_result.py --evalDataType ${SUBSET}

Usage

Run the following script, specify the checkpoint and subset:

python inferTalkNet.py --checkpoint ${MODEL_PATH} --evalDataType ${SUBSET}

Finally, run the postprocessing script to make the predictions compatible with other components in this diarization benchmark:

python utils/postprocess.py --evalDataType ${SUBSET}

Citation

Please cite the following paper if our code is helpful to your research.

@article{grauman2021ego4d,
  title={Ego4d: Around the world in 3,000 hours of egocentric video},
  author={Grauman, Kristen and Westbury, Andrew and Byrne, Eugene and Chavis, Zachary and Furnari, Antonino and Girdhar, Rohit and Hamburger, Jackson and Jiang, Hao and Liu, Miao and Liu, Xingyu and others},
  journal={arXiv preprint arXiv:2110.07058},
  year={2021}
}

ego4d_talknet_asd's People

Contributors

zcxu-eric avatar

Forkers

dongkeon

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.