Coder Social home page Coder Social logo

avcleanse's Introduction

Audio-visual speaker recognition

Introduction

Audio-visual speaker recognition on VoxCeleb2. which includes the speaker recognition code (speaker folder), face recognition code (face folder) and speaker-face recognition code (speaker_face folder). We seperate the codes into three folders to convinent the usage.

Our paper is here. This project contains the code of audio-visual speaker recognition and the cleansed training list.

This code uses Mixed Precision Training (torch.cuda.amp).

Preparation

Requirements

pip install -r requirements.txt

Pretrain model

The link of the pretrain model can be found here.

A-Vox2.model:  The speaker network (ECAPA-L) trained on VoxCeleb2
V-Vox2.model:  The face network (ResNet18) trained on VoxCeleb2
V-Glint.model: The face network (ResNet50) trained on Glint360K

Create a pretrain folder in root directory, put these models into the pretrain folder.

Dataset and text files

The .txt files can be found here. The faces on VoxCeleb1 can be downloaded here There is no offical link for downloading the videos from VoxCeleb2. So sorry I can not help - -.

The structure of the dataset looks like:

    # VoxCeleb2
    # ├── frame_align (face frames dataset, after alignment)
    # │   ├── id00012 (speaker id)
    # │       ├── 21Uxsk56VDQ (video id)
    # │           ├── 00001 (utterance id)
    # │               ├── 0010.jpg (face frames, I extract one frame every 0.4 second)
    # │               ├── 0020.jpg
    # │               ├── 0030.jpg
    # │   ├── ...
    # ├── wav (speech dataset)
    # │   ├── id00012 (speaker id)
    # │       ├── 21Uxsk56VDQ (video id)
    # │           ├── 00001.wav (utterance id)
    # │   ├── ...
    # ├── train_all.txt (speaker_id-wav_file_name-duraition)
    # ├── train_all_clean.txt (speaker_id-wav_file_name-duraition-audio_sim_score-visual_sim_score-clean_or_noisy)

    # VoxCeleb1
    # ├── frame_align (face frames dataset, be similar to Vox2)
    # ├── wav (speech dataset, be similar to Vox2)
    # ├── O_list.txt (data list of VoxCeleb1-O, wav_file_name-duration)
    # ├── E_list.txt (data list of VoxCeleb1-E)
    # ├── H_list.txt (data list of VoxCeleb1-H)
    # ├── O_trials.txt (original test trials of VoxCeleb1-O)
    # ├── E_trials.txt (original test trials of VoxCeleb1-E)
    # ├── H_trials.txt (original test trials of VoxCeleb1-H)

The O_list, E_list and H_list are used to speed up the testing process; train_all.txt is the original training list; train_all_clean.txt is the cleansed training list.

For face alignment, I do it based on here. Here is my code for reference (not write so well -_-...)

Speaker Recognition

In speaker folder, we train the ECAPA-TDNN speaker network on VoxCeleb2. The details of speaker recognition can be found here: ECAPA-TDNN.

Modality System Vox1-O Vox1-E Vox1-H
Speech (1) ECAPA-L-Vox2 0.98 1.21 2.30

It is noted that the results in our paper are the mean performance of training three times, so be slightly different with these results.

Face Recognition

In face folder, we train a face recognition model on VoxCeleb2, here are the results:

Modality System Vox1-O Vox1-E Vox1-H
Face (2) ResNet18-Vox2 0.97 0.81 1.16
Face (3) ResNet50-Glint 0.03 0.07 0.09

It is noted that (3) is a pretrain-model on the Glint360K dataset. We did not check if there is the identity overlap between Glint360K (360K person) and VoxCeleb (6K person) set (lack the identity files). This result is only used to show that the multi-modality is strong, but that will not effect our cleansing purpose since our final target is a cleansed VoxCeleb2.

We did not do experiments that had no alignments for both training and evaluation. For the pretrain face model on Glint360K, the result is bad if no alignment for evaluation.

Speaker-Face Recognition

The pipeline of face recognition is in speaker_face folder. We suggest to train two networks separately. Here are the results:

Modality System Vox1-O Vox1-E Vox1-H
Fusion (1)+(2) 0.15 0.23 0.41
Fusion (1)+(3) 0.01 0.08 0.15

Training and evaluation

Get into the speaker, face or speaker_face folder.

For training: bash run_train.sh

Noted to set the path of training data, evaluation data, and the output path.

For evaluation: bash run_eval.sh.

Noted to set the path of evaluation data.

I have optimized the code to speed up the evalution process (': mins, '': seconds)

Modality System Vox1-O Vox1-E Vox1-H
Speech (1) 0'14'' 5'30'' 5'16''
Face (2) 0'13'' 5.33'' 5'30''
Face (3) 0'24'' 11‘26’‘ 10'52''
Fusion (1)+(2) 0'25'' 11'26'' 10'55''
Fusion (1)+(3) 0'38'' 16‘57’‘ 16‘11’‘

For speech modality only, we can evaluate Vox1-E and Vox1-H within 6 mins. (One RTX-3090)

avcleanse's People

Contributors

taoruijie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

avcleanse's Issues

about speaker-face step

Thanks for you open source code. In paper, I guess you train speaker and face embedding network respectively. Then, use it to clean the data. But in speaker-face folder, you train speaker and face embedding network together. I want to ask why you do this. Thank you.

face data

Hello, I wanna ask how to generate the corresponding face data through the voxceleb video dataset, because I haven't learned much about face recognition.

Question for cleasing step

Hello, I'm working on audio-visual speaker recognition system and I read your code.

I would like to ask if your cleasing code is included in this repo? Or that you only provided clean list for training data.

Cause I didn't find the part that you did cleasing steps. Does that means if your models in this repo are just simple speaker recognition models?

Thanks for your reply!

Question for pretrain visual model

Hello, Thanks for the code and your sharing. When I was doing evaluation for this code these days, the pretrain visual model including V-Vox2.model and V-Glint.model, in my experiment, do not have a correct parameter numbers of the visual model for both IResNet18 and IResNet 50, resulting in a terrible evaluation scores and EER.

Specifically, the parameter of V-Vox2.model is 422 while the model.state_dict() only have 188 parameters. Meanwhile, the parameter of V-Glint.model is missing face_loss.weights so numbers of parameter is one smaller. I wonder if I was doing wrong in the process of evaluation or the pretrained model given is not correct.

( by the way, the pretrained audio model given is correct and works well. )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.