hechmik / voxceleb_enrichment_age_gender Goto Github PK

Code and data repository for paper "VoxCeleb enrichment for Age and Gender recognition" submitted at ASRU 2021

License: MIT License

Jupyter Notebook 89.98% Python 10.02%

voxceleb sound machine-learning deep-learning gender-recognition age-prediction interspeech asru2021 voxceleb-enrichment age

voxceleb_enrichment_age_gender's Introduction

VoxCeleb enrichment for Age and Gender recognition

This repository contains all the material related to the paper "VoxCeleb enrichment for Age and Gender recognition" submitted for publication at ASRU 2021. For those mainly interested in downloading data you can download the ENRICHED DATASET csv file.

Arxiv Link

https://arxiv.org/abs/2109.13510

Abstract

VoxCeleb datasets are widely used in speaker recognition studies. Our work serves two purposes.

First, we provide speaker age labels and (an alternative) annotation of speaker gender. Second, we demonstrate the use of this metadata by constructing age and gender recognition models with different features and classifiers. We query different celebrity databases and apply consensus rules to derive age and gender labels. We also compare the original VoxCeleb gender labels with our labels to identify records that might be mislabeled in the original VoxCeleb data.

On modeling side, the lowest mean absolute error (MAE) in age regression, 9.443 years, is obtained using i-vector features with ridge regression. This indicates challenge in age estimation from in-the-wild style speech data.

Authors

Repo structure

This repository is structured as follows:

dataset: here the ENRICHED DATASET can be found and downloaded, as well as support files detailing which records have been used for training and testing
best_models: the best models reported in the paper, Linear Regression with i-Vectors (Age regression) and Logistic regression with i-Vectors (Gender recognition), are made available so that other users can try them in a variety of scenarios (assuming that features where computed as described)
notebooks: Python scripts and Jupyter notebooks used throughout the various steps

Aknowledgments

This work has been partially sponsored by Academy of Finland (proj. no. 309629).

Considering the nature of the work, we would like to cite also in this README the original VoxCeleb 1 and VoxCeleb 2 papers:

[1] A. Nagrani*, J. S. Chung*, A. Zisserman, VoxCeleb: a large-scale speaker identification dataset, 
INTERSPEECH, 2017

[2] J. S. Chung*, A. Nagrani*, A. Zisserman, VoxCeleb2: Deep Speaker Recognition, 
INTERSPEECH, 2018

Similar works

This work was carried out in 2020 when the first author was affiliated with University of Eastern Finland. The authors came later across an independent but closely related work that addresses age labeling of VoxCeleb. The key difference between our work and theirs is that we assigned age labels based on the videos semantic and people identity, while they trained a facial age estimation model for the labeling task, taking as input the visual frames of the original YouTube videos. For readers convenience here it follows the paper's full reference, together with their github repo.

N. Tawara, A. Ogawa, Y. Kitagishi and H. Kamiyama, "Age-VOX-Celeb: Multi-Modal Corpus for Facial and Speech Estimation," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6963-6967, doi: 10.1109/ICASSP39728.2021.9414272.

GITHUB Repository: https://github.com/nttcslab-sp/agevoxceleb

Contact information

For any comment, clarification or suggestion please feel free to open an issue here in GitHub and/or send me an email at hechmi DOT khaled1995 AT gmail DOT com

voxceleb_enrichment_age_gender's People

Contributors

Stargazers

Watchers

Forkers

aascode entn-at robertanto arunparayatham zhangxincheng hido710 raff97 root2345 odeliazavlianovsc jonlysun jorrin2018 hugobothamd 20193939 wizyke

voxceleb_enrichment_age_gender's Issues

The final dataframe does not contain VoxCeleb1 data?

Thank you for creating this dataset. When inspecting final_dataframe_extended.csv, I found that no VoxCeleb ID starts with id1xx, which indicates VoxCeleb1. All VoxCeleb ID in that file start with id0xx, which is VoxCeleb2. A quick check at https://github.com/hechmik/voxceleb_enrichment_age_gender/blob/main/dataset/age-train.txt and https://github.com/hechmik/voxceleb_enrichment_age_gender/blob/main/dataset/age-test.txt also indicates the same thing.

Is there a bug in the code that remove VoxCeleb1 data? Or there is overlap between VoxCeleb1 and VoxCeleb2 speaker, so when merging the dataframes, VoxCeleb1 samples are dropped? Thank you.

test_data.npz no find?

Hi，author，thanks for great work！I want to test your pre-train model，Can you give me your test set file test_data.npz？
I try to creation it , but fail.
I just need mfcc age test_data.npz，thanks a lot!

input dimension

Hello!

ASVTorch generates 24 MFCCs, so the MFCCS are (n, 24) shape. Your input is (200, 30). Where is the 30 from? Can you please provide some test samples?

npz data files creation

Hi,
Thanks for the nice repository, I have a question regarding train test phase in some of the notebooks:
You are loading a npc data files (e.g., '/media/hdd1/khaled/npz_files/final_version/test_data.npz') that constructs the train and test dataset, however in the repository I don't find them and I didn't understand how you created them.

Any help finding out how you packed the data files will be great, thanks!

Age prediction on new utterances

Hi,

Thank you for sharing the code. I was trying to understand how to predict age for a new test set? I have a csv file with utterance_ids and paths. Would it possible to share a code snippet on how the model(s) could be tested on single utterance.

Thanks a lot!

Trying ivect_log_reg_model.torch

I am trying to use the gender recognition model shown here ('ivec_log_reg_model.torch'), but the method suggested runs into an error:

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    model.load_state_dict(torch.load('../best_models/ivec_log_reg_model.torch'))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LogisticRegression:
        size mismatch for linear.weight: copying a param with shape torch.Size([2, 400]) from checkpoint, the shape in current model is torch.Size([1, 512]).
        size mismatch for linear.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([1]).

Replacing (512,1) with (400,2) in the example does seem to work. Now, the problem is that there's no mention of how to test it with your own audios. I'll see if I can find it, but any suggestion would be welcome.