Coder Social home page Coder Social logo

taoruijie / ecapa-tdnn Goto Github PK

View Code? Open in Web Editor NEW
572.0 4.0 110.0 61.5 MB

Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)

License: MIT License

Python 100.00%
speaker-recognition ecapa-tdnn voxceleb2 speaker-verification voxceleb1

ecapa-tdnn's Introduction

Introduction

This repository contains my unofficial reimplementation of the standard ECAPA-TDNN, which is the speaker recognition in VoxCeleb2 dataset.

This repository is modified based on voxceleb_trainer.

Best Performance in this project (with AS-norm)

Dataset Vox1_O Vox1_E Vox1_H
EER 0.86 1.18 2.17
minDCF 0.0686 0.0765 0.1295

Notice, this result is in the Vox1_O clean list, for Vox1_O Noise list: EER is 1.00 and minDCF is 0.0713.


System Description

I have uploaded the system description, please check the Session 3, ECAPA-TDNN SYSTEM.

Dependencies

Note: That is the setting based on my device, you can modify the torch and torchaudio version based on your device.

Start from building the environment

conda create -n ECAPA python=3.7.9 anaconda
conda activate ECAPA
pip install -r requirements.txt

Start from the existing environment

pip install -r requirements.txt

Data preparation

Please follow the official code to perpare your VoxCeleb2 dataset from the 'Data preparation' part in this repository.

Dataset for training usage:

  1. VoxCeleb2 training set;

  2. MUSAN dataset;

  3. RIR dataset.

Dataset for evaluation:

  1. VoxCeleb1 test set for Vox1_O

  2. VoxCeleb1 train set for Vox1_E and Vox1_H (Optional)

Training

Then you can change the data path in the trainECAPAModel.py. Train ECAPA-TDNN model end-to-end by using:

python trainECAPAModel.py --save_path exps/exp1 

Every test_step epoches, system will be evaluated in Vox1_O set and print the EER.

The result will be saved in exps/exp1/score.txt. The model will saved in exps/exp1/model

In my case, I trained 80 epoches in one 3090 GPU. Each epoch takes 37 mins, the total training time is about 48 hours.

Pretrained model

Our pretrained model performs EER: 0.96 in Vox1_O set without AS-norm, you can check it by using:

python trainECAPAModel.py --eval --initial_model exps/pretrain.model

With AS-norm, this system performs EER: 0.86. We will not update this code recently since no enough time for this work. I suggest you the following paper if you want to add AS-norm or other norm methods:

Matejka, Pavel, et al. "Analysis of Score Normalization in Multilingual Speaker Recognition." INTERSPEECH. 2017.

We also update the score.txt file in exps/pretrain_score.txt, it contains the training loss, training acc and EER in Vox1_O in each epoch for your reference.


Reference

Original ECAPA-TDNN paper

@inproceedings{desplanques2020ecapa,
  title={{ECAPA-TDNN: Emphasized Channel Attention, propagation and aggregation in TDNN based speaker verification}},
  author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
  booktitle={Interspeech 2020},
  pages={3830--3834},
  year={2020}
}

Our reimplement report

@article{das2021hlt,
  title={HLT-NUS SUBMISSION FOR 2020 NIST Conversational Telephone Speech SRE},
  author={Das, Rohan Kumar and Tao, Ruijie and Li, Haizhou},
  journal={arXiv preprint arXiv:2111.06671},
  year={2021}
}

VoxCeleb_trainer paper

@inproceedings{chung2020in,
  title={In defence of metric learning for speaker recognition},
  author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
  booktitle={Interspeech},
  year={2020}
}

Acknowledge

We study many useful projects in our codeing process, which includes:

clovaai/voxceleb_trainer.

lawlict/ECAPA-TDNN.

speechbrain/speechbrain

ranchlai/speaker-verification

Thanks for these authors to open source their code!

Notes

If you meet the problems about this repository, Please ask me from the 'issue' part in Github (using English) instead of sending the messages to me from bilibili, so others can also benifit from it. Thanks for your understanding!

If you improve the result based on this repository by some methods, please let me know. Thanks!

Cooperation

If you are interested to work on this topic and have some ideas to implement, I am glad to collaborate and contribute with my experiences & knowlegde in this topic. Please contact me with [email protected].

ecapa-tdnn's People

Contributors

taoruijie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ecapa-tdnn's Issues

why does training and testing audio have different length?

Hi,
I notice that at training time, num_frames is 200, so the segment of training audio is 2 seconds.
But at eval time, the segment of training audio is 3 seconds, ECAPAModel.py line 63.
How come training and testing is not the same length?

Online augmentation affects speed

Hello, when I'm training, I find that data augmentation always waste a lot of time which cause the gpu (3090) to run intermittently. There are almost 20 seconds used for augmentation every batch(batch_size = 400). I want to ask you why yours so fast. Looking forward to your reply.

result EER

I test the result in vox1-O (veri_test.txt) and I get the result below:
EER 1.12%, minDCF 0.0745%
I noticed that the result reported in the README is actually evaluated in Vox1(clean) veri_test2.txt. Still a great work.
By the way, without TTA, I got VEER 1.0052 MinDCF 0.08051 in Vox1(clean).

GPU问题

请问博主,这个程序必须是安装了GPU版本的pytorch才可以使用吗?cpu版本的可以跑起来吗?

Fail to download voxceleb1&2 dataset

Hi there, do you any idea to download requeseted dataset for training? I tried to download the voxceleb1&2 as stated in readme docs but it fails.

Accelerating evaluation speed

During evaluation, the current implementation calculates the similarity scores one by one using a for loop, that could be slow when the size of "lines" gets larger. Is there an elegant way of vectorizing it?

The content format of these two files train_list_with_len.txt and veri_test2.txt

I want to use my data to train the model. But I met with some problems. I wondered whether my training list format is wrong or not. So I want to ask the content format of the training list in these two file. I followed Data preparation.
Here is the content of my /data08/VoxCeleb2/train_list_with_len.txt

d10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00001.wav
id10001 id10001/Y8hIVOBuels/00002.wav
id10001 id10001/Y8hIVOBuels/00002.wav

Here is the content of my /data08/VoxCeleb1/veri_test2.txt

1 id10001/Y8hIVOBuels/00001.wav id10001/1zcIwhmdeo4/00001.wav
0 id10001/Y8hIVOBuels/00001.wav id10943/vNCVj7yLWPU/00005.wav
1 id10001/Y8hIVOBuels/00001.wav id10001/7w0IBEWc9Qw/00004.wav
0 id10001/Y8hIVOBuels/00001.wav id10999/G5R2-Hl7YX8/00008.wav
1 id10001/Y8hIVOBuels/00002.wav id10001/7w0IBEWc9Qw/00002.wav
0 id10001/Y8hIVOBuels/00002.wav id10787/qZInQxuCSVo/00008.wav

dataset

could i use this code to run my own dataset?i wannna use it to run urbansound8k dataet

Question about Res2Net module configuration

Thank you for a good resource. :)

Is there any special reason to implement it differently from the original paper in the multi-scale (res2net) module part of the ECAPA-TDNN model?
(i.e., the first split is the identical form in the Res2Net paper, but the last split in your implementation)

image

ECAPA-TDNN/model.py

Lines 59 to 72 in a229093

spx = torch.split(out, self.width, 1)
for i in range(self.nums):
if i==0:
sp = spx[i]
else:
sp = sp + spx[i]
sp = self.convs[i](sp)
sp = self.relu(sp)
sp = self.bns[i](sp)
if i==0:
out = sp
else:
out = torch.cat((out, sp), 1)
out = torch.cat((out, spx[self.nums]),1)

can not prepare the dataset

When I followed the Data preparation part in the link and ran the this code python3 dataprep.py --save_path data --download --user USERNAME --password PASSWORD , I met with the following error.

--2021-11-26 14:04:56-- http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
Resolving www.robots.ox.ac.uk (www.robots.ox.ac.uk)... 129.67.94.2
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa [following]
--2021-11-26 14:04:58-- https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
Connecting to www.robots.ox.ac.uk (www.robots.ox.ac.uk)|129.67.94.2|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-11-26 14:04:59 ERROR 404: Not Found.

Traceback (most recent call last):
File "Downloads/voxceleb_trainer-master/dataprep.py", line 176, in
download(args,fileparts)
File "Downloads/voxceleb_trainer-master/dataprep.py", line 58, in download
raise ValueError('Download failed %s. If download fails repeatedly, use alternate URL on the VoxCeleb website.'%url)
ValueError: Download failed http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa. If download fails repeatedly, use alternate URL on the VoxCeleb website.

How can I solve this problem? Thanks!

ECAPA-TDNN

Hi!

I'm having trouble understanding ECAPA-TDNN architecture.

image

To be specific, I don't understand what does the elements in ECAPA-TDNN do (PreEmphasis,MelSpectrogram,FBankAug,conv1d,relu, batchNorm1d, bottleneck, Attention...) in the context of speaker verification?

What about classifier AAAsoftmax, optimizer Adam and scheduler stepLR?

Thanks for your attention and time!

file structure of the dataset

Could you please post the file structure of the dataset, it would be great if you could upload a demo of the dataset, thanks

算法正确率并不高?

80 epoch, LR 0.000090, LOSS 1.371865, ACC 74.04%, EER 1.03%, bestEER 0.96%
看到训练log,在第80个epoch的时候,eer已经达到1.03,但是ACC却只有74.04,正确率并不高呢?

the motivation of the two part

I noticed that there're two parts modification here:

  1. self.attention = nn.Sequential(
    nn.Conv1d(96, 128, kernel_size=1),
    nn.ReLU(),
    nn.BatchNorm1d(128),
    nn.Tanh(), # I add this layer
    nn.Conv1d(128, 96, kernel_size=1),
    nn.Softmax(dim=2),
    )
  1. self.se = nn.Sequential(
    nn.AdaptiveAvgPool1d(1),
    nn.Conv1d(channels, bottleneck, kernel_size=1, padding=0),
    nn.ReLU(),
    # nn.BatchNorm1d(bottleneck), # I remove this layer
    nn.Conv1d(bottleneck, channels, kernel_size=1, padding=0),
    nn.Sigmoid(),
    )

What's the motivation here? And do these benefits the performance?
Thanks

模型输入不统一?

我看到推理代码中:
with torch.no_grad():
embedding_1 = self.speaker_encoder.forward(data_1, aug = False)
embedding_1 = F.normalize(embedding_1, p=2, dim=1)
embedding_2 = self.speaker_encoder.forward(data_2, aug = False)
embedding_2 = F.normalize(embedding_2, p=2, dim=1)
embeddings[file] = [embedding_1, embedding_2]
其中,data1是语音全部的数据,data2是分割后又stack的数据。对于不同长度的语音,data1和data2是没有规定长度的?都可以输入到self.speaker_encoder.forward计算embedding???

单卡改多卡

请问尝试过将代码改成多卡吗?我这边改完之后,收敛速度很慢。
89 epoch, LR 0.000069, LOSS 2.890798, ACC 48.38%, EER 1.90%, bestEER 1.76%
再往后的epoch性能也没有提升。

Code using

Could I use your code for R&E at high school, please?

如何解决训练较慢

学长您好 我照着您的代码跑实验时 训练较慢。感觉是CPU读入数据的时候花费了很长时间(设置读入线程为4时,每次四个batch训练完成后就会等半分钟秒左右。),请问有什么解决办法吗。
这是我的实验以及配置截图:
截图20230307100011
截图20230307100509

AS-Norm

Has anyone implemented as-norm? Can you share it with me?
thank you very much!!

The feature passed into model is not MFCC

In the original paper the features passed into the model are MFCC in 80 dimensions, but in your code I don't find the original speech converted to MFCC. I'm not quite sure if I misunderstood or if you directly used the original speech as a feature, does that make any difference? Waiting for your reply.

How do you apply AS-NORM?

Hi, thanks for sharing your code. You say the best performance is achieved with AS-NORM. Can you share how you apply the AS-NORM with which set?

关于AS-norm的问题,

Hi!Ruijie,在B站关注你好久了!最近在做SASVC的比赛,发现用了你这个仓库做ASV 的 Baseline code. 你在Readme中写了这个ECAPA-TDNN结果是as-norm后的结果,可我没有在你的代码里找到任何关于backend norm的部分。请问是typo吗?还是您没有向本仓库中添加那一段代码?

Train data?

Hi man,what is the train data, only VoxCeleb2 or VoxCeleb1+VoxCeleb2 ?

KeyError: 'id10004/JKMfqmjUwMU/00003.wav'

Hi,
Thanks for this great work.
I'm getting following error while executing trainSpeakerNet.py with customized(less number of samples from voxceleb1) dataset for training and evaluation. But I got this error and I couldn't solve this issue.

Please help me to resolve this issue.

Reading 0 of 37: 0.00 Hz, embedding size 512
Computing 0 of 143: 0.00 HzTraceback (most recent call last):
File "./trainSpeakerNet.py", line 310, in
main()
File "./trainSpeakerNet.py", line 304, in main
mp.spawn(main_worker, nprocs=n_gpus, args=(n_gpus, args))
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/workspace/work/voxceleb_trainer-master/trainSpeakerNet.py", line 193, in main_worker
sc, lab, _ = trainer.evaluateFromList(**vars(args))
File "/workspace/work/voxceleb_trainer-master/SpeakerNet.py", line 221, in evaluateFromList
com_feat = feats[data[2]].cuda()
KeyError: 'id10004/JKMfqmjUwMU/00003.wav'

门限确定和score的分布

您好,我关注到测试的score还有负值,请问score区间是多少呢?我理解是0-1的。请问门限是如何确定的呢?

GPU utilization error!

Hi, author. I am training the ECAPA-TDNN model end-to-end by using: python trainECAPAModel.py
However, I found that while training, the training time per epoch is very long. After checking, I found that the GPU memory is occupied, but its utilization is 0. I manually set model.cuda(), but it does not work. I'm wondering what part of the program should I change to make the model load successfully.

Questions about reproduced ECAPA-Tdnn paper

Hi

I found out there are some differences between your code configrations and original configurations in ECAPA.

The most important one is in your code, you just random choose 1 of the 6 noise to add . And in ECAPA, they use all 6 noise methods which means they have a largger dataset.

I trained the 512 channels model, which only can achieve 1.16 EER (1.01 in ECAPA) , but your result in 1024 channel is even better than ECAPA. So is there any secret you holding about training skill? or you changed the configrations in your upload code ( I just copy your project and change the channel num, and everything else stays the same). OR because the tiny differences in your code leads it is better on a large model.

And thank you for your excellent work! Any help will be appriciated!

Best

Vox1_E and Vox1_H

您好?请问在没有norm的情况下 Vox1_E and Vox1_H 的测试指标如何呢?

what's the effect of PreEmphasis

I think the paper of ECAPA-TDNN didn't mention any pre-processing on wav signal. And voxceleb_trainer didn't have this processing as well. Does it affect performance?

How to use pretrain.model for continuing training?

I want to add some chinese audios to the training data.

Can I use your pretrain.model and continue to train using my data,

Or Do I have to download all the VoxCeleb1data plusing my data, and train it from the beginning?

Thank you for your reply.

About split utterance to matrix

Excuse me, I noticed that you split utterances into the matrix in the evaluation stage. Could you please explain why you do that?

About the training time

Hello, thank you so much for contributing this project.
I am training this model recently. I also use one 3090 and the same setting as you. But i need spend about 20 hours for each epoch. Do you know what's the reason?
Thank you so much for your answering in advance.

training set is not 5 times bigger after augmentation

I notice that in dataloader, the size of training set is the same size as original audio size after augmentation.

So, adding augmentation is not to increase the amount of training data, only to increase the diversity of it ?

How to evaluate your nn

Hi!
I'm new at neural networks and i'm having trouble discovering how to evaluate your implementation.
By now I'm using an audio dataset which is different from your --eval_path and --eval_list, so I'm running this command:

python trainECAPAModel.py --eval --initial_model exps/pretrain.model --eval_list /eval_list_directory --eval_path /eval_path_directory

Is this the correct way to evaluate your implementation? Should I use any different argument? The point is I don't think I understand what exps/pretrain.model is, so I don't know how to use it.

Looking forward to your response!
Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.