qqueing / deepspeaker-pytorch Goto Github PK

View Code? Open in Web Editor NEW

362.0 362.0 100.0 26 KB

Speaker embedding(verification and recognition) using Pytorch

License: MIT License

Python 100.00%

deepspeaker-pytorch's People

Contributors

Stargazers

Watchers

Forkers

ruslanmlnkv hbredin lauragpt zhaoforever bookong22 twistedmove m12sl guanjian729 entn-at fireae schaltung yongyug hubeibei007 tivaro onceforall john-zhaofei liang2713020 diggerdu kaien1 bhargav1200 blue1881eulb jame-zhang feiyuzhiyu rheasilvia haifengzeng lansingcode tichenm reinhardhsu wantongtang xingzai0617 kevinbetterq sunzeyucmu aitorbajo whiteweak liuweiping2020 xinli94 aurora11111 audiobucket mingmchen chenkaigithub 18959263172 gaoyiyeah shiwanglei windowxiaoming lealex262 kezhende shiyiyan ming0818 irentang anigi98932 ml2457 rogeryang12 kingfener muzaluisa gdy1201 owen864720655 xiangyangai hwong39 myhugong yuelupenbgpeng123 xixirupan zhuzizyf shaojinding dacson vasline shiningquan swhan9873 pkusnail yijiuzai 18030410071 vyoz dcmr aneybaby727 cdyangbo llmhao shenyi666666 zhangshenghust kbitc tujinhai wonwooo darrenonly flavio58it gokulsg plll4zzx chenyang918 spxnn avijoy-chakma siyuanwei philipismyen brokenax thomas11809 chauncey-tow muzihuole lucky2sunshine reptilefury hard-working-bee qiuzhengyong techthiyanes seemfree jaedukseo

deepspeaker-pytorch's Issues

data_a & testloader.dataset()

Hello! In the test section of triplet_train.py, what printed out is batch_idx * len(data_a)/len(test_loader.dataset). The batch_idx grows along with iterations, while the len(data_a) stays unchanged(default 512 seemingly). However, the length of the test_loader.dataset is 37720, as is given by the voxceleb1_test.txt. The numerator and denominator don't match well when the numerator grows really big. What's wrong here? Any good suggestion?

question about the loss function

Could you tell me why you use the L2 distance rather than the Cosine Similarity proposed in the paper?

txt issue

did you test the difference of use or not use delta feature?

will use delta feature improve the performance?

How to prepare the dataset

Hi,

Thanks for the great implementation! I am wondering how should I download voxceleb dataset. What is the expected directory structure? And also could you give us an example of test_pairs file?

Best,
Xin

The pretrained model

hello! Thanks for your code and can you provide the pretrained model so I can evaluate my own dataset?

Minor issue with avg pooling

I could be wrong, since I normaly don't do speech verification.
But in my case, if I run the training it gives following error when affine transform is done to match the embedding dim after avg pool
"x = self.model.fc(x)" gives
RuntimeError: size mismatch, m1: [512 x 1024], m2: [2048 x 512] at /opt/conda/conda-bld/pytorch_1550813258230/work/aten/src/THC/generic/THCTensorMathBlas.cu:266

I think this is because
avgpool is supposed to be on the temporal dimension by design, and in the commited version of the code, the avg pooling is done on frequency domain.
avg pool2d is supposed to give [F, T] = [4, 2] => [4,1] but instead it gives [4, 2] => [1, 2]
Thus the dimension after torch.view is half smaller than what is expected by the model.fc layer.

So I suggest

for myResNet.init()

Again, I'm no expert of speech verification.
Anybody has another idea on how to fix that bug that is occuring to me, please please let me know.

dataset divide

hello，i can download dataset from this URL:http://www.robots.ox.ac.uk/~vgg/data/voxceleb/download.sh with this commond : ./download.sh list.txt but all download audio is in a file.How can I separate them with speakers？？？can you understand? my English is awfu. very thanks.

What CUDA version is required to run this ?

Can I use this with CUDA 9.0 ?

Numbers of frames

Hi! Why are you using so low numbers of frame as default (32 as i see)? Voxceleb dataset wasn't preprocessing for dropping silence segments. Thus, many parts of training data is only silence. Acc is growing when I use greater number of frames (of course it's not only from silence segments). May be you was doing some experiments with numbers of frames?