a-nagrani / vggvox Goto Github PK

View Code? Open in Web Editor NEW

373.0 373.0 83.0 915 KB

VGGVox models for Speaker Identification and Verification trained on the VoxCeleb (1 & 2) datasets

MATLAB 100.00%

vggvox's People

Contributors

Stargazers

Watchers

Forkers

cvhci-plumcot bhusanchettri grseb9s twistedmove hao44le locussam labimage csgcmai runngezhang cdamhieu aascode chienlinhuang1116 xinkez xiangyangai uvfanfan jiancao92 wl3b10s tttjjjwww christiaan20 namdn linglinduan ahmadkhalafallah handong890 hahahahahage jiaqiwa 00001101-xt mwang-lifesize mingmchen suwoncjh miltonsarria yingmuying kezhende yml-bit laisimiao janakoteich zwq1230 saqlainhussainshah zhaoyj1122 ming0818 vineshg teslaimpertior summerfanny habibzadeh ajilim myhololens kingcrane666 swhan9873 simarsspace vyoz qwfy ooobsidian shenyi666666 gunjanj007 jojocorleone danywarano weimingtom elliotthwang human2b darrenonly wangyang199609 yupengg jyt1234 moteroaltostratus iiscleap ankitshah009 avijoy-chakma yangx1123 easy-shu tzuren yinxx wuxie6424 suzhenwang86 manandharsudip4 hiren8055 luckychay oxc521 go2chayan themozhi eternityup liyuhao413

vggvox's Issues

Model architecture seems different from original paper

Hi, I have tried to export both the VGG-M and the Resnet-50 models for verification to Keras. In the first case, everything worked well, and the architecture proposed in the paper was the same as the architecture that I obtained from the matlab file contanining the model. However, in the second case I have found the following discrepancies:

The embedding dimension proposed in the paper is 512. In the matlab model the dimension is 128 (why is this the case?)
In the VoxCeleb2 paper (and in the Resnet original paper: https://arxiv.org/pdf/1512.03385.pdf) an activation is applied after the addition of the nonlinear stack's output and the shortcut connection. However, in the matlab model the reLU activation is applied both before and after the shortcut addition. The Resnet original paper is explicit about applying the activation just after the shortcut addition, so I don't understand the reason behind it.
Just before the last block is applied (fc_1, pool_time, fc_2, following VoxCeleb2 paper notation), the matlab model adds a pooling layer ( pool_final_b1 and pool_final_b2 for each network of the siamese architecture). I couldn't find any mention of this layer in the original paper.
Except from the first convolutional layer (conv0_b1, conv0_b2 following the matlab model notation) and the feed forward layers (fc65_b1, fc65_b2, fc8_s1, fc8_s2, following the matlab model notation), every intermediate conv layer has no bias parameters. Is there any reason for this?

Undefined function or variable 'dagnn.ContrastiveLoss'.

demo_vggvox_verif
Downloading the VGGVox model for Verification ... this may take a while
Warning: Directory already exists.

In demo_vggvox_verif (line 27)

Undefined function or variable 'dagnn.ContrastiveLoss'.

Error in dagnn.DagNN.loadobj (line 26)
block = constr() ;
Error in demo_vggvox_verif (line 33)
load(opts.modelPath); net = dagnn.DagNN.loadobj(netStruct);

Dataset license

Hello

I am curious as to which license enables you to use Youtube videos for creating a dataset. Motivated by your work, I am also interested in data from Youtube for research purpose.

Can you please guide me to the relevant tutorial or laws through I can read more on this subject.

Fghn

Huihc
Bhbn g. Ngn
Hytfbj.
Hftb . Jtdy

problems in configuring the environment

Hello,
thanks for sharing the trained model!

I am a beginner. I have some problems in configuring the environment and executing the test code. Some of them I tried to solve but failed. Could you please give me your own running environment? For example, the version of Linux system, the version of MATLAB, the version of matconvnet, the version of CUDA, and the version of cudnn?

Looking forward to your reply, thank you!

number of filters in conv3 layer?

I was looking at reimplementing the model in the VoxCeleb paper, and then cross checking with the setup in this repo. In the paper, conv3 has 256 filters, whereas in http://www.robots.ox.ac.uk/~vgg/data/voxceleb/models/vggvox_ident_net.mat it appears to have 384. Did you find that bigger was better for this layer+dataset?

Thanks - big fan of the VoxCeleb paper, great work :-)

about the pool_time layer

confused with pool_time layer. as said 'modify the payers to adapt to the spectrogram'.
a input size of 512*300 with 3s segment, the resnet50 output 9* 8*2048d, and followed with 9*1*2048 fully connect layer. How does the 1*N avg pool layer work? this 9*1*2048 length fc layer has nothing to do with N. It can be followed by the fc2(5994) layer to the output...
plz....

Removal of DC component required?

The paper doesn't mention removal of DC component.
However, you seem to be doing it during preprocessing in the code?
Is the removal of DC component necessary?

Equal Error Rate

Hi,

I want to compute Equal Error Rate as a measurement error, but i couldn't find it in your code. Could you tell me how to compute this measurement?

Can anyone help with to create a equal positive and negative paris of trial list?

1 id10001/Y8hIVOBuels/00001.wav id10001/utrA-v8pPm4/00001.wav
0 id10001/Y8hIVOBuels/00001.wav id10341/rX4LkvzySSM/00014.wav
1 id10001/Y8hIVOBuels/00001.wav id10001/zELwAz2W6hM/00010.wav
0 id10001/Y8hIVOBuels/00001.wav id10341/5DAommAsxmE/00007.wav
1 id10001/Y8hIVOBuels/00002.wav id10001/zELwAz2W6hM/00005.wav
0 id10001/Y8hIVOBuels/00002.wav id10293/X7uOKQUYTCM/00001.wav
1 id10001/Y8hIVOBuels/00002.wav id10001/7gWzIy6yIIk/00001.wav
0 id10001/Y8hIVOBuels/00002.wav id10016/o524HaR7jfE/00007.wav
1 id10001/Y8hIVOBuels/00003.wav id10001/J9lHsKG98U8/00024.wav
0 id10001/Y8hIVOBuels/00003.wav id10246/ojc6G1jqXOU/00001.wav
1 id10001/Y8hIVOBuels/00003.wav id10001/9mQ11vBs1wc/00003.wav
0 id10001/Y8hIVOBuels/00003.wav id10425/kV-qT4iLGTs/00002.wav
1 id10001/Y8hIVOBuels/00004.wav id10001/J9lHsKG98U8/00023.wav
0 id10001/Y8hIVOBuels/00004.wav id10166/PPZBsH24NyE/00002.wav
1 id10001/Y8hIVOBuels/00004.wav id10001/zELwAz2W6hM/00015.wav
0 id10001/Y8hIVOBuels/00004.wav id10425/x2ZdgyFnZwc/00002.wav
1 id10001/Y8hIVOBuels/00005.wav id10001/J9lHsKG98U8/00004.wav
0 id10001/Y8hIVOBuels/00005.wav id10104/i6tWIMbpZFs/00004.wav

Train, Validation and test set for VoxCeleb1 and voxCeleb2

Hi,
I am working on DNN based speaker ID using voxCeleb. How do you define train, validation and test sets for both datasets. I can see identification file list for voxCeleb1 but it is not clear what to choose for training, validation or testing. Thanks for support!

-Best,
Hari

Verification Siamese embedding

Hello,
thanks for sharing the trained model!

I wanted to use the verification model for extracting a speaker embedding, as described in the paper. There its explained that the embedding is trained as the output of a Siamese network at layer fc8 with a dimension of 256.
It seems that in the provided verification model the last layer has a dimension of 1024 instead of 256 (or the number of classes 1251).

Is this the correct embedding? or am i extracting the wrong layer

I want to compare the embeddings with a distance function like proposed.

Regards

What is the final embedding dimension used for computing distance.

In the paper, the dimension of the final embedding is mentioned as 512, and in the code, the final computed embedding dimension is 128, but in the provided code 2048 dimensional embedding is used for computing distance.

Missing training code

the repo does not contain the training code with VggVox architecture.

Proportion of positive and negative pair in verification task

Hi,

I am curious about the proportion in the verification dataset.

How many the positive (same speaker) pair in the dataset compared to the negative (different speaker) pair????

Thanks,

Loss and Acc on VGGVox (Thin-Resnet & NetVLAD)

Hi, correct me if I'm wrong but you trained VGGVox on VoxCeleb2 as Classification, right ? If so, what kind of Loss and Acc did you get ? for training set and validation set ?
Thanks.

Trial list

Could you please explain how trial lists are created? what is the strategy behind it? @a-nagrani

some problems in configuring the environment

The author of the code， Hello！I am a beginner. I have some problems in configuring the environment and executing the test code. Some of them I tried to solve but failed. Could you please give me your own running environment, such as Linux system version, MATLAB version, matconvnet version, CUDA version, cudnn version? Looking forward to your reply, thank you!

Get true label for the class

How do we get the true label i.e. Celebrity Name after getting a class from this line of code in demo_vggvox_identif.m file.

fprintf('Score:%d\nClass:%d\n',score, class);

This will get me the class value this maximum probability. But how do I get the label from the class value? Thank you so much for all the help.

voxceleb1 dataset

what‘s the meaning of the third to fifth rows?

404 Not Found: VoxCeleb2 model

Would you mind checking the following link? Thanks for open-sourcing the network!

Model trained for verification on VoxCeleb2 (this is a resnet based model)

http://www.robots.ox.ac.uk/~vgg/data/voxceleb2/ver_net.mat

Features computation

Hi,

From 'mfccspec.m', it seems that you compute the Fourier transform (with Matlab's fft function) of 25ms-long windows, and keep the entire spectrum (which is Hermitian symmetric) to compute the features that are then fed to the network.

Usually, we only keep the Fourier coefficients corresponding to the "positive frequencies", as other coefficients are redundant due to Hermitian symmetry.

Can you confirm that it is indeed the procedure you follow?

Thanks a lot by advance for your answer.

Best,

Simon