Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Question About Batch Size, number of Epochs and Learning Rate about vqmivc HOT 6 CLOSED

wendison commented on June 11, 2024

Question About Batch Size, number of Epochs and Learning Rate

from vqmivc.

Comments (6)

Wendison commented on June 11, 2024 1

I didn't test how many utterances are required per speaker, I just used the entire data from train-clean-360h & train-clean-100h of LibriTTS for training, and test-clean for testing, so they were just preliminary experiments. But your ideas sound reasonable, data balance for each speaker should be considered, which may lead to better results.

from vqmivc.

Wendison commented on June 11, 2024

Hi, during my experiments, I have tried different settings for batch-size, training epochs and learning rate.

For batch-size, I remember that the initial value was set to 64, but later I set it to 256 to speed up the training and found there was no harm on the performance, so I just fixed it to 256.

For training epochs, I chose the best epoch based on two aspects: (1) For validation data, the reconstruction loss doesn't decrease, and cpc-prediction accuracy doesn't increase; (2) Listen to the intermediate converted results on validation speakers (consider several conversion pairs), and chose the epoch that has best converted quality.

For learning rate, it is set empirically, as I found the validation reconstruction loss began to increase after epoch-300, so I decrease the learning rate to avoid the overfitting.

As for your points below, I'm not sure whether its true, as I didn't notice too much difference between smaller and larger batch size.
For complex data like this there should be an improvement on bigger batches

According to my experience, listening to the intermediate converted results is very important to determine whether the training is successful, so I suggest you to do so. Besides, you may tune the value for mi_weight in

VQMIVC/config/train.yaml

Line 8 in 72c650c

mi_weight: 0.01

as it influences the disentanglement performance, larger value leads to less dependencies between different speech representations, but also may degrade the quality of converted voice.

Hope this helps :)

from vqmivc.

jlmarrugom commented on June 11, 2024

Thank you, I've created a notebook to check audio conversions of the saved checkpoints.

From my experiments, I think that this model will perform better on large datasets, like LibriTTS, since it has a good generalization. It would be important to add male speakers, since VCTK is Gender imbalanced towards Female speakers. Some conversions tends to mimic a female voice.

from vqmivc.

Wendison commented on June 11, 2024

That's great finding! Besides, I also conducted some experiments on LibriTTS, and found that the content encoder can extract more accurate/robust content representations for out-of-domain speakers, hence source content can be well preserved. Since VCTK has relatively limited vocabulary, while LibriTTS has more diverse vocabulary, which improves the generalization ability of content encoder.

from vqmivc.

jlmarrugom commented on June 11, 2024

Excellent!, Do you know how many utterances are needed for each speaker?, I tested the model with 60 utt/spkr and 120 utt/spkr, and obtained decent results on 60 and comparable results on 120, maybe to speed up training on libriTTS one could select 100 utt per speaker and obtain a good result. What do you think?

from vqmivc.

jlmarrugom commented on June 11, 2024

Thank You!

from vqmivc.

Question About Batch Size, number of Epochs and Learning Rate about vqmivc HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent