Coder Social home page Coder Social logo

Comments (6)

Wendison avatar Wendison commented on June 11, 2024 1

I didn't test how many utterances are required per speaker, I just used the entire data from train-clean-360h & train-clean-100h of LibriTTS for training, and test-clean for testing, so they were just preliminary experiments. But your ideas sound reasonable, data balance for each speaker should be considered, which may lead to better results.

from vqmivc.

Wendison avatar Wendison commented on June 11, 2024

Hi, during my experiments, I have tried different settings for batch-size, training epochs and learning rate.

For batch-size, I remember that the initial value was set to 64, but later I set it to 256 to speed up the training and found there was no harm on the performance, so I just fixed it to 256.

For training epochs, I chose the best epoch based on two aspects: (1) For validation data, the reconstruction loss doesn't decrease, and cpc-prediction accuracy doesn't increase; (2) Listen to the intermediate converted results on validation speakers (consider several conversion pairs), and chose the epoch that has best converted quality.

For learning rate, it is set empirically, as I found the validation reconstruction loss began to increase after epoch-300, so I decrease the learning rate to avoid the overfitting.

As for your points below, I'm not sure whether its true, as I didn't notice too much difference between smaller and larger batch size.
For complex data like this there should be an improvement on bigger batches

According to my experience, listening to the intermediate converted results is very important to determine whether the training is successful, so I suggest you to do so. Besides, you may tune the value for mi_weight in

mi_weight: 0.01
as it influences the disentanglement performance, larger value leads to less dependencies between different speech representations, but also may degrade the quality of converted voice.

Hope this helps :)

from vqmivc.

jlmarrugom avatar jlmarrugom commented on June 11, 2024

Thank you, I've created a notebook to check audio conversions of the saved checkpoints.

From my experiments, I think that this model will perform better on large datasets, like LibriTTS, since it has a good generalization. It would be important to add male speakers, since VCTK is Gender imbalanced towards Female speakers. Some conversions tends to mimic a female voice.

from vqmivc.

Wendison avatar Wendison commented on June 11, 2024

That's great finding! Besides, I also conducted some experiments on LibriTTS, and found that the content encoder can extract more accurate/robust content representations for out-of-domain speakers, hence source content can be well preserved. Since VCTK has relatively limited vocabulary, while LibriTTS has more diverse vocabulary, which improves the generalization ability of content encoder.

from vqmivc.

jlmarrugom avatar jlmarrugom commented on June 11, 2024

Excellent!, Do you know how many utterances are needed for each speaker?, I tested the model with 60 utt/spkr and 120 utt/spkr, and obtained decent results on 60 and comparable results on 120, maybe to speed up training on libriTTS one could select 100 utt per speaker and obtain a good result. What do you think?

from vqmivc.

jlmarrugom avatar jlmarrugom commented on June 11, 2024

Thank You!

from vqmivc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.