Documentation Request: Include instructions on how to fine tune pre-existing weights about multilingual_text_to_speech HOT 10 CLOSED

tomiinek commented on May 28, 2024

Documentation Request: Include instructions on how to fine tune pre-existing weights

from multilingual_text_to_speech.

Comments (10)

michael-conrad commented on May 28, 2024

Is this the right approach?

https://pytorch.org/tutorials/recipes/recipes/warmstarting_model_using_parameters_from_a_different_model.html

from multilingual_text_to_speech.

michael-conrad commented on May 28, 2024

I noticed that you have to code set to overwrite a checkpoints params if given an explicit param.

So I'm trying

python train-ga.py --checkpoint generated_switching --hyper_parameters generated_switching_cherokee6 --accumulation_size 5

After making sure the alphabets and languages from the checkpointed version are appended to versions in the new params file.

from multilingual_text_to_speech.

Tomiinek commented on May 28, 2024

Ah, I am sorry for a late response, I forgot ...

Please include instructions on how to resume training starting with your 70k iteration weights.
Is this the right approach?
https://pytorch.org/tutorials/recipes/recipes/warmstarting_model_using_parameters_from_a_different_model.html

These are just weights and not checkpoints (so it is missing optimizer-related things and so on), but you can use them for initialization. Look at these lines. The last four lines are not relevant in this case, so you can remove them.

Would it be possible to add additional languages as part of a fine tuning process?

I originally wanted to include the "fine-tuning" feature, but the code became very complicated and I actually did not need it for my experiments. I removed all the code related to fine-tuning in this commit 6c603ef. Check out the train.py file.

The typical use case is probably that you fine-tune the multilingual model to a single new language or speaker. Things are complicated because you have to make sure that the alphabet, speakers etc. matches and decide what to do if not (which approach to initialization etc.). In the case of the generated model, you also (IMHO) want to freeze all the encoder parameters and fine-tune just the language and speaker embeddings and maybe also the decoder, but in the case of other models supported by the code, you want to freeze or train different parts ...

from multilingual_text_to_speech.

michael-conrad commented on May 28, 2024

These are just weights and not checkpoints (so it is missing optimizer-related things and so on), but you can use them for initialization. Look at these lines. The last four lines are not relevant in this case, so you can remove them.

so, I can add a CLI option to do a "--with__weights" or similar, load the weights, but otherwise do everything as a new model?

if yes, would there be any advantage in starting with the previous parameters, then adding the additional language, so everything is in the same order or embedding?

from multilingual_text_to_speech.

stale commented on May 28, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from multilingual_text_to_speech.

padmanabhankrishnamurthy commented on May 28, 2024

Hi,

Just wanted to know if there has been any movement on this, and if there's a clearer path to fine-tuning the model with new languages / speakers now?

For example, if I wanted to add support for English without having to re-train, what parameters would I have to freeze / train to enable this?

Thanks!

from multilingual_text_to_speech.

Tomiinek commented on May 28, 2024

Hello, I am sorry guys, no movement. The training script is also not very fine-tuning friendly 😔

from multilingual_text_to_speech.

padmanabhankrishnamurthy commented on May 28, 2024

Thanks for the reply!

I've been trying to adapt the current code for fine-tuning on the LJSpeech dataset, i.e, adding support for English and for the LJSpeech speaker.

My approach currently involves freezing all parameters of the character encoder using param.requires_grad=False, and just training only the language encoder and the speaker encoder. Since there is only one speaker in the LJSpeech dataset, I have even set multi_speaker to False to turn off the adversarial speaker classifier. My model has been training for around 2 days (150 epochs on only the LJSpeech dataset), and while speech is starting to be generated in the LJSpeech speaker's voice, the model appears to have lost all information about other speakers. Consequently, feeding in any speaker id produces speech only in the LJSpeech speaker's voice.

Does this approach seem right to you?

from multilingual_text_to_speech.

Tomiinek commented on May 28, 2024

Ou, interesting!

Just to clarify ... Are you useing GeneratedConvolutionalEncoder as the encoder? If so, how did you add English? Did you make the inner embedding bigger and trainable while fixing the rest of the encoder parameters?
Also, how do you load the pre-trained model or treat the speaker embeddings? Because if you set multi_speaker=False the checkpoint has some extra paramters (and maybe the decoder expects larger inputs?)

Fixing decoder seems ok, but you cannot expect that the resulting voice will be exactly matching Linda. Maybe, you can try fine-tuning it too but with lower learning rate.

from multilingual_text_to_speech.

padmanabhankrishnamurthy commented on May 28, 2024

Hi,

So unfortunately, our fine-tuning experiments didn't work out.
But we're trying another line of experiments wherein we're attempting to get a single English speaker to speak in another language (say for example, German). In this case, since the use-case employs only one English speaker, is it sufficient to train the model using English recordings of only the target speaker, and German recordings of multiple other speakers? I.e, am I right in concluding that recordings of multiple English speakers are unnecessary, since we wish to synthesise German speech in only one particular English voice?

Thanks!

from multilingual_text_to_speech.

Documentation Request: Include instructions on how to fine tune pre-existing weights about multilingual_text_to_speech HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent