Currently the software automatically divides the audio samples provided into a dataset

You want me to introduce... <a target="_blank" rel="noopener norefer

Option to set training data/testing data percentages. about alltalk_tts HOT 8 CLOSED

erew123 commented on July 24, 2024

Option to set training data/testing data percentages.

from alltalk_tts.

Comments (8)

erew123 commented on July 24, 2024 2

Do a git pull.. it should be there!

Will also confirm at the prompt when you train:

from alltalk_tts.

erew123 commented on July 24, 2024 1

Ah, sorry, yes I did misunderstand your question. So you are on about the the ratio of the % split between evaluation and training data. Its currently set at 15% for evaluation and the remaining 85% is therefore set as training data. I can push a setting into the interface to allow you to adjust that, if I'm now getting your question correct?

from alltalk_tts.

erew123 commented on July 24, 2024 1

You want me to introduce...

correct?

from alltalk_tts.

erew123 commented on July 24, 2024

All the actual samples generated in Step 1 (Whisper splitting the original sample) are passed into the training and used in Step 2 (actual training). Its just that with the voice generation at the end (Step 3), you need something longer than 6 seconds to properly generate TTS (it wants a 6+ second long sample). So what's actually occurring at Step 3, all voice samples shorter than 7 seconds (just to be sure) are not being displayed, or copied over (Step 4/what to do next) alongside the model, as those shorter clips would be useless to put in your "voices" folder. I hope that makes sense, even I had to read it twice and I wrote it.

On more general point of having more/longer voice samples at the end, a few people have told me (so anecdotal) that the Whisper 2 model is splitting sentences better, both in how it cuts the sample wav's and the overall length. I've not enough time on my hands yet to fully test this, however I have made a note on the documentation for finetuning https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-finetuning-a-model

As a side note, many people seem to think that the Whisper v2 model (used on Step 1) is giving better results at generating training datasets, so you may prefer to try that, as opposed to the Whisper 3 model.

You may wish to try the Whisper 2 model (another 3GB download) and see how that fares for you.

from alltalk_tts.

erew123 commented on July 24, 2024

As we passed another message or two, Im assuming Ill be ok to close this. But feel free to reply if you want to know something else on this tickets topic. Thanks

from alltalk_tts.

Urammar commented on July 24, 2024

But feel free to reply if you want to know something else on this tickets topic. Thanks

You've misunderstood the ticket, sorry. I'm not talking about whisper incorrectly cutting up voice samples, im talking about those samples simply not existing. A minor character in a single episode of a tv show, for instance, that never performed again.

The computer in star trek is another example. There exists only a few minutes of the computers voice lines through the whole show, and only fraction of those that are clean without sirens and whatnot going in the background.

Now, the actual problem.

Whisper breaks down the voice samples, yes, but it populates those samples into two separate datasets.

The training dataset, and the evaluation dataset.

These two sets are intentionally separated, as otherwise the model can just train to the test, so to speak, but ultimately only being able to reproduce those exact samples, as it overfits.

The training and eval datasets are saved in finetune->tmp-trn as CSV files, and are not cross pollinated.

This behavior is absolutely correct for large voice sample sizes, but for something like a videogame character that only has a few minutes of spoken lines tops, this can cause problems as you have insufficient training material. The ability to choose to lower the amount of clips dedicated to evaluation and raise the number set for actual training would be a welcome feature.

In addition, it allows you to quickly and automatically add all possible voice clips to training as a final run, so overfitting is minimized, but training data is maximized.

from alltalk_tts.

Urammar commented on July 24, 2024

YES! Exactly that!

from alltalk_tts.

Urammar commented on July 24, 2024

Absolute legend.

from alltalk_tts.

Option to set training data/testing data percentages. about alltalk_tts HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent