Coder Social home page Coder Social logo

Comments (8)

erew123 avatar erew123 commented on July 24, 2024 2

Do a git pull.. it should be there!

Will also confirm at the prompt when you train:

image

from alltalk_tts.

erew123 avatar erew123 commented on July 24, 2024 1

Ah, sorry, yes I did misunderstand your question. So you are on about the the ratio of the % split between evaluation and training data. Its currently set at 15% for evaluation and the remaining 85% is therefore set as training data. I can push a setting into the interface to allow you to adjust that, if I'm now getting your question correct?

from alltalk_tts.

erew123 avatar erew123 commented on July 24, 2024 1

You want me to introduce...

image

correct?

from alltalk_tts.

erew123 avatar erew123 commented on July 24, 2024

All the actual samples generated in Step 1 (Whisper splitting the original sample) are passed into the training and used in Step 2 (actual training). Its just that with the voice generation at the end (Step 3), you need something longer than 6 seconds to properly generate TTS (it wants a 6+ second long sample). So what's actually occurring at Step 3, all voice samples shorter than 7 seconds (just to be sure) are not being displayed, or copied over (Step 4/what to do next) alongside the model, as those shorter clips would be useless to put in your "voices" folder. I hope that makes sense, even I had to read it twice and I wrote it.

On more general point of having more/longer voice samples at the end, a few people have told me (so anecdotal) that the Whisper 2 model is splitting sentences better, both in how it cuts the sample wav's and the overall length. I've not enough time on my hands yet to fully test this, however I have made a note on the documentation for finetuning https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-finetuning-a-model

As a side note, many people seem to think that the Whisper v2 model (used on Step 1) is giving better results at generating training datasets, so you may prefer to try that, as opposed to the Whisper 3 model.

You may wish to try the Whisper 2 model (another 3GB download) and see how that fares for you.

from alltalk_tts.

erew123 avatar erew123 commented on July 24, 2024

As we passed another message or two, Im assuming Ill be ok to close this. But feel free to reply if you want to know something else on this tickets topic. Thanks

from alltalk_tts.

Urammar avatar Urammar commented on July 24, 2024

But feel free to reply if you want to know something else on this tickets topic. Thanks

You've misunderstood the ticket, sorry. I'm not talking about whisper incorrectly cutting up voice samples, im talking about those samples simply not existing. A minor character in a single episode of a tv show, for instance, that never performed again.

The computer in star trek is another example. There exists only a few minutes of the computers voice lines through the whole show, and only fraction of those that are clean without sirens and whatnot going in the background.

Now, the actual problem.


Whisper breaks down the voice samples, yes, but it populates those samples into two separate datasets.

The training dataset, and the evaluation dataset.

These two sets are intentionally separated, as otherwise the model can just train to the test, so to speak, but ultimately only being able to reproduce those exact samples, as it overfits.

The training and eval datasets are saved in finetune->tmp-trn as CSV files, and are not cross pollinated.

This behavior is absolutely correct for large voice sample sizes, but for something like a videogame character that only has a few minutes of spoken lines tops, this can cause problems as you have insufficient training material. The ability to choose to lower the amount of clips dedicated to evaluation and raise the number set for actual training would be a welcome feature.

In addition, it allows you to quickly and automatically add all possible voice clips to training as a final run, so overfitting is minimized, but training data is maximized.

from alltalk_tts.

Urammar avatar Urammar commented on July 24, 2024

YES! Exactly that!

from alltalk_tts.

Urammar avatar Urammar commented on July 24, 2024

Absolute legend.

from alltalk_tts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.