Coder Social home page Coder Social logo

Arbitrarily long text about bark HOT 4 CLOSED

suno-ai avatar suno-ai commented on May 31, 2024
Arbitrarily long text

from bark.

Comments (4)

mcamac avatar mcamac commented on May 31, 2024 3

The best way right now is probably to split text up into sentences or chunks, and generate separately, passing in the same speaker prompt for consistency. We could consider a better API that does this under the hood as well

from bark.

JonathanFly avatar JonathanFly commented on May 31, 2024 2

Is there a way to run on arbitrarily long text for example breaking up by max token (not splitting words)?

For now you can just eyeball what would be audio chunks less than 14 seconds long, and use the same history_prompt in all the generations.

I think it's a little trickier than just chunking on tokens, because it seems the tokens that are input into generate_text_semantic are only loosely correlated with audio length. They are just regular transformer encoded tokens. A typical couple of sentences is probably 30 something tokens, which is about 14 seconds of audio. Or it's more because the speaker is a slow talker. Or it's less because it's a fast rap song. Or half the tokens are just describing the audio and aren't actually things to be said aloud, like WOMAN: or [whispering].

generate_text_semantic() in generation.py will happily accept even multiple paragraphs - up to 256 tokens. I think the model is like, "Okay, so what 14 seconds of audio best represents these two paragraphs of words based on all the 14 second transcripts I was given during training..." But it wasn't given much text like that. So it's like, "I guess I can say anything? Maybe yell out a few words from the text prompt in the middle? Sounds good to me, based on all the clips I saw in my training."

This is the best btw. Sometimes it sounds like you caught the actors that were getting to read your text prompt between takes, or doing vocal warmups with it.

SoSuno.mp4

I love this so much I'd be all over this model even if every prompt ended up like that.

10% of the time it abridges it perfectly, so there must be some segments like that in training.

If you change this 256 to a 64, what will generally happen is long outputs are properly abridged. But not super useful because (unless I'm missing something?) you can't directly decode the semantic token outputs and easily check "Oh it stopped at this word, so I'll pick up there next time."

if len(encoded_text) > 256:
        p = round((len(encoded_text) - 256) / len(encoded_text) * 100, 1)
        logger.warning(f"warning, text too long, lopping of last {p}%")
        encoded_text = encoded_text[:256]

And even if you could check, output quality seems worse if you pack it like that. Probably best to keep it simple, feed bark 10 to 14 seconds of audio at a time, use your judgement to estimate. Bark is pretty good about stopping early for text shorter than 14 seconds so the final output is pretty seamless.

from bark.

notnil avatar notnil commented on May 31, 2024

Yeah it would be great to point this to larger documents.

from bark.

gkucsko avatar gkucsko commented on May 31, 2024

gonna close to consolidate conversations, see here: #79

from bark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.