Arbitrarily long text about bark HOT 4 CLOSED

suno-ai commented on May 31, 2024

Arbitrarily long text

from bark.

Comments (4)

mcamac commented on May 31, 2024 3

The best way right now is probably to split text up into sentences or chunks, and generate separately, passing in the same speaker prompt for consistency. We could consider a better API that does this under the hood as well

from bark.

JonathanFly commented on May 31, 2024 2

Is there a way to run on arbitrarily long text for example breaking up by max token (not splitting words)?

For now you can just eyeball what would be audio chunks less than 14 seconds long, and use the same history_prompt in all the generations.

I think it's a little trickier than just chunking on tokens, because it seems the tokens that are input into generate_text_semantic are only loosely correlated with audio length. They are just regular transformer encoded tokens. A typical couple of sentences is probably 30 something tokens, which is about 14 seconds of audio. Or it's more because the speaker is a slow talker. Or it's less because it's a fast rap song. Or half the tokens are just describing the audio and aren't actually things to be said aloud, like WOMAN: or [whispering].

generate_text_semantic() in generation.py will happily accept even multiple paragraphs - up to 256 tokens. I think the model is like, "Okay, so what 14 seconds of audio best represents these two paragraphs of words based on all the 14 second transcripts I was given during training..." But it wasn't given much text like that. So it's like, "I guess I can say anything? Maybe yell out a few words from the text prompt in the middle? Sounds good to me, based on all the clips I saw in my training."

This is the best btw. Sometimes it sounds like you caught the actors that were getting to read your text prompt between takes, or doing vocal warmups with it.

SoSuno.mp4

I love this so much I'd be all over this model even if every prompt ended up like that.

10% of the time it abridges it perfectly, so there must be some segments like that in training.

If you change this 256 to a 64, what will generally happen is long outputs are properly abridged. But not super useful because (unless I'm missing something?) you can't directly decode the semantic token outputs and easily check "Oh it stopped at this word, so I'll pick up there next time."

if len(encoded_text) > 256:
        p = round((len(encoded_text) - 256) / len(encoded_text) * 100, 1)
        logger.warning(f"warning, text too long, lopping of last {p}%")
        encoded_text = encoded_text[:256]

And even if you could check, output quality seems worse if you pack it like that. Probably best to keep it simple, feed bark 10 to 14 seconds of audio at a time, use your judgement to estimate. Bark is pretty good about stopping early for text shorter than 14 seconds so the final output is pretty seamless.

from bark.

notnil commented on May 31, 2024

Yeah it would be great to point this to larger documents.

from bark.

gkucsko commented on May 31, 2024

gonna close to consolidate conversations, see here: #79

from bark.

Arbitrarily long text about bark HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent