Coder Social home page Coder Social logo

Real Time Streaming about alltalk_tts HOT 9 CLOSED

erew123 avatar erew123 commented on July 24, 2024
Real Time Streaming

from alltalk_tts.

Comments (9)

Sascha353 avatar Sascha353 commented on July 24, 2024

Do you mean audio streaming decoupled from the main text-generation-webui wav handling? A tts-engine independent solution inside the webui would be best but as a workaround it could be implemented in alltalk_tts or any tts extension, without returning the webui any audio chunks and directly streaming it with a library like sounddevice. One disadvantage would be that the user has no control to pause, stop or continue the playback.

from alltalk_tts.

mercuryyy avatar mercuryyy commented on July 24, 2024

I can implement it later into text-generation-webui the main thing i am trying to achieve is being able to generate an instant playable .wav file that can be streamed in chunks so we can achieve real time TTS

The main thing is streaming the raw audio to stdout as its produced

from alltalk_tts.

erew123 avatar erew123 commented on July 24, 2024

Streaming is possible with this https://github.com/KoljaB/RealtimeTTS though that is another step down the line. My current workload is re-tidying all the documentation both on this github and within the app. Catching a few minor bugs/issues.

Then I'm working on the new API for 3rd party/standalone, which is 70-80% completed.

From there, Ill look at options for other TTS engine and features such as the above. However, its worth noting there is a memory overhead for this and there will be coding around certain things like the LowVRAM option as the two, whilst not incompatible as such, you're just going to be shuffling the TTS model between VRAM and system RAM all the time, resulting in zero gain and probably a lot of complaints around speed.

from alltalk_tts.

Sascha353 avatar Sascha353 commented on July 24, 2024

There are a view more things to consider:

  • How and when does text input reach the tts-engine: To receive answers as fast as possible we should start here. TG-webui can be used in streaming mode or normal mode. Normal mode is what is currently used by all tts engines AFAIK, but is obviously not the best option in terms of speed, as synthesis is starting only after the whole reply is made. Streaming mode would feed the tts-engine individual words which can't be used to generate a coherent sentence. So if we talk about instant or real time tts, we are talking about sentence by sentence and not word by word streaming. So there are two options here. Add "sentence-streaming" in TG-webui or add a feature into a tts-extension which gathers individual words from TG-webui in streaming mode and waits till at least one sentence is generated. For each complete sentence it calls the tts engine for synthesis. Sentences could be also very short so there must be a little bit more logic to it, to wait for a certain amount of characters/tokens.
  • I'm not sure if RealtimeTTS is capable of doing this. I know it can split a whole paragraph into sentences but does it also work with word-by-word streaming from the text generation?
  • Parallelisation: The "word-listener", tts-synthesis and playback of audio must be done in parallel
  • xtts actually has a nativ streaming mode which I did not test yet and which they them self did not use in their Voice chat space. In that space they do pretty much what should be the fastest way of getting audio results from the tts, they also utilized gradio to still be able to control the streamed audio
  • It's true that this would be a feature which mostly benefits systems which have spare VRAM, as it would do text-generation and speech synthesis in parallel and all the models must be loaded @erew123 do you think it could be an optional feature for people which most likely wouldn't use the LowVRAM feature anyway or is out of scope, due to the LowVram "incompatiblily"?

from alltalk_tts.

mercuryyy avatar mercuryyy commented on July 24, 2024

@Sascha353 great overview, and great find on https://docs.coqui.ai/en/dev/models/xtts.html#streaming-manually
Should be easy to implement into the addon.

from alltalk_tts.

erew123 avatar erew123 commented on July 24, 2024

@Sascha353 @mercuryyy
xtts actually has a nativ streaming mode which I did not test yet
Not tested it either, but had spotted it. I am curious how well it will handle compared to the other one I suggested. Also how it deals with sentence breakdown.

Should be easy to implement into the addon.
Yes and no. All the other code will need to be caveated around e.g. got to make sure that low VRAM is disabled when people use such a mode. Its probable that it wont work with the narrator function, depending on how it would stream split sentences, so would need testing and then potentially code to flip that off and notify the user. Then of course, Ill need to document it because, if I don't, ill be getting all the questions why X isn't working correctly etc.

is out of scope, due to the LowVram "incompatibility"?
Technically speaking, not out of scope. However, I've only built AllTalk in the last few weeks. There's been good adoption, but I've also been fighting some fires here and there, a couple of minor hiccups and also helping the less technical people who have struggled with some things. Hence my focus on getting Alltalk very stable in its current form, very clear documentation, good troubleshooting (just added a basic diagnostic utility today and cleaned up the whole built in documentation + the whole github front page is re-written). I think I've spent about 14 hours on documentation over the last day or so. My next goal is to complete the JSON requests API for 3rd party apps + document. From there, potentially other TTS engines and things such as streaming.

from alltalk_tts.

erew123 avatar erew123 commented on July 24, 2024

@Sascha353 @mercuryyy Let me ask you both a question as this also has considerations. Where would you both want the streaming output to be played? e.g. within Text-generation-webui's interface as it generates content? Over the API and back to your own player of some kind? Over the API and through a built in Python based player that runs within the AllTalk Python process?

from alltalk_tts.

mercuryyy avatar mercuryyy commented on July 24, 2024

@erew123 First chocie would be "Over the API and back to your own player of some kind" sort of a live stream .wav
I was playing with the TTS built in options, got it to somewhat work outside of alltalk, it is not bad at all.

from alltalk_tts.

Sascha353 avatar Sascha353 commented on July 24, 2024

I tend to aim for the "best" option first and reduce the scope if needed, based on feasibility and resources. In my opinion, it would be best if audio output is send to TG-webui, as there is already a player in gradio, where the user can interact with the audio file/stream. This makes is generally more accessible and understandable for the user as no other output/player is introduced. I know the gradio player is capable of supporting a wav stream (utilized in the coqui soace). However I don't think the TG-webui is ready to receive & handle audio chunks as receiving and working with one full wav file coming from the tts-engine is obviously vastly different from handling a stream of incoming audio chunks. I described some of the challenges already in my FR here.

As a proof of concept and probably the easiest approach would be to stream directly using a library like sounddevice or PyAudio. In that case the tts extension should have auto play disabled so that the stream is not played by the streaming-feature and later again in the webui, after the full wav is generated and transferred.

It's just my opinion but I would not introduce another UI to control the stream. The user is working inside the webui and usability and immersion drops if you have to switch apps, windows, tabs etc.

from alltalk_tts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.