Comments (9)
Do you mean audio streaming decoupled from the main text-generation-webui wav handling? A tts-engine independent solution inside the webui would be best but as a workaround it could be implemented in alltalk_tts or any tts extension, without returning the webui any audio chunks and directly streaming it with a library like sounddevice. One disadvantage would be that the user has no control to pause, stop or continue the playback.
from alltalk_tts.
I can implement it later into text-generation-webui the main thing i am trying to achieve is being able to generate an instant playable .wav file that can be streamed in chunks so we can achieve real time TTS
The main thing is streaming the raw audio to stdout as its produced
from alltalk_tts.
Streaming is possible with this https://github.com/KoljaB/RealtimeTTS though that is another step down the line. My current workload is re-tidying all the documentation both on this github and within the app. Catching a few minor bugs/issues.
Then I'm working on the new API for 3rd party/standalone, which is 70-80% completed.
From there, Ill look at options for other TTS engine and features such as the above. However, its worth noting there is a memory overhead for this and there will be coding around certain things like the LowVRAM option as the two, whilst not incompatible as such, you're just going to be shuffling the TTS model between VRAM and system RAM all the time, resulting in zero gain and probably a lot of complaints around speed.
from alltalk_tts.
There are a view more things to consider:
- How and when does text input reach the tts-engine: To receive answers as fast as possible we should start here. TG-webui can be used in streaming mode or normal mode. Normal mode is what is currently used by all tts engines AFAIK, but is obviously not the best option in terms of speed, as synthesis is starting only after the whole reply is made. Streaming mode would feed the tts-engine individual words which can't be used to generate a coherent sentence. So if we talk about instant or real time tts, we are talking about sentence by sentence and not word by word streaming. So there are two options here. Add "sentence-streaming" in TG-webui or add a feature into a tts-extension which gathers individual words from TG-webui in streaming mode and waits till at least one sentence is generated. For each complete sentence it calls the tts engine for synthesis. Sentences could be also very short so there must be a little bit more logic to it, to wait for a certain amount of characters/tokens.
- I'm not sure if RealtimeTTS is capable of doing this. I know it can split a whole paragraph into sentences but does it also work with word-by-word streaming from the text generation?
- Parallelisation: The "word-listener", tts-synthesis and playback of audio must be done in parallel
- xtts actually has a nativ streaming mode which I did not test yet and which they them self did not use in their Voice chat space. In that space they do pretty much what should be the fastest way of getting audio results from the tts, they also utilized gradio to still be able to control the streamed audio
- It's true that this would be a feature which mostly benefits systems which have spare VRAM, as it would do text-generation and speech synthesis in parallel and all the models must be loaded @erew123 do you think it could be an optional feature for people which most likely wouldn't use the LowVRAM feature anyway or is out of scope, due to the LowVram "incompatiblily"?
from alltalk_tts.
@Sascha353 great overview, and great find on https://docs.coqui.ai/en/dev/models/xtts.html#streaming-manually
Should be easy to implement into the addon.
from alltalk_tts.
@Sascha353 @mercuryyy
xtts actually has a nativ streaming mode which I did not test yet
Not tested it either, but had spotted it. I am curious how well it will handle compared to the other one I suggested. Also how it deals with sentence breakdown.
Should be easy to implement into the addon.
Yes and no. All the other code will need to be caveated around e.g. got to make sure that low VRAM is disabled when people use such a mode. Its probable that it wont work with the narrator function, depending on how it would stream split sentences, so would need testing and then potentially code to flip that off and notify the user. Then of course, Ill need to document it because, if I don't, ill be getting all the questions why X isn't working correctly etc.
is out of scope, due to the LowVram "incompatibility"?
Technically speaking, not out of scope. However, I've only built AllTalk in the last few weeks. There's been good adoption, but I've also been fighting some fires here and there, a couple of minor hiccups and also helping the less technical people who have struggled with some things. Hence my focus on getting Alltalk very stable in its current form, very clear documentation, good troubleshooting (just added a basic diagnostic utility today and cleaned up the whole built in documentation + the whole github front page is re-written). I think I've spent about 14 hours on documentation over the last day or so. My next goal is to complete the JSON requests API for 3rd party apps + document. From there, potentially other TTS engines and things such as streaming.
from alltalk_tts.
@Sascha353 @mercuryyy Let me ask you both a question as this also has considerations. Where would you both want the streaming output to be played? e.g. within Text-generation-webui's interface as it generates content? Over the API and back to your own player of some kind? Over the API and through a built in Python based player that runs within the AllTalk Python process?
from alltalk_tts.
@erew123 First chocie would be "Over the API and back to your own player of some kind" sort of a live stream .wav
I was playing with the TTS built in options, got it to somewhat work outside of alltalk, it is not bad at all.
from alltalk_tts.
I tend to aim for the "best" option first and reduce the scope if needed, based on feasibility and resources. In my opinion, it would be best if audio output is send to TG-webui, as there is already a player in gradio, where the user can interact with the audio file/stream. This makes is generally more accessible and understandable for the user as no other output/player is introduced. I know the gradio player is capable of supporting a wav stream (utilized in the coqui soace). However I don't think the TG-webui is ready to receive & handle audio chunks as receiving and working with one full wav file coming from the tts-engine is obviously vastly different from handling a stream of incoming audio chunks. I described some of the challenges already in my FR here.
As a proof of concept and probably the easiest approach would be to stream directly using a library like sounddevice or PyAudio. In that case the tts extension should have auto play disabled so that the stream is not played by the streaming-feature and later again in the webui, after the full wav is generated and transferred.
It's just my opinion but I would not introduce another UI to control the stream. The user is working inside the webui and usability and immersion drops if you have to switch apps, windows, tabs etc.
from alltalk_tts.
Related Issues (20)
- Integration with RVC HOT 28
- Third option for `Text Not Inside * or " is` HOT 3
- switching between trained models from the webui in standalone mode HOT 1
- [!] Warning: The text length exceeds the character limit of 239 for language 'es', this might cause truncated audio. HOT 1
- AMD GPUs support HOT 4
- Containerized Docker Build Hits a Stopping Point HOT 1
- Any way to debug why AllTalk is not using GPU HOT 16
- Please help i can't make this tool work. HOT 6
- MaryTTS Support HOT 1
- Error in Step 2 Fine tuning. HOT 3
- Intel arc gpu support HOT 2
- Allow batch size 1 by default HOT 1
- FFMPEG RuntimeError: Failed to open the input in finetune.py HOT 3
- Colab HOT 1
- Can i use MMS models? HOT 1
- Standalone Install Error HOT 2
- Free Memory HOT 4
- Expected String and got Path HOT 6
- Possible to run the models entirely on CPU+RAM or the 2nd GPU? HOT 1
- SillyTavern AllTalk extension: Pitch must be between -24 and 24. HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alltalk_tts.