Hello, I have found your project interesting, good job. I believe th

Wow, thank you, <a class="user-mention notranslate" data-hovercard-type="user" data-ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Voice Activity Controller about whisper_streaming HOT 5 OPEN

rodrigoGA commented on June 26, 2024

Voice Activity Controller

from whisper_streaming.

Comments (5)

Gldkslfmsd commented on June 26, 2024

Wow, thank you, @rodrigoGA ! This is very interesting feedback. I want to review and test your approach and possibly merge the useful parts. Later, when I'll have time.
Thanks!

from whisper_streaming.

rodrigoGA commented on June 26, 2024

Should the suggestion be integrated, I would also suggest changing the way the translation is returned. All streaming systems in some way indicate whether it is a partial or final translation. In this way, what is in the buffer could be returned as partial, and the user would have a more realistic feedback of what is being said. It is understood that the partial can change.

from whisper_streaming.

Gldkslfmsd commented on June 26, 2024

yes, an option for |||-separated partial output is possible. But anyway, I don't want more complicated output protocol. Plaintext is enough.

from whisper_streaming.

rodrigoGA commented on June 26, 2024

I understand the idea of keeping it simple. However, this is the standard in streaming ASR. You can check how Nvidia uses 'is_final' for all streaming models supported by the Riva platform https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/protos.html#_CPPv428SpeechRecognitionAlternative or companies that sell the model as a service in streaming APIs https://www.assemblyai.com/docs/guides/real-time-streaming-transcription
All of them use the same concept. As a consumer of these services, I can tell you that this is very useful for knowing when the user is speaking and for getting feedback on what is happening, even though the transcription has not finished. Imagine you want to use an ASR in a real-world use case, for example, transcribing a phone call. You would need to know when the user stops speaking and that the transcription is finished in order to do something with the text. Otherwise, you would have to wait until the call ends to consider the transcription complete, which would lose the aspect of real-time

from whisper_streaming.

Gldkslfmsd commented on June 26, 2024

@rodrigoGA , thank you very much again. In integrated your VAC in https://github.com/ufal/whisper_streaming/tree/vad-streaming It seems working good, but the code needs to be reviewed and made clearer and simpler. Then I can merge it.

from whisper_streaming.

Recommend Projects

Voice Activity Controller about whisper_streaming HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent