Coder Social home page Coder Social logo

whisper_streaming's Introduction

whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation

Turning Whisper into Real-Time Transcription System

Demonstration paper, by Dominik Macháček, Raj Dabre, Ondřej Bojar, 2023

Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.

Paper in proceedings: http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf

Demo video: https://player.vimeo.com/video/840442741

Slides -- 15 minutes oral presentation at IJCNLP-AACL 2023

Please, cite us. Bibtex citation:

@InProceedings{machacek-dabre-bojar:2023:ijcnlp,
  author    = {Macháček, Dominik  and  Dabre, Raj  and  Bojar, Ondřej},
  title     = {Turning Whisper into Real-Time Transcription System},
  booktitle      = {System Demonstrations},
  month          = {November},
  year           = {2023},
  address        = {Bali, Indonesia},
  publisher      = {Asian Federation of Natural Language Processing},
  pages     = {17--24},
}

Installation

  1. pip install librosa -- audio processing library

  2. Whisper backend.

Two alternative backends are integrated. The most recommended one is faster-whisper with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with pip install faster-whisper.

Alternative, less restrictive, but slower backend is whisper-timestamped: pip install git+https://github.com/linto-ai/whisper-timestamped

The backend is loaded only when chosen. The unused one does not have to be installed.

  1. Optional, not recommended: sentence segmenter (aka sentence tokenizer)

Two buffer trimming options are integrated and evaluated. They have impact on the quality and latency. The default "segment" option performs better according to our tests and does not require any sentence segmentation installed.

The other option, "sentence" -- trimming at the end of confirmed sentences, requires sentence segmenter installed. It splits punctuated text to sentences by full stops, avoiding the dots that are not full stops. The segmenters are language specific. The unused one does not have to be installed. We integrate the following segmenters, but suggestions for better alternatives are welcome.

  • pip install opus-fast-mosestokenizer for the languages with codes as bn ca cs de el en es et fi fr ga gu hi hu is it kn lt lv ml mni mr nl or pa pl pt ro ru sk sl sv ta te yue zh

  • pip install tokenize_uk for Ukrainian -- uk

  • for other languages, we integrate a good performing multi-lingual model of wtpslit. It requires pip install torch wtpsplit, and its neural model wtp-canine-s-12l-no-adapters. It is downloaded to the default huggingface cache during the first use.

  • we did not find a segmenter for languages as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt that are supported by Whisper and not by wtpsplit. The default fallback option for them is wtpsplit with unspecified language. Alternative suggestions welcome.

In case of installation issues of opus-fast-mosestokenizer, especially on Windows and Mac, we recommend using only the "segment" option that does not require it.

Usage

Real-time simulation from audio file

usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
                         [--backend {faster-whisper,whisper_timestamped}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
                         audio_path

positional arguments:
  audio_path            Filename of 16kHz mono channel wav, on which live streaming is simulated.

options:
  -h, --help            show this help message and exit
  --min-chunk-size MIN_CHUNK_SIZE
                        Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.
  --model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}
                        Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.
  --model_cache_dir MODEL_CACHE_DIR
                        Overriding the default model cache dir where models downloaded from the hub are saved
  --model_dir MODEL_DIR
                        Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.
  --lan LAN, --language LAN
                        Language code for transcription, e.g. en,de,cs.
  --task {transcribe,translate}
                        Transcribe or translate.
  --backend {faster-whisper,whisper_timestamped}
                        Load only this backend for Whisper processing.
  --vad                 Use VAD = voice activity detection, with the default parameters.
  --buffer_trimming {sentence,segment}
                        Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.
  --buffer_trimming_sec BUFFER_TRIMMING_SEC
                        Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.
  --start_at START_AT   Start processing audio at this time.
  --offline             Offline mode.
  --comp_unaware        Computationally unaware simulation.

Example:

It simulates realtime processing from a pre-recorded mono 16k wav file.

python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt

Simulation modes:

  • default mode, no special option: real-time simulation from file, computationally aware. The chunk size is MIN_CHUNK_SIZE or larger, if more audio arrived during last update computation.

  • --comp_unaware option: computationally unaware simulation. It means that the timer that counts the emission times "stops" when the model is computing. The chunk size is always MIN_CHUNK_SIZE. The latency is caused only by the model being unable to confirm the output, e.g. because of language ambiguity etc., and not because of slow hardware or suboptimal implementation. We implement this feature for finding the lower bound for latency.

  • --start_at START_AT: Start processing audio at this time. The first update receives the whole audio by START_AT. It is useful for debugging, e.g. when we observe a bug in a specific time in audio file, and want to reproduce it quickly, without long waiting.

  • --offline option: It processes the whole audio file at once, in offline mode. We implement it to find out the lowest possible WER on given audio file.

Output format

2691.4399 300 1380 Chairman, thank you.
6914.5501 1940 4940 If the debate today had a
9019.0277 5160 7160 the subject the situation in
10065.1274 7180 7480 Gaza
11058.3558 7480 9460 Strip, I might
12224.3731 9460 9760 have
13555.1929 9760 11060 joined Mrs.
14928.5479 11140 12240 De Kaiser and all the
16588.0787 12240 12560 other
18324.9285 12560 14420 colleagues across the

See description here

As a module

TL;DR: use OnlineASRProcessor object and its methods insert_audio_chunk and process_iter.

The code whisper_online.py is nicely commented, read it as the full documentation.

This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.

from whisper_online import *

src_lan = "en"  # source language
tgt_lan = "en"  # target language  -- same as source for ASR, "en" if translate task is used

asr = FasterWhisperASR(lan, "large-v2")  # loads and wraps Whisper model
# set options:
# asr.set_translate_task()  # it will translate from lan into English
# asr.use_vad()  # set using VAD

online = OnlineASRProcessor(asr)  # create processing object with default buffer trimming option

while audio_has_not_ended:   # processing loop:
	a = # receive new audio chunk (and e.g. wait for min_chunk_size seconds first, ...)
	online.insert_audio_chunk(a)
	o = online.process_iter()
	print(o) # do something with current partial output
# at the end of this audio processing
o = online.finish()
print(o)  # do something with the last output


online.init()  # refresh if you're going to re-use the object for the next audio

Server -- real-time from mic

whisper_online_server.py has the same model options as whisper_online.py, plus --host and --port of the TCP connection. See help message (-h option).

Client example:

arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43001
  • arecord sends realtime audio from a sound device (e.g. mic), in raw audio format -- 16000 sampling rate, mono channel, S16_LE -- signed 16-bit integer low endian. (use the alternative to arecord that works for you)

  • nc is netcat with server's host and port

Background

Default Whisper is intended for audio chunks of at most 30 seconds that contain one full sentence. Longer audio files must be split to shorter chunks and merged with "init prompt". In low latency simultaneous streaming mode, the simple and naive chunking fixed-sized windows does not work well, it can split a word in the middle. It is also necessary to know when the transcribt is stable, should be confirmed ("commited") and followed up, and when the future content makes the transcript clearer.

For that, there is LocalAgreement-n policy: if n consecutive updates, each with a newly available audio stream chunk, agree on a prefix transcript, it is confirmed. (Reference: CUNI-KIT at IWSLT 2022 etc.)

In this project, we re-use the idea of Peter Polák from this demo: https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py However, it doesn't do any sentence segmentation, but Whisper produces punctuation and the libraries faster-whisper and whisper_transcribed make word-level timestamps. In short: we consecutively process new audio chunks, emit the transcripts that are confirmed by 2 iterations, and scroll the audio processing buffer on a timestamp of a confirmed complete sentence. The processing audio buffer is not too long and the processing is fast.

In more detail: we use the init prompt, we handle the inaccurate timestamps, we re-process confirmed sentence prefixes and skip them, making sure they don't overlap, and we limit the processing buffer window.

Contributions are welcome.

Performance evaluation

See the paper.

Contact

Dominik Macháček, [email protected]

whisper_streaming's People

Contributors

gldkslfmsd avatar lifefeel avatar luca-pozzi avatar luweigen avatar skripnik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

whisper_streaming's Issues

I stop receiving predictions while streaming after some time

I'm running

 python whisper_online_server.py --model base.en --host localhost --port 43001 --vad
arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43001

and the client stops receiving predictions after 50-70s. If I restart the client it starts working again. I noticed that this happens more often with audio that have frequent silences. Some audios that have background noises as "silence" did okay.

Also wondering how I can get word level timestamping. Is there an option for that? Because I'm currently getting something like this

20530 21390  Just saying, you know what?
24230 24770  before anything, because
24770 24830  for
24830 25770  anything because we've handled business
35130 36750  Well, I mean, the most
36770 37650  recent two rounds at
37650 38970  NIP's been able to put on the board have
38970 40210  been the result of some
40210 41190  form of aggression for
41190 42090  CI and the push into
42090 43250  halls to actually try and fight
43250 43550  out the
43550 45470  balconies at this point. And then now
45850 47010  here trying to be aggressive with
47010 48130  a boost off at the half wall.

Thank you for this library 🙌

Using VAD

I'm having trouble using VAD.
To use VAD, I set the setting to True and ran the server module.
The error "numpy.core.multiarray failed to import" occurred when the numpy version was low, so I upgraded the version and solved this problem.

However, the following error occurred during the speech recognition process: "onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(double)) , expected: (tensor(float))"

Since it seems to be an issue with the development environment, I'm sharing the following development environment.

virtual environment : conda
python : 3.8.5
numpy : 1.23.5
torch.version : 1.12.1+cu113

Shoud I install tensor...?
I need your advice.

whisper_online_server.py Freezes after few seconds

I would like to express my gratitude for the exceptional project you have developed. Currently, I am hosting the "whisper_online_server.py" script in a cloud environment, where I stream audio from my local computer through the microphone.

However, I have encountered an issue that I am struggling to resolve. For the initial few seconds, the audio transcription process functions as expected, but subsequently, it freezes. I have not made any alterations to the provided code, I dont know why the transription freezes
Here is the client code I am using :

import pyaudio
import socket
import time

# Define audio parameters
audio_format = pyaudio.paInt16
channels = 1
sample_rate = 16000
chunk_size = 1 * sample_rate
# chunk_size = 65536

server_address = <host>
server_port = <port>


audio = pyaudio.PyAudio()


client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
client_socket.connect((server_address, server_port))


stream = audio.open(format=audio_format, channels=channels,
                    rate=sample_rate, input=True, frames_per_buffer=chunk_size)

print("Recording and streaming audio...")

try:
    while True:
        audio_data = b""
        audio_data = stream.read(chunk_size)
        client_socket.sendall(audio_data)
        print("Sending audio data to server")

        time.sleep(0.1)


except KeyboardInterrupt:
    print("Recording and streaming stopped.")

# Close the audio stream and the socket
stream.stop_stream()
stream.close()
client_socket.close()
audio.terminate()

I am relatively new to this field, and despite my attempts to adjust the client code, the issue persists. I would greatly appreciate your assistance in resolving this matter or any guidance you may provide to identify the root cause.

Thank you for your time and support.

export LD_LIBRARY_PATH when nvidia-cudnn-cu11 is installed via pip in the venv

The GPU models need cuda, cuda11 for the time being.
for some reason, even though the libraries are installed as dependencies in the virtualenv, both whisper_online.py and whisper_online_server.py require the path to the nvidia-cudnn-cu11.so libraries to be exported.
(Only when using faster-whisper, to my knowledge).

If others experience the error:

Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory
Please make sure libcudnn_ops_infer.so.8 is in your library path!
[1]    16931 IOT instruction (core dumped)  python3.10 whisper_streaming/whisper_online_server.py --min-chunk-size 3

It can be solved by running this before the python program

export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`

'str' object has no attribute 'sep'

I am using this code to setup a ws server. Client sends audio chunks to this ws server. The ws connection is established but first upstream audio send gives me an error 'str' object has no attribute 'sep'.
Basically it fails at https://github.com/ufal/whisper_streaming/blob/main/whisper_online.py#L250 - not so sure why though. Do you see an obvious issue i am missing ?


src_lan = "en"  # source language
tgt_lan = "en"  # target language -- same as source for ASR, "en" if translate task is used
# Set options:
asr = FasterWhisperASR(src_lan, "large-v2") 
online = OnlineASRProcessor(tgt_lan, asr)
async def audio_processing(websocket, path):
    try:
        online.init() 
        async for audio_chunk in websocket:
            a = audio_chunk  # Receive new audio chunk
            print("audio",a)
            online.insert_audio_chunk(a)
            o = online.process_iter()
            await websocket.send(o)  # Send the current partial output
        # At the end of audio processing, send the final output
        final_output = online.finish()
        await websocket.send(final_output)
    except websockets.exceptions.ConnectionClosedOK:
        pass
    except Exception as e:
        print(e)
# Start the WebSocket server
start_server = websockets.serve(audio_processing, "localhost", 8765)

asyncio.get_event_loop().run_until_complete(start_server)
asyncio.get_event_loop().run_forever()

For real-time transcription, crashes after loading around the 60 sec market.

Hi there, this has been super awesome but for some reason, it always crashes for me around the 60 second mark on the default model. One interesting thing to note is that as I increase the min-chunk-size the time it takes to crash also increases (but it always inevitably crashes). I can't figure out why and I don't even know where to begin to debug. It also does not matterr which model I use but I usually use the default.

I've connected the real-time streaming via a simple streaming socket app. I'm not sure if that is causing any issues but they both tend to crash together. Would apprecciate any guidance on how to solve or trouble shoot the issue. The very basic streaming app is as follows:

import socket
import pyaudio

Audio configuration

FORMAT = pyaudio.paInt16 # Audio format (16-bit PCM)
CHANNELS = 1 # Mono audio
RATE = 16000 # Sampling rate 16kHz
CHUNK = 1024 # Size of each audio chunk

Server configuration

SERVER = 'localhost' # Server IP address (adjust as needed)
PORT = 43007 # Port number for the server

def stream_audio_to_server():
audio = pyaudio.PyAudio()

# Open the microphone stream
stream = audio.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK)

# Create a socket connection to the server
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as client_socket:
    client_socket.connect((SERVER, PORT))

    print("Streaming audio to server... Press Ctrl+C to stop.")

    try:
        while True:
            # Read audio chunk from microphone
            data = stream.read(CHUNK, exception_on_overflow=False)
            # Send audio chunk to server
            client_socket.sendall(data)
    except KeyboardInterrupt:
        pass
    finally:
        # Stop and close the stream
        stream.stop_stream()
        stream.close()
        audio.terminate()

        print("Stopped streaming.")

if name == "main":
stream_audio_to_server()

Any recommendations to solve this would be greatly appreciated! Thanks!

Usage as Fronteend+Backend app with websockets

Hi.

Thank you so much for this great product.

I want to build a product that does real time transcription. It will consist of two parts:
1- A lightweight front end app (android app or a simple web ui) that records the audio and sends it to a server
2- A server that uses whisper to do real time transcription and return the results.

My question is how can I do this using whisper streaming. Should I use websocket to connect the frontend with the backend? What should be the length of the audio chunk that will be sent one at a time.

I hope you can help me with this.

Thanks

Real-time input from microphone

Is there a way to use your program in real-time, so no audio file needed? I am looking for a solution to use whisper with my laptop microphone

windows opus-fast-mosestokenizer installation

Hello! When I try to run the whisper_online_server I get this error:

Exception has occurred: FileNotFoundError
[WinError 2] The system cannot find the file specified
File "S:\AI\Testing\HuggingFace\openai\whisper_streaming\whisper_online.py", line 436, in create_tokenizer
return MosesTokenizer(lan)
^^^^^^^^^^^^^^^^^^^
File "S:\AI\Testing\HuggingFace\openai\whisper_streaming\whisper_online_server.py", line 64, in
online = OnlineASRProcessor(asr,create_tokenizer(tgt_language))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Everything is the default, I just cloned the repo and installed all dependencies. How can I solve this?

I'm on Windows, using py3.11.3 in a .venv.
image

How can i do inference on microphone

Please can you tell me that how can i do the inference with my microphone instead of using an audio file, because in the demo video it is shown that the lady is speaking on the mic and the model is transcribing it on the go

BUG ImportError('libmosestokenizer-dev.so: cannot open shared object file: No such file or directory')

Hi after prolonged building of the required libs , I was finally able to run faster-whisper with GPU, but now I have this error about the opus-fast-mosestokenizer

When runing this command

python3 whisper_online.py out.wav --language en --model small --min-chunk-size 1 > out.txt

i get error about libmosestokenizer-dev.so

Audio duration is: 35.55 seconds
Loading Whisper small model for en... done. It took 5.62 seconds.
Traceback (most recent call last):
  File "/home/ivo/.local/lib/python3.8/site-packages/mosestokenizer/__init__.py", line 16, in <module>
    from mosestokenizer.lib import _mosestokenizer
ImportError: libmosestokenizer-dev.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "whisper_online.py", line 495, in <module>
    online = OnlineASRProcessor(asr,create_tokenizer(tgt_language))
  File "whisper_online.py", line 427, in create_tokenizer
    from mosestokenizer import MosesTokenizer
  File "/home/ivo/.local/lib/python3.8/site-packages/mosestokenizer/__init__.py", line 20, in <module>
    raise RuntimeError(_msg)
RuntimeError: Failed to import mosestokenizer c++ library
Full error log: ImportError('libmosestokenizer-dev.so: cannot open shared object file: No such file or directory')

I installed opus-fast-mosestokenizer with pip

pip install opus-fast-mosestokenizer

edit>

I tried and changed the tokenizer with sacremoses , in the line from sacremoses import MosesTokenizer, that got the transcription running! and showing words, even got processed the first sentence with time/delay report, but then , understandably failed as the tokenizer couldnt split(). My point is that in the infrastructure, the other stuff seems to work, just opus- cant get worked.

large-v3 model

It still fails when I run with Whisper large-v3. My command simply uses the default for all arguments except for the model:
python whisper_online.py audio.wav --model large-v3

Because the large-v3 model of faster-whisper was just supported recently, it might be not tested.

Originally posted by @zmlee0514 in #44 (comment)

How to run model inference from mic on windows

Please tell me that how can I run the model for streaming transcription from mic input

I don't know much about any terminal based library so if possible please help me out.

I am a windows user and arecord isn't available for windows

whisper_online.py returns only 'Python'

(streaming) C:\Users\user>python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt
Python
(streaming) C:\Users\user>python3 whisper_online.py 3-1.wav --model large-v2
Python
(streaming) C:\Users\user\Documents\Python>python3 whisper_online.py 3-1.wav --model large-v2
Python

I installed and ran the streaming whisperer according to the manual.
(I replaced the opus-fast-tokenizer with the scremoses)
However, there is no error and only repeats just 'Python' like that.

I am using Windows 10, Anaconda and Python 3.8.0.
What's wrong with me...

Client sample in python.

/I want to develop the client module written by Python.
so I wrote the sample code as below.

`
import socket

def send_file_over_socket(file_path, host, port, chunk_size):
try:

    with open(file_path, 'rb') as file:
        
        client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        client_socket.connect((host, port))

        while True:               
            data = file.read(chunk_size)
            if not data:
                break 
            
            client_socket.send(data)

        client_socket.close()
        print("sending complete")

except Exception as e:
    print(f"Error on sending data': {str(e)}")

if name == 'main':
file_path = '/Users/Thankyou/Desktop/data/readalong.wav' # wave path
host = 'XXX.XXX.XXX.XXX'
port = XXXXXX # port
chunk_size = 32044 # chunk_size

send_file_over_socket(file_path, host, port, chunk_size)

`

And I got the error as follow
`From cffi callback <function SoundFile._init_virtual_io..vio_read at 0x7f897e1703a0>:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/soundfile.py", line 1214, in vio_read
data_read = file.readinto(buf)
File "/opt/conda/lib/python3.8/site-packages/soundfile.py", line 713, in getattr
raise AttributeError(
AttributeError: 'SoundFile' object has no attribute 'readinto'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/soundfile.py", line 1219, in vio_read
buf[0:data_read] = data
ValueError: right operand length must match slice length
Traceback (most recent call last):
File "whisper_online_server.py", line 223, in
proc.process()
File "whisper_online_server.py", line 187, in process
a = self.receive_audio_chunk()
File "whisper_online_server.py", line 145, in receive_audio_chunk
audio, _ = librosa.load(sf,sr=SAMPLING_RATE)
File "/opt/conda/lib/python3.8/site-packages/librosa/core/audio.py", line 165, in load
raise (exc)
File "/opt/conda/lib/python3.8/site-packages/librosa/core/audio.py", line 146, in load
with sf.SoundFile(path) as sf_desc:
File "/opt/conda/lib/python3.8/site-packages/soundfile.py", line 629, in init
self._file = self._open(file, mode_int, closefd)
File "/opt/conda/lib/python3.8/site-packages/soundfile.py", line 1183, in _open
_error_check(_snd.sf_error(file_ptr),
File "/opt/conda/lib/python3.8/site-packages/soundfile.py", line 1357, in _error_check
raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening SoundFile(<_io.BytesIO object at 0x7f898182bd10>, mode='r', samplerate=16000, channels=1, format='RAW', subtype='PCM_16', endian='LITTLE'): File contains data in an unknown format.
whisper-server-INFO: killing process 60900`

please comment your idea.

Thank you,.

whisper_online_server.py not producing a correct text

Hello,

First thanks a lot for opensourcing the software 👍

I encounter some pbs with whisper_online_server.py, it does not seem to work and produce some "random" content. I tested the same file with whisper_online.py and it works perfectly. Any idea what it could be?

Client side

Command

ffmpeg -i GMT20231102-120819_Recording.m4a -f s16le -acodec pcm_s16le - | nc localhost 43007

Log

ffmpeg -i GMT20231102-120819_Recording.m4a -f s16le -acodec pcm_s16le - | nc localhost 43007
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-pocketsphinx --enable-librsvg --enable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared
  libavutil      56. 70.100 / 56. 70.100
  libavcodec     58.134.100 / 58.134.100
  libavformat    58. 76.100 / 58. 76.100
  libavdevice    58. 13.100 / 58. 13.100
  libavfilter     7.110.100 /  7.110.100
  libswscale      5.  9.100 /  5.  9.100
  libswresample   3.  9.100 /  3.  9.100
  libpostproc    55.  9.100 / 55.  9.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'GMT20231102-120819_Recording.m4a':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isommp42
    creation_time   : 2023-11-02T12:08:19.000000Z
  Duration: 01:22:02.94, start: 0.000000, bitrate: 127 kb/s
  Stream #0:0(und): Audio: aac (LC) (mp4a / 0x6134706D), 32000 Hz, mono, fltp, 126 kb/s (default)
    Metadata:
      creation_time   : 2023-11-02T12:08:19.000000Z
      handler_name    : AAC audio
      vendor_id       : [0][0][0][0]
Stream mapping:
  Stream #0:0 -> #0:0 (aac (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, s16le, to 'pipe:':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isommp42
    encoder         : Lavf58.76.100
  Stream #0:0(und): Audio: pcm_s16le, 32000 Hz, mono, s16, 512 kb/s (default)
    Metadata:
      creation_time   : 2023-11-02T12:08:19.000000Z
      handler_name    : AAC audio
      vendor_id       : [0][0][0][0]
      encoder         : Lavc58.134.100 pcm_s16le
0 1660  Thank you very much.9.37 bitrate= 512.6kbits/s speed=18.1x
4000 8420  dramatic presentation of a video12.1kbits/s speed=4.89x
8420 12420  for sports.
13260 17020  If you're interested in watching.1kbits/s speed=3.99x
38740 39180  please feel free to leave a= 512.1kbits/s speed=1.95x
42540 42560  please feel free to leave a comment or a like, and don't forget to

Server side

Command

python whisper_online_server.py --min-chunk-size 1

Log

Loading Whisper large-v2 model for en... done. It took 3.65 seconds.
Whisper is not warmed up
whisper-server-INFO: INFO: Listening on('localhost', 43007)
whisper-server-INFO: INFO: Connected to client on ('127.0.0.1', 56506)
...
INCOMPLETE: (4.04, 10.0, ' Well, that was nice of you. Okay, uh, maybe next time. Take care.')
len of buffer now: 10.37
(None, None, '')
b'C\xfeG\xfeK\xfeP\xfeT\xfe'
65536
PROMPT:
CONTEXT:  Thank you very much.
transcribing 12.42 seconds from 0.00
whisper-server-INFO: Processing audio with duration 00:12.416
>>>>COMPLETE NOW: (None, None, '')
INCOMPLETE: (4.0600000000000005, 12.38, " Well that is it for this episode of Gamer Gear. I hope you enjoyed it. Thank you all for watching. I'll see you all next time.")
len of buffer now: 12.42
(None, None, '')
b'\xdd\xff\xea\xff\x15\x00\xff\xff\xd6\xff'
65536
PROMPT:
CONTEXT:  Thank you very much.
transcribing 14.46 seconds from 0.00
whisper-server-INFO: Processing audio with duration 00:14.464
>>>>COMPLETE NOW: (None, None, '')
INCOMPLETE: (4.36, 13.26, " Well, that was not too hard, was it? No, it was a little bit difficult. It was a long time ago. It's hard to say.")
len of buffer now: 14.46
(None, None, '')
b'<\x07[\x07\x82\x08)\nj\x0c'
65536
PROMPT:
CONTEXT:  Thank you very much.
transcribing 16.51 seconds from 0.00
whisper-server-INFO: Processing audio with duration 00:16.512
>>>>COMPLETE NOW: (None, None, '')
INCOMPLETE: (4.079999999999999, 16.48, ' dramatic presentation of a video that we all enjoy watching. If you enjoy it, please subscribe.')
len of buffer now: 16.51
(None, None, '')
b'\xd3\xfd\xda\xfd\xe2\xfd\xef\xfd\x00\xfe'
65536
...

VAD and whisper-timestamped

First, thank you. I am super happy to see whisper-timestamped used in such a good project.
Having Whipser streamed in real time is a super feature!

I see here that VAD is not available when using whisper-timestamped backend:

def use_vad(self):
raise NotImplemented("Feature use_vad is not implemented for whisper_timestamped backend.")

But VAD IS implemented in whisper-timestamped (it was even before faster-whisper integrated it). It's currently based on SILERO (same as what was done in faster-whisper).
Am I missing a sticking point? (Maybe the fact that things required for VAD are not by default in the requirements?)
I can contribute if help is needed on this.

(VAD is important to prevent some hallucinations of Whisper models, and make timestamps more accurate)

Also, I want to mention:
After being disappointed with weird results on some files, I opened a branch to replace SILERO with AUDITOK : linto-ai/whisper-timestamped#78 (see the linked issue to have an illustration of possible "hallucinations" of Silero).
I had good experience with Auditok. I was hoping some user feedback to confirm before merging in master. But as it's not coming, maybe we just need to establish a benchmark to confirm the improvement.

Running whisper_streaming with a fine-tuned whisper model

Hi Everyone,
I want to use whisper_streaming with my custom fine-tuned whisper model.
However, as far as I see from the command line parameters, it only accepts whisper's pretrained models, tiny, base, small, medium and large. Is it possible to make it work with a fine-tuned model ?

Review a demo paper

Dear followers watching this repo, stargazers and all,

I'm working on a demonstration paper for a conference that has a deadline soon. Would you like to volunteer to read and review the paper, and give me feedback? I'd like to have it by Friday 30th June if possible. If not, than at least by 7th July.

If you are interested, please, send me an email, and I will send you pdf.

Btw., you can watch a demo video linked from README.

Thanks!

Best,

Dominik

Voice Activity Controller

Hello, I have found your project interesting, good job.

I believe there is an incorrect use of VAD. The function get_speech_timestamps used by fasterwhisper is a copy of the function from silero which is intended for complete audios. However, when working with streaming, audio fragments are being received. Silero already includes a utility for this at https://github.com/snakers4/silero-vad/blob/5e7ee10ee065ab2b98751dd82b28e3c6360e19aa/utils_vad.py#L428

I have forked your project to test this: https://github.com/rodrigoGA/whisper_streaming/tree/main
Changing the way VAD is used seemed to improve the results.

One of the main drawbacks I found is the delay in obtaining the transcription, which gives an unpleasant feeling, especially when the conversation ends, as no transcription is received for a few seconds. Therefore, I created a class based on VAD to flush the buffer once it detects that the user has not spoken for 0.5 seconds https://github.com/rodrigoGA/whisper_streaming/blob/main/voice_activity_controller.py
In this file, you can find an example that transcribes from the microphone: https://github.com/rodrigoGA/whisper_streaming/blob/main/mic_test_whisper_streaming.py
It greatly improves the feeling of real-time transcription, perhaps a similar idea can be applied. I say feeling because I haven't done any serious performance testing.

I've also created a simple example that transcribes when the user stops talking to compare results: https://github.com/rodrigoGA/whisper_streaming/blob/main/mic_test_whisper_simple.py

Another point I think you should consider is the tokens you are using. In languages like Spanish, questions are enclosed in question marks at the beginning and end, and can have other punctuation marks in the middle. For example, sentences like this: "¿Cuál es la capital de Francia, y por qué es conocida por su arquitectura?" However, in some situations, your approach has transcribed it as: "cual es la capital de Francia, ¿por qué es conocida por su arquitectura?" It might be a problem with whisper, but I think it's the use of tokens you have applied.

macOS support

the opus-fast-mosestokenizer seems not able build on macOS.

any help?

[BUG] AttributeError: 'MosesTokenizer' object has no attribute 'split'

Hello again.

I managed to run the server. However, I got this error when I was translating foreign audio into English:

  File "/home/eng_amghasan/whisper_streaming/whisper_online_server.py", line 215, in <module>
    proc.process()
  File "/home/eng_amghasan/whisper_streaming/whisper_online_server.py", line 186, in process
    o = online.process_iter()
  File "/home/eng_amghasan/whisper_streaming/whisper_online.py", line 276, in process_iter
    self.chunk_completed_sentence()
  File "/home/eng_amghasan/whisper_streaming/whisper_online.py", line 322, in chunk_completed_sentence
    sents = self.words_to_sentences(self.commited)
  File "/home/eng_amghasan/whisper_streaming/whisper_online.py", line 376, in words_to_sentences
    s = self.tokenizer.split(t)
AttributeError: 'MosesTokenizer' object has no attribute 'split'
whisper-server-INFO: killing process 138718

Always chunking because of len

I have tested the whisper_online.py on a 10min video, but it always chunks when the buffer reaches 30s not sentence.
Running log: AI_pin.txt

I found it is due to the string comparison in the words_to_sentences function, so add strip() into it. I wonder if it is right because I want Whisper to output one sentence per line.

def words_to_sentences(self, words):
    """Uses self.tokenizer for sentence segmentation of words.
    Returns: [(beg,end,"sentence 1"),...]
    """
    
    cwords = [w for w in words]
    t = " ".join(o[2] for o in cwords)
    s = self.tokenizer.split(t)
    out = []
    while s:
        beg = None
        end = None
        sent = s.pop(0).strip()
        fsent = sent
        while cwords:
            b,e,w = cwords.pop(0)
            if beg is None and sent.startswith(w.strip()):
                beg = b
            elif end is None and sent == w.strip():
                end = e
                out.append((beg,end,fsent))
                break
            sent = sent[len(w):].strip()
    return out

The result log: AI_pin.txt

batching inference and forced decoding for speedup and multi-target

Batching inference should be used in Whisper-Streaming. It's currently not implemented.

This could work: huggingface/transformers#27658

  • if "forced decoding" really works for Whisper, it should help to avoid re-processing the current buffer from start of segment, and it should be faster

Why batching:

  • If more than chunk-size audio is accumulated, process a batch of the full audio buffer, and the buffer minus chunk size. Then apply local agreement as on the two subsequent iterations. It will be faster.
  • it could enable joint transcription and translation on one GPU. It might be slower than separately -- due to padding, one of them might have short buffer and the other long. But not so much with forced decoding. And it might be good anyway
  • it could enable multiple clients in one instance

Usage Real Time Speech to Text

I want to use your application in my graduation project. I will only use the speech to text side. How should I follow to make it run as smoothly and fast as your program?

How did you measure computationally unaware latency?

This question is not about code issue.

I read your paper and would like to know more specifically about computationally unaware latency.

In your code, computationally unaware mode does not measure latency. How did you measure latency?

I want to reproduce your result.

Thanks.

youtube m3u8 use whisper streaming?

Hello, I am currently making a software for youtube live real-time translation
It is known that I can use yt-dlp to obtain the .m3u8 file of youtube streaming audio only

How do I use whisper_streaming to concatenate m3u8?

Import error when using `whisper_online` as module

Hi!

First of all, thank you very much for this repo, impressive work!

Issue

As I am trying to use Whisper for ASR on a robot, I want to use whisper_online.py as a module. In my main application, I tried to use
something like

import whisper
import whisper_timestamped
import whisper_streaming

...

sr = whisper_online.WhisperTimestampedASR(lan = kwargs.get('lan'),
                                                       modelsize = kwargs.get('model'), 
                                                       cache_dir = kwargs.get('model_cache_dir'), 
                                                       model_dir = kwargs.get('model_dir')
                                                       )

while True:
    a = # receive new audio chunk (and e.g. wait for min_chunk_size seconds first, ...
    online.insert_audio_chunk(a)
    o = online.process_iter()
    print(o) # do something with current partial output

but I get a ModuleNotFoundError: No module named 'whisper'. I am sure that whisper is installed as running whisper_online.py as main works smoothly. Moreover, I have run import whisper in a Python shell, and it throws no error.

Solution(s)

  • placing the import statements at the top of the whisper_online script works, but indeed the possibility to import only the necessary backend is lost.
  • alternatively, I have tried to add an import_backend method in ASRBase, to import the required libraries. E.g. for WhisperTimestampedASR it would be:
def import_backend(self):
        global whisper, whisper_timestamped
        import whisper
        import whisper_timestamped

Is this workable for fastapi?

I want using it as a backend of service, request from mobile devices, for instance voice recognition, is it possible?

batching multi-client server

          > > How to use this to allow multiple clients to connect when you host a server or create an API for live transcription?

I don't know, it's a topic that requires a separate issue. But first, there must be a Whisper backend that enables batching -- more inputs processing at once. If there's not, then use one GPU with one server for one client.

Thank you. Using one GPU for each client is a tall ask for me as there could be up to a dozen clients active at a particular time for my use case. I think there are a few backends which do support batched processing. e.g. https://github.com/Blair-Johnson/batch-whisper
If you have any references or you can point me to the parts where changes are needed to implement this.
Or is it alright if I create a new issue for this?

Originally posted by @umaryasin33 in #10 (comment)

Error while online.process_iter()

I'm trying to use ffmpeg to create the audio_chunks and send it to online.process_iter() function but its failing with:

AttributeError: 'str' object has no attribute 'sep' 

I managed to get it working with cli command, you have to run twice. 1st to create the wav file and since the file does not exist the whisper script fails. The 2nd time you already have the file with some audio and so it will work , just override the file with ffmpeg option Y:

ffmpeg -i https://cbsn-us.cbsnstream.cbsnews.com/out/v1/55a8648e8f134e82a470f83d562deeca/master.m3u8 -ac 1 -ar 16000 -acodec p^C_s16le -f wav out2.wav | sleep 3 | python3 whisper_online.py out2.wav --language en --min-chunk-size 3 > output.txt --model tiny.en

But when i try to run via python script it fails.

Below is my test.py script

import os
import subprocess
import tempfile
import time
import whisper_online

# Configuration
src_lan = "en"
tgt_lan = "en"
m3u8_stream_path = "https://cbsn-us.cbsnstream.cbsnews.com/out/v1/55a8648e8f134e82a470f83d562deeca/master.m3u8"
ffmpeg_command = [
    "ffmpeg", "-i", m3u8_stream_path, 
    "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", 
    "-f", "wav", "pipe:1"
]
chunk_size = 128000  # For 4s of audio, based on previous calculations

# Initialize whisper_online
asr = whisper_online.FasterWhisperASR(src_lan, "tiny.en")
online = whisper_online.OnlineASRProcessor(tgt_lan, asr)
online.init()

# Create a temporary file
temp_fd, temp_filename = tempfile.mkstemp(suffix=".wav")

try:
    # Start FFmpeg process
    ffmpeg_process = subprocess.Popen(ffmpeg_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=10**5)
    
    # if not hasattr(asr, 'sep'):
    #     asr.sep = " "

    with open(temp_filename, 'wb') as temp_file:
        while True:  # Main loop
            try:
                # Read and write audio chunk
                audio_chunk = ffmpeg_process.stdout.read(chunk_size)
                if not audio_chunk:
                    print("No more audio chunks. Ending.")
                    break
                temp_file.write(audio_chunk)
                temp_file.flush()
                
                # Process with whisper_online
                print("Inserting audio chunk...")
                result = online.insert_audio_chunk(temp_filename)                
                print("Processing...")
                print("Insert result:", result)
                print("Insert result type:", type(result))
                try:
                    print(type(asr))
                    print(hasattr(asr, 'sep'))
                    partial_output = online.process_iter()
                except AttributeError as e:
                    print("AttributeError during process_iter: ", str(e))
                    print(f"Attribute Error: {str(e)}")
                    print(f"ASR Object Type: {type(asr)}")
                    print(f"ASR has 'sep': {hasattr(asr, 'sep')}")
                    # Additional logging, recovery, or handling code here
                    continue 
                print("Partial output:", partial_output)
                print("Type of partial_output:", type(partial_output))
            except Exception as e:
                print(f"Error during processing: {str(e)}")
                print("Retrying in 1 second...")
                time.sleep(1)  # Wait before retrying
                
            time.sleep(0.1)  # Optional delay
            
finally:
    # Cleanup
    os.close(temp_fd)
    os.remove(temp_filename)
    ffmpeg_process.terminate()
    ffmpeg_process.wait()
    final_output = online.finish()
    print(final_output)
    online.init()

I'm pretty sure its somehow related to the audio_chunk that is being pass but i cannot understand why.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.