softcatala / whisper-ctranslate2 Goto Github PK

View Code? Open in Web Editor NEW

749.0 20.0 69.0 1.15 MB

Whisper command line client compatible with original OpenAI client based on CTranslate2.

License: MIT License

Makefile 2.15% Python 97.85%

speech-recognition speech-to-text whisper openai- openai-whisper

whisper-ctranslate2's Introduction

Introduction

Whisper command line client compatible with original OpenAI client based on CTranslate2.

It uses CTranslate2 and Faster-whisper Whisper implementation that is up to 4 times faster than openai/whisper for the same accuracy while using less memory.

Goals of the project:

Provide an easy way to use the CTranslate2 Whisper implementation
Ease the migration for people using OpenAI Whisper CLI

Installation

To install the latest stable version, just type:

pip install -U whisper-ctranslate2

Alternatively, if you are interested in the latest development (non-stable) version from this repository, just type:

pip install git+https://github.com/Softcatala/whisper-ctranslate2

CPU and GPU support

GPU and CPU support are provided by CTranslate2.

It has compatibility with x86-64 and AArch64/ARM64 CPU and integrates multiple backends that are optimized for these platforms: Intel MKL, oneDNN, OpenBLAS, Ruy, and Apple Accelerate.

GPU execution requires the NVIDIA libraries cuBLAS 11.x and cuDNN 8.x to be installed on the system. Please refer to the CTranslate2 documentation

By default the best hardware available is selected for inference. You can use the options --device and --device_index to control manually the selection.

Usage

Same command line as OpenAI Whisper.

To transcribe:

whisper-ctranslate2 inaguracio2011.mp3 --model medium

To translate:

whisper-ctranslate2 inaguracio2011.mp3 --model medium --task translate

Whisper translate task translates the transcription from the source language to English (the only target language supported).

Additionally using:

whisper-ctranslate2 --help

All the supported options with their help are shown.

CTranslate2 specific options

On top of the OpenAI Whisper command line options, there are some specific options provided by CTranslate2 or whiper-ctranslate2.

Quantization

--compute_type option which accepts default,auto,int8,int8_float16,int16,float16,float32 values indicates the type of quantization to use. On CPU int8 will give the best performance:

whisper-ctranslate2 myfile.mp3 --compute_type int8

Loading the model from a directory

--model_directory option allows to specify the directory from which you want to load a CTranslate2 Whisper model. For example, if you want to load your own quantified Whisper model version or using your own Whisper fine-tunned version. The model must be in CTranslate2 format.

Using Voice Activity Detection (VAD) filter

--vad_filter option enables the voice activity detection (VAD) to filter out parts of the audio without speech. This step uses the Silero VAD model:

whisper-ctranslate2 myfile.mp3 --vad_filter True

The VAD filter accepts multiple additional options to determine the filter behavior:

--vad_threshold VALUE (float)

Probabilities above this value are considered as speech.

--vad_min_speech_duration_ms (int)

Final speech chunks shorter min_speech_duration_ms are thrown out.

--vad_max_speech_duration_s VALUE (int)

Maximum duration of speech chunks in seconds. Longer will be split at the timestamp of the last silence.

Print colors

--print_colors True options prints the transcribed text using an experimental color coding strategy based on whisper.cpp to highlight words with high or low confidence:

whisper-ctranslate2 myfile.mp3 --print_colors True

Live transcribe from your microphone

--live_transcribe True option activates the live transcription mode from your microphone:

whisper-ctranslate2 --live_transcribe True --language en

whisper-demo.mov

Diarization (speaker identification)

There is experimental diarization support using pyannote.audio to identify speakers. At the moment, the support is a segment level.

To enable diarization you need to follow these steps:

Install pyannote.audio with pip install pyannote.audio
Accept pyannote/segmentation-3.0 user conditions
Accept pyannote/speaker-diarization-3.1 user conditions
Create access token at hf.co/settings/tokens.

And then execute passing the HuggingFace API token as parameter to enable diarization:

whisper-ctranslate2 --hf_token YOUR_HF_TOKEN

and then the name of the speaker is added in the output files (e.g. JSON, VTT and STR files):

[SPEAKER_00]: There is a lot of people in this room

The option --speaker_name SPEAKER_NAME allows to use your own string to identify the speaker.

Need help?

Check our frequently asked questions for common questions.

Contact

Jordi Mas [email protected]

whisper-ctranslate2's People

Contributors

Stargazers

Watchers

Forkers

keimaruo ringge veryquant mayeaux mrluyao pqtrung existeundelta flippfuzz kurokao fflorent anilbey cspchong tassedecafe gappc knaik letschurch stevenxuxin v0xie jpenney matrixer2306 messiho snoopycn yas callum17 hxczsvb awas666 tpekarekrosin hoonlight masatoi iieleven11 thamwangjun nocmt aparuproy080 jaonewguy zx3777 mihab samuelwn mencelot rustprofi egoist-bot gary109 augml leiffoged halecoder milton-lopez cianyyz eastonn rickythomas2008 ghalymt silpol khjneku mikel337 jiminator jonascz unimatrix099 arvidkahl joostshao verybestjp hanasim chenying99 kennethsjiang dodysw rkilchmn aunali321 chimpansee mmokrejs clumzzybtw

whisper-ctranslate2's Issues

Live Inference API

Hi Jordi,

Thanks for open sourcing it! Could you please provide an example of running live inference programmatically instead of via CLI? Thanks.

Use "official" faster-whisper models

Hi,

Would it be possible for this project to use the same models that are downloaded by faster-whisper? See the available models here: https://huggingface.co/guillaumekln

Using the same models would ensure consistency between this project and faster-whisper. For example your models do not seem to contain a recent fix to the configuration:

https://huggingface.co/guillaumekln/faster-whisper-tiny.en/commit/7d45cf02c1ed72d240c0dbf99d544d19bef1b5a3
https://huggingface.co/guillaumekln/faster-whisper-tiny/commit/518d6e0b5a068b278f66842b17377f9523de5cd1

(all models were updated similarly)

How to transfer the word-level timestamped json to srt?

Doesn't accept specified language

initial_prompt?

can you add initial_prompt to this? it is really helpful in a lot of situations:

whisper-ctranslate2 --model large-v2 --compute_type int8 --output_format json --vad_filter True --language en --word_timestamps True --initial_prompt "Joan Svarczkopf" 12694263-Joan-Svarczkopf.mp3

.DLL's needed to run on Windows Anaconda command prompt

Spent several hours trying to get this working under Anaconda under Windows (after failing to get it to work in Ubuntu WSL2) with the GPU.

What's needed are the cblasLt64_12.dll and the CUDNN... dll's -- and this is the tricky part -- also needed is zlibwapi.dll

I put them all in the C:\Users[username]\anaconda3\Lib\site-packages\ctranslate2 folder with the other dll's and it finally worked.

honestly have forgetting which downloads included the nvidia dll's but I think some where in the cuda_12.1.0_531.14_windows.exe which I opened with 7zip to extract. Probably didn't install "correctly" but seems to be working okay now.

Thanks for all the work as the speed is great!

https://i.stack.imgur.com/PXtpG.png

No file output if output folder/type not selected

Using whisper-ctranslate2 0.1.8 (Windows 11 Pro 64 bit build 22621) I am not getting any transcription output if I don't select a type/location for it.

Text is generated but nothing is output at C:\Users\rsmit\Dropbox\Videos

Using this command line: whisper-ctranslate2.exe --language en --model "base" --device CPU --output_dir "C:\Users\rsmit\Dropbox\Videos" --output_format "srt" "C:\Users\rsmit\Dropbox\Videos\080109-002.wav"

I get a SRT file in that folder with the contents:
1
00:00:00,000 --> 00:00:07,280
Hi, this is Roger Smith with the U.S. Base Conservation Group, my dearth.

I get error code 126 with CUDA installed and running.

I get this error message:

Could not load library cudnn_cnn_infer64_8.dll. Error code 126
Please make sure cudnn_cnn_infer64_8.dll is in your library path!

I double check my Windows 10 Environment Variables， and cudnn_cnn_infer64_8.dll directory is there.

Anything I do wrong?

can Whisper run on the gaps between each speech section？

Could you add an option? Sometimes using vad will miss dialogue，and --vad_threshold VALUE (float) is useless.

Like this：

silero-vad
Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Whisper is also run on the gaps between each speech section, by either expanding the section up to the max merge size, or running Whisper independently on the non-speech section.
silero-vad-expand-into-gaps
- Use Silero VAD to detect sections that contain speech, and run Whisper on independently on each section. Each spech section will be expanded such that they cover any adjacent non-speech sections. For instance, if an audio file of one minute contains the speech sections 00:00 - 00:10 (A) and 00:30 - 00:40 (B), the first section (A) will be expanded to 00:00 - 00:30, and (B) will be expanded to 00:30 - 00:60.

vad_filter still runs even if 'False' is passed

Seems to me like if I call vad_filter with False it still runs. Just noticed it while I was testing some things.

only works in cpu mode , but gpu outputs nothing

os: win11 , cuda 11.6 cudnn 8.9 rtx3060 4G

every example in readme i've tried , if no --device , no outputs , if --device cpu it works but slow.

and I've tried the newest version

whisper-ctranslate2-0.1.9
whisper-ctranslate2-0.2.0

unexpected keyword argument 'repetition_penalty

After upgrading to 0.2.9, whisper-ctranslate2 gives me this error when translating a .wav file

Traceback (most recent call last):
  File "/home/user/.local/bin/whisper-ctranslate2", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/src/whisper_ctranslate2/whisper_ctranslate2.py", line 554, in main
    result = transcribe.inference(
  File "/home/user/.local/lib/python3.10/site-packages/src/whisper_ctranslate2/transcribe.py", line 130, in inference
    segments, info = self.model.transcribe(
TypeError: WhisperModel.transcribe() got an unexpected keyword argument 'repetition_penalty'

Is it possible to add a flag to display debug logging?

Additional logging was implemented in:
SYSTRAN/faster-whisper#112
https://github.com/guillaumekln/faster-whisper/pull/113/files

Would it be possible to add a flag to display debug logging?
I'm particularly interested in the "VAD filter kept the following audio segments: %s" log entries.

Thanks!

diarize option?

it seems like diarize is possible with faster whisper, but whisper-ctranslate2 doesn't support the argument?
whisper-diarize

Use the output of --live_transcribe

Hi, and thanks for your project, it works flawlessly and I had met no issue at all.
I would just like to ask if there is a way to "trap" or echo somewhere the real time transcriptions obtained with the option --live_transcribe True
Thanks in advance for your support

--live_transcribe does nothing for me in macOS

I did double-check terminal permissions and indeed it has microphone permission.

(base) ➜  ~ whisper-ctranslate2 --live_transcribe True --language es
Live stream device: Mac mini Speakers
Listening.. (Ctrl+C to Quit)

It sais listening and that's about it, also I noticed that it didn't even get into this stage when I first installed and launched it directly, I had to first transcribe a .wav file.

Documentation for translation functionality?

Is it possible to use ctranslate to detect audio and then translate it into a target language from the command line prompt? I searched for documentation and while I see a translate flag and a language flag it's not clear to me if that's the source or target language.

For one English video I mistakenly flagged it as Japanese and ended up getting more or less a Japanese translation with timecode which shocked me. I wasn't able to replicate this on purpose with another file though where I actually wanted it to be translated (despite telling it translate and Japanese it just gave me an English transcription).

It would be great if the help were a little more explicit about the options and how to set them.

sometimes srt file not generated

I tried a few files, sometimes srt file not generated, I used --output_format srt

I also try to debug it , and set breakpoints at following lines , find it may never reach here , and no error or exception, it's strange.

            writer = get_writer(output_format, output_dir)
            writer(result, audio_path)

How to turn off --highlight_words ?

When i use --word_timestamps True , the highlight word will auto turn on，and when i add --highlight_words False ，it show error：whisper-ctranslate2: error: unrecognized arguments: --highlight_words False.

123 is example. the result srt like this:

These three lines are the same subtitle

Live transcription is extremely inaccurate

I'm not sure if I'm doing something wrong, but the live transcription feature is coming up with very strange transcriptions, even using a big model like medium.en. I'm speaking clearly into a nice microphone, and I have tested with another recording app that the microphone is working correctly.

"This is a test of the microphone" somehow transcribed to... "This is a stomacher from Eastern Siberia"!?

The transcription models work great if I record into an audio file and then pass that to whisper-ctranslate2, so it's just something weird going on with the live transcribe feature.

I'm on Windows 11, and I'm running this natively on Windows. (using the CPU since GPU transcription doesn't work for me on Windows at the moment, but it works fine in WSL2... but WSL2 doesn't have access to microphones.) So, it could be something weird with the mic API on Windows.

Output file location

The help specifies that the output files should be written to the directory I am running whisper-ctranslate2 from, but it does not appear to be writing anything out at all. I am unsure whether it is writing any files at all, or if the paths are doing something really unconventional like the .dll problems that have been mentioned elsewhere.

I am using --task translate and running this from a Windows cmd terminal. The regular version of Whisper outputs all of the files to the directory, but this does not appear to produce anything.

It looks like this specifically applies to the Translate task. Transcribe generates the expected outputs in the expected spot.

EDIT: It looks like it depends on what model I am using? Medium will complete on a larger file, when large-v1 and large-v2 finish the translation but do not produce any output files. large-v1 will work on a smaller test file that I made, whereas large-v2 will not produce any output files I can see.

OSError: PortAudio library not found when using whisper-ctranslate2-0.1.7

time whisper-ctranslate2 --model large-v2 --language Japanese --task translate --verbose True --compute_type float32 --threads 4 --beam_size 1 --vad_filter True --vad_min_silence_duration_ms 2000 --output_dir output --output_format all -- audio.webm
Traceback (most recent call last):
  File "/home/jenkins/workspace/whisper-ctranslate2/venv/bin/whisper-ctranslate2", line 5, in <module>
    from src.whisper_ctranslate2.whisper_ctranslate2 import main
  File "/home/jenkins/workspace/whisper-ctranslate2/venv/lib/python3.10/site-packages/src/whisper_ctranslate2/whisper_ctranslate2.py", line 11, in <module>
    from .live import Live
  File "/home/jenkins/workspace/whisper-ctranslate2/venv/lib/python3.10/site-packages/src/whisper_ctranslate2/live.py", line 4, in <module>
    import sounddevice as sd
  File "/home/jenkins/workspace/whisper-ctranslate2/venv/lib/python3.10/site-packages/sounddevice.py", line 71, in <module>
    raise OSError('PortAudio library not found')
OSError: PortAudio library not found

Not sure if it helps, but this might be because I'm running this on a headless Ubuntu server, with no audio devices.
Works fine with whisper-ctranslate2-0.1.6

pip list
Package             Version
------------------- ---------
av                  10.0.0
Brotli              1.0.9
certifi             2022.12.7
cffi                1.15.1
charset-normalizer  3.1.0
coloredlogs         15.0.1
ctranslate2         3.10.3
faster-whisper      0.4.1
filelock            3.10.7
flatbuffers         23.3.3
huggingface-hub     0.13.3
humanfriendly       10.0
idna                3.4
mpmath              1.3.0
mutagen             1.46.0
numpy               1.24.2
onnxruntime         1.14.1
packaging           23.0
pip                 22.0.2
protobuf            4.22.1
pycparser           2.21
pycryptodomex       3.17
PyYAML              6.0
requests            2.28.2
setuptools          59.6.0
sounddevice         0.4.6
sympy               1.11.1
tokenizers          0.13.2
tqdm                4.65.0
typing_extensions   4.5.0
urllib3             1.26.15
websockets          11.0
whisper-ctranslate2 0.1.7
yt-dlp              2023.3.4

Please support python API for whisper-ctranslate2

I want to integrate whisper-ctranslate2 to my system, can you support API for whisper-ctranslate2, I use Python coding, thanks very much.

Enable GPU

Hi sir, How to enable GPU mode. I managed to run it but it uses my CPU and not using my GPU. Thank you!

Consider adding options list to readme

Similar to how whisper.cpp has listed in it's quick start. A nice overview of all the options (I know you can just run the command and it'll output them in the terminal as well. But it would nice if it was documented in the readme as well)

https://github.com/ggerganov/whisper.cpp#quick-start

'whisper-ctranslate2' is not recognized as an internal or external command, operable program or batch file.

I istalled as provided pip install -U whisper-ctranslate2 and without problem (other than uninstalling faster whisper library btw install steps, it finished. But I can not run it and throws error as 'whisper-ctranslate2' is not recognized as an internal or external command, operable program or batch file.

What should I have do to fix this issue?
I am on win10 machine, Python 3.10.11, CUDA version: release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
Installled packages:
Package Version

av 10.0.0
certifi 2022.12.7
charset-normalizer 3.1.0
colorama 0.4.6
coloredlogs 15.0.1
ctranslate2 3.10.2
filelock 3.10.7
flatbuffers 23.3.3
huggingface-hub 0.13.3
humanfriendly 10.0
idna 3.4
Jinja2 3.1.2
MarkupSafe 2.1.2
mpmath 1.3.0
networkx 3.1
numpy 1.24.2
onnxruntime 1.14.1
packaging 23.0
pip 23.0.1
protobuf 4.22.1
pyreadline3 3.4.1
PyYAML 6.0
requests 2.28.2
sympy 1.11.1
tokenizers 0.13.2
torch 2.0.0
tqdm 4.65.0
typing_extensions 4.5.0
urllib3 1.26.15

Random stops

@jordimas
Thanks for your effort to offer this great tool.
I've enjoyed it many times with diffferent audios and videos.

However, I've also encountered several stops, that the transcripting will stopped randomly at different localtions of a video, when I tried the 2nd or 3rd time by ending the 1st task, then it'll finish the transcripting that audio or video.
I'm using the Intel NUC11 with modified cooling fan (No GPU but will better cooling for CPU).
What's the possible reason and how to tackle it? Could you plz share your thoughts?
Thanks!

Error with Japanese subtitles

I'm attempting to generate text for Japanese audio through SubtitleEdit but getting this error which appears to be a python issue with the characters:

Date: 04/14/2023 13:57:40
SE: 3.6.12.62 - Microsoft Windows NT 10.0.22621.0 - 64-bit
Message: Calling whisper (CTranslate2) with : C:\Users\rsmit\Dropbox\transfer settings\Whisper-Faster\Whisper-Faster\whisper-ctranslate2.exe --language ja --model "large" "D:\Temp\3787e9c7-46dd-4055-aa58-377ada7b89e0.wav"
UnicodeEncodeError: 'charmap' codec can't encode characters in position 26-45: character maps to

File "encodings\cp1252.py", line 19, in encode

File "D:\whisper-fast_main_.py", line 399, in cli

File "D:\whisper-fast_main_.py", line 406, in

Traceback (most recent call last):

[3696] Failed to execute script 'main' due to unhandled exception!

Calling whisper CTranslate2 done in 00:00:09.1675218
Loading result from STDOUT

How to calculate the probability value of a word?

Hi everyone, I am new here, so I do not know how authors can calculate the value of probability. I guess they use one of the following methods:

Comparing the audio of a word in audio with a lot of sample sounds. Then, they count how many times it's correct and calculate the percentage.
Example: With 100 sample sounds, if the word in the audio matches 50 sample sounds, the probability value is 50%.
Comparing the syllables of the word.
Example: For the word 'contribute' - /kənˈtrɪb.juːt/. They check if the phoneme sounds are correct. If /kən/ sounds like /kan/ or /kon/, the word will only get 66.6% probability.

** If they are using the second approach, could you please explain how I can identify the incorrect phoneme?

I would like to receive results like:

"birch": /bir/ may lack confidence, so its color is yellow, and /ch/ is the correct phoneme, so its color is green.
"planks": /pl/ may lack confidence, so its color is green, and /anks/ is the correct phoneme, so its color is red.
I came across this picture in a discussion, but it doesn't have a solution.

Examples: "Agricultural," "depletion."

I would greatly appreciate your assistance. Thank you so much.

I see the website name "apeuni" has this feature, you can see more example:

Help, the software is not working!

Hello! I installed the usual whisper, and then yours, after installing CTranslate2. Your software doesn't want to work even when set to medium, the usual whistle works on large-v2. Please help, I really need this software! On the first screen, I tried to use two commands. One so that he himself would determine the language, the second he independently indicated. In the case when he himself determines the language, nothing happens, and when I prompt him, the line is instantly reset to a new one without results. For comparison, I entered the same commands through the usual whisper, where everything works, even the large-v2 mode, it just takes a lot of time.

argument error when doing vad_filter

whisper-ctranslate2 --live_transcribe True --language en --device cpu --model small --vad_filter True

It will listen, but when it comes to transcribing it will crash with this error

Traceback (most recent call last):
  File "/home/hackerman/anaconda3/envs/whisper/bin/whisper-ctranslate2", line 8, in <module>
    sys.exit(main())
  File "/home/hackerman/anaconda3/envs/whisper/lib/python3.9/site-packages/src/whisper_ctranslate2/whisper_ctranslate2.py", line 498, in main
    Live(
  File "/home/hackerman/anaconda3/envs/whisper/lib/python3.9/site-packages/src/whisper_ctranslate2/live.py", line 163, in inference
    self.listen()
  File "/home/hackerman/anaconda3/envs/whisper/lib/python3.9/site-packages/src/whisper_ctranslate2/live.py", line 159, in listen
    self.process()
  File "/home/hackerman/anaconda3/envs/whisper/lib/python3.9/site-packages/src/whisper_ctranslate2/live.py", line 134, in process
    result = self.transcribe.inference(
  File "/home/hackerman/anaconda3/envs/whisper/lib/python3.9/site-packages/src/whisper_ctranslate2/transcribe.py", line 128, in inference
    segments, info = self.model.transcribe(
  File "/home/hackerman/anaconda3/envs/whisper/lib/python3.9/site-packages/faster_whisper/transcribe.py", line 252, in transcribe
    speech_chunks = get_speech_timestamps(audio, vad_parameters)
  File "/home/hackerman/anaconda3/envs/whisper/lib/python3.9/site-packages/faster_whisper/vad.py", line 94, in get_speech_timestamps
    speech_prob, state = model(chunk, state, sampling_rate)
  File "/home/hackerman/anaconda3/envs/whisper/lib/python3.9/site-packages/faster_whisper/vad.py", line 288, in __call__
    out, h, c = self.session.run(None, ort_inputs)
  File "/home/hackerman/anaconda3/envs/whisper/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 217, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(double)) , expected: (tensor(float))

word_timestamps doesn't output .json file

Usual Whisper also outputs a .json file when using word_timestamps, such as:

{
  "text": " Welcome to English in a minute. Most people enjoy going to parties. You got to be",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 1.22,
      "end": 7.98,
      "text": " Welcome to English in a minute. Most people enjoy going to parties. You got to be",
      "tokens": [
        50364,
        4027,
        281,
        3669,
        294,
        257,
        3456,
        13,
        4534,
        561,
        2103,
        516,
        281,
        8265,
        13,
        509,
        658,
        281,
        312,
        50760
      ],
      "temperature": 0,
      "avg_logprob": -0.3057987576439267,
      "compression_ratio": 1.0384615384615385,
      "no_speech_prob": 0.07186834514141083,
      "words": [
        {
          "word": " Welcome",
          "start": 1.22,
          "end": 1.84,
          "probability": 0.7827746272087097
        },
        {
          "word": " to",
          "start": 1.84,
          "end": 2.14,
          "probability": 0.99225252866745
        },
        {
          "word": " English",
          "start": 2.14,
          "end": 2.52,
          "probability": 0.9609076976776123
        },
        {
          "word": " in",
          "start": 2.52,
          "end": 2.78,
          "probability": 0.8311975598335266
        },
        {
          "word": " a",
          "start": 2.78,
          "end": 2.86,
          "probability": 0.996508777141571
        },
        {
          "word": " minute.",
          "start": 2.86,
          "end": 3.38,
          "probability": 0.721484899520874
        },
        {
          "word": " Most",
          "start": 4.3,
          "end": 4.32,
          "probability": 0.9801965355873108
        },
        {
          "word": " people",
          "start": 4.32,
          "end": 4.86,
          "probability": 0.9989566802978516
        },
        {
          "word": " enjoy",
          "start": 4.86,
          "end": 5.44,
          "probability": 0.9946292042732239
        },
        {
          "word": " going",
          "start": 5.44,
          "end": 5.94,
          "probability": 0.9926981329917908
        },
        {
          "word": " to",
          "start": 5.94,
          "end": 6.22,
          "probability": 0.9955708384513855
        },
        {
          "word": " parties.",
          "start": 6.22,
          "end": 6.54,
          "probability": 0.9153341054916382
        },
        {
          "word": " You",
          "start": 7.4,
          "end": 7.52,
          "probability": 0.9837445020675659
        },
        {
          "word": " got",
          "start": 7.52,
          "end": 7.72,
          "probability": 0.5474305152893066
        },
        {
          "word": " to",
          "start": 7.72,
          "end": 7.86,
          "probability": 0.9965183734893799
        },
        {
          "word": " be",
          "start": 7.86,
          "end": 7.98,
          "probability": 0.9901227355003357
        }
      ]
    }
  ],
  "language": "English"
}

This is useful if you're applying post processing to create your own subtitles files (for example with max character limits per line).

Don't know the complexity to have something like this implemented but would help me a lot. With the current version there's no good way I can find to get each word timestamp (I could go through the vtt file and pull out each file within a block but it's not a very graceful approach).

Thanks for the great module!

No outputs

I spent half an hour running the large-v2 model on a 25 minutes video. At the end of the process, there were no outputs.

The command i used: whisper-ctranslate2 [the video file] --model large-v2 --output_format srt --output_dir .\ --word_timestamps True --no_speech_threshold 0.2 --logprob_threshold None

GPU -> GTX 1060 (6GB VRAM model)
Average VRAM used by whisper-ctranslate2 during the process -> varies from 2.5 to 4.5GB
Windows 10

Edit: tried with tiny model. Doesnt work either. No outputs.

Add support for max_line_length and max_line_count

As requested here: SYSTRAN/faster-whisper#156

Would be nice if we could get this implemented in whisper-ctranslate2

Relevant Whisper code:
openai/whisper@43940fc

--model_directory just got broken

Hello, i just updated to latest version and --model_directory argument seems to be broken.

error: unrecognized arguments: --model_directory

Streaming from microhphone

Can you add the feature to use the microphone for transcribing?

Please make sure libcudnn_ops_infer.so.8 is in your library path?

I keep getting this error no matter what I do:

Run on GPU with float16
Estimating duration from bitrate, this may be inaccurate
Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory
Please make sure libcudnn_ops_infer.so.8 is in your library path!
Aborted (core dumped)

find / -name "libcudnn_ops_infer.so.8"
/home/silvacarl/.local/lib/python3.8/site-packages/nvidia/cudnn/lib/libcudnn_ops_infer.so.8

any ideas?

PATH=$(echo "$PATH:/usr/local/cuda-11.7/bin:/usr/lib/x86_64-linux-gnu:/home/silvacarl/.local/lib/python3.8/site-packages/nvidia/cudnn/lib")
LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/home/silvacarl/.local/lib/python3.8/site-packages/nvidia/cudnn/lib"

even that doesnt work.

Enabling VAD results in subtitle timing issue

For example, with a silent segment from 00:01:00 to 00:01:30, after being processed by VAD, the subtitle following appears at 00:01:00, causing the entire 30s silence period to have a subtitle. How can I fix this?

It works fine, but gives an error.

Excuse me, after I run the whisper-ctranslate2 test.mp3 --task transcribe --language Chinese command, I can successfully recognize the voice and generate the corresponding file, but what is the reason for the following error?
report error：

Traceback (most recent call last):
  File "/usr/local/bin/whisper-ctranslate2", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/src/whisper_ctranslate2/whisper_ctranslate2.py", line 517, in main
    result = Transcribe().inference(
  File "/usr/local/lib/python3.10/dist-packages/src/whisper_ctranslate2/transcribe.py", line 125, in inference
    segments, info = model.transcribe(
  File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 236, in transcribe
    audio = decode_audio(audio, sampling_rate=sampling_rate)
  File "/usr/local/lib/python3.10/dist-packages/faster_whisper/audio.py", line 45, in decode_audio
    with av.open(input_file, metadata_errors="ignore") as container:
  File "av/container/core.pyx", line 401, in av.container.core.open
  File "av/container/core.pyx", line 272, in av.container.core.Container.__cinit__
  File "av/container/core.pyx", line 292, in av.container.core.Container.err_check
  File "av/error.pyx", line 336, in av.error.err_check

My system is Ubuntu , The python version is 3.10.6

How to transcibe as fast as possible in a CPU only server?

I found it has a --thread option, to make it work as fast as it can be. Should I specify the --thread parameter to the core number of my CPU? Or just leave it as the default?

Or any other parameter can be should to achieve the goal?

CPU Info:

$ lscpu
Architecture:                    x86_64                                                     
CPU op-mode(s):                  32-bit, 64-bit  
Byte Order:                      Little Endian                                                                                                                                           
Address sizes:                   46 bits physical, 48 bits virtual                                                                                                                       
CPU(s):                          72                                                                                                                                                      
On-line CPU(s) list:             0-71        
Thread(s) per core:              2                                                          
Core(s) per socket:              18                                                                                                                                                      
Socket(s):                       2                                                                                                                                                       
NUMA node(s):                    2                                                                                                                                                       
Vendor ID:                       GenuineIntel                                                                                                                                            
CPU family:                      6                                                                                                                                                       
Model:                           85  
Model name:                      Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz
Stepping:                        5
CPU MHz:                         1209.473
CPU max MHz:                     4000.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        6200.00
Virtualization:                  VT-x
L1d cache:                       1.1 MiB
L1i cache:                       1.1 MiB
L2 cache:                        36 MiB
L3 cache:                        49.5 MiB
NUMA node0 CPU(s):               0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70
NUMA node1 CPU(s):               1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:          Mitigation; IBRS
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT vulnerable
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm con                                 stant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg                                  fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat                                 _l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 e                                 rms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_ll                                 c cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities

Multi audio input support?

I’ve tried running whisper-ctranslate2 with multiple audio files using a command like whisper-ctranslate2 1.opus 2.opus, as shown in the official example, but it doesn’t seem to work. Is it possible to add support for processing multiple input files at once?

If there’s a way to do this, could you please advise?

--verbose argument should work with lowercase true and false

$ whisper-ctranslate2 "audiofile.wav" --model base.en --compute_type int8 --output_format vtt --threads 4 --verbose false
whisper-ctranslate2: error: argument --verbose: invalid str2bool value: 'false'

If I specify "True" or "False", it works fine, but this seems like a quick fix

2 hour video but generated transcription of only 35 mins.

i had used this command:

whisper-ctranslate2 video.mp4 --model medium.en --output_format srt > x.srt

and it stopped midway.

the transcription also started long like 1 sentence & then gave word-level timestamps. this was probably bcz i was simultaneously using macwhisper. but i stopped later on.

is that bcz my terminal can't hold that much output?

i would also love to see progress % like i can see in macwhisper.

and an export option would be cool rather than using > x.srt... idk if it'll not stop then.

anyways, what do you think is this case?

Is possible to support with ydotool / nerd-dictation in whisper-ctranslate2 ?

I tried

whisper-ctranslate2 --live_transcribe True  --model small

and live mode works fine, but i'd like to get the output in a non-interactive way, is possible to receive the output of the spoken text after N seconds of silence ?

compress json filesize

It seems the precision in the "accuracy" parameter is quite high, could we reduce to say 2 or even 1 significant digits if we want to save some memory space in the .json ? it seems right now a lot of tokens are consumed by this extremely precise parameter... maybe it could be an argument that states the precision (int 1 to 10)?

0.2.9 --help explained incorrectly

--repetition_penalty REPETITION_PENALTY
prevent repetitions of ngrams with this size (set 0 to disable) (default: 1.0)
--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE
Penalty applied to the score of previously generated tokens (set > 1 to penalize) (default: 0)

The explanations of these two items are reversed, but "default" is correct.

Replacing 'whisper' with 'whisper-ctranslate2' in Projects

Background: Some projects utilize the 'whisper' library by importing it using the 'import whisper' command. I would like to replace 'whisper' with 'whisper-ctranslate2' in these projects.

Issue: When attempting to import 'whisper-ctranslate2', the hyphen (-) in the name results in a "SyntaxError: invalid syntax" error.

Solution: Is there an alternative way to import 'whisper-ctranslate2'? If so, this would make it easier to enhance the performance of projects currently using 'whisper'.

Using GPU without any output

These two screenshots are that I use CPU and GPU to transcribe the same audio, and it can be found that the CPU can run normally, but the GPU has no output. My OS is windows 11.

Support additional params for `faster-whisper` VAD options

faster-whisper now supports passing additional parameters to the VAD:

https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/vad.py#L15

I might take a crack at this but probably won't have a chance for a couple days

Extending CLI for Fine-Tuned Whisper Models on Hugging Face

Hi,

In addition to the standard models such as tiny, base, small, medium, and large, the community has also created fine-tuned models like the one found at https://huggingface.co/sefa-alper/whisper-turkish-large-v2.

Could you please provide guidance on how we can incorporate these fine-tuned Whisper models into the CLI for seamless integration and use?

Thank you in advance.