Coder Social home page Coder Social logo

systran / faster-whisper Goto Github PK

View Code? Open in Web Editor NEW
8.8K 112.0 732.0 2.91 MB

Faster Whisper transcription with CTranslate2

License: MIT License

Python 100.00%
deep-learning inference quantization speech-recognition speech-to-text transformer whisper openai

faster-whisper's Introduction

CI PyPI version

Faster Whisper transcription with CTranslate2

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models.

This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

Benchmark

Whisper

For reference, here's the time and memory usage that are required to transcribe 13 minutes of audio using different implementations:

Large-v2 model on GPU

Implementation Precision Beam size Time Max. GPU memory Max. CPU memory
openai/whisper fp16 5 4m30s 11325MB 9439MB
faster-whisper fp16 5 54s 4755MB 3244MB
faster-whisper int8 5 59s 3091MB 3117MB

Executed with CUDA 11.7.1 on a NVIDIA Tesla V100S.

Small model on CPU

Implementation Precision Beam size Time Max. memory
openai/whisper fp32 5 10m31s 3101MB
whisper.cpp fp32 5 17m42s 1581MB
whisper.cpp fp16 5 12m39s 873MB
faster-whisper fp32 5 2m44s 1675MB
faster-whisper int8 5 2m04s 995MB

Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.

Distil-whisper

Implementation Precision Beam size Time Gigaspeech WER
distil-whisper/distil-large-v2 fp16 4 - 10.36
faster-distil-large-v2 fp16 5 - 10.28
distil-whisper/distil-medium.en fp16 4 - 11.21
faster-distil-medium.en fp16 5 - 11.21

Executed with CUDA 11.4 on a NVIDIA 3090.

testing details (click to expand)

For distil-whisper/distil-large-v2, the WER is tested with code sample from link. for faster-distil-whisper, the WER is tested with setting:

from faster_whisper import WhisperModel

model_size = "distil-large-v2"
# model_size = "distil-medium.en"
# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")

Requirements

  • Python 3.8 or greater

Unlike openai-whisper, FFmpeg does not need to be installed on the system. The audio is decoded with the Python library PyAV which bundles the FFmpeg libraries in its package.

GPU

GPU execution requires the following NVIDIA libraries to be installed:

There are multiple ways to install these libraries. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below.

Other installation methods (click to expand)

Use Docker

The libraries are installed in this official NVIDIA Docker image: nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04.

Install with pip (Linux only)

On Linux these libraries can be installed with pip. Note that LD_LIBRARY_PATH must be set before launching Python.

pip install nvidia-cublas-cu11 nvidia-cudnn-cu11

export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`

Download the libraries from Purfview's repository (Windows & Linux)

Purfview's whisper-standalone-win provides the required NVIDIA libraries for Windows & Linux in a single archive. Decompress the archive and place the libraries in a directory included in the PATH.

Installation

The module can be installed from PyPI:

pip install faster-whisper
Other installation methods (click to expand)

Install the master branch

pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz"

Install a specific commit

pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"

Usage

Faster-whisper

from faster_whisper import WhisperModel

model_size = "large-v3"

# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")

# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")

segments, info = model.transcribe("audio.mp3", beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Warning: segments is a generator so the transcription only starts when you iterate over it. The transcription can be run to completion by gathering the segments in a list or a for loop:

segments, _ = model.transcribe("audio.mp3")
segments = list(segments)  # The transcription will actually run here.

Faster Distil-Whisper

The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest distil-large-v3 checkpoint is intrinsically designed to work with the Faster-Whisper transcription algorithm. The following code snippet demonstrates how to run inference with distil-large-v3 on a specified audio file:

from faster_whisper import WhisperModel

model_size = "distil-large-v3"

model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, language="en", condition_on_previous_text=False)

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

For more information about the distil-large-v3 model, refer to the original model card.

Word-level timestamps

segments, _ = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))

VAD filter

The library integrates the Silero VAD model to filter out parts of the audio without speech:

segments, _ = model.transcribe("audio.mp3", vad_filter=True)

The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the source code. They can be customized with the dictionary argument vad_parameters:

segments, _ = model.transcribe(
    "audio.mp3",
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500),
)

Logging

The library logging level can be configured like this:

import logging

logging.basicConfig()
logging.getLogger("faster_whisper").setLevel(logging.DEBUG)

Going further

See more model and transcription options in the WhisperModel class implementation.

Community integrations

Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!

  • WhisperX is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
  • whisper-ctranslate2 is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
  • whisper-diarize is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
  • whisper-standalone-win Standalone CLI executables of faster-whisper for Windows, Linux & macOS.
  • asr-sd-pipeline provides a scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines.
  • Open-Lyrics is a Python library that transcribes voice files using faster-whisper, and translates/polishes the resulting text into .lrc files in the desired language using OpenAI-GPT.
  • wscribe is a flexible transcript generation tool supporting faster-whisper, it can export word level transcript and the exported transcript then can be edited with wscribe-editor
  • aTrain is a graphical user interface implementation of faster-whisper developed at the BANDAS-Center at the University of Graz for transcription and diarization in Windows (Windows Store App) and Linux.
  • Whisper-Streaming implements real-time mode for offline Whisper-like speech-to-text models with faster-whisper as the most recommended back-end. It implements a streaming policy with self-adaptive latency based on the actual source complexity, and demonstrates the state of the art.
  • WhisperLive is a nearly-live implementation of OpenAI's Whisper which uses faster-whisper as the backend to transcribe audio in real-time.
  • Faster-Whisper-Transcriber is a simple but reliable voice transcriber that provides a user-friendly interface.

Model conversion

When loading a model from its size such as WhisperModel("large-v3"), the corresponding CTranslate2 model is automatically downloaded from the Hugging Face Hub.

We also provide a script to convert any Whisper models compatible with the Transformers library. They could be the original OpenAI models or user fine-tuned models.

For example the command below converts the original "large-v3" Whisper model and saves the weights in FP16:

pip install transformers[torch]>=4.23

ct2-transformers-converter --model openai/whisper-large-v3 --output_dir whisper-large-v3-ct2
--copy_files tokenizer.json preprocessor_config.json --quantization float16
  • The option --model accepts a model name on the Hub or a path to a model directory.
  • If the option --copy_files tokenizer.json is not used, the tokenizer configuration is automatically downloaded when the model is loaded later.

Models can also be converted from the code. See the conversion API.

Load a converted model

  1. Directly load the model from a local directory:
model = faster_whisper.WhisperModel("whisper-large-v3-ct2")
  1. Upload your model to the Hugging Face Hub and load it from its name:
model = faster_whisper.WhisperModel("username/whisper-large-v3-ct2")

Comparing performance against other implementations

If you are comparing the performance against other Whisper implementations, you should make sure to run the comparison with similar settings. In particular:

  • Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper, model.transcribe uses a default beam size of 1 but here we use a default beam size of 5.
  • When running on CPU, make sure to set the same number of threads. Many frameworks will read the environment variable OMP_NUM_THREADS, which can be set when running your script:
OMP_NUM_THREADS=4 python3 my_script.py

faster-whisper's People

Contributors

bbc-esq avatar bekirbakar avatar claytonjy avatar daxaxelrod avatar entn-at avatar flippfuzz avatar gabrielrolfsen avatar geekodour avatar gldkslfmsd avatar guillaumekln avatar hedrergudene avatar hoonlight avatar ilianp avatar jordimas avatar juergenfleiss avatar mahmoudashraf97 avatar makaveli10 avatar mayeaux avatar metame-none avatar minorjinx avatar nguyendc-systran avatar oscaarjs avatar otakutyrant avatar ozancaglayan avatar palladium123 avatar purfview avatar sanchit-gandhi avatar tekacs avatar trungkienbkhn avatar zh-plus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

faster-whisper's Issues

IndexError: index 1 is out of bounds for axis 0 with size 1

Got error when transcribing segments.

Traceback (most recent call last):
  File "E:\AI\faster-whisper\trans.py", line 102, in <module>
    gensrt(segments, output_file, True)
  File "E:\AI\faster-whisper\trans.py", line 55, in gensrt
    for i, segment in enumerate(segments):
  File "E:\AI\faster-whisper\faster_whisper\transcribe.py", line 389, in generate_segments    self.add_word_timestamps(
  File "E:\AI\faster-whisper\faster_whisper\transcribe.py", line 547, in add_word_timestamps
    alignment = self.find_alignment(tokenizer, text_tokens, mel, num_frames)
  File "E:\AI\faster-whisper\faster_whisper\transcribe.py", line 617, in find_alignment
    start_times = jump_times[word_boundaries[:-1]]
IndexError: index 1 is out of bounds for axis 0 with size 1

Segmentation fault on Mac M1 during conversion

I have been unable to convert the model
Regardless of whether I have tried with or without quantization, or different models - unfortunately, I have had no success.

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-small --output_dir output
Segmentation fault: 11

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny --output_dir output
Segmentation fault: 11

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny.en --output_dir output Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.92k/1.92k [00:00<00:00, 486kB/s] Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 151M/151M [00:04<00:00, 37.2MB/s] Segmentation fault: 11 192:faster-whisper$ /usr/local/Cellar/[email protected]/3.9.16/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
^^ this one just freezes

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny --output_dir output --quantization float16
Segmentation fault: 11

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-small --output_dir output --quantization float16
Segmentation fault: 11

192:faster-whisper$ ct2-transformers-converter --model openai/whisper-tiny.en --output_dir output --quantization float16
Segmentation fault: 11

Memory spike at the end of transcription

Hello, great work! I experimented a bit with this and came across an anomaly. While transcribing George Bush Columbia talk, the memory stays around 2.5GB, but then I encounter a sudden spike beyond 3.5GB in VRAM in case of GPU usage and RAM in case of CPU usage, when using int8, when all spoken text was already out of the model. Is it due to silence at the end or some additional operations? Would you know why this happens and how to prevent this?

!wget https://upload.wikimedia.org/wikipedia/commons/1/1f/George_W_Bush_Columbia_FINAL.ogg
model_path = "whisper-large-v2-ct2/"
model = WhisperModel(model_path, device="cuda", compute_type="int8",)
segments, info = model.transcribe("./George_W_Bush_Columbia_FINAL.ogg", beam_size=1, language="en", condition_on_previous_text=False)

The output:

...
[183.06s -> 185.82s]  are safely home.
[185.82s -> 192.62s]  May God bless the grieving families and may God continue to bless America.
['transcribe /home/ubuntu/src/faster-whisper/run.py:10', 'time_delta', 44.965]
Traceback (most recent call last):
  File "/home/ubuntu/src/faster-whisper/run.py", line 16, in <module>
    for segment in segments:
  File "/home/ubuntu/src/faster-whisper/faster_whisper/transcribe.py", line 285, in generate_segments
    result, avg_log_prob, temperature = self.generate_with_fallback(
  File "/home/ubuntu/src/faster-whisper/faster_whisper/transcribe.py", line 461, in generate_with_fallback
    result = self.model.generate(
RuntimeError: CUDA failed with error out of memory

Different output for `medium` between openai/whisper and this

Hi,

First, let me start by saying great job. This is awesome! It's crazy how much faster this is than openai/whisper. I really appreciate the effort here.

I am running into a few issues and just want to better understand.

I am on the latest ctranslate2==3.6.0 and am using the medium model both with a beam_size=5 set. On faster-whisper, I get:

 ______ available?î Yeah. Speaker? ______, this is

And on openai/whisper I get:

I am using [NAME] here.

 __________ Hello. __________ Hello, is [NAME] available? __________ Yes, speaking. __________ Hi,

I have other examples of this as well but need to redact the text before I can post them.

I've also noticed non-English characters in the translation as well for faster-whisper for English-only audio. And even on 3.6.0, I still appear to get the beginning of my audio chopped off as well (but not all the time). It's seemingly random.

Is it normal to have some differences? Is there come config difference I am missing between the two?

While the speed increases are great, the inconsistencies are enough that we can't really use this over openai/whisper for our tasks. Anything I can do to help debug, let me know.

As a side note, I've been testing medium.en and the random characters seem to not happen and accuracy appears to be better for faster-whisper compared to the multi language model for faster-whisper where as medium for openai/whsiper appears to work fine.

Thanks in advanced!

initial_prompt?

In the transcribe.py code I don't see the initial_prompt parameter.
Is it somewhere?
Can it be added?

the timestamps of the texts are not accuracy.

Very appreciate for your project, just comparing with the whisper.cpp. Found the timestamp are not accuracy,seems the whisper.cpp is accurately. The model is whisper-large-v2.

whisper.cpp
[00:00:00.000 --> 00:00:03.440] [MUSIC PLAYING]
[00:00:03.440 --> 00:00:03.940] Impossible.
[00:00:03.940 --> 00:00:10.440] A woman leading a man's army.
[00:00:10.440 --> 00:00:16.440] It is my duty to fight for the kingdom.
[00:00:16.440 --> 00:00:25.940] The girl who has come to save the dynasty.
[00:00:27.380 --> 00:00:30.380] [SCREAMING]
[00:00:30.380 --> 00:00:35.380] You will die pretending to be something you are not.
[00:00:35.380 --> 00:00:40.380] Get here, I stand.
[00:00:40.380 --> 00:00:48.380] I'm Hua Mulan.
[00:00:48.380 --> 00:00:51.380] I will bring honor to us all.
[00:00:51.380 --> 00:00:53.880] Disney's Mulan, rated PG-13.
[00:00:53.880 --> 00:00:55.880] Streaming September 4th.
[00:00:55.880 --> 00:00:58.380] Exclusively available to Disney+ subscribers
[00:00:58.380 --> 00:01:00.740] with Premier Access.

faster-whisper
[0.00s -> 5.00s] Impossible.
[5.00s -> 11.00s] A woman leading a man's army.
[11.00s -> 17.00s] It is my duty to fight for the kingdom.
[17.00s -> 26.00s] A girl who has come to save the dynasty.
[26.00s -> 39.00s] You will die pretending to be something you're not.
[39.00s -> 47.00s] Yet here I stand.
[47.00s -> 51.00s] I'm Hua Mulan. I will bring honor to us all.
[51.00s -> 56.00s] Disney's Mulan. Rated PG-13. Streaming September 4th.
[56.00s -> 74.00s] Exclusively available to Disney Plus subscribers with Premiere Access.

Here is my test audio link. Please try it, you will found the timestamps are incorrect.
https://stream.lestream.cn/source.mp3

TypeError when providing language to model

Hey, when I am providing language to transcribe method like segments, info = model.transcribe(file_name, language="english", beam_size=5)

i am getting the following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 25
     
     20  segments, info = model.transcribe(file_name, language="english", beam_size=5)
     21 
     23 print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
---> 25 for segment in segments:
     26     print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
   (...)

File /usr/local/lib/python3.10/site-packages/faster_whisper/transcribe.py:187, in WhisperModel.generate_segments(self, features, language, options)
    182 def generate_segments(self, features, language, options):
    183     tokenized_segments = self.generate_tokenized_segments(
    184         features, language, options
    185     )
--> 187     for start, end, tokens in tokenized_segments:
    188         text = self.decode_text_tokens(tokens)
    189         if not text.strip():

File /usr/local/lib/python3.10/site-packages/faster_whisper/transcribe.py:224, in WhisperModel.generate_tokenized_segments(self, features, language, options)
    216 previous_tokens = all_tokens[prompt_reset_since:]
    217 prompt = self.get_prompt(
    218     language,
    219     previous_tokens,
    220     task=options.task,
    221     without_timestamps=options.without_timestamps,
    222 )
--> 224 result, temperature = self.generate_with_fallback(segment, prompt, options)
    226 if (
    227     result.no_speech_prob > options.no_speech_threshold
    228     and result.scores[0] < options.log_prob_threshold
    229 ):
    230     offset += segment.shape[-1]

File /usr/local/lib/python3.10/site-packages/faster_whisper/transcribe.py:315, in WhisperModel.generate_with_fallback(self, segment, prompt, options)
    309     kwargs = {
    310         "beam_size": options.beam_size,
    311         "patience": options.patience,
    312     }
    314 final_temperature = temperature
--> 315 result = self.model.generate(
    316     features,
    317     [prompt],
    318     max_length=max_length,
    319     return_scores=True,
    320     return_no_speech_prob=True,
    321     **kwargs,
    322 )[0]
    324 tokens = result.sequences_ids[0]
    325 text = self.decode_text_tokens(tokens)

TypeError: generate(): incompatible function arguments. The following argument types are supported:
    1. (self: ctranslate2._ext.Whisper, features: ctranslate2._ext.StorageView, prompts: Union[List[List[str]], List[List[int]]], *, asynchronous: bool = False, beam_size: int = 5, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, max_length: int = 448, return_scores: bool = False, return_no_speech_prob: bool = False, sampling_topk: int = 1, sampling_temperature: float = 1) -> Union[List[ctranslate2._ext.WhisperGenerationResult], List[ctranslate2._ext.WhisperGenerationResultAsync]]

Invoked with: <ctranslate2._ext.Whisper object at 0x7f19f702bf30>, <ctranslate2._ext.StorageView object at 0x7f19fef2df70>, [[50258, None, 50359]]; kwargs: max_length=448, return_scores=True, return_no_speech_prob=True, beam_size=5, patience=1

transcription speed blowouts

Thanks again for this project - for context I'm testing it transcribing a live public radio stream, appreciate the rapid speed and low memory as it's most useful providing near-live transcription.
The radio stream is maybe 60% voice on studio mic, 30% phone voice, 5% voice talking over music, and 5% music.
I have a simple python script running 30s chunks from the live radio stream into faster-whisper continuously.
Using base model, on a cheap VPS with just 2GB RAM - I'm sure I could get better results with a higher spec machine but it's a proof of concept - would be useful to run across a large number of different streams here.

Most 30s chunks take between 6-8s to transcribe, which is perfect, but roughly 1 in 10 can blow out between 20-50s.

I haven't quite figured out what causes it, I wonder if it's when the 30s chunk has a mix of music and talk, or a mix of different audio sources? Was wondering if in your experience you could shed light on the reason? Would a larger model stop the blowouts?

decode_audio error

❯ python3 transcribe.py
Traceback (most recent call last):
  File "/home/user/git/faster-whisper/transcribe.py", line 13, in <module>
    segments, info = model.transcribe("audio.opus", beam_size=5)
  File "/home/user/git/faster-whisper/faster_whisper/transcribe.py", line 156, in transcribe
    audio = decode_audio(
  File "/home/user/git/faster-whisper/faster_whisper/audio.py", line 27, in decode_audio
    fifo.write(new_frame)
  File "av/audio/fifo.pyx", line 25, in av.audio.fifo.AudioFifo.write
  File "av/audio/fifo.pyx", line 90, in av.audio.fifo.AudioFifo.write
  File "av/error.pyx", line 336, in av.error.err_check
av.error.ValueError: [Errno 22] Invalid argument

target file is an opus file. Is mp3 the only supported filetype? what codecs should the file be?

edit: it seems to work on smaller/shorter files. The one with the error is a ~12 hour video I wanted to test on.

If anybody shared the translated models?

Will this project support ".pt" file instead of bin model hold in hugging face?Or this project can storage translated models in some public storage.

Because the bin models in hugging face are very large and my network is pretty slow, it will be a large challenge for me to download this size of model.

Incompatible with CUDA v12.1

OS: Arch Linux x86_64, python-numpy-1.24.2, CUDA v12.1, all other system packages at latest versions.
Using a GPU for transcription

Steps to reproduce

  1. Install cuda-12.1.0-1-x86_64.pkg.tar.zst via pacman
  2. Attempt to run transcription using the sample Python snippet provided

Expected behaviour

Transcription begins as with CUDA v11.8 installed

Actual behaviour

An error message is shown in the output:

Traceback (most recent call last):
  File "faster-whisper/faster.py", line 15, in <module>
    segments, info = model.transcribe(sys.argv[1], beam_size=5)
  File "faster_whisper/transcribe.py", line 207, in transcribe
    results = self.model.detect_language(input)
RuntimeError: Library libcublas.so.11 is not found or cannot be loaded

Workaround

cd /var/cache/pacman/pkg
# downgrade CUDA and dependecies to CUDA v11.8, using older package files
sudo pacman -U cuda-11.8.0-1-x86_64.pkg.tar.zst cudnn-8.6.0.163-1-x86_64.pkg.tar.zst

Any Colab notebook to test?

Is there any Google Colab Notebook for implementation? Would be very good for people that has no access to GPUs

How to assign more CPU on this python script

Is there any possible way to assign more CPU on this script. Honestly, is's super fast on my windows machine. However, I discover that it only takes maybe 60%-70% CPU, so if there's any possible way to make fully use of my CPU. Or is there any other way to improve the speed without losing the quality.

Exception: Model "openai/whisper-tiny.en" on the Hub doesn't have a tokenizer

Currently hitting this exception in this block of code. Looks like huggingface got rid of tiny.

tokenizer_file = os.path.join(model_path, "tokenizer.json")
if os.path.isfile(tokenizer_file):
    self.hf_tokenizer = tokenizers.Tokenizer.from_file(tokenizer_file)
else:
    self.hf_tokenizer = tokenizers.Tokenizer.from_pretrained(
        "openai/whisper-tiny" + ("" if self.model.is_multilingual else ".en")
    )

Speed tests and comparison to other Whisper versions

Just wanted to say thanks for this great port of Whisper to CTranslate2.

I've done some tests and compared it to other ports like a TFlite version and a C++ version on Raspberry Pi 4. You can find the results >here<.

In conclusion it is as fast as the Tflite version, but smaller and has the better API right now 🙂 👍.

What files need to be prepared to convert my own model

Hugging Face provides finetune code:https://huggingface.co/blog/fine-tune-whisper
These files are obtained after finetune
image

When I use the following command, I get the following error

ct2-transformers-converter --model ./checkpoint-90 --output_dir ./tmp

image

I compared the model of the hugging face and found that there are many files missing:https://huggingface.co/openai/whisper-small/tree/main
image

Please tell me how to get the missing files

Seek help

Support float32?

16:52:07 kris ~/faster-whisper $ python test_transcription.py Traceback (most recent call last): File "/home/kris/faster-whisper/test_transcription.py", line 8, in <module> model = WhisperModel(model_path, device="cpu", compute_type="float16") # compute_type="float16" File "/home/kris/faster-whisper/faster_whisper/transcribe.py", line 71, in __init__ self.model = ctranslate2.models.Whisper( ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation.

I get somewhat of the same error with whisper however it automatically changes to float32. Is there a possible fix for this?

Inference on long files

Hello,

Thank you for this great library!
Is there any way we can chunk the initial audio into shorter samples, let's say 50 seconds each, run inference on those, and end up with a final reconstruction.
I came across this article and I wonder if it's possible to get it working here.
Any ideas if this is possible ?

Slower than original Whisper on ARM 64bit (Raspberry Pi 4/Orange Pi 5)

Hi @guillaumekln ,

we've discussed this in #9 a bit but I think its worth to create an extra issue to keep track of it.
In my tests on Raspberry Pi 4 and Orange Pi 5 Whisper.cpp is actually slower than the original Whisper. Here is an excerpt of results:

Raspberry Pi 400

Test date: 2023.02.17

Engine Model File Threads Stream Time RTF Quality
Whisper original tiny 1 4 - 5.9s 0.54 perfect
Whisper original tiny 2 4 - 4.3s 1.19 perfect
Whisper Cpp ggml-tiny 1 4 - 9.1s 0.83 perfect
Whisper Cpp ggml-tiny 2 4 - 8.6s 2.39 perfect
Whisper Cpp (BLAS) ggml-tiny 1 4 - 8.4s 0.76 perfect
Whisper Cpp (BLAS) ggml-tiny 2 4 - 8.0s 2.22 perfect
Whisper CT2 whisper-tiny-ct2 1 4 - 3.9s 0.36 perfect
Whisper CT2 whisper-tiny-ct2 2 4 - 3.2s 0.90 perfect

Orange Pi 5

Test date: 2023.02.19

Engine Model File Threads Stream Time RTF Quality
Whisper original tiny 1 4 - 3.0s 0.27 perfect
Whisper original tiny 2 4 - 1.9s 0.53 perfect
Whisper Cpp (BLAS) ggml-tiny 1 4 - 3.7s 0.34 perfect
Whisper Cpp (BLAS) ggml-tiny 2 4 - 3.5s 0.97 perfect
Whisper CT2 whisper-tiny-ct2 1 4 - 1.3s 0.12 perfect
Whisper CT2 whisper-tiny-ct2 2 4 - 1.4s 0.39 perfect

I've repeated the tests yesterday on Orange Pi 5 with similar results.

beam_size default setting

Hey great job on this package. Already enjoying the improvements.

I found in your README the following:

Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper, model.transcribe uses a default beam size of 1 but here we use a default beam size of 5.

A link below shows the default beam size from openAI to be 5 as well.

https://github.com/openai/whisper/blob/a6b36ede1f060860d5676a543176a6439d91eae6/whisper/transcribe.py#L272

Two different errors with converting MP3s to WAV

Hi again!

I am running into two different issues consistently, mainly with av but I am not sure if you've seen this before.

av.error.InvalidDataError: [Errno 1094995529] Invalid data found when processing input

and

invalid new backstep -1

with libav.mp3float

Is there a newer version of av to use or should I just do the conversion myself and pass down .wav? Or is it possible my version of ffmpeg is older since av is just a binding?

inference slowed down with `word_timestamps=True`

Hi,
I have a simple script to run inference on a wav file. I noticed when word_timestamps=True, the processing time is much longer.
I'm using the same wav file in each of these cases, you can see duration below for each:

model = WhisperModel(model_path, "cpu", compute_type="int8", cpu_threads=4)
segments, info = model.transcribe(input, word_timestamps=True, beam_size=1)
OMP_NUM_THREADS=4 python inference.py
duration: 310.73524594306946s
model = WhisperModel(model_path, "cpu", compute_type="int8", cpu_threads=4)
segments, info = model.transcribe(input, word_timestamps=False, beam_size=1)
OMP_NUM_THREADS=4 python inference.py
duration: 225.86893439292908s

Is this expected? Or is there some optimization that could be done for word-level timestamps.

Repetited text within and between segments

Sometimes, a sentence repeat itself multiple times at the end of a segment, and may continue to repeat in subsequent segments.

This was a known issue of openai/whisper (upstream issue openai/whisper#977 and openai/whisper#1059), and may be fixed by openai/whisper@38f2f4d

When I use "faster-whisper", I encountered the same sentence repetition. I found it's also reported on #35 (comment)

Could you please check if this commit openai/whisper@38f2f4d can/should be ported here?

Feature : Add support for VAD filter

Thank you for releasing the code

since this implementation require less memory than other implementation
adding VAD (Voice activity detection) should be more suitable
Voice activity detection make whisper more accurate especially for non english

(openai/whisper#29 (comment))

will this possible to add ?
thank you

Requested float16 compute type, but the target device or backend do not support efficient float16 computation.

I recently tried this wonderful tool on CPU of my Windows 10 amchine and got quite good results. But when I tried on GPU via model = WhisperModel(model_path, device="cuda", compute_type="float16") I received following error Requested float16 compute type, but the target device or backend do not support efficient float16 computation.
I have GTX1050 Ti and main driver is 31.0.15.1694. How can I fix this error and run on my GPU card?

Crosstalk when multithreading.

I'm trying to reuse the same CTranslate2 model instance to handle multiple different 30s chunks of audio from different streams. In order to prepare for that, I lifted the WhisperModel.model , WhisperModel.tokenizer out into a core instance so I can initialize a single WhisperModel from faster-whisper per worker but have them share the CTranslate2 Whisper Model.

I'm also handling prompt in the .transcribe() call.

However, I'm getting wildly different results where wording from one thread is crossing over to the other thread. This continued even after I implemented a lock around WhisperModel's transcribe call and subsequent segments iterator.. It's as if the CTranslate2 Whisper model has internal state and I'm failing to clear it.

Does anyone have suggestions for how this may be?

My code for reference:

def translationWorker(work_queue, language, primer: str, persist, task, core: WhisperModelCore, start_ts: int, id: str):
    # model = whisper.load_model("large")
    options = {
        "language": language,
        "task": task if task is not None else ('translate' if language != 'en' else 'transcribe')
    }
    stripped_primer = ""
    if (primer is not None and len(primer) > 3):
        stripped_primer = primer.strip() + " "
        options["initial_prompt"] = stripped_primer
    model = WhisperModel(core)

    first_send = True;

    print("OpenAI Whisper Ready")
    while True:
        audio_chunks = work_queue.get()
        if (audio_chunks == _DONE):
            print("Finishing up over here too")
            return "ok"
        first_chunk: ChunkRecord
        last_chunk: ChunkRecord
        first_chunk, last_chunk, audio = audio_chunks
        # Transcribe audio into subtitles
        unique = [];
        with core.lock:
            out, b = model.transcribe(audio, **options) 
            # out is of type Segment

            for seg in out:
                text: str
                start = seg.start;
                end = seg.end;
                text = seg.text;
                stripped_text = text.strip()
                print(
                    colored("[" + str(first_chunk.chunk_id + start) + ":" +
                            str(first_chunk.chunk_id + end) + "]", "dark_grey"),
                    " :: ", colored(text, "green"))
                persist.append({
                    "relstart": start,
                    "relend": end,
                    "start": first_chunk.chunk_id + start,
                    "end": first_chunk.chunk_id + end,
                    "text": stripped_text})
                if(len(unique) == 0 or unique[-1]["text"] != stripped_text):
                    unique.append({
                        "relstart": start,
                        "relend": end,
                        "start": first_chunk.chunk_id + start,
                        "end": first_chunk.chunk_id + end,
                        "text": stripped_text})
                else:
                    unique[-1]["end"] = first_chunk.chunk_id + end;
    
        # process UNIQUE into pieces to prompt and prefix.
        local_overlap_split_at = OVERLAP_LATENCY if OVERLAP_LATENCY > 0 and OVERLAP_LATENCY < EXPECTED_CHUNK_DURATION else EXPECTED_CHUNK_DURATION
            # context for the next run is going to be between 0 and OVERLAP_LATENCY. 
        options["initial_prompt"] = stripped_primer + " ".join(map(
            lambda x: x["text"],
            filter(lambda x: x["relend"] < local_overlap_split_at, unique)))
        # prefix for the next run will be between local_overlap_split_at til the end.
        options["prefix"] = " ".join(map(
            lambda x: x["text"],
            filter(lambda x: x["relend"] >= local_overlap_split_at, unique)))
        send_candidates = list(map(lambda u: {
            "timestamp": (start_ts + u["start"]) * 1000,
            "text": u["text"],
            "duration": int((u["end"] - u["start"])*1000.0)
        }, filter(lambda u: first_send or u['relstart'] > (EXPECTED_CHUNK_DURATION - local_overlap_split_at), unique)))
        # context is going to be the OVERLAP level.
        send_translation(id, 'tl' if options["task"] == 'translate' else 'tc', send_candidates)

        # print(out)
        time.sleep(0.0001)

word_timestamps on Faster Whisper

Hello, I would like to know if it's possible to add the "--word_timestamps" option to Faster Whisper now, since this new option has been added to the official Whisper repository. It would be very helpful if this option could be included in Faster Whisper. Thank you in advance.

Shorter segments?

Would it be possible to produce shorter segments? (some are way too long)

Different output from mono audio and stereo audio

Previously, I would like to say very grateful for this project. I have a fairly unique problem, namely the transcript results obtained from mono audio and stereo audio are quite different. The transcript results from stereo audio are better than mono audio, even though we know that before the transcript process, the audio will be converted to mono.

Parameters used:
compute_type : int8
model : large
device: gpu
beam_size: 5
language: 'id'

Link Audio:
In Here

How to wait for function to finish

Hello, I have never used async in python before, just wanted is it possible to get the result of the transcription once it's done instead of in async

instead of this

    def generate_segments(self, features, language, options):
        tokenized_segments = self.generate_tokenized_segments(
            features, language, options
        )

        for start, end, tokens in tokenized_segments:
            text = self.decode_text_tokens(tokens)
            if not text.strip():
                continue

            yield Segment(
                start=start,
                end=end,
                text=text,
            )

how would I go about returning an array of all segments

    def generate_segments(self, features, language, options):
        tokenized_segments = self.generate_tokenized_segments(
            features, language, options
        )
        
        res = []
        for start, end, tokens in tokenized_segments:
            text = self.decode_text_tokens(tokens)
            if not text.strip():
                continue

            res.append(Segment(
                start=start,
                end=end,
                text=text,
            ))
       
        return res

Why is gpu slower than cpu

GPU: A100
wav:test.zip
code:


from tqdm import tqdm
import time
from faster_whisper import WhisperModel
import os
os.environ["OMP_NUM_THREADS"] = "4"

audio_path = 'test.wav'

gpu_model_path = "whisper-large-v2-ct2-float16/"
cpu_model_path = "whisper-large-v2-ct2-int8/"

# or run on GPU with INT8
gpu_model = WhisperModel(gpu_model_path, device="cuda", compute_type="float16")
# or run on CPU with INT8
cpu_model = WhisperModel(cpu_model_path, device="cpu", compute_type="int8")

startTime = time.time()
print('Transcribing with gpu model')
segments, info = gpu_model.transcribe(audio_path, beam_size=5, language='zh')
intermediateTime = time.time()
print('gpu model: %s' % (intermediateTime - startTime))

print('-'*100)

startTime = time.time()
print('Transcribing with cpu model')
segments, info = cpu_model.transcribe(audio_path, beam_size=5, language='zh')
intermediateTime = time.time()
print('cpu model: %s' % (intermediateTime - startTime))

log:

Transcribing with gpu model
gpu model: 0.012365579605102539
----------------------------------------------------------------------------------------------------
Transcribing with cpu model
cpu model: 0.006761789321899414

nvidia-smi:
image

word-level timestamps

Hi, I really appreciate you sharing this implementation.
I found it to be very fast with accurate results.
I do not see word-level timestamps in the result. Are word level timestamps possible?

Apple Neural Engine on M1 Macs?

First of all, this is amazing work. It mops the floor with vanilla Whisper on Apple M1 chips.

However, I noticed that it only supports float32 types on Apple M1. All types seem to fallback to that due to ctranslate2 presumably. I read that Apple Neural Engine supports fp16, int16 and int8 types. Any chance we can support those via ANE? That might take the performance to an even higher level

thanks!

RuntimeError: No SGEMM backend on CPU

I am testing the branch word-level-timestamps dc780dc

I installed your fork of CTranslate2 https://github.com/guillaumekln/CTranslate2.git on branch whisper-align.

When I try to run inference with word timestamps, I get:

Traceback (most recent call last):
[...]
    segments, info = model.transcribe(audio)
  File "/usr/local/lib/python3.10/dist-packages/faster_whisper/transcribe.py", line 206, in transcribe
    results = self.model.detect_language(input)
RuntimeError: No SGEMM backend on CPU

I have seen this issue OpenNMT/CTranslate2#646 but in my case it looks different.
I did the cmake with default options (also tried explicit option -DWITH_MKL=ON) and I have MKL

-- Found MKL include directory: /opt/intel/oneapi/mkl/latest/include
-- Found MKL library directory: /usr/lib/x86_64-linux-gnu

Do you see what I can be missing?

Word confidence scores

Thanks for the excellent work on this package. A quick query, I've been looking at the following suggestions as to how to get word confidence from vanilla Whisper in grey mode. Could you provide some pointers as to how to implement this in the faster whisper / ctranslate2 implementation?
As the code here deviates significantly. github.com/openai/whisper/discussions/284

Support for variable size chunks

I can see that now only 30 second chunks are supported by the CTranslate2 model. Shorter chunks are padded to 30s such that model.generate can accept exclusively [batch_size, 80, 3000] inputs.

In some real-time applications like shorter chunks may be used and the original Whisper model supports shorter chunks despite being trained on 30s. Would it be possible to allow shorter chunks for faster inference in contrast to when always padding to 30s?

Translation feature in Faster-Whisper not translating language fully for certain audio/video files

Description

I noticed that the translation feature in Faster-Whisper does not seem to translate the language fully for certain audio/video files. It appears to (at random) only translate parts of the language into English, whereas Whisper is mostly capable of translating the entire language. Is there a reason for this difference in translation capabilities?

Reproduction Steps

  1. Use the following code to transcribe a video and translate it to English using Faster-Whisper:
from faster_whisper import WhisperModel
import torch

model_path = "whisper-large-v2"

# Use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if torch.cuda.is_available() else "int8"

# Run on GPU with INT8 or FP16
model = WhisperModel(model_path, device=device, compute_type=compute_type)   

# Transcribe video and translate to English
with torch.no_grad():
    segments, info = model.transcribe("test.mp4", beam_size=5, task="translate")

# Print transcription and translation segments
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
    print("\n[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
  1. Replace "test.mp4" with the video file provided by me that was used to test the translation feature.

Expected Behavior

The entire language in the video file should be translated into English.

Actual Behavior

The translation feature only translates/transcribes the English portion of the video.

Additional Information

I have attached the video file used to test the translation feature.

arabic.mp4

A part of the beginning of my audio was cut

Hello, first of all thank you very much for your work on this project, it really was much faster and consumed less RAM and VRAM.
I'm testing and unfortunately a significant part of my audio has been cut.
My audio is in Portuguese and has 13 minutes, apparently the problem only occurred at the beginning of it.
Is there a way to solve this problem?
I used the following code:

from faster_whisper import WhisperModel

model_path = "whisper-medium-ct2/"

# Run on GPU with FP16
model = WhisperModel(model_path, device="cuda", compute_type="float16")

# or run on GPU with INT8
# model = WhisperModel(model_path, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_path, device="cpu", compute_type="int8")

segments, info = model.transcribe("jota.mp3", beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%ds -> %ds] %s" % (segment.start, segment.end, segment.text))

The result I got running the standard version of Whisper on the same medium model:

[00:00.000 --> 00:03.500]  Afinal de contas, imprimir dinheiro gera ou não gera inflação?
--------------------- CUT ---------------------
[00:03.500 --> 00:05.500]  É isso que a gente vai responder neste vídeo.
[00:05.500 --> 00:10.500]  Música
[00:10.500 --> 00:13.500]  Muito bem, todos aqueles que estão chegando agora aqui no canal, meu nome é Fernando Urch,
[00:13.500 --> 00:16.500]  aqui a gente fala de economia, mercados e investimentos, se vocês gostarem do conteúdo,
[00:16.500 --> 00:21.000]  considerem se inscrever, ativando o sininho aqui embaixo e também compartilhando este vídeo.
[00:21.000 --> 00:24.500]  Pois o assunto de inflação é recorrente aqui no canal pela sua importância,
[00:24.500 --> 00:30.500]  o impacto que tem na nossa vida financeira, profissional, na economia, na vida em sociedade.
--------------------- CUT ---------------------
[00:30.500 --> 00:34.500]  E o debate em torno da relação entre impressão de moeda e inflação,
[00:34.500 --> 00:38.500]  ele ressurge de tempos em tempos, como foi lá no início da pandemia,
[00:38.500 --> 00:42.500]  quando muitos economistas, banqueiros centrais, políticos,
[00:42.500 --> 00:47.500]  de vários espectros ideológicos, esquerda e direita, com raríssimas exceções,
[00:47.500 --> 00:50.500]  afirmavam categoricamente que imprimir dinheiro
[00:50.500 --> 00:54.500]  não geraria inflação naquele momento, naquelas circunstâncias.
[00:54.500 --> 00:57.500]  E a verdade é que não é tão simples responder essa pergunta,
[00:57.500 --> 01:02.500]  porque imprimir dinheiro não necessariamente vai gerar inflação,
[01:02.500 --> 01:05.500]  depende de outros fatores, depende das circunstâncias.
[01:05.500 --> 01:09.500]  Mas sim que imprimir dinheiro é sempre um fator inflacionário.

The result I got running this faster version of Whisper:

Detected language 'pt' with probability 0.996094
[0s -> 3s]  Afinal de contas, imprimir dinheiro gera ou não gera inflação?
[30s -> 36s]  E o debate em torno da relação entre impressão de moeda e inflação resurge de tempos em tempos,
[36s -> 42s]  como foi lá no início da pandemia, quando muitos economistas, banqueiros centrais, políticos,
[42s -> 47s]  de vários espectros ideológicos, esquerda e direita, com raríssimas exceções,
[47s -> 54s]  afirmavam categoricamente que imprimir dinheiro não geraria inflação naquele momento, naquelas circunstâncias.
[54s -> 58s]  E a verdade é que não é tão simples responder essa pergunta, porque
[58s -> 65s]  imprimir dinheiro não necessariamente vai gerar inflação, depende de outros fatores, depende das circunstâncias.
[65s -> 69s]  Mas sim que imprimir dinheiro é sempre um fator inflacionário.

As you can see there was a part cut off at the beginning of my audio, in case you want to test my audio to see if you get my results: https://www.dropbox.com/s/m0q30hmzbx6mvt2/jota.mp3?dl=1
And another question, is it possible to get the return as an srt or vtt file, like the standard Whisper?
Thank you very much.

ctranslate2 version 3.9.0 has an error

Hi,

ctranslate2 version 3.9.0 has the error below, I solved the problem by fixing version 3.8.* in the file: requirements.txt

Traceback (most recent call last):
  File "/usr/local/bin/ct2-transformers-converter", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 863, in main
    converter.convert_from_args(args)
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/converter.py", line 50, in convert_from_args
    return self.convert(
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/converter.py", line 89, in convert
    model_spec = self._load()
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 99, in _load
    spec = loader(model, tokenizer)
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 154, in __call__
    self.set_config(spec.config, model, tokenizer)
  File "/usr/local/lib/python3.9/dist-packages/ctranslate2/converters/transformers.py", line 584, in set_config
    range(config.decoder_layers // 2, config.decoder_layers),
AttributeError: 'WhisperConfig' object has no attribute 'decoder_layers'

How to select a GPU?

Whisper is using model="cuda:0" or model="cuda:1"
With faster-whisper, I get this error:
ValueError: unsupported device cuda:1

TypeError: WhisperModel.transcribe() got an unexpected keyword argument 'word_timestamps'

Description:
While running the code, I encountered an error with the following message: "TypeError: WhisperModel.transcribe() got an unexpected keyword argument 'word_timestamps'". The error occurs when trying to transcribe an audio file with the 'word_timestamps' argument set to True.

Steps to Reproduce:

  1. Install the required dependencies for the code snippet: faster_whisper and torch.
  2. Download the pre-trained WhisperModel from the given model_path.
  3. Run the code snippet with an audio file named "audio.mp3".

Expected Result:
The code should transcribe the audio file and print the start and end times of each word in the audio file.

Actual Result:
The code throws a TypeError with the message "WhisperModel.transcribe() got an unexpected keyword argument 'word_timestamps'".

Error Message:
TypeError: WhisperModel.transcribe() got an unexpected keyword argument 'word_timestamps'

Code Snippet:

from faster_whisper import WhisperModel
import torch

model_path = "whisper-large-v2-ct2/"

# Use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if torch.cuda.is_available() else "int8"

model = WhisperModel(model_path, device=device, compute_type=compute_type)   

# Transcribe video and translate to English
with torch.no_grad():
    segments, _ = model.transcribe("audio.mp3", word_timestamps=True)

for segment in segments:
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))

Zaz frances wrong

Hello!
Everything is fine, superb! I really like your code at my oldy Intel Icore-3 first generation (no cpp version working on so old cpu)
but
https://www.youtube.com/results?search_query=zaz+belle++live+
strange problem with zaz
whisper - ok, no problem, file, i have downloaded it again, but absolutely same.
but with faster-whisper something wrong.
not exactly with wrong recognition, it's something with time shift etc
also its not problem with frances - other zaz songs is fine.
thank u very mouch

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.