vaibhavs10 / insanely-fast-whisper Goto Github PK

License: Apache License 2.0

Jupyter Notebook 98.44% Python 1.56%

insanely-fast-whisper's Introduction

Insanely Fast Whisper

An opinionated CLI to transcribe Audio files w/ Whisper on-device! Powered by 🤗 Transformers, Optimum & flash-attn

TL;DR - Transcribe 150 minutes (2.5 hours) of audio in less than 98 seconds - with OpenAI's Whisper Large v3. Blazingly fast transcription is now a reality!⚡️

pipx install insanely-fast-whisper==0.0.15 --force

Not convinced? Here are some benchmarks we ran on a Nvidia A100 - 80GB 👇

Optimisation type	Time to Transcribe (150 mins of Audio)
large-v3 (Transformers) (`fp32`)	~31 (31 min 1 sec)
large-v3 (Transformers) (`fp16` + `batching [24]` + `bettertransformer`)	~5 (5 min 2 sec)
large-v3 (Transformers) (`fp16` + `batching [24]` + `Flash Attention 2`)	*~2 (1 min 38 sec)*
distil-large-v2 (Transformers) (`fp16` + `batching [24]` + `bettertransformer`)	~3 (3 min 16 sec)
distil-large-v2 (Transformers) (`fp16` + `batching [24]` + `Flash Attention 2`)	*~1 (1 min 18 sec)*
large-v2 (Faster Whisper) (`fp16` + `beam_size [1]`)	~9.23 (9 min 23 sec)
large-v2 (Faster Whisper) (`8-bit` + `beam_size [1]`)	~8 (8 min 15 sec)

P.S. We also ran the benchmarks on a Google Colab T4 GPU instance too!

P.P.S. This project originally started as a way to showcase benchmarks for Transformers, but has since evolved into a lightweight CLI for people to use. This is purely community driven. We add whatever community seems to have a strong demand for!

🆕 Blazingly fast transcriptions via your terminal! ⚡️

We've added a CLI to enable fast transcriptions. Here's how you can use it:

Install insanely-fast-whisper with pipx (pip install pipx or brew install pipx):

pipx install insanely-fast-whisper

⚠️ If you have python 3.11.XX installed, pipx may parse the version incorrectly and install a very old version of insanely-fast-whisper without telling you (version 0.0.8, which won't work anymore with the current BetterTransformers). In that case, you can install the latest version by passing --ignore-requires-python to pip:

pipx install insanely-fast-whisper --force --pip-args="--ignore-requires-python"

If you're installing with pip, you can pass the argument directly: pip install insanely-fast-whisper --ignore-requires-python.

Run inference from any path on your computer:

insanely-fast-whisper --file-name <filename or URL>

Note: if you are running on macOS, you also need to add --device-id mps flag.

🔥 You can run Whisper-large-v3 w/ Flash Attention 2 from this CLI too:

insanely-fast-whisper --file-name <filename or URL> --flash True

🌟 You can run distil-whisper directly from this CLI too:

insanely-fast-whisper --model-name distil-whisper/large-v2 --file-name <filename or URL>

Don't want to install insanely-fast-whisper? Just use pipx run:

pipx run insanely-fast-whisper --file-name <filename or URL>

Note

The CLI is highly opinionated and only works on NVIDIA GPUs & Mac. Make sure to check out the defaults and the list of options you can play around with to maximise your transcription throughput. Run insanely-fast-whisper --help or pipx run insanely-fast-whisper --help to get all the CLI arguments along with their defaults.

CLI Options

The insanely-fast-whisper repo provides an all round support for running Whisper in various settings. Note that as of today 26th Nov, insanely-fast-whisper works on both CUDA and mps (mac) enabled devices.

  -h, --help            show this help message and exit
  --file-name FILE_NAME
                        Path or URL to the audio file to be transcribed.
  --device-id DEVICE_ID
                        Device ID for your GPU. Just pass the device number when using CUDA, or "mps" for Macs with Apple Silicon. (default: "0")
  --transcript-path TRANSCRIPT_PATH
                        Path to save the transcription output. (default: output.json)
  --model-name MODEL_NAME
                        Name of the pretrained model/ checkpoint to perform ASR. (default: openai/whisper-large-v3)
  --task {transcribe,translate}
                        Task to perform: transcribe or translate to another language. (default: transcribe)
  --language LANGUAGE   
                        Language of the input audio. (default: "None" (Whisper auto-detects the language))
  --batch-size BATCH_SIZE
                        Number of parallel batches you want to compute. Reduce if you face OOMs. (default: 24)
  --flash FLASH         
                        Use Flash Attention 2. Read the FAQs to see how to install FA2 correctly. (default: False)
  --timestamp {chunk,word}
                        Whisper supports both chunked as well as word level timestamps. (default: chunk)
  --hf-token HF_TOKEN
                        Provide a hf.co/settings/token for Pyannote.audio to diarise the audio clips
  --diarization_model DIARIZATION_MODEL
                        Name of the pretrained model/ checkpoint to perform diarization. (default: pyannote/speaker-diarization)
  --num-speakers NUM_SPEAKERS
                        Specifies the exact number of speakers present in the audio file. Useful when the exact number of participants in the conversation is known. Must be at least 1. Cannot be used together with --min-speakers or --max-speakers. (default: None)
  --min-speakers MIN_SPEAKERS
                        Sets the minimum number of speakers that the system should consider during diarization. Must be at least 1. Cannot be used together with --num-speakers. Must be less than or equal to --max-speakers if both are specified. (default: None)
  --max-speakers MAX_SPEAKERS
                        Defines the maximum number of speakers that the system should consider in diarization. Must be at least 1. Cannot be used together with --num-speakers. Must be greater than or equal to --min-speakers if both are specified. (default: None)

Frequently Asked Questions

How to correctly install flash-attn to make it work with insanely-fast-whisper?

Make sure to install it via pipx runpip insanely-fast-whisper install flash-attn --no-build-isolation. Massive kudos to @li-yifei for helping with this.

How to solve an AssertionError: Torch not compiled with CUDA enabled error on Windows?

The root cause of this problem is still unknown, however, you can resolve this by manually installing torch in the virtualenv like python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121. Thanks to @pto2k for all tdebugging this.

How to avoid Out-Of-Memory (OOM) exceptions on Mac?

The mps backend isn't as optimised as CUDA, hence is way more memory hungry. Typically you can run with --batch-size 4 without any issues (should use roughly 12GB GPU VRAM). Don't forget to set --device-id mps.

How to use Whisper without a CLI?

All you need to run is the below snippet:

pip install --upgrade transformers optimum accelerate

import torch
from transformers import pipeline
from transformers.utils import is_flash_attn_2_available

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3", # select checkpoint from https://huggingface.co/openai/whisper-large-v3#model-details
    torch_dtype=torch.float16,
    device="cuda:0", # or mps for Mac devices
    model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
)

outputs = pipe(
    "<FILE_NAME>",
    chunk_length_s=30,
    batch_size=24,
    return_timestamps=True,
)

outputs

Acknowledgements

OpenAI Whisper team for open sourcing such a brilliant check point.
Hugging Face Transformers team, specifically Arthur, Patrick, Sanchit & Yoach (alphabetical order) for continuing to maintain Whisper in Transformers.
Hugging Face Optimum team for making the BetterTransformer API so easily accessible.
Patrick Arminio for helping me tremendously to put together this CLI.

Community showcase

@ochen1 created a brilliant MVP for a CLI here: https://github.com/ochen1/insanely-fast-whisper-cli (Try it out now!)
@arihanv created an app (Shush) using NextJS (Frontend) & Modal (Backend): https://github.com/arihanv/Shush (Check it outtt!)
@kadirnar created a python package on top of the transformers with optimisations: https://github.com/kadirnar/whisper-plus (Go go go!!!)

insanely-fast-whisper's People

Contributors

Stargazers

Watchers

Forkers

brunotech junmagic-ai paperwave hurricanejin dantegpt gtrevg labdmitriy tonywhite11 richardsonjf jansystemic keyman9848 sbusso techthiyanes big-data-ai muharremokutan touristshaun tomchapin gsheni jeffara matheusbigatao mildmillard rayhunter rogeroga wagnercateb bht-media liunix61 wetdog petersonfonseca ssahgal newerton nothere spladder87 woodenzen magicleo mz0in hbcbh1999 glaceage whitefu f901107 shanthshivam ximoprofe fmbento patrick91 yazinsai sombochea willnco waldirborbajr eltociear satishdeshbhratar hitech777 trungtaxp rkp64 ar4sgpt britvabo papiguy notmoebius tumi-m abhijeetkur tfius a7mad-magdy77 suryatmodulus broyojo gunyarakun chenxwh hoanggtg assimelha li-yifei dakouan18 5l1v3r1 rusiksi render-ai taocao igorcosta kp-forks swisscakerowl lplzyp includewins0ck2 thepetk lmspaul polymood polya20 twobob narakai mentordotgit superlyb sunsetmkt fenardh samuelail nupam word911 cellinlab tarifut mastermind0001 zk4 tensor-fusion robbieinoz bayard lewangdev kamasamikon danieltea

insanely-fast-whisper's Issues

Can we get back Language Confidence?

Can we get back Language Detected Confidence? @Vaibhavs10

Output formats and input paramaters

How can I define output format such as "srt" or "text" and Can I define VAD filter or spoken language as parameter? Nice work btw. Thank you

Upcoming in next release! (this week)

Speaker Diarization with Pyannote 🤯
Fast CPU support 💻
Streaming ⚡

fasterwhisper and whisperx are faster than this repo

Hello, have u tested fasterwhisper and whisperx? https://github.com/guillaumekln/faster-whisper

this lib is way more faster than transformers.

TO-DO before the next release

Add FA2 support via --fa2 parameter
Add FA2 benchmarks
New release on PyPI
Update the README to be more informative (add instructions to install, run, instructions on running without the CLI)
Announce!

issue while installing it on ubantu macihine

pip install pipx
pipx install insanely-fast-whisper

ubantu 22.04
python 3.10 used for this
still getting this issue ,how can i solve it??

Traceback (most recent call last):
  File "/home/faster_whisper.py", line 1, in <module>
    from faster_whisper import WhisperModel
  File "/home/faster_whisper.py", line 1, in <module>
    from faster_whisper import WhisperModel
ImportError: cannot import name 'WhisperModel' from partially initialized module 'faster_whisper' (most like

Flash Attention issue

Hi,

Thanks for this repo. It works great. I am just not able to make Flash Attention works. I have a RTX 3090 so it's supposed to work based on their documentation.

I followed the instructions and everything is installed as expected:

pip install flash-attn --no-build-isolation

I can import the Flash Attention dependencies:

python -c "from flash_attn import flash_attn_qkvpacked_func, flash_attn_func"

But when I try to run insanely-fast-whisper with Flash=True, I get this error:

$ insanely-fast-whisper --file-name audio.mp3 --language fr --flash True
Traceback (most recent call last):
  File "/home/romain/.local/bin/insanely-fast-whisper", line 8, in <module>
    sys.exit(main())
  File "/home/romain/.local/pipx/venvs/insanely-fast-whisper/lib/python3.10/site-packages/insanely_fast_whisper/cli.py", line 78, in main
    pipe = pipeline(
  File "/home/romain/.local/pipx/venvs/insanely-fast-whisper/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 870, in pipeline
    framework, model = infer_framework_load_model(
  File "/home/romain/.local/pipx/venvs/insanely-fast-whisper/lib/python3.10/site-packages/transformers/pipelines/base.py", line 269, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/home/romain/.local/pipx/venvs/insanely-fast-whisper/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/home/romain/.local/pipx/venvs/insanely-fast-whisper/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3233, in from_pretrained
    config = cls._check_and_enable_flash_attn_2(config, torch_dtype=torch_dtype, device_map=device_map)
  File "/home/romain/.local/pipx/venvs/insanely-fast-whisper/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1273, in _check_and_enable_flash_attn_2
    raise ImportError(
ImportError: Flash Attention 2 is not available. Please refer to the documentation of https://github.com/Dao-AILab/flash-attention for installing it. Make sure to have at least the version 2.1.0

List of input parameters currently supported?

Thanks for this awesome repository!

I wanted to learn what parameters are you currently supporting.

Thanks

Make it easy to import and use in other python modules

I wanted to build on top of the work being done here (specially since diarization is coming soon) and I had to copy paste the transformers code into my code rather than

from insanely_fast_whisper import pipe

pipe(...)

Or maybe

from insanely_fast_whisper import build_pipe

pipe = build_pipe("large-v3", ...)
pip(...)

Can we get initial support for lib use?

Is it available for Commercial Use?

I am truly impressed with the remarkable work on this project.

I am interested in understanding whether this model is accessible for commercial applications. Could you please provide more details on the commercial usage terms and conditions?

Thanks

What's this based on...not seeing sourcecode.

Hello,

I'm fascinated by this project, but I don't see any of the source code even though there's a "src" folder and you can install it on Pypi. What is this based on? I'm unable to replicate your test results saying that it's 6x faster than faster-whisper or what not? Is this just faster-whisper with an increased batch_size or does it actually innovate something new?

You're making bold claims about its speed that I have not been able to verify so...

Thanks! Always looking for latest and greatest. I'd also like to know how you think it compares to the Jax implementation located here https://github.com/sanchit-gandhi/whisper-jax

Can whisper support fast audio transcribing in real time

I noticed that someone mentioned the first token latency in streaming mode in https://twitter.com/arpagon/status/1723858496733511851. Therefore, I'm very curious whether if whisper can support fast auto transcribing in real time.

Error installing insanely-fast-whisper on M1 Mac even after several retries

Invocation
pipx install insanely-fast-whisper

Console Output
Fatal error from pip prevented installation. Full pip output in file:
/Users/ray/.local/pipx/logs/cmd_2023-11-22_10.36.26_pip_errors.log

pip seemed to fail to build package:

optimum

Some possibly relevant errors from pip install:
error: subprocess-exited-with-error
FileNotFoundError: [Errno 2] No such file or directory: 'optimum/version.py'
AssertionError: Error: Could not open 'optimum/version.py' due [Errno 2] No such file or directory: 'optimum/version.py'

Error installing insanely-fast-whisper.

Error Log
cmd_2023-11-22_10.36.26_pip_errors.log

  Downloading optimum-0.1.1.tar.gz (17 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
  Downloading optimum-0.1.0.tar.gz (16 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'```

PIP STDERR
----------
  WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)")': /packages/dc/0c/f95215bc5f65e0a5fb97d4febce7c18420002a4c3ea5182294dc576f17fb/accelerate-0.16.0-py3-none-any.whl
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [27 lines of output]
      Traceback (most recent call last):
        File "<string>", line 7, in <module>
      FileNotFoundError: [Errno 2] No such file or directory: 'optimum/version.py'
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/Users/ray/.local/pipx/shared/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/Users/ray/.local/pipx/shared/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/ray/.local/pipx/shared/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/gf/74t2xp_90_bgkn8nhp74xkn00000gn/T/pip-build-env-lyfsyn4h/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/gf/74t2xp_90_bgkn8nhp74xkn00000gn/T/pip-build-env-lyfsyn4h/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "/private/var/folders/gf/74t2xp_90_bgkn8nhp74xkn00000gn/T/pip-build-env-lyfsyn4h/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 480, in run_setup
          super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
        File "/private/var/folders/gf/74t2xp_90_bgkn8nhp74xkn00000gn/T/pip-build-env-lyfsyn4h/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 10, in <module>
      AssertionError: Error: Could not open 'optimum/version.py' due [Errno 2] No such file or directory: 'optimum/version.py'
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.```

TLDR numbers right?

TL;DR - Transcribe 300 minutes (5 hours) of audio in less than 98 seconds - with OpenAI's Whisper Large v3. Blazingly fast transcription is now a reality!⚡️

Based on the benchmarks table the fastest 98s corresponds to 150 minutes of audio, this is also what's in the colab and what was tweeted: https://twitter.com/reach_vb/status/1723810943329616007

Is the 300 a typo? Or was it a different benchmark/hardware?

How is the VRAM usage?

Hi there, the project looks really neat. Congrats and thanks! I am curious if this one is also as VRAM efficient as the fast-whisper? I couldn't see anything regarding that in the readme.

whisper-large-v2 with flash attention

I have very big list (~1M) of audio files and I would like to transcribe them using whisper-large-v2 and flash attention.

I am running it on A100 GPU with the help of HF pipeline() funcitonality.

I felt the decoding is very slow and I would like to ask if my code has any issues

import torch
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition",
                "openai/whisper-large-v2",
                torch_dtype=torch.float16,
                model_kwargs={"use_flash_attention_2": True},
                generate_kwargs = {"language":"<|es|>","task": "transcribe"},
                device="cuda:0")
audioListPath="wav_audio2_16k_utts.list" ## list of audio files

audioListFP = open(audioListPath)
audioListL = audioListFP.readlines()
audioListL = [audio.strip("\n") for audio in audioListL]
total = len(audioListL)

outTxt = open("wav_audio2_esHyp_whisperlargeV2.txt","w") ## writing transcription to this file
s= 0
while s < total:
    e = s+50000
    print("---- decoding {}, {} ---".format(s,e) )
    batchUtts = audioListL[s:e]
    outputs = pipe(batchUtts, chunk_length_s=30, batch_size=50)  
    for output in outputs:
        outTxt.write(output['text']+"\n")
    s = e

When I run the above script I see one warning as follows
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').

Hence, I am asking if I am using the model and the pipeline properly .
Help is appreciated

when using timestamps = true, it only return whole things as a chunk and a list of words

is it expected?

whisper should return result: multiple chunks, each chunk has multiple words.

how to install?

noob here so don't know much about how to install this.

can you write a guide?

Is this only designed for CUDA-compatble GPUs? My device is not so; am I out of luck?

How to change language settings?

my audio file is japanese voice. but, that result is english..

Difference between running insanely-fast-whisper with or without Flash Attention?

Can someone maybe explain?

Transcribe short audio

Hi,

I conducted several tests and the results for long audio files are impressive. However, my primary focus is on transcribing short audio files (around 10 seconds) as quickly as possible, and it's in this area that I need the most efficiency.

With Transformers (fp16 + batching [24] + Flash Attention 2):
It takes less than 90 seconds to process a 300-minute audio file, achieving a speed that is 200 times faster than the total length of the audio.
But It takes 1 second to process an 8-second audio file (so only 8x faster than audio length)

I tried with distil-whisper, the results are better for short audio file: around 0.25s for my 8 second audio file. But it only supports english.

Is there any other optimization I can use to transcribe short audio files ?

ValueError : The generation config is outdated

Feature Request: Automatic Language Recognition

Could you add automatic language recognition, like the original Whisper has?

To-Do for the next release!

Add support for Large-v3
Update benchmarks
Add a parameter for time-stamps
Define usage and parameters better

cc: @patrick91 for vis!

mac m1 pro - python3.12 install errors

Installing with python3.10 works but it appears the optimum dep's install is broken for python3.12:

> pipx install insanely-fast-whisper
Fatal error from pip prevented installation. Full pip output in file:
    /Users/tavis/.local/pipx/logs/cmd_2023-11-27_15.42.39_pip_errors.log

pip seemed to fail to build package:
    optimum

Some possibly relevant errors from pip install:
    error: subprocess-exited-with-error
    FileNotFoundError: [Errno 2] No such file or directory: 'optimum/version.py'
    AssertionError: Error: Could not open 'optimum/version.py' due [Errno 2] No such file or directory: 'optimum/version.py'

Error installing insanely-fast-whisper.

Snippet of cmd_2023-11-27_15.42.39_pip_errors.log:


PIP STDERR
----------
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [27 lines of output]
      Traceback (most recent call last):
        File "<string>", line 7, in <module>
      FileNotFoundError: [Errno 2] No such file or directory: 'optimum/version.py'

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/Users/tavis/.local/pipx/shared/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/Users/tavis/.local/pipx/shared/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/tavis/.local/pipx/shared/lib/python3.12/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/gx/5p9zxn1j6w1fc1_4rtg9t9wm0000gn/T/pip-build-env-m3rlum3k/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/private/var/folders/gx/5p9zxn1j6w1fc1_4rtg9t9wm0000gn/T/pip-build-env-m3rlum3k/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "/private/var/folders/gx/5p9zxn1j6w1fc1_4rtg9t9wm0000gn/T/pip-build-env-m3rlum3k/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 480, in run_setup
          super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
        File "/private/var/folders/gx/5p9zxn1j6w1fc1_4rtg9t9wm0000gn/T/pip-build-env-m3rlum3k/overlay/lib/python3.12/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 10, in <module>
      AssertionError: Error: Could not open 'optimum/version.py' due [Errno 2] No such file or directory: 'optimum/version.py'

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

Does it / will it support large-v3?

Hi @Vaibhavs10 , just came across your repository, love the name.
Is the large-v3 already supported, or would it need to construct/tranform the new model?
Thanks
PS. there was no discusison tab so I used the issues tab. Hope you don't mind.

Any chance of a CPU implementation?

Any chance to get a CPU implementation in near future?

.mp4 as input?

Is possible to use a .mp4 as input? It is requiring me a wav, flac or mp3 file but all my files are .mp4 so I have to convert them spending a lot of time in it :-(

Thanks

[brainstorming] Torch.mps backend speedup!

enable support for mps so this can be run natively on apple silicon. when run with device="mps" instead of device="cuda:0", the error as shown in the error log below occurs. clearly this is lacking in the underlying pytorch implementation, so we can use this issue to comment on the pytorch issue and garner addition support.

error log:

Traceback (most recent call last):
File "./insanely-fast-whisper/whisper.py", line 15, in
outputs = pipe("/path/to/audio/file.mp3",
File "./insanely-fast-whisper/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 357, in call
return super().call(inputs, **kwargs)
File "./insanely-fast-whisper/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1132, in call
return next(
File "./insanely-fast-whisper/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in next
item = next(self.iterator)
File "./insanely-fast-whisper/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 266, in next
processed = self.infer(next(self.iterator), **self.params)
File "./insanely-fast-whisper/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1046, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "./insanely-fast-whisper/lib/python3.10/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 555, in _forward
encoder_outputs=encoder(inputs, attention_mask=attention_mask),
File "./insanely-fast-whisper/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "./insanely-fast-whisper/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "./insanely-fast-whisper/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py", line 1159, in forward
layer_outputs = encoder_layer(
File "./insanely-fast-whisper/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "./insanely-fast-whisper/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "./insanely-fast-whisper/lib/python3.10/site-packages/optimum/bettertransformer/models/encoder_models.py", line 1039, in forward
hidden_states = torch._transformer_encoder_layer_fwd(
NotImplementedError: The operator 'aten::_transformer_encoder_layer_fwd' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

ValueError : The generation config is outdated

Hi @Vaibhavs10

I forgot to mention that I had changed the model from large-v2 to openai/whisper-medium.en

Does it mean that at present implementation supports large-v2 only?

I changed it back to the same and right now and that error did not come and it has started downloading model weights (6 GB).

Speaker diarization

Hello,

I really impressed with this library.

And I have a question, how to achieve accurate speaker diarization using this library?

Thank you very much for the improvements to whisper, could you have more clarity on whether there is an "hallucination" problem, i.e. an issue with duplicate output in other languages

And can you give the amount of video memory required to run the model? Does the windows system require additional configuration?

[feat] CLI Tool

These optimizations are fantastic. To make it more accessible and user-friendly for those who prefer command-line tools, someone should consider developing a CLI (Command Line Interface) version of this project. This would greatly enhance the ease of use and automation for many users.

Support for CPU Mode

It would be nice to see support for a cpu mode! :)

Add license to repo

Following up on #17 -Can you please add a license to the repo?

Passing parameters for whisper model inference results in unexpected keyword argument error

I am encountering an issue while attempting to pass parameters for the Whisper model inference using the code snippet below:

outputs = pipe("audio.mp3", chunk_length_s=30, batch_size=24, return_timestamps=True, condition_on_previous_text=trans["condition_on_previous_text"], vad_filter=True, word_timestamps=True, repetition_penalty=trans["repetition_penalty"], temperature=trans["temperature"])

The code above is intended to perform inference using the Whisper model, but it results in the following error:
TypeError: AutomaticSpeechRecognitionPipeline._sanitize_parameters() got an unexpected keyword argument 'condition_on_previous_text'.

what's faster whisper here?

It's ctranslate2?

speaker diarise: ValueError: attempt to get argmin of an empty sequence

diarise is not working when running cli with
python cli.py --device-id cuda:0 --batch-size 1 --timestamp word --file-name mono.wav --hf_token XXXXXXX

🤗 Transcribing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:20
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
🤗 Transcribing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:20
🤗 Segmenting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01
Traceback (most recent call last):
  File "/home/user/cli.py", line 282, in <module>
    segmented_transcript = post_process_segments_and_transcripts(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/cli.py", line 128, in post_process_segments_and_transcripts
    upto_idx = np.argmin(np.abs(end_timestamps - end_time))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/insanely-fast-whisper/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 1325, in argmin
    return _wrapfunc(a, 'argmin', axis=axis, out=out, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/envs/insanely-fast-whisper/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 59, in _wrapfunc
    return bound(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^
ValueError: attempt to get argmin of an empty sequence

Error: Torch not compiled with CUDA enabled

I get the following error:

AssertionError: Torch not compiled with CUDA enabled

Although I do have the torch (2.1) and cuda (12.1) :

>>> print("PyTorch version:", torch.__version__)
PyTorch version: 2.1.0+cu121
>>> print("Is CUDA available:", torch.cuda.is_available())
Is CUDA available: True
>>>

>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:36:24_Pacific_Standard_Time_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

I run in conda env on Windows. My GPU is nvidia RTX3060 with the latest drivers.

does it support initial_prompt?

this looks incredible.

Initial testing lead to a crash, FYI in case this is of use.

$ insanely-fast-whisper --file-name 2023-11-15\ 10.45.06.mp4
Traceback (most recent call last):
  File "/home/administrator/PodVision/venv-ai/bin/insanely-fast-whisper", line 8, in <module>
    sys.exit(main())
  File "/home/administrator/PodVision/venv-ai/lib/python3.10/site-packages/insanely_fast_whisper/cli.py", line 86, in main
    pipe = pipeline(
  File "/home/administrator/PodVision/venv-ai/lib/python3.10/site-packages/transformers/pipelines/__init__.py", line 885, in pipeline
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/administrator/PodVision/venv-ai/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 691, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/administrator/PodVision/venv-ai/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained
    return cls._from_pretrained(
  File "/home/administrator/PodVision/venv-ai/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1857, in _from_pretrained
    slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained(
  File "/home/administrator/PodVision/venv-ai/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2044, in _from_pretrained
    raise ValueError(
ValueError: Non-consecutive added token '<|0.02|>' found. Should have index 50365 but has index 50366 in saved vocabulary.

$ python --version
Python 3.10.8

$ nvidia-smi
Wed Nov 15 22:11:42 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.76       Driver Version: 515.76       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T1000 8GB    Off  | 00000000:03:00.0 Off |                  N/A |
| 32%   45C    P0    N/A /  50W |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
Release:	20.04
Codename:	focal

Tried a couple of different input files, crash occurs in the same location.

Weird error with a URL-pointed file in Colab

Input:

EX_FILE_URL="https://upload.wikimedia.org/wikipedia/commons/c/c8/Example.ogg"
!pip install insanely-fast-whisper
!echo insanely-fast-whisper --file-name $EX_FILE_URL
!insanely-fast-whisper --file-name $EX_FILE_URL

Output:

insanely-fast-whisper --file-name https://upload.wikimedia.org/wikipedia/commons/c/c8/Example.ogg
2023-11-14 19:04:01.002323: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-14 19:04:01.002389: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-14 19:04:01.002461: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-14 19:04:02.254552: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
Traceback (most recent call last):
  File "/usr/local/bin/insanely-fast-whisper", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/insanely_fast_whisper/cli.py", line 101, in main
    outputs = pipe(
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/automatic_speech_recognition.py", line 357, in __call__
    return super().__call__(inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py", line 1132, in __call__
    return next(
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/pt_utils.py", line 183, in __next__
    processed = next(self.subiterator)
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/automatic_speech_recognition.py", line 434, in preprocess
    inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate)
  File "/usr/local/lib/python3.10/dist-packages/transformers/pipelines/audio_utils.py", line 41, in ffmpeg_read
    raise ValueError(
ValueError: Soundfile is either not in the correct format or is malformed. Ensure that the soundfile has a valid audio file extension (e.g. wav, flac or mp3) and is not corrupted. If reading from a remote URL, ensure that the URL is the full address to **download** the audio file.

While the file seemed processed fine with

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition",
"openai/whisper-large-v3",
torch_dtype=torch.float16,
device="cuda:0")

pipe(EX_FILE_URL)

AssertionError: Torch not compiled with CUDA enabled

I installed it using pipx but I'm having this error running the tool.
How could I fix this? Do I need to install a proper version of Torch with pipx?

For context, I have the original Whisper installed with pip and it runs okay.

Thank you!

D:\Tools\Whisper\Video>insanely-fast-whisper --file-name video.wav Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "C:\Users\andy\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\andy\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\andy\.local\bin\insanely-fast-whisper.exe\__main__.py", line 7, in <module> File "C:\Users\andy\.local\pipx\venvs\insanely-fast-whisper\lib\site-packages\insanely_fast_whisper\cli.py", line 86, in main pipe = pipeline( File "C:\Users\andy\.local\pipx\venvs\insanely-fast-whisper\lib\site-packages\transformers\pipelines\__init__.py", line 1070, in pipeline return pipeline_class(model=model, framework=framework, task=task, **kwargs) File "C:\Users\andy\.local\pipx\venvs\insanely-fast-whisper\lib\site-packages\transformers\pipelines\automatic_speech_recognition.py", line 239, in __init__ self.model.to(device) File "C:\Users\andy\.local\pipx\venvs\insanely-fast-whisper\lib\site-packages\transformers\modeling_utils.py", line 2271, in to return super().to(*args, **kwargs) File "C:\Users\andy\.local\pipx\venvs\insanely-fast-whisper\lib\site-packages\torch\nn\modules\module.py", line 1160, in to return self._apply(convert) File "C:\Users\andy\.local\pipx\venvs\insanely-fast-whisper\lib\site-packages\torch\nn\modules\module.py", line 810, in _apply module._apply(fn) File "C:\Users\andy\.local\pipx\venvs\insanely-fast-whisper\lib\site-packages\torch\nn\modules\module.py", line 810, in _apply module._apply(fn) File "C:\Users\andy\.local\pipx\venvs\insanely-fast-whisper\lib\site-packages\torch\nn\modules\module.py", line 810, in _apply module._apply(fn) File "C:\Users\andy\.local\pipx\venvs\insanely-fast-whisper\lib\site-packages\torch\nn\modules\module.py", line 833, in _apply param_applied = fn(param) File "C:\Users\andy\.local\pipx\venvs\insanely-fast-whisper\lib\site-packages\torch\nn\modules\module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) File "C:\Users\andy\.local\pipx\venvs\insanely-fast-whisper\lib\site-packages\torch\cuda\__init__.py", line 289, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

Segments

does the output have segments, the same way model.transcribe() does in the original model?

`import torch
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition",
"openai/whisper-large-v2",
torch_dtype=torch.float16,
device="cuda:0")

pipe.model = pipe.model.to_bettertransformer()

outputs = pipe("<FILE_NAME>",
chunk_length_s=30,
batch_size=24,
return_timestamps=True)

outputs["segments"]`

Support for more file types

support various file types including video like .mov, .mp4, etc. this is easily achievable via a number of python packages.

example implementation:

!pip install moviepy

import moviepy.editor as mp
clip = mp.VideoFileClip("/path/to/video/file.ext")
clip.audio.write_audiofile("/path/to/video/file.mp3")

.srt/.txt files & Speaker Recognition

Very nice speedup - easily 10x faster than Whisper.

Is there a way to have it output the .txt and .srt that Whisper delivers as well? Is there an easy conversion script from the output.json?

Also is there a way to recognize speakers i.e. Speaker 1 , Speaker 2?

translation

thanks for the example!

can you add a translation example?

word timestamps crashes

when specifying word timestamps on a 3m 45s file, I am seeing a crash

insanely-fast-whisper --file-name test.wav --timestamp word

You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "insanely-fast-whisper", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "insanely_fast_whisper/cli.py", line 101, in main
    outputs = pipe(
              ^^^^^
  File "transformers/pipelines/automatic_speech_recognition.py", line 357, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/pipelines/base.py", line 1132, in __call__
    return next(
           ^^^^^
  File "transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/pipelines/base.py", line 1046, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/pipelines/automatic_speech_recognition.py", line 552, in _forward
    generate_kwargs["num_frames"] = stride[0] // self.feature_extractor.hop_length
                                    ~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for //: 'tuple' and 'int'

vaibhavs10 / insanely-fast-whisper Goto Github PK

insanely-fast-whisper's Introduction

Insanely Fast Whisper

🆕 Blazingly fast transcriptions via your terminal! ⚡️

CLI Options

Frequently Asked Questions

How to use Whisper without a CLI?

Acknowledgements

Community showcase

insanely-fast-whisper's People

Contributors

Stargazers

Watchers

Forkers

insanely-fast-whisper's Issues

Input:

Output:

Recommend Projects

Recommend Topics

Recommend Org