sanchit-gandhi / whisper-jax Goto Github PK

JAX implementation of OpenAI's Whisper model for up to 70x speed-up on TPU.

License: Apache License 2.0

Python 1.79% Makefile 0.01% Shell 0.01% Jupyter Notebook 98.20%

deep-learning jax speech-recognition speech-to-text whisper

whisper-jax's Introduction

Whisper JAX

This repository contains optimised JAX code for OpenAI's Whisper Model, largely built on the 🤗 Hugging Face Transformers Whisper implementation. Compared to OpenAI's PyTorch code, Whisper JAX runs over 70x faster, making it the fastest Whisper implementation available.

The JAX code is compatible on CPU, GPU and TPU, and can be run standalone (see Pipeline Usage) or as an inference endpoint (see Creating an Endpoint).

For a quick-start guide to running Whisper JAX on a Cloud TPU, refer to the following Kaggle notebook, where we transcribe 30 mins of audio in approx 30 sec:

The Whisper JAX model is also running as a demo on the Hugging Face Hub:

Installation

Whisper JAX was tested using Python 3.9 and JAX version 0.4.5. Installation assumes that you already have the latest version of the JAX package installed on your device. You can do so using the official JAX installation guide: https://github.com/google/jax#installation

Once the appropriate version of JAX has been installed, Whisper JAX can be installed through pip:

pip install git+https://github.com/sanchit-gandhi/whisper-jax.git

To update the Whisper JAX package to the latest version, simply run:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/sanchit-gandhi/whisper-jax.git

Pipeline Usage

The recommended way of running Whisper JAX is through the FlaxWhisperPipline abstraction class. This class handles all the necessary pre- and post-processing, as well as wrapping the generate method for data parallelism across accelerator devices.

Whisper JAX makes use of JAX's pmap function for data parallelism across GPU/TPU devices. This function is Just In Time (JIT) compiled the first time it is called. Thereafter, the function will be cached, enabling it to be run in super-fast time:

from whisper_jax import FlaxWhisperPipline

# instantiate pipeline
pipeline = FlaxWhisperPipline("openai/whisper-large-v2")

# JIT compile the forward call - slow, but we only do once
text = pipeline("audio.mp3")

# used cached function thereafter - super fast!!
text = pipeline("audio.mp3")

Half-Precision

The model computation can be run in half-precision by passing the dtype argument when instantiating the pipeline. This will speed-up the computation quite considerably by storing intermediate tensors in half-precision. There is no change to the precision of the model weights.

For most GPUs, the dtype should be set to jnp.float16. For A100 GPUs or TPUs, the dtype should be set to jnp.bfloat16:

from whisper_jax import FlaxWhisperPipline
import jax.numpy as jnp

# instantiate pipeline in bfloat16
pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.bfloat16)

Batching

Whisper JAX also provides the option of batching a single audio input across accelerator devices. The audio is first chunked into 30 second segments, and then chunks dispatched to the model to be transcribed in parallel. The resulting transcriptions are stitched back together at the boundaries to give a single, uniform transcription. In practice, batching provides a 10x speed-up compared to transcribing the audio samples sequentially, with a less than 1% penalty to the WER¹, provided the batch size is selected large enough.

To enable batching, pass the batch_size parameter when you instantiate the pipeline:

from whisper_jax import FlaxWhisperPipline

# instantiate pipeline with batching
pipeline = FlaxWhisperPipline("openai/whisper-large-v2", batch_size=16)

Task

By default, the pipeline transcribes the audio file in the language it was spoken in. For speech translation, set the task argument to "translate":

# translate
text = pipeline("audio.mp3", task="translate")

Timestamps

The FlaxWhisperPipline also supports timestamp prediction. Note that enabling timestamps will require a second JIT compilation of the forward call, this time including the timestamp outputs:

# transcribe and return timestamps
outputs = pipeline("audio.mp3",  task="transcribe", return_timestamps=True)
text = outputs["text"]  # transcription
chunks = outputs["chunks"]  # transcription + timestamps

Putting it all together

In the following code snippet, we instantiate the model in bfloat16 precision with batching enabled, and transcribe the audio file returning timestamps tokens:

from whisper_jax import FlaxWhisperPipline
import jax.numpy as jnp

# instantiate pipeline with bfloat16 and enable batching
pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.bfloat16, batch_size=16)

# transcribe and return timestamps
outputs = pipeline("audio.mp3",  task="transcribe", return_timestamps=True)

Model Usage

The Whisper JAX model can use on a more granular level in much the same way as the original Hugging Face Transformers implementation. This requires the Whisper processor to be loaded separately to the model to handle the pre- and post-processing, and the generate function to be wrapped using pmap by hand:

import jax.numpy as jnp
from datasets import load_dataset
from flax.jax_utils import replicate
from flax.training.common_utils import shard
from jax import device_get, pmap
from transformers import WhisperProcessor

from whisper_jax import FlaxWhisperForConditionalGeneration

# load the processor and model
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model, params = FlaxWhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-large-v2", dtype=jnp.bfloat16, _do_init=False,
)

def generate_fn(input_features):
    pred_ids = model.generate(
        input_features, task="transcribe", return_timestamps=False, max_length=model.config.max_length, params=params,
    )
    return pred_ids.sequences

# pmap the generate function for data parallelism
p_generate = pmap(generate_fn, "input_features")
# replicate the parameters across devices
params = replicate(params)

# load a dummy sample from the LibriSpeech dataset
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]

# pre-process: convert the audio array to log-mel input features
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="np").input_features
# replicate the input features across devices for DP
input_features = shard(input_features)

# run the forward pass (JIT compiled the first time it is called)
pred_ids = p_generate(input_features)
output_ids = device_get(pred_ids.reshape(-1, model.config.max_length))

# post-process: convert tokens ids to text string
transcription = processor.batch_decode(pred_ids, skip_special_tokens=True)

Available Models and Languages

All Whisper models on the Hugging Face Hub with Flax weights are compatible with Whisper JAX. This includes, but is not limited to, the official OpenAI Whisper checkpoints:

Size	Parameters	English-only	Multilingual
tiny	39 M	✓	✓
base	74 M	✓	✓
small	244 M	✓	✓
medium	769 M	✓	✓
large	1550 M	x	✓
large-v2	1550 M	x	✓

Should you wish to use a fine-tuned Whisper checkpoint in Whisper JAX, you should first convert the PyTorch weights to Flax. This is straightforward through use of the from_pt argument, which will convert the PyTorch state dict to a frozen Flax parameter dictionary on the fly. You can then push the converted Flax weights to the Hub to be used directly in Flax the next time they are required. Note that converting weights from PyTorch to Flax requires both PyTorch and Flax to be installed.

For example, to convert the fine-tuned checkpoint sanchit-gandhi/whisper-small-hi from the blog post Fine-Tuning Whisper:

from whisper_jax import FlaxWhisperForConditionalGeneration, FlaxWhisperPipline
import jax.numpy as jnp

checkpoint_id = "sanchit-gandhi/whisper-small-hi"
# convert PyTorch weights to Flax
model = FlaxWhisperForConditionalGeneration.from_pretrained(checkpoint_id, from_pt=True)
# push converted weights to the Hub
model.push_to_hub(checkpoint_id)

# now we can load the Flax weights directly as required
pipeline = FlaxWhisperPipline(checkpoint_id, dtype=jnp.bfloat16, batch_size=16)

Advanced Usage

More advanced users may wish to explore different parallelisation techniques. The Whisper JAX code is built on-top of the T5x codebase, meaning it can be run using model, activation, and data parallelism using the T5x partitioning convention. To use T5x partitioning, the logical axis rules and number of model partitions must be defined. For more details, the user is referred to the official T5x partitioning guide: https://github.com/google-research/t5x/blob/main/docs/usage/partitioning.md

Pipeline

The following code snippet demonstrates how data parallelism can be achieved using the pipeline shard_params method in an entirely equivalent way to pmap:

from whisper_jax import FlaxWhisperPipline
import jax.numpy as jnp

# 2D parameter and activation partitioning for DP
logical_axis_rules_dp = (
    ("batch", "data"),
    ("mlp", None),
    ("heads", None),
    ("vocab", None),
    ("embed", None),
    ("embed", None),
    ("joined_kv", None),
    ("kv", None),
    ("length", None),
    ("num_mel", None),
    ("channels", None),
)

pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.bfloat16, batch_size=16)
pipeline.shard_params(num_mp_partitions=1, logical_axis_rules=logical_axis_rules_dp)

Model

It is also possible to use the Whisper JAX model with T5x partitioning by defining a T5x inference state and T5x partitioner:

import jax
import jax.numpy as jnp
from flax.core.frozen_dict import freeze
from jax.sharding import PartitionSpec as P

from whisper_jax import FlaxWhisperForConditionalGeneration, InferenceState, PjitPartitioner


# 2D parameter and activation partitioning for DP
logical_axis_rules_dp = [
    ("batch", "data"),
    ("mlp", None),
    ("heads", None),
    ("vocab", None),
    ("embed", None),
    ("embed", None),
    ("joined_kv", None),
    ("kv", None),
    ("length", None),
    ("num_mel", None),
    ("channels", None),
]

model, params = FlaxWhisperForConditionalGeneration.from_pretrained(
    "openai/whisper-large-v2",
    _do_init=False,
    dtype=jnp.bfloat16,
)


def init_fn():
    input_shape = (1, 80, 3000)

    input_features = jnp.zeros(input_shape, dtype="f4")
    input_features = input_features.at[(..., -1)].set(model.config.eos_token_id)

    decoder_input_ids = jnp.zeros((input_shape[0], 1), dtype="i4")
    decoder_attention_mask = jnp.ones_like(decoder_input_ids)

    batch_size, sequence_length = decoder_input_ids.shape
    decoder_position_ids = jnp.broadcast_to(jnp.arange(sequence_length)[None, :], (batch_size, sequence_length))

    rng = jax.random.PRNGKey(0)
    init_params = model.module.init(
        rng,
        input_features=input_features,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        decoder_position_ids=decoder_position_ids,
        return_dict=False,
    )
    return init_params


# Axis names metadata
param_axes = jax.eval_shape(init_fn)["params_axes"]

# Create InferenceState, since the partitioner expects it
state = InferenceState(
    step=jnp.array(0),
    params=freeze(model.params_shape_tree),
    params_axes=freeze(param_axes),
    flax_mutables=None,
    flax_mutables_axes=param_axes,
)

# Define the pjit partitioner with 1 model partition
partitioner = PjitPartitioner(
    num_partitions=1,
    logical_axis_rules=logical_axis_rules_dp,
)

mesh_axes = partitioner.get_mesh_axes(state)
params_spec = mesh_axes.params

p_shard_params = partitioner.partition(model.to_bf16, (params_spec,), params_spec)


def generate(params, input_features):
    output_ids = model.generate(input_features, params=params, max_length=model.config.max_length).sequences
    return output_ids


p_generate = partitioner.partition(
    generate,
    in_axis_resources=(params_spec, P("data")),
    out_axis_resources=P("data"),
)

# This will auto-magically run in mesh context
params = p_shard_params(freeze(params))

# you can now run the forward pass with: 
# pred_ids = p_generate(input_features)

Benchmarks

We compare Whisper JAX to the official OpenAI implementation and the 🤗 Transformers implementation. We benchmark the models on audio samples of increasing length and report the average inference time in seconds over 10 repeat runs. For all three systems, we pass a pre-loaded audio file to the model and measure the time for the forward pass. Leaving the task of loading the audio file to the systems adds an equal offset to all the benchmark times, so the actual time for loading and transcribing an audio file will be higher than the reported numbers.

OpenAI and Transformers both run in PyTorch on GPU. Whisper JAX runs in JAX on GPU and TPU. OpenAI transcribes the audio sequentially in the order it is spoken. Both Transformers and Whisper JAX use a batching algorithm, where chunks of audio are batched together and transcribed in parallel (see section Batching).

Table 1: Average inference time in seconds for audio files of increasing length. GPU device is a single A100 40GB GPU. TPU device is a single TPU v4-8.

	OpenAI	Transformers	Whisper JAX	Whisper JAX

Framework	PyTorch	PyTorch	JAX	JAX
Backend	GPU	GPU	GPU	TPU

1 min	13.8	4.54	1.72	0.45
10 min	108.3	20.2	9.38	2.01
1 hour	1001.0	126.1	75.3	13.8

Creating an Endpoint

The Whisper JAX model is running as a demo on the Hugging Face Hub:

However, at peak times there may be a queue of users that limit how quickly your audio input is transcribed. In this case, you may benefit from running the model yourself, such that you have unrestricted access to the Whisper JAX model.

If you are just interested in running the model in a standalone Python script, refer to the Kaggle notebook Whisper JAX TPU:

Otherwise, we provide all the necessary code for creating an inference endpoint. To obtain this code, first clone the repository on the GPU/TPU on which you want to host the endpoint:

git clone https://github.com/sanchit-gandhi/whisper-jax

And then install Whisper JAX from source, with the required additional endpoint dependencies:

cd whisper-jax
pip install -e .["endpoint"]

We recommend that you set-up an endpoint in the same zone/region as the one you are based in. This reduces the communication time between your local machine and the remote one, which can significantly reduce the overall request time.

Gradio App

The Python script app.py contains the code to launch a Gradio app with the Whisper large-v2 model. By default, it uses a batch size of 16 and bfloat16 half-precision. You should update these parameters depending on your GPU/TPU device (as explained in the sections on Half-precision and Batching).

We can launch the Gradio app on port 7860 (default) on our GPU/TPU device through the following command:

python app/app.py

This will launch a Gradio demo with the same interface as the official Whisper JAX demo. To view the Gradio app remotely, we have two options:

Open the port 7860 on the GPU/TPU device to listen to all requests
Start an ngrok server on the GPU/TPU that redirects requests to port 7860

To open the port 7860 on your GPU/TPU, refer to your hardware provider's firewall instructions (for GCP, these can be found here). Once you have opened port 7860, you should be able to access the gradio demo through the http address:

http://DEVICE-IP:7860

where DEVICE-IP is the public IP address of your GPU/TPU. We can verify this address is accessible by opening this http address in a browser window on our local machine.

Alternatively, we can direct network requests to the Gradio app using ngrok. By using ngrok, we don't need to open the port 7860 on our GPU/TPU - ngrok will provide us with a public http address that will automatically redirect requests to port 7860 on our accelerator. However, in our experience, using ngrok was less reliable than a direct tunnel to port 7860, thus we recommend option 1 here where possible.

To set-up ngrok on your GPU/TPU, first install ngrok according to the official installation guide. You should authenticate your ngrok account if you have one, otherwise your ngrok server will be time-limited to 2 hours. Once installed and authenticated, you can launch an ngrok server on port 7860:

ngrok http 7860

The ngrok http address will be of the form:

https://NGROK-ADDRESS.ngrok.io

which can be used to access the Gradio demo through a web browser.

Sending Requests

Independent of whether you've chosen to open the port 7860 or use ngrok, we're now ready to send audio file requests to our endpoint. To do this, we'll make use of the gradio_client library. If you already have a recent version of Gradio, then the gradio_client library is included as a dependency.

Otherwise, the lightweight gradio_client package can be installed from pip and is tested to work with Python versions 3.9 or higher:

pip install --upgrade gradio_client

We can now send json requests to our endpoint using ngrok. The function transcribe_audio sends an audio file to our endpoint and returns the transcription:

from gradio_client import Client

# make sure this URL matches your http web address
API_URL = "http://DEVICE-IP:7860/" # if using port 7860
API_URL = "https://NGROK-ADDRESS.ngrok.io/" # if using ngrok

# set up the Gradio client
client = Client(API_URL)

def transcribe_audio(audio_path, task="transcribe", return_timestamps=False):
    """Function to transcribe an audio file using our endpoint"""
    text, runtime = client.predict(
        audio_path,
        task,
        return_timestamps,
        api_name="/predict_1",
    )
    return text

# transcribe an audio file using our endpoint
output = transcribe_audio("audio.mp3")

# transcribe with timestamps
output_with_timestamps = transcribe_audio("audio.mp3", return_timestamps=True)

Acknowledgements

🤗 Hugging Face Transformers for the base Whisper implementation, particularly to andyehrenberg for the Flax Whisper PR and ArthurZucker for the batching algorithm
Gradio for their easy-to-use package for building ML demos, and pcuenca for the help in hooking the demo up to the TPU
Google's TPU Research Cloud (TRC) programme for Cloud TPUs
Google's t5x Repository for the model partitioning framework

See WER results from Colab: https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor?usp=sharing ↩

whisper-jax's People

Contributors

Stargazers

Watchers

Forkers

wimjan123 techthiyanes rjac-ml niodiehard statsgary mehmetsafabenli entn-at plurigrid samliu gradjitta abdoiiii ginko-ai miketout cat-stack-boop hyojunguy cellinlab leedaga tsingcao archerband bigfootcn saravananwat knightcn1983 regud c00renut huangxubo23 suryatmodulus closerforever jurjsorinliviu shidong123 curiosity007 jasonmoriarty qqq-tech shuidong adamuas myaniu lw3259111 eltociear sizzles jaedukseo agentmishra zhupite233 bsalita xharut2022 vpegasus lwppwl cyberflamego celestialized magicleo jeanru zh30 songyangzhao kawdoco hhy5277 hercules261188 ishandutta2007 exian77 pcuenca tngamemo bytjn1416124 xxoolm ray-go kp-forks camenduru yumingjia1016 mbrukman buphnezz zhangdi12202023 hbcbh1999 zhangxinyi0529 sunjian0523 danmunson renovattio22 mutualmate learnpythontheew martjay bg-szy baris-unver techventurebuilder alamin655 drewwalkup ammadyousaf8888 ai-jie01 varinliali yzkee lydiaerjiang henlein vinayreddy100 ldsxp lu-lucifer 0xvivi swotar mirents xiwang1021 baiyu0408 rykeryang keyman9848 frogkingg angmig yaoqian0616 kanqingzi0415

whisper-jax's Issues

How to indicate the input audio file?

How can I pass the path of the input audiofile in the pipeline? In the kaggle notebook you are passing a dataset, should we just replace by a path to our input or is it another way?

Process Killed without any Error

I have the following specs:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3050 T...    On | 00000000:01:00.0 Off |                  N/A |
| N/A   45C    P5                8W /  60W|     54MiB /  4096MiB |     41%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

I see the following warning before the program is killed:
W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I do not see other errors:

python whisperJAX.py 
2023-04-23 22:28:46.200680: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Killed

How can I resolve this issue? Please let me know if I need to share any more details

AttributeError: module 'jax.tree_util' has no attribute 'register_pytree_with_keys_class'

I cannot seem to get rid of this on google colab:

AttributeError: module 'jax.tree_util' has no attribute 'register_pytree_with_keys_class'

Default test files are generating repeating text / hallucinating

I used the default settings on the Kaggle notebook.

https://huggingface.co/datasets/sanchit-gandhi/whisper-jax-test-files

Recreate Benchmarks on A100

Hey all,

Very interesting work! I am trying to recreate some of the results you have in table 1.

Do you happen to have the script + audio used on hand? I am having trouble matching it on my machine:

from whisper_jax import FlaxWhisperPipline
import jax.numpy as jnp
import time 
import librosa

SAMPLING_RATE = 16000
audio, sr = librosa.load('test_audio.mp3', sr=SAMPLING_RATE)

# instantiate pipeline in bfloat16
pipeline = FlaxWhisperPipline("openai/whisper-large-v2", dtype=jnp.float16, batch_size=32)

print("Warmup compiling forward pass")
text = pipeline(audio)


start_time = time.time()
for i in range(10):
    print(f"Go iter {i}")
    text = pipeline(audio)
end_time = time.time()
print(text)
print(f"Took {end_time - start_time} s")

# Took 330.93562269210815 s

test_audio.mp3 is a 13 min ted talk clip. I get about 30s per transcription iteration with this. Could be a bunch of things, but just want to know if this code would expect to give the benchmark results under optimal config.

realtime transcriptions

Hi- appreciate sharing of this framework, it looks very useful
I'm wondering if it's possible to do real-time transcriptions using
from transformers.pipelines.audio_utils import ffmpeg_microphone_live as detailed in this PR:

huggingface/transformers#21196

Add speed comparison to whisper.cpp

It would be nice to know how this compare to the ggml-based whisper.cpp implemnetation.

https://github.com/ggerganov/whisper.cpp

Specify the device when loading the pipeline

is there a way to specify the device when loading the pipeline? it doesn't seem possible to pass the device id like you'd be able to do with the 🤗pipeline like:
pipe = FlaxWhisperPipline("openai/whisper-large-v2", device=0, dtype=jnp.bfloat16, batch_size=16)

I'm running a benchmark on multiple models/pipelines and whisper jax takes up all the VRAM available on the 2 GPUs I have (A100 80GB), which causes an OOM error when I try to process an audio file.
I'd like to have the possiblity to load whisper jax on device 0 and the other models on any other devices I have.

please recommend a way to do something like this

OpenAI and Transformers Benchmarks

Hello, I want to confirm whether the implementation of OpenAI in the benchmark uses the openai-whisper library or the WhisperForConditionalGeneration model of Hugging Face? At the same time, I also want to confirm whether the Hugging Face implementation uses the FlaxWhipserForConditionalGeneration model?

If the OpenAI implementation uses the model in openai-whisper, is the performance test the execution time of DecodingTask.run()?

Selecting Language Manually e.g. Hindi Language

Is there way to enforce the model to transcribe only specific language for e.g. Hindi Language?

Estimated JIT time on a Colab Premium GPU

What is the estimated first JIT compile time on a Colab Premium GPU (A100)? I'm talking about the code right below this line:

# JIT compile the forward call - slow, but we only do once

Single ckpt file to use as a local checkpoint

Hey,

Appreciate your work, it is amazing. I wanted to use the model that I have created with the .ckpt extension. I've found the issue #17 however you have answered it as

Download the entire repository to your local system, and then pass the path to this folder. E.g. if I cloned [this checkpoint](https://huggingface.co/sanchit-gandhi/whisper-small-hi) into a folder called whisper-small-hi, I would pass ./whisper-small-hi

I do not have any folder for my ckpt file. The model is only that file which is larger than 10GB. When I try to pass that file at:

cc.initialize_cache("./jax_cache")
checkpoint = "my_checkpoint.ckpt"
BATCH_SIZE = 16
CHUNK_LENGTH_S = 30
NUM_PROC = 8
FILE_LIMIT_MB = 1000
YT_ATTEMPT_LIMIT = 3

It produces the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

Any help would be great. Thanks a lot in advance.

No GPU/TPU found

I have a RTX 4090 and running

import json

import jax.numpy as jnp
from whisper_jax import FlaxWhisperPipline


def transcribe_70():
    # instantiate pipeline
    pipeline = FlaxWhisperPipline("openai/whisper-large-v2", batch_size=16, dtype=jnp.bfloat16)

    outputs = pipeline("audio.mp3", task="transcribe", return_timestamps=True)
    # used cached function thereafter - super fast!!
    with open("output70.json", "w") as f:
        f.write(json.dumps(outputs))


if __name__ == '__main__':
    transcribe_70()

gives me:

2023-04-21 08:22:07.870777: I external/xla/xla/service/service.cc:168] XLA service 0x56074ee3c980 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2023-04-21 08:22:07.870792: I external/xla/xla/service/service.cc:176]   StreamExecutor device (0): Interpreter, <undefined>
2023-04-21 08:22:07.873147: I external/xla/xla/pjrt/tfrt_cpu_pjrt_client.cc:218] TfrtCpuClient created.
2023-04-21 08:22:07.873292: I external/xla/xla/stream_executor/tpu/tpu_initializer_helper.cc:269] Libtpu path is: libtpu.so
2023-04-21 08:22:07.873367: I external/xla/xla/stream_executor/tpu/tpu_initializer_helper.cc:277] Failed to open libtpu: libtpu.so: cannot open shared object file: No such file or directory
2023-04-21 08:22:07.873389: I external/xla/xla/stream_executor/tpu/tpu_platform_interface.cc:73] No TPU platform found.
No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

Using whisper in the same virtual env works with the GPU.

Model providing not an accurate transcription , mixing some other language .

I tried to get transcriptions for a video of David Silver's reinforcement learning playlist from YouTube .
The model was able to generate very good transcriptions at some timestamps , but at many timestamps , it generates transcriptions of some other language which apart from English. I haven't changed any settings or anything , just copy pasted the url of the video and clicked on transcribe . The result was out in 23.4 seconds but wasn't accurate .

For more information , please have a look at this image I'm attaching below :

In the image , you can clearly observe that the model is generating transcriptions of other language , even though english is asked for . Some part of it was in English , and the other part in some other language . #

Error when running on Google Colab (TPU)

Hey,

I'm assuming this is a JAX issue, but I'm getting the following errors when trying to run the notebook on Google Colab:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-13-308fe9e13fe9>](https://pw2dauh3d9-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230419-060138-RC00_525408879#) in <cell line: 1>()
----> 1 from whisper_jax import FlaxWhisperPipline
      2 import jax.numpy as jnp
      3 
      4 pipeline = FlaxWhisperPipline("openai/whisper-medium", dtype=jnp.bfloat16, batch_size=16)

4 frames
[/usr/local/lib/python3.9/dist-packages/flax/core/frozen_dict.py](https://pw2dauh3d9-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230419-060138-RC00_525408879#) in <module>
     48 
     49 
---> 50 @jax.tree_util.register_pytree_with_keys_class
     51 class FrozenDict(Mapping[K, V]):
     52   """An immutable variant of the Python dict."""

AttributeError: module 'jax.tree_util' has no attribute 'register_pytree_with_keys_class'

I've already tried the hints mentions on JAX' Github page, but no success:

# tpu
import jax.tools.colab_tpu
jax.tools.colab_tpu.setup_tpu()

!pip install "jax<=0.3.25" "jaxlib<=0.3.25"

# gpu
import jax
jax.devices()

using command line

anyway to use the command line for the project? looking for example, thanks

[Suggestion]Please add a requirements.txt file for smooth installation

I have just started working with this awesome repository. One way to improve the user experience would be to create a requirements.txt file to install the required frameworks for this repository to work.

The three frameworks that need to be installed are gradio, pytube and transformers.

Bug: Numpy array as input

When a Numpy array is passed in, the model runs fine, but this causes the model to perform poorly because the audio array is not resampled to the appropriate sample rate.

This is fixed by passing a dict with array and sampling_rate keys.

Could not find TensorRT

Hi there, i got sanchit's example from other issue working, but speedup is only 8-10x real-time on RTX 4090. GPU is being used 100%, as i can tell from nvtop.
Maybe following error is the reason?

2023-04-28 14:27:07.026383: W tensorflow[/compiler/tf2tensorrt/utils/py_utils.cc:38](https://file+.vscode-resource.vscode-cdn.net/compiler/tf2tensorrt/utils/py_utils.cc:38)] TF-TRT Warning: Could not find TensorRT
Compilation:  198.21863865852356
Cached:  173.1455545425415

Performance on M1 chips compared to PyTorch implementation?

Looks like JAX does not support accelerated M1 / Apple Neural Engine.

Curious if anyone has done a benchmark comparison. Reference:

google/jax#8074

openai/whisper#382

is it usable with LoRA fine-tuned whisper?

curious to know if it runs well with a fine-tuned whisper model using PEFT?
is it possible to load it in int8?

No GPU/TPU found, falling back to CPU.

Configuration：
GPU：3090
Driver Version: 525.105.17
CUDA Version: 12.0

but I find the problem that never find the GPU, How to solve this problem?

Looks like no way to use on windows

I did research and jax is not installable on windows

The only version that I have found that can be installed on windows is jax 0.3.7

But your software is requiring 0.4.7

Any solutions for this?

where is the model cache

thank you,i want set cache dir

Where to put model in local install

I don't get how I can link to the model on a local indtall. Should I replace /openai/largev2/ by the path of my model on the disk?
And should I download all feom the folder from huggingface or should I just download the flax file?

Jax installation Consulting

Hi,
I'm glad to have discovered this place, and after hearing how much speed can be increased, I can't wait to give it a try

Is Jax only installed on Linux？

Please forgive my poor English, the above is translated

ability to provide initial_prompt

The original whisper model can take an initial_prompt value to improve accuracy of the transcript. Is this possible in this improved version of whisper? It really helps a lot for context words.

OpenAI Whisper medium-model error while processing timestamps

I am getting the following error when using "openai/whisper-medium" model with timestamp prediction:
There was an error while processing timestamps, we haven't found a timestamp as last token. Was WhisperTimeStampLogitsProcessor used?
This error comes from "transformers/models/whisper/tokenization_whisper.py" line 885. The generated tokens do not include any timestamps, except for the first one (0.0).

I have tested to use audios of different length (1min to 1h) and different parameters (half-precision, stride) and always the same error occurs. On the other hand, with the base-model and large-v2-model this error does not occur.

Code:

model = "openai/whisper-medium"
whisper = FlaxWhisperPipline(model, dtype=jnp.float32)
res: dict = whisper(audio_file, stride_length_s=0.0, language="es", return_timestamps=True)

My computer:

Python 3.8.10
SO: Ubuntu 20.04 LTS 64bits WSL in Windows 11
CPU: 12th Gen Intel® Core™ i7-12700
GPU: Nvidia RTX 3060
RAM: 32,0 GB

Huggingface model

What is the huggingface model? Not the space, the model

ValueError: Received incompatible devices for pjitted computation

Awesome repo! I have one question tho: Whenever I try running this code on my own TPU-v4-8, I get the following error:

WARNING:absl:Tiling device assignment mesh by hosts, which may lead to reduced XLA collective performance. To avoid this, modify the model parallel submesh or run with more tasks per host.
Traceback (most recent call last):
  File "fastapi_app.py", line 17, in <module>
    pipeline.shard_params()
  File "/root/ai/whisper-jax/whisper_jax/pipeline.py", line 127, in shard_params
    self.params = p_shard_params(freeze(self.params))
  File "/root/ai/whisper-jax/whisper_jax/partitioner.py", line 787, in __call__
    return self._pjitted_fn(*args)
  File "/usr/local/lib/python3.8/dist-packages/jax/_src/traceback_util.py", line 166, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/jax/_src/pjit.py", line 238, in cache_miss
    outs, out_flat, out_tree, args_flat = _python_pjit_helper(
  File "/usr/local/lib/python3.8/dist-packages/jax/_src/pjit.py", line 193, in _python_pjit_helper
    raise ValueError(msg) from None
jax._src.traceback_util.UnfilteredStackTrace: ValueError: Received incompatible devices for pjitted computation. Got argument params['model']['decoder']['embed_positions']['embedding'] of FlaxPreTrainedModel.to_bf16 with shape float32[448,1280] and device ids [0] on platform CPU and pjit's devices with device ids [0, 2, 1, 3] on platform TPU

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "fastapi_app.py", line 17, in <module>
    pipeline.shard_params()
  File "/root/ai/whisper-jax/whisper_jax/pipeline.py", line 127, in shard_params
    self.params = p_shard_params(freeze(self.params))
  File "/root/ai/whisper-jax/whisper_jax/partitioner.py", line 787, in __call__
    return self._pjitted_fn(*args)
ValueError: Received incompatible devices for pjitted computation. Got argument params['model']['decoder']['embed_positions']['embedding'] of FlaxPreTrainedModel.to_bf16 with shape float32[448,1280] and device ids [0] on platform CPU and pjit's devices with device ids [0, 2, 1, 3] on platform TPU

Any idea how I can fix it?

How to save txt vtt and srt outputs? how to set beam_size, initial_prompt, best_of and other parameters

I have checked main page and kaggle and there is no example of these

In reguler I was doing like below

For whisper jax how can I do?

        result = model.transcribe("../input/whisper2/lecture_"+str(lectureId)+".mp3",language="en",beam_size=10,initial_prompt="Welcome to the Software Engineering Courses channel.",best_of=10,verbose=True,temperature=0.0)

        # save SRT

        language = result["language"]
        sub_name = f"/kaggle/working/lecture_"+str(lectureId)+".srt"
        with open(sub_name, "w", encoding="utf-8") as srt:
            write_srt(result["segments"], file=srt)

        # Save output
        writing_lut = {
            '.txt': whisper.utils.write_txt,
            '.vtt': whisper.utils.write_vtt,
            '.srt': whisper.utils.write_txt,
        }

Add train code

Thanks for you nice project. The openai whisper don't open source the train code. Can you project implement it? When I use large-v2 model, it always gives youtube video advertise. So it is a problem of the train data. I want to train a model with clean data. The problem is discussed below:
openai/whisper#928

Enhancement: Add the WebUI for kaggle

Slower than faster-whisper (2x)

Hi! I am running on WSL2 with an RTX 3090.
I've noticed that faster-whisper runs about twice as fast on my 16k sampled 30s audio clip.

Is that to be expected or did I do something wrong with my JAX installation?
whisper-jax takes about 10s (once cached), while faster-whisper takes 5.1s

I set the faster-whisper beam_size to 1, is there an equivalent setting for whisper-jax?

AttributeError: module 'jax.tree_util' has no attribute 'register_pytree_with_keys_class'

what version of jax and jaxlib works with this?

[Bug]Invalid URL 'None': No schema supplied

This issue occurs when I provide a Youtube link. I'm on Windows 11 (Python 3.10.6) using command python app.py

Traceback (most recent call last):
  File "/home/ethan/.local/lib/python3.10/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/ethan/.local/lib/python3.10/site-packages/gradio/blocks.py", line 1302, in process_api
    result = await self.call_function(
  File "/home/ethan/.local/lib/python3.10/site-packages/gradio/blocks.py", line 1025, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/ethan/.local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/ethan/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/ethan/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/mnt/c/Users/rosha/whisper-jax/app/app.py", line 185, in transcribe_youtube
    text, runtime = tqdm_generate(inputs, task=task, return_timestamps=return_timestamps, progress=progress)
  File "/mnt/c/Users/rosha/whisper-jax/app/app.py", line 126, in tqdm_generate
    model_outputs.append(forward(batch, task=task, return_timestamps=return_timestamps))
  File "/mnt/c/Users/rosha/whisper-jax/app/app.py", line 69, in forward
    outputs = chunked_query(
  File "/mnt/c/Users/rosha/whisper-jax/app/app.py", line 62, in chunked_query
    response = requests.post(API_URL_FROM_FEATURES, json=payload)
  File "/home/ethan/.local/lib/python3.10/site-packages/requests/api.py", line 119, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/home/ethan/.local/lib/python3.10/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/ethan/.local/lib/python3.10/site-packages/requests/sessions.py", line 528, in request
    prep = self.prepare_request(req)
  File "/home/ethan/.local/lib/python3.10/site-packages/requests/sessions.py", line 456, in prepare_request
    p.prepare(
  File "/home/ethan/.local/lib/python3.10/site-packages/requests/models.py", line 316, in prepare
    self.prepare_url(url, params)
  File "/home/ethan/.local/lib/python3.10/site-packages/requests/models.py", line 390, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL 'None': No schema supplied. Perhaps you meant http://None?

Speaker diarization?

Is there a recommended method to implement speaker diarization with this whisper solution?

KeyError: 'tokens'

Hello @sanchit-gandhi Thanks for sharing this repo.
I installed all the dependencies and ran this command in terminal 1 bash launch_app.sh
in terminal 2 I ran API_URL=http://0.0.0.0:8000/generate/ API_URL_FROM_FEATURES=http://0.0.0.0:8000/gnerate_from_features/ python app.py
when I select you tube url getting this error

File "/home/ubuntu/whisper-jax/app/app.py", line 72, in forward
    outputs["tokens"] = np.asarray(outputs["tokens"])
KeyError: 'tokens'

Complete error

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/gradio/blocks.py", line 1302, in process_api
    result = await self.call_function(
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/gradio/blocks.py", line 1025, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/ubuntu/whisper-jax/app/app.py", line 185, in transcribe_youtube
    text, runtime = tqdm_generate(inputs, task=task, return_timestamps=return_timestamps, progress=progress)
  File "/home/ubuntu/whisper-jax/app/app.py", line 126, in tqdm_generate
    model_outputs.append(forward(batch, task=task, return_timestamps=return_timestamps))
  File "/home/ubuntu/whisper-jax/app/app.py", line 72, in forward
    outputs["tokens"] = np.asarray(outputs["tokens"])
KeyError: 'tokens'
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/gradio/routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/gradio/blocks.py", line 1302, in process_api
    result = await self.call_function(
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/gradio/blocks.py", line 1025, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/ubuntu/anaconda3/envs/whisper/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/ubuntu/whisper-jax/app/app.py", line 185, in transcribe_youtube
    text, runtime = tqdm_generate(inputs, task=task, return_timestamps=return_timestamps, progress=progress)
  File "/home/ubuntu/whisper-jax/app/app.py", line 126, in tqdm_generate
    model_outputs.append(forward(batch, task=task, return_timestamps=return_timestamps))
  File "/home/ubuntu/whisper-jax/app/app.py", line 72, in forward
    outputs["tokens"] = np.asarray(outputs["tokens"])
KeyError: 'tokens'

Words timestamps [HELP]

I'm not able to get the transcription with words timestamps. Only sentences timestamps.

If this possible with whisper-jax?

Thanks

CUDA out of memory

Trying to load medium or large model, I get out of memory errors. Loading small with float16 precision works but takes all my 24 GB VRAM. Is there any way to limit Jax memory usage? The OpenAI model is far more modest in its requirements. Reducing the model weights to float16 should be a good idea too.

ValueError: ffmpeg was not found but is required to load audio files from filename

I executed the command python app.py and provided a YouTube video link through the web interface, but received the following error message:

Traceback (most recent call last):
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\transformers\pipelines\audio_utils.py", line 34, in ffmpeg_read
    with subprocess.Popen(ffmpeg_command, stdin=subprocess.PIPE, stdout=subprocess.PIPE) as ffmpeg_process:
  File "C:\Users\rosha\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 969, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\rosha\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1438, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\gradio\routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\gradio\blocks.py", line 1302, in process_api
    result = await self.call_function(
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\gradio\blocks.py", line 1025, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\app\app.py", line 183, in transcribe_youtube
    inputs = ffmpeg_read(inputs, processor.feature_extractor.sampling_rate)
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\transformers\pipelines\audio_utils.py", line 37, in ffmpeg_read
    raise ValueError("ffmpeg was not found but is required to load audio files from filename") from error
ValueError: ffmpeg was not found but is required to load audio files from filename

I have added ffmpeg to the path as well as I have also installed ffmpeg-python but still the same issue.

In case I select the Microphone tab and record the audio and click submit I get the following error:

C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\pydub\utils.py:198: RuntimeWarning: Couldn't find ffprobe or avprobe - defaulting to ffprobe, but may not work
  warn("Couldn't find ffprobe or avprobe - defaulting to ffprobe, but may not work", RuntimeWarning)
Traceback (most recent call last):
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\gradio\processing_utils.py", line 138, in audio_from_file
    audio = AudioSegment.from_file(filename)
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\pydub\audio_segment.py", line 728, in from_file
    info = mediainfo_json(orig_file, read_ahead_limit=read_ahead_limit)
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\pydub\utils.py", line 274, in mediainfo_json
    res = Popen(command, stdin=stdin_parameter, stdout=PIPE, stderr=PIPE)
  File "C:\Users\rosha\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 969, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\rosha\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1438, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\gradio\routes.py", line 401, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\gradio\blocks.py", line 1300, in process_api
    inputs = self.preprocess_data(fn_index, inputs, state)
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\gradio\blocks.py", line 1148, in preprocess_data
    processed_input.append(block.preprocess(inputs[i]))
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\gradio\components.py", line 2425, in preprocess
    sample_rate, data = processing_utils.audio_from_file(
  File "C:\Users\rosha\Downloads\Compressed\whisper-jax-main\whisper\lib\site-packages\gradio\processing_utils.py", line 148, in audio_from_file
    raise RuntimeError(msg) from e
RuntimeError: Cannot load audio from file: `ffprobe` not found. Please install `ffmpeg` in your system to use non-WAV audio file formats and make sure `ffprobe` is in your PATH.

Model distillation

Hi , thanks for the Jax code , Are there any plans for distilling the existing/orignal model ?

ffmpeg was not found but is required to load audio files from filename

While operating on kaggle this is the error I encounter the error -
ffmpeg was not found but is required to load audio files from filename.
Code -

def process_doc(file):
  wav_path=os.path.join("/kaggle/input/upwork-calls/CSG_CALLS",f"{file}")
  doc_path=os.path.join("/kaggle/working/docs",f"doc_{file}")
  
  if not os.path.exists(doc_path):
        os.mkdir(doc_path)
  
  for files in tqdm(os.listdir(wav_path)):
    filename=files.split(".")[0]+".docx"
    result = pipeline(os.path.join(wav_path,files),task="transcribe")
    mydoc = docx.Document()
    mydoc.add_paragraph(result['text'])
    mydoc.save(os.path.join(doc_path,filename))
  print("------------------- Saved in path ---------------- : ",doc_path)

I tried to load ffmpeg using

!apt-get install -y ffmpeg > /dev/null

failed with error - E: Package 'ffmpeg' has no installation candidate
Can anyone please help me with the issue

on mac m2 this issue locally - for longer videos (english and hindi both)

File "/code/code.py", line 82, in
result = transcribe(video_converted,language)
File "/code/codeTranscript.py", line 10, in transcribe
return transcribe_jax(audio,language=None)
File "/code/codeTranscript.py", line 25, in transcribe_jax
pipeline = FlaxWhisperPipline("models/whisper/large-v2.pt", batch_size=8)
File "/code/venv-3.10/lib/python3.10/site-packages/whisper_jax/pipeline.py", line 84, in init
self.processor = WhisperProcessor.from_pretrained(self.checkpoint)
File "/code/venv-3.10/lib/python3.10/site-packages/transformers/processing_utils.py", line 184, in from_pretrained
args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
File "/code/venv-3.10/lib/python3.10/site-packages/transformers/processing_utils.py", line 228, in _get_arguments_from_pretrained
args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))
File "/code/venv-3.10/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 329, in from_pretrained
feature_extractor_dict, kwargs = cls.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs)
File "/code/venv-3.10/lib/python3.10/site-packages/transformers/feature_extraction_utils.py", line 457, in get_feature_extractor_dict
text = reader.read()
File "/opt/homebrew/Cellar/[email protected]/3.10.11/Frameworks/Python.framework/Versions/3.10/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

How to choose language for translation and transcription?

Let say I want to translate English into German etc.
So how I can do this both for translation and transcription. ?

What should be the AWS Machine configuration for Whisper Large model deployment

Thank you, @sanchit-gandhi, for your fantastic work. I would appreciate your opinion on configuring an AWS machine for deploying Hugging Face's Whisper large model (JAX version) and data storage for both audio and streamed textual data.

My end goal is to deploy the stream output model, but for now, I am setting up the current model without steam functionality. What would be the optimal AWS configuration to consider the future scope of the project?

If I decide to use your Whisper version, what would be the best configuration for large, taking into account the future streaming component?
If I choose to use other implementations with streaming support, what would be the optimal configuration for large?

How can we extract the logits from audio?

Is there anyone who uses whisper-jax to extract logits from audio?

cannot import name 'dot_product_attention_weights' from 'flax.linen.attention

Windows 10
Miniconda3
Python3.9
jaxlib-0.3.25
jax-0.3.25
numpy-1.20.3

When I try to import using : from whisper_jax import FlaxWhisperPipline

I get this error, I am new in JAX so anyhelp is welcome.

cannot import name 'dot_product_attention_weights' from 'flax.linen.attention'

The model is not fast compared to transformers Whisper

Hi,
I couldn't get faster results. Whisper transformers are faster than Jax implementation.

SystemInfo

jax ==0.4.8
jaxlib==0.4.7+cuda11.cudnn82
transformers==4.28.1
CUDA Version: 11.7
Python 3.9.16
GPU: RTX 3090 Ti

Transformers Implementation:

from transformers import pipeline

MODEL_NAME ="openai/whisper-large-v2"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    device='cuda:0',
    generate_kwargs = {"language":"<|tr|>","task": "transcribe"}
)
text = pipe(16k_sound,
                    return_timestamps=True, 
      chunk_length_s=30.0, 
    stride_length_s=[6,0],
     batch_size=8,
  generate_kwargs = {"language":"<|tr|>","task": "transcribe"})

JAX Implementation:

from whisper_jax import FlaxWhisperPipline
import jax.numpy as jnp

MODEL_NAME ="openai/whisper-large-v2"

pipeline = FlaxWhisperPipline(MODEL_NAME,dtype=jnp.float16)
text = pipeline(16k_sound,
                    return_timestamps=True, 
    chunk_length_s=30.0, 
    stride_length_s=[6,0],
     batch_size=8,
  generate_kwargs = {"language":"<|tr|>","task": "transcribe"})

here I tried 3-4 times but I couldn't decrease the computation time.

No GPU detected, but stock OpenAI/whisper does (WSL2)

Hi! As the title says, my GPU is not being recognized No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) but any other CUDA code (also OpenAI/whisper) does detect my GPU.

Thank you for the help!

sanchit-gandhi / whisper-jax Goto Github PK

whisper-jax's Introduction

Whisper JAX

Installation

Pipeline Usage

Half-Precision

Batching

Task

Timestamps

Putting it all together

Model Usage

Available Models and Languages

Advanced Usage

Pipeline

Model

Benchmarks

Creating an Endpoint

Gradio App

Sending Requests

Acknowledgements

Footnotes

whisper-jax's People

Contributors

Stargazers

Watchers

Forkers

whisper-jax's Issues

SystemInfo

Recommend Projects

Recommend Topics

Recommend Org