Comments (11)
btw., please rather copy paste text than post screenshots. It's better visible and searchable. You can put it to triple ticks so it's formated as code.
like this
from whisper_streaming.
tokenizer for sentence segmentation, right?
- wrap it in an object that has split function that works like the one of MosesTokenizer
- pass it to OnlineASRProcessor
from whisper_streaming.
Thank you for your quick response.
Actually no, My question means, how can I use the Hugging Face formatted tokenizer?
I have a Hugging Face formatted tokenizer with files such as vocab.json, vocabulary.json, added_tokens.json, normalizer.json, special_tokens_map.json, and merges.txt inside, along with my ct2 translated whisper model. However, Faster-whisper only takes a tokenizer.json file. How can I use my Hugging Face formatted tokenizer is it possible to convert it to tokenizer.json?
Thank you
from whisper_streaming.
I used convert_slow_tokenizer to convert it to tokenizer.json
from transformers import WhisperTokenizer
from transformers.convert_slow_tokenizer import convert_slow_tokenizer
TOKENIZER_DIR = "medium-tokenizer"
tokenizer = WhisperTokenizer.from_pretrained(TOKENIZER_DIR,
language="Mongolian",
task="transcribe", use_fast=False)
fast_tokenizer.save("tokenizer.json")
fast_tokenizer = convert_slow_tokenizer(tokenizer)
but encountered the following error.
python3 whisper_online.py converted_file.wav --model_dir whisper_model_ct --language mn --min-chunk-size 1 > out.txt
Audio duration is: 8.58 seconds
Loading Whisper medium model for mn... done. It took 1.98 seconds.
/home/khangal/.local/lib/python3.10/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.2.2 when using version 1.3.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn(
Traceback (most recent call last):
File "/home/khangal/PycharmProjects/whisper_streaming/whisper_online.py", line 547, in <module>
asr.transcribe(a)
File "/home/khangal/PycharmProjects/whisper_streaming/whisper_online.py", line 123, in transcribe
return list(segments)
File "/home/khangal/.local/lib/python3.10/site-packages/faster_whisper/transcribe.py", line 452, in generate_segments
) = self.generate_with_fallback(encoder_output, prompt, tokenizer, options)
File "/home/khangal/.local/lib/python3.10/site-packages/faster_whisper/transcribe.py", line 660, in generate_with_fallback
result = self.model.generate(
TypeError: generate(): incompatible function arguments. The following argument types are supported:
1. (self: ctranslate2._ext.Whisper, features: ctranslate2._ext.StorageView, prompts: Union[List[List[str]], List[List[int]]], *, asynchronous: bool = False, beam_size: int = 5, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, max_length: int = 448, return_scores: bool = False, return_no_speech_prob: bool = False, max_initial_timestamp_index: int = 50, suppress_blank: bool = True, suppress_tokens: Optional[List[int]] = [-1], sampling_topk: int = 1, sampling_temperature: float = 1) -> Union[List[ctranslate2._ext.WhisperGenerationResult], List[ctranslate2._ext.WhisperGenerationResultAsync]]
Invoked with: <ctranslate2._ext.Whisper object at 0x7f5a41c326b0>, <ctranslate2._ext.StorageView object at 0x7f5a06402fb0>, [[None, 220, None]]; kwargs: length_penalty=1, repetition_penalty=1, no_repeat_ngram_size=0, max_length=448, return_scores=True, return_no_speech_prob=True, suppress_blank=True, suppress_tokens=[-1], max_initial_timestamp_index=50, beam_size=5, patience=1
My model_dir contains model.bin and tokenizer.json files.
from whisper_streaming.
Sorry, I don't know. But I thnik that a recent version of faster-whisper was not working with a model that I downloaded from HuggingFace long ago. Did you try to reinstall it?
from whisper_streaming.
If we use faster-whisper in the backend, it takes the tokenizer.json file from our model-located directory. If there isn't a tokenizer.json file in our directory, it defaults to using the one from openai/whisper-tiny.en, as specified in , right?
If so, I want to pass my own tokenizer.json, but I currently only have a Hugging Face-formatted tokenizer. Do I need to convert it to tokenizer.json? Actually, my tokenizer works well when using the transformers WhisperTokenizer.
I decided to convert our current Hugging Face-formatted tokenizer to tokenizer.json, but I encountered an error like the above when whisper_online.py attempted to load it.
from whisper_streaming.
sklearn writes you Trying to unpickle estimator LogisticRegression from version 1.2.2 when using version 1.3.1.
. Maybe you downgrade your sklearn?
You can also check if you are able to run faster-whisper in offline mode. If not, consult their authors. If yes, it should work with whisper-streaming. As #27 .
Moreover, I don't understand why you have your own tokenizer. If it is a part of your finetuned model, then please follow #27.
Good luck!
from whisper_streaming.
Yes, I fine-tuned the Whisper model for my language. whisper_online works well if it doesn't contain any tokenizer, but the result is not as we wanted, as shown below. Therefore, I believe I have to use my own tokenizer, just like when I trained the Whisper model.
## last processed 10.54 s, now is 11.09, the latency is 0.55
11091.6247 0 7720 loseiny бол fem�ushing option том heavy�uel zero простоring birth fold tom
from whisper_streaming.
Also, I tried injecting the custom lines into your process_iter function. It works well, but our result only displays the print statement.
def process_iter(self):
"""Runs on the current audio buffer.
Returns: a tuple (beg_timestamp, end_timestamp, "text"), or (None, None, "").
The non-emty text is confirmed (committed) partial transcript.
"""
prompt, non_prompt = self.prompt()
'''
print("PROMPT:", prompt, file=self.logfile)
print("CONTEXT:", non_prompt, file=self.logfile)
print(
f"transcribing {len(self.audio_buffer) / self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}",
file=self.logfile)
'''
res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)
#print("TEST:", res, file=self.logfile)
# transform to [(beg,end,"word1"), ...]
tokenizer = WhisperTokenizer.from_pretrained('whisper_model_ct',
language="Mongolian",
task="transcribe")
print("TS:", res, file=self.logfile)
if len(res) > 0:
segment = res[0]
tokens_list = segment.tokens
text = ""
for token_id in tokens_list:
text += tokenizer.decode(token_id)
print(text, file=self.logfile)
from whisper_streaming.
may you can make faster-whisper transcribe function to return token ids and not tokens? and then run your tokenizer?
from whisper_streaming.
I'm closing it because it is an issue of your extension code, and because of inactivity. Good luck!
from whisper_streaming.
Related Issues (20)
- Tracking down delay HOT 2
- --model_path Never Work! HOT 6
- How to use whisper_online_server.py on macOS HOT 2
- Help to to run the program to transcirbe real time audio from mic HOT 1
- [BUG] Unnecessary socket re-creation inside with statement in whisper_online_server.py HOT 3
- Use of another backend HOT 2
- OpenAi Api not adding punctuation HOT 10
- OpenAI Whisper is not working anymore as a backend for whisper_streaming HOT 7
- Could this impletemented with micphone as voice input? HOT 1
- unexpected slow speed HOT 3
- [Quesion] about embedding whisper on deivce? HOT 1
- bilgi/ instructions notice learning HOT 2
- Server and Client for Web App HOT 1
- How to start the command correctly:whisper_online_server.py HOT 1
- Occasional Increasing Delay and Hallucination Issues HOT 5
- How to stream mic input on Windows? HOT 1
- can anyone tell me how to exactly run this project on windows from srarting as i dont know k=nothing how to run it HOT 1
- New Fork: Web client + WebSocket + own VAD impl. HOT 7
- Why does "whisper_online_server.py" close after I disconnect the client? HOT 3
- Getting this error on windows , any idea why ? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from whisper_streaming.