How can I use my own trained tokenizer?

tokenizer for sentence segmentation, right? wrap it in an obje

Custom HuggingFace tokenizer with faster-whisper about whisper_streaming HOT 11 CLOSED

kenaii commented on July 24, 2024

Custom HuggingFace tokenizer with faster-whisper

from whisper_streaming.

Comments (11)

Gldkslfmsd commented on July 24, 2024 1

btw., please rather copy paste text than post screenshots. It's better visible and searchable. You can put it to triple ticks so it's formated as code.

like this

from whisper_streaming.

Gldkslfmsd commented on July 24, 2024

tokenizer for sentence segmentation, right?

wrap it in an object that has split function that works like the one of MosesTokenizer
pass it to OnlineASRProcessor

from whisper_streaming.

kenaii commented on July 24, 2024

Thank you for your quick response.

Actually no, My question means, how can I use the Hugging Face formatted tokenizer?

I have a Hugging Face formatted tokenizer with files such as vocab.json, vocabulary.json, added_tokens.json, normalizer.json, special_tokens_map.json, and merges.txt inside, along with my ct2 translated whisper model. However, Faster-whisper only takes a tokenizer.json file. How can I use my Hugging Face formatted tokenizer is it possible to convert it to tokenizer.json?

Thank you

from whisper_streaming.

kenaii commented on July 24, 2024

I used convert_slow_tokenizer to convert it to tokenizer.json

from transformers import WhisperTokenizer
from transformers.convert_slow_tokenizer import convert_slow_tokenizer

TOKENIZER_DIR = "medium-tokenizer"
tokenizer = WhisperTokenizer.from_pretrained(TOKENIZER_DIR,
                                                 language="Mongolian",
                                                 task="transcribe", use_fast=False)
fast_tokenizer.save("tokenizer.json")
fast_tokenizer = convert_slow_tokenizer(tokenizer)

but encountered the following error.

python3 whisper_online.py converted_file.wav --model_dir whisper_model_ct --language mn --min-chunk-size 1 > out.txt
Audio duration is: 8.58 seconds
Loading Whisper medium model for mn... done. It took 1.98 seconds.
/home/khangal/.local/lib/python3.10/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.2.2 when using version 1.3.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
Traceback (most recent call last):
  File "/home/khangal/PycharmProjects/whisper_streaming/whisper_online.py", line 547, in <module>
    asr.transcribe(a)
  File "/home/khangal/PycharmProjects/whisper_streaming/whisper_online.py", line 123, in transcribe
    return list(segments)
  File "/home/khangal/.local/lib/python3.10/site-packages/faster_whisper/transcribe.py", line 452, in generate_segments
    ) = self.generate_with_fallback(encoder_output, prompt, tokenizer, options)
  File "/home/khangal/.local/lib/python3.10/site-packages/faster_whisper/transcribe.py", line 660, in generate_with_fallback
    result = self.model.generate(
TypeError: generate(): incompatible function arguments. The following argument types are supported:
    1. (self: ctranslate2._ext.Whisper, features: ctranslate2._ext.StorageView, prompts: Union[List[List[str]], List[List[int]]], *, asynchronous: bool = False, beam_size: int = 5, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, max_length: int = 448, return_scores: bool = False, return_no_speech_prob: bool = False, max_initial_timestamp_index: int = 50, suppress_blank: bool = True, suppress_tokens: Optional[List[int]] = [-1], sampling_topk: int = 1, sampling_temperature: float = 1) -> Union[List[ctranslate2._ext.WhisperGenerationResult], List[ctranslate2._ext.WhisperGenerationResultAsync]]

Invoked with: <ctranslate2._ext.Whisper object at 0x7f5a41c326b0>, <ctranslate2._ext.StorageView object at 0x7f5a06402fb0>, [[None, 220, None]]; kwargs: length_penalty=1, repetition_penalty=1, no_repeat_ngram_size=0, max_length=448, return_scores=True, return_no_speech_prob=True, suppress_blank=True, suppress_tokens=[-1], max_initial_timestamp_index=50, beam_size=5, patience=1

My model_dir contains model.bin and tokenizer.json files.

from whisper_streaming.

Gldkslfmsd commented on July 24, 2024

Sorry, I don't know. But I thnik that a recent version of faster-whisper was not working with a model that I downloaded from HuggingFace long ago. Did you try to reinstall it?

from whisper_streaming.

kenaii commented on July 24, 2024

If we use faster-whisper in the backend, it takes the tokenizer.json file from our model-located directory. If there isn't a tokenizer.json file in our directory, it defaults to using the one from openai/whisper-tiny.en, as specified in , right?

If so, I want to pass my own tokenizer.json, but I currently only have a Hugging Face-formatted tokenizer. Do I need to convert it to tokenizer.json? Actually, my tokenizer works well when using the transformers WhisperTokenizer.

I decided to convert our current Hugging Face-formatted tokenizer to tokenizer.json, but I encountered an error like the above when whisper_online.py attempted to load it.

from whisper_streaming.

Gldkslfmsd commented on July 24, 2024

sklearn writes you Trying to unpickle estimator LogisticRegression from version 1.2.2 when using version 1.3.1. . Maybe you downgrade your sklearn?

You can also check if you are able to run faster-whisper in offline mode. If not, consult their authors. If yes, it should work with whisper-streaming. As #27 .

Moreover, I don't understand why you have your own tokenizer. If it is a part of your finetuned model, then please follow #27.

Good luck!

from whisper_streaming.

kenaii commented on July 24, 2024

Yes, I fine-tuned the Whisper model for my language. whisper_online works well if it doesn't contain any tokenizer, but the result is not as we wanted, as shown below. Therefore, I believe I have to use my own tokenizer, just like when I trained the Whisper model.

## last processed 10.54 s, now is 11.09, the latency is 0.55
11091.6247 0 7720  loseiny бол fem�ushing option том heavy�uel zero простоring birth fold tom

from whisper_streaming.

kenaii commented on July 24, 2024

Also, I tried injecting the custom lines into your process_iter function. It works well, but our result only displays the print statement.

def process_iter(self):
     """Runs on the current audio buffer.
     Returns: a tuple (beg_timestamp, end_timestamp, "text"), or (None, None, "").
     The non-emty text is confirmed (committed) partial transcript.
     """

     prompt, non_prompt = self.prompt()
     '''
     print("PROMPT:", prompt, file=self.logfile)
     print("CONTEXT:", non_prompt, file=self.logfile)
     print(
         f"transcribing {len(self.audio_buffer) / self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}",
         file=self.logfile)
     '''
     res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)
     #print("TEST:", res, file=self.logfile)
     # transform to [(beg,end,"word1"), ...]

     tokenizer = WhisperTokenizer.from_pretrained('whisper_model_ct',
                                                  language="Mongolian",

                                                  task="transcribe")
     print("TS:", res, file=self.logfile)
     if len(res) > 0:
         segment = res[0]
         tokens_list = segment.tokens
         text = ""
         for token_id in tokens_list:
             text += tokenizer.decode(token_id)
         print(text, file=self.logfile)

from whisper_streaming.

Gldkslfmsd commented on July 24, 2024

may you can make faster-whisper transcribe function to return token ids and not tokens? and then run your tokenizer?

from whisper_streaming.

Gldkslfmsd commented on July 24, 2024

I'm closing it because it is an issue of your extension code, and because of inactivity. Good luck!

from whisper_streaming.

Custom HuggingFace tokenizer with faster-whisper about whisper_streaming HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent