Coder Social home page Coder Social logo

Comments (11)

Gldkslfmsd avatar Gldkslfmsd commented on July 24, 2024 1

btw., please rather copy paste text than post screenshots. It's better visible and searchable. You can put it to triple ticks so it's formated as code.

like this

from whisper_streaming.

Gldkslfmsd avatar Gldkslfmsd commented on July 24, 2024

tokenizer for sentence segmentation, right?

  1. wrap it in an object that has split function that works like the one of MosesTokenizer
  2. pass it to OnlineASRProcessor

from whisper_streaming.

kenaii avatar kenaii commented on July 24, 2024

Thank you for your quick response.

Actually no, My question means, how can I use the Hugging Face formatted tokenizer?

I have a Hugging Face formatted tokenizer with files such as vocab.json, vocabulary.json, added_tokens.json, normalizer.json, special_tokens_map.json, and merges.txt inside, along with my ct2 translated whisper model. However, Faster-whisper only takes a tokenizer.json file. How can I use my Hugging Face formatted tokenizer is it possible to convert it to tokenizer.json?

Thank you

from whisper_streaming.

kenaii avatar kenaii commented on July 24, 2024

I used convert_slow_tokenizer to convert it to tokenizer.json

from transformers import WhisperTokenizer
from transformers.convert_slow_tokenizer import convert_slow_tokenizer

TOKENIZER_DIR = "medium-tokenizer"
tokenizer = WhisperTokenizer.from_pretrained(TOKENIZER_DIR,
                                                 language="Mongolian",
                                                 task="transcribe", use_fast=False)
fast_tokenizer.save("tokenizer.json")
fast_tokenizer = convert_slow_tokenizer(tokenizer)

but encountered the following error.

python3 whisper_online.py converted_file.wav --model_dir whisper_model_ct --language mn --min-chunk-size 1 > out.txt
Audio duration is: 8.58 seconds
Loading Whisper medium model for mn... done. It took 1.98 seconds.
/home/khangal/.local/lib/python3.10/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.2.2 when using version 1.3.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
Traceback (most recent call last):
  File "/home/khangal/PycharmProjects/whisper_streaming/whisper_online.py", line 547, in <module>
    asr.transcribe(a)
  File "/home/khangal/PycharmProjects/whisper_streaming/whisper_online.py", line 123, in transcribe
    return list(segments)
  File "/home/khangal/.local/lib/python3.10/site-packages/faster_whisper/transcribe.py", line 452, in generate_segments
    ) = self.generate_with_fallback(encoder_output, prompt, tokenizer, options)
  File "/home/khangal/.local/lib/python3.10/site-packages/faster_whisper/transcribe.py", line 660, in generate_with_fallback
    result = self.model.generate(
TypeError: generate(): incompatible function arguments. The following argument types are supported:
    1. (self: ctranslate2._ext.Whisper, features: ctranslate2._ext.StorageView, prompts: Union[List[List[str]], List[List[int]]], *, asynchronous: bool = False, beam_size: int = 5, patience: float = 1, num_hypotheses: int = 1, length_penalty: float = 1, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, max_length: int = 448, return_scores: bool = False, return_no_speech_prob: bool = False, max_initial_timestamp_index: int = 50, suppress_blank: bool = True, suppress_tokens: Optional[List[int]] = [-1], sampling_topk: int = 1, sampling_temperature: float = 1) -> Union[List[ctranslate2._ext.WhisperGenerationResult], List[ctranslate2._ext.WhisperGenerationResultAsync]]

Invoked with: <ctranslate2._ext.Whisper object at 0x7f5a41c326b0>, <ctranslate2._ext.StorageView object at 0x7f5a06402fb0>, [[None, 220, None]]; kwargs: length_penalty=1, repetition_penalty=1, no_repeat_ngram_size=0, max_length=448, return_scores=True, return_no_speech_prob=True, suppress_blank=True, suppress_tokens=[-1], max_initial_timestamp_index=50, beam_size=5, patience=1


My model_dir contains model.bin and tokenizer.json files.

from whisper_streaming.

Gldkslfmsd avatar Gldkslfmsd commented on July 24, 2024

Sorry, I don't know. But I thnik that a recent version of faster-whisper was not working with a model that I downloaded from HuggingFace long ago. Did you try to reinstall it?

from whisper_streaming.

kenaii avatar kenaii commented on July 24, 2024

If we use faster-whisper in the backend, it takes the tokenizer.json file from our model-located directory. If there isn't a tokenizer.json file in our directory, it defaults to using the one from openai/whisper-tiny.en, as specified in , right?

If so, I want to pass my own tokenizer.json, but I currently only have a Hugging Face-formatted tokenizer. Do I need to convert it to tokenizer.json? Actually, my tokenizer works well when using the transformers WhisperTokenizer.

I decided to convert our current Hugging Face-formatted tokenizer to tokenizer.json, but I encountered an error like the above when whisper_online.py attempted to load it.

from whisper_streaming.

Gldkslfmsd avatar Gldkslfmsd commented on July 24, 2024

sklearn writes you Trying to unpickle estimator LogisticRegression from version 1.2.2 when using version 1.3.1. . Maybe you downgrade your sklearn?

You can also check if you are able to run faster-whisper in offline mode. If not, consult their authors. If yes, it should work with whisper-streaming. As #27 .

Moreover, I don't understand why you have your own tokenizer. If it is a part of your finetuned model, then please follow #27.

Good luck!

from whisper_streaming.

kenaii avatar kenaii commented on July 24, 2024

Yes, I fine-tuned the Whisper model for my language. whisper_online works well if it doesn't contain any tokenizer, but the result is not as we wanted, as shown below. Therefore, I believe I have to use my own tokenizer, just like when I trained the Whisper model.

## last processed 10.54 s, now is 11.09, the latency is 0.55
11091.6247 0 7720  loseiny бол fem�ushing option том heavy�uel zero простоring birth fold tom

from whisper_streaming.

kenaii avatar kenaii commented on July 24, 2024

Also, I tried injecting the custom lines into your process_iter function. It works well, but our result only displays the print statement.

def process_iter(self):
     """Runs on the current audio buffer.
     Returns: a tuple (beg_timestamp, end_timestamp, "text"), or (None, None, "").
     The non-emty text is confirmed (committed) partial transcript.
     """

     prompt, non_prompt = self.prompt()
     '''
     print("PROMPT:", prompt, file=self.logfile)
     print("CONTEXT:", non_prompt, file=self.logfile)
     print(
         f"transcribing {len(self.audio_buffer) / self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}",
         file=self.logfile)
     '''
     res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)
     #print("TEST:", res, file=self.logfile)
     # transform to [(beg,end,"word1"), ...]

     tokenizer = WhisperTokenizer.from_pretrained('whisper_model_ct',
                                                  language="Mongolian",

                                                  task="transcribe")
     print("TS:", res, file=self.logfile)
     if len(res) > 0:
         segment = res[0]
         tokens_list = segment.tokens
         text = ""
         for token_id in tokens_list:
             text += tokenizer.decode(token_id)
         print(text, file=self.logfile)

from whisper_streaming.

Gldkslfmsd avatar Gldkslfmsd commented on July 24, 2024

may you can make faster-whisper transcribe function to return token ids and not tokens? and then run your tokenizer?

from whisper_streaming.

Gldkslfmsd avatar Gldkslfmsd commented on July 24, 2024

I'm closing it because it is an issue of your extension code, and because of inactivity. Good luck!

from whisper_streaming.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.