So, I have this issue: when there is silence in the recorded audio, whisper, instead o

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Also along with it <a class="user-mention notranslate" data-hovercard-type="user" data

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Well, it now makes a lot of sense. Thank you both <a class="user-mention notranslate"

"Thanks for watching" shows up repeatedly about faster-whisper HOT 6 CLOSED

ENDERFUN2 commented on September 22, 2024

"Thanks for watching" shows up repeatedly

from faster-whisper.

Comments (6)

arunman1kandan commented on September 22, 2024 1

Hey @ENDERFUN2 ,
Assume you have a box full of Legos and want to construct a spacecraft. To be used, all of the Legos should be loose and in the box.However, occasionally, a smaller box within the larger box may contain the Legos. This extra box is merely there; it contains no Legos.The sounddevice.rec function in this code is analogous to obtaining a Lego box. There could be an additional, empty box included (the extra dimension).It's like opening the large box and removing the smaller, empty one when you use the squeeze feature. It takes out the superfluous box so that all you have to work with is the audio data, or Legos.This is significant because the WhisperModel.transcribe method, which you use to construct the spaceship, is limited to working with loose Legos and not with boxes inside boxes. Squeezing ensures that everything operates as intended and removes the excess box.

This is as how I understood the squeeze works.

from faster-whisper.

trungkienbkhn commented on September 22, 2024

@ENDERFUN2 , hello. To handle silence in recorded audio, you can try using the vad_filter option.
To avoid saving audio to temp file, you should pass the audio data as numpy ndarray format to FW model. Below is my example:

import numpy as np
import sounddevice as sd

from faster_whisper import WhisperModel

print("Recording started")
duration = 10
sample_rate = 16000
audio_data = sd.rec(
    int(sample_rate * duration), samplerate=sample_rate, channels=1, dtype=np.float32
)
sd.wait()
audio_data = audio_data.squeeze()
print("Recording stopped")

model = WhisperModel("tiny", device="cpu")
segments, info = model.transcribe(audio_data, word_timestamps=True)
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

from faster-whisper.

arunman1kandan commented on September 22, 2024

Also along with it @ENDERFUN2 make sure you use 16000 as sampling rate and not any other as FW wouldn't support a sample rate other than 16000 as told by @trungkienbkhn.

from faster-whisper.

ENDERFUN2 commented on September 22, 2024

Okay, so vad_filter works perfectly. I don't know why I hadn't found it before...
Also, that sample code from @trungkienbkhn turned out to be a game changer. But, as a curious man, why is squeeze required?It's my first serious project in Python, 'cause all my previous were in Java or C++, therefore I don't really understand it. And why sample rate has to be set to 16000? When I pass the audio file with 48000 sample rate, it transcribes the audio 100% perfect. Would so grateful for explanaition

from faster-whisper.

trungkienbkhn commented on September 22, 2024

@ENDERFUN2 , FYI, you can see this comment to better understand why should use sample_rate=16000. If I use sr=48000 in my example in here, obviously it doesn't work.

For why use squeeze(), sd.rec() func returns an array with shape (duration * sample_rate, 1) because it records mono audio, resulting in a 2D array with one of the dimensions having size 1. However, FW requires input as a 1D array. So we need use squeeze() to reformat.

from faster-whisper.

ENDERFUN2 commented on September 22, 2024

Well, it now makes a lot of sense. Thank you both @trungkienbkhn @arunman1kandan for your service, although abstracting it to Legos wasn't necessary, I just didn't understand ndarrays. Also, after some refactoring I noticed that my code is one big pile of garbage and I should reformat it asap. Your answers gave me an important insight

from faster-whisper.

"Thanks for watching" shows up repeatedly about faster-whisper HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent