Comments (6)
Hey @ENDERFUN2 ,
Assume you have a box full of Legos and want to construct a spacecraft. To be used, all of the Legos should be loose and in the box.However, occasionally, a smaller box within the larger box may contain the Legos. This extra box is merely there; it contains no Legos.The sounddevice.rec function in this code is analogous to obtaining a Lego box. There could be an additional, empty box included (the extra dimension).It's like opening the large box and removing the smaller, empty one when you use the squeeze feature. It takes out the superfluous box so that all you have to work with is the audio data, or Legos.This is significant because the WhisperModel.transcribe method, which you use to construct the spaceship, is limited to working with loose Legos and not with boxes inside boxes. Squeezing ensures that everything operates as intended and removes the excess box.
This is as how I understood the squeeze works.
from faster-whisper.
@ENDERFUN2 , hello. To handle silence in recorded audio, you can try using the vad_filter
option.
To avoid saving audio to temp file, you should pass the audio data as numpy ndarray format to FW model. Below is my example:
import numpy as np
import sounddevice as sd
from faster_whisper import WhisperModel
print("Recording started")
duration = 10
sample_rate = 16000
audio_data = sd.rec(
int(sample_rate * duration), samplerate=sample_rate, channels=1, dtype=np.float32
)
sd.wait()
audio_data = audio_data.squeeze()
print("Recording stopped")
model = WhisperModel("tiny", device="cpu")
segments, info = model.transcribe(audio_data, word_timestamps=True)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
from faster-whisper.
Also along with it @ENDERFUN2 make sure you use 16000 as sampling rate and not any other as FW wouldn't support a sample rate other than 16000 as told by @trungkienbkhn.
from faster-whisper.
Okay, so vad_filter works perfectly. I don't know why I hadn't found it before...
Also, that sample code from @trungkienbkhn turned out to be a game changer. But, as a curious man, why is squeeze required?It's my first serious project in Python, 'cause all my previous were in Java or C++, therefore I don't really understand it. And why sample rate has to be set to 16000? When I pass the audio file with 48000 sample rate, it transcribes the audio 100% perfect. Would so grateful for explanaition
from faster-whisper.
@ENDERFUN2 , FYI, you can see this comment to better understand why should use sample_rate=16000. If I use sr=48000 in my example in here, obviously it doesn't work.
For why use squeeze(), sd.rec() func returns an array with shape (duration * sample_rate, 1)
because it records mono audio, resulting in a 2D array with one of the dimensions having size 1. However, FW requires input as a 1D array. So we need use squeeze() to reformat.
from faster-whisper.
Well, it now makes a lot of sense. Thank you both @trungkienbkhn @arunman1kandan for your service, although abstracting it to Legos wasn't necessary, I just didn't understand ndarrays. Also, after some refactoring I noticed that my code is one big pile of garbage and I should reformat it asap. Your answers gave me an important insight
from faster-whisper.
Related Issues (20)
- Batch process available? HOT 2
- Word-level timestamps are off after hotwords is setted HOT 1
- Finetuning with Dora HOT 1
- Is there a method or parameter that can filter out noise that is not human voice? HOT 3
- The VAD parameters and default values in the source code is inconsistent with the description in README.md HOT 1
- how can I get more accurate timestamps? HOT 1
- Can not upload model to hub HOT 1
- TypeError: `pad_width` must be of integral type. HOT 3
- Minimum CUDA version HOT 1
- 有没有大佬遇到过 a problem RuntimeError: Cannot load the vocabulary from the model directory HOT 1
- Transcribe results being translated to different language HOT 11
- Local or remote? HOT 1
- issue with GPU utilization, RTX 3060 mobile, CUDA 12. HOT 2
- Question regarding avg_logprobability HOT 1
- distil-large-v3 is fast but exports wrong language / large-v3 is slow HOT 4
- Whisper invalid input feartures shape. HOT 2
- is their any example for onnx HOT 1
- Transcribe outputs gibberish english even when language param is set HOT 2
- ValueError: Input audio chunk is too short when transcribing numpy array HOT 3
- How to choose the CUDA 12 version? Can it support CUDA 12.2? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from faster-whisper.