exphat / swiftwhisper Goto Github PK
View Code? Open in Web Editor NEW🎤 The easiest way to transcribe audio in Swift
License: MIT License
🎤 The easiest way to transcribe audio in Swift
License: MIT License
At first thank you for your job
i have a question: when i transcribe audio file as PCM [Float]
i receive as result [Segment]
i noticed that each Segment may contain not separate word, but sentence
how i can split sentence into separate words with timestamp for each?
I tried to use WhisperParams
fields:
but result always the same
The only thing is help me decrease words in sentence is using beamSearch
strategy, but i still get sentence instead of separate words
my code
let params = WhisperParams(strategy: .beamSearch)
params.max_len = 1
params.split_on_word = true
whisper = Whisper(fromFileURL: modelUrl, withParams: params)
I tried, no call back.
If we check cpp code, it also didn't update anything to upper oc or swift layer.
// main loop
while (true) {
const int progress_cur = (100*(seek - seek_start))/(seek_end - seek_start);
while (progress_cur >= progress_prev + progress_step) {
progress_prev += progress_step;
if (params.print_progress) {
fprintf(stderr, "%s: progress = %3d%%\n", func, progress_prev);
}
}
How does it work? Does it even works?
First of all: Thank you for coding this Swift Package — it’s terrific! 🙏
What I’m missing: I’d love to get word-level timestamps like mentioned in the Whisper API.
For my understanding this would require that we can set --word_timestamps
to true
. (Maybe WhisperParams
would be a good place for that?)
Keep up the great work!
Best, Martin
If I try to initialize a Segment struct inside my app, I get the error: "'Segment' initializer is inaccessible due to 'internal' protection level"
See: https://docs.swift.org/swift-book/documentation/the-swift-programming-language/accesscontrol/
"The default memberwise initializer for a structure type is considered private if any of the structure’s stored properties are private. Likewise, if any of the structure’s stored properties are file private, the initializer is file private. Otherwise, the initializer has an access level of internal."
So the build-in memberwise initializer is only available within the package. If you don't provide a public initializer, you won't be able to create your struct from outer space. (https://stackoverflow.com/questions/54673224/public-struct-in-framework-init-is-inaccessible-due-to-internal-protection-lev)
When I use the large-v3 model in my swiftui app, the app crashes. The model is loaded correctly, but then errors with:
Assertion failed: (mel_inp.n_mel == n_mels), function whisper_encode_internal, file whisper.cpp, line 1430.
Everything works as expected with the medium model.
Is there some interface expose outside to control the number of thread used? It takes 4 thread and always takes really long time to translate a couple of seconds audio. If we use more thread, leverage bio chip capability, would that be faster?
One approach that comes to mind is to record the current timestamp and re-truncate the audio for transcription, but that's not very elegant.
Need watchOS support.
Best Regards
I have experimented with the actual whisper.cpp library and this library by setting max-len to the same amount so that I can control the number of words per segment. It does not work as expected for SwiftWhisper. It effectively ignores the value of --max-len meanwhile original cpp library does not ignore it.
Hi!!
First of all great library!
So thanks for that!
Second, is there an API for diarization?
how to release it if I don't want it or want to initialize another model, it seems the memory never freed up
Firstly, thanks for sharing your library, it's great.
I'm trying to get a transcription that splits on each word. I understand (perhaps incorrectly) that to do this I need to set max_len=1 and split_on_word=true. Found here: https://github.com/ggerganov/whisper.cpp#word-level-timestamp
However I see no change in the segments in that they always seem to be split on the default/automatic settings. Please let me know if I'm doing something wrong. Here's my code:
let params = WhisperParams()
params.language = .english
params.max_len = 1
params.split_on_word = true
let whisper = Whisper(fromFileURL: Bundle.main.url(forResource: "ggml-tiny.en", withExtension: "bin")!, withParams: params)
let segments = try await whisper.transcribe(audioFrames: audioFrames)
transciption = segments.map(\.text).joined()
For example initial_prompt
, max_len
, split_on_word
, etc.
public init(strategy: WhisperSamplingStrategy = .greedy) {
self.whisperParams = whisper_full_default_params(whisper_sampling_strategy(rawValue: strategy.rawValue))
self.language = .auto
}
Hi @exPHAT, do you have more example code with all the boilerplate to get started with using the CoreML model, the delegates, and the PCM Array to whisper.cpp STT? Thanks, @shyamalschandra!
I can do trial-and-error but it is much easier to have boilerplate.
Any tips/tricks on how to tie in live microphone data into this library? Similar to the dictation system on macOS.
When Whisper.init(fromFileURL)
is called with a file URL that is a file that exists, but not a valid model file, the error condition from the underlying whisper.cpp library is not handled.
Specifically:
self.whisperContext = fileURL.relativePath.withCString { whisper_init_from_file($0) }
whisper_init_from_file
will return nullptr
in this case. The attempted assignment produces the following error which crashes the program using the library.
whisper_init_from_file_no_state: loading model from '.'
whisper_model_load: loading model
whisper_model_load: invalid model data (bad magic)
whisper_init_no_state: failed to load model
SwiftWhisper/Whisper.swift:16: Fatal error: Unexpectedly found nil while implicitly unwrapping an Optional value
I'd like to add a check to this initializer so my program can catch and safely handle this case. In theory, my program could attempt to figure out if this was a valid model file, but this would involve re-implementing the detection code from whisper.cpp that I am trying to wrap. Letting that code that is already doing the error handling just pass the error through seems like a better arrangement.
Doing so would probably require changing the init signature to a throwing one or a fail-able one. I understand that this would involve a change in API here. Is there a way to handle this that would be likely to be accepted as a PR? Is there a more general plan for how to handle this sort of error case?
Is it possible to get time data for every word in each segment?
I know your repo is very new, but I was wondering if you could add a License so I know if I can add it in my project?
openai provides an api to use whisper on their platform (https://platform.openai.com/docs/api-reference/audio/createTranscription) for a cheap price, its much faster than doing it locally, i think itd be a good addition to support it
is there some new updates for tiny, medium model? Large doesn't work very well on mobile, I won't expect v3 can improve too much.
Do you have more boilerplate code (examples) to show an example of some application where the speech recognition is working?
Is it possible to get the confidence for an individual segment/word as part of the results?
Thanks
I noticed that Whisper cpp has coreml support:
https://github.com/ggerganov/whisper.cpp/tree/master#core-ml-support
Does SwiftWhisper support CoreML and if not, is this something I can setup to do in my project or does it require a change to SwiftWhisper?
Hey, awesome package!
I wanted to ask how one could use this for on-device realtime description with microphone audio, similar to the objc example from the whisper.cpp package
is there a demo example to learn about how to download the models in the app upon demand and then use those models to transcribe? Thank you
I've been following the whisper.cpp project to create the mlmodelc file. However, I've encountered an issue where the weights/weight.bin file, which is required by SwiftWhisper, is not being created.
So when I run project with SwiftWhisper coreml, The exact error message I'm receiving is:
Could not open .../ggml-base-encoder.mlmodelc/weights/weight.bin
I'm not sure what I might be missing or doing incorrectly. Any guidance or suggestions would be greatly appreciated.
load base model with CoreML model need 9.7s, but only 1s without CoreML model.
Hello,
I'm relatively new to Swift, and I got confused with the AudioKit convertAudioFileToPCMArray
.
Does anyone have a working code example I might be able to refer to?
Thank you!
Is it possible to use this either on the CPU or GPU - specifically on macOS Apple Silicon machines. Is this configurable, automatic or not available?
Thanks
I expected that since the swift package uses the C++ code through interop, it would be just as fast. I did a test transcription using the same wav file and the base.en model. Running the main example from whisper cpp directly takes 2.7s to complete. The swift package takes >10s for the same model and wav file. Have no idea why it is happening. Can someone explain to me?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.