exphat / swiftwhisper Goto Github PK

View Code? Open in Web Editor NEW

519.0 8.0 54.0 737 KB

🎤 The easiest way to transcribe audio in Swift

License: MIT License

Swift 100.00%

ios swift whisper macos openai speech-recognition speech-to-text transcription whisper-cpp

swiftwhisper's People

Contributors

Stargazers

Watchers

Forkers

9sl9 jettisonthenet schlu majedoh regud brytonsf blu-fox aicodehunt lookatpete zljkevin josscii jordibruin myvaheed synopsis ixeau andreasink mkll uni-industries dougzilla32 pflepo mbrevoort f0enix fromparis priva28 tomng atacan lijuncode knfisd fansize ericlewis jiaohaibin ljcucc icecocolatte jonathanstorey roguebytes narakai yaalsn samuelail ryosukefukatani chipcracker anchangsu tuanmanhcao

swiftwhisper's Issues

Missing required module 'whisper_cpp' when using the SwiftWhisper code.

How should I fix this problem to include the missing module whisper_cpp in the setup of the library for Swift because the other module is c++-based? Thanks!

Split segment to words

At first thank you for your job
i have a question: when i transcribe audio file as PCM [Float] i receive as result [Segment]
i noticed that each Segment may contain not separate word, but sentence
how i can split sentence into separate words with timestamp for each?
I tried to use WhisperParams fields:

max_len = 1
split_on_word = true

but result always the same
The only thing is help me decrease words in sentence is using beamSearch strategy, but i still get sentence instead of separate words

my code

let params = WhisperParams(strategy: .beamSearch)
params.max_len = 1
params.split_on_word = true
whisper = Whisper(fromFileURL: modelUrl, withParams: params)

Progress delegate is not working

I tried, no call back.

If we check cpp code, it also didn't update anything to upper oc or swift layer.

// main loop
while (true) {
const int progress_cur = (100*(seek - seek_start))/(seek_end - seek_start);
while (progress_cur >= progress_prev + progress_step) {
progress_prev += progress_step;
if (params.print_progress) {
fprintf(stderr, "%s: progress = %3d%%\n", func, progress_prev);
}
}

How does it work? Does it even works?

Add support for word-level timestamps `--word_timestamps`

First of all: Thank you for coding this Swift Package — it’s terrific! 🙏

What I’m missing: I’d love to get word-level timestamps like mentioned in the Whisper API.

For my understanding this would require that we can set --word_timestamps to true. (Maybe WhisperParams would be a good place for that?)

Keep up the great work!

Best, Martin

Error initializing Segment outside SwiftWhisper

If I try to initialize a Segment struct inside my app, I get the error: "'Segment' initializer is inaccessible due to 'internal' protection level"

See: https://docs.swift.org/swift-book/documentation/the-swift-programming-language/accesscontrol/

"The default memberwise initializer for a structure type is considered private if any of the structure’s stored properties are private. Likewise, if any of the structure’s stored properties are file private, the initializer is file private. Otherwise, the initializer has an access level of internal."

So the build-in memberwise initializer is only available within the package. If you don't provide a public initializer, you won't be able to create your struct from outer space. (https://stackoverflow.com/questions/54673224/public-struct-in-framework-init-is-inaccessible-due-to-internal-protection-lev)

unsafe build flags?

I added SwiftWhisper as a SPM dependency, but I am getting an error about "unsafe build flags". I will try to track it down if I have time, but wanted to check if you know what it was?

Crash when using Large-v3 model

When I use the large-v3 model in my swiftui app, the app crashes. The model is loaded correctly, but then errors with:

Assertion failed: (mel_inp.n_mel == n_mels), function whisper_encode_internal, file whisper.cpp, line 1430.

Everything works as expected with the medium model.

CPU cost 400% always

Is there some interface expose outside to control the number of thread used? It takes 4 thread and always takes really long time to translate a couple of seconds audio. If we use more thread, leverage bio chip capability, would that be faster?

Is there a possibility to pause and resume transcription?

One approach that comes to mind is to record the current timestamp and re-truncate the audio for transcription, but that's not very elegant.

watchOS support?

Need watchOS support.

Best Regards

Setting --max-len does not change the value of the results

I have experimented with the actual whisper.cpp library and this library by setting max-len to the same amount so that I can control the number of words per segment. It does not work as expected for SwiftWhisper. It effectively ignores the value of --max-len meanwhile original cpp library does not ignore it.

Diarization

Hi!!
First of all great library!
So thanks for that!
Second, is there an API for diarization?

memory free up issue

how to release it if I don't want it or want to initialize another model, it seems the memory never freed up

Split on word

Firstly, thanks for sharing your library, it's great.

I'm trying to get a transcription that splits on each word. I understand (perhaps incorrectly) that to do this I need to set max_len=1 and split_on_word=true. Found here: https://github.com/ggerganov/whisper.cpp#word-level-timestamp

However I see no change in the segments in that they always seem to be split on the default/automatic settings. Please let me know if I'm doing something wrong. Here's my code:

let params = WhisperParams()
params.language = .english
params.max_len = 1
params.split_on_word = true

let whisper = Whisper(fromFileURL: Bundle.main.url(forResource: "ggml-tiny.en", withExtension: "bin")!, withParams: params)

let segments = try await whisper.transcribe(audioFrames: audioFrames)
transciption = segments.map(\.text).joined()

Are there plans to support more parameters in WhisperParams?

For example initial_prompt, max_len, split_on_word, etc.

public init(strategy: WhisperSamplingStrategy = .greedy) {
    self.whisperParams = whisper_full_default_params(whisper_sampling_strategy(rawValue: strategy.rawValue))
    self.language = .auto
}

More boilerplate besides the test functions.

Hi @exPHAT, do you have more example code with all the boilerplate to get started with using the CoreML model, the delegates, and the PCM Array to whisper.cpp STT? Thanks, @shyamalschandra!

I can do trial-and-error but it is much easier to have boilerplate.

thoughts on reading from microphone data?

Any tips/tricks on how to tie in live microphone data into this library? Similar to the dictation system on macOS.

crash when initializing with an invalid model

When Whisper.init(fromFileURL) is called with a file URL that is a file that exists, but not a valid model file, the error condition from the underlying whisper.cpp library is not handled.

Specifically:
self.whisperContext = fileURL.relativePath.withCString { whisper_init_from_file($0) }

whisper_init_from_file will return nullptr in this case. The attempted assignment produces the following error which crashes the program using the library.

whisper_init_from_file_no_state: loading model from '.'
whisper_model_load: loading model
whisper_model_load: invalid model data (bad magic)
whisper_init_no_state: failed to load model
SwiftWhisper/Whisper.swift:16: Fatal error: Unexpectedly found nil while implicitly unwrapping an Optional value

I'd like to add a check to this initializer so my program can catch and safely handle this case. In theory, my program could attempt to figure out if this was a valid model file, but this would involve re-implementing the detection code from whisper.cpp that I am trying to wrap. Letting that code that is already doing the error handling just pass the error through seems like a better arrangement.

Doing so would probably require changing the init signature to a throwing one or a fail-able one. I understand that this would involve a change in API here. Is there a way to handle this that would be likely to be accepted as a PR? Is there a more general plan for how to handle this sort of error case?

Data about every word in each segment

Is it possible to get time data for every word in each segment?

License

I know your repo is very new, but I was wondering if you could add a License so I know if I can add it in my project?

support using the api

openai provides an api to use whisper on their platform (https://platform.openai.com/docs/api-reference/audio/createTranscription) for a cheap price, its much faster than doing it locally, i think itd be a good addition to support it

large-v3

is there some new updates for tiny, medium model? Large doesn't work very well on mobile, I won't expect v3 can improve too much.

Do you have more boilerplate code (examples) to show an example of some application where the speech recognition is working?

Confidence value for segment?

Is it possible to get the confidence for an individual segment/word as part of the results?

Thanks

Core ML Support

I noticed that Whisper cpp has coreml support:
https://github.com/ggerganov/whisper.cpp/tree/master#core-ml-support

Does SwiftWhisper support CoreML and if not, is this something I can setup to do in my project or does it require a change to SwiftWhisper?

Real-time transcription

Hey, awesome package!

I wanted to ask how one could use this for on-device realtime description with microphone audio, similar to the objc example from the whisper.cpp package

Is there a demo example?

is there a demo example to learn about how to download the models in the app upon demand and then use those models to transcribe? Thank you

MLModelAsset: load failed with error

I've been following the whisper.cpp project to create the mlmodelc file. However, I've encountered an issue where the weights/weight.bin file, which is required by SwiftWhisper, is not being created.

So when I run project with SwiftWhisper coreml, The exact error message I'm receiving is:

Could not open .../ggml-base-encoder.mlmodelc/weights/weight.bin

I'm not sure what I might be missing or doing incorrectly. Any guidance or suggestions would be greatly appreciated.

missing required module whisper_cpp

I've tried using the master and fast version of this package in xcode and keep running into this issue with the message that I'm missing a required module called whisper_cpp.

Thank you!

"loading Core ML model"Spend a lot of time

load base model with CoreML model need 9.7s, but only 1s without CoreML model.

Example?

Hello,
I'm relatively new to Swift, and I got confused with the AudioKit convertAudioFileToPCMArray.
Does anyone have a working code example I might be able to refer to?
Thank you!

GPU vs CPU

Is it possible to use this either on the CPU or GPU - specifically on macOS Apple Silicon machines. Is this configurable, automatic or not available?

Thanks

Running compiled C++ version of program is faster than using Library

I expected that since the swift package uses the C++ code through interop, it would be just as fast. I did a test transcription using the same wav file and the base.en model. Running the main example from whisper cpp directly takes 2.7s to complete. The swift package takes >10s for the same model and wav file. Have no idea why it is happening. Can someone explain to me?