Speech Recognition Experiments

Experiments to check out different ASR/STT systems and evaluate integration into SEPIA STT-Server.

ASR engines:

Whisper org - The original Whisper version by Open-AI
Whisper TFlite - A Tensorflow Lite compatible Whisper port
Whisper Cpp - A small C++ port of Whisper
Whisper CT2 - An efficient and fast CTranslate2 port of Whisper
Sherpa ncnn - Next-gen Kaldi implementation for streaming ASR
Nvidia NeMo - A toolkit for various end-to-end ASR models and languages
Vosk - Fast, small, accurate (for clear audio), easy to customize. Works with classic Kaldi models. One of the core engines of SEPIA STT Server.

Wake-Word detection:

OpenWakeWord - An robust, NN based, open-source wake-word detection framework with a focus on performance and simplicity.

Other great ASR engines already included in SEPIA:

Coqui STT - Successor of Mozilla's Deep Speech project. End-to-end ASR with CTC decoder and "optional" LMs.

Installation

Each ASR experiment folder has an install bash script, simply run bash install.sh.
Sometimes you will find additional scripts to download models. They should be mentioned during installation.
After a successful installation use bash run-test.sh to run a default test. If the script uses Python you need to activate the right virtual environment first: source venv/bin/activate.

Comments and Impressions

Whisper:
- Whisper in any form, is very accurate, but the missing streaming support is the biggest drawback.
- RTF is not linear. Unfortunately the short files (<4s) need almost the same time to transcribe as the larger ones (>10s).
- For Raspberry Pi 4 based voice assistants you have to wait usually >3s after finishing your input to get a result (bad UX).
- An Orange Pi 5 with optimal Whisper is fast enough to run the 'tiny' model and get good UX (usually <1.5s inference time for every input <30s).
- Whisper CT2 seems to be the best version right now for the Arm64/Aarch64 systems (RPi4 etc.). It has the same speed as the TFlite version or even faster, is smaller in size, works better with non-en languages and has a cleaner API.
Sherpa ncnn:
- Sherpa is very fast and supports streaming audio, but without language model WER is a bit high at the moment. Results look very promising though.
- Example result (file 1, JFK speech): "AND SAW MY FELLOW AMERICANS ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY".
- UPDATED 2023.04.29: Included better English model.
Nvidia NeMo:
- Nvidia NeMo small models (e.g. 'en_conformer_ctc_small') are very fast and precise for clear and simple audio files.
- Unfortunately NeMo has no pre-trained models for streaming conformer yet (2023.03.07)
- Non-streaming is a bit faster than Sherpa-ncnn but way more precise
- The test results below currently indicate the quality is as good as Whisper, but more complicated vocabulary and noisy audio quickly shows that Whisper still performs much better, especially compared to larger NeMo models.
- NeMo can be tuned easily using (phoneme free!) language models. Depending on your beam parameters (width, alpha, beta) accuracy for your LM vocabulary can increase dramatically, while it will drop for out-of-vocabulary words.
Vosk:
- Vosk is very small, fast, supports streaming audio and you can convert most of the classic Kaldi models to work with it.
- The small models are only ~50MB and surprisingly good, even for general dictation tasks ... if your input audio isn't too noisy and your vocabulary not too complicated.
- The larger models are solid, but I never really use them, because they are much slower, need more RAM and don't offer much better results in my everyday tests with SEPIA assistant.
- If you want good accuracy in a specific domain you should train your own language model. The Vosk homepage has some documentation, but for SEPIA I use the kaldi-adapt-lm repo.
- Vosk with a custom LM is probably your best open-source ASR choice on low-end hardware.

Benchmarks

Test notes:

File 1 is en_speech_jfk_11s.wav
File 2 is en_sh_lights_70pct_4s.wav
All Whisper tests are done without language detection!
Whisper TFlite (slim) is the tflite_runtime package built with Bazel (faster than default!)
Whisper Cpp is built with default settings ('NEON = 1', 'BLAS = 0') and Whisper Cpp (BLAS) with OpenBlas
Whisper CT2 uses the 'int8' model
Quality is a subjective impression of the transcribed result (TODO: replace with WER)
Sherpa model small-2023-01-09 full name is conv-emformer-transducer-small-2023-01-09

Raspberry Pi 400 - Aarch64 - Debian Bullseye

Test date: 2023.02.17

Engine	Model	File	Threads	Stream	Time	RTF	Quality
Whisper original	tiny	1	4	-	5.9s	0.54	perfect
Whisper original	tiny	2	4	-	4.3s	1.19	perfect
Whisper TFlite	tiny.en	1	4	-	4.1s	0.37	perfect
Whisper TFlite	tiny.en	2	4	-	3.4s	0.94	perfect
Whisper TFlite (slim)	tiny.en	1	4	-	3.9s	0.36	perfect
Whisper TFlite (slim)	tiny.en	2	4	-	3.2s	0.90	perfect
Whisper TFlite (slim)	tiny	1	4	-	4.7s	0.43	perfect
Whisper TFlite (slim)	tiny	2	4	-	3.8s	1.06	perfect
Whisper Cpp	ggml-tiny	1	4	-	9.1s	0.83	perfect
Whisper Cpp	ggml-tiny	2	4	-	8.6s	2.39	perfect
Whisper Cpp (BLAS)	ggml-tiny	1	4	-	8.4s	0.76	perfect
Whisper Cpp (BLAS)	ggml-tiny	2	4	-	8.0s	2.22	perfect
Whisper CT2	whisper-tiny-ct2	1	4	-	3.9s	0.36	perfect
Whisper CT2	whisper-tiny-ct2	2	4	-	3.2s	0.90	perfect
Sherpa ncnn	small-2023-01-09	1	4	+	2.0s	0.18	okayish
Sherpa ncnn	small-2023-01-09	2	4	+	0.6s	0.18	low

Test date: 2023.03.07

Engine	Model	File	Threads	Stream	Time	RTF	Quality
Nvidia NeMo	en_conformer_ctc_small	1	4	-	1.1s	0.10	perfect
Nvidia NeMo	en_conformer_ctc_small	2	4	-	0.5s	0.14	perfect

Orange Pi 5 8GB - Aarch64 - Armbian Bullseye (Kernel 5.10.110-rockchip-rk3588)

Test date: 2023.02.19

Engine	Model	File	Threads	Stream	Time	RTF	Quality
Whisper original	tiny	1	4	-	3.0s	0.27	perfect
Whisper original	tiny	2	4	-	1.9s	0.53	perfect
Whisper TFlite (slim)	tiny	1	4	-	1.4s	0.13	perfect
Whisper TFlite (slim)	tiny	2	4	-	1.4s	0.39	perfect
Whisper Cpp (BLAS)	ggml-tiny	1	4	-	3.7s	0.34	perfect
Whisper Cpp (BLAS)	ggml-tiny	2	4	-	3.5s	0.97	perfect
Whisper CT2	whisper-tiny-ct2	1	4	-	1.3s	0.12	perfect
Whisper CT2	whisper-tiny-ct2	2	4	-	1.4s	0.39	perfect

Test date: 2023.03.07

Engine	Model	File	Threads	Stream	Time	RTF	Quality
Sherpa ncnn	small-2023-01-09	1	4	+	0.6s	0.05	okayish
Sherpa ncnn	small-2023-01-09	2	4	+	0.2s	0.06	low
Nvidia NeMo	en_conformer_ctc_small	1	4	-	0.4s	0.03	perfect
Nvidia NeMo	en_conformer_ctc_small	2	4	-	0.2s	0.06	perfect

fquirin / speech-recognition-experiments Goto Github PK

speech-recognition-experiments's Introduction

Speech Recognition Experiments

Installation

Comments and Impressions

Benchmarks

Raspberry Pi 400 - Aarch64 - Debian Bullseye

Orange Pi 5 8GB - Aarch64 - Armbian Bullseye (Kernel 5.10.110-rockchip-rk3588)

speech-recognition-experiments's People

Contributors

Stargazers

Watchers

Forkers

speech-recognition-experiments's Issues

Request to re-test sherpa-ncnn

Do whisper CT2(base model) achieve same speed as that of vosk (english large) with CPU

Request to add sherpa-onnx

My question, which is easiest to run?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent