Coder Social home page Coder Social logo

fquirin / speech-recognition-experiments Goto Github PK

View Code? Open in Web Editor NEW
50.0 9.0 4.0 741 KB

Experiments to test different speech recognition systems for SEPIA Framework

License: MIT License

Shell 14.07% Python 85.93%
asr coqui speech-recognition stt vosk whisper kaldi speech-to-text raspberry-pi nemo

speech-recognition-experiments's Introduction

Speech Recognition Experiments

Experiments to check out different ASR/STT systems and evaluate integration into SEPIA STT-Server.

ASR engines:

  • Whisper org - The original Whisper version by Open-AI
  • Whisper TFlite - A Tensorflow Lite compatible Whisper port
  • Whisper Cpp - A small C++ port of Whisper
  • Whisper CT2 - An efficient and fast CTranslate2 port of Whisper
  • Sherpa ncnn - Next-gen Kaldi implementation for streaming ASR
  • Nvidia NeMo - A toolkit for various end-to-end ASR models and languages
  • Vosk - Fast, small, accurate (for clear audio), easy to customize. Works with classic Kaldi models. One of the core engines of SEPIA STT Server.

Wake-Word detection:

  • OpenWakeWord - An robust, NN based, open-source wake-word detection framework with a focus on performance and simplicity.

Other great ASR engines already included in SEPIA:

  • Coqui STT - Successor of Mozilla's Deep Speech project. End-to-end ASR with CTC decoder and "optional" LMs.

Installation

  • Each ASR experiment folder has an install bash script, simply run bash install.sh.
  • Sometimes you will find additional scripts to download models. They should be mentioned during installation.
  • After a successful installation use bash run-test.sh to run a default test. If the script uses Python you need to activate the right virtual environment first: source venv/bin/activate.

Comments and Impressions

  • Whisper:
    • Whisper in any form, is very accurate, but the missing streaming support is the biggest drawback.
    • RTF is not linear. Unfortunately the short files (<4s) need almost the same time to transcribe as the larger ones (>10s).
    • For Raspberry Pi 4 based voice assistants you have to wait usually >3s after finishing your input to get a result (bad UX).
    • An Orange Pi 5 with optimal Whisper is fast enough to run the 'tiny' model and get good UX (usually <1.5s inference time for every input <30s).
    • Whisper CT2 seems to be the best version right now for the Arm64/Aarch64 systems (RPi4 etc.). It has the same speed as the TFlite version or even faster, is smaller in size, works better with non-en languages and has a cleaner API.
  • Sherpa ncnn:
    • Sherpa is very fast and supports streaming audio, but without language model WER is a bit high at the moment. Results look very promising though.
    • Example result (file 1, JFK speech): "AND SAW MY FELLOW AMERICANS ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY".
    • UPDATED 2023.04.29: Included better English model.
  • Nvidia NeMo:
    • Nvidia NeMo small models (e.g. 'en_conformer_ctc_small') are very fast and precise for clear and simple audio files.
    • Unfortunately NeMo has no pre-trained models for streaming conformer yet (2023.03.07)
    • Non-streaming is a bit faster than Sherpa-ncnn but way more precise
    • The test results below currently indicate the quality is as good as Whisper, but more complicated vocabulary and noisy audio quickly shows that Whisper still performs much better, especially compared to larger NeMo models.
    • NeMo can be tuned easily using (phoneme free!) language models. Depending on your beam parameters (width, alpha, beta) accuracy for your LM vocabulary can increase dramatically, while it will drop for out-of-vocabulary words.
  • Vosk:
    • Vosk is very small, fast, supports streaming audio and you can convert most of the classic Kaldi models to work with it.
    • The small models are only ~50MB and surprisingly good, even for general dictation tasks ... if your input audio isn't too noisy and your vocabulary not too complicated.
    • The larger models are solid, but I never really use them, because they are much slower, need more RAM and don't offer much better results in my everyday tests with SEPIA assistant.
    • If you want good accuracy in a specific domain you should train your own language model. The Vosk homepage has some documentation, but for SEPIA I use the kaldi-adapt-lm repo.
    • Vosk with a custom LM is probably your best open-source ASR choice on low-end hardware.

Benchmarks

Test notes:

  • File 1 is en_speech_jfk_11s.wav
  • File 2 is en_sh_lights_70pct_4s.wav
  • All Whisper tests are done without language detection!
  • Whisper TFlite (slim) is the tflite_runtime package built with Bazel (faster than default!)
  • Whisper Cpp is built with default settings ('NEON = 1', 'BLAS = 0') and Whisper Cpp (BLAS) with OpenBlas
  • Whisper CT2 uses the 'int8' model
  • Quality is a subjective impression of the transcribed result (TODO: replace with WER)
  • Sherpa model small-2023-01-09 full name is conv-emformer-transducer-small-2023-01-09

Raspberry Pi 400 - Aarch64 - Debian Bullseye

Test date: 2023.02.17

Engine Model File Threads Stream Time RTF Quality
Whisper original tiny 1 4 - 5.9s 0.54 perfect
Whisper original tiny 2 4 - 4.3s 1.19 perfect
Whisper TFlite tiny.en 1 4 - 4.1s 0.37 perfect
Whisper TFlite tiny.en 2 4 - 3.4s 0.94 perfect
Whisper TFlite (slim) tiny.en 1 4 - 3.9s 0.36 perfect
Whisper TFlite (slim) tiny.en 2 4 - 3.2s 0.90 perfect
Whisper TFlite (slim) tiny 1 4 - 4.7s 0.43 perfect
Whisper TFlite (slim) tiny 2 4 - 3.8s 1.06 perfect
Whisper Cpp ggml-tiny 1 4 - 9.1s 0.83 perfect
Whisper Cpp ggml-tiny 2 4 - 8.6s 2.39 perfect
Whisper Cpp (BLAS) ggml-tiny 1 4 - 8.4s 0.76 perfect
Whisper Cpp (BLAS) ggml-tiny 2 4 - 8.0s 2.22 perfect
Whisper CT2 whisper-tiny-ct2 1 4 - 3.9s 0.36 perfect
Whisper CT2 whisper-tiny-ct2 2 4 - 3.2s 0.90 perfect
Sherpa ncnn small-2023-01-09 1 4 + 2.0s 0.18 okayish
Sherpa ncnn small-2023-01-09 2 4 + 0.6s 0.18 low

Test date: 2023.03.07

Engine Model File Threads Stream Time RTF Quality
Nvidia NeMo en_conformer_ctc_small 1 4 - 1.1s 0.10 perfect
Nvidia NeMo en_conformer_ctc_small 2 4 - 0.5s 0.14 perfect

Orange Pi 5 8GB - Aarch64 - Armbian Bullseye (Kernel 5.10.110-rockchip-rk3588)

Test date: 2023.02.19

Engine Model File Threads Stream Time RTF Quality
Whisper original tiny 1 4 - 3.0s 0.27 perfect
Whisper original tiny 2 4 - 1.9s 0.53 perfect
Whisper TFlite (slim) tiny 1 4 - 1.4s 0.13 perfect
Whisper TFlite (slim) tiny 2 4 - 1.4s 0.39 perfect
Whisper Cpp (BLAS) ggml-tiny 1 4 - 3.7s 0.34 perfect
Whisper Cpp (BLAS) ggml-tiny 2 4 - 3.5s 0.97 perfect
Whisper CT2 whisper-tiny-ct2 1 4 - 1.3s 0.12 perfect
Whisper CT2 whisper-tiny-ct2 2 4 - 1.4s 0.39 perfect

Test date: 2023.03.07

Engine Model File Threads Stream Time RTF Quality
Sherpa ncnn small-2023-01-09 1 4 + 0.6s 0.05 okayish
Sherpa ncnn small-2023-01-09 2 4 + 0.2s 0.06 low
Nvidia NeMo en_conformer_ctc_small 1 4 - 0.4s 0.03 perfect
Nvidia NeMo en_conformer_ctc_small 2 4 - 0.2s 0.06 perfect

speech-recognition-experiments's People

Contributors

fquirin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speech-recognition-experiments's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.