Coder Social home page Coder Social logo

deutscheki / tevr-asr-tool Goto Github PK

View Code? Open in Web Editor NEW
406.0 14.0 18.0 296 KB

State-of-the-art (ranked #1 Aug 2022) German Speech Recognition in 284 lines of C++. This is a 100% private 100% offline 100% free CLI tool.

License: MIT License

CMake 0.40% C 96.95% C++ 2.62% Shell 0.03%
c-plus-plus command-line-tool speech speech-recognition kenlm no-cloud offline private speech-to-text tensorflow-lite wave deep-learning language-model machine-learning pretrained-models tensorflow debian debian-package asr german-speech-recognition

tevr-asr-tool's Introduction

TEVR ASR Tool

  • state-of-the-art performance
  • no GPU needed
  • 100% offline
  • 100% private
  • 100% free
  • MIT license
  • Linux x86_64
  • command-line tool
  • easy to understand
    • only 284 lines of C++ code
    • AI model on HuggingFace

High Transcription Quality

In August 2022, we ranked #1 on "Speech Recognition on Common Voice German (using extra training data)" with a 3.64% word error rate. Accordingly, the performance of this tool is considered to be the best of what's currently possible in German speech recognition: PWC

How does this work?

L175-L185 load the WAV file. L189-L229 execute the acoustic AI model. L260-L275 convert the predicted token logits into string snippets. L73-L162 implement the Beam search re-scoring based on a KenLM language model.

If you're curious how the acoustic AI model works and why I designed it that way, here's the paper: https://arxiv.org/abs/2206.12693 and here's a pre-trained HuggingFace transformers model: https://huggingface.co/fxtentacle/wav2vec2-xls-r-1b-tevr

Install the Debian/Ubuntu package

Download tevr_asr_tool-1.0.0-Linux-x86_64.deb from GitHub and extract the multipart ZIP:

wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/download/v1.0.0/tevr_asr_tool-1.0.0-Linux-x86_64.zip.001"
wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/download/v1.0.0/tevr_asr_tool-1.0.0-Linux-x86_64.zip.002"
wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/download/v1.0.0/tevr_asr_tool-1.0.0-Linux-x86_64.zip.003"
wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/download/v1.0.0/tevr_asr_tool-1.0.0-Linux-x86_64.zip.004"
wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/download/v1.0.0/tevr_asr_tool-1.0.0-Linux-x86_64.zip.005"
cat tevr_asr_tool-1.0.0-Linux-x86_64.zip.00* > tevr_asr_tool-1.0.0-Linux-x86_64.zip
unzip tevr_asr_tool-1.0.0-Linux-x86_64.zip

Install it:

sudo dpkg -i tevr_asr_tool-1.0.0-Linux-x86_64.deb

Install from Source Code

Download submodules:

git submodule update --init

CMake configure and build:

cmake -DCMAKE_BUILD_TYPE=MinSizeRel -DCPACK_CMAKE_GENERATOR=Ninja -S . -B build
cmake --build build --target tevr_asr_tool -j 16

Create debian package:

(cd build && cpack -G DEB)

Install it:

sudo dpkg -i build/tevr_asr_tool-1.0.0-Linux-x86_64.deb

Usage

tevr_asr_tool --target_file=test_audio.wav 2>log.txt

should display the correct transcription mückenstiche sollte man nicht aufkratzen. And log.txt will contain the diagnostics and progress that was logged to stderr during execution.

GPU Acceleration for Developers

I plan to release a Vulkan & OpenGL-accelerated real-time low-latency transcription software for developers soon. It'll run 100% private + 100% offline just like this tool, but instead of processing a WAV file on CPU it'll stream the real-time GPU transcription of your microphone input through a WebRTC-capable REST API so that you can easily integrate it with your own voice-controlled projects. For example, that'll enable hackable voice typing together with pynput.keyboard.

If you want to get notified when it launches, please enter your email at https://madmimi.com/signups/f0da3b13840d40ce9e061cafea6280d5/join

Commercial Customization

This tool itself is free to use also for commercial use. And of course it comes with no warranty of any kind.

But if you have an idea for a commercial use-case for a customized version of this tool or for similar technology - ideally something that helps small and medium-sized businesses in northern Germany become more competitive - then please contact me at [email protected]

Research Citation

If you use this for research, please cite:

@misc{https://doi.org/10.48550/arxiv.2206.12693,
  doi = {10.48550/ARXIV.2206.12693},
  url = {https://arxiv.org/abs/2206.12693},
  author = {Krabbenhöft, Hajo Nils and Barth, Erhardt},  
  keywords = {Computation and Language (cs.CL), Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, F.2.1; I.2.6; I.2.7},  
  title = {TEVR: Improving Speech Recognition by Token Entropy Variance Reduction},  
  publisher = {arXiv},  
  year = {2022}, 
  copyright = {Creative Commons Attribution 4.0 International}
}

Replace the AI Model

The German AI model and my training scripts can be found on HuggingFace: https://huggingface.co/fxtentacle/wav2vec2-xls-r-1b-tevr

The model has undergone XLS-R cross-language pre-training. You can directly fine-tune it with a different language dataset - for example CommonVoice English - and then re-export the files in the tevr-asr-data folder.

Alternatively, you can donate roughly 2 weeks of A100 GPU credits to me and I'll train a suitable recognition model and upload it to HuggingFace.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.