This repository contains
A) the files needed to perform forced decoding given a piece of audio and
the "ground truth" to be obtained from the decoding.
B) the files needed to turn this into an API.
Disclaimer: All of the API work here is very primitive and should only be used as a proof of concept.
I tested everything here with an Ubuntu 20.04 VM on Azure for the backend / decoding part.
To set up the backend / decoding, follow these steps on the Ubuntu machine (as root):
- Install Docker.
docker pull kaldiasr/kaldi
- Clone this repository.
docker run -it -v <PATH_TO_REPO>/server:/opt/kaldi/egs/wsj/s5/forced_vit kaldiasr/kaldi /bin/bash
- In the container, run:
cd egs/wsj/s5/
cp -r forced_vit/* .
./stage.sh
./make_forced.sh
./setup_speech.sh
- Detach from the container (make sure it stays running)
- Run the server script
app.py
Usage: python3 app.py <docker-container-id>
No set up needed for client side, just run client script detect_ans.py
with the IP of the server.
Usage: python3 detect_ans.py <api-ip>
If you just want to test to make sure forced decoding works after setting
up the backend, inside the docker container you can place the audio file
here: /opt/kaldi/egs/wsj/s5/client_sound.wav
Then run: ./forced_single.sh "<WORDS-TO-DECODE>"
Running this for the first time will take a while.
Also, client/split_test_data.py
is included to show how I roughly split audio given the youtube transcript when testing the decoder for resilience.