Coder Social home page Coder Social logo

uiuc-sst / asr24 Goto Github PK

View Code? Open in Web Editor NEW
27.0 11.0 7.0 985 KB

24-hour Automatic Speech Recognition

License: GNU General Public License v3.0

Shell 18.35% Perl 12.70% Python 21.02% Ruby 18.20% JavaScript 4.97% Makefile 1.01% C++ 23.75%
transcription asr g2p language-model kaldi

asr24's Introduction

Well within 24 hours, transcribe 40 hours of recorded speech in a surprise language.

Build an ASR for a surprise language L from a pre-trained acoustic model, an L pronunciation dictionary, and an L language model. This approach converts phones directly to L words. This is less noisy than using multiple cross-trained ASRs to make English words from which phone strings are extracted, merged by PTgen, and reconstituted into L words.

A full description with performance measurements is on arXiv, and in:
M Hasegawa-Johnson, L Rolston, C Goudeseune, GA Levow, and K Kirchhoff,
Grapheme-to-phoneme transduction for cross-language ASR, Stat. Lang. Speech Proc.:3‒19, 2020.

Install software:

Kaldi

If you don't already have a version of Kaldi newer than 2016 Sep 30, get and build it following the instructions in its INSTALL files.

    git clone https://github.com/kaldi-asr/kaldi
    cd kaldi/tools; make -j $(nproc)
    cd ../src; ./configure --shared && make depend -j $(nproc) && make -j $(nproc)

brno-phnrec

Put Brno U. of Technology's phoneme recognizer next to the usual s5 directory.

    sudo apt-get install libopenblas-dev libopenblas-base
    cd kaldi/egs/aspire
    git clone https://github.com/uiuc-sst/brno-phnrec.git
    cd brno-phnrec/PhnRec
    make

This repo

Put this next to the usual s5 directory.
(The package nodejs is for ./sampa2ipa.js.)

    sudo apt-get install nodejs
    cd kaldi/egs/aspire
    git clone https://github.com/uiuc-sst/asr24.git
    cd asr24

Extension of ASpIRE

    cd kaldi/egs/aspire/asr24
    wget -qO- http://dl.kaldi-asr.org/models/0001_aspire_chain_model.tar.gz | tar xz
    steps/online/nnet3/prepare_online_decoding.sh \
      --mfcc-config conf/mfcc_hires.conf \
      data/lang_chain exp/nnet3/extractor \
      exp/chain/tdnn_7b exp/tdnn_7b_chain_online
    utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \
      exp/tdnn_7b_chain_online exp/tdnn_7b_chain_online/graph_pp

In exp/tdnn_7b_chain_online this builds the files phones.txt, tree, final.mdl, conf/, etc.
This builds the subdirectories data and exp. Its last command mkgraph.sh can take 45 minutes (30 for CTVE Mandarin) and use a lot of memory because it calls fstdeterminizestar on a large language model, as Dan Povey explains.

  • Verify that it can transcribe English, in mono 16-bit 8 kHz .wav format. Either use the provided 8khz.wav, or sox MySpeech.wav -r 8000 8khz.wav, or ffmpeg -i MySpeech.wav -acodec pcm_s16le -ac 1 -ar 8000 8khz.wav.

(The scripts cmd.sh and path.sh say where to find kaldi/src/online2bin/online2-wav-nnet3-latgen-faster.)

    . cmd.sh && . path.sh
    online2-wav-nnet3-latgen-faster \
      --online=false  --do-endpointing=false \
      --frame-subsampling-factor=3 \
      --config=exp/tdnn_7b_chain_online/conf/online.conf \
      --max-active=7000 \
      --beam=15.0  --lattice-beam=6.0  --acoustic-scale=1.0 \
      --word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt \
      exp/tdnn_7b_chain_online/final.mdl \
      exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \
      'ark:echo utterance-id1 utterance-id1|' \
      'scp:echo utterance-id1 8khz.wav|' \
      'ark:/dev/null'

CVTE Mandarin

  • Get the Mandarin chain model (3.4 GB, about 10 minutes). This makes a subdir cvte/s5, containing a words.txt, HCLG.fst, and final.mdl.
    wget -qO- http://kaldi-asr.org/models/0002_cvte_chain_model.tar.gz | tar xz
    steps/online/nnet3/prepare_online_decoding.sh \
      --mfcc-config conf/mfcc_hires.conf \
      data/lang_chain exp/nnet3/extractor \
      exp/chain/tdnn_7b cvte/s5/exp/chain/tdnn
    utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \
      cvte/s5/exp/chain/tdnn cvte/s5/exp/chain/tdnn/graph_pp

For each language L, build an ASR:

Get raw text.

  • Into $L/train_all/text put word strings in L (scraped from wherever), roughly 10 words per line, at most 500k lines. These may be quite noisy, because they'll be cleaned up.

Get a G2P.

  • Into $L/train_all/g2aspire.txt put a G2P, a few hundred lines each containing grapheme(s), whitespace, and space-delimited Aspire-style phones.
    If it has CR line terminators, convert them to standard ones in vi with %s/^M/\r/g, typing control-V before the ^M.
    If it starts with a BOM, remove it: vi -b g2aspire.txt, and just x that character away.

  • If you need to build the G2P, ./g2ipa2asr.py $L_wikipedia_symboltable.txt aspire2ipa.txt phoibletable.csv > $L/train_all/g2aspire.txt.

Build an ASR.

  • ./run.sh $L makes an L-customized HCLG.fst.
  • To instead use a prebuilt LM, ./run_from_wordlist.sh $L. See that script for usage.

Transcribe speech:

Get recordings.

On ifp-serv-03.ifp.illinois.edu, get LDC speech and convert it to a flat dir of 8 kHz .wav files:

    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Russian/LDC2016E111/RUS_20160930
    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Tamil/TAM_EVAL_20170601/TAM_EVAL_20170601
    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Uzbek/LDC2016E66/UZB_20160711

    mkdir /tmp/8k
    for f in */AUDIO/*.flac; do sox "$f" -r 8000 -c 1 /tmp/8k/$(basename ${f%.*}.wav); done
    tar cf /workspace/ifp-53_1-data/eval/8k.tar -C /tmp 8k
    rm -rf /tmp/8k

For BABEL .sph files:

    cd /ws/ifp-serv-03_1/workspace/fletcher/fletcher1/speech_data1/Assamese/LDC2016E02/conversational/training/audio
    tar cf /tmp/foo.tar BABEL*.sph
    scp /tmp/foo.tar ifp-53:/tmp

On ifp-53,

    mkdir ~/kaldi/egs/aspire/asr24/$L-8khz
    cd myTmpSphDir
    tar xf /tmp/foo.tar
    for f in *.sph; do ~/kaldi/tools/sph2pipe_v2.5/sph2pipe -p -f rif "$f" /tmp/a.wav; \
        sox /tmp/a.wav -r 8000 -c 1 ~/kaldi/egs/aspire/asr24/$L-8khz/$(basename ${f%.*}.wav); done

On the host that will run the transcribing, e.g. ifp-53:

    cd kaldi/egs/aspire/asr24
    wget -qO- http://www.ifp.illinois.edu/~camilleg/e/8k.tar | tar xf -
    mv 8k $L-8khz
  • ./mkscp.rb $L-8khz $(nproc) $L splits the ASR tasks into one job per CPU core, each job with roughly the same audio duration.
    It reads $L-8khz, the dir of 8 kHz speech files.
    It makes $L-submit.sh.
  • ./$L-submit.sh launches these jobs in parallel.
  • After those jobs complete, collect the transcriptions with
    grep -h -e '^TAM_EVAL' $L/lat/*.log | sort > $L-scrips.txt (or ...^RUS_, ^BABEL_, etc.).
  • To sftp transcriptions to Jon May as elisa.tam-eng.eval-asr-uiuc.y3r1.v8.xml.gz, with timestamp June 11 and version 8,
    grep -h -e '^TAM_EVAL' tamil/lat/*.log | sort | sed -e 's/ /\t/' | ./hyp2jonmay.rb /tmp/jon-tam tam 20180611 8
    (If UTF-8 errors occur, simplify letters by appending to the sed command args such as -e 's/Ñ/N/g'.)
  • Collect each .wav file's n best transcriptions with
    cat $L/lat/*.ascii | sort > $L-nbest.txt.

Special postprocessing.

If your transcriptions used nonsense English words, convert them to phones and then, via a trie or longest common substring, into L-words:

  • ./trie-$L.rb < trie1-scrips.txt > $L-trie-scrips.txt.
  • make multicore-$L; wait; grep ... > $L-lcs-scrips.txt.

Typical results.

RUS_20160930 was transcribed in 67 minutes, 13 MB/min, 12x faster than real time.

A 3.1 GB subset of Assam LDC2016E02 was transcribed in 440 minutes, 7 MB/min, 6.5x real time. (This may have been slower because it exhausted ifp-53's memory.)

Arabic/NEMLAR_speech/NMBCN7AR, 2.2 GB (40 hours), was transcribed in 147 minutes, 14 MB/min, 16x real time. (This may have been faster because it was a few long (half-hour) files instead of many brief ones.)

TAM_EVAL_20170601 was transcribed in 45 minutes, 21 MB/min, 19x real time.

Generating lattices $L/lat/* took 1.04x longer for Russian, 0.93x longer(!) for Arabic, 1.7x longer for Tamil.

asr24's People

Contributors

camilleg avatar jhasegaw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.