Coder Social home page Coder Social logo

xubuild / lm_build Goto Github PK

View Code? Open in Web Editor NEW

This project forked from srvk/lm_build

0.0 2.0 0.0 17.9 MB

Adapting your own Language Model for Kaldi

Home Page: http://speechkitchen.org/kaldi-language-model-building/

Shell 61.43% Perl 35.26% Python 3.31%

lm_build's Introduction

Kaldi Language Model Building

Adapting Your Own Language Model

Instructions to learn about building a Kaldi language model based on your own text.

When you clone this code into a Kaldi experiment like โ€ฆ/kaldi-trunk/egs/tedlium/s5 you get a folder lm_build/ with tools and examples of how to adapt and train a language model based on your own training text file.

Adding New Vocabulary Words to the Lexicon

The new script run_adapt.sh helps make LM adaptation much easier now.

  • Method 1: manually create a file newwords.txt in the lm_build working folder, into which you place new words (not already in the lexicon in TEDLIUM.152k.dic) Pronunciations will be automatically generated and added to the dictionary.
  • Method 2: Automatic candidate OOV words are generated when you run run_adapt.sh in the file candidate_oovs.txt. This candidate list of new words contains all words found in the training text not already in the dictionary (OOV words) that appear more than once. Rename this file newwords.txt and run run_adapt.sh again to use all these words with a frequency greater than 2. Or edit newwords.txt having a look at oov-counts.txt to see the word frequency counts and help you iteratively refine the dictionary
  • (optionally) add to the example_txt training text file some examples that use the new words. Hint: you may need to repeat these LM adaptation sentences between 50 and 100 times for the transcriber to recognize and produce them as output.
  • Run the script run_adapt.sh. This will do several things, but the end result will be a new composed decoding graph TLG.fst in the output folder data/lang_phn_test/
  • Point your Eesen Transcriber setup to use the resulting graph, for example by setting this value in /vagrant/Makefile.options

GRAPH_DIR?=$(EESEN_ROOT)/asr_egs/tedlium/v2-30ms/lm_build/data/lang_phn_test

Adding your own pronunciations

This process makes use of the CMU Lexicon Tool to generate dictionary entries with phonetic pronunciations for unseen words. These may not always be correct. An alternative approach (Method 3?) Add your own words and pronunciations directly to TEDLIUM.152k.dic first - perhaps pattern matching parts of pronunciations from similar words. It is also possible to have more than one pronunciation, e.g:

zydeco Z AY D EH K OW
zydeco(2) Z IH D AH K OW
zydeco(3) Z AY D AH K OW

FAQ

How some of the scripts work
Deterministic (tiny) LM Building
Adding Technical Words to Dictionary
More Details About LM Building

lm_build's People

Contributors

fmetze avatar riebling avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.