Coder Social home page Coder Social logo

subword-kaldi's Introduction

Create a subword Lexicon FST for Kaldi

This is the code belonging to the paper Improved subword modeling for WFST-based speech recognition.

For each subword marking style (word boundary marker, left-right marked, left-marked, right-marked) a seperate script exists in local/ that can create a L.fst.

The standard way to use this scripts is:

extra=3
utils/prepare_lang.sh --phone-symbol-table data/lang/phones.txt --num-extra-phone-disambig-syms $extra data/subword_dict "<UNK>" data/subword_lang/local data/subword_lang

dir=data/subword_lang
tmpdir=data/subword_lang/local

# Overwrite L_disambig.fst
common/make_lfst_wb.py $(tail -n$extra $dir/phones/disambig.txt) < $tmpdir/lexiconp_disambig.txt | fstcompile --isymbols=$dir/phones.txt --osymbols=$dir/words.txt --keep_isymbols=false --keep_osymbols=false | fstaddselfloops  $dir/phones/wdisambig_phones.int $dir/phones/wdisambig_words.int | fstarcsort --sort_type=olabel > $dir/L_disambig.fst 

For the other scripts (l/r/lr-marked ) the number of extra disambiguation symbols can be reduced to 1

What type of marking style is the best?

This unfortunately depends on your language and dataset. We have seen different optimal values for different datasets and languages.

Limitiations

  • The lexicon files are not updated in the lang directory, so lexicon-based alignment of lattices will not work (fix in progress)
  • At this moment all pronunciations will have probability 1 (which is common anyway for grapheme-based systems). If custom probabilities are required the local/make_lfst_*.py files should be updated to include them.

Help

Feel free to make an issue or send me an email on [email protected] if you have trouble getting these scripts to work.

subword-kaldi's People

Contributors

gastron avatar psmit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

subword-kaldi's Issues

How to generate data/subword_dict

Hi,

I am trying to use the subword units along Kaldi librispeech recipe. I have used the code snippet mentioned in the README in the stage 3 of librispeech recipe.

if [ $stage -le 3 ]; then

   local/prepare_dict.sh --stage 3 --nj 30 --cmd "$train_cmd" data/local/lm data/local/lm data/subword_dict

   utils/prepare_lang.sh --phone-symbol-table data/lang/phones.txt --num-extra-phone-disambig-syms $extra data/subword_dict   "<UNK>" data/subword_lang/local data/subword_lang
    
   subdir=data/subword_lang
   tmpdir=data/subword_lang/local

   local/make_lfst_wb.py $(tail -n$extra $subdir/phones/disambig.txt) < $tmpdir/lexiconp_disambig.txt | fstcompile  --isymbols=$subdir/phones.txt --osymbols=$subdir/words.txt --keep_isymbols=false --keep_osymbols=false | fstaddselfloops  $dir/phones/wdisambig_phones.int $subdir/phones/wdisambig_words.int | fstarcsort --sort_type=olabel > $subdir/L_disambig.fst
fi

Please let me know if I need to prepare the data/subword_dict separately or if this is correct.
Currently I get the below error

FATAL: FstCompiler: Symbol "<w>" is not mapped to any integer arc olabel, symbol table = data/subword_lang/words.txt, source = standard input, line = 1
ERROR: FstHeader::Read: Bad FST header: -
ERROR (fstaddselfloops[5.5.971~1-07043]:ReadFstKaldi():kaldi-fst-io.cc:35) Reading FST: error reading FST header from standard input

[ Stack-Trace: ]
/home/chitralekha/kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0xb42) [0x7f81e210b742]
fstaddselfloops(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x21) [0x557e2ee630cf]
/home/chitralekha/kaldi/src/lib/libkaldi-fstext.so(fst::ReadFstKaldi(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x1ba) [0x7f81e25685db]
fstaddselfloops(main+0x123) [0x557e2ee62afd]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f81e1794bf7]
fstaddselfloops(_start+0x2a) [0x557e2ee628fa]

kaldi::KaldiFatalErrorERROR: FstHeader::Read: Bad FST header: standard input
Traceback (most recent call last):
  File "local/make_lfst_wb.py", line 65, in <module>
    print_word(word, phones, False, True, 3, 0)
  File "local/make_lfst_wb.py", line 40, in print_word
    print("{}\t{}\t{}\t{}".format(cur_state,next_state,phones[0],word))
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='ANSI_X3.4-1968'>
BrokenPipeError: [Errno 32] Broken pipe

Documentation

The documentation is slightly out-of-date, and should address new paths + new SentencePiece L.fst

Fix align lexicon

The align_lexicon files should also be changed. Otherwise it is not possible to do phone alignment.

sil symbol

Hi Dears
Thanks for your code.
In codes, sil symbol is hard-coded.

print("{}\t{}\t{}\t{}\t{}".format(0,4,"SIL","<eps>", -math.log(0.5)))

if we use another symbol do we must change them?

best regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.