Coder Social home page Coder Social logo

srvk / eesen Goto Github PK

View Code? Open in Web Editor NEW
824.0 82.0 342.0 6 MB

The official repository of the Eesen project

Home Page: http://arxiv.org/abs/1507.08240

License: Apache License 2.0

Shell 9.72% Perl 5.01% Python 0.42% Makefile 0.94% C++ 70.73% C 10.40% Cuda 2.78%
tensorflow ctc-loss asr ctc kaldi speech-recognition speech-to-text

eesen's Introduction

Eesen

Eesen is to simplify the existing complicated, expertise-intensive ASR pipeline into a straightforward sequence learning problem. Acoustic modeling in Eesen involves training a single recurrent neural network (RNN) to model the mapping from speech to text. Eesen abandons the following elements required by the existing ASR pipeline:

  • Hidden Markov models (HMMs)
  • Gaussian mixture models (GMMs)
  • Decision trees and phonetic questions
  • Dictionary, if characters are used as the modeling units
  • ...

Eesen was created by Yajie Miao with inspiration from the Kaldi toolkit. Thank you, Yajie!

Key Components

Eesen contains 4 key components to enable end-to-end ASR:

  • Acoustic Model -- Bi-directional RNNs with LSTM units.
  • Training -- Connectionist temporal classification (CTC) as the training objective.
  • WFST Decoding -- A principled decoding approach based on Weighted Finite-State Transducers (WFSTs), or
  • RNN-LM Decoding -- Decoding based on (character) RNN language models, when using Tensorflow (currently its own branch)

Highlights of Eesen

  • The WFST-based decoding approach can incorporate lexicons and language models into CTC decoding in an effective and efficient way.
  • The RNN-LM decoding approach does not require a fixed lexicon.
  • GPU implementation of LSTM model training and CTC learning, now also using Tensorflow.
  • Multiple utterances are processed in parallel for training speed-up.
  • Fully-fledged example setups to demonstrate end-to-end system building, with both phonemes and characters as labels, following Kaldi recipes and conventions.

Experimental Results

Refer to RESULTS under each example setup.

References

For more information, please refer to the following paper(s):

Yajie Miao, Mohammad Gowayyed, and Florian Metze, "EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding," in Proc. Automatic Speech Recognition and Understanding Workshop (ASRU), Scottsdale, AZ; U.S.A., December 2015. IEEE.

eesen's People

Contributors

abuccts avatar csukuangfj avatar efosler avatar fmetze avatar jb1999 avatar julianslzr avatar naxingyu avatar qiukun avatar riebling avatar shuang777 avatar standy66 avatar yajiemiao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eesen's Issues

Trying tedlium demo(v1) and crashed when trainning on the first epoch

I used tedlium(110hours) data and tried to runnrun_ctc_phn.sh and run_ctc_char.sh scripts.
But I got the error on the first epoch.

CUDA version is 7.5
GPU is NVIDIA GeForce GTX 970

the error part log as follows:

VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 35000 sequences (53.5094Hr): Obj(log[Pzx]) = -228.534   TokenAcc = 54.2553%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 36000 sequences (56.0337Hr): Obj(log[Pzx]) = -235.923   TokenAcc = 54.2852%
WARNING (train-ctc-parallel:MallocInternal():cuda-device.cc:658) Allocation of 18460 rows, each of size 2560 bytes failed,  releasing cached memory and retrying.
WARNING (train-ctc-parallel:MallocInternal():cuda-device.cc:665) Allocation failed for the second time.    Printing device memory usage and exiting
LOG (train-ctc-parallel:PrintMemoryUsage():cuda-device.cc:334) Memory used: 4142407680 bytes.
ERROR (train-ctc-parallel:MallocInternal():cuda-device.cc:668) Memory allocation failure
WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe copy-feats scp:exp/train_char_l5_c320/train_local.scp ark:- | add-deltas ark:- ark:- | had nonzero return status 36096
ERROR (train-ctc-parallel:MallocInternal():cuda-device.cc:668) Memory allocation failure

[stack trace: ]
eesen::KaldiGetStackTrace()
eesen::KaldiErrorMessage::~KaldiErrorMessage()

Look forward your suggestion.

live mode recognition

I would like to test the model with live utterance (by live recording ) instead of batch mode,

  1. whether eesen support live mode?
  2. if not how to load a model once and decode each utterance

latttice๏ผ1best is not aligned

Hi Yajie

Hope you well

I am very interested to obtain the correctly aligned phone sequence to word. But currently Eesen's lattice-1best.cc can only generate approximate alignment of word and its phone sequence.
for example, the following output from lattice-1best:
0
0 1 21 3.79004,4.6815,1_1_1_1_1_1_1_2_2_2_2_2_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_9_1_1_1_1_1_1_1_1_39_1
1 2 62 0.0800781,0.134042,1_3_1_1_1_1_1_1_1_1_26_1_1_1_1_1_1_1_34_1_1
2 3 57 1.21191,8.98855,1_8_1_1_1_1_1_1_1_1_1_1_10_1_1_3_1_1_1_1_1_1_1_1_1_1_1_34_1_1_1_1_1_8_1_1_1_24_24_24_1_1
3 4 33 0.113281,0.192038,1_5_1_1_1_1_1_1_1_1_1_38_1_1_1_1_1_1_1_1
4 5 60 2.41016,8.13873,1_4_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_10_1_1_32_1_1_1_1_1_1_1_1_8_1_1_1_1_24_24_1_1_37_1_1_1_1_1_1_1_1_1_1_34_1_1_1_1_1_1_1_1_1_1_1_1_31_31_1_1_1_1_1_37_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_10_1_1_1_21_1_1_1_1_1_1_1_1_1_1_1_31_1_1_1_1_1_1_1_1_1_1_1_14_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_12_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1_1
5 3.00586,0,2
...............

the phone sequences are not exactly aligned to the word ( represent by int).
I am very new to Kaldi and asr. from what I know, there is a kaldi function : lattice-word-align.cc can be used if we want an exact word-to-phoneseq alignment, but that function requires a final.mdl, which ctc doesn't have.

Could you please give me some hints to how to generate exact word-to-phoneseq alginment?

Best

Build failure:error: โ€˜cuda_apply_heavisideโ€™ was not declared in this scope

I followed the install instructions and when running configure, satisfied all dependencies:
_ben@ben:~/eesen/src$ ./configure --use-cuda=yes --cudatk-dir=/usr/local/cuda-7.5Configuring ...
Checking OpenFST library in /home/ben/eesen/tools/openfst ...
Checking OpenFst library was patched.
Backing up config.mk to config.mk.bak
Doing OS specific configurations ...
On Linux: Checking for linear algebra header files ...
Using ATLAS as the linear algebra library.
... no libatlas.so in /usr/lib
... no libatlas.so in /usr/lib/atlas
... no libatlas.so in /usr/lib/atlas-sse2
... no libatlas.so in /usr/lib/atlas-sse3
... no libatlas.so in /usr/lib64
... no libatlas.so in /usr/lib64/atlas
... no libatlas.so in /usr/lib64/atlas-sse2
... no libatlas.so in /usr/lib64/atlas-sse3
... no libatlas.so in /usr/local/lib
... no libatlas.so in /usr/local/lib/atlas
... no libatlas.so in /usr/local/lib/atlas-sse2
... no libatlas.so in /usr/local/lib/atlas-sse3
... no libatlas.so in /usr/local/lib64
... no libatlas.so in /usr/local/lib64/atlas
... no libatlas.so in /usr/local/lib64/atlas-sse2
... no libatlas.so in /usr/local/lib64/atlas-sse3
... no libatlas.so in /home/ben/eesen/src/../tools/ATLAS/build/install/lib/
... no libatlas.so in /home/ben/eesen/tools/ATLAS/lib
Could not find libatlas.so in any of the obvious places, will most likely try static:
Could not find libatlas.a in any of the generic-Linux places, but we'll try other stuff...

Successfully configured for Debian 7 [dynamic libraries] with ATLASLIBS =/usr/lib/atlas-base/libatlas.so.3.0 /usr/lib/atlas-base/libf77blas.so.3.0 /usr/lib/atlas-base/libcblas.so.3 /usr/lib/atlas-base/liblapack_atlas.so.3
Using CUDA toolkit /usr/local/cuda-7.5 (nvcc compiler and runtime libraries)
Successfully configured with Speex at /home/ben/eesen/src/../tools/extras/speex-1.2rc1, (static=[false])

But during 'make', it reports the following error:
cuda-matrix.cc: In instantiation of โ€˜void eesen::CuMatrixBase::ApplyHeaviside() [with Real = float]โ€™:
cuda-matrix.cc:1227:16: required from here
cuda-matrix.cc:1075:57: error: โ€˜cuda_apply_heavisideโ€™ was not declared in this scope
cuda_apply_heaviside(dimGrid, dimBlock, data_, Dim());
^
cuda-matrix.cc: In instantiation of โ€˜void eesen::CuMatrixBase::ApplyHeaviside() [with Real = double]โ€™:
cuda-matrix.cc:1228:16: required from here
cuda-matrix.cc:1075:57: error: โ€˜cuda_apply_heavisideโ€™ was not declared in this scope
make[1]: *** [cuda-matrix.o] Error 1
make[1]: Leaving directory `/home/ben/eesen/src/gpucompute'
make: *** [gpucompute] Error 2

I did some research and it seems that cuda_apply_heaviside is from kaldi. But the previous dependency check didn't mention installing kaldi. So what could have gone wrong?

script for TEDLIUM release2

Hello. First of all, thank you very much for the great work!
I'm interested in the training of TEDLIUM, especially for the release2.
I guess there aren't no proper language model, so it needs to build my own language model from the corpus and dictionary using irstlm or srilm. Could you tell me how to do it? or any plan to add it?
Thanks!

latgen faster speed improve

Hi Yajie

Could you suggest some way to improve the speed of latgen-faster?
I tested some utterances in noisy environment, it could be extreamly slow (like > 10seconds).

I wonder if any parallel method can be applied here ( to speed up decoding for a single utterance)?

Best

BLAS alternatives

Is is possible to use an alternative to ATLAS?

Running the install for ATLAS I hit the "CPU Throttling apparently enabled!" error/abort. I'm reluctant to disable CPU dynamic frequency scaling, especially when I have a number of other linear algebra libraries already installed (cuda/cublas, Openblas, boost/ublas, scipy/blas). Wouldn't the cuda libcublas be acceptable, or even preferable?

tedlium char based training scripts

Hi

Thank you very much for the great work!

I've tried the tedlium phone based scripts, they works fine. Now I am trying the char based scripts, from what I saw I think it might not have be tested right?

local/tedlium_prepare_char_dict.sh will produce lexicon1.txt which break to < U N K >
therefore units-nosil.txt will have A B C E G.. U [ ] < > as tokens which I assume they should be put together.

I noticed that the wsj/run_ctc_char.sh is a bit different from the /tedlium/v1/run_ctc_char.sh I am wondering if the tedlium scripts have been tested, and can I resort to the wsj script to train a proper char based system?

Thanks a million!

Best

eesen runns on gridengine qstat shows Eqs

Hi I get a gridengine cluster which has two nodes(node1,node2), each node has 3 gpus.
Also I have another node(node3) as nfs and all the wav data is on node3
then I mount the data dir to node1 and node2(node 1 and node2 only can read the mount dir, can not write)

On node1 and node2 I build eesen separately, and modify the cmd.sh to queue format as follows:

export train_cmd="queue.pl -q all.q -l arch=*64"
export decode_cmd="queue.pl -q all.q -l arch=*64,mem_free=2G,ram_free=2G"
export mkgraph_cmd="queue.pl -q all.q -l arch=*64,ram_free=4G,mem_free=4G"
export big_memory_cmd="queue.pl -q all.q -l arch=*64,ram_free=8G,mem_free=8G"
export cuda_cmd="queue.pl -q all.q -l gpu=1"

But when I runned ./run_ctc_phn.sh on node1 the screen stops on the make fbank step as follows:

steps/make_fbank.sh --cmd queue.pl -q all.q -l arch=*64 --nj 20 data/train exp/make_fbank/train fbank
utils/validate_data_dir.sh: Successfully validated data-directory data/train
steps/make_fbank.sh [info]: segments file exists: using that.

But no gpu or cpu is running and I runned qstat shows that some error happened like this

28 0.50000 make_fbank kaldi        Eqw   07/29/2016 15:57:54                                    1 1-5:1,7-19:2

But no error log is found
Please give me some suggestion, thanks very much.

ctc loss output is weird with training in cpu only

Hi,
Since the GPU nodes are being in maintenance, I need to compile the code with --use-cuda=no option.
However, there were a few compile errors, especially related to ExpA, so I changed them to exp.
By the way, when the EESEN enters into the training phase, the log show like this:
train-ctc-parallel --report-step=1000 --num-sequence=10 --frame-limit=25000 --learn-rate=0.00004 --momentum=0.9 --verbose=1 'ark,s,cs:copy-feats scp:exp/train_phn_l4_c320/train_local.scp ark:- | add-deltas ark:- ark:- |' 'ark:gunzip -c exp/train_phn_l4_c320/labels.tr.gz|' exp/train_phn_l4_c320/nnet/nnet.iter0 exp/train_phn_l4_c320/nnet/nnet.iter1
copy-feats scp:exp/train_phn_l4_c320/train_local.scp ark:-
add-deltas ark:- ark:-
LOG (train-ctc-parallel:main():train-ctc-parallel.cc:112) TRAINING STARTED
VLOG1 After 1010 sequences (0.695456Hr): Obj(log[Pzx]) = -1e+30 TokenAcc = -36.2847%
VLOG1 After 2020 sequences (1.62006Hr): Obj(log[Pzx]) = -1e+30 TokenAcc = -30.1322%
VLOG1 After 3030 sequences (2.67991Hr): Obj(log[Pzx]) = -1e+30 TokenAcc = -25.7895%
VLOG1 After 4040 sequences (3.84653Hr): Obj(log[Pzx]) = -1e+30 TokenAcc = -23.9529%
VLOG1 After 5050 sequences (5.10758Hr): Obj(log[Pzx]) = -1e+30 TokenAcc = -23.4412%
VLOG1 After 6060 sequences (6.45321Hr): Obj(log[Pzx]) = -1e+30 TokenAcc = -22.1382%
VLOG1 After 7070 sequences (7.87719Hr): Obj(log[Pzx]) = -1e+30 TokenAcc = -20.9931%
VLOG1 After 8080 sequences (9.37634Hr): Obj(log[Pzx]) = -1e+30 TokenAcc = -20.5369%
VLOG1 After 9090 sequences (10.944Hr): Obj(log[Pzx]) = -1e+30 TokenAcc = -20.2233%

Is this normal? What does the negative accuracy mean?
Thank you.

Jinserk

Token Accuracy drops drastically for Tedlium database training

Hi,

In the 14th iteration for training the token accuracy drops drastically

VLOG1 After 62000 sequences (33.1789Hr): Obj(log[Pzx]) = -8.45854 TokenAcc = 97.4986%
VLOG1 After 63000 sequences (34.1167Hr): Obj(log[Pzx]) = -8.56525 TokenAcc = 97.4942%
VLOG1 After 64000 sequences (35.0689Hr): Obj(log[Pzx]) = -8.71485 TokenAcc = 97.4921%
VLOG1 After 65000 sequences (36.0363Hr): Obj(log[Pzx]) = -8.6e+29 TokenAcc = 19.0128%
VLOG1 After 66000 sequences (37.0192Hr): Obj(log[Pzx]) = -1e+30 TokenAcc = 4.44885%

The only change I made was in train_ctc_parallel_x3.sh where I increased
num_sequence=20 valid_num_sequence=40 frame_num_limit=4000000
Is this leading to the error?

More nnet1 component

Can I just import the original nnet1 components, conv, maxpool, etc, to the eesen src/nnet code? Do they still fit into the framework?

ctc label and token label mismatch

Hi,

I noticed that in ctc labels, the 0 index denotes <blk>, but in token fst the 0 index becomes <eps> and the 1st index is <blk>, how do you solve this mismatch problem when decoding? I mean when you feed ctc ouputs to TLG.fst, the labels with the same index are different. Or you pass the character/phone output instead of label index into TLG.fst?

Thanks!

decode

For CTC, are there differences between Uni and Bi on the process of decode๏ผŸ

Integrating silence phone to phone-ctc

It seems for me that it would be really healthy to add <SIL> phone to phones (like <SPACE>), since ideologically <blk> label correspond to "I'm not sure what to output right now" or "phone is changing". And for me it seems that it is not a good practice to force it to make different tasks at the same time.

The problem with this approach is that nobody marks silence. That's why we need some basic alignments, extract silences from them and only then train the system. Are you going to develop these types of approaches?

Dropout for LSTM

Hi Yajie,

I've seen that you implemented dropout for LSTM. According your code dropout is applied not only on train stage but also on crossvalidation and test stages. Is it right?

Thanks

the timit data perfoms bad. Is it normal?

I tested timit data using eesen

I first convert timt data like tedlium format,
but the dic of acoustic model i didn't used timit's
instead in file prepare_phn_dict.sh I used cantab-TEDLIUM.dct
in tedlium_prepare_data.sh (local/join_suffix.py ) I used TEDLIUM.150K.dic as the scripts' parameter

train result

some result is ok,like

mrjb1_sa2 recg result is 'ask me to carry an oily rag like'

but most of the result is bad, as follows

mrhl0_si1274-0000000-0000175 ph
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrhl0_si1274-0000000-0000175 is -0.489169 over 173 frames.
mrhl0_si1521-0000000-0000490 how do you think
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrhl0_si1521-0000000-0000490 is 0.126054 over 488 frames.
mrhl0_si2151-0000000-0000380 so much
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrhl0_si2151-0000000-0000380 is 0.174863 over 378 frames.
mrhl0_sx171-0000000-0000457 how do you think
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrhl0_sx171-0000000-0000457 is 0.257338 over 455 frames.
mrhl0_sx261-0000000-0000216 how much
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrhl0_sx261-0000000-0000216 is -0.253791 over 214 frames.
mrhl0_sx351-0000000-0000526 how do you think
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrhl0_sx351-0000000-0000526 is -0.56767 over 524 frames.
mrhl0_sx441-0000000-0000386 which
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrhl0_sx441-0000000-0000386 is -0.52401 over 384 frames.
mrhl0_sx81-0000000-0000304 so much
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrhl0_sx81-0000000-0000304 is 0.0374106 over 302 frames.

I don't whether the result is ok or not
could you please give me some suggestion?

Token Accuracy Drops Obj(log[Pzx])=nan

Hi all,

We have an issue with training more than 9000 Hrs of our speech data. We use the train_ctc_parallel.sh recipe with num_seq=10 and frame_num_limit=12500. However after training the 183 hours of data token accuracy drops drastically and Obj(log[Pzx])=nan

Does anyone have any pointers to solve this? Thanks

VLOG1 After 201933 sequences (163.131Hr): Obj(log[Pzx]) = -41.6014 TokenAcc = 59.2475%
VLOG1 After 202939 sequences (163.96Hr): Obj(log[Pzx]) = -42.6817 TokenAcc = 58.5093%
VLOG1 After 203946 sequences (164.793Hr): Obj(log[Pzx]) = -42.561 TokenAcc = 59.4519%
VLOG1 After 204951 sequences (165.623Hr): Obj(log[Pzx]) = -43.8504 TokenAcc = 57.6078%
VLOG1 After 205951 sequences (166.449Hr): Obj(log[Pzx]) = -1e+27 TokenAcc = 58.6802%
VLOG1 After 206958 sequences (167.246Hr): Obj(log[Pzx]) = -40.0838 TokenAcc = 59.9717%
VLOG1 After 207965 sequences (168.075Hr): Obj(log[Pzx]) = -41.7963 TokenAcc = 60.6595%
VLOG1 After 208967 sequences (168.907Hr): Obj(log[Pzx]) = -43.781 TokenAcc = 58.9324%
VLOG1 After 209976 sequences (169.669Hr): Obj(log[Pzx]) = -37.7623 TokenAcc = 60.3672%
VLOG1 After 210979 sequences (170.439Hr): Obj(log[Pzx]) = -39.8231 TokenAcc = 58.9923%
VLOG1 After 211979 sequences (171.212Hr): Obj(log[Pzx]) = -1e+27 TokenAcc = 58.5249%
VLOG1 After 212985 sequences (172.007Hr): Obj(log[Pzx]) = -39.8087 TokenAcc = 61.0912%
VLOG1 After 213992 sequences (172.773Hr): Obj(log[Pzx]) = -39.9797 TokenAcc = 59.2739%
VLOG1 After 214996 sequences (173.55Hr): Obj(log[Pzx]) = -9.96016e+26 TokenAcc = 59.5201%
VLOG1 After 216005 sequences (174.351Hr): Obj(log[Pzx]) = -38.7279 TokenAcc = 61.0642%
VLOG1 After 217009 sequences (175.156Hr): Obj(log[Pzx]) = -40.5695 TokenAcc = 60.692%
VLOG1 After 218017 sequences (175.937Hr): Obj(log[Pzx]) = -38.693 TokenAcc = 59.3732%
VLOG1 After 219017 sequences (176.733Hr): Obj(log[Pzx]) = -40.4078 TokenAcc = 60.4888%
VLOG1 After 220017 sequences (177.531Hr): Obj(log[Pzx]) = -40.1273 TokenAcc = 60.5241%
VLOG1 After 221017 sequences (178.363Hr): Obj(log[Pzx]) = -43.3316 TokenAcc = 59.4678%
VLOG1 After 222020 sequences (179.173Hr): Obj(log[Pzx]) = -42.5057 TokenAcc = 59.6424%
VLOG1 After 223023 sequences (180.003Hr): Obj(log[Pzx]) = -40.4493 TokenAcc = 60.753%
VLOG1 After 224025 sequences (180.791Hr): Obj(log[Pzx]) = -40.4711 TokenAcc = 59.7552%
VLOG1 After 225032 sequences (181.603Hr): Obj(log[Pzx]) = -39.235 TokenAcc = 60.5423%
VLOG1 After 226035 sequences (182.43Hr): Obj(log[Pzx]) = -42.8734 TokenAcc = 59.7315%
VLOG1 After 227041 sequences (183.209Hr): Obj(log[Pzx]) = -38.7436 TokenAcc = 60.1154%
VLOG1 After 228042 sequences (183.989Hr): Obj(log[Pzx]) = nan TokenAcc = 36.4706%
VLOG1 After 229048 sequences (184.82Hr): Obj(log[Pzx]) = nan TokenAcc = 2.26617%
VLOG1 After 230053 sequences (185.631Hr): Obj(log[Pzx]) = nan TokenAcc = 2.17376%
VLOG1 After 231054 sequences (186.464Hr): Obj(log[Pzx]) = nan TokenAcc = 2.23086%
VLOG1 After 232055 sequences (187.262Hr): Obj(log[Pzx]) = nan TokenAcc = 2.24668%
VLOG1 After 233062 sequences (188.122Hr): Obj(log[Pzx]) = nan TokenAcc = 2.20169%
VLOG1 After 234063 sequences (188.93Hr): Obj(log[Pzx]) = nan TokenAcc = 2.36458%

decoding without cmvn

Hi Yajie

Thank you very much for your kindly replying..

I want to ask another question: Current Eessen performs best with CMVN and biLSTM, but in real scenario it would take too much time to wait for the whole utterance finish and CMVN might not accessible (like user change their location or even user changing). What is the best strategy in such situation?

All the best

Xiaofeng

Convert eesen lattices to SLF lattices

Hi,

how one could convert eesen lattices to SLF lattices like in Kaldi using lattice-align-words and lattice-to-phone-lattice? Would it be possible?

Thanks!

setting output neurons

We are trying eesen for OCR.
We have a dataset with labels marked 1 to 320 . but some of the labels are not seen in training and testing set.only 218 labels are seen.
the question is
what would be the number of output neurons either 218 or 320?

Output

1.how to get the output of network in decoder, like Softmax probability vs time stamp
2. How to get the phone level output

Question about eesen training code

Hi ,

There is a question for me while reading the eesen code.
Location: line 66 to line 75 at file net/ctc_loss.cc
When back-propagate the errors through the softmax layer, as I can get from the code,
the formula is ctc_error * yk - Row_mul(yk * ColSum(ctc_error * yk) ).
But the formula of softmax-derivation is yk * (1 - yk).
So as I can get, the difference of using softmax-derivation formular is the ColSum and RowMul.
And why? Is there something I missed?

Looking forward for reply!

cluster runns ok one one node and error on two or more nodes.

Hi I get a gridengine cluster which has three nodes(node1,node2,node3), each node has 3 gpus.
node1 is master node, submit node and exector node
node2 is submit node and executor node
node3 is submit node
Also node3 has a nfs service, all the wav data and txt data is on node3.
Node1 and node2 mount node3's data and node1 and node2 can only read, can not write the mounted data.

node1, node2 and node3 has the same user name called kaldi and password

Then on node1, node2 and node3's dir named /home/kaldi/git/eesen I built eesen separately.
I touched uname.sh contains uname -a and runned on node3(only submit node) several times,
the job was distributed to node1 and node2 and runs ok

Then I touched a file named touch.sh which contains 'touch ok.fst' cmd and submited on node3 several times,
the job was distributed to node1 and node2 and runs ok

Then I went to timit dir and changed the cmd.sh to queue.pl on the three nodes
On node3 I runned run_ctc_phn.sh, then runed qstat -j 54, I got the following result:

==============================================================
job_number:                 54
exec_file:                  job_scripts/54
submission_time:            Wed Aug  3 10:04:05 2016
owner:                      kaldi
uid:                        1006
group:                      kaldi
gid:                        1006
sge_o_home:                 /home/kaldi
sge_o_log_name:             kaldi
sge_o_path:                 /home/kaldi/git/eesen/asr_egs/timit/vc1/utils/:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/netbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/featbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/decoderbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/fstbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/openfst/bin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/irstlm/bin/:/home/kaldi/git/eesen/asr_egs/timit/vc1:/home/kaldi/git/eesen/asr_egs/timit/vc1/utils/:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/netbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/featbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/decoderbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/fstbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/openfst/bin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/irstlm/bin/:/home/kaldi/git/eesen/asr_egs/timit/vc1:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/kaldi/git/eesen/asr_egs/timit/vc1
sge_o_host:                 cluster000
account:                    sge
cwd:                        /home/kaldi/git/eesen/asr_egs/timit/vc1
merge:                      y
hard resource_list:         arch=*64
mail_list:                  [email protected]
notify:                     FALSE
job_name:                   make_fbank_train.sh
stdout_path_list:           NONE:NONE:exp/make_fbank/train/q/make_fbank_train.log
jobshare:                   0
hard_queue_list:            all.q
shell_list:                 NONE:/bin/bash
env_list:                   PATH=/home/kaldi/git/eesen/asr_egs/timit/vc1/utils/:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/netbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/featbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/decoderbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/fstbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/openfst/bin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/irstlm/bin/:/home/kaldi/git/eesen/asr_egs/timit/vc1:/home/kaldi/git/eesen/asr_egs/timit/vc1/utils/:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/netbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/featbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/decoderbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/fstbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/openfst/bin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/irstlm/bin/:/home/kaldi/git/eesen/asr_egs/timit/vc1:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
script_file:                /home/kaldi/git/eesen/asr_egs/timit/vc1/exp/make_fbank/train/q/make_fbank_train.sh
job-array tasks:            1-20:1
error reason    1:          08/03/2016 10:04:20 [1003:10662]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    2:          08/03/2016 10:04:20 [1003:10663]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    3:          08/03/2016 10:04:20 [1003:10674]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    4:          08/03/2016 10:04:20 [1003:10666]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    5:          08/03/2016 10:04:20 [1003:10665]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    6:          08/03/2016 10:04:20 [1003:10669]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    7:          08/03/2016 10:04:20 [1003:10671]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    8:          08/03/2016 10:04:20 [1003:10664]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    9:          08/03/2016 10:04:20 [1003:10667]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   10:          08/03/2016 10:04:20 [1003:10670]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   11:          08/03/2016 10:04:23 [1004:29627]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   12:          08/03/2016 10:04:20 [1003:10679]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   13:          08/03/2016 10:04:23 [1004:29629]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   14:          08/03/2016 10:04:20 [1003:10678]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   15:          08/03/2016 10:04:23 [1004:29630]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   16:          08/03/2016 10:04:20 [1003:10676]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   17:          08/03/2016 10:04:23 [1004:29631]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   18:          08/03/2016 10:04:20 [1003:10680]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   19:          08/03/2016 10:04:23 [1004:29628]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   20:          08/03/2016 10:04:20 [1003:10675]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
scheduling info:            queue instance "[email protected]" dropped because it is disabled
                            Job is in error state

then I went to node 1 and runed run_ctc_phn.sh and got the similar result half of the cmd runed ok:

error reason    1:          08/03/2016 10:04:20 [1003:10662]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    3:          08/03/2016 10:04:20 [1003:10674]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    5:          08/03/2016 10:04:20 [1003:10665]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    7:          08/03/2016 10:04:20 [1003:10671]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    9:          08/03/2016 10:04:20 [1003:10667]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   11:          08/03/2016 10:04:23 [1004:29627]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   13:          08/03/2016 10:04:23 [1004:29629]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   15:          08/03/2016 10:04:23 [1004:29630]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   17:          08/03/2016 10:04:23 [1004:29631]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   19:          08/03/2016 10:04:23 [1004:29628]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit

Then I disabled node2 and runed run_ctc_phn.sh on node1 (the cluster now has only one exector node1) then eesen runed ok

What is the problem? Please help me to find out how the problem comes? Thanks.

Timit fbank result is ok? and how to add some features such as delta-delta?

Hi I tested timit data using eesen, but the result is not good as follows:

training process

EPOCH 11 RUNNING ... ENDS [2016-Jun-6 17:02:47]: lrate 4e-05, TRAIN ACCURACY 23.4300%, VALID ACCURACY 17.3147%
EPOCH 12 RUNNING ... ENDS [2016-Jun-6 17:07:02]: lrate 4e-05, TRAIN ACCURACY 25.2924%, VALID ACCURACY 16.1223%
EPOCH 13 RUNNING ... ENDS [2016-Jun-6 17:11:18]: lrate 4e-05, TRAIN ACCURACY 26.1150%, VALID ACCURACY 18.4033%
EPOCH 14 RUNNING ... ENDS [2016-Jun-6 17:15:33]: lrate 4e-05, TRAIN ACCURACY 26.6806%, VALID ACCURACY 19.5179%
EPOCH 15 RUNNING ... ENDS [2016-Jun-6 17:19:51]: lrate 4e-05, TRAIN ACCURACY 27.1350%, VALID ACCURACY 18.6625%
EPOCH 16 RUNNING ... ENDS [2016-Jun-6 17:24:07]: lrate 2e-05, TRAIN ACCURACY 27.4092%, VALID ACCURACY 20.1400%
EPOCH 17 RUNNING ... ENDS [2016-Jun-6 17:28:23]: lrate 1e-05, TRAIN ACCURACY 27.5363%, VALID ACCURACY 20.2177%
finished, too small rel. improvement .0777
Training succeeded. The final model exp/train_phn_l5_c320/final.nnet
Removing features tmpdir exp/train_phn_l5_c320/ptrXL @ pingan-nlp-001
cv.ark  train.ark

testing process

rjb1_sx64-0000000-0000248 out-moded 
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrjb1_sx64-0000000-0000248 is 0.454562 over 246 frames.
mrjh0_sa1-0000000-0000385 she had your dark suit in greasy wash water all 
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrjh0_sa1-0000000-0000385 is 0.577131 over 383 frames.
mrjh0_sa2-0000000-0000317 how ask me to carry an oily rag like 
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrjh0_sa2-0000000-0000317 is 0.483511 over 315 frames.
mrjh0_si1145-0000000-0000487 how unauthentic 
LOG (latgen-faster:RebuildRepository():determinize-lattice-pruned.cc:294) Rebuilding repository.
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrjh0_si1145-0000000-0000487 is 0.258022 over 485 frames.
mrjh0_si1775-0000000-0000306 how unauthentic 
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrjh0_si1775-0000000-0000306 is 0.384129 over 304 frames.
mrjh0_si515-0000000-0000296 out-moded 
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mrjh0_si515-0000000-0000296 is 0.429838 over 294 frames.
mrjh0_sx155-0000000-0000394 how unauthentic

I checked the ark file of timit and tedlium data and I found some difference, but I do not know how the difference come.
The tedlium's ark file like this :

AlGore_2009  [
  510340.6 586395.1 608272.1 621239.9 642546.4 653072.2 651401.9 651305.8 653922.6 659371.4 654681.1 652654.5 646230.6 645681.9 650887.6 655483.5 666377.6 671666.1 672115.6 669366.7 669373.2 681050.7 703447.4 715073.2 709013.8 702928.3 713154.4 718430.6 711170 688705.3 658752.9 641324.2 630078.5 628411.7 623944.6 627934.9 639849.6 641777.4 643522.4 627100.5 39020
  6946354 9087419 9763794 1.018412e+07 1.091917e+07 1.127568e+07 1.123372e+07 1.124698e+07 1.134869e+07 1.154412e+07 1.137156e+07 1.1279e+07 1.104712e+07 1.103819e+07 1.121266e+07 1.137482e+07 1.174376e+07 1.193318e+07 1.195034e+07 1.184173e+07 1.18378e+07 1.224044e+07 1.303755e+07 1.345864e+07 1.321482e+07 1.298168e+07 1.337751e+07 1.358918e+07 1.332344e+07 1.25133e+07 1.146994e+07 1.091216e+07 1.056395e+07 1.05361e+07 1.041939e+07 1.053006e+07 1.088328e+07 1.093435e+07 1.097803e+07 1.044419e+07 0 ]

And the timit's is like this:

fadg0_sa1  [
  3077.437 3576.837 3893.808 4497.17 4646.433 4888.595 5084.933 5245.375 5266.312 5316.513 5304.906 5279.905 5159.947 5092.513 5093.656 5096.891 5198.106 5342.096 5525.816 5622.102 5590.077 5587.714 5621.955 5658.111 5640.733 5684.978 5922.412 6028.531 5843.909 5494.285 5123.665 4873.254 4768.456 4619.075 4454.212 4446.68 4533.783 4809.863 5073.438 5097.519 372
  28369.65 38061.98 44509.9 59787.96 63547.87 70383.9 75846.95 80695.33 82071.57 83632.43 82730.72 81498.48 78174.86 76341.12 75977.55 75682.39 78059.57 82118.61 87383.28 90191.3 89340.34 89230.34 90614.35 91722.68 90768.06 91814.1 99787.2 103876.6 97762.07 85880.71 74550.43 67565.18 64682.43 60528.35 56254.4 56227.67 58352.52 65413.38 72421.16 72856.04 0 ]

but the scripts is the same as the tedliums', (I just modified exist code),the diff runs like this:

91c91
<      || exit 208;

---
>      || exit 1;
106c106
<      || exit 209;

---
>      || exit 1;

and the scripts is like this:

#!/bin/bash 

# Copyright 2012  Karel Vesely  Johns Hopkins University (Author: Daniel Povey)
# Apache 2.0
# To be run from .. (one directory up from here)
# see ../run.sh for example

# Begin configuration section.
nj=4
cmd=run.pl
fbank_config=conf/fbank.conf
compress=true
# End configuration section.

echo "$0 $@"  # Print the command line for logging

if [ -f path.sh ]; then . ./path.sh; fi
. parse_options.sh || exit 1;

if [ $# != 3 ]; then
   echo "usage: make_fbank.sh [options] <data-dir> <log-dir> <path-to-fbankdir>";
   echo "options: "
   echo "  --fbank-config <config-file>                      # config passed to compute-fbank-feats "
   echo "  --nj <nj>                                        # number of parallel jobs"
   echo "  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs."
   exit 1;
fi

data=$1
logdir=$2
fbankdir=$3


# make $fbankdir an absolute pathname.
fbankdir=`perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = "$pwd/$dir"; } print $dir; ' $fbankdir ${PWD}`

# use "name" as part of name of the archive.
name=`basename $data`

mkdir -p $fbankdir || exit 1;
mkdir -p $logdir || exit 1;

if [ -f $data/feats.scp ]; then
  mkdir -p $data/.backup
  echo "$0: moving $data/feats.scp to $data/.backup"
  mv $data/feats.scp $data/.backup
fi

scp=$data/wav.scp

required="$scp $fbank_config"

for f in $required; do
  if [ ! -f $f ]; then
    echo "make_fbank.sh: no such file $f"
    exit 1;
  fi
done

utils/validate_data_dir.sh --no-text --no-feats $data || exit 1;

if [ -f $data/spk2warp ]; then
  echo "$0 [info]: using VTLN warp factors from $data/spk2warp"
  vtln_opts="--vtln-map=ark:$data/spk2warp --utt2spk=ark:$data/utt2spk"
elif [ -f $data/utt2warp ]; then
  echo "$0 [info]: using VTLN warp factors from $data/utt2warp"
  vtln_opts="--vtln-map=ark:$data/utt2warp"
fi

for n in $(seq $nj); do
  # the next command does nothing unless $fbankdir/storage/ exists, see
  # utils/create_data_link.pl for more info.
  utils/create_data_link.pl $fbankdir/raw_fbank_$name.$n.ark  
done

if [ -f $data/segments ]; then
  echo "$0 [info]: segments file exists: using that."
  split_segments=""
  for n in $(seq $nj); do
    split_segments="$split_segments $logdir/segments.$n"
  done

  utils/split_scp.pl $data/segments $split_segments || exit 1;
  rm $logdir/.error 2>/dev/null

  $cmd JOB=1:$nj $logdir/make_fbank_${name}.JOB.log \
    extract-segments scp,p:$scp $logdir/segments.JOB ark:- \| \
    compute-fbank-feats $vtln_opts --verbose=2 --config=$fbank_config ark:- ark:- \| \
    copy-feats --compress=$compress ark:- \
     ark,scp:$fbankdir/raw_fbank_$name.JOB.ark,$fbankdir/raw_fbank_$name.JOB.scp \
     || exit 208;

else
  echo "$0: [info]: no segments file exists: assuming wav.scp indexed by utterance."
  split_scps=""
  for n in $(seq $nj); do
    split_scps="$split_scps $logdir/wav.$n.scp"
  done

  utils/split_scp.pl $scp $split_scps || exit 1;

  $cmd JOB=1:$nj $logdir/make_fbank_${name}.JOB.log \
    compute-fbank-feats $vtln_opts --verbose=2 --config=$fbank_config scp,p:$logdir/wav.JOB.scp ark:- \| \
    copy-feats --compress=$compress ark:- \
     ark,scp:$fbankdir/raw_fbank_$name.JOB.ark,$fbankdir/raw_fbank_$name.JOB.scp \
     || exit 209;

fi


if [ -f $logdir/.error.$name ]; then
  echo "Error producing fbank features for $name:"
  tail $logdir/make_fbank_${name}.1.log
  exit 1;
fi

# concatenate the .scp files together.
for n in $(seq $nj); do
  cat $fbankdir/raw_fbank_$name.$n.scp || exit 1;
done > $data/feats.scp

rm $logdir/wav.*.scp  $logdir/segments.* 2>/dev/null

nf=`cat $data/feats.scp | wc -l` 
nu=`cat $data/utt2spk | wc -l` 
if [ $nf -ne $nu ]; then
  echo "It seems not all of the feature files were successfully ($nf != $nu);"
  echo "consider using utils/fix_data_dir.sh $data"
fi

echo "Succeeded creating filterbank features for $name"

Is there some thing wrong?
and what is out-moded and journalese mean?

LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mbns0_sx340-0000000-0000242 is 0.466467 over 240 frames.
mbns0_sx430-0000000-0000343 out-moded 
LOG (latgen-faster:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:111) Log-like per frame for utterance mbns0_sx430-0000000-0000343 is 0.430763 over 341 frames.
mbns0_sx70-0000000-0000119 journalese 

train timit data use refered to tedlinu v1 and get mtrc0_si479.wav: Permission denied error

Hi

I want to test eesen on timit data, but I got some error when extracting fbank features.
the error like this

copy-feats --compress=true ark:- ark,scp:/home/zhangjl/git/asr/eesen/asr_egs/timit/v1/fbank/raw_fbank_train.20.ark,/home/zhangjl/git/asr/eesen/asr_egs/timit/v1/fbank/raw_fbank_train.20.scp
extract-segments scp,p:data/train/wav.scp exp/make_fbank/train/segments.20 ark:-
compute-fbank-feats --verbose=2 --config=conf/fbank.conf ark:- ark:-
sh: /home/zhangjl/dataCenter/asr/timitTedFormat/train/wav/mtrc0_si479.wav: Permission denied
ERROR (extract-segments:Read4ByteTag():wave-reader.cc:74) WaveData: expected 4-byte chunk-name, got read errror
WARNING (extract-segments:Read():feat/wave-reader.h:149) Exception caught in WaveHolder object (reading).
WARNING (extract-segments:HasKeyInternal():util/kaldi-table-inl.h:1370) RandomAccessTableReader: error reading object from stream '/home/zhangjl/dataCenter/asr/timitTedFormat/train/wav/mtrc0_si479.wav |'
WARNING (extract-segments:main():extract-segments.cc:125) Could not find recording mtrc0_si479, skipping segment mtrc0_si479-0000000-0000436
WARNING (extract-segments:Close():kaldi-io.cc:446) Pipe /home/zhangjl/dataCenter/asr/timitTedFormat/train/wav/mtrc0_si479.wav | had nonzero return status 32256
sh: /home/zhangjl/dataCenter/asr/timitTedFormat/train/wav/mtrc0_sx119.wav: Permission denied
ERROR (extract-segments:Read4ByteTag():wave-reader.cc:74) WaveData: expected 4-byte chunk-name, got read errror

I process the data as follows:

I convert the sounds of timit to wav format as follows

run cmd as follows:

file mtrc0_si479.wav

and got the result

si2005.wav: NIST SPHERE file

so I use sox convert sphere format to wav format, the result like this

RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz

I transform timit data like tedlium data format, the only difference is the wav.scp file.

tedlimu' wav.scp is like

ichaelSpecter_2010 /home/zhangjl/git/asr/eesen/asr_egs/tedlium/v1/../../../tools/sph2pipe_v2.5/sph2pipe -f wav -p /home/zhangjl/git/asr/eesen/asr_egs/tedlium/v1/db/TEDLIUM_release1/test/sph/MichaelSpecter_2010.sph |

my wav.scp is like

faem0_si1823 /home/zhangjl/dataCenter/asr/timitTedFormat/test/wav/faem0_si1823.wav |

Look forward your suggestion. Thx

Support for multi-GPU

I know that multi-GPU is not supported yet. I'm interested in knowing whether supporting that is straightforward or not, so that we know it's worth for us to work on it.
We're interested using Eesen on more than 1000 hours of audio, so obviously that would be helpful (the 110 hours TEDLIUM dataset takes about 3 days with a single K40).
Thanks!

how to prevent ""Epsilon loops exist in your decoding"

Dear Yajie

I created an artificial arpa languagemodel, (because our use case has no real corpus, so I use some patterns to get the possible ngrams. and because we don't assume anything, so we set most of the ngrams the same probability and backoff weights).

With the FST generated fro this artifical arpa languagmodel, some of the utterances failed to be decoded with
{{{
"Epsilon loops exist in your decoding" which comes from:KALDI_ASSERT(loop_count < max_loop && "Epsilon loops exist in your decoding "
"graph (this is not allowed!)");
}}}
in lattice-faster-decoder.cc

I tried to enlarge the max_loop to be 10million, still get this error.

Could you please give me some hints as what could be the cause of this error?

Best

Running BLSTM without CTC

We found the scripts train-ce in essen netbin. Is there a recipe file using this script to run only BLSTM.

about <eps>:<eps>

Hi Yajie

I made a toy example to study the TLG.fst following the link

I end up with the following TLG graph.
TLG.pdf

From the pdf, I noticed an interesting thing: state 6 to state 14 has to pass the state 10, but the arc from 6 to 10 and the arc from 10 to 14 are the same as: "eps:eps". I think "eps:eps" represents that there is no input and there is no output. The reason why Eesen has to insert an intermediate state 10 there is the T.fst
T.pdf

is built following: "there has to be "eps:eps" arc both at the end and the beginning of each phone".

I think there won't be much speed loss by inserting this intermediate state, I just wonder if we delete the intermediate state, what would happen?

Thank you very much!

Best

CTC backprop gradient efficiency

Hi Yajie,
I notice that in the ctc code, the gradient of output is calculated and the then multiplied with the nnet output to get the backprop error signal. Referring to Alex's thesis, it seems that you implemented Eq.(7.31) first, then implemented Eq.(7.32). Is it possible to implement Eq(7.34) directly such that there will be one matrix multiplication less?

Consistent GPU crashes in nnet training

I'm having issues trying to get train_ctc_parallel.sh to run successfully (I'm using the one from the wsj example, but I'm running it on a different data set- Fisher English). I've tried three times, each time reducing --num-sequence and --frame-limit , but still continue to get out of memory errors. It looks like it dies either right on the last utterance set (I have 871960 utterances in my training set) or as it's trying to start training the second epoch...

CUDA version is 7.0, GPU is NVIDIA GRID K520 with 4 GB

snippets from the latest tr.iter1.log follow.

train-ctc-parallel --report-step=1000 --num-sequence=5 --frame-limit=12500 --learn-rate=0.00004 --momentum=0.9 --verbose=1 'ark,s,cs:copy-feats scp:exp/train_char_l4_c320/train_local.scp ark:- | add-deltas ark:- ark:- |' 'ark:gunzip -c exp/train_char_l4_c320/labels.tr.gz|' exp/train_char_l4_c320/nnet/nnet.iter0 exp/train_char_l4_c320/nnet/nnet.iter1
LOG (train-ctc-parallel:SelectGpuIdAuto():cuda-device.cc:262) Selecting from 1 GPUs
LOG (train-ctc-parallel:SelectGpuIdAuto():cuda-device.cc:277) cudaSetDevice(0): GRID K520       free:4044M, used:51M, total:4095M, free/total:0.987462
LOG (train-ctc-parallel:SelectGpuIdAuto():cuda-device.cc:310) Selected device: 0 (automatically)
LOG (train-ctc-parallel:FinalizeActiveGpu():cuda-device.cc:194) The active GPU is [0]: GRID K520        free:4031M, used:64M, total:4095M, free/total:0.984288 version 3.0
LOG (train-ctc-parallel:PrintMemoryUsage():cuda-device.cc:334) Memory used: 0 bytes.
LOG (train-ctc-parallel:DisableCaching():cuda-device.cc:731) Disabling caching of GPU memory.
copy-feats scp:exp/train_char_l4_c320/train_local.scp ark:-
add-deltas ark:- ark:-
LOG (train-ctc-parallel:main():train-ctc-parallel.cc:112) TRAINING STARTED
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 1005 sequences (0.0697528Hr): Obj(log[Pzx]) = -22.6206   TokenAcc = -0.88588%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 2010 sequences (0.148333Hr): Obj(log[Pzx]) = -13.5552   TokenAcc = 0%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 3015 sequences (0.233011Hr): Obj(log[Pzx]) = -13.292   TokenAcc = 0%

---- Lots of log lines skipped ----

VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 871630 sequences (899.055Hr): Obj(log[Pzx]) = -214.826   TokenAcc = 73.8465%
LOG (copy-feats:main():copy-feats.cc:100) Copied 871960 feature matrices.
WARNING (train-ctc-parallel:MallocInternal():cuda-device.cc:658) Allocation of 202970 rows, each of size 8960 bytes failed,  releasing cached memory and retrying.
WARNING (train-ctc-parallel:MallocInternal():cuda-device.cc:665) Allocation failed for the second time.    Printing device memory usage and exiting
LOG (train-ctc-parallel:PrintMemoryUsage():cuda-device.cc:334) Memory used: 3582853120 bytes.
ERROR (train-ctc-parallel:MallocInternal():cuda-device.cc:668) Memory allocation failure
ERROR (train-ctc-parallel:MallocInternal():cuda-device.cc:668) Memory allocation failure

[stack trace: ]
eesen::KaldiGetStackTrace()
eesen::KaldiErrorMessage::~KaldiErrorMessage()
eesen::CuAllocator::MallocInternal(unsigned long, unsigned long, unsigned long*)
eesen::CuAllocator::MallocPitch(unsigned long, unsigned long, unsigned long*)
eesen::CuDevice::MallocPitch(unsigned long, unsigned long, unsigned long*)
.
.
.
eesen::Layer::Propagate(eesen::CuMatrixBase<float> const&, eesen::CuMatrix<float>*)
eesen::Net::Propagate(eesen::CuMatrixBase<float> const&, eesen::CuMatrix<float>*)
train-ctc-parallel(main+0xe56) [0x492eaf]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f4974463af5]
train-ctc-parallel() [0x490fa9]

I've also tried running it on parallel GPU, (using train_ctc_parallel_h.sh) and I get a very similar error regardless of settings, the latest example of which was

VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 183915 sequences (184.251Hr): Obj(log[Pzx]) = -72.0027   TokenAcc = 58.031%
LOG (train-ctc-parallel:comm_avg_weights():net/communicator.h:98) Reading averaged model from exp/train_char_l4_c320/nnet/nnet.iter1.job1.262
WARNING (train-ctc-parallel:MallocInternal():cuda-device.cc:742) Allocation of 101485 rows, each of size 8960 bytes failed,  releasing cached memory and retrying.
WARNING (train-ctc-parallel:MallocInternal():cuda-device.cc:749) Allocation failed for the second time.    Printing device memory usage and exiting
LOG (train-ctc-parallel:PrintMemoryUsage():cuda-device.cc:418) Memory used: 3861626880 bytes.
ERROR (train-ctc-parallel:MallocInternal():cuda-device.cc:752) Memory allocation failure
WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe apply-cmvn --norm-vars=true --utt2spk=ark:data/train_tr95/utt2spk scp:data/train_tr95/cmvn.scp scp:exp/train_char_l4_c320/feats_tr.3.scp ark:- | add-deltas ark:- ark:- | had nonzero return status 36096
ERROR (train-ctc-parallel:MallocInternal():cuda-device.cc:752) Memory allocation failure

[stack trace: ]
eesen::KaldiGetStackTrace()
eesen::KaldiErrorMessage::~KaldiErrorMessage()
eesen::CuAllocator::MallocInternal(unsigned long, unsigned long, unsigned long*)
eesen::CuAllocator::MallocPitch(unsigned long, unsigned long, unsigned long*)
eesen::CuDevice::MallocPitch(unsigned long, unsigned long, unsigned long*)
.
.
.
eesen::Layer::Propagate(eesen::CuMatrixBase<float> const&, eesen::CuMatrix<float>*)
eesen::Net::Propagate(eesen::CuMatrixBase<float> const&, eesen::CuMatrix<float>*)
train-ctc-parallel(main+0xe5e) [0x492f37]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7ff010f85af5]
train-ctc-parallel() [0x491029]

Any thoughts or suggestions?

performance

Hi,

On the swbd, for Bi-LSTM, CTC (using EESEN) and CE can reach similar performance.
But Uni-LSTM, with comparing to CE, CTC gets terrible performance. In RESULTS,
CTC phonemes on the complete set (with 5 BiLSTM layers) can result in an 15.0%
WER on the swbd. I want to know the WER in your experiments (with Uni-LSTM layers).

Thanks.

nan Obj value

Hi, guys:
I'm trying CTC on a big dataset more than 2000hours, using steps/train_ctc_parallel_h.sh --nj 3
job 1

VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 280196 sequences (343.467Hr): Obj(log[Pzx]) = -34.12   TokenAcc = 65.5557%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 290214 sequences (356.577Hr): Obj(log[Pzx]) = -35.0829   TokenAcc = 66.0633%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 300228 sequences (369.041Hr): Obj(log[Pzx]) = -35.5174   TokenAcc = 64.8673%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 310238 sequences (382.234Hr): Obj(log[Pzx]) = -35.9657   TokenAcc = 65.8774%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 320254 sequences (394.763Hr): Obj(log[Pzx]) = -33.7356   TokenAcc = 66.6929%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 330265 sequences (407.316Hr): Obj(log[Pzx]) = -32.8957   TokenAcc = 67.0366%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 340269 sequences (420.524Hr): Obj(log[Pzx]) = -36.0733   TokenAcc = 66.5803%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350272 sequences (433.553Hr): Obj(log[Pzx]) = -3.06908e+29   TokenAcc = 13.6926%

job 2

VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 290240 sequences (361.23Hr): Obj(log[Pzx]) = -34.8453   TokenAcc = 65.5767%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 300251 sequences (374.192Hr): Obj(log[Pzx]) = -34.6744   TokenAcc = 65.9575%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 310255 sequences (386.489Hr): Obj(log[Pzx]) = -32.8376   TokenAcc = 66.4874%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 320258 sequences (399.27Hr): Obj(log[Pzx]) = -34.0779   TokenAcc = 66.6884%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 330274 sequences (411.385Hr): Obj(log[Pzx]) = -31.8291   TokenAcc = 67.3521%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 340285 sequences (423.992Hr): Obj(log[Pzx]) = -32.077   TokenAcc = 67.5502%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350304 sequences (437.536Hr): Obj(log[Pzx]) = -3.17996e+29   TokenAcc = 18.6188%

job 3

VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 290221 sequences (367.77Hr): Obj(log[Pzx]) = -33.8591   TokenAcc = 65.0273%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 300232 sequences (380.856Hr): Obj(log[Pzx]) = -36.8064   TokenAcc = 65.3367%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 310232 sequences (393.478Hr): Obj(log[Pzx]) = -33.5105   TokenAcc = 65.6533%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 320235 sequences (406.032Hr): Obj(log[Pzx]) = -9.997e+25   TokenAcc = 66.9717%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 330242 sequences (418.355Hr): Obj(log[Pzx]) = -32.2745   TokenAcc = 66.3963%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 340247 sequences (433.924Hr): Obj(log[Pzx]) = -41.3272   TokenAcc = 67.6148%
VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350257 sequences (448.286Hr): Obj(log[Pzx]) = nan   TokenAcc = 29.8061%

job2 crash

VLOG[1] (train-ctc-parallel:EvalParallel():ctc-loss.cc:182) After 350304 sequences (437.536Hr): Obj(log[Pzx]) = -3.17996e+29   TokenAcc = 18.6188%
LOG (train-ctc-parallel:comm_avg_weights():net/communicator.h:106) Waiting for averaged model at exp/train_phn_l3_c320/nnet/nnet.iter1.avg500
ERROR (train-ctc-parallel:Check():net.cc:397) 'nan' in network parameters
WARNING (train-ctc-parallel:Close():kaldi-io.cc:446) Pipe apply-cmvn --norm-vars=true --utt2spk=ark:data_fbank/train_nodup_tr/utt2spk scp:data_fbank/train_nodup_tr/cmvn.scp scp:exp/train_phn_l3_c320/feats_tr.2.scp ark:- | add-deltas ark:- ark:- | had nonzero return status 36096
ERROR (train-ctc-parallel:Check():net.cc:397) 'nan' in network parameters

[stack trace: ]
eesen::KaldiGetStackTrace()
eesen::KaldiErrorMessage::~KaldiErrorMessage()
eesen::Net::Check() const
eesen::Net::Write(std::ostream&, bool) const
eesen::Net::Write(std::string const&, bool) const
comm_avg_weights(eesen::Net&, int const&, int const&, int const&, std::string const&, std::string const&)
train-ctc-parallel(main+0x1223) [0x48b1e7]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f777206eec5]
train-ctc-parallel() [0x488aa9]

I'll spend some time on this.

setting the parameters

what would be the optimal values for decoding the digit corpus having only 10 words
--beam , --lattice_beam, --max-active, --acwt

any thumb rule i.e vocabulary size vs these parameters

preformance

Hi,

On the swbd, for Bi-LSTM, CTC (using EESEN) and CE can reach similar performance.
But Uni-LSTM, with comparing to CE, CTC gets terrible performance. In RESULTS,
CTC phonemes on the complete set (with 5 BiLSTM layers) can result in an 15.0%
WER on the swbd. I want to know the WER in your experiments (with Uni-LSTM layers).

Thanks.

acoustic weight in latgen-faster

Hi Yajie

I noticed that the final results of Eesen are produced by search acoustic-wight fro 5 to 10 in the lattic-1best, Could you give a little explaination of why acoustic-scale in latgen-faster always set to 0.6. Why it dose not need to be optimized?

thank you very much

Best

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.