opensource-spraakherkenning-nl / kaldi_nl Goto Github PK

View Code? Open in Web Editor NEW

64.0 14.0 16.0 26.82 MB

Code related to the Dutch instance and user groups of the KALDI speech recognition toolkit

Home Page: http://www.opensource-spraakherkenning.nl

License: Apache License 2.0

Shell 74.96% Perl 12.93% Python 8.02% Dockerfile 4.09%

kaldi speech-recognition speech-recognition-model dutch

kaldi_nl's Introduction

Kaldi NL

Introduction

These scripts may be used to convert speech contained in audio files into text using the Kaldi open-source speech recognition system.

Installation

The software is run from the directory where you cloned this repository. Kaldi NL depends on a working installation of Kaldi (http://kaldi-asr.org/), Java, Perl, Python 3, and SoX (http://sox.sourceforge.net/). They have been tested on Ubuntu Linux 16.10, 18.04 LTS and 20.04 LTS but are expected to run on other Linux distributions as well. A container image is provided that contains all of the dependencies and models.

Before running the decoder for the first time, or when you need to change its configuration, please run configure.sh. The configure.sh script will ask for the location of your Kaldi installation, and for the location to put the models.

A decode.sh script is dynamically generated based on the selected models, as are the decoding graphs needed for Kaldi. This last step may take a while (but on a 16GB machine usually no more than an hour or so).

It is also possible to install a completely pre-made decoder with models as supplied by certain partners in that case you can specify one or more of the following models as a parameter to configure.sh:

utwente - Starter Pack - These are the dutch models and decoder graphs originally provided with Kaldi_NL
radboud_OH - Oral History - These are dutch models and decoder graphs trained on oral history interviews
radboud_PR - Parliamentary Talks - These are dutch models and decoder graphs trained on parliamentary talks
radboud_GN - Daily Conversation - These are dutch models and decoder graphs trained on daily conversations

A decode script is supplied for for each, respectively named decoder_OH.sh, decoder_PR.sh and decode_GN.sh.

The use of these scripts under macOS is not supported, but we have been able to make them work. Just as with the default Kaldi recipes, most issues stem from the use of the standard GNU tools. So use gcp, gawk, gsed, gtime, gfile, etc in place of cp, awk, sed, time, and file. If you encounter any other issues with these script on macOS, please let us know, especially if you've been able to fix them :-)

Container with Web Interface

For end-users and hosting partners, we provide a web-interface offering easy access to Automatic Speech Recognition for Dutch (asr_nl), containing all the models from Radboud University. A container image is available to deploy this webservice locally, please see: Automatic Speech Recognition for Dutch for the webservice source and further instructions.

Container without Web Interface

There is a prebuilt image from the Docker Hub registry using docker as follows: This contains all the models (but not the webservice).

$ docker pull proycon/kaldi_nl

You can also build the container image yourself using a tool like docker build, which is the recommended option if you are deploying this in your own infrastructure. To build a container image with the specified models included in the image:

$ docker build -t proycon/kaldi -f kaldi.Dockerfile
$ docker build -t proycon/kaldi_nl --build-arg MODELS="utwente radboud_OH radboud_PR radboud_GN" .

If you want the models downloaded at run-time onto an external data volume, rather than at build time into the image, then add --build-arg MODELPATH=/models and later at runtime set -v /path/to/your/models:/models.

Run the container as follows, you may want replace decode_OH.sh with another decode script corresponding to your desired model.

$ docker run -t -i -v /your/data/path:/data proycon/kaldi_nl

The decode.sh command (or rather one of its variants) from the next section can be appended directly to the docker run line, e.g.:

$ docker run -t -i -v /your/data/path:/data proycon/kaldi_nl ./decode_OH.sh /data/yourinput.wav /data

Usage

The decode script is called with:

./decode.sh [options] <speech-dir>|<speech-file>|<txt-file containing list of source material> <output-dir>

If you want to use one of the pre-built models, use decode_OH.sh or any of the other options instead of the generic decode.sh.

All parameters before the last one are automatically interpreted as one of the three types listed above. After the process is done, the main results are produced in <output-dir>/1Best.ctm. This file contains a list of all words that were recognised in the audio, with one word per line. The lines follow the standard .ctm format:

<source file> 1 <start time> <duration> <word hypothesis> <posterior probability>

In addition, some simple text files may be generated, as well as performance metrics in case the source material contains a suitable reference transcription in .stm format. There's also the option of using a .uem file in order to provide a pre-segmentation or to limit the amount of audio to transcribe.

As part of the transcription process, the LIUM speech diarization toolkit is utilized. This produces a directory <output-dir>/liumlog, which contains .seg files that provide information about the speaker diarization. For more information on the content of these files, please visit http://www-lium.univ-lemans.fr/diarization/.

Details

Due to the nature of decoding with Kaldi, its use of FSTs, and the size of the models in the starterpack, a machine with less than 8GB of memory will probably not be able to compile the graphs or provide very useful ASR performance. In any case, make sure the number of jobs does not crush your machine (use the --nj parameter). Also be advised that building the docker image requires at least 60GiB of available disk space.

In the starterpack of Dutch models, the best current performance can be expected when using:

AM: NL/UTwente/HMI/AM/CGN_all/nnet3_online/tdnn
- (slightly better, but much slower: NL/UTwente/HMI/AM/CGN_all/nnet3/tdnn_lstm
LM: v1.0/KrantenTT.3gpr.kn.int.arpa.gz
Rescore LM: NL/UTwente/HMI/LM/KrantenTT & v1.0/KrantenTT.4gpr.kn.int.arpa.gz

Contribute your own models!

Please see the contribution guidelines and contribute your own models and decoding pipelines!

Licensing

Kaldi-NL is licensed under the Apache 2.0 licence, this concerns only the scripts directly included in this repository and where not explicitly noted otherwise. Note that the various models that can be obtained through Kaldi-NL are never by default covered by this license and may often be licensed differently.

The models for Dutch (asr_nl) that are installable through this Kaldi_NL distribution are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike license (4.0).

kaldi_nl's People

Contributors

Stargazers

Watchers

Forkers

silasxue samalexandrov guyddr mahmoudaljabary rogierhofboer sanyaade-speechtools nyartsgnaw sirifarif laurensw75 entn-at proycon axelvugts kleinrana thegallivanter avanderklis nieuwmijnleven

kaldi_nl's Issues

Prepare for a release

There's only been one pre-release long ago, we need to move to a proper semantically version release procedure. I'm starting preparations for a first release.

Issues to be solved for the first one:

Improve logging and error handling #17

Model size to big for Android

We tried to use the Dutch model with the Vosk-demo for Android based on Kaldi and everything worked smooth, except the Dutch language.
I asked the owner of Vosk-demo what could be the reason of the failure.
He told me the model size is to big for Android to process fast enough.
Is there anyway to reduce the size of the Dutch model?

No speech found, exiting.

Hi,

I've successfully(I think) built the model and generated decode.sh using configure.sh. However, upon running decode.sh with a speech folder in CGN, I got an error like this:
./decode.sh /home/jerome/Documents/20151207_CGN_2_0_3/CGN_2.0.3/data/audio/wav/comp-o/nl output Argument /home/jerome/Documents/20151207_CGN_2_0_3/CGN_2.0.3/data/audio/wav/comp-o/nl is a directory, copying contents.. done Diarization completed in 0:00:11 (CPU: 0:00:09), Memory used: 6 MB Split 561 source files into 0 segment cat: output/intermediate/data/ALL/spk2utt: No such file or directory No speech found, exiting.
Any idea what went wrong?

Also, if I want to see the WER of the decoding, how can I pass in the ground truth speech transcription, and in what format?

Thanks!

missing instruction for installation

This instruction was somehow remove from the readme, while it is still essential information for install. Maybe put it back in?

Before running the decoder for the first time, make sure to set KALDI_ROOT to the proper value in path.sh.

Update the vocabulary / Lexicon

Hi there,

First of all, thanks for the amazing open sourced project.
I would like to update the vocabulary or Lexicon with domain specific terms and people's names. Do you perhaps have any help on how to do this?

I look forward to your reply!

Kind regards,

Decode script does not generate 1Best.ctm

Hi, thank you for open sourcing this project.

I am trying to run the nnet3_online tdnn model. On running the decode.sh script, I get the following output

Argument filename.wav is a sound file, using it as audio
Diarization completed in 0:00:47 (CPU: 0:00:40), Memory used: 117 MB                
Split 1 source file into 103 segments                              
Duration of speech: 0h:10m:23s
NNet3 decoding completed in 0:00:00 (CPU: 0:00:00), Memory used: 4 MB                
Rescoring completed in 0:00:00 (CPU: 0:00:00), Memory used: 0 MB                
Rescoring completed in 0:00:00 (CPU: 0:00:00), Memory used: 2 MB                
Done

However does not contain 1Best.ctm results file. Any help is appreciated.

XML conversion only implemented for single wav files, skipping...

It seems that when a single file of a certain size is provided is provided, this file is split, the follow error/warning message is prompted, even for a single input file.

==========================
Conversion to XML
==========================
XML conversion only implemented for single wav files, skipping...

Weird problem on Ubuntu 18.04 in file local/flist2scp.sh

Running decode.sh on a Fedora installation, there is no problem.

Running decode.sh on an Ubuntu 18.04 installation gives the following error:

    local/flist2scp.sh: line 35: ${#lines[@]-1}: bad substitution

I got the code working simply substituting line 35 in file local/flist2scp.sh by:

    numjobs=${#lines[@]}

which gives me a value of "1" if I have only 1 job (weird why in Fedora you have to substract 1 from the result as in the original code -- as there this gives also "1" having only 1 job....)

set number of speakers for speaker diarisation

as requested by @HenkvdHeuvel :

Ook nu al wordt er een file met spk bij de output meegeleverd, maar het zou fijn zijn als de gebruiker zelf kan aangeven hoeveel sprekers er in de audiofile zitten. Voor interviews is dit typisch twee, en aangezien de service het vaakst hiervoor wordt
gebruikt zou dat een mooie defaultwaarde zijn, maar het zou nog beter zijn als de gebruiker dit zelf als parameter zou kunnen instellen.

Ik bedoel dat er in principe een sprekerID wordt getagd aan elke uiting/woord. Dat is denk ik jouw optie 2.

Basic installation instructions

When trying to set up Kaldi_NL I have the following observations.

When running ./configure.sh I am prompted for KALDI_ROOT. I would have expected this is the prefix folder where Kaldi itself is installed. For example /opt/kaldi. It seems to be the sources folder of Kaldi instead. The next step is 'models', naively I continue then the script terminates with:

ls: cannot access 'models////AM/////.mdl': No such file or directory
Something went wrong with Model Selection, Choose Acoustic Model: : empty list

java remains active after decode has finished

4 processes remain active after decode.sh has finished.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27249 asr 20 0 3364720 54908 11660 S 100.3 0.2 1366:02 java
27251 asr 20 0 3364720 54948 11656 S 100.3 0.2 1366:09 java
27253 asr 20 0 3364720 52992 11628 S 100.3 0.2 1366:08 java
27245 asr 20 0 3364720 52392 11732 S 100.0 0.2 1366:09 java

mkgraph.sh: expected models/NL/UTwente/HMI/LM/KrantenTT/v1.0/LG_KrantenTT.3gpr.kn.int_UTwente_HMI_lexicon/phones/silence.csl to exist

There's a file missing. I did find a similar file, but it's not in .../phones/silence.csl but in .../lang/phones/silence.csl

Re-containerize Kaldi_NL without LaMachine

In light of proycon/LaMachine#214 a new container build that is standalone (no longer using LaMachine) needs to be implemented.

Kaldi poses two extra challenges:

The image is huge (30GB) due to all the models (and a fair amount of cruft)
I'd like to use Alpine as a basis though it's still a bit unsure whether all kaldi components compile nicely against musl. Moreover, this is only for production scenarios that do not require a GPU (e.g. no training).

Something went wrong: models were not installed.

After installing Kaldi and all the prerequisites I ran /configure.sh.
After entering my kaldi root and src in the script I tell it to store the models at
/home/rick/Documents/Kaldi_NL/models/
(no ~)

Script then starts downloading and doing a bunch of stuff, after a while I get the message: " Something went wrong: models were not installed"

Last few lines were:

Models/NL/UTwente/HMI/AM/CGN_large/nnet3_online/tdnn_cleaned/v1.0/frame_subsampling_factor
Models/NL/UTwente/HMI/AM/CGN_nomis/nnet_bn/dnn8f_pretrain_dbn_dnn_smbr/v1.0/final.feature_transform
Models/NL/UTwente/HMI/AM/CGN_large/nnet2_online/nnet_ms_a_tri4_ami_smbr_online/v1.0/conf/.fixed
Models/NL/UTwente/HMI/AM/CGN_nomis/fmllr_bn/dnn8a_bn-feat/v1.0/nnet/nnet_iter13_learnrate1.5625e-05_tr2.0319_cv2.1793_final_
Models/NL/UTwente/HMI/AM/CGN_large/nnet3_online/tdnn_cleaned/v1.0/ivector_extractor/online_cmvn.conf
Something went wrong: models were not installed.

If i open the folder it is empty.
No messages showing anything going wrong prior to the above.
The tar file seems to have been successfully downloaded according to my shell.
What went wrong and how can I Fix this?
I'd like to try out the model with a small benchmark to assess usabilty for a project.

kaldi was build for 4 CPU's (2cores each), but Kaldi_NL only uses 1 core for decode

Here's a top line showing server usage during decoding:

%Cpu(s): 99.9 us, 0.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

training pre-trained model

Is this possible to improve model with this script using pre-trained model? like training common voice?
and out of these models, which one is best for the noisy audio?

change file_types setting in decode.sh

The Linux distribution I use (RHEL7) doesn't seem to have a proper mp3-plugin for sox available. So, during installation I ignored the warning and continued. Now, in de the decode.sh script the setting is as follows:
file_types="wav mp3" # file types to include for transcription
Can I safely edit this file and change the value to "wav"? Or will trouble come when some mp3 file is put in the input folder later?

Add speaker id to sentence in output text file

I have been trying to figure out how to add a speaker id to the output text file but have come up empty. I can tinker the settings so that when there are 2 speakers these get identified and added to the 1Best.ctm file at the end of each line, but I want this added to the output text file with sentences. In other kaldi examples this is done via multiple ways but mostly involve adding speaker next to vector as argument for vector_extract.
Any hints on where to do this with KaldiNL?
Thanks!

decode.sh is emtpy

I figured maybe decode.sh wasn't written, because the configuration wasn't completed. After choosing configuration options in the dialog that was triggered with ./configure.sh the script doesn't pass the screen with the text:

Const LM generation
Creating ConstLM for rescore, This may take a while
Creating ConstLM

Any ideas what went wrong?

Technical information:
OS: Red Hat Enterprise Linux Server release 7.2 (Maipo)
CPU: 4 x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (2 cpu cores)
MP3 plugin for Sox wasn't installed.
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)
Memory: 32GB

Models chosen:
AM: NL/UTwente/HMI/AM/CGN_large/nnet3_online/tdnn_cleaned
LM: v1.0/KrantenTT.3gpr.kn.int.arpa.gz
Rescore LM: NL/UTwente/HMI/LM/KrantenTT & v1.0/KrantenTT.4gpr.kn.int.arpa.gz

lium fails to find speech segments

The lium speaker diarization results in 0 segments of speech, which is caused probably by a java exception.

Can it be that the Kaldi_NL configuration depends on another Java distribution than I have installed, which is: java version "1.8.0_111"?

This is from the liumlog segmentation:


02:34.648                CONFIG| cmdLine: --fInputDesc=sphinx,1:3:2:0:0:0,13,0:0:0:0 --fInputMask=../OUT//intermediate/data/ALL/liumlog/%s.mfcc --sInputMask=../OUT//intermediate/data/ALL/liumlog/%s.i.seg --sOutputMask=../OUT//intermediate/data/ALL/liumlog/TONY_VAN_VERR-AEN560690VB.pms.seg --dPenality=10,10,50 --tInputMask=lib/models_es/sms.gmms TONY_VAN_VERR-AEN560690VB
02:34.678 MDecode        INFO  | fast decoding, Number of GMM=3	{make() / 1}
02:34.683 MDecode        FINE  | 	 decoder.get result	{make() / 1}
Exception in thread "main" java.lang.NullPointerException
	at fr.lium.spkDiarization.libDecoder.FastDecoderWithDuration.getClusterSet(FastDecoderWithDuration.java:872)
	at fr.lium.spkDiarization.programs.MDecode.make(MDecode.java:93)
	at fr.lium.spkDiarization.programs.MDecode.main(MDecode.java:121)

If UnsupportedAudioFileException the main Java process still claims 100% cpu

When providing an audio file other than 16kHz, 16bit, the main java proces remains running and claims 100% cpu. This also happens when the decode dir still has one of such files present.

Hoe kan ik getallen in cijfers laten herkennen ipv als woorden?

KaldiNL werkt prima op mijn CentOS systeem. Ik wil echter getallen (bijv 152.200) laten opschrijven in de transcriptie als getal (dus in cijfers), in plaats van honderdtweeenvijftigduizend...

Hoe kan ik dat het best doen? Zijn er aanvullingen op het LM beschikbaar of verwacht, die dit mogelijk maken?

Dank!

Improve logging and error handling

Error catching and logging is rather suboptimal currently. Out of necessity, I improved some logging in eng_asr_resources (https://gitlab.science.ru.nl/clst-asr/eng_asr_decoder, private repo), it may be beneficial to merge some of those changes back into kaldi_NL.

Certificate expired on nlspraak.ewi.utwente.nl

Downloading models because of an expired certificate:

wget https://nlspraak.ewi.utwente.nl/open-source-spraakherkenning-NL/Models_Starterpack.tar.gz
--2021-01-20 19:07:40--  https://nlspraak.ewi.utwente.nl/open-source-spraakherkenning-NL/Models_Starterpack.tar.gz
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving nlspraak.ewi.utwente.nl (nlspraak.ewi.utwente.nl)... 130.89.10.22
Connecting to nlspraak.ewi.utwente.nl (nlspraak.ewi.utwente.nl)|130.89.10.22|:443... connected.
ERROR: The certificate of ‘nlspraak.ewi.utwente.nl’ is not trusted.
ERROR: The certificate of ‘nlspraak.ewi.utwente.nl’ has expired.
The certificate has expired