[output cleaning] eulogos importer

I think that we got a lot of issues here.

There are issues related to the syntax used by user in chat rooms, messages with fancy characters from chatbot and symbols used for text-graphic stuff. I'll just leave here some recurring examples

Chat room info messages:

DCC CHAT ip <ip.address.format.here>

CoTiDie is [email protected] * CoTiDieMoRi

U:\>tracert 195.94.177.137
Tracing route to TARAS [195.94.177.137]
over a maximum of 30 hops:
e questa è la sua stazione
NetBIOS Remote Machine Name Table
Name               Type         Status
TARAS            UNIQUE      Registered\
EULOGOS          GROUP       Registered
TARAS            UNIQUE      Registered
TARAS            UNIQUE      Registered
EULOGOS          GROUP       Registered
FRANCESCO        UNIQUE      Registered
MAC Address = 52-54-AB-DD-22-98

wolvie is AWAY since Mon Oct 13 11:00:29 1997 Reason: 4OKKUPATO: lavoro

El_Diablo ----==>>>>------>12,11 EL-GRECO  ----==>>>>------>

] [Time/0h 0m] [Log/On] [Page/On]

free-join 1,15 -The Most Advanced Script Ever Seen-
free-join 15,15  14,14                                  15,15
free-join 15,15  14,14  14,1-16=14º15 14°14S15ho16wD15ow14N 14P15r16O14°15,1 14º16=14-14,14  15,15
free-join 15,15  14,14                                  15,15

E_D-away Set Away: Tuesday 10/14/97 Pager: On MsgLog: Off Beeper: Off Reason: not chatting
Type /ctcp E_D-away PAGE REASON to get my attention

CI-WUGY (AWAY:scripting...) gØne since: 4:34pm

<U+0081> -=º °ShowDowN v6.5 PrO° º=- <U+0081>

Junes [^?Auto-Set Back^?] at (2:47:51pm) Away for: 46mins 28secs -LOG OFF-

/ctcp nextphase This_Is_Not_A_Fucking_CTCP___This_Is_A_CoCoNuts_Island_CTCP_ :D

Deth got [<U+008D>(8Lemon)<U+008D>] [<U+008D>(8Lemon)<U+008D>] [<U+008D>(--2-7---)<U+008D>]

Symbols and text noise made by nicknames or by chatting:

ciao a tutti ;9

.... .... 

\\olverin pensa che sia il caso di sganciare

c e qualcuno????????????????????????????,,

kimy [Pitch], usate le query please

/msg drago ciao

adios agua.......................- - ->

Pannella usa la tromba§

Mannaccia **§§°ç°ç§é*ç°é*ç°é°çé°ç°

etupensicheseiofossiilpresidentedellajamaicastareiacazzeggiarequicontuttelefighechecisonola'???????????????'  :> [08]

/ ___| |_ _|    / \     / _ \  | |
| |      | |    / _ \   | | | | | |
\____| |___| /_/   \_\  \___/  (_)

italia .'..'.

Que|o io avrei il crack del kali

/Msg LiveFast, sono il suo agente

che ca^?^?o hai capito

different languages:

nothing je t ai dit que je suis la
If Anyone Speaks Too Long Texts He Will Be Kicked
hello th^?^?e girls

Align repo

Just align with https://github.com/Common-Voice/commonvoice-fr/pull/97/files

Different errors in bash scripts

Hi there,
( have to write in English or i can write in Italian ? )

i have tested the docker instance and i've found this errors:

file "import_lingualibre.sh" line 7:
wget https://lingualibre.fr/datasets/Q385-ita-Italian.zip -O /mnt/source/lingua_libre_Q385-ita-Italian_train.zip

change "/mnt/source/" into "/mnt/sources/"

file "generate_alphabet.sh" line :11,12,13
the execution of this script ends with:
sed -i s/#//g '/mnt/extracted/data/*test.csv'
sed: can't read /mnt/extracted/data/*test.csv: No such file or directory

my workaround was to specify the directories like this:
sed -i 's/#//g' /mnt/extracted/data/cv-it/clips/*test.csv
sed -i 's/#//g' /mnt/extracted/data/cv-it/clips/*train.csv
sed -i 's/#//g' /mnt/extracted/data/cv-it/clips/*dev.csv
sed -i 's/#//g' /mnt/extracted/data/lingualibre/*test.csv
sed -i 's/#//g' /mnt/extracted/data/lingualibre/*train.csv
sed -i 's/#//g' /mnt/extracted/data/lingualibre/*dev.csv

file "build_lm.sh" line 46:
seems that the endpoint doesn't exist
+ rm /mnt/lm/lm.arpa
+ '[' '!' -f /mnt/lm/trie ']'
+ curl -sSL https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.master.ba56407376f1e1109be33ac87bcb6eb9709b18be.cpu/artifacts/public/native_client.tar.xz
+ pixz -d
+ tar -xf -
can not seek in input: Illegal seek
Not an XZ file
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors

browsing that url gives: ResourceNotFound

P.S.
Everytime pixz return:
can not seek in input: Illegal seek
hope this is only a warning
P.P.S.
i've tried to format this thread as best as possible but seems i can't...sorry if it is too chaotic

Hope to help in some way

Regards
Massimo

Remove # from dataset


I Initializing variables...

I STARTING Optimization

Epoch 0 |   Training | Elapsed Time: 0:00:42 | Steps: 45 | Loss: 184.312358                                                                                                                                                                                                                                                                                                             

Epoch 0 | Validation | Elapsed Time: 0:00:08 | Steps: 31 | Loss: 165.173247 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

I Saved new best validating model with loss 165.173247 to: /mnt/checkpoints/best_dev-45

Epoch 1 |   Training | Elapsed Time: 0:00:39 | Steps: 45 | Loss: 171.796456                                                                                                                                                                                                                                                                                                             

Epoch 1 | Validation | Elapsed Time: 0:00:07 | Steps: 31 | Loss: 162.115896 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 1 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

WARNING:tensorflow:From /home/trainer/ds-train/lib/python3.6/site-packages/tensorflow/python/training/saver.py:960: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.

Instructions for updating:

Use standard file APIs to delete files with this prefix.

W0925 18:24:51.043290 140513397847872 deprecation.py:323] From /home/trainer/ds-train/lib/python3.6/site-packages/tensorflow/python/training/saver.py:960: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.

Instructions for updating:

Use standard file APIs to delete files with this prefix.

I Saved new best validating model with loss 162.115896 to: /mnt/checkpoints/best_dev-90

Epoch 2 |   Training | Elapsed Time: 0:00:39 | Steps: 45 | Loss: 150.745584                                                                                                                                                                                                                                                                                                             

Epoch 2 | Validation | Elapsed Time: 0:00:07 | Steps: 31 | Loss: 136.187367 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 2 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

I Saved new best validating model with loss 136.187367 to: /mnt/checkpoints/best_dev-135

Epoch 3 |   Training | Elapsed Time: 0:00:39 | Steps: 45 | Loss: 127.614623                                                                                                                                                                                                                                                                                                             

Epoch 3 | Validation | Elapsed Time: 0:00:08 | Steps: 31 | Loss: 123.730088 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 3 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

I Saved new best validating model with loss 123.730088 to: /mnt/checkpoints/best_dev-180

Epoch 4 |   Training | Elapsed Time: 0:00:39 | Steps: 45 | Loss: 114.725798                                                                                                                                                                                                                                                                                                             

Epoch 4 | Validation | Elapsed Time: 0:00:08 | Steps: 31 | Loss: 115.417479 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 4 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

I Saved new best validating model with loss 115.417479 to: /mnt/checkpoints/best_dev-225

Epoch 5 |   Training | Elapsed Time: 0:00:39 | Steps: 45 | Loss: 104.686398                                                                                                                                                                                                                                                                                                             

Epoch 5 | Validation | Elapsed Time: 0:00:08 | Steps: 31 | Loss: 136.464502 | Dataset: /mnt/extracted/data/cv-it/clips/dev.csv                                                                                                                                                                                                                                                          

Epoch 5 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /mnt/extracted/data/lingualibre/lingua_libre_Q385-ita-Italian_dev.csv                                                                                                                                                                                                                               

I Early stop triggered as (for last 4 steps) validation loss: 136.464502 with standard deviation: 8.535361 and mean: 125.111644

I FINISHED optimization in 0:04:52.388711

E While processing /mnt/extracted/data/cv-it/clips/common_voice_it_17894238.wav:

E   "ERROR: Your transcripts contain characters (e.g. '#') which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your [train,dev,test].csv transcripts, and then add all these to data/alphabet.txt."

Documentation about how to run the various bash script alone

As It is now during development or debugging but also an usage can be handy to execute the various scripts of the Model generation manually.

TTS model

https://github.com/mozilla/TTS it is a complete different project but we can release a model also for this.

Align with french repository

Trie file Generation

Hi all,

thanks for the hard-work that you've put in creating this fork. I'm in the process of creating a reduced dictionary to control a robot but I'm having some issues with the trie file generation. I've tried to generate the trie file using the generate_trie script of the 0.7.0a1 native client without success. I've event tried to run it on the lm_binary that comes with the "2020.03.13" release but it still fails to recognize words. This is, for instance, the result if I try to get "cinque sei sette otto nove dieci" recognized:

Obviously everything works fine if I use the lm.binary and trie provided together with the model

Wrapper script for corpus generation

The idea is a script that execute the others and at the end generate an unique txt file and do some sanitization on that like:

We need to remove duplicate lines right now I have 3 lines with Respira.
Remove empty lines
Script that remove all the lines that include letters not part of this [^A-Za-z0-9àÈèìéòù,;:'.! ]

Parallelize OpenSubtitles exporter

The exporter currently has two problems that severy limit its parallelization:

first and foremost, each time it submits a job to the thread pool, it waits for it to finish before submitting the next one (

DeepSpeech-Italian-Model/MITADS/opensubtitles_exporter.py

Line 135 in a0680b6

total_lines += future.result()

);
with Python, threads are not so effective at parallelization because of the GIL (unless your jobs are I/O bound, which is not the case here), so a process pool should be used instead of a thread pool.

Integrare dal repo francese

common-voice/commonvoice-fr#44

Allineare il repo

Define general rules for all sentences

I think it is important to define some rules for processing sentences from all importers.
This checks can either be done in the wrapper script or in sanitize.py (this can be more efficient)
My proposal is:

everything should be converted to lowercase
[^\s'abcdefghijklmnopqrstuvwxyzàèéìíòóôùú,\.!?:;] if a sentence match this regex it contains not valid chars so should be discarded

The discard will be done after trying to clean sentences (like removing trailing dashes or unescaping html)

Generate an Italian text corpus

After discussing in the community (especially @paolo-losi) the model need a better text corpus, right now the available ones have issues with the licensing that we need.
CC0 or public domain, or CC license with commercial support but we want to release just the scripts and not the final dataset to avoid any troubles.
The point is to create a corpus not from wikipedia or encyclopedic, manuals and so on but colloquial resources like chats, discussions/emails, quotes that are more similar to the needs of a voice recognition usage.

So we need to replace https://github.com/MozillaItalia/DeepSpeech-Italian-Model/blob/master/DeepSpeech/it/build_lm.sh#L12 with something else.

So our idea is to generate a static txt file on the fly with a billion words from this kind of text.
We need stuff after 1920+ to get an italian more modern.
For every resource we need to sanitize and cleaning to remove symbols and other stuff not needed.

Workflow

What tools use?

Considering that the deepspeech model is executed on linux machine we can use Bash but is not so very fast so we have to use Python.
Also this corpus doesn't need to be generated at every model generation but once for all of them by us.

DOCKERFILE Merge flag TRANSFER_LEARNING and DROP_SOURCE_LAYER

Document the versioning

Just write in the readme about how we are releasing the model.

My idea is 2019.2-0.1, this means using the whole set of scripts with the 2nd CV Italian dataset but generated as 0.1 version because of different testing or other reasons.
What do you think? @astrastefania @mone27

Evaluate Voci Parlate of Wikipedia

There is an Italian project of Wikipedia pages read in Italian.
The link for all the various audio and pages: https://it.wikipedia.org/wiki/Categoria:Voci_parlate

Some of this recordings are public domain like https://it.wikipedia.org/wiki/File:Itwiki-Barile_(unit%C3%A0_di_misura).ogg
Another problem is that those recordings are of old version of the pages so we need to recover the version read to associate with the recording.

Readme improvements

Update the CV dataset

This is the link https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-4-2019-12-10/it.tar.gz

And the script to update https://github.com/MozillaItalia/DeepSpeech-Italian-Model/blob/492a279a79c1d6e539e85d561ec9be71f4261230/DeepSpeech/it/import_cvit.sh

Archive audio+text to download

Lists of resources we can implement to add more datasets for DeepSpeech (maybe generate a custom dataset based on Common Voice dataset organization, in the readme there is a sample, or on the fly to avoid license issues):

Check also: #34

Otherwise we can evaluate this tools to generate a dataset based on youtube:

Another solution is to use https://github.com/srinivr/kaldi-long-audio-alignment with the italia model to auto split text+audio in small fragment to speed up.

The most important part is that the data need to be aggregated to avoid license issue, this means that the files need to be all together and is not possible to recreate the original files.

Change Mitads to another name on public release

The purpose of this ticket is to find a name to the text corpus we are working on that will be used as reference everywhere, probably also outside this project.
Basically it will be like a brand name, it is important that is easy to reconize the original/author (at least for me).

I personally don't like the MITADS names because it is difficult to pronunce and understand what it meas.
Here i propose some other random ideas:

MozIta
MozItaDS
ItaSpeech (best for me)
ItaDS

Looking forward to hear your input

Originally posted by @mone27 in #36 (comment)

Integrate M-AILABS Italian Corpus and align with CV-fr

https://github.com/JRMeyer/open-speech-corpora#bsd-3-clause-license

Support on text corpus generation for python threads

I am wondering if we can speed up the scripts for the text corpus generation using python thread.

It is something that we can do when we will have all the script working, so we can hack all of them to split their process on reading and cleaning data.

Our estimation that is can take like 4 hours now.

Autodownload Lingualibre dataset

https://github.com/MozillaItalia/commonvoice-it/blob/master/DeepSpeech/import_lingualibre.sh needs a way to automatically download and unzip https://lingualibre.fr/datasets/Q385-ita-Italian.zip if the file doesn't exists like for https://github.com/MozillaItalia/commonvoice-it/blob/master/DeepSpeech/import_cvit.sh

Wikisource exporter should cache what is downloading

In this way the execution will be faster if we save the text downloaded instead of query everytime the website that has a delay.

incompatibility of the italian model with deepspeech 0.5.1

deepspeech version 0.5.1 installed with pip in a fresh virtualenv cannot properly
load the italian model. It seem that kenml version used to train the model is more recent
than the version linked to deepspeech 0.5.1

$ deepspeech --model italian/output_graph.pbmm --audio test.wav --lm italian/lm.binary --trie italian/trie --alphabet italian/alphabet.txt 
Loading model from file italian/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-11-13 11:02:34.378997: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-13 11:02:34.387460: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-11-13 11:02:34.387495: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-11-13 11:02:34.387508: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-11-13 11:02:34.387609: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.0109s.
Loading language model from files italian/lm.binary italian/trie
Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.
Loaded language model in 0.0439s.
Running inference.
Error running session: Not found: PruneForTargets: Some target nodes not found: initialize_state 
Segmentation fault (core dumped)

Migrate Deepspeech scripts to Italian

We need to migrate all the bash scripts and docker file replacing the French references from file to parameters to italian.
Right now localize the readme in that folder in english or italian is not a priority https://github.com/MozillaItalia/commonvoice-it/tree/master/DeepSpeech
The scripts in that fodler download various packages from other resources like lingualibre to add more data to the model generation, package and generate the model for deepspeech.

Until this is not done we cannot generate the model for italian to use with deepspeech.

Align with the original repo

The pr to add to this:

common-voice/commonvoice-fr#80
common-voice/commonvoice-fr#79 of this one we need to check if we can use the patch on train_fr

Document Corpus scripts

As title we need to explain better the new scripts at https://github.com/MozillaItalia/DeepSpeech-Italian-Model/blob/master/MITADS/README.md

Create archive of wikipedia dump for italian

actually in build_lm.sh there is:

curl -sSL https://github.com/Common-Voice/commonvoice-fr/releases/download/lm-0.1/wiki.txt.xz | pixz -d | tr '[:upper:]' '[:lower:]' > wiki_it_lower.txt

Pointing at french version of wiki
If can be helpfull i have used:
https://raw.githubusercontent.com/alesarrett/CostituzioneItaliana/master/costituzione.txt

Create a tiny sample from the CV dataset

Download the dataset from https://voice.mozilla.org/it/datasets
Pick some files wav+text
Compress them and upload on github

Training pipeline

edit 2: ~~edit: change batch size to 128~~ nevermind, it crashes

I think is better to define a training pipeline as the official Deepspeech releases state.

We dont have the same amount of hours and videocards as DeepSpeech guys so lets start with 0.6 version hyperparameters.

I was thinking to some kind of pipelines to apply to a training-from-scratch model or starting from a pretrained checkpoint (transfer learning). What do you think?

PIPELINE 1 (with 0.6 hyperparameters from the fr repo)

I step

generate the scorer with LM_ALPHA and LM_BETA = 0
EPOCHS=30
BATCH_SIZE=64
N_HIDDEN=2048
LEARNING_RATE=0.0001
DROPOUT=0.4
EARLY_STOP
ES_EPOCHS (early stop after)=10
MAX_TO_KEEP=3 (we can keep more checkpoint when we will have more disk space)
DROP_SOURCE_LAYERS=1 (if using transfer learning)
USE_AUTOMATIC_MIXED_PRECISION (if training from scratch)

II step:

use LM_OPTIMIZER to search good ALPHA and BETA values
MAX_ALPHA=5 MAX_BETA=5 MAX_ITER=600

III step:

EPOCHS=30
BATCH_SIZE=64
N_HIDDEN=2048
LEARNING_RATE=0.00001 (lower LR)
DROPOUT=0.4
EARLY_STOP
ES_EPOCHS=10
MAX_TO_KEEP=3
DROP_SOURCE_LAYERS=1 (if using transfer learning)
USE_AUTOMATIC_MIXED_PRECISION (if training from scratch)

or:

PIPELINE 2

I step

generate the scorer with LM_ALPHA and LM_BETA = 0
EPOCHS=100
BATCH_SIZE=64
N_HIDDEN=2048
LEARNING_RATE=0.0001
DROPOUT=0.4
EARLY_STOP
ES_EPOCHS (early stop after)=25 (default value)
MAX_TO_KEEP=3
REDUCE_LR_ON_PLATEAU=1 (when learning got stuck, LR will be reduced)
PLATEAU_EPOCHS=10 (default,number of epochs to consider for RLROP. Smaller than ES_EPOCHS)
DROP_SOURCE_LAYERS=1 (if using transfer learning)
USE_AUTOMATIC_MIXED_PRECISION (if training from scratch)

II step:

use LM_OPTIMIZER to search good ALPHA and BETA values
MAX_ALPHA=5 MAX_BETA=5 MAX_ITER=600

New package Clips-Mitads and Importer for Clips dataset

Ref: http://www.clips.unina.it/it/index.jsp

Tasks:

Download both the packages and clean from the files that we don't need
Generate a new csv files with information of all the audio
Repackage in an open format like zip or tar.gz
Write a script launcher like https://github.com/MozillaItalia/DeepSpeech-Italian-Model/blob/master/DeepSpeech/it/import_m-ailabs.sh that download the new package
Write an importer, https://github.com/mozilla/DeepSpeech/tree/master/bin there are for different datasets to get inspiration

For the first 3 steps

We need to parse the txt of every recording to generate a unique CSV and package this csv with all the wav and remove the rest of the files.

New package name Clips-Mitads, just as reference.

CSV to create

wav_filename,wav_filesize,transcript
common_voice_it_19574474.wav,175148,ben degna di ammirazione
common_voice_it_19574387.wav,291884,noi possiamo benissimo non ritrovarci in quello che facciamo

Scripts unfinished: https://gist.github.com/Mte90/116e5d8a17973b7bd9bd9050662736dd

The csv is missing the wav filesize
The extraction of the rar need to avoid overwrites and get the files from the "etichettate" folder if exist

Document docker env variables

We want to document the variables in https://github.com/MozillaItalia/DeepSpeech-Italian-Model/blob/master/DeepSpeech/Dockerfile_it.train and also the 4 env files we provide now.

As now there are some variables not very clear, others was just added by us like only_export so @lissyx can help us:

english_compatible
amp
max_to_keep
n_hidden
lm_beta
lm_alpha

[output cleaning] ted_importer.py

These are some problems that I've found on looking at ted_importer.py output. I'll write down from the most serious, at least for me :)

code issues:

clean_log() is not defined
bs4 lxml parser: bs4.FeatureNotFound. Using BeautifulSoup(data, 'html.parser') it works

output issues:

symbols: ♪ ; ♫ ; T ∇ Sτ ; E=mc² ; 31¼%
html escape: (sanitize.py escapehtml() could be useful):

&amp; 
&quot;

some unknown chars: eg

Nessuno aveva mai studiato l'�involucro
L'�immagine alle mie spalle mostra

Some sentences starts with 00, seems that new line split numbers:

Ed ecco come userò il mio premio di 10
00 dollari

accents: E' vs È , ò vs o´

falo´
È un infuso creato

when there is an apostrophe the next word occurs without a space. There are lot of these kind:

E'il drone
E'l'applicazione
E'stata
realta'virtuale
E'più
L'immagine che venne un po'dopo aveva una spiegazione semplice

invalid roman numbers:

La natura nel senso del IXX secolo, giusto
VV: Sì, tre persone sono scese sul fondo dell'Oceano Pacifico

and about this last one: lot of sentences starts with 2 letter and then ":"

AC: Se dimagrisci un po'
AG: Ci sono certamente implicazioni tecniche
ZF: Scendi tu dal palco

Migrate Common Voice data

Missing files:

CommonVoice-Data/names.py (needs data from a Italian source, and properly patched to be able to parse italian data) #4
CommonVoice-Data/libretheatre.py (needs data from Italian source, maybe has to be rewritten entirely) #5
CommonVoice-Data/wikipedia.py (done)
CommonVoice-Data/wikisource.py (needs italian translation of the book "Le forceures de blocus", has to be rewritten to be able to scrap the italian book) #5
CommonVoice-Data/framabook.py #5
CommonVoice-Data/utils.py (needs to be adapted to italian language)

Originally posted by @alex179ohm in #2 (comment)

New books from Librivox

The idea is to add audio books (not about poetry/prose) written after 1930 from Librivox (are released CC0) and that the author is died since 70 years.
Also recording more then 1 hour.
We can accept books also released before but with a modern italian.

The big point is we need also an importer for deepspeech of that like for https://github.com/MozillaItalia/DeepSpeech-Italian-Model/blob/master/DeepSpeech/it/import_m-ailabs.sh.

List (books from D'annunzio. Pascoli, Pirandello, Verga):

Trascrizioni Borgo Manero

We need to generate a dataset for https://www.comune.borgomanero.no.it/audio/audio.aspx

Download all of audio and txt with csv like CV
Define if generate a zipped dataset to speed up the process

Remove lingualibre

In order to simplify development of the italian model I propose to remove the lingualibre dataset.
The reasons are that lingualibre in Italian is only 4 minutes long, so is not so useful to improve the dataset.
The main issue as pointed out in #17 is that the max test batch size is 16 due to the small lingualibre dataset, but I have not found an easy way to specify it correctly in the docker image (make test_batch_size and batch_size different). Moreover it would force to use a not so optimal test batch for the other dataset.

Evaluate MPSKA dataset

There is dataset where the same Italian text is read in various different way: www.mspkacorpus.it/

We probably need to do an adaptor like https://github.com/MozillaItalia/DeepSpeech-Italian-Model/blob/master/DeepSpeech/it/import_m-ailabs.sh

[output cleaning] qallme_importer.py

Found those issues:

random "???" strings
accents like these: e` a` o` i` u` eg:
sai dire il nome della squadra che giochera` contro il Pine`

Those could be fixed by adding some regex rules to mapping_normalization list:

instead removing only the single square brackets:
re.compile('\[.*?\]'), u'']
and then:

re.compile('a`'), u'à']
re.compile('u`'), u'ù']
re.compile('i`'), u'ì']
re.compile('o`'), u'ò']

and about e` :

re.compile('perche`'), u'perché']
re.compile(' ne`'), u'né']
.. list of other words that needs é instead of è...
re.compile(' e`'), u'è']

deepspeech.Model() fails: AttributeError: 'Model' object has no attribute '_impl'

I tried this code:

https://github.com/MozillaItalia/DeepSpeech-Italian-Model/wiki/Python-Deepspeech-API-utilizzando-file-audio

but it failed, i resolved following this:

mozilla/DeepSpeech#3006

MIgrate book testers

There are different scripts in CommonVoice-Data that are used to download stuff in CC0 and test the model generated:

wikisource.py (need to be migrated to an italian book)
libretheatre.py (replaced to something else)
framabook.py (replaced to something else)
project-gutenberg.py (adapt to italian language)
assemble nationale (replaced to something else)
bano (replaced to something else)

We need to replace because they doesn't exist in italian so we can create similar aggregator from:

Auto download the common voice italian model

In https://github.com/MozillaItalia/commonvoice-it/blob/master/DeepSpeech/import_cvit.sh the script looks for the dataset.
It is more useful to check and in case is not avalaible to download the package from the url https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-3/it.tar.gz

We need also to check the sha1 and update it in the script.

Add docs to generate the model in readme

As written at #45 (comment)

Generate the ML model

Require a computer with nvidia card

Supporto per sviluppo

Ciao a tutti,
vorrei sapere se c'e'un forum, chat o altro canale ufficiale o non ufficiale dove discutere dello sviluppo di DeepSpeech italiano e dare/avere supporto durante il training o lo sviluppo di un language model custom.

Greazie

Not clear how to do a simple speech recognition

It would be great if the instructions in the README were dumb-proof.

I just tried to follow them and the results were nonsensical.

It may clearly be due to error on our side or the environment (WSL) but looking at the release, I suspect that some data is missing (I just followed strictly what's on the README).

Improve download status for Mitads

This is what happens with parlareitaliano:

...17066%, 0 MB, 58052 KB/s, 0 seconds passedDownloading in ./parsing/parlareitaliano/b01f001f.hsw
...12047%, 0 MB, 45343 KB/s, 0 seconds passedDownloading in ./parsing/parlareitaliano/b01f003f.hsw
...24824%, 0 MB, 59074 KB/s, 0 seconds passedDownloading in ./parsing/parlareitaliano/b01f005f.hsw

We need to test all the importers for their status bar because sometimes don't works flawlessy.

MITADS - Transcript roman numbers

We have the issue that the text corpus include roman numbers but we need to convert those as usual numbers but also to spot fake positives and so on.

We need a way to detect roman numbers and not other text that include that letters.

Migrate names.py for italian names/surname

We need to migrate https://github.com/MozillaItalia/commonvoice-it/blob/master/CommonVoice-Data/names.py

We need a list of generic names of streets.

mozillaitalia / deepspeech-italian-model Goto Github PK

deepspeech-italian-model's People

Contributors

Stargazers

Watchers

Forkers

deepspeech-italian-model's Issues

Workflow

What tools use?

PIPELINE 1 (with 0.6 hyperparameters from the fr repo)

I step

II step:

III step:

PIPELINE 2

I step

II step:

For the first 3 steps

CSV to create

Recommend Projects

Recommend Topics

Recommend Org