common-voice / commonvoice-fr Goto Github PK

Tooling for producing French dataset for Common Voice

Python 81.75% Shell 18.25%

commonvoice-fr's Introduction

Common Voice

This is the web app for Mozilla Common Voice, a platform for collecting speech donations in order to create public domain datasets for training voice recognition-related tools.

Upcoming releases

Type	Release Cadence	More info
Platform code & sentences	Monthly, or as needed	Release notes
Dataset	Quarterly	Dataset metadata

Quick links

How to contribute

🎉 First off, thanks for taking the time to contribute! This project would not be possible without people like you. 🎉

There are many ways to get involved with Common Voice - you don't have to know how to code to contribute!

To add or correct the translation of the web interface, please use the Mozilla localization platform Pontoon. Please note, we do not accept any direct pull requests for changing localization content.
For information on how to add or edit sentences to Common Voice, see SENTENCES.md
For instructions on setting up a local development environment, see DEVELOPMENT.md
For information on how to add a new language to Common Voice, see LANGUAGE.md
For information on how to get in contact with existing language communities, see COMMUNITIES.md

For more general guidance on building your own language community using Mozilla voice tools, please refer to the Mozilla Voice Community Playbook.

Discussion

For general discussion (feedback, ideas, random musings), head to our Discourse Category.

For bug reports or specific feature, please use the GitHub issue tracker.

For live chat, join us on Matrix.

Licensing and content source

This repository is released under MPL (Mozilla Public License) 2.0.

The majority of our sentence text in /server/data comes directly from user submissions in our Sentence Collector or they are scraped from Wikipedia using our extractor tool, and are released under a CC0 public domain Creative Commons license.

Any files that follow the pattern europarl-VERSION-LANG.txt (such as europarl-v7-de.txt) were extracted with our thanks from the Europarl Corpus, which features transcripts from proceedings in the European parliament.

Citation

If you use the data in a published academic work we would appreciate if you cite the following article:

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M. and Weber, G. (2020) "Common Voice: A Massively-Multilingual Speech Corpus". Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). pp. 4211—4215

The BiBTex is:

@inproceedings{commonvoice:2020,
  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
  title = {Common Voice: A Massively-Multilingual Speech Corpus},
  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
  pages = {4211--4215},
  year = 2020
}

Cross Browser Testing

This project is tested with Browserstack

commonvoice-fr's People

Contributors

Stargazers

Watchers

commonvoice-fr's Issues

Deprecate our language-specific LM production and use `generate_lm.py`

Textes sur Framabook

Framabook héberge quelques textes sous licence CC0 (Exemple)

Ils pourraient etre utilisés pour enrichir le corpus de texte

Écriture d'un parseur ePub ?
Extraction de textes
Validation contre un alphabet existant

Update KenLM (+Eigen) version

Newer versions with -v required

Écrasement des TSVs Common Voice si fournis

Permettre d'avoir un tar.gz qui contienne une autre vue des données Common Voice (regénéré par Corpora Creator)

Pipeline Docker d'entraînement DeepSpeech

Formaliser le pipeline actuel pour le français:

clone de DeepSpeech/master
importation du jeu de données CommonVoice FR
importation du jeu de données TrainingSpeech de @nicolaspanel
Importation du jeu de données Lingua Libre
Entraînement
Export aux différents format (.pb, .pbmm, .tflite)

Il semble que beaucoup de contributeurs trouveraient un Docker pour ça plus pratique.

Security Vulnerability in requests <= 2.19.1

Documentation du module utils

Je pense qu'il serait utile de documenter les différentes fonctions du module utils en créant leur docstring pour faciliter leurs utilisations. Incluant notamment des recommandations. Par exemple, quand il est préférable d'utiliser l'argument nlp de extract_sentences().

Prononciation non naturelle

Bonjour,

je fais parti des personnes qui écoutent les phrases https://voice.mozilla.org/fr/listen

mais la plupart des phrases que j'entend sont lues et ça s'entend que c'est lu, la prononciation n'est pas naturelle :

des pauses entre les mots
une lecture saccadé

faut-il accepter ou rejeter ces prononciations ? (si on refuse, il en restera très peu)

je pense que c'est dû :

la personne qui parle est en train de lire la suite ou essayer de comprendre la suite (comme si elle n'avait pas lu une première fois la phrase)
la phrase à lire n'est pas simple
- suppression de mot (de nombre, certainement) au milieu de la phrase, rendant la phrase étrange
- des mots compliqués / pas communs

si un modèle est entraîné avec ces données, j'ai peur qu'il ne reconnaisse que des phrases lues mais pas celles que l'on prononce naturellement

peut etre qu'on pourrait avoir un espèce de tutoriel avec des conseils :

lire une première fois la phrase avant de s'enregistrer
autres conseils

et des exemples :

de phrases à valider :
- cas nominal
- accent très fort
- bruit
- micro de mauvaise qualité
- autres cas
de phrases à refuser :
- oublie de mots
- inaudible
- mot à la place d'un autre
  - qui ne change pas le sens de la phrase mais qui ne correspond pas exactement à la phrase écrite
- autres cas

je ne sais pas où reporter ça

Créer un modèle de langue

Documenté dans https://github.com/mozilla/DeepSpeech/blob/master/data/lm/README.md

Il faudrait écrire du code pour être capable de reproduire un LM de 0 :

Export dump Wikipedia FR
Conversion du XML en TXT
Nettoyage du TXT (selon quels critères ?)

Utiliser la branche transfer-learning

L'objectif est de pouvoir entraîner à partir d'un checkpoint anglais existant, mais (notamment) en utilisant un alphabet français

Maintenir une branche transfer-learning à jour de https://github.com/mozilla/DeepSpeech/tree/transfer-learning2
Intégrer dans le Dockerfile

Sortie qui ne fait aucun sens

Bonjour,
J'ai téléchargé le dernier modèle et je l'ai essayé sur ce fichier audio:

conférence.wav.zip

Le résultat est le suivant:

il est bien diciottenni n'est pas en mesure de garantir carrément aux diogens personne fusicaudus entreprise son plebejus n'aura aucune des rentes qui ne soit subuda une tonsure personnel a pas un mesaton qui a fait des voiles au panama son entreprise par ce peu la loi dit que le sujet disponible aux personnes cuisiniers morale ou estainbeche e di aujuste entame assure le plus de duquesne finirent pas en plus ou suprotection puolanto busseau tout de noces et cooper qui a dumaisniel possibilite pour le penetrans il est difficile dans miseration de l'autobot de la cote sofrance les utes de cotonoises importe tous au monde par des personnes plusieurs française ou des sociétés de tangeance seront desormais depuis au bout pour celui par la france nantenin insos openfeint a francais putot qu'au contrebalancera ce une bonne chose mais pour y arriver y a une condition retable incontounable et ca c'est le boece la france et europe

Le texte semble être un simple agencement de mots aléatoires.
Est-ce que c'est mon enregistrement en particulier qui est inexploitable, ou est-ce que c'est un taux d'erreur "normal" pour le modèle de deepspeech en français à l'heure actuelle ?

Organisation d'un Meetup CV/DS

Sondage pour choisir une date: https://framadate.org/foGyOuwuwlXVCYnB

Set `uid/gid` for inside Docker

Currently, the Docker hardcodes using uid/gid 999/999. This makes usage of a mountpoint a bit painful since the content might not be owned by the user running Docker.

Adding a build arg for that would make usage smoother.

Comparaison Common Voice v3 FR -- Duplicatas

Comparer les WER sur l'entraînement entre :

Common Voice v3 FR """de base""", i.e., pas de doublons autorisés
Common Voice v3 FR avec 2/4 doublons,
Common Voice v3 FR avec 8 doublons

On veut vérifier :

quelle quantité de données supplémentaires on récupère à chaque étape
de combien ça permet d'améliorer le modèle, en partant du postulat qu'à quantité égale, s'il y a des doublons, ça dégrade

Node.js DeepSpeech & modèle francophone

Bonjour !

Je suis en train d'essayer d'utiliser DeepSpeech sur un serveur Node.js afin d'essayer de mettre en oeuvre une solution de STT. Mais malheureusement, je ne parviens pas à obtenir grand chose.

Pour mes expérimentations, j'ai utilisé la realese 0.3.4 afin de me baser sur un modèle français.

J'utilise la caméra PlayStation Eye branchée en USB et arecord:

**** Liste des Périphériques Matériels CAPTURE ****
carte 0: PCH [HDA Intel PCH], périphérique 0: CX8200 Analog [CX8200 Analog]
  Sous-périphériques: 1/1
  Sous-périphérique #0: subdevice #0
carte 1: CameraB409241 [USB Camera-B4.09.24.1], périphérique 0: USB Audio [USB Audio]
  Sous-périphériques: 1/1
  Sous-périphérique #0: subdevice #0

J'ai repris l'exemple fourni par Mozilla sur le repo de DeepSpeech en modifiant la configuration du micro:

const playstationEye = {
  rate: '32000',
  channels: '4',
  debug: true,
  fileType: 'wav',
  device: 'plughw:1,0',
}

J'ai également modifié le chemin vers le modèle utilisé par DeepSpeech:

  DEEPSPEECH_MODEL = path.join(__dirname, 'model_tensorflow_fr')

Lorsque je lance le serveur Node.js, le micro écoute bien mais je n'arrive à détecter que peu de mots / phrases qui sont assez loin de ce que je dis:
recognized: { text: 'ou je mavet', recogTime: 456, audioLength: 1875 }

Etant novice dans ce monde de la reconnaissance vocale, j'imagine que je passe à côté de quelque chose et qu'il y a d'autres éléments à configurer ?

Install deepspech as a package like it is now required

Training on 0.7 will require proper pip install of the deepspeech codebase

Utiliser et packager le checkpoint `best_dev`

Actuellement, le code fait que l'on perds le checkpoint best_dev, qui serait le plus utile pour les utilisateurs.

Modèle v0.4

Bloqueurs :

Mettre à jour le fork DeepSpeech sur master / v0.7.0 #89
Intégrer Common Voice Français v3 (release fr_412h_2019-12-10) #90
Améliorer le filtrage de Common Voice v3 #99
Optimisation du modèle de langue en limitant aux mots les plus fréquents #96
~~Génération du LM #83~~
Partager le best_dev #94
Exports depuis best_dev #95 (prêt mais attends un patch sur DeepSpeech)
Comparaison Common Voice v3 FR -- Duplicatas #102

Non bloqueurs :

Augmenter le corpus de son avec de nouveaux jeux de données #91
Expérimenter avec les augmentations (pitch, noise) #92

M-AILABS French

https://github.com/JRMeyer/open-speech-corpora#bsd-3-clause-license

~190h annoncées sur LibriVox français, un peu de recoupement avec Training Speech.

Importeur fonctionnel
Importeur mergé upstream

Download the dataset automatically

I am trying to do the italian version but (a part that the readme are in french) it is not clear for me how the dataset are downloaded.
The script include different import from different dataset but no url to download automatically them.
It is possible to add that? or at least instructions? in this way setupping is easier.

urllib3 needs updating

Réorganisation du dépôt

Write and use language-specific validate-label outside of DeepSpeech fork

Work to have --validate_label_locale has landed in DeepSpeech master repo, we can rely on that for limiting the need to fork the DeepSpeech repo and make a locale-dedicated validate_label function directly in the Docker instance.

Utiliser l'optimiseur de LM

mozilla/DeepSpeech#2783

Split training / evaluation

It can be useful to re-run evaluation without a training phase.

Intégrer Common Voice Français v3 (release fr_412h_2019-12-10)

Import invalide depuis Wikisource

https://discourse.mozilla.org/t/grosse-proportion-de-mots-accentues-decoupes/47333

cc @hellosct1

Produce LM as a packaged scorer

Newer versions of DeepSpeech 0.7 packages the LM as a scorer file instead of the LM + trie + manually setting alpha / beta values.

example d'extraction à partir d'audiobook

L'audio et l'epub on été obtenu depuis https://www.atramenta.net/lire/contes-du-jour-et-de-la-nuit/6473

Le "découpage" obtenu est le fichier Contes_du_jour_et_de_la_nuit.json (voir contes du jour.zip), il a été généré via

/usr/bin/python -m aeneas.tools.execute_task Contes_du_jour_et_de_la_nuit.mp3 Contes_du_jour_et_de_la_nuit.txt "task_language=fra|os_task_file_format=json|is_text_type=plain"  Contes_du_jour_et_de_la_nuit.json

NB:

Le fichier Contes_du_jour_et_de_la_nuit.txt correspond à une extraction/normalisation des phases issues du fichier EPUB (voir )

Pour obtenir les sous-fichiers audio, utiliser:

with open('Contes_du_jour_et_de_la_nuit.json') as f:
  data = json.load(f) 
for item in data['fragments']:
  input_audio = 'Contes_du_jour_et_de_la_nuit.wav'
  from_ = max(float(item['begin'])-0.2,0.)  # correct systematic error
  to = float(item['end'])-0.2 # correct systematic error
  output_audio = '/tmp/{}.wav'.format(item['id'])
  subprocess.call('ffmpeg -i {} -ss {} -to {} -y -c copy {}'.format(input_audio, from_,to, output_audio).split(' '))

j'ai utilisé $ ffmpeg -i Contes_du_jour_et_de_la_nuit.mp3 -ac 1 -ar 16000 Contes_du_jour_et_de_la_nuit.wav pour passer de mp3 à wav
voir https://github.com/readbeyond/aeneas pour plus d'infos

Je vous laisse regarder tout ça mais ça me semble pas mal

African Accented French

http://www.openslr.org/57/

Importeur fonctionnel
Importeur mergé upstream

Fail to run import_trainingspeech.sh

Hi, i'm running the Docker image but get stuck at the import_trainingspeech.sh step, where i get the following error:

+ fr/import_trainingspeech.sh
+ pushd /home/trainer/ds/
~/ds ~
+ pip install Unidecode==1.0.23
Collecting Unidecode==1.0.23
  Downloading https://files.pythonhosted.org/packages/31/39/53096f9217b057cb049fe872b7fc7ce799a1a89b76cf917d9639e7a558b5/Unidecode-1.0.23-py2.py3-none-any.whl (237kB)
     |████████████████████████████████| 245kB 9.4MB/s 
Installing collected packages: Unidecode
Successfully installed Unidecode-1.0.23
+ '[' 0 = 1 ']'
+ '[' '!' -f /mnt/extracted/data/trainingspeech/ts_2019-04-11_fr_FR_train.csv ']'
+ python bin/import_ts.py /mnt/extracted/data/trainingspeech
No path "/mnt/extracted/data/trainingspeech" - creating ...
No archive "/mnt/extracted/data/trainingspeech/ts_2019-04-11_fr_FR.zip" - downloading...
Progress |#                                                                                       | 100% completedTraceback (most recent call last):
  File "bin/import_ts.py", line 197, in <module>
    _download_and_preprocess_data(cli_args.target_dir, cli_args.english_compatible)
  File "bin/import_ts.py", line 44, in _download_and_preprocess_data
    archive_path = maybe_download('ts_' + ARCHIVE_NAME + '.zip', target_dir, ARCHIVE_URL)
  File "/home/trainer/ds/bin/../util/downloader.py", line 26, in maybe_download
    bar.update(done)
  File "/home/trainer/ds-train/lib/python3.6/site-packages/progressbar/bar.py", line 629, in update
    return self.update(value, force=force, **kwargs)
  File "/home/trainer/ds-train/lib/python3.6/site-packages/progressbar/bar.py", line 641, in update
    % (value, self.min_value, self.max_value))
ValueError: Value 278 is out of range, should be between 0 and 0
Progress |#

I build & run the docker image on a GCP VM (Debian 9) but I do have the same error when tested in local. For now I commented import_trainingspeech.sh in ./fr/run.sh to skip this step

Améliorer le filtrage de Common Voice v3

Les premiers résultats sur Common Voice v3:

beaucoup de déchets dans l'alphabet
loss à peu près similaire
taux d'erreurs très élevés (WER et CER)

Je pense que le jeu de données devrait être nettoyé pour au moins avoir un alphabet """normal""".

Corrections diverses sur Corpora Creator

cf le travail de @nicolaspanel common-voice/CorporaCreator#87

Quelques éléments :

Identifier les séquences / phrases incorrectes
Adapter CorporaCreator pour les corriger (si possible) ou les refuser (sinon) : #21
Remonter / corriger le texte source de Common Voice

Training stops after 3 epochs (wrong gast version?)

Hello there,
My training seems to stop after 3 epochs, and it seems related to the version of gast installed:

Epoch 3 | Validation | Elapsed Time: 0:00:17 | Steps: 13 | Loss: 48.163957 | Dataset: /mnt/extracted/data/African_Accented_French/African_Accented_French/African_Accented_French_dev.csv
I Early stop triggered as (for last 4 steps) validation loss: 37.242888 with standard deviation: 0.310681 and mean: 36.303010
I FINISHED optimization in 1:22:21.241841
WARNING:tensorflow:Entity <bound method LSTMBlockWrapper.call of <tensorflow.contrib.rnn.python.ops.lstm_ops.LSTMBlockFusedCell object at 0x7f3a003c8cf8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method LSTMBlockWrapper.call of <tensorflow.contrib.rnn.python.ops.lstm_ops.LSTMBlockFusedCell object at 0x7f3a003c8cf8>>: AttributeError: module 'gast' has no attribute 'Num'
W0116 14:20:37.469725 139891645953856 ag_logging.py:145] Entity <bound method LSTMBlockWrapper.call of <tensorflow.contrib.rnn.python.ops.lstm_ops.LSTMBlockFusedCell object at 0x7f3a003c8cf8>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method LSTMBlockWrapper.call of <tensorflow.contrib.rnn.python.ops.lstm_ops.LSTMBlockFusedCell object at 0x7f3a003c8cf8>>: AttributeError: module 'gast' has no attribute 'Num'
INFO:tensorflow:Restoring parameters from /mnt/checkpoints/best_dev-18787
I0116 14:20:37.636511 139891645953856 saver.py:1280] Restoring parameters from /mnt/checkpoints/best_dev-18787
I Restored variables from best validation checkpoint at /mnt/checkpoints/best_dev-18787, step 18787
Testing model on /mnt/extracted/data/M-AILABS/fr_FR/fr_FR_test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                                     Fatal Python error: Segmentation fault

Thread 0x00007f3a01f0a700 (most recent call first):

Thread 0x00007f3a0380d700 (most recent call first):
  File "/usr/lib/python3.6/threading.py", line 295 in wait
  File "/usr/lib/python3.6/queue.py", line 164 in get
  File "/home/trainer/ds-train/lib/python3.6/site-packages/tensorflow/python/summary/writer/event_file_wrifr/train.sh: line 49:    51 Segmentation fault      (core dumped) python -u DeepSpeech.py --show_progressbar True --use_cudnn_rnn True --automatic_mixed_precision True --alphabet_config_path /mnt/models/alphabet.txt --lm_binary_path /mnt/lm/lm.binary --lm_trie_path /mnt/lm/trie --feature_cache /mnt/sources/feature_cache --train_files ${all_train_csv} --dev_files ${all_dev_csv} --test_files ${all_test_csv} --train_batch_size ${BATCH_SIZE} --dev_batch_size ${BATCH_SIZE} --test_batch_size ${BATCH_SIZE} --n_hidden ${N_HIDDEN} --epochs ${EPOCHS} --learning_rate ${LEARNING_RATE} --dropout_rate ${DROPOUT} --lm_alpha ${LM_ALPHA} --lm_beta ${LM_BETA} ${EARLY_STOP_FLAG} --checkpoint_dir /mnt/checkpoints/ --export_dir /mnt/models/ --export_language "fra"

Seems related to this TensorFlow issue, which is why I'm thinking about a wrong version of Gast, but did anyone came across this issue already and is there a way to fix properly? (maybe define the right gast version in the dockerfile?) Or maybe it is something else?

Réglages des hyper-paramètres

Il s'agit de régler les « bonnes » valeurs aux différents hyper-paramètres du réseau pour obtenir des résultats corrects : nombre d'epoch, learning rate, dropout, lm alpha/beta

Expérimenter l'utilisation de DSAlign avec le modèle actuel

Expérimenter si le modèle actuel est suffisant pour être utilisé par https://github.com/mozilla/DSAlign et permettre de l'alignement automatique.

Cela permettrait d'aller chercher des sources CC-0 et autres licences compatibles qui proposent une transcription exacte mais pour lesquels il reste un travail de découpage et d'alignement à effectuer.

Générer le modèle de langage avec `generate_lm.py`

Sélection de sources de données pour modèle Français

On peut déjà partir sur :

Common Voice fr, évidemment
Training speech par @nicolaspanel https://gitlab.com/nicolaspanel/TrainingSpeech
Lingua Libre https://lingualibre.fr/datasets/ : mozilla/DeepSpeech#2067

Make scripts more generalized

Currentmy we have too much under each language.

Integrate Corpora-Creator for easier experimenting

For several languages, Common Voice data releases contains much more in validated.tsv than it is available from train.tsv, dev.tsv and test.tsv. This is a (good) conservative approach to make sure the default usable data do not contains multiple times the same sentence spoken by several people.

Some experiment shows that this can, however, improve the model.

Integrating Corpora-Creator and allowing to re-generate the data from validated.tsv would help people experimenting with that (French dataset included).

Fail to run ./bin/run-tc-ldc93s1_new.sh

Hello,

I'm trying to build and run the Docker train image but it keeps failing at the last step of checks.sh:
./bin/run-tc-ldc93s1_new.sh 2

I always get the following error:

+ ./bin/run-tc-ldc93s1_new.sh 2
+ ldc93s1_dir=./data/ldc93s1-tc
+ ldc93s1_csv=./data/ldc93s1-tc/ldc93s1.csv
+ epoch_count=2
+ [ ! -f ./data/ldc93s1-tc/ldc93s1.csv ]
+ echo Downloading and preprocessing LDC93S1 example data, saving in ./data/ldc93s1-tc.
Downloading and preprocessing LDC93S1 example data, saving in ./data/ldc93s1-tc.
+ python -u bin/import_ldc93s1.py ./data/ldc93s1-tc
No path "./data/ldc93s1-tc" - creating ...
No archive "./data/ldc93s1-tc/LDC93S1.wav" - downloading...
Traceback (most recent call last):
  File "/home/trainer/ds-train/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/home/trainer/ds-train/lib/python3.6/site-packages/urllib3/connectionpool.py", line 376, in _make_request
    self._validate_conn(conn)
  File "/home/trainer/ds-train/lib/python3.6/site-packages/urllib3/connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "/home/trainer/ds-train/lib/python3.6/site-packages/urllib3/connection.py", line 394, in connect
    ssl_context=context,
  File "/home/trainer/ds-train/lib/python3.6/site-packages/urllib3/util/ssl_.py", line 370, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/lib/python3.6/ssl.py", line 407, in wrap_socket
    _context=self, _session=session)
  File "/usr/lib/python3.6/ssl.py", line 817, in __init__
    self.do_handshake()
  File "/usr/lib/python3.6/ssl.py", line 1077, in do_handshake
    self._sslobj.do_handshake()
  File "/usr/lib/python3.6/ssl.py", line 689, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/trainer/ds-train/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/trainer/ds-train/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/trainer/ds-train/lib/python3.6/site-packages/urllib3/util/retry.py", line 436, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='catalog.ldc.upenn.edu', port=443): Max retries exceeded with url: /desc/addend
a/LDC93S1.wav (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "bin/import_ldc93s1.py", line 28, in <module>
    _download_and_preprocess_data(sys.argv[1])
  File "bin/import_ldc93s1.py", line 18, in _download_and_preprocess_data
    local_file = maybe_download(LDC93S1_BASE + ".wav", data_dir, LDC93S1_BASE_URL + LDC93S1_BASE + ".wav")
  File "/home/trainer/ds/bin/../util/downloader.py", line 18, in maybe_download
    req = requests.get(archive_url, stream=True)
  File "/home/trainer/ds-train/lib/python3.6/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/home/trainer/ds-train/lib/python3.6/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/trainer/ds-train/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/trainer/ds-train/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/home/trainer/ds-train/lib/python3.6/site-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='catalog.ldc.upenn.edu', port=443): Max retries exceeded with url: /desc/addenda/LD
C93S1.wav (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))

Do you know what may cause this and how to resolve?
Thanks!

Installation CorporaCreator dans l'image Docker
Téléchargement des clips CommonVoice MP3
Utilisation de ce jeu de données CommonVoice « non officiel »

Rejet des abbréviations

Il serait efficace pour éviter les situations ambigües de rejeter certaines abbréviations. Le code actuel se limite à celles uniquement en majuscules, potentiellement séparées par des points.

On pourrait vouloir rajouter des choses spécifiques au français: mr, mme, melle, mgr, ch, rte, etc.: https://github.com/Common-Voice/sentence-collector/blob/master/shared/validation/languages/en.js#L16-L20

Identifier les abréviations existantes dans Common Voice et dans https://github.com/Common-Voice/commonvoice-fr/tree/master/CommonVoice-Data/data
Proposer du code sur Sentence Collector pour les refuser