Multilingual end-to-end ASR system with phonological syllables as subwords

The aim of this project is to build a multilingual ASR system trained with phonological syllables as subwords.

Syllables can be described broadly as linguistic units that represent sound organization patterns in human speech.
They are relevant in speech production and perception and are considered a linguistic universal, meaning that they are found in all the documented languages, and languages that share a similar phonetic inventory have in common most part of their syllable inventory.
According to the phonological definition, each syllable consists of at least a nucleus, namely an element characterized by a high degree of sonority (in most cases a vowel); the nucleus can be surrounded by less sonorant elements that constitute syllable onset and coda. The sonority within the syllable increases before the nucleus, in which the peak is reached, and decreases after it.

Syllables convey acoustic information, because the distribution of the segments represents the variation of energy in the signal.
Implementing such elements in the vocabulary on which the model is trained should therefore emphasize the association between each audio frame and its textual label and be beneficial for the recognition.

To obtain syllables as subwords we need to build a custom tokenizer based on the class Wav2Vec2PhonemeCTCTokenizer that works according to the main syllabification rules, the Sonority Sequencing Principle and the Maximal Onset Principle.

The dataset is automatically transcribed in phonemes to work on a phonological level. This is done through the tool WebMAUS Basic provided by the the Bavarian Archive for Speech Signals of the Institute of Phonetics and Speech Processing of the Ludwig-Maximilians-Universität (München, Germany).

To build the ASR we fine-tune the pre-trained model WavLM-large (Chen et al., 2021) on multilingual speech data extracted from the Mozilla Common Voice dataset.

The languages considered within this project are Italian, Spanish and French.

The performance of the model is evaluated with two metrics: the Token Error Rate and the Phoneme Error Rate.

This repo contains:

multilingual_corpus.ipynb
-> multilingual corpus preparation (Common Voice data, WebMAUS Basic transcriptions)
transcriptions
-> folder with pkl files with phonetic and phonological transcriptions of Italian, French and Spanish data
expMLhyb20.py
-> fine-tuning of the model wavLM-large on multilingual data with a custom tokenizer and syllable-based vocabulary
syllabifier.ipynb
-> syllabification algorithm used to generate the syllabic vocabulary
HybridML_ITESFRPhoCTCTokenizer
-> custom tokenizer that acts according to syllabification rules
tokenizerMLT_ITESFR_hybPhoSyl246
-> folder with multilingual syllabic vocabulary
evaluation.py
-> evaluation of the trained model on the test dataset. Generates a csv file with a sample of predictions
back2words.py
-> auxiliary script that adjusts the format of the predicted sentences to calculate PER, TER and WER scores

speechtechlab / multilingual-asr-syllables Goto Github PK

multilingual-asr-syllables's Introduction

Multilingual end-to-end ASR system with phonological syllables as subwords

multilingual-asr-syllables's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent