arbml / klaam Goto Github PK

Arabic speech recognition, classification and text-to-speech.

License: MIT License

Python 3.39% Shell 0.09% Jupyter Notebook 96.52%

asr tts arabic-speech-recognition arabic

klaam's Introduction

Motivation

As you know machine learning has proven its importance in many fields, like computer vision, NLP, reinforcement learning, adversarial learning, etc .. Unfortunately, there is a little work to make machine learning accessible for Arabic-speaking people.

Goal

Our goal is to enrich the Arabic content by creating open-source projects and open the community eyes on the significance of machine learning. We want to create interactive applications that allow novice Arabs to learn more about machine learning and appreciate its advances.

Challenges

Arabic language has many complicated features compared to other languages. First, Arabic language is written right to left. Second, it contains many letters that cannot be pronounced by most foreigners like ض ، غ ، ح ، خ، ظ. Moreover, Arabic language contains special characters called Diacritics which are special characters that help readers pronounced words correctly. For instance the statement السَّلامُ عَلَيْكُمْ وَرَحْمَةُ اللَّهِ وَبَرَكَاتُهُ containts special characters after most of the letters. The diactrics follow special rules to be given to a certain character. These rules are construct a complete area called النَّحْوُ الْعَرَبِيُّ. Compared to English, the Arabic language words letters are mostly connected اللغة as making them disconnected is difficult to read ا ل ل غ ة. Finally, there as many as half a billion people speaking Arabic which resulted in many dialects in different countires.

Procedure

Our procedure is generalized and can be generalized to many language models not just Arabic. This standrized approach takes part as multiple steps starting from training on colab then porting the models to the web.

Models

Name	Description	Notebook	Demo
Arabic Diacritization	Simple RNN model ported from Shakkala
Arabic2English Translation	seq2seq with Attention
Arabic Poem Generation	CharRNN model with multinomial distribution
Arabic Words Embedding	N-Grams model ported from Aravec
Arabic Sentiment Classification	RNN with Bidirectional layer
Arabic Image Captioning	Encoder-Decoder architecture with attention
Arabic Word Similarity	Embedding layers using cosine similarity
Arabic Digits Classification	Basic RNN model with classification head
Arabic Speech Recognition	Basic signal processing and classification
Arabic Object Detection	SSD Object detection model
Arabic Poems Meter Classification	Bidirectional GRU
Arabic Font Classification	CNN
Arabic Text Detection	Optical Character Recognition (OCR)

Datasets

Name	Description
Arabic Digits	70,000 images (28x28) converted to binary from Digits
Arabic Letters	16,759 images (32x32) converted to binary from Letters
Arabic Poems	146,604 poems scrapped from aldiwan
Arabic Translation	100,000 paralled arabic to english translation ported from OpenSubtitles
Product Reviews	1,648 reviews on products ported from Large Arabic Resources For Sentiment Analysis
Image Captions	30,000 Image paths with captions extracted and translated from COCO 2014
Arabic Wiki	4,670,509 words cleaned and processed from Wikipedia Monolingual Corpora
Arabic Poem Meters	55,440 verses with their associated meters collected from aldiwan
Arabic Fonts	516 100×100 images for two classes.

Tools

To make models easily accessible by contributers, developers and novice users we use two approaches

Google Colab

Google colaboratory is a free service that is offered by Google for research purposes. The interface of a colab notebook is very similar to jupyter notebooks with slight differences. Google offers three hardware accelerators CPU, GPU and TPU for speeding up training. We almost all the time use GPU because it is easier to work with and acheives good results in a reasonable time. Check this great tutorial on medium.

TensorFlow.js

TensorFlow.js is part of the TensorFlow ecosystem that supports training and inference of machine learning models in the browser. Please check these steps if you want to port models to the web:

Use keras to train models then save the model as model.save('keras.h5')
Install the TensorFlow.js converter using pip install tensorflowjs
Use the following script to tensorflowjs_converter --input_format keras keras.h5 model/
The model directory will contain the files model.json and weight files same to group1-shard1of1
Finally you can load the model using TensorFlow.js

Check this tutorial that I made for the complete procedure.

Website

We developed many models to run directly in the browser. Using TensorFlow.js the models run using the client GPU. Since the webpage is static there is no risk of privacy or security. You can visit the website here . Here is the main intefrace of the website

The added models so far

Poems Generation

English Translation

Words Embedding

Sentiment Classification

Image Captioning

Diactrization

Contribution

Check the CONTRIBUTING.md for a detailed explanantion about how to contribute.

Resources

As a start we will start on Github for hosting the website, models, datasets and other contents. Unfortunately, there is a limitation on the space that will hunt us in the future. Please let us know what you suggest on that matter.

Contributors

Thanks goes to these wonderful people (emoji key):

_MagedSaeed
🎨 🤔 📦

_{March Works}
🤔

_{Mahmoud Aslan}
🤔 💻

This project follows the all-contributors specification. Contributions of any kind welcome!

Citation

@inproceedings{alyafeai-al-shaibani-2020-arbml,
    title = "{ARBML}: Democritizing {A}rabic Natural Language Processing Tools",
    author = "Alyafeai, Zaid  and
      Al-Shaibani, Maged",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.2",
    pages = "8--13",
}

klaam's People

Contributors

Stargazers

Watchers

Forkers

anwarkamel yagan93 kheirabk databill86 sadam1195 anas-aljanaby mustafa0x farah-99 youssofokiel deltabug omarxadel magedsaeed aschvin 7am7 adisaura manarsaaldossari farishijazi ahmeftah dr-ssaleh rafidamr saleh1312 zaynabmu huihuangemi mohammednasri khaled-fayed ma7dev moayad-khader jihadzoabi keux abdelrahman-rashad8 a24ibrah laradigital abdullahgharib saadmohmed moh-sameer77 abed-kotob tao-isaman alaaabuzaghleh milas-melt ahmadabdelhameed zhiweioo7 mohammedsaifali bahattab reemabdelrazek30 hamzasiam01 k3rbyte ziadamr-99 mo-gaafar ziadabdelsalam mkharzommal ahmed-ashraf-marzouk leenrahmoun ahmedzaid16 saleh2005 mortezaghoddousi ninaalien taqneen-ai bounar happyhumaninc aqhali ahmedelhadarey maryksaeed mbendelphi maryam-alali

klaam's Issues

Improving the classification model

The classification model needs improvement. The accuracy on the test set is around 62% on the five classes. Here is the model used

class Wav2Vec2ClassificationModel(Wav2Vec2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        
        self.wav2vec2 = Wav2Vec2Model(config)
        
        self.inner_dim = 128
        self.feature_size = 999
        
        self.tanh = nn.Tanh()
        self.linear1 = nn.Linear(1024, self.inner_dim)
        self.linear2 = nn.Linear(self.inner_dim*self.feature_size, 5)
        self.init_weights()
        
    def freeze_feature_extractor(self):
        self.wav2vec2.feature_extractor._freeze_parameters()

    def forward(
        self,
        input_values,
        attention_mask=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
        labels=None
    ):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        
        outputs = self.wav2vec2(
            input_values,
            attention_mask=attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        x = self.linear1(outputs[0]) 
        x = self.tanh(x)
        x = self.linear2(x.view(-1, self.inner_dim*self.feature_size))
        return {'logits':x}

Help required to prepare the dataset to train the model.

Hello,

I am having a hard time to train the model due to the dataset. I have downloaded the mgb3 dataset and loaded according to the error logs I came across. Could you please write the steps to prepare the dataset for train the model.

Thanks!

Transcriptions of longer audios

I tested for few longer audios , but transcpit is just a part of it..how to process for longer audios ?

Hi，how to deal with this problem

argparse.ArgumentError appears when trying to train the module

I tried to train the module with both scripts that are in the readme file, and both resulted in argparse.ArgumentError
I tried running:

python run_mgb3.py \
    --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
    --output_dir=/path/to/output \
    --cache_dir=/path/to/cache/ \
    --freeze_feature_extractor \
    --num_train_epochs="50" \
    --per_device_train_batch_size="32" \
    --preprocessing_num_workers="1" \
    --learning_rate="3e-5" \
    --warmup_steps="20" \
    --evaluation_strategy="steps"\
    --save_steps="100" \
    --eval_steps="100" \
    --save_total_limit="1" \
    --logging_steps="100" \
    --do_eval \
    --do_train \

and also

python run_common_voice.py \
    --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
    --dataset_config_name="ar" \
    --output_dir=/path/to/output/ \
    --cache_dir=/path/to/cache \
    --overwrite_output_dir \
    --num_train_epochs="1" \
    --per_device_train_batch_size="32" \
    --per_device_eval_batch_size="32" \
    --evaluation_strategy="steps" \
    --learning_rate="3e-4" \
    --warmup_steps="500" \
    --fp16 \
    --freeze_feature_extractor \
    --save_steps="10" \
    --eval_steps="10" \
    --save_total_limit="1" \
    --logging_steps="10" \
    --group_by_length \
    --feat_proj_dropout="0.0" \
    --layerdrop="0.1" \
    --gradient_checkpointing \
    --do_train --do_eval \
    --max_train_samples 100 --max_val_samples 100

and both codes resulted in this Error:

_2022-04-24 19:02:16.824403: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-04-24 19:02:16.824670: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_mgb3.py", line 523, in
main()
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_mgb3.py", line 263, in main
parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\transformers\hf_argparser.py", line 71, in init
self._add_dataclass_arguments(dtype)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\transformers\hf_argparser.py", line 166, in _add_dataclass_arguments
self._parse_dataclass_field(parser, field)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\transformers\hf_argparser.py", line 137, in _parse_dataclass_field
parser.add_argument(field_name, **kwargs)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\argparse.py", line 1440, in add_argument
return self._add_action(action)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\argparse.py", line 1805, in _add_action
self._optionals._add_action(action)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\argparse.py", line 1642, in _add_action
action = super(_ArgumentGroup, self)._add_action(action)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\argparse.py", line 1454, in _add_action
self._check_conflict(action)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\argparse.py", line 1591, in _check_conflict
conflict_handler(action, confl_optionals)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.3312.0_x64__qbz5n2kfra8p0\lib\argparse.py", line 1600, in handle_conflict_error
raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument --gradient_checkpointing: conflicting option string: --gradient_checkpointing

I reproduced the error by running it on another machine and still got it.
Any suggestions to how to fix it?

How to capture voice from audio device

Hi,
I had a look at how to get text from an audio file, but did not get his to extract the voice directly from the audio device through specking. i.e. without saving the voice into a wave file

why arabic tts doesn't work for some arabic test samples?

thanks for this awesome work,
i was using this notebook https://github.com/ARBML/klaam/blob/main/notebooks/demo.ipynb to test some samples.

this is the code i tried :

from klaam import TextToSpeech
from IPython.display import Audio

root_path = "./"
prepare_tts_model_path = "./cfgs/FastSpeech2/config/Arabic/preprocess.yaml"
model_config_path = "./cfgs/FastSpeech2/config/Arabic/model.yaml"
train_config_path = "./cfgs/FastSpeech2/config/Arabic/train.yaml"
vocoder_config_path = "./cfgs/FastSpeech2/model_config/hifigan/config.json"
speaker_pre_trained_path = "./data/model_weights/hifigan/generator_universal.pth.tar"

model = TextToSpeech(prepare_tts_model_path, model_config_path, train_config_path, vocoder_config_path, speaker_pre_trained_path,root_path)

text = 'وہ ابو بکر کو صلاہ کی رہنمائی کیا جاتا ہے ہمارے لئے یہ ایک بڑی سوال ہے۔ بہت سوال ہے۔ یہ ایک قیمتی سوال ہے جو کسی کو صلاہ کی رہنمائی کیا جاتا ہے جب وہ زندگی ہے اور وہ مسجد میں ہے اور وہ کماند ہے اور وہ کہتا ہے اللہ اور اس کی رسول کو کوئی باقر سے درمی نہیں دے اور جب وہ ابو بکر کو نہیں جانتے ہیں اور امر کو بھی جانتا ہے۔'
model.synthesize(text)
Audio("sample.wav")

and i get this error :


Downloading...
From: https://drive.google.com/uc?id=1J7ZP_q-6mryXUhZ-8j9-RIItz2nJGOIX
To: /content/klaam/model.pth.tar
100%|██████████| 418M/418M [00:06<00:00, 62.4MB/s]
Removing weight norm...
skipped
['b', 'r']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-6-a135184f6c06>](https://localhost:8080/#) in <module>
     12 
     13 text = 'وہ ابو بکر کو صلاہ کی رہنمائی کیا جاتا ہے ہمارے لئے یہ ایک بڑی سوال ہے۔ بہت سوال ہے۔ یہ ایک قیمتی سوال ہے جو کسی کو صلاہ کی رہنمائی کیا جاتا ہے جب وہ زندگی ہے اور وہ مسجد میں ہے اور وہ کماند ہے اور وہ کہتا ہے اللہ اور اس کی رسول کو کوئی باقر سے درمی نہیں دے اور جب وہ ابو بکر کو نہیں جانتے ہیں اور امر کو بھی جانتا ہے۔'
---> 14 model.synthesize(text)
     15 Audio("sample.wav")

3 frames
[/content/klaam/klaam/external/FastSpeech2/phonetise/phonetise_arabic.py](https://localhost:8080/#) in phonetise(text)
    612                 for pronunciation in pronunciations:
    613                     stressIndex = findStressIndex(pronunciation)
--> 614                     if stressIndex < len(pronunciation) and stressIndex != -1:
    615                         pronunciation[stressIndex] += "'"
    616                     else:

TypeError: '<' not supported between instances of 'str' and 'int'

instead of throwing errors,i expected arabic tts to discard unknown chars automatically like this tts : https://tts.readthedocs.io/en/latest/

can you suggest me any arabic text processing technique that i can use before doing model.synthesize(text) everytime so that the model doesn't throw error like TypeError: '<' not supported between instances of 'str' and 'int' for arabic samples? thanks in advance.

Text to Speech

We are thinking to add TTS models, here are some possible architectures to use

No such file or directory: 'model.pth.tar' when create TextToSpeech instance

i'm testing Demo Nootbook and i faced this error when creating an instance of TextToSpeech model

assert batch_size * group_size < len(dataset) AssertionError when I train the model

hello everyone,

@zaidalyafeai @mustafa0x @elgeish @MagedSaeed

I tried to train the model in my dataset and this error comes out could you please help me this is Traceback (most recent call last):
File "/content/drive/MyDrive/FastSpeech2/train.py", line 198, in
main(args, configs)
File "/content/drive/MyDrive/FastSpeech2/train.py", line 32, in main
assert batch_size * group_size < len(dataset)
AssertionError

thank you

Missing file

Hello Mustafa & Ziad,

I have checked your awesome work, which is really helpful to me, but I have a question please ,,I am new to this field, so could you please share with me a good reference to understand the difference between hifi-GAN and Mel-GAN?, I have checked a lot of references over the internet, but they were not that helpful!
Also I have another question related to the vocoder and speaker used, when I tried different combinations I have listened and was able to know that the vocoder HiFi-GAN and the speaker universal is the best combination,,but when I tried combination LJSpeech & HIFI-GAN , I received error that the file generator_LJSpeech.pth.tar does not exist, and when I checked the files and the code, I can see the code points to this directoryFastSpeech2/hifigan/generator_LJSpeech.pth.tar but , this file does not exist "generator_LJSpeech.pth.tar"

Error opening training file, File contains data in an unknown format.

Hi Ziad,
I tried running this script that is available in the readme file to the train the MSA model:

python run_common_voice.py --model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="ar" --output_dir=/path/to/output/ --cache_dir=/path/to/cache --overwrite_output_dir="yes" --num_train_epochs="1" --per_device_train_batch_size="32" --per_device_eval_batch_size="32" --evaluation_strategy="steps" --learning_rate="3e-4" --warmup_steps="500" --fp16="no" --freeze_feature_extractor="yes" --save_steps="10" --eval_steps="10" --save_total_limit="1" --logging_steps="10" --group_by_length="no" --feat_proj_dropout="0.0" --layerdrop="0.1" --do_train="yes" --do_eval="yes" --max_train_samples 100 --max_val_samples 100

And I got this message:

_Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 511, in
main()
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 400, in main
train_dataset = train_dataset.map(
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 1955, in map
return self._map_single(
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 520, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 487, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\fingerprint.py", line 458, in wrapper
out = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 2320, in map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 2220, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 1915, in decorated
result = f(decorated_item, *args, **kwargs)
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 394, in speech_file_to_array_fn
speech_array, sampling_rate = torchaudio.load(batch["path"])
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\torchaudio\backend\soundfile_backend.py", line 197, in load
with soundfile.SoundFile(filepath, "r") as file:
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 629, in init
self._file = self._open(file, mode_int, closefd)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 1183, in _open
_error_check(_snd.sf_error(file_ptr),
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 1357, in error_check
raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening '/path/to/cache\downloads\extracted\31455a499a0212b1751dd0c1547b0d360037f6a8c0a69178647a45a577d0ff67\cv-corpus-6.1-2020-12-11/ar/clips/common_voice_ar_19225971.mp3': File contains data in an unknown format.

I think the reason behind it is that the training files are in .mp3 instead of .wav
Any suggestions to how I can tackle this problem?

[feature] pip install package

why this is not available through pip install ?

example:

pip install klaam

Add license

Maybe it's better to add a license to the repo?

ASR transcribe() works only for the first 8 seconds

transcribe works for the first 8 seconds of the audio only

meaning if the text should've been:

........ السلام عليكم ورحمة الله وبركاته ......
the ASR outputs:
........ السلام عليكم ورحمة

assuming the 8 second mark is right after ورحمة

Speech Recognition Error

Speech Recognition

OSError: Can't load config for 'Zaid/wav2vec2-large-xlsr-dialect-classification'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'Zaid/wav2vec2-large-xlsr-dialect-classification' is the correct path to a directory containing a config.json file

[bug] conda: command not found

installing on a python pip environment that was not installed via conda throws an error

logs :

Setting conda init...
./install.sh: line 40: conda: command not found
./install.sh: line 41: /etc/profile.d/conda.sh: No such file or directory
Setting environment... (envs/environment.yml)
./install.sh: line 48: conda: command not found
Activating environment... (klaam)
./install.sh: line 51: conda: command not found
Upgrading pip...
./install.sh: line 55: python: command not found
Updating poetry config...
./install.sh: line 59: python: command not found
Installing dependencies using poetry...
./install.sh: line 62: python: command not found

[Error] Module installation error while running in Colab.

I have tested the klaam/notebook/demo.ipynb file on Google Colab. It raised an error of missing modules.

Error Message:

When I install the missing modules manually using !pip install <module-name> it works well.
So, I think there is a problem with the !pip install -r requirements.txt.

I printed the output of the installation in a separate file:

After some search, I wasn't able to solve the problem. I will be thankful if you can advise.

Timestamps

Thank you for this work -- شكرا! I tested this briefly and found the results to be quite good. Is there any way to get time-stamped results? (My use case is forced alignment)

Is there any project or code that can implement phoneme recognition and alignment for Arabic recordings? Thanks!

RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0, 128] because the unspecified dimension size -1 can be any value and is ambiguous

thank you , solved

[Proposal] Codebase refactoring

To organize the code and introduce testing and continuous integration, it would be beneficial to refactor the entire codebase.

TL;DR

Re-organizing the codebase to follow best practices and to introduce testing and continuous integration.
Separating logic to import the package as a separate module, scripts to localize scripts that were used for train/inference of the logic, notebooks to localize demos and simple scripts that were written as notebooks, and tests to test the logic
Adding GitHub Actions to test build, logic of the package, auto-generate docs, and to publish the package to pypi
Moving from pip and requirements.txt setup to conda for environment management and poetry for packages management. This will ease the development as the project scales.

Codebase refactoring

Mapping

file/dir	action	placement
FastSpeed2/*	moved	kaalm/external/FastSpeed2/*
dialect_speech_corpus	moved	klaam/speech_corpus/dialect.py
egy_speech_corpus	moved	klaam/speech_corpus/egy.py
mor_speech_corpus	moved	klaam/speech_corpus/mor.py
samples	moved	samples
.gitignore	moved	.gitignore
LICENSE	moved	LICENSE
README.md	moved	README.md
audio_utils.py	moved	klaam/utils/audio.py
demo.ipynb	moved	notebooks/demo.ipynb
demo_with_mic.ipynb	moved	notebooks/demo_with_mix.ipynb
inference.ipynb	moved	notebooks/inference.ipynb
klaam.py	moved	klaam/run.py
klaam_logo.PNG	moved	misc/klaam_logo.png
models.py	moved	klaam/models/wav2vec.py
processors.py	moved	klaam/processors/custom_wave2vec.py
requirements.txt	removed
run.sh	moved	scripts/run.sh
run_classifier.py	moved	scripts/run_classifier.py
run_common_voice.py	moved	scripts/run_common_voice.py
run_mgb3.py	moved	scripts/run_mgb3.py
run_mgb5.py	moved	scripts/run_mgb5.py
sample_run.sh	moved	scripts/sample_run.sh
utils.py	moved	klaam/utils/utils.py
	added	docs
	added	tests
	added	.github
	added	output
	added	environment.yml
	added	install.sh
	added	mypi.ini
	added	pyproject.toml
	added	pytest.ini
	added	ckpts

Tree Structure

root	level 1	level2	description
.github			github stuff (e.g. github issue templates, github actions workflows, etc.)
	workflows
		build.yml	to test building of the package
		publish.yml	to publish the package to `pypi`
		tests.yml	to run tests
		docs.yml	to generate documentation
klaam			the logic for the package
	utils
		audio.py
		utils.py
	models
		wav2vec.py
	processors
		wave2vec.py
	external
		FastSpeed2/*
	speech_corpus
		dialect.py
		egy.py
		mor.py
	run.py
notebooks
	demo.ipynb
	demo_with_mix.ipynb
	inference.ipynb

scripts			set of scripts to be used to train/evaluate or anything external from the logic of the package
	run.sh
	run_classifier.py
	run_common_voice.py
	run_mgb3.py
	run_mgb5.py
	sample_run.sh
tests			set of tests to test logics within `klaam`
	test_*.py
	conftest.py
misc
	klaam_logo.png
samples
	demo.wav
ckpts	...		checkpoints of pre-trained models that were downloaded
docs	...		documentation files
output	...
environment.yml			`conda` environment definition
install.sh			installing script to setup `conda` environment and install dependecies using `poetry`
mypy.ini			`pylint` configuration
pyproject.toml			package definition and list of dependecies to be installed
pytest.ini			`pytest` configuration
LICENSE
README.md
.gitignore

Environment/dependencies packages

conda is used to manage the environment and install essential libraries that are big/core to the package, e.g. TensorFlow, PyTorch, cudatools, etc.
poetry is used to manage dependencies and setup the package
pytest is used to enable unit/integration testing of the codebase

Commands

poetry add PACKAGE - to add a package (this will append to pyproject.toml)
- If the package installation failed and couldn't find another way to add the package, then install it using conda and add to enviroment.yml manually. (leave a comment next to the line)
- Check on the web for the right channels when install packages using conda
poetry install - to install the package (package_name)
pytest tests - to run all tests manually
pytest tests/TEST_PATH - to run a specific test file (check pytest documentation for more information)

Edit - added the following sections: env/dep packages and commands

Access denied when importing TextToSpeech

FastSpeech Text to Speech

where is the file saved After TTS ?

A question

hello everyone ,

does your implementation used Fastspeech (2s ) or not ?

I just want to make sure , Thank you for your work

Error

When I run the training/final step, I get this error can you advise?
^CTraceback (most recent call last):
File "train.py", line 198, in
main(args, configs)
File "train.py", line 93, in main
nn.utils.clip_grad_norm_(model.parameters(), grad_clip_thresh)
File "/home/layan/.local/lib/python3.6/site-packages/torch/nn/utils/clip_grad.py", line 36, in clip_grad_norm_
total_norm = torch.norm(torch.stack([torch.norm(p.grad.detach(), norm_type).to(device) for p in parameters]), norm_type)
File "/home/layan/.local/lib/python3.6/site-packages/torch/nn/utils/clip_grad.py", line 36, in
total_norm = torch.norm(torch.stack([torch.norm(p.grad.detach(), norm_type).to(device) for p in parameters]), norm_type)
File "/home/layan/.local/lib/python3.6/site-packages/torch/functional.py", line 1293, in norm
return _VF.norm(input, p, dim=_dim, keepdim=keepdim) # type: ignore
File "/home/layan/.local/lib/python3.6/site-packages/torch/_VF.py", line 25, in getattr
def getattr(self, attr):

i try hard to run the mode or project "offline" locally on pycharm and it not working can anyone give a guide even on colab dont work

i try hard to run the mode or project "offline" locally on pycharm and it not working can anyone give a guide even on colab dont work
i dont understand i just clone into github and start but nothing work

WER for Egyptian Arabic

Hi, I was wondering the WER for Egyptian Arabic, since I don’t see a score on this page? https://huggingface.co/Zaid/wav2vec2-large-xlsr-53-arabic-egyptian

Error loading model

404 Client Error: Not Found for url: https://huggingface.co/Zaid/wav2vec2-large-xlsr-53-arabic-egyptian/resolve/main/tf_model.h5

OSError: Can't load weights for 'Zaid/wav2vec2-large-xlsr-53-arabic-egyptian'. Make sure that:

'Zaid/wav2vec2-large-xlsr-53-arabic-egyptian' is a correct model identifier listed on 'https://huggingface.co/models'
or 'Zaid/wav2vec2-large-xlsr-53-arabic-egyptian' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

these 2 errors appears as i am running however i modified the code from :
if lang == 'egy':
model_dir = 'Zaid/wav2vec2-large-xlsr-53-arabic-egyptian'
elif lang == 'msa':
model_dir = 'elgeish/wav2vec2-large-xlsr-53-arabic'

to :

if lang == "egy":
model_dir = Wav2Vec2ForCTC.from_pretrained("Zaid/wav2vec2-large-xlsr-53-arabic-egyptian")
elif lang == "msa":
model_dir = Wav2Vec2ForCTC.from_pretrained("elgeish/wav2vec2-large-xlsr-53-arabic")
self.bw = True

as its written in the hugging face site but still not working . Thanks in advance

Do you have a paper on arXive or somehwere?

Sampling rate modifications

Hello @zaidalyafeai

For our bachelor thesis a friend and me started working on dialect classifcation a while ago, now we came across your repo and your working with the same corpus as we did. We want to investigate how the length of the provided samples is influencing the trained classifier when using wav2vec-xlsr as the base-model.

After some investigation of your code, we were wondering why you just read the first 20 seconds of each file. Is this not somewhat contraproductive? As we are losing a lot of training data trough that?

 def speech_file_to_array_fn(batch):
        start = 0 
        stop = 20 
        srate = 16_000
        speech_array, sampling_rate = sf.read(batch["file"], start = start * srate , stop = stop * srate)
        batch["speech"] = librosa.resample(np.asarray(speech_array), sampling_rate, srate)
        batch["sampling_rate"] = srate
        batch["parent"] = batch["label"]
        return batch

Did you preprocess your data cutted into smaller pieces so each is max. 20seconds long? Or is it possible to read the whole files in so and generate our batches according to the length of each file. As the whole thing is not quite straight forward to implement.

FileNotFoundError: [Errno 2] No such file or directory: 'model.pth.tar'

inference worked few days ago but not working anymore because of broken googledrive weight file links.

Access denied with the following error:

Cannot retrieve the public link of the file. You may need to change
the permission to 'Anyone with the link', or have had many accesses.

You may still be able to access the file from the browser:

 https://drive.google.com/uc?id=1J7ZP_q-6mryXUhZ-8j9-RIItz2nJGOIX

FileNotFoundError Traceback (most recent call last)
in
49 speaker_pre_trained_path = "./klaam/data/model_weights/hifigan/generator_universal.pth.tar"
50
---> 51 ar_model = TextToSpeech(prepare_tts_model_path, model_config_path, train_config_path, vocoder_config_path, speaker_pre_trained_path,root_path)
52
53

5 frames
/content/./klaam/klaam/run.py in init(self, prepare_tts_model_path, model_config_path, train_config_path, vocoder_config_path, speaker_pre_trained_path, root_path)
67 self.vocoder_config_path = vocoder_config_path
68 self.speaker_pre_trained_path = speaker_pre_trained_path
---> 69 self.model, self.vocoder, self.configs = prepare_tts_model(
70 self.configs, self.vocoder_config_path, self.speaker_pre_trained_path
71 )

/content/./klaam/klaam/external/FastSpeech2/inference.py in prepare_tts_model(configs, vocoder_config_path, speaker_pre_trained_path)
57
58 # Get model
---> 59 model = get_model_inference(configs, DEVICE, train=False)
60
61 # Load vocoder

/content/./klaam/klaam/external/FastSpeech2/utils/model.py in get_model_inference(configs, device, train)
42 if not os.path.exists(ckpt_path):
43 gdown.download(url, ckpt_path, quiet=False)
---> 44 ckpt = torch.load(ckpt_path, map_location=torch.device("cpu"))
45 model.load_state_dict(ckpt["model"])
46

/usr/local/lib/python3.8/dist-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
697 pickle_load_args['encoding'] = 'utf-8'
698
--> 699 with _open_file_like(f, 'rb') as opened_file:
700 if _is_zipfile(opened_file):
701 # The zipfile reader is going to advance the current file position.

/usr/local/lib/python3.8/dist-packages/torch/serialization.py in _open_file_like(name_or_buffer, mode)
228 def _open_file_like(name_or_buffer, mode):
229 if _is_path(name_or_buffer):
--> 230 return _open_file(name_or_buffer, mode)
231 else:
232 if 'w' in mode:

/usr/local/lib/python3.8/dist-packages/torch/serialization.py in init(self, name, mode)
209 class _open_file(_opener):
210 def init(self, name, mode):
--> 211 super(_open_file, self).init(open(name, mode))
212
213 def exit(self, *args):

FileNotFoundError: [Errno 2] No such file or directory: 'model.pth.tar'

How can I use the same model but with different Arabic dataset ?

I should train the model on the new dataset from scratch or not?

Functionality to split/align audio segments for training

The audio in two of the datasets we are using (MGB3 and MGB5) come in long sequences of tens of minutes. This is impractical to use with any GPU for training. Longer sequences of audio will result in out of memory errors in GPUs even with a small batch size.

The solution is to split the audio into smaller audio segments of 15 to 30 seconds depending on the hardware used (GPU memory to a large extent).

This issue is to track adding a functionality to split the audio into smaller chunks that can fit into a GPU.

ASR outputs "v" instead of "ث"

I'm not sure where the problem occurs exactly, but this is the only letter, ث is always "v"

Perhaps check the vocabulary files

Terrible documentation of the project

There's no such package as klaam, how is the code supposed to work?

arbml / klaam Goto Github PK

klaam's Introduction

Motivation

Goal

Challenges

Procedure

Models

Datasets

Tools

Google Colab

TensorFlow.js

Website

Poems Generation

English Translation

Words Embedding

Sentiment Classification

Image Captioning

Diactrization

Contribution

Resources

Contributors

Citation

klaam's People

Contributors

Stargazers

Watchers

Forkers

klaam's Issues

TL;DR

Codebase refactoring

Environment/dependencies packages

Commands

Recommend Projects

Recommend Topics

Recommend Org