Coder Social home page Coder Social logo

nipponjo / tts-arabic-pytorch Goto Github PK

View Code? Open in Web Editor NEW
68.0 2.0 16.0 3.2 MB

TTS models for Arabic (Tacotron2, FastPitch)

Python 12.68% HTML 0.31% JavaScript 0.07% Jupyter Notebook 86.93%
arabic hifi-gan pytorch tacotron2 tacotron2-pytorch text-to-speech torchaudio tts python deep-learning

tts-arabic-pytorch's Introduction

tts-arabic-pytorch

TTS models (Tacotron2, FastPitch), trained on Nawar Halabi's Arabic Speech Corpus, including the HiFi-GAN vocoder for direct TTS inference.

Papers:

Tacotron2 | Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (arXiv)

FastPitch | FastPitch: Parallel Text-to-speech with Pitch Prediction (arXiv)

HiFi-GAN | HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (arXiv)

Audio Samples

You can listen to some audio samples here.

Multispeaker model (in progress)

Multispeaker weights are available for the FastPitch model. Currently, another male voice and two female voices have been added. Audio samples can be found here. Download weights here. There also exists an ONNX version for this model.

The multispeaker dataset was created by synthesizing data with Coqui's XTTS-v2 model and a mix of voices from the Tunisian_MSA dataset.

Quick Setup

The models were trained with the mse loss as described in the papers. I also trained the models using an additional adversarial loss (adv). The difference is not large, but I think that the (adv) version often sounds a bit clearer. You can compare them yourself.

Running python download_files.py will download all pretrained weights, alternatively:

Download the pretrained weights for the Tacotron2 model (mse | adv).

Download the pretrained weights for the FastPitch model (mse | adv).

Download the HiFi-GAN vocoder weights (link). Either put them into pretrained/hifigan-asc-v1 or edit the following lines in configs/basic.yaml.

# vocoder
vocoder_state_path: pretrained/hifigan-asc-v1/hifigan-asc.pth
vocoder_config_path: pretrained/hifigan-asc-v1/config.json

This repo includes the diacritization models Shakkala and Shakkelha.

The weights can be downloaded here. There also exists a separate repo and package.

-> Alternatively, download all models and put the content of the zip file into the pretrained folder.

Required packages:

torch torchaudio pyyaml

~ for training: librosa matplotlib tensorboard

~ for the demo app: fastapi "uvicorn[standard]"

Using the models

The Tacotron2/FastPitch from models.tacotron2/models.fastpitch are wrappers that simplify text-to-mel inference. The Tacotron2Wave/FastPitch2Wave models includes the HiFi-GAN vocoder for direct text-to-speech inference.

Inference options

text = "اَلسَّلامُ عَلَيكُم يَا صَدِيقِي."

wave = model.tts(
    text_input = text, # input text
    speed = 1, # speaking speed
    denoise = 0.005, # HifiGAN denoiser strength
    speaker_id = 0, # speaker id
    batch_size = 2, # batch size for batched inference
    vowelizer = None, # vowelizer model
    pitch_mul = 1, # pitch multiplier (for FastPitch)
    pitch_add = 0, # pitch offset (for FastPitch)
    return_mel = False # return mel spectrogram?
)

Inferring the Mel spectrogram

from models.tacotron2 import Tacotron2
model = Tacotron2('pretrained/tacotron2_ar_adv.pth')
model = model.cuda()
mel_spec = model.ttmel("اَلسَّلامُ عَلَيكُم يَا صَدِيقِي.")
from models.fastpitch import FastPitch
model = FastPitch('pretrained/fastpitch_ar_adv.pth')
model = model.cuda()
mel_spec = model.ttmel("اَلسَّلامُ عَلَيكُم يَا صَدِيقِي.")

End-to-end Text-to-Speech

from models.tacotron2 import Tacotron2Wave
model = Tacotron2Wave('pretrained/tacotron2_ar_adv.pth')
model = model.cuda()
wave = model.tts("اَلسَّلامُ عَلَيكُم يَا صَدِيقِي.")

wave_list = model.tts(["صِفر" ,"واحِد" ,"إِثنان", "ثَلاثَة" ,"أَربَعَة" ,"خَمسَة", "سِتَّة" ,"سَبعَة" ,"ثَمانِيَة", "تِسعَة" ,"عَشَرَة"])
from models.fastpitch import FastPitch2Wave
model = FastPitch2Wave('pretrained/fastpitch_ar_adv.pth')
model = model.cuda()
wave = model.tts("اَلسَّلامُ عَلَيكُم يَا صَدِيقِي.")

wave_list = model.tts(["صِفر" ,"واحِد" ,"إِثنان", "ثَلاثَة" ,"أَربَعَة" ,"خَمسَة", "سِتَّة" ,"سَبعَة" ,"ثَمانِيَة", "تِسعَة" ,"عَشَرَة"])

By default, Arabic letters are converted using the Buckwalter transliteration, which can also be used directly.

wave = model.tts(">als~alAmu Ealaykum yA Sadiyqiy.")
wave_list = model.tts(["Sifr", "wAHid", "<i^nAn", "^alA^ap", ">arbaEap", "xamsap", "sit~ap", "sabEap", "^amAniyap", "tisEap", "Ea$arap"])

Unvocalized text

text_unvoc = "اللغة العربية هي أكثر اللغات السامية تحدثا، وإحدى أكثر اللغات انتشارا في العالم"
wave_shakkala = model.tts(text_unvoc, vowelizer='shakkala')
wave_shakkelha = model.tts(text_unvoc, vowelizer='shakkelha')

Inference from text file

python inference.py
# default parameters:
python inference.py --list data/infer_text.txt --out_dir samples/results --model fastpitch --checkpoint pretrained/fastpitch_ar_adv.pth --batch_size 2 --denoise 0

Testing the model

To test the model run:

python test.py
# default parameters:
python test.py --model fastpitch --checkpoint pretrained/fastpitch_ar_adv.pth --out_dir samples/test

Processing details

This repo uses Nawar Halabi's Arabic-Phonetiser but simplifies the result such that different contexts are ignored (see text/symbols.py). Further, a doubled consonant is represented as consonant + doubling-token.

The Tacotron2 model can sometimes struggle to pronounce the last phoneme of a sentence when it ends in an unvocalized consonant. The pronunciation is more reliable if one appends a word-separator token at the end and cuts it off using the alignments weights (details in models.networks). This option is implemented as a default postprocessing step that can be disabled by setting postprocess_mel=False.

Training the model

Before training, the audio files must be resampled. The model was trained after preprocessing the files using scripts/preprocess_audio.py.

To train the model with options specified in the config file run:

python train.py
# default parameters:
python train.py --config configs/nawar.yaml

Web app

The web app uses the FastAPI library. To run the app you need the following packages:

fastapi: for the backend api | uvicorn: for serving the app

Install with: pip install fastapi "uvicorn[standard]"

Run with: python app.py

Preview:

Acknowledgements

I referred to NVIDIA's Tacotron2 implementation for details on model training.

The FastPitch files stem from NVIDIA's DeepLearningExamples

tts-arabic-pytorch's People

Contributors

nipponjo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

tts-arabic-pytorch's Issues

The question about additional adversarial loss ?

Good Job! Looking forward to your repo!
I also think that the (adv) version often sounds a bit clearer. but i have some question?

  1. Is it a joint training for tacotron2 and hifigan, or are there other details? I'm curious about the fastpitch_ar_adv.pth model.
  2. When it is convenient to open source more code details, I would like to further verify the quality in other languages.

I'm ooking forward to your reply

Python.net

Hi,

Amazing Principle and application , but tried to use python app.py with many errors on CPU!!!!

can you please make a sample using
I can help with c# and the web app
this is the lib
pythonnet

Thank you

Issues with Tacotron2 Training on Nawar Halabi's Arabic Speech Corpus

Hello,

I've been attempting to train Tacotron2 using Nawar Halabi's Arabic Speech Corpus, I'm trying to make sure everything works fine with this corpus before moving on to my own data. Despite following the instructions for preprocessing and training with the provided configurations, I'm experiencing issues with the quality of the synthesized speech.

Training Details

  • Preprocessed the audio files using scripts/preprocess_audio.py (after fixing minor issues).
  • Trained the model using configs/nawar.yaml configuration file.
  • Tried with both train_tc2.py and train_tc2_adv.py.
  • I'm using a single GeForce RTX 3090 GPU and Ubuntu 22.04.2 LTS.

Configuration Files

nawar.yaml

restore_model: ''

log_dir: logs/exp2
checkpoint_dir: checkpoints/exp2

train_wavs_path: /media/hayder/Disk2/development/tts-arabic-pytorch_test_2/data/arabic-speech-corpus/wav_new
train_labels: ./data/train_phon.txt

test_wavs_path: /media/hayder/Disk2/development/tts-arabic-pytorch_test_2/data/arabic-speech-corpus/test set/wav_new
test_labels: ./data/test_phon.txt

balanced_sampling: False
sampler_weights_file: ./data/sampler/sampler_weights

cache_dataset: False

n_save_states_iter: 10
n_save_backup_iter: 1000

basic.yaml

# training
epochs: 500
decoder_max_step: 3000

random_seed: False

batch_size: 8
learning_rate: 1.0e-3
weight_decay: 1.0e-6
grad_clip_thresh: 1.0

cache_dataset: True
use_cuda_if_available: True

balanced_sampling: False

# vocoder
vocoder_state_path: pretrained/hifigan-asc-v1/hifigan-asc.pth
vocoder_config_path: pretrained/hifigan-asc-v1/config.json

# diacritizers
shakkala_path: pretrained/diacritizers/shakkala_second_model6.pth
shakkelha_path: pretrained/diacritizers/shakkelha_rnn_3_big_20.pth

Testing Script

from models.tacotron2 import Tacotron2Wave
import soundfile as sf
import playsound

model = Tacotron2Wave('/media/hayder/Disk2/development/tts-arabic-pytorch_test_2/checkpoints/exp2/states.pth')
model = model.cuda()
wave = model.tts("اَلسَّلامُ عَلَيكُم يَا صَدِيقِي")

sf.write('audio.wav', wave, 22050)
playsound.playsound('audio.wav')

Inaccurate Synthesized Output

The generated speech does not align with the expected pronunciation. I've included a WAV file example of the synthesized speech for the sentence "اَلسَّلامُ عَلَيكُم يَا صَدِيقِي" here.

Could you please assist me in identifying the possible issues? If you need additional information I'll promptly provide it.

Thanks.

Train configs

Good afternoon, please describe in more detail the way to train the model from scratch.

If possible, in the form of a notepad

Phonetising another dataset

Hello

thank you for your fast reply

another question My dataset is Egyptian Arabic and the transcript is written in Arabic without diactrization and graphemes so I used your script to convert the Arabic to Buckwalter and then Buckwalter to phonemes. I didn't make any changes to the symbols.

how many epochs should I find a satisfying result at it?

Thank you

Question Regarding pitch_dict.pt File

Hello.

There's a file named pitch_dict.pt used in FastPitch training. What is this file and where can I get it?

I'm trying to train with FastPitch to check if the issue in #10 is also present here or it's just for Tacotron.

Thank you.

New Dataset

Hello

I have a new dataset that I want to train it so I need your advice what should be changed in this repo.

the dataset is egyptian arabic.

AttributeError: 'float' object has no attribute 'items'

Hello,

First and foremost, thank you for an incredible repository that is sorely lacking in a language like Arabic. I have a problem. I got this error when I created a new dataset. this error occur when i have more than 7 line in train_phon.txt. please help.

python train.py
[==============================] 100.0%
[==============================] 100.0%
Epoch: 0
C:\Users\alghr\Dev\Python\tts-arabic-pytorch\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
loss: 48.50394821166992, grad_norm: 27.04533576965332
Traceback (most recent call last):
File "train.py", line 207, in
main()
File "train.py", line 195, in main
training_loop(model,
File "train.py", line 102, in training_loop
writer.add_training_data(loss.item(), grad_norm.item(),
File "C:\Users\alghr\Dev\Python\tts-arabic-pytorch\arabic\utils\logging.py", line 12, in add_training_data
for k, v in meta.items():
AttributeError: 'float' object has no attribute 'items'

How can I implement the code with different dataset

hello,

first, thank you for amazing repository that very lack in such language like Arabic , I have question : can I train the model with different dataset rather than Arabic speech corpus? if yes could you please help me with steps that I should change in the code

thank you in advance.

Arabic Dataset

Hello.
I created my own arabic dataset including texts and voices, what is the right way to fine-tune it, please?

KeyError: 'i0i0' during inference

hello,
thank you for sharing this awesome arabic TTS model. i need some aid from you please,
this model can't read samples like below

sample = 'وَتَعَاوَنُوا عَلَى البِرِّ وَالتَّقْوَى'
wave = ar_model.tts(text_buckw = sample)

it gives me error like this


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-26-c5e258748198>](https://localhost:8080/#) in <module>
      1 sample = 'وَتَعَاوَنُوا عَلَى البِرِّ وَالتَّقْوَى'
----> 2 wave = ar_model.tts(text_buckw = sample)
      3 wave = wave * 32768.0
      4 wave

6 frames
[/content/tts-arabic-tacotron2/model/networks.py](https://localhost:8080/#) in tts(self, text_buckw, batch_size, speed, postprocess_mel, return_mel)
    286             return self.tts_single(text_buckw, speed=speed,
    287                                    postprocess_mel=postprocess_mel,
--> 288                                    return_mel=return_mel)
    289 
    290         # input: list

[/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py](https://localhost:8080/#) in decorate_context(*args, **kwargs)
     25         def decorate_context(*args, **kwargs):
     26             with self.clone():
---> 27                 return func(*args, **kwargs)
     28         return cast(F, decorate_context)
     29 

[/content/tts-arabic-tacotron2/model/networks.py](https://localhost:8080/#) in tts_single(self, text_buckw, speed, postprocess_mel, return_mel)
    242                    return_mel=False):
    243 
--> 244         mel_spec = self.model.ttmel_single(text_buckw, postprocess_mel)
    245         if speed is not None:
    246             mel_spec = resize_mel(mel_spec, rate=speed)

[/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py](https://localhost:8080/#) in decorate_context(*args, **kwargs)
     25         def decorate_context(*args, **kwargs):
     26             with self.clone():
---> 27                 return func(*args, **kwargs)
     28         return cast(F, decorate_context)
     29 

[/content/tts-arabic-tacotron2/model/networks.py](https://localhost:8080/#) in ttmel_single(self, utterance, postprocess_mel)
    115             process_mel = True
    116 
--> 117         token_ids = text.tokens_to_ids(tokens)
    118         ids_batch = torch.LongTensor(token_ids).unsqueeze(0).to(self.device)
    119 

[/content/tts-arabic-tacotron2/text/__init__.py](https://localhost:8080/#) in tokens_to_ids(phonemes)
     23 
     24 def tokens_to_ids(phonemes):
---> 25     return [phon_to_id[phon] for phon in phonemes]
     26 
     27 

[/content/tts-arabic-tacotron2/text/__init__.py](https://localhost:8080/#) in <listcomp>(.0)
     23 
     24 def tokens_to_ids(phonemes):
---> 25     return [phon_to_id[phon] for phon in phonemes]
     26 
     27 

KeyError: 'i0i0'

how do i deal with such unknown phonemes? do you have any quick solution for this issue? thanks in advance

Vocoder

I am asking about the vocoder should I use the vocoder that you mention like download it or there for new dataset I should change the vocoder

No CUDA - New torch.nn.utils.parametrizations.weight_norm

\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
Traceback (most recent call last):
\app.py", line 13, in
tts_manager = TTSManager('app/static')

Python310\lib\site-packages\torch\serialization.py", line 258, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.