open-speech-ekstep / vakyansh-tts Goto Github PK

View Code? Open in Web Editor NEW

47.0 5.0 24.0 600 KB

Text to Speech for Indic languages

License: MIT License

Python 69.42% Cython 0.24% Shell 0.79% Jupyter Notebook 29.55%

vakyansh-tts's Introduction

vakyansh-tts

Text to Speech for Indic languages

1. Installation and Setup for training

Clone repo Note : for multspeaker glow-tts training use branch multispeaker

git clone https://github.com/Open-Speech-EkStep/vakyansh-tts

Build conda virtual environment

cd ./vakyansh-tts
conda create --name <env_name> python=3.7
conda activate <env_name>
pip install -r requirements.txt

Install apex; commit: 37cdaf4 for Mixed-precision training

Note : used only for glow-tts

cd ..
git clone https://github.com/NVIDIA/apex
cd apex
git checkout 37cdaf4
pip install -v --disable-pip-version-check --no-cache-dir ./
cd ../vakyansh-tts

Build Monotonic Alignment Search Code (Cython)

Note : used only for glow-tts

bash install.sh

2. Data Resampling

The data format should have a folder containing all the .wav files for glow-tts and a text file containing filenames with their sentences.

Directory structure:

langauge_folder_name

language_folder_name
|-- ./wav/*.wav
|-- ./text_file_name.txt

The format for text_file_name.txt (Text file is only needed for glow-tts training)

( audio1.wav "Sentence1." )
( audio2.wav "Sentence2." )

To resample the .wav files to 22050 sample rate, change the following parameters in the vakyansh-tts/scripts/data/resample.sh

input_wav_path : absolute path to wav file folder in vakyansh_tts/data/
output_wav_path : absolute path to vakyansh_tts/data/resampled_wav_folder_name
output_sample_rate : 22050 (or any other desired sample rate)

To run:

cd scripts/data/
bash resample.sh

3. Spectogram Training (glow-tts)

3.1 Data Preparation

To prepare the data edit the vakyansh-tts/scripts/glow/prepare_data.sh file and change the following parameters

input_text_path : absolute path to vakyansh_tts/data/text_file_name.txt
input_wav_path : absolute path to vakyansh_tts/data/resampled_wav_folder_name
gender : female or male voice

To run:

cd scripts/glow/
bash prepare_data.sh

3.2 Training glow-tts

To start the spectogram-training edit the vakyansh-tts/scripts/glow/train_glow.sh file and change the following parameter:

gender : female or male voice

Make sure that the gender is same as that of the prepare_data.sh file

To start the training, run:

cd scripts/glow/
bash train_glow.sh

4. Vocoder Training (hifi-gan)

4.1 Data Preparation

To prepare the data edit the vakyansh-tts/scripts/hifi/prepare_data.sh file and change the following parameters

input_wav_path : absolute path to vakyansh_tts/data/resampled_wav_folder_name
gender : female or male voice

To run:

cd scripts/hifi/
bash prepare_data.sh

4.2 Training hifi-gan

To start the spectogram-training edit the vakyansh-tts/scripts/hifi/train_hifi.sh file and change the following parameter:

gender : female or male voice

Make sure that the gender is same as that of the prepare_data.sh file

To start the training, run:

cd scripts/hifi/
bash train_hifi.sh

5. Inference

5.1 Using Gradio

To use the gradio link edit the following parameters in the vakyansh-tts/scripts/inference/gradio.sh file:

gender : female or male voice
device : cpu or cuda
lang : langauge code

To run:

cd scripts/inference/
bash gradio.sh

5.2 Using fast API

To use the fast api link edit the parameters in the vakyansh-tts/scripts/inference/api.sh file similar to section 5.1

To run:

cd scripts/inference/
bash api.sh

5.3 Direct Inference using text

To infer, edit the parameters in the vakyansh-tts/scripts/inference/infer.sh file similar to section 5.1 and set the text to the text variable

To run:

cd scripts/inference/
bash infer.sh

To configure other parameters there is a version that runs the advanced inference as well. Additional Parameters:

noise_scale : can vary from 0 to 1 for noise factor
length_scale : can vary from 0 to 2 for changing the speed of the generated audio 
transliteration : whether to switch on/off transliteration. 1: ON, 0: OFF
number_conversion : whether to switch on/off number to words conversion. 1: ON, 0: OFF
split_sentences : whether to switch on/off splitting of sentences. 1: ON, 0: OFF

To run:

cd scripts/inference/
bash advanced_infer.sh

5.4 Installation of tts_infer package

In tts_infer package, we currently have two components:

1. Transliteration (AI4bharat's open sourced models) (Languages supported: {'hi', 'gu', 'mr', 'bn', 'te', 'ta', 'kn', 'pa', 'gom', 'mai', 'ml', 'sd', 'si', 'ur'} )

2. Num to Word (Languages supported: {'en', 'hi', 'gu', 'mr', 'bn', 'te', 'ta', 'kn', 'or', 'pa'} )

git clone https://github.com/Open-Speech-EkStep/vakyansh-tts
cd vakyansh-tts
bash install.sh
python setup.py bdist_wheel
pip install -e .
cd tts_infer
wget https://storage.googleapis.com/vakyansh-open-models/translit_models.zip && unzip -q translit_models.zip

Usage: Refer to example file in tts_infer/

from tts_infer.tts import TextToMel, MelToWav
from tts_infer.transliterate import XlitEngine
from tts_infer.num_to_word_on_sent import normalize_nums

import re
from scipy.io.wavfile import write

text_to_mel = TextToMel(glow_model_dir='/path/to/glow-tts/checkpoint/dir', device='cuda')
mel_to_wav = MelToWav(hifi_model_dir='/path/to/hifi/checkpoint/dir', device='cuda')

def translit(text, lang):
    reg = re.compile(r'[a-zA-Z]')
    engine = XlitEngine(lang)
    words = [engine.translit_word(word, topk=1)[lang][0] if reg.match(word) else word for word in text.split()]
    updated_sent = ' '.join(words)
    return updated_sent
    
def run_tts(text, lang):
    text = text.replace('।', '.') # only for hindi models
    text_num_to_word = normalize_nums(text, lang) # converting numbers to words in lang
    text_num_to_word_and_transliterated = translit(text_num_to_word, lang) # transliterating english words to lang
    
    mel = text_to_mel.generate_mel(text_num_to_word_and_transliterated)
    audio, sr = mel_to_wav.generate_wav(mel)
    write(filename='temp.wav', rate=sr, data=audio) # for saving wav file, if needed
    return (sr, audio)

vakyansh-tts's People

Contributors

Stargazers

Watchers

vakyansh-tts's Issues

Missing files while inference

default_lineup.json is not available
File "/home/vivek/ml/tushar/indic_tts/vakyansh-tts/utils/inference/transliterate.py", line 720, in init
lineup = json.load(open(os.path.join(F_DIR, config_path), encoding="utf-8"))

Error in setting up the environment

When try to run "pip install -r requirements.txt" in a virtual environment in WSL, I get this error:
Failed to build mosestokenizer ffmpy flask-cachebuster docopt toolwrapper uctools

Could you help me resolve this?
( Note- This does not happen in Google Colab )

gpu usage during inference

is it compulsory to use GPU during inference while using your models.

Cannot start multispeaker training

Can you please guide on what parameters need to be changed to start multispeaker training. I have dataset of 4 speakers. I have transformed the data into required format using your script. But when i start training i get this error.
Traceback (most recent call last):
File "../src/glow_tts/init.py", line 82, in
main()
File "../src/glow_tts/init.py", line 69, in main
_ = generator(x, x_lengths, y, y_lengths, gen=False, g=sid)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/saad/hindi_glow/src/glow_tts/models.py", line 356, in forward
z, logdet = self.decoder(y, z_mask, g=g, reverse=False)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/saad/hindi_glow/src/glow_tts/models.py", line 198, in forward
x, logdet = f(x, x_mask, g=g, reverse=reverse)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/saad/hindi_glow/src/glow_tts/attentions.py", line 128, in forward
x = self.wn(x, x_mask, g)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(input, **kwargs)
File "/home/saad/hindi_glow/src/glow_tts/modules.py", line 145, in forward
g = self.cond_layer(g)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in getattr
type(self).name, name))
AttributeError: 'WN' object has no attribute 'cond_layer'
INFO:root:{'train': {'use_cuda': True, 'log_interval': 20, 'seed': 1234, 'epochs': 10000, 'learning_rate': 1.0, 'betas': [0.9, 0.98], 'eps': 1e-09, 'warmup_steps': 4000, 'scheduler': 'noam', 'batch_size': 16, 'ddi': True, 'fp16_run': True, 'save_epoch': 1}, 'data': {'load_mel_from_disk': False, 'training_files': '/home/saad/hindi_glow/data/training/train.txt', 'validation_files': '/home/saad/hindi_glow/data/training/valid.txt', 'chars': '/home/saad/hindi_glow/data/training/chars.txt', 'punc': '/home/saad/hindi_glow/data/training/punc.txt', 'text_cleaners': ['basic_indic_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 16000, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 80.0, 'mel_fmax': 7600.0, 'add_noise': True, 'add_blank': True}, 'model': {'hidden_channels': 192, 'filter_channels': 768, 'filter_channels_dp': 256, 'kernel_size': 3, 'p_dropout': 0.1, 'n_blocks_dec': 12, 'n_layers_enc': 6, 'n_heads': 2, 'p_dropout_dec': 0.05, 'dilation_rate': 1, 'kernel_size_dec': 5, 'n_block_layers': 4, 'n_sqz': 2, 'prenet': True, 'mean_only': True, 'hidden_channels_enc': 192, 'hidden_channels_dec': 192, 'window_size': 4}, 'model_dir': '/home/saad/hindi_glow/results/', 'log_dir': '/home/saad/hindi_glow/logs/'}
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/saad/hindi_glow/src/glow_tts/train.py", line 109, in train_and_eval
utils.latest_checkpoint_path(hps.model_dir, "G_.pth"),
File "/home/saad/hindi_glow/src/glow_tts/utils.py", line 83, in latest_checkpoint_path
x = f_list[-1]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/saad/hindi_glow/src/glow_tts/train.py", line 120, in train_and_eval
os.path.join(hps.model_dir, "ddi_G.pth"), generator, optimizer_g
File "/home/saad/hindi_glow/src/glow_tts/utils.py", line 27, in load_checkpoint
optimizer.load_state_dict(checkpoint_dict["optimizer"])
File "/home/saad/hindi_glow/src/glow_tts/commons.py", line 176, in load_state_dict
self._optim.load_state_dict(d)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/optim/optimizer.py", line 116, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Fine tuning

Hi,
Great work. I am a bit confused, can we done fine tuning with custom dataset of single speaker? If so, how, please guide.

Model can't speak decimal numbers in Hindi Model

When trying to to give a float number in text, model crashes.
It shows this error
assert digit in all_num["en"], "Give proper input" AssertionError: Give proper input

Finetuning the model from pretrained English TTS checkpoint causing issues

Hi,

When I was trying to finetune using the pre-trained checkpoint of English Glow model it is causing the grad to become inf/ Nan inspite of the defined gradient clipping in the code.

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
INFO:male:Train Epoch: 1 [0/4714 (0%)]  Loss: 9.467967
INFO:male:[8.355452537536621, 1.1125144958496094, 0, 5.70544330734548e-07]
grad_norm:
nan
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
...

In addition, soon after the adam optimizer throws the below error -

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/vakyansh-tts/src/glow_tts/train.py", line 125, in train_and_eval
    rank, epoch, hps, generator, optimizer_g, train_loader, logger, writer
  File "/home/vakyansh-tts/src/glow_tts/train.py", line 186, in train
    optimizer_g.step()
  File "/home/vakyansh-tts/src/glow_tts/commons.py", line 169, in step
    self._optim.step()
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/apex/amp/_initialize.py", line 242, in new_step
    output = old_step(*args, **kwargs)
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/torch/optim/adam.py", line 119, in step
    group['eps']
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/torch/optim/functional.py", line 86, in adam
    exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/apex/amp/wrap.py", line 101, in wrapper
    return orig_fn(arg0, *args, **kwargs)
RuntimeError: The size of tensor a (35) must match the size of tensor b (34) at non-singleton dimension 0

Checking further it seems the difference in the size of tensor by 1 is as expected for the Adam optimizer.
Will it be possible to help look into this?

Thanks,
Aalisha

error in tts generation for number '1998' in Tamil

error in tts generation for number '1998' in Tamil; it should speak "ஓர் ஆயிரத்து தொள்ளாயிரத்து தொன்னூற்றெட்டு" as in
https://tamilpesu.us/en/number/1998/?type=IN

current implementation in vakyansh-tts speaks incorrectly

Voice Cloning in TTS

Is there a way to add voice cloning while synthesizing speech, something similar to SV2TTS - if I have a trained speaker encoder, is it possible to do real time voice cloning using your trained glow TTS mel synthesizer and hifi vocoder.

open-speech-ekstep / vakyansh-tts Goto Github PK

vakyansh-tts's Introduction

vakyansh-tts

1. Installation and Setup for training

2. Data Resampling

3. Spectogram Training (glow-tts)

3.1 Data Preparation

3.2 Training glow-tts

4. Vocoder Training (hifi-gan)

4.1 Data Preparation

4.2 Training hifi-gan

5. Inference

5.1 Using Gradio

5.2 Using fast API

5.3 Direct Inference using text

5.4 Installation of tts_infer package

vakyansh-tts's People

Contributors

Stargazers

Watchers

Forkers

vakyansh-tts's Issues

Recommend Projects

Recommend Topics

Recommend Org