Coder Social home page Coder Social logo

vakyansh-tts's Introduction

vakyansh-tts

Text to Speech for Indic languages

1. Installation and Setup for training

Clone repo Note : for multspeaker glow-tts training use branch multispeaker

git clone https://github.com/Open-Speech-EkStep/vakyansh-tts

Build conda virtual environment

cd ./vakyansh-tts
conda create --name <env_name> python=3.7
conda activate <env_name>
pip install -r requirements.txt

Install apex; commit: 37cdaf4 for Mixed-precision training

Note : used only for glow-tts

cd ..
git clone https://github.com/NVIDIA/apex
cd apex
git checkout 37cdaf4
pip install -v --disable-pip-version-check --no-cache-dir ./
cd ../vakyansh-tts

Build Monotonic Alignment Search Code (Cython)

Note : used only for glow-tts

bash install.sh

2. Data Resampling

The data format should have a folder containing all the .wav files for glow-tts and a text file containing filenames with their sentences.

Directory structure:

langauge_folder_name

language_folder_name
|-- ./wav/*.wav
|-- ./text_file_name.txt

The format for text_file_name.txt (Text file is only needed for glow-tts training)

( audio1.wav "Sentence1." )
( audio2.wav "Sentence2." )

To resample the .wav files to 22050 sample rate, change the following parameters in the vakyansh-tts/scripts/data/resample.sh

input_wav_path : absolute path to wav file folder in vakyansh_tts/data/
output_wav_path : absolute path to vakyansh_tts/data/resampled_wav_folder_name
output_sample_rate : 22050 (or any other desired sample rate)

To run:

cd scripts/data/
bash resample.sh

3. Spectogram Training (glow-tts)

3.1 Data Preparation

To prepare the data edit the vakyansh-tts/scripts/glow/prepare_data.sh file and change the following parameters

input_text_path : absolute path to vakyansh_tts/data/text_file_name.txt
input_wav_path : absolute path to vakyansh_tts/data/resampled_wav_folder_name
gender : female or male voice

To run:

cd scripts/glow/
bash prepare_data.sh

3.2 Training glow-tts

To start the spectogram-training edit the vakyansh-tts/scripts/glow/train_glow.sh file and change the following parameter:

gender : female or male voice

Make sure that the gender is same as that of the prepare_data.sh file

To start the training, run:

cd scripts/glow/
bash train_glow.sh

4. Vocoder Training (hifi-gan)

4.1 Data Preparation

To prepare the data edit the vakyansh-tts/scripts/hifi/prepare_data.sh file and change the following parameters

input_wav_path : absolute path to vakyansh_tts/data/resampled_wav_folder_name
gender : female or male voice

To run:

cd scripts/hifi/
bash prepare_data.sh

4.2 Training hifi-gan

To start the spectogram-training edit the vakyansh-tts/scripts/hifi/train_hifi.sh file and change the following parameter:

gender : female or male voice

Make sure that the gender is same as that of the prepare_data.sh file

To start the training, run:

cd scripts/hifi/
bash train_hifi.sh

5. Inference

5.1 Using Gradio

To use the gradio link edit the following parameters in the vakyansh-tts/scripts/inference/gradio.sh file:

gender : female or male voice
device : cpu or cuda
lang : langauge code

To run:

cd scripts/inference/
bash gradio.sh

5.2 Using fast API

To use the fast api link edit the parameters in the vakyansh-tts/scripts/inference/api.sh file similar to section 5.1

To run:

cd scripts/inference/
bash api.sh

5.3 Direct Inference using text

To infer, edit the parameters in the vakyansh-tts/scripts/inference/infer.sh file similar to section 5.1 and set the text to the text variable

To run:

cd scripts/inference/
bash infer.sh

To configure other parameters there is a version that runs the advanced inference as well. Additional Parameters:

noise_scale : can vary from 0 to 1 for noise factor
length_scale : can vary from 0 to 2 for changing the speed of the generated audio 
transliteration : whether to switch on/off transliteration. 1: ON, 0: OFF
number_conversion : whether to switch on/off number to words conversion. 1: ON, 0: OFF
split_sentences : whether to switch on/off splitting of sentences. 1: ON, 0: OFF

To run:

cd scripts/inference/
bash advanced_infer.sh

5.4 Installation of tts_infer package

In tts_infer package, we currently have two components:

1. Transliteration (AI4bharat's open sourced models) (Languages supported: {'hi', 'gu', 'mr', 'bn', 'te', 'ta', 'kn', 'pa', 'gom', 'mai', 'ml', 'sd', 'si', 'ur'} )

2. Num to Word (Languages supported: {'en', 'hi', 'gu', 'mr', 'bn', 'te', 'ta', 'kn', 'or', 'pa'} )
git clone https://github.com/Open-Speech-EkStep/vakyansh-tts
cd vakyansh-tts
bash install.sh
python setup.py bdist_wheel
pip install -e .
cd tts_infer
wget https://storage.googleapis.com/vakyansh-open-models/translit_models.zip && unzip -q translit_models.zip

Usage: Refer to example file in tts_infer/

from tts_infer.tts import TextToMel, MelToWav
from tts_infer.transliterate import XlitEngine
from tts_infer.num_to_word_on_sent import normalize_nums

import re
from scipy.io.wavfile import write

text_to_mel = TextToMel(glow_model_dir='/path/to/glow-tts/checkpoint/dir', device='cuda')
mel_to_wav = MelToWav(hifi_model_dir='/path/to/hifi/checkpoint/dir', device='cuda')

def translit(text, lang):
    reg = re.compile(r'[a-zA-Z]')
    engine = XlitEngine(lang)
    words = [engine.translit_word(word, topk=1)[lang][0] if reg.match(word) else word for word in text.split()]
    updated_sent = ' '.join(words)
    return updated_sent
    
def run_tts(text, lang):
    text = text.replace('।', '.') # only for hindi models
    text_num_to_word = normalize_nums(text, lang) # converting numbers to words in lang
    text_num_to_word_and_transliterated = translit(text_num_to_word, lang) # transliterating english words to lang
    
    mel = text_to_mel.generate_mel(text_num_to_word_and_transliterated)
    audio, sr = mel_to_wav.generate_wav(mel)
    write(filename='temp.wav', rate=sr, data=audio) # for saving wav file, if needed
    return (sr, audio)

vakyansh-tts's People

Contributors

ankit-thoughtworks avatar ankurdhuriya avatar harveenchadha avatar neerajchhimwal avatar priyanshi-shah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

vakyansh-tts's Issues

Missing files while inference

default_lineup.json is not available
File "/home/vivek/ml/tushar/indic_tts/vakyansh-tts/utils/inference/transliterate.py", line 720, in init
lineup = json.load(open(os.path.join(F_DIR, config_path), encoding="utf-8"))

Error in setting up the environment

When try to run "pip install -r requirements.txt" in a virtual environment in WSL, I get this error:
Failed to build mosestokenizer ffmpy flask-cachebuster docopt toolwrapper uctools

Could you help me resolve this?
( Note- This does not happen in Google Colab )

Cannot start multispeaker training

Can you please guide on what parameters need to be changed to start multispeaker training. I have dataset of 4 speakers. I have transformed the data into required format using your script. But when i start training i get this error.
Traceback (most recent call last):
File "../src/glow_tts/init.py", line 82, in
main()
File "../src/glow_tts/init.py", line 69, in main
_ = generator(x, x_lengths, y, y_lengths, gen=False, g=sid)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/saad/hindi_glow/src/glow_tts/models.py", line 356, in forward
z, logdet = self.decoder(y, z_mask, g=g, reverse=False)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/saad/hindi_glow/src/glow_tts/models.py", line 198, in forward
x, logdet = f(x, x_mask, g=g, reverse=reverse)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/saad/hindi_glow/src/glow_tts/attentions.py", line 128, in forward
x = self.wn(x, x_mask, g)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(input, **kwargs)
File "/home/saad/hindi_glow/src/glow_tts/modules.py", line 145, in forward
g = self.cond_layer(g)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in getattr
type(self).name, name))
AttributeError: 'WN' object has no attribute 'cond_layer'
INFO:root:{'train': {'use_cuda': True, 'log_interval': 20, 'seed': 1234, 'epochs': 10000, 'learning_rate': 1.0, 'betas': [0.9, 0.98], 'eps': 1e-09, 'warmup_steps': 4000, 'scheduler': 'noam', 'batch_size': 16, 'ddi': True, 'fp16_run': True, 'save_epoch': 1}, 'data': {'load_mel_from_disk': False, 'training_files': '/home/saad/hindi_glow/data/training/train.txt', 'validation_files': '/home/saad/hindi_glow/data/training/valid.txt', 'chars': '/home/saad/hindi_glow/data/training/chars.txt', 'punc': '/home/saad/hindi_glow/data/training/punc.txt', 'text_cleaners': ['basic_indic_cleaners'], 'max_wav_value': 32768.0, 'sampling_rate': 16000, 'filter_length': 1024, 'hop_length': 256, 'win_length': 1024, 'n_mel_channels': 80, 'mel_fmin': 80.0, 'mel_fmax': 7600.0, 'add_noise': True, 'add_blank': True}, 'model': {'hidden_channels': 192, 'filter_channels': 768, 'filter_channels_dp': 256, 'kernel_size': 3, 'p_dropout': 0.1, 'n_blocks_dec': 12, 'n_layers_enc': 6, 'n_heads': 2, 'p_dropout_dec': 0.05, 'dilation_rate': 1, 'kernel_size_dec': 5, 'n_block_layers': 4, 'n_sqz': 2, 'prenet': True, 'mean_only': True, 'hidden_channels_enc': 192, 'hidden_channels_dec': 192, 'window_size': 4}, 'model_dir': '/home/saad/hindi_glow/results/', 'log_dir': '/home/saad/hindi_glow/logs/'}
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/saad/hindi_glow/src/glow_tts/train.py", line 109, in train_and_eval
utils.latest_checkpoint_path(hps.model_dir, "G_
.pth"),
File "/home/saad/hindi_glow/src/glow_tts/utils.py", line 83, in latest_checkpoint_path
x = f_list[-1]
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/saad/hindi_glow/src/glow_tts/train.py", line 120, in train_and_eval
os.path.join(hps.model_dir, "ddi_G.pth"), generator, optimizer_g
File "/home/saad/hindi_glow/src/glow_tts/utils.py", line 27, in load_checkpoint
optimizer.load_state_dict(checkpoint_dict["optimizer"])
File "/home/saad/hindi_glow/src/glow_tts/commons.py", line 176, in load_state_dict
self._optim.load_state_dict(d)
File "/home/saad/anaconda3/envs/indtts/lib/python3.7/site-packages/torch/optim/optimizer.py", line 116, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Fine tuning

Hi,
Great work. I am a bit confused, can we done fine tuning with custom dataset of single speaker? If so, how, please guide.

Finetuning the model from pretrained English TTS checkpoint causing issues

Hi,

When I was trying to finetune using the pre-trained checkpoint of English Glow model it is causing the grad to become inf/ Nan inspite of the defined gradient clipping in the code.

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
INFO:male:Train Epoch: 1 [0/4714 (0%)]  Loss: 9.467967
INFO:male:[8.355452537536621, 1.1125144958496094, 0, 5.70544330734548e-07]
grad_norm:
nan
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
...

In addition, soon after the adam optimizer throws the below error -

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/vakyansh-tts/src/glow_tts/train.py", line 125, in train_and_eval
    rank, epoch, hps, generator, optimizer_g, train_loader, logger, writer
  File "/home/vakyansh-tts/src/glow_tts/train.py", line 186, in train
    optimizer_g.step()
  File "/home/vakyansh-tts/src/glow_tts/commons.py", line 169, in step
    self._optim.step()
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/apex/amp/_initialize.py", line 242, in new_step
    output = old_step(*args, **kwargs)
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/torch/optim/adam.py", line 119, in step
    group['eps']
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/torch/optim/functional.py", line 86, in adam
    exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
  File "/home/.conda/envs/vakyansh-tts/lib/python3.7/site-packages/apex/amp/wrap.py", line 101, in wrapper
    return orig_fn(arg0, *args, **kwargs)
RuntimeError: The size of tensor a (35) must match the size of tensor b (34) at non-singleton dimension 0

Checking further it seems the difference in the size of tensor by 1 is as expected for the Adam optimizer.
Will it be possible to help look into this?

Thanks,
Aalisha

Voice Cloning in TTS

Is there a way to add voice cloning while synthesizing speech, something similar to SV2TTS - if I have a trained speaker encoder, is it possible to do real time voice cloning using your trained glow TTS mel synthesizer and hifi vocoder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.