Coder Social home page Coder Social logo

as-ideas / transformertts Goto Github PK

View Code? Open in Web Editor NEW
1.1K 33.0 220.0 25.99 MB

๐Ÿค–๐Ÿ’ฌ Transformer TTS: Implementation of a non-autoregressive Transformer based neural network for text to speech.

Home Page: https://as-ideas.github.io/TransformerTTS/

License: Other

Python 100.00%
deep-learning axelspringerai text-to-speech python tensorflow tts

transformertts's Introduction



A Text-to-Speech Transformer in TensorFlow 2

Implementation of a non-autoregressive Transformer based neural network for Text-to-Speech (TTS).
This repo is based, among others, on the following papers:

Our pre-trained LJSpeech model is compatible with the pre-trained vocoders:

(older versions are available also for WaveRNN)

For quick inference with these vocoders, checkout the Vocoding branch

Non-Autoregressive

Being non-autoregressive, this Transformer model is:

  • Robust: No repeats and failed attention modes for challenging sentences.
  • Fast: With no autoregression, predictions take a fraction of the time.
  • Controllable: It is possible to control the speed and pitch of the generated utterance.

๐Ÿ”ˆ Samples

Can be found here.

These samples' spectrograms are converted using the pre-trained MelGAN vocoder.

Try it out on Colab:

Open In Colab

Updates

  • 06/20: Added normalisation and pre-trained models compatible with the faster MelGAN vocoder.
  • 11/20: Added pitch prediction. Autoregressive model is now specialized as an Aligner and Forward is now the only TTS model. Changed models architectures. Discontinued WaveRNN support. Improved duration extraction with Dijkstra algorithm.
  • 03/20: Vocoding branch.

๐Ÿ“– Contents

Installation

Make sure you have:

  • Python >= 3.6

Install espeak as phonemizer backend (for macOS use brew):

sudo apt-get install espeak

Then install the rest with pip:

pip install -r requirements.txt

Read the individual scripts for more command line arguments.

Pre-Trained LJSpeech API

Use our pre-trained model (with Griffin-Lim) from command line with

python predict_tts.py -t "Please, say something."

Or in a python script

from data.audio import Audio
from model.factory import tts_ljspeech

model = tts_ljspeech()
audio = Audio.from_config(model.config)
out = model.predict('Please, say something.')

# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)

You can specify the model step with the --step flag (CL) or step parameter (script).
Steps from 60000 to 100000 are available at a frequency of 5K steps (60000, 65000, ..., 95000, 100000).

IMPORTANT: make sure to checkout the correct repository version to use the API.
Currently 493be6345341af0df3ae829de79c2793c9afd0ec

Dataset

You can directly use LJSpeech to create the training dataset.

Configuration

  • If training on LJSpeech, or if unsure, simply use config/training_config.yaml to create MelGAN or HiFiGAN compatible models
    • swap the content of data_config_wavernn.yaml in config/training_config.yaml to create models compatible with WaveRNN
  • EDIT PATHS: in config/training_config.yaml edit the paths to point at your dataset and log folders

Custom dataset

Prepare a folder containing your metadata and wav files, for instance

|- dataset_folder/
|   |- metadata.csv
|   |- wavs/
|       |- file1.wav
|       |- ...

if metadata.csv has the following format wav_file_name|transcription you can use the ljspeech preprocessor in data/metadata_readers.py, otherwise add your own under the same file.

Make sure that:

  • the metadata reader function name is the same as data_name field in training_config.yaml.
  • the metadata file (can be anything) is specified under metadata_path in training_config.yaml

Training

Change the --config argument based on the configuration of your choice.

Train Aligner Model

Create training dataset

python create_training_data.py --config config/training_config.yaml

This will populate the training data directory (default transformer_tts_data.ljspeech).

Training

python train_aligner.py --config config/training_config.yaml

Train TTS Model

Compute alignment dataset

First use the aligner model to create the durations dataset

python extract_durations.py --config config/training_config.yaml

this will add the durations.<session name> as well as the char-wise pitch folders to the training data directory.

Training

python train_tts.py --config config/training_config.yaml

Training & Model configuration

  • Training and model settings can be configured in training_config.yaml

Resume or restart training

  • To resume training simply use the same configuration files
  • To restart training, delete the weights and/or the logs from the logs folder with the training flag --reset_dir (both) or --reset_logs, --reset_weights

Monitor training

tensorboard --logdir /logs/directory/

Tensorboard Demo

Prediction

With model weights

From command line with

python predict_tts.py -t "Please, say something." -p /path/to/weights/

Or in a python script

from model.models import ForwardTransformer
from data.audio import Audio
model = ForwardTransformer.load_model('/path/to/weights/')
audio = Audio.from_config(model.config)
out = model.predict('Please, say something.')

# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)

Model Weights

Access the pre-trained models with the API call.

Old weights

Model URL Commit Vocoder Commit
ljspeech_tts_model 0cd7d33 aca5990
ljspeech_melgan_forward_model 1c1cb03 aca5990
ljspeech_melgan_autoregressive_model_v2 1c1cb03 aca5990
ljspeech_wavernn_forward_model 1c1cb03 3595219
ljspeech_wavernn_autoregressive_model_v2 1c1cb03 3595219
ljspeech_wavernn_forward_model d9ccee6 3595219
ljspeech_wavernn_autoregressive_model_v2 d9ccee6 3595219
ljspeech_wavernn_autoregressive_model_v1 2f3a1b5 3595219

Maintainers

Special thanks

MelGAN and WaveRNN: data normalization and samples' vocoders are from these repos.

Erogol and the Mozilla TTS team for the lively exchange on the topic.

Copyright

See LICENSE for details.

transformertts's People

Contributors

cfrancesco avatar conprogramming avatar cschaefer26 avatar datitran avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transformertts's Issues

Suggestion

Hey, thanks so much for the code
I don't have any contact to ask you so thought to ask here. Extremely sorry for this.
I am new to TTS and want to implement latest multispeaker TTS system paper here

The speaker module here is passing outputs to encoder and pre decoder. My issue is since my seq Len will be different for encoder and pre decoder output, how do I concat the speaker output in the two in tensorflow.
More specifically since enc output will be (batch, inp_seq_len, d) and predecoder output (batch, out_seq_len, d) how to concat different speaker ids on each sequency parallelly?
Currently I am thinking to create two speaker inps of size (batch, inp_seq,speaker Id ) and pass through speaker module to get shape which can be concatenated with enc output with speaker module training set to true. Then create another speaker input which is compatible with out seq Len and pass through speaker module to get output compatible with pre decoder with training set to false.
I don't have anyone else to ask hence asking you.
Thanks

where is self.loss coming from?

in weighted_sum_losses .....
`
def _gta_forward(self, inp, tar, stop_prob, training):
tar_inp = tar[:, :-1]
tar_real = tar[:, 1:]
tar_stop_prob = stop_prob[:, 1:]

    mel_len = int(tf.shape(tar_inp)[1])
    tar_mel = tar_inp[:, 0::self.r, :]
    
    with tf.GradientTape() as tape:
        model_out = self.__call__(inputs=inp,
                                  targets=tar_mel,
                                  training=training)
        loss, loss_vals = weighted_sum_losses((tar_real,
                                               tar_stop_prob,
                                               tar_real),
                                              (model_out['final_output'][:, :mel_len, :],
                                               model_out['stop_prob'][:, :mel_len, :],
                                               model_out['mel_linear'][:, :mel_len, :]),
                                              **self.loss**, # this 
                                              self.loss_weights)
    model_out.update({'loss': loss})
    model_out.update({'losses': {'output': loss_vals[0], 'stop_prob': loss_vals[1], 'mel_linear': loss_vals[2]}})
    model_out.update({'reduced_target': tar_mel})
    return model_out, tape

`

Invalid array shape in extract_durations

Hi thanks for your help with storing predictions in extract_durations. Now that I can run the script, the next error occurs. Any ideas why the shape does not fit? Thanks again!

Extracting training alignments: : 156it [06:40,  2.57s/it]
Traceback (most recent call last):
  File "extract_durations.py", line 230, in <module>
    np.save(str(train_target_dir / f'{sample_idx}_mel_phon_dur.npy'), sample)
  File "<__array_function__ internals>", line 6, in save
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/numpy/lib/npyio.py", line 527, in save
    arr = np.asanyarray(arr)
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
ValueError: could not broadcast input array from shape (224,80) into shape (224)

Issues replicating the examples

My predict.py:

from utils.config_manager import ConfigManager
from utils.audio import Audio
from scipy.io.wavfile import write

config_loader = ConfigManager('ljspeech_autoregressive_transformer\standard', model_kind='autoregressive')
audio = Audio(config_loader.config)
model = config_loader.load_model()
was = 'President Trump met with other leaders at the Group of twenty conference..'
out = model.predict(was)

# Convert spectrogram to wav (with griffin lim)
wav = audio.reconstruct_waveform(out['mel'].numpy().T)
#print(wav)
samplerate = 22050; 
was = "".join(x for x in was if x.isalnum())
write(was+".wav", samplerate, wav)

I changed Architecture of autoregressive_config.yaml to

# ARCHITECTURE 
decoder_model_dimension: 256
encoder_model_dimension: 512
decoder_num_heads: [4, 4, 4, 4]  # the length of this defines the number of layers
encoder_num_heads: [4, 4, 4, 4]  # the length of this defines the number of layers
encoder_feed_forward_dimension: 1024
decoder_feed_forward_dimension: 1024
decoder_prenet_dimension: 256
encoder_prenet_dimension: 512
encoder_max_position_encoding: 1000
decoder_max_position_encoding: 10000
postnet_conv_filters: 256
postnet_conv_layers: 5                  
with_stress: true
postnet_kernel_size: 5
encoder_dense_blocks: 4
decoder_dense_blocks: 4
normalizer: 'WaveRNN'
encoder_attention_conv_filters: 512
decoder_attention_conv_filters: 512
encoder_attention_conv_kernel: 3
decoder_attention_conv_kernel: 3

And I got the attached wave file PresidentTrumpmetwithotherleadersattheGroupoftwentyconference.zip. Not the same as https://as-ideas.github.io/TransformerTTS/.

What is missing?

extract_durations.py slowly filling up RAM and gets killed

Hi, I found time again to train a model on a larger German dataset. Creating the dataset works well but when I execute extract_durations.py the process is taking more and more RAM until the python process ends with a simple "killed". I have 32GB of RAM.

This is how I call it:
(TransformerTTS) [user1@localhost TransformerTTS]$ python extract_durations.py --config config/melgan --binary --fix_jumps --fill_mode_next --store_predictions

These are the last three lines I see:

2020-10-26 08:43:29.819269: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Processing validation set: : 13it [00:10,  1.25it/s]
Processing training set: : 1444it [05:54,  1.38it/s]Getรถtet

How do you use TransformerTTS?

My intended purpose is to have new voices for book to audiobook conversions for private consumption.

How do you use TransformerTTS?

Problem computing alignment dataset

Hi,

Thank you for sharing your implementation that I find great. I used your repository to train the first autoregressive model which worked fine and was able to synthesize speech for my sentences. However, when I try to use the autoregressive model to create the durations dataset, folders inside forward_data are created but are empty. When I subsequently try to train the forward model it obviously fails:

python train_forward.py --config config/melgan

`TRAINING
Traceback (most recent call last):
File "anaconda3/envs/tfenv2/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1986, in execution_mode
yield
File "anaconda3/envs/tfenv2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 655, in _next_internal
output_shapes=self._flat_output_shapes)
File "anaconda3/envs/tfenv2/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2363, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "anaconda3/envs/tfenv2/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "anaconda3/envs/tfenv2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 670, in next
return self._next_internal()
File "anaconda3/envs/tfenv2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 661, in _next_internal
return structure.from_compatible_tensor_list(self._element_spec, ret)
File "anaconda3/envs/tfenv2/lib/python3.7/contextlib.py", line 130, in exit
self.gen.throw(type, value, traceback)
File "anaconda3/envs/tfenv2/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1989, in execution_mode
executor_new.wait()
File "anaconda3/envs/tfenv2/lib/python3.7/site-packages/tensorflow/python/eager/executor.py", line 67, in wait
pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_forward.py", line 122, in
test_batch = val_dataset.next_batch()
File "TTS/TransformerTTS/preprocessing/data_handling.py", line 35, in next_batch
return next(self.data_iter)
File "anaconda3/envs/tfenv2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 631, in next
return self.next()
File "anaconda3/envs/tfenv2/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 672, in next
raise StopIteration
StopIteration`

Here is the output during extraction of alignements, where, after checking the extract_durations.py file, I can see that some loops further in the file are not entered (load_files returns full objects for both train and val samples):

python extract_durations.py --config config/melgan --binary --fix_jumps --fill_mode_next

`2020-07-02 13:08:56.878490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10368 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:68:00.0, compute capability: 6.1)
^M0it [00:00, ?it/s]^MProcessing validation set: : 0it [00:00, ?it/s]2020-07-02 13:09:09.360962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-07-02 13:09:10.055624: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
^MProcessing validation set: : 1it [00:12, 12.81s/it]^MProcessing validation set: : 1it [00:13, 12.81s/it]^MProcessing validation set: : 2it [00:13, 9.11s/it]^MProcessing validation set: : 2it [00:13, 6.65s/it]

^M0it [00:00, ?it/s]^M0it [00:00, ?it/s]
^M0it [00:00, ?it/s]^M0it [00:00, ?it/s]
1 Physical GPUs, 1 Logical GPUs
DurationExtraction_weighted_binary_filled(next)_fix_jumps
restored weights from /LJSpeech/logdir_newTTS_en/melgan/autoregressive_weights/ckpt-90 at step 900000
Extracting attention from layer Decoder_DenseBlock4_CrossAttention
Done.`

My data configuration for MelGAN is:
`data_directory: '/LJSpeech/LJSpeech-1.1' # path to wavs and metafile directory
log_directory: '/LJSpeech/logdir_newTTS_en' # weights and logs are stored here
train_data_directory: None # optional: alternative directory where to store processed data (default is data_dir)
wav_subdir_name: 'wavs' # subfolder in data_directory containing wavs files
metadata_filename: 'metadata.csv' # name of metadata file under data_directory
session_name: None # session naming, can be specified in command line

DATA

n_samples: 100000
n_test: 100
mel_start_value: 4
mel_end_value: -4

AUDIO

sampling_rate: 22050
n_fft: 1024
mel_channels: 80
hop_length: 256
win_length: 1024
f_min: 0
f_max: 8000
normalizer: MelGAN # which mel normalization to use from utils.audio.py [MelGAN or WaveRNN]

TOKENIZER

phoneme_language: 'en'`

Could you please help what could be going wrong here ?

Audio Not Generating At All

Hey!

Was playing around with the notebook and ran the cells one by one. When I tried listening to the audio generated by the model using the IPython display function, the audio was a 4 second long stream of soft booming noises and didn't sound anything like the input sentence.

Would love to know how to proceed.

Cheers :D

Mandarin DataSet

Hi guys, now my dataset is about Mandarin and when I change the language to 'zh' in 'data_config.yaml', I got an error that 'ValueError: language must be either "en" or "de", not zh.'. Could you tell me how to do it? I'm looking forward to this. Thank you!

Demos in colab not working

Hi there. Thanks for the fantastic work you've made. I'm new to TTS area and when I tried out the demos in colab some errors appeared.
When I execute this cell in the notebook synthesize_autoregressive_melgan:

# Load pretrained models
from utils.config_manager import ConfigManager
from utils.audio import Audio

import IPython.display as ipd

config_loader = ConfigManager(str(config_path), model_kind='autoregressive')
audio = Audio(config_loader.config)
model = config_loader.load_model(str(config_path / 'autoregressive_weights/ckpt-90'))

I got this error:

WARNING: could not retrieve git hash. Command '['git', 'describe', '--always']' returned non-zero exit status 128.
WARNING: could not check git hash. Command '['git', 'describe', '--always']' returned non-zero exit status 128.
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-a5cfa4a2749f> in <module>()
      7 config_loader = ConfigManager(str(config_path), model_kind='autoregressive')
      8 audio = Audio(config_loader.config)
----> 9 model = config_loader.load_model(str(config_path / 'autoregressive_weights/ckpt-90'))

1 frames
/content/drive/My Drive/TTS/TransformerTTS/utils/config_manager.py in load_model(self, checkpoint_path, verbose)
    192 
    193     def load_model(self, checkpoint_path: str = None, verbose=True):
--> 194         model = self.get_model()
    195         self.compile_model(model)
    196         ckpt = tf.train.Checkpoint(net=model)

/content/drive/My Drive/TTS/TransformerTTS/utils/config_manager.py in get_model(self, ignore_hash)
    111                                              decoder_prenet_dimension=self.config['decoder_prenet_dimension'],
    112                                              encoder_prenet_dimension=self.config['encoder_prenet_dimension'],
--> 113                                              encoder_attention_conv_kernel=self.config['encoder_attention_conv_kernel'],
    114                                              decoder_attention_conv_kernel=self.config['decoder_attention_conv_kernel'],
    115                                              encoder_attention_conv_filters=self.config['encoder_attention_conv_filters'],

KeyError: 'encoder_attention_conv_kernel'

I didn't change any codes except mounting it to my own drive. What could be the reason of this error?
Thanks in advance.

About loss used

is your tts loss average of (absolute error, cross entropy, mean squared error...) ???
Can you tell me what is actual tts loss used in literature. Actual I was looking but unable to find.
Thanks

Unsuitable location for the new_adam static method

As noted in a TODO comment, the new_adam static method is not suited to the ConfigManager object in utils/config_manager.py, and would make a lot more sense to exist in model/models.py.

The method in question:

@staticmethod
def new_adam(learning_rate):
return tf.keras.optimizers.Adam(learning_rate,
beta_1=0.9,
beta_2=0.98,
epsilon=1e-9)

Other than the learning_rate, the parameters of the Adam optimizer are hard-coded in, so perhaps a part of this change could also be including some sort of configuration options for beta_1, beta_2, and epsilon.

Missing words in audio, improper pauses

Hey,

I have been trying custom training the models using custom dataset(non-english) ensuring the language is supported by the phonemizer https://github.com/espeak-ng/espeak-ng .

I am using non-english characters as the training set, the phonemes generated are long due to richness in the language, the data set is well cleaned and seems okay for the task.

However, TTS-Autoregressive training is not coming as expected. At times synthesis misses out on a few words, or there are improper pauses after commas and full stops.

The Melgan is training very well and has converged properly. However, for TTS, the losses are stagnant and doesn't seem to improve further but still getting the issue.

Can anyone help in identifying on what could be going wrong here? Is it a dataset issue or a phoneme level issue?
What to look at and what not? Any pointer would be helpful.

PS: I have already replicated results on English on a custom dataset which are very good.

Thanks

Shapes in forward model does not match

Hi there
I'm trying to use the Forward model for my own dataset ut in extract_durations.py I face with this error.
Any ideas why the shape does not fit?
In addition it works fine with autoregressive model

1 Physical GPUs, 1 Logical GPUs
DurationExtraction_weighted_binary_filled(next)_fix_jumps_layer-1
fatal: not a git repository (or any of the parent directories): .git
WARNING: could not retrieve git hash. Command '['git', 'describe', '--always']' returned non-zero exit status 128.

CONFIGURATION ljspeech.melgan.autoregressive
- decoder_model_dimension : 256
- encoder_model_dimension : 512
- decoder_num_heads : [4, 4, 4, 4]
- encoder_num_heads : [4, 4, 4, 4]
- encoder_feed_forward_dimension : 1024
- decoder_feed_forward_dimension : 1024
- decoder_prenet_dimension : 256
- encoder_prenet_dimension : 512
- encoder_attention_conv_filters : 512
- decoder_attention_conv_filters : 512
- encoder_attention_conv_kernel : 3
- decoder_attention_conv_kernel : 3
- encoder_max_position_encoding : 1000
- decoder_max_position_encoding : 10000
- postnet_conv_filters : 256
- postnet_conv_layers : 5
- postnet_kernel_size : 5
- encoder_dense_blocks : 4
- decoder_dense_blocks : 4
- stop_loss_scaling : 8
- dropout_rate : 0.1
- decoder_prenet_dropout_schedule : [[0, 0.0], [25000, 0.0], [35000, 0.5]]
- learning_rate_schedule : [[0, 0.0001]]
- head_drop_schedule : [[0, 0], [15000, 1]]
- reduction_factor_schedule : [[0, 10], [80000, 5], [150000, 3], [250000, 1]]
- max_steps : 900000
- bucket_boundaries : [200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200]
- bucket_batch_sizes : [64, 42, 32, 25, 21, 18, 16, 14, 12, 11, 1]
- debug : False
- validation_frequency : 1000
- prediction_frequency : 10000
- weights_save_frequency : 10000
- train_images_plotting_frequency : 1000
- keep_n_weights : 2
- keep_checkpoint_every_n_hours : 12
- n_steps_avg_losses : [100, 500, 1000, 5000]
- n_predictions : 2
- prediction_start_step : 20000
- audio_start_step : 40000
- audio_prediction_frequency : 10000
- data_directory : /content/content/data
- log_directory : /content/logdir
- metadata_filename : 12400_V2.csv
- train_metadata_filename : train_metafile.txt
- valid_metadata_filename : valid_metafile.txt
- session_name : melgan
- data_name : ljspeech
- n_samples : 100000
- n_test : 100
- mel_start_value : 0.5
- mel_end_value : -0.5
- max_mel_len : 1200
- min_mel_len : 80
- sampling_rate : 22050
- n_fft : 1024
- mel_channels : 80
- hop_length : 256
- win_length : 1024
- f_min : 0
- f_max : 8000
- normalizer : MelGAN
- phoneme_language : en-us
- with_stress : False
fatal: not a git repository (or any of the parent directories): .git
WARNING: could not check git hash. Command '['git', 'describe', '--always']' returned non-zero exit status 128.
WARNING: could not find weights file. Trying to load from 
 /content/logdir/ljspeech.melgan.autoregressive/weights.
Edit data_config.yaml to point at the right log directory.
restored weights from None at step 0
ERROR: model's reduction factor is greater than 1, check config. (r=10
Extracting attention from layer Decoder_DenseBlock4_CrossAttention
Processing dataset: : 0it [00:00, ?it/s]2020-10-22 11:35:05.933335: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-10-22 11:35:07.492170: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
Traceback (most recent call last):
  File "/content/TransformerTTS/extract_durations.py", line 114, in <module>
    pred_mel = tf.expand_dims(1 - tf.squeeze(create_mel_padding_mask(mel_batch[:, 1:, :])), -1) * pred_mel
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 1125, in binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 1457, in _mul_dispatch
    return multiply(x, y, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py", line 509, in multiply
    return gen_math_ops.mul(x, y, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 6166, in mul
    _ops.raise_from_not_ok_status(e, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [32,397,1] vs. [32,400,80] [Op:Mul]

Error converting audio using melgan

Hi, I try to use melgan to produce wav files but I get a TransformerTTS/utils/audio.py:96: RuntimeWarning: overflow encountered in exp.

The colab notebook is not clear on how to do the conversion, since the last lines are:

if torch.cuda.is_available():
    vocoder = vocoder.cuda()
    mel = mel.cuda()

with torch.no_grad():
    audio = vocoder.inference(mel)

I'm expecting to do sth like:

wav = audio.reconstruct_waveform(audio.cpu().numpy().T)
librosa.output.write_wav('melgan1.wav', wav, 22050)

But it gives me the error in audio.py. Can you help me out?

All elements in a batch must have the same rank

Hi,

I'm trying to train the model on another dataset for Persian language. I've created the dataset using create_training_data.py and my dataset is just like ljspeech. But when in run train.py I see the following error:

`starting training from scratch

TRAINING
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 2102, in execution_mode
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 758, in _next_internal
output_shapes=self._flat_output_shapes)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2610, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: All elements in a batch must have the same rank as the padded shape for component2: expected rank 1 but got element with rank 2 [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mahdi/Temp/TTS/TransformerTTS/train_forward.py", line 96, in
test_mel, test_phonemes, test_durs, test_fname = valid_dataset.next_batch()
File "/home/mahdi/Temp/TTS/TransformerTTS/preprocessing/datasets.py", line 199, in next_batch
return next(self.data_iter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 736, in next
return self.next()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 772, in next
return self._next_internal()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 764, in _next_internal
return structure.from_compatible_tensor_list(self._element_spec, ret)
File "/usr/lib/python3.6/contextlib.py", line 99, in exit
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 2105, in execution_mode
executor_new.wait()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/executor.py", line 67, in wait
pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.InvalidArgumentError: All elements in a batch must have the same rank as the padded shape for component2: expected rank 1 but got element with rank 2

Process finished with exit code 1
`

I don't know what is the problem. Can somebody help me please?

Problem with sentence cleaners/phonemizer Pipeline

when using the default cleaner for LibriTTS, I came across a strange behavior of the preprocessing pipeline.

I decided to replace '...' with a comma in the cleaner, to avoid buggy punctuation (though i realize it might be what causes the subsequent problem)

This leads to the presence of the substring ' , ?' in the end of sentence nยฐ4 of the batch shown below. (Initially, it was ' . . . ?')

When processed with the phonemizer, it appears the said substring (comma-question mark) affects the processing, which goes goes very wrong: the batch goes from length 10 to length 9 by fusing sentence 4 and 5 in a single output, then nยฐ6 gets cut, and then the question marks from later sentences end up in the wrong sentences and everything afterwards is shifted and does not behave as normal.

This is the batch I feed in: (I reduced batch size in order to localize where it happened)
(Pay attention to the number before sentences)

0.'What risks would you run in a job like that? Ned Land said. Swallowing a few gulps of salt water? ',
1.'Whatever you say, Ned. Then, trying to imitate Captain Nemos carefree tone, I asked, By the way, gallant Ned, are you afraid of sharks? ',
2.'Me? the Canadian replied ',
3.'Im a professional harpooner! Its my job to make a mockery of them! ',
4.'So its an issue of , ? ',
5.'In the water? ',
6.'You see, sir, these sharks are badly designed ',
7.'They have to roll their bellies over to snap you up, and in the meantime ',
8.'What are your feelings about these man eaters? ',
9.'Im afraid I must be frank with master ',

output:

0.'wษ’t ษนษชsks wสŠd juห ษนสŒn ษชn ษ dส’ษ’b laษชk รฐat? nษ›d land sษ›d. swษ’lษ™สŠษชล‹ ษ fjuห ษกสŒlps ษ’v sษ’lt wษ”หtษ™? ',
1.'wษ’tษ›vษ™ juห seษช, nษ›d. รฐษ›n, tษนaษชษชล‹ tสŠ ษชmษชteษชt kaptษชn niหmษ™สŠz keษ™fษนiห tษ™สŠn, aษช askt, baษช รฐษ™ weษช, ษกalษ™nt nษ›d, ษ‘ห juห ษfษนeษชd ษ’v สƒษ‘หks? ',
2.'miห? รฐษ™ kษneษชdiษ™n ษนษชplaษชd',
3.'ษชm ษ pษนษ™fษ›สƒษ™nษ™l hษ‘หpuหnษ™! ษชts maษช dส’ษ’b tษ™ meษชk ษ mษ’kษ™ษนi ษ’v รฐษ›m! ',
4.'sษ™สŠ ษชts ษn ษชสƒuห ษ’v , ษชnรฐษ™ wษ”หtษ™? ', (nยฐ 4 and 5 are "glued together")
5.'juห siห? ', (start of nยฐ6 with unexpected question mark)
6.'sษœห, รฐiหz สƒษ‘หks ษ‘ห badli dษชzaษชnd, รฐeษช hav tษ™ ษนษ™สŠl รฐeษ™ bษ›lษชz ษ™สŠvษ™ tษ™ snap juห สŒp', (nยฐ6 and start of nยฐ 7 without the end)
7.'and ษชnรฐษ™ miหntaษชm, wษ’t ษ‘ห jษ”ห fiหlษชล‹z ษbaสŠt รฐiหz man iหtษ™z', (end of nยฐ7, plus nยฐ8)
8.'ษชm ษfษนeษชd aษช mสŒst biห fษนaล‹k wษชรฐ mastษ™? ' (nยฐ9 with unexpected question mark)

Sorry for the clumsy report, it is quite hard to explain what happens. It is the only sentence from "train-clean-100" which leads to this problem. It is also the only one containing the substring ' . . . ?'

Maybe replacing dot-dot-dot with comma was not the best workaround, but I think there is something which goes wrong beyond that here...

Any Suggestions to introduce pauses (Up or down) in the produced speech?

First of all, Great Work! Thanks for sharing the repo!

I have trained the autoregressive model on LJ dataset. The output is quite good for short sentences. I seek some advice to manipulate pauses between words in the produced speech. Let's say the produced speech is 'This is Text to Speech model.' I want to increase(or say decrease) the pause between the word Speech model little bit.

Any Suggestions?

Possible to fine-tune on new dataset?

Is it possible to fine-tune the pretrained LJSpeech model on a new dataset (English, but also potentially others)?
If so, would we have to resume training on both the autoregressive model and the forward model or just the forward model?

Wrong filename extracted when file ends with .wav

In create_dataset.py there is one line that extrudes the filename without the extension .wav. This line is filename = filename.split('.')[-1]. But the index must be -2 otherwise the filename is wav for all files. Took me a while to find this one.

How to change the phonemizer?

Hi, I tried this code it's really good and fast. But I have problem with the phonemizer. Since I dont use english for my model the phonemizer is a little bit off. there is some word that have wrong pronounciation, is there any way that I change the phonemizer or turn of that option?
Thanks you for your feedback.

why reduced version of mel spec for the decoder traning?

in the models.py

def _gta_forward(self, inp, tar, stop_prob, training):
        tar_inp = tar[:, :-1]
        tar_real = tar[:, 1:]
        tar_stop_prob = stop_prob[:, 1:]
        mel_len = int(tf.shape(tar_inp)[1])
        tar_mel = tar_inp[:, 0::self.r, :]

why do you take reduced version of tar_mel for the decoder? is it for faster traning? why self.r = 10? is there any reason?

Error while training

When running the following command: python train_autoregressive.py --config config/wavernn
I get the error:

starting training from scratch

TRAINING
Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\eager\context.py", line 1986, in execution_mode
    yield
  File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 655, in _next_internal
    output_shapes=self._flat_output_shapes)
  File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 2363, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\framework\ops.py", line 6653, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 670, in next
    return self._next_internal()
  File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 661, in _next_internal
    return structure.from_compatible_tensor_list(self._element_spec, ret)
  File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\eager\context.py", line 1989, in execution_mode
    executor_new.wait()
  File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\eager\executor.py", line 67, in wait
    pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_autoregressive.py", line 119, in <module>
    _ = train_dataset.next_batch()
  File "E:\dev\TransformerTTS-master\preprocessing\data_handling.py", line 35, in next_batch
    return next(self.data_iter)
  File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 631, in __next__
    return self.next()
  File "C:\Users\user\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 672, in next
    raise StopIteration
StopIteration

Cannot train forward model (StopIteration)

Hi, I successfully trained an Autoregressive Model, executed extract_durations.py and now want to train the forward model. Unfortunately, executing python train_forward.py --config config/melgan gives me a StopIteration.

Edit: I found out that the folder "forward_data" in my train_data_directory contains four folders "train", "train_predictions_melgan", "val" and "val_predictions_melgan". But all four are empty. I guess "extract_durations.py" should fill them in some way?

This is the end of the stack trace:

starting training from scratch


TRAINING
Traceback (most recent call last):
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1986, in execution_mode
    yield
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 655, in _next_internal
    output_shapes=self._flat_output_shapes)
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2363, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 670, in next
    return self._next_internal()
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 661, in _next_internal
    return structure.from_compatible_tensor_list(self._element_spec, ret)
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 1989, in execution_mode
    executor_new.wait()
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/tensorflow/python/eager/executor.py", line 67, in wait
    pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_forward.py", line 122, in <module>
    test_batch = val_dataset.next_batch()
  File "/home/user1/TransformerTTS/preprocessing/data_handling.py", line 35, in next_batch
    return next(self.data_iter)
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 631, in __next__
    return self.next()
  File "/home/user1/anaconda3/envs/TransformerTTS/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 672, in next
    raise StopIteration
StopIteration

RuntimeError: CUDA out of memory

Hey guys!
I get the following error when trying to convert my spectrograms to audio when using melGAN:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 4.00 GiB total capacity; 16.61 MiB already allocated; 0 bytes free; 26.00 MiB reserved in total by PyTorch)

This is the code that's causing it

if torch.cuda.is_available():
    vocoder = vocoder.cuda()
    mel = mel.cuda()

with torch.no_grad():
    audio = vocoder.inference(mel)

Its taken straight from the notebook. Any solutions to this?
The entire error message looks like this:

RuntimeError                              Traceback (most recent call last)
<ipython-input-9-d382b2c2e31f> in <module>
      4 
      5 with torch.no_grad():
----> 6     audio = vocoder.inference(mel)

~/.cache\torch\hub\seungwonpark_melgan_master\model\generator.py in inference(self, mel)
     70         mel = torch.cat((mel, zero), dim=2)
     71 
---> 72         audio = self.forward(mel)
     73         audio = audio.squeeze() # collapse all dimension except time axis
     74         audio = audio[:-(hop_length*10)]

~/.cache\torch\hub\seungwonpark_melgan_master\model\generator.py in forward(self, mel)
     46     def forward(self, mel):
     47         mel = (mel + 5.0) / 5.0 # roughly normalize spectrogram
---> 48         return self.generator(mel)
     49 
     50     def eval(self, inference=False):

~\AppData\Local\Continuum\anaconda3\envs\Transformer TTS\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~\AppData\Local\Continuum\anaconda3\envs\Transformer TTS\lib\site-packages\torch\nn\modules\container.py in forward(self, input)
     98     def forward(self, input):
     99         for module in self:
--> 100             input = module(input)
    101         return input
    102 

~\AppData\Local\Continuum\anaconda3\envs\Transformer TTS\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    548             result = self._slow_forward(*input, **kwargs)
    549         else:
--> 550             result = self.forward(*input, **kwargs)
    551         for hook in self._forward_hooks.values():
    552             hook_result = hook(self, input, result)

~\AppData\Local\Continuum\anaconda3\envs\Transformer TTS\lib\site-packages\torch\nn\modules\conv.py in forward(self, input, output_size)
    645         return F.conv_transpose1d(
    646             input, self.weight, self.bias, self.stride, self.padding,
--> 647             output_padding, self.groups, self.dilation)
    648 
    649 

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 4.00 GiB total capacity; 16.61 MiB already allocated; 0 bytes free; 26.00 MiB reserved in total by PyTorch)

Best way to reduce GPU memory usage

Using arount 70k wav files (18GB) I get an OOM from tensorflow on iteration ~80k. What is the best way to reduce the memory usage of my GPU (TITAN RX with ~22GB RAM)? Would it be the batch size in autoregressive_config.yaml or would you recommend to reduce the training data?

Unequal input array dimensions in create_dataset.py

I'm currently trying to execute the create_dataset script. It fails with an invalid array size in line 65 of create_dataset.py.

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 3515 and the array at index 1 has size 3514

I checked the metadata.csv and the wav files in the folder waves. Numbers are the same. The error occurs even with a different dataset.

Do you have a clue where this error comes from?

Regarding Model architecture

Hey, I am new to tts and transformers.
I looked into your code and found that you are using mel finear to cmpute stop prob while in paper they have used decoder output for predicting stop prob. Can you explain me? Please coorect me if I am wrong. Basically i saw Postnet code and infering from there.

ERROR install espeak

Hello. I have problems when I try to install espeak in my PC (Windows 10).

The package doesn't existin Python. Some websites tells me try with another package as espeakng. Whay do you recomend me because I modified tokenizer.py with espeakng but the system results in a Rutime error:

RuntimeError: espeakng is not a supported backend, choose in espeak, festival, segments.

Thanks a lot

any idea of generate speech of long text?

hi,
thanks for you contribution. I trained autoregressive model with a Chinese voice corpus BIAOBEI, and got satisfied results. But I noticed that if this model process a long text, e. g. more than 20 Chinese char, the generated voice would end with strange repeat phoneme for a long time.
I think this is because of the cover field and memory ability of the transformer model. But I dont know which parameter would influence it. Could you give me some advices? Thank you.

ModuleNotFoundError: No module named 'model.generator'

Im trying to run the audio synthesis with autoregressive transformer tts and melgan vocoder notebook and I'm running into issues when I try to load the vocoder model. Specifically here:

sys.path.append(MelGAN_path)
import torch
import numpy as np

vocoder = torch.hub.load('seungwonpark/melgan', 'melgan')
vocoder.eval()

mel = torch.tensor(out['mel'].numpy().T[np.newaxis,:,:])

vocoder = torch.hub.load('seungwonpark/melgan', 'melgan')

Is the line that is generating the following error:

ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-aa82fe46f3ca> in <module>
      3 import numpy as np
      4 
----> 5 vocoder = torch.hub.load('seungwonpark/melgan', 'melgan')
      6 vocoder.eval()
      7 

~\AppData\Local\Continuum\anaconda3\envs\Transformer TTS\lib\site-packages\torch\hub.py in load(github, model, *args, **kwargs)
    363     sys.path.insert(0, repo_dir)
    364 
--> 365     hub_module = import_module(MODULE_HUBCONF, repo_dir + '/' + MODULE_HUBCONF)
    366 
    367     entry = _load_entry_from_hubconf(hub_module, model)

~\AppData\Local\Continuum\anaconda3\envs\Transformer TTS\lib\site-packages\torch\hub.py in import_module(name, path)
     73         spec = importlib.util.spec_from_file_location(name, path)
     74         module = importlib.util.module_from_spec(spec)
---> 75         spec.loader.exec_module(module)
     76         return module
     77     elif sys.version_info >= (3, 0):

~\AppData\Local\Continuum\anaconda3\envs\Transformer TTS\lib\importlib\_bootstrap_external.py in exec_module(self, module)

~\AppData\Local\Continuum\anaconda3\envs\Transformer TTS\lib\importlib\_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

~/.cache\torch\hub\seungwonpark_melgan_master/hubconf.py in <module>
      1 dependencies = ['torch']
      2 import torch
----> 3 from model.generator import Generator
      4 
      5 model_params = {

ModuleNotFoundError: No module named 'model.generator'

I've followed all the steps exactly. I have no idea whats causing this error and any sort of help would be much appreciated.

Error during Training

During the training, I ran the command !python train_autoregressive.py --config config/melgan
I am getting following error

starting training from scratch

TRAINING
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 2102, in execution_mode
    yield
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 758, in _next_internal
    output_shapes=self._flat_output_shapes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2610, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 772, in next
    return self._next_internal()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 764, in _next_internal
    return structure.from_compatible_tensor_list(self._element_spec, ret)
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 2105, in execution_mode
    executor_new.wait()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/executor.py", line 67, in wait
    pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_autoregressive.py", line 116, in <module>
    _ = train_dataset.next_batch()
  File "/content/TransformerTTS/preprocessing/data_handling.py", line 33, in next_batch
    return next(self.data_iter)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 736, in __next__
    return self.next()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 774, in next
    raise StopIteration
StopIteration

I made some changes to create_dataset.py file since my metadata.csv file hai two columns one is the filename and other is audio _caption while the audio file itself if of .flac format.
Here is sample data from metadata.csv file 84-121123-0000 GO DO YOU HEAR

Multiple Vocal voices

it seems that current implementation is designed to single vocal voice like LJspeech dataset which u used.. this dataset is 24 hour audio recording of single vocal.

I have a dataset of hundred of vocal , each vocal has less than hour audio recording.

Can u trained a multiple vocal model.

If not then is it possible that somehow i can perform fine tuning but taking your pre trained model and fine tune it on my single vocal with less than 1 hour audio recording

Multi-Gpu improper utilization

Hey, I am training autoregressive-melgan model in a system with two RTX 2080 Ti GPU's with cuda 11 with TensorFlow GPU 2.2. The model gets initialized and both GPUs are detected properly:

TRAINING
2020-06-19 14:47:24.630599: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-19 14:47:24.630609: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-19 14:47:24.630621: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-19 14:47:24.630654: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-19 14:47:24.630991: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-19 14:47:24.631327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-19 14:47:24.631662: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-19 14:47:24.631971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1
2020-06-19 14:47:24.631992: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-19 14:47:24.633164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-19 14:47:24.633173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 1
2020-06-19 14:47:24.633178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N Y
2020-06-19 14:47:24.633181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1:   Y N
2020-06-19 14:47:24.633255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-19 14:47:24.633604: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-19 14:47:24.633945: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-19 14:47:24.634312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10186 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-06-19 14:47:24.634655: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-19 14:47:24.634981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10210 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
2 Physical GPUs, 2 Logical GPUs

Out of the total 11GB memory, one gpu uses up to 10 GB. However, the other GPU is only 155mb occupied in the GPU while all through the training process. If I increase the batch size then it throws OOM error without using the other GPU which is free

Screenshot 2020-06-19 at 7 54 49 PM

Can you please help what could be going wrong here?

Regarding mel start and end token

Hey, I saw you are taking mel start /end val 4,-4 now i think its 0.5/-0.5. You are normalizing mel spectrogram in range -4 to +4 then dont you think using above token values will cause problems??

I trained on vctk today with r=1 from start. But my predictions are usually the pad values. Did you had any similar issue sometime?
I am attaching some images if u can help me in some way...
test
image
image

train

image
image

I padded the wav files with 12.5msec duration with value 0

AssertionError when running extract_durations.py

Traceback (most recent call last):
  File "extract_durations.py", line 160, in <module>
    fix_jumps=fix_jumps)
  File "/home/ubuntu/TransformerTTS/utils/alignments.py", line 131, in get_durations_from_alignment
    binary_attn, binary_score = binary_attention(ref_attention_weights)
  File "/home/ubuntu/TransformerTTS/utils/alignments.py", line 82, in binary_attention
    np.sum(attention_weights.T == attention_peak_per_phoneme, axis=0) != 1) == 0  # single peak per mel step
AssertionError

Happens when running
python extract_durations.py --config ../ljspeech_melgan_autoregressive_transformer/melgan --binary --fix_jumps --fill_mode_next

on an autoregressive model trained to step 1,110,000 on a new dataset (restored from checkpoint 900k from the released model weights, commit 1c1cb03).

Also happens when using just the released 900k checkpoint with no training on the new dataset.

Any ideas what might be wrong? Does it need more training?

Average training time in Google Colab with GPU

I am working on Colab, and for now, I'm trying to train the model with LJSpeech dataset. (just for trial, later I will use custom data)

I used parameters as in config files with "max_steps: 900_000" for melgan/autoregressive. It is my first TTS model experience, so I wanted to ask about training time. How many minutes/hours are expected for total training time of models?

Audio Alignment

Hey, What steps should we use to allign the audios(non english). I see there is something called "Compute alignment dataset" which you guys use for the forward model.

What exactly does that help in and there are two types of mel one is predicted and other is GT. IF we are training from scratch i assume we should add use_GT when running extract_duration.py

RuntimeError: main thread is not in main loop

Hi, I frequently get the following error message during training RuntimeError: main thread is not in main loop which appears in tkinter's __init__.py. "Frequently" means roughly every 20000 steps.

Are you familiar with this error? I'm training on a centos 7 with cuda 10.1 using a single nvidia titan rtx.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.