keonlee9420 / comprehensive-transformer-tts Goto Github PK

A Non-Autoregressive Transformer based Text-to-Speech, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS

License: MIT License

Dockerfile 0.30% Python 99.70%

text-to-speech supervised unsupervised non-autoregressive non-ar multi-speaker ultimate-tts tts pytorch comprehensive

comprehensive-transformer-tts's Introduction

Comprehensive-Transformer-TTS - PyTorch Implementation

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS. Any suggestions toward the best Non-AR TTS are welcome :)

Unsupervised Duration Modelings

One TTS Alignment To Rule Them All (Badlani et al., 2021): We are finally freed from external aligners such as MFA! Validation alignments for LJ014-0329 up to 70K are shown below as an example.

Transformer Performance Comparison on LJSpeech (1 TITAN RTX 24G / 16 batch size)

Model	Memory Usage	Training Time (1K steps)
Fastformer (lucidrains')	10531MiB / 24220MiB	4m 25s
Fastformer (wuch15's)	10515MiB / 24220MiB	4m 45s
Long-Short Transformer	10633MiB / 24220MiB	5m 26s
Conformer	18903MiB / 24220MiB	7m 4s
Reformer	10293MiB / 24220MiB	10m 16s
Transformer	7909MiB / 24220MiB	4m 51s
Transformer_fs2	11571MiB / 24220MiB	4m 53s

Toggle the type of building blocks by

# In the model.yaml
block_type: "transformer_fs2" # ["transformer_fs2", "transformer", "fastformer", "lstransformer", "conformer", "reformer"]

Toggle the type of prosody modelings by

# In the model.yaml
prosody_modeling:
  model_type: "none" # ["none", "du2021", "liu2021"]

Toggle the type of duration modelings by

# In the model.yaml
duration_modeling:
  learn_alignment: True # True for unsupervised modeling, and False for supervised modeling

Quickstart

DATASET refers to the names of datasets such as LJSpeech and VCTK in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/. The models are trained under unsupervised duration modeling with "transformer_fs2" building block.

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Add --speaker_id SPEAKER_ID for a multi-speaker TTS.

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively. Moreover, your own language and dataset can be adapted following here.

Preprocessing

For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.
Run
```
python3 prepare_align.py --dataset DATASET
```
for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

After that, run the preprocessing script by
```
python3 preprocess.py --dataset DATASET
```

Training

Train your model with

python3 train.py --dataset DATASET

Useful options:

To use a Automatic Mixed Precision, append --use_amp argument to the above command.
The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

LJSpeech

VCTK

Ablation Study

ID	Model	Block Type	Pitch Conditioning
1	LJSpeech_transformer_fs2_cwt	`transformer_fs2`	continuous wavelet transform
2	LJSpeech_transformer_cwt	`transformer`	continuous wavelet transform
3	LJSpeech_transformer_frame	`transformer`	frame-level f0
4	LJSpeech_transformer_ph	`transformer`	phoneme-level f0

Observations from

changing building block (ID 1~2): "transformer_fs2" seems to be more optimized in terms of memory usage and model size so that the training time and mel losses are decreased. However, the output quality is not improved dramatically, and sometimes the "transformer" block generates speech with an even more stable pitch contour than "transformer_fs2".
changing pitch conditioning (ID 2~4): There is a trade-off between audio quality (pitch stability) and expressiveness.
- audio quality: "ph" >= "frame" > "cwt"
- expressiveness: "cwt" > "frame" > "ph"

Notes

Both phoneme-level and frame-level variance are supported in both supervised and unsupervised duration modeling.
Note that there are no pre-extracted phoneme-level variance features in unsupervised duration modeling.
Unsupervised duration modeling in phoneme-level will take longer time than frame-level since the additional computation of phoneme-level variance is activated at runtime.
Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

For vocoder, HiFi-GAN and MelGAN are supported.

Updates Log

Mar.05, 2022 (v0.2.1): Fix and update codebase & pre-trained models with demo samples
1. Fix variance adaptor to make it work with all combinations of building block and variance type/level
2. Update pre-trained models with demo samples of LJSpeech and VCTK under "transformer_fs2" building block and "cwt" pitch conditioning
3. Share the result of ablation studies of comparing "transformer" vs. "transformer_fs2" paired among three types of pitch conditioning ("frame", "ph", and "cwt")
Feb.18, 2022 (v0.2.0): Update data preprocessor and variance adaptor & losses following keonlee9420's DiffSinger / Add various prosody modeling methods
1. Prepare two different types of data pipeline in preprocessor to maximize unsupervised/supervised duration modelings
2. Adopt wavelet for pitch modeling & loss
3. Add fine-trained duration loss
4. Apply var_start_steps for better model convergence, especially under unsupervised duration modeling
5. Remove dependency of energy modeling on pitch variance
6. Add "transformer_fs2" building block, which is more close to the original FastSpeech2 paper
7. Add two types of prosody modeling methods
8. Loss camparison on validation set:
- LJSpeech - blue: v0.1.1 / green: v0.2.0
- VCTK - skyblue: v0.1.1 / orange: v0.2.0
Sep.21, 2021 (v0.1.1): Initialize with ming024's FastSpeech2

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

ming024's FastSpeech2
wuch15's Fastformer
lucidrains' fast-transformer-pytorch
lucidrains' long-short-transformer
sooftware's conformer
lucidrains' reformer-pytorch
sagelywizard's pytorch-mdn
keonlee9420's Robust_Fine_Grained_Prosody_Control
keonlee9420's Cross-Speaker-Emotion-Transfer
keonlee9420's DiffSinger
NVIDIA's NeMo: Special thanks to Onur Babacan and Rafael Valle for unsupervised duration modeling.

comprehensive-transformer-tts's People

Contributors

Stargazers

Watchers

Forkers

shaun95 shunsunsun joan126 mnfutao techthiyanes cherokeelanguage chenchy ductho9799 anhdtd-vbee peterguoruc ishine lycaojh zhyoung24 tricky61 harryfyodor sciai-ai atuxhe maxmax2016 gongchenghhu rxhmdia hanchaotest esoff japita-se georgehappy1 zhang-man tsaifangsheng happylittlecat2333 jinmingche harmoniqpunk thatii nayanjha16 stardust-minus ambujmehrish aixingxy suzhiba liangxt2012 alexives16 kdrkdrkdr deandeandone bigdan12 nickovchinnikov

comprehensive-transformer-tts's Issues

Gibberish synthesized speech from my own model

Hi,
I am training a model on the ryanspeech dataset. Currently it is on 125k+ steps, and I tried to synthesize a speech with the checkpoint, but the result is rather hard to understand.

output.mp4

I tried adding the --duration_control 1.3 to the command, but I got

Traceback (most recent call last):
  File "synthesize.py", line 231, in <module>
    synthesize(device, model, args, configs, vocoder, batchs, control_values)
  File "synthesize.py", line 95, in synthesize
    output = model(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/Comprehensive-Transformer-TTS/model/CompTransTTS.py", line 112, in forward
    ) = self.variance_adaptor(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/Comprehensive-Transformer-TTS/model/modules.py", line 1088, in forward
    pitch_prediction, pitch_embedding = self.get_pitch_embedding(
  File "/root/Comprehensive-Transformer-TTS/model/modules.py", line 933, in get_pitch_embedding
    f0_denorm = denorm_f0(f0, uv, self.preprocess_config["preprocessing"]["pitch"], pitch_padding=pitch_padding)
  File "/root/Comprehensive-Transformer-TTS/utils/pitch_tools.py", line 79, in denorm_f0
    f0[uv > 0] = 0
IndexError: The shape of the mask [1, 154] at index 1 does not match the shape of the indexed tensor [1, 173] at index 1

My config is

block_type: "transformer_fs2"

duration_modeling:
  learn_alignment: False
  aligner_temperature: 0.0005

prosody_modeling:
  model_type: "liu2021"

What am I missing?
Thank you!

Are sent and word duration loss necessary for unsupervised alignment ?

Are sent and word duration loss necessary for unsupervised alignment for a robust duration prediction?

Prosody Loss

Hi, I am adding your MDN prosody modeling code segment to my tacotron but I encountered several problems about the code segment about prosody modeling. First, the prosody loss is added into the total loss only after the prosody_loss_enable_steps but in the training steps before the prosody_loss_enable_steps the prosody representation is already added with the text encoding. Does it means in the training steps before the prosody_loss_enable_steps, the prosody representation is optimized without the prosody loss?
Second, in the training steps, the backward gradient of training prosody predictor should be acted like "stop gradient" but it seems little relevant code.
Thanks!

weird sounding voices with MelGAN

Hello,

Audio samples generated with multi-speaker MelGEN (haven't tried single-speaker) sound unnatural.

I know worse quality is expected, but all samples sound like having significantly too high pitch.

Maybe there is a bug in implementation ported from FastSpeech?

Weird sound in LONG sentence

Hi, it's a really nice work!
In my experiment, the speech goes weird after 10s (short sentences are all good). Losses decrese normally, and I checked the predicted dur/pitch/energy, they were all good as well. Only the mel goes weird.
Have you ever encontered this kind of problem in long sentences?

Unvoiced loss is too high for me.

Hello, I'm trying to train TTS model with frame level pitch prediction(not cwt) for my spanish dataset.

First, I made a small modification for training. before var_start_steps, I just detach encoder input from variational predictor instead of setting init_losses
like below.

if step < self.config.var_start_steps:
    x_pitch = x.detach()       
else:
    x_pitch = x

pitch_prediction, pitch_embed = self.get_pitch_embedding(x_pitch, pitch_target, uv_target, pitch_control)

When I do this and observe losses, synthesized mel spectrogram by ground truth pitch, uv, dur was perfect, and pitch, dur loss descends satisfactorily even before var_start_steps.

But only Unvoiced loss starts from 0.9 and descends too slowly(and not descends before var_start_steps)in my case. (for 50k ~ 100k steps)
And the mel synthesized in eval state have no pitch(uv is almost 1) .

Does anyone had the same problem like me?
Any help would be appreciated.

Preprocessed data in my dataset is like below. I think grount truth unvoiced segment have no problem.

Thank you.

bug in calculate the energy in FastSpeechSTFT

I think here is a bug in audio/stft.py: 252
energy = np.sqrt(np.exp(mel) ** 2).sum(-1)
This code did nothing but just sum the abs of the np.exp(mel), while we expect it to calculate the sum before the sqrt.
The correct code should be
energy = np.sqrt((np.exp(mel) ** 2).sum(-1))

about duration predictor

In "learn_alignment: True" mode, the input of duration predictor is "x.detach() + self.predictor_grad * (x - x.detach())".

Why do we need detach()?
Why do we need add self.predictor_grad * (x - x.detach()) since it always is zero?

Preprocess error

█████████| 137/137 [18:49:05<00:00, 494.49s/it]
Computing statistic quantities ...
Traceback (most recent call last):
File "preprocess.py", line 19, in
preprocessor.build_from_path()
File "/GPUFS/sysu_hpcedu_123/Comprehensive-Transformer-TTS/preprocessor/preprocessor.py", line 267, in build_from_path
f0s_sup_stats = compute_f0_stats(f0s_sup)
File "/GPUFS/sysu_hpcedu_123/Comprehensive-Transformer-TTS/preprocessor/preprocessor.py", line 145, in compute_f0_stats return (f0_mean, f0_std)
UnboundLocalError: local variable 'f0_mean' referenced before assignment

It crashed after a long time I run prerprocess
For dataset,I use a dataset which simliar as VCTK but chinese,and It did not throw any error before this step.
Could anyone can help me?

requirements fail to install

seems packages may have updated names, specific python version appears to be required, requires c++ build tools
suggest update:

python~3.8.0 and <3.9

praat-parselmouth==0.3.3
g2p-en==2.1.0
scikit-learn==0.22.2.post1

if possible as it will not install 1.7:
torch>=1.7.0 (==2.0.0)

RuntimeError: The size of tensor a (1191) must match the size of tensor b (1000) at non-singleton dimension 1

Hi,
Thanks for the great work.
I met an error when training on the ryanspeech dataset:

Traceback (most recent call last):
  File "train.py", line 254, in <module>
    train(0, args, configs, batch_size, num_gpus)
  File "train.py", line 110, in train
    losses = Loss(batch, output, step=step)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/Comprehensive-Transformer-TTS/model/loss.py", line 334, in forward
    pitch_loss = self.get_pitch_loss(pitch_predictions, pitch_targets)
  File "/root/Comprehensive-Transformer-TTS/model/loss.py", line 197, in get_pitch_loss
    losses["uv"] = (F.binary_cross_entropy_with_logits(uv_pred, uv, reduction="none") * nonpadding) \
RuntimeError: The size of tensor a (1191) must match the size of tensor b (1000) at non-singleton dimension 1

I printed the shape of both uv_pred and uv, and they were both [16, 1191].

My configuration is

 ---> Automatic Mixed Precision: True
 ---> Number of used GPU: 1
 ---> Batch size per GPU: 16
 ---> Batch size in total: 16
 ---> Type of Building Block: conformer
 ---> Type of Duration Modeling: supervised
 ---> Type of Prosody Modeling: liu2021

This happened at around 50k+ steps.
What am I missing? Thank you!

Multi-GPU training could not work normally?

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates th at your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel;
As suggested, I modified the model by adding find_unused_parameters=True, as followings, model = DistributedDataParallel(model, device_ids=[rank], find_unused_parameters=True).to(device), but I still got the same errors, could you train normally when with multi GPU? Any suggestions to fix this?
Many Thanks.

Problem with Utterance-level Prosody extractor of DelightfulTTS

I've recently been experimenting with your implementation of DelightfulTTS and the voice quality is awesome. However I found out that the embedding vector output of Utterance-level Prosody extractor is very small, making the that of Utterance-level Prosody predictor small as well (L2 is roughly 12 and each element in the vector is roughly 0.2 to 0.3). Vectors with element close to zero means this layer mostly doesn't add any information at all. Have you find any solution to this?

Mixture density network

안녕하세요 먼저 코드 공유 감사드립니다.
다름이 아니라 mdn쪽에서 loss가 nan이나 infinty가 뜨길래 cliping을 통해서 학습을 돌리고 결과를 보았을때 소리가 아예 생성이 안되는것 같습니다. 혹시 어떤 문제가 예상되는지 여쭤봐도 될까요? 직접 수정하여 학습하려고 할때 어려움이 있는것 같습니다 도와주시면 감사드리겠습니다.

Voice embeddings: Can VCTK embeddings be used after training on non VCTK data?

I'm looking to see if I can train Cherokee, but then use VCTK voices go speak the Cherokee as donor voices.

Would this be possible?

Reason for std and input scaling in cwt?

Hey, I have some questions about your pitch predictor in cwt domain:

decoder_inp = decoder_inp.detach() + self.predictor_grad * (decoder_inp - decoder_inp.detach())
pitch_padding = mel2ph == 0


if self.pitch_type == "cwt":
    pitch_padding = None
    cwt = cwt_out = self.cwt_predictor(decoder_inp) * control
    stats_out = self.cwt_stats_layers(encoder_out[:, 0, :])  # [B, 2]
    mean = f0_mean = stats_out[:, 0]
    std = f0_std = stats_out[:, 1]
    cwt_spec = cwt_out[:, :, :10]
    if f0 is None:
        std = std * self.cwt_std_scale
        f0 = cwt2f0_norm(

I have three questions:

What is the reason for the first line? Isn't the right side always zero and therefore no gradients flow back?
Why do you scale inputs by 0.1?
Why did you scale ground truth std by 0.8?

Thanks for any help in advance!

An errors with running the preprocess.py

I'm trying to preprocess the VCTK dataset, and stuck on the 'Computing statistic quantities' step. When I copy from repo preprocessed_data files instead, the training run successful.

Firstly, there is a runtime error:

preprocessor.py

625: cont_lf0_lpf_norm = (cont_lf0_lpf - logf0s_mean_org) / logf0s_std_org
RuntimeWarning: invalid value encountered in true_divide

After applying a simple crutch to fix a value of logf0s_std_org, next error appear:

165: energy_mean = energy_scaler.mean_[0]
'StandardScaler' object has no attribute 'mean_'

win 10
conda python 3.6.15
all packages from the requirements.txt is installed

But when inference, the alignment learnt from LengthRegulator was very incorrect, often only 4-5 frames.
Below is training tensorboard picture:

keonlee9420 / comprehensive-transformer-tts Goto Github PK

comprehensive-transformer-tts's Introduction

Comprehensive-Transformer-TTS - PyTorch Implementation

Transformers

Prosody Modelings (WIP)

Supervised Duration Modelings

Unsupervised Duration Modelings

Transformer Performance Comparison on LJSpeech (1 TITAN RTX 24G / 16 batch size)

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

TensorBoard

LJSpeech

VCTK

Ablation Study

Notes

Updates Log

Citation

References

comprehensive-transformer-tts's People

Contributors

Stargazers

Watchers

Forkers

comprehensive-transformer-tts's Issues

Recommend Projects

Recommend Topics

Recommend Org