xcmyz / fastspeech Goto Github PK
View Code? Open in Web Editor NEWThe Implementation of FastSpeech based on pytorch.
License: MIT License
The Implementation of FastSpeech based on pytorch.
License: MIT License
Why is the WAV result in the results folder so bad? It is because fast speech just can't produce good results compared to Tacotron2. Will the result from the paper's author be better than your results?
Whether FastSpeech can be synthesized on CPU in real time?
(CPU上面是否能实时合成?)
hello,da niu.Can you explain what the role of alignment.py is? how to use?
hello,I would like to ask, how many steps can be trained to get a good result, now this file is all noise.
你复现后有对比过Tacotron的合成速度么,我复现后比Tacotron的合成速度还要慢一些
class Decoder(nn.Module):
""" Decoder """
def forward(self, enc_seq, enc_pos, return_attns=False):
# ....
# -- Prepare masks
slf_attn_mask = get_attn_key_pad_mask(seq_k=enc_pos, seq_q=enc_pos)
non_pad_mask = get_non_pad_mask(enc_pos)
为什么使用 enc_pos
来计算 mask ,难道不应该是使用 enc_seq
吗?
Hi, I downloaded the pre_trained fastspeech model, checkpoint_112000.pth.tar
when use tar -xf checkpoint_112000.pth.tar cmd on linux , error msg :
tar: This does not look like a tar archive
tar: Skipping to next header
tar: Exiting with failure status due to previous errors
so, how to decompress it?
alignmnet.py returns attention weights as float matrix, should i convert them to int type as the alignment_targets/0.npy?
Hi scmyz
I have train a model to step 172000 use train.py , and I want to use this model to synthesis ,
but when I run synthesis.py , I meet TypeError during the synthesis .
" TypeError: forward() missing 2 required positional arguments: 'src_seq' and 'src_pos' "
do I miss something during training or synthesis ?
thanks
(Tacotron) [wann31828@glogin1 FastSpeech]$ python synthesis.py
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
Model Have Been Loaded.
Traceback (most recent call last):
File "synthesis.py", line 91, in
synthesis_griffin_lim(words, model, alpha=1.0, mode="normal")
File "synthesis.py", line 45, in synthesis_griffin_lim
mel, mel_postnet = model(text, pos, alpha=alpha)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home/wann31828/anaconda3/envs/Tacotron/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'src_seq' and 'src_pos'
My pytorch model is 1.0.1.Is that the reason that the pytorch is the wrong model.
Hello,
Thanks a lot for publishing this amzaing repo.
I just have a little concern that I would like to let you know.
I see that under the 'paper' directory you have uploaded some papers.
I don't think that is a safe thing to do regarding copyrights.
The copyrights for each papers might differ, but just to be safe, how about replacing the papers with a link to it or just name of the papers???
I run train_accelerated.py
, and start training.
Howerver, the estimated time is more than 9400000s ( about 100 days ).
Could you introduce the training process and the training tine?
How many steps can we get a good result ?
I run the total training on 2 GPUs (Tesla V100).
Is these script to generate mels and wavs when model is ready?
I cannot access to your pretrained model which is on Baidu. Is there another way to get that file?
fastspeech+WaveRNN能实时吗,有多快
It says the file has been broken. Is that possible for you to update it through?
Thanks!
Hi, Good reference to your good implementation. Such a nice work!
I have some questions about alignments (while getting the duration from pretrained-model)
It was understood that the proposed d_i (duration) can be applied only when the index increases monotonically. Since the alignment is not a perfect diagonal, the index value does not increase monotonically, when the argmax value is taken. How did you solve this case?
Thank you.
can you try the biaobei chinese dataset?
hello , i wanna ask is it real time on CPU,thank you
in train.py:
if args.frozen_learning_rate:
scheduled_optim.step_and_update_lr_frozen(
args.learning_rate_frozen)
else:
scheduled_optim.step_and_update_lr()
I am now optimizing the quality of the slower generated waves by train more times.
but:
Biaobei public Chinese TTS dataset has .interval infomation, if it can use these data to train alignment?
At the results folder, there are no WaveGlow output samples. I saw there is a function synthesis_waveglow
. Can you please also output WaveGlow synthesized result too?
@xcmyz hello,when i run alignment.py, i met this erro
sequence size (1, 52)
alignment size (276, 52)
[[8.4814453e-01 4.4107437e-04 2.3097992e-03 ... 6.3121319e-05
6.3180923e-04 6.0485840e-02]
[5.4394531e-01 3.4637451e-02 4.3609619e-02 ... 8.9359283e-04
1.2283325e-03 7.8308105e-02]
[7.6904297e-01 5.2764893e-02 1.9561768e-02 ... 9.4366074e-04
2.1877289e-03 2.1026611e-02]
...
[9.5081329e-04 7.7486038e-07 4.0590763e-05 ... 2.9706955e-04
1.7822266e-02 8.7695312e-01]
[8.5830688e-04 5.9604645e-07 3.7193298e-05 ... 1.8274784e-04
8.3541870e-03 8.7597656e-01]
[7.3289871e-04 6.5565109e-07 3.7550926e-05 ... 1.7142296e-04
5.7220459e-03 8.7060547e-01]]
how can i solve that,thank you .
Hi, Thank you for excellent source!
You mentioned Run alignment.py, it will spend 7 hours training on NVIDIA RTX2080ti.
in README.md.
But, this command yields only one alignment from text("I want to go to CMU to do research on deep learning."). How do I fix alignment.py ?
你好,请问tacotron2中的aliment具体是如何用的,导出之后直接作为alignment_target传到现有模型里吗?我看了一下原文,里吗好像有一步通过duration extractor将alignment转化为phoneme duration seuence的步骤,请问这部分对应您代码里的哪部分?
你好,我看了下,按照lr的更新算法,learning rate会越来越大,请问这么设计的原因是什么?往常一般都会设计成随着训练轮数的增加,learning rate会越来越小
Hello,
I tried train FastSpeech with alignment_target and dataset on TeslaV100.
but I found the error like this.
Traceback (most recent call last):
File "train_accelerated.py", line 191, in
main(args)
File "train_accelerated.py", line 109, in main
length_target=alignment_target)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/FastSpeech.py", line 33, in forward
decoder_output = self.decoder(length_regulator_output, decoder_pos)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/Models.py", line 141, in forward
slf_attn_mask=slf_attn_mask)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/Layers.py", line 125, in forward
enc_input, enc_input, enc_input, mask=slf_attn_mask)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/SubLayers.py", line 60, in forward
output, attn = self.attention(q, k, v, mask=mask)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/fastspeech/FastSpeech/transformer/Modules.py", line 21, in forward
attn = attn.masked_fill(mask, -np.inf)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 337, in masked_fill
return self.clone().masked_fill_(mask, value)
RuntimeError: CUDA out of memory. Tried to allocate 316.50 MiB (GPU 0; 15.75 GiB total capacity; 14.15 GiB already allocated; 146.88 MiB free; 317.24 MiB cached)
Could explain the reason to me?
Where i can find the model.
./model_new/checkpoint_148000.pth.tar
I find that this repository removes license.txt(MIT).
Do you plan to change the license?
@xcmyz
There is a great improvement since I tried this branch at Jun! Great Work
There is one thing I am completely confused. You use log of mel as the feature, while the tacotron2 use DB of mel with normlization to [-4, 4]. What are the differences? Does feature like tacotron work for fast-speech? Does the log of mel better for fast speech?
After analyzing, I found the reason of slow training speed. The code is not optimized for parallel data training. I optimized the distributed data-parallel loader. I will create a pull request later.
BTW, the author os FastSpeech paper should publish their code soon. They say they will publish it once the paper is accepted. This repo is a very good resource, but it needs more maintains and optimization.
希望作者能够更多的参与github和知乎上的讨论,非常感谢
I tried TTS with WaveGlow as follow, but i've got noise result
Could you explain the reason for me?
def synthesis_waveglow(text_seq, model, waveglow, alpha=1.0, mode=""):
denoiser = Denoiser(waveglow)
text = text_to_sequence(text_seq, hp.text_cleaners)
text = text + [0]
text = np.stack([np.array(text)])
text = torch.from_numpy(text).long().to(device)
pos = torch.stack([torch.Tensor([i+1 for i in range(text.size(1))])])
pos = pos.long().to(device)
model.eval()
with torch.no_grad():
_, mel_postnet = model(text, pos, alpha=alpha)
with torch.no_grad():
#wav = waveglow.infer(mel_postnet, sigma=0.666)
wav = waveglow.infer(torch.transpose(mel_postnet,1,2).type(torch.cuda.HalfTensor), sigma=0.666)
print("Wav Have Been Synthesized.")
if not os.path.exists("results"):
os.mkdir("results")
wav_denoised = denoiser(wav, strength=0.01)[:, 0]
#audio.save_wav(wav[0].data.cpu().numpy(), os.path.join(
# "results", text_seq + mode + ".wav"))
audio.save_wav(wav_denoised[0].cpu().numpy(), os.path.join(
"results", text_seq + mode + ".wav"))
Thank you
First of all, thanks for your quick and great implementation.
The default sample rate is 22050 for LJSpeech-1.1 wavs, Tacotron2/hparams.py, and possibly the used pre-trained Tacotron2 model published by NVIDIA.
So, should we change sample_rate=20000 to 22050 in FastSpeech/hparams.py? Other related params may also need to change correspondingly.
Please check and help on this issue, thank you.
解压后,还需要如何对代码做相应更改
Hi, I am implementing fastspeech with tensorflow, when rewrite length regular module, I met some problem. tf graph is static graph, I can not get real tensor shape before feeding data as I define inputs with placeholder and shape=[None,None, None].
my code is :
`
def len_regulator(self, phoneme_seqs, duration_seqs, alpha=1.0, max_mel_length=None):
D = tf.keras.backend.round(tf.scalar_mul(alpha, duration_seqs))
# grouping
pho_splits = tf.split(phoneme_seqs, num_or_size_splits=phoneme_seqs.shape.as_list()[-1], axis=0)
dur_splits = tf.split(D, num_or_size_splits=D.shape.as_list()[-1], axis=0)
repeats = [tf.ones(tf.cast(r, tf.int32), dtype=tf.float32) for r in dur_splits]
expanded = list()
for i, j in zip(pho_splits, repeats):
expanded.append(tf.multiply(i, j))
expanded = tf.concat(expanded, axis=0)
return expanded
`
anybody who can help? thanks in advance!!
In the original paper,the author proposes the breaks between words,and verifies that FastSpeech can add breaks between adjacent words by lengthening the duration of the space characters in the sentence, which can improve the prosody of voice.
So,how do you add breaks between words in your code?
I'll be appreciated if you can help me.
I've trained a new model, and write a script to synthesize new speech using
mel_output, mel_output_postnet = model(src_seq, src_pos)
But it need about 4-6 seconds, is this an expected result? Thanks!
Hello Zhengxi,
Will you provide the pre-trained model?
Hi, the training time for each step was slower when using multiple GPU's.
I'm running:
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 train_accelerated.py
And for single GPU:
CUDA_VISIBLE_DEVICES=0 python3 train_accelerated.py
I can fit batch_size=16 in a single gpu so I tried with batch_size=16 for single and multi-gpu.
I also tried batch_size=64 for the multi-gpu case.
Any result samples?
i run the alignment.py, and the test.wav returned by get_tacotron2_alignment_test() function is inaudible.
print(mel_outputs)
None
tensor([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0')
Thanks for author's work.
But I find the default learning_rate_frozen(1e-3) is too big, cause the loss can't convergence.
What is the learning_rate when you train the model?
Good Job! Can you provide some result sample?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.