Coder Social home page Coder Social logo

jik876 / hifi-gan Goto Github PK

View Code? Open in Web Editor NEW
1.8K 32.0 493.0 620 KB

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

License: MIT License

Python 100.00%
speech-synthesis gan text-to-speech tts deep-learning hifi-gan pytorch vocoder

hifi-gan's Introduction

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
We provide our implementation and pretrained models as open source in this repository.

Abstract : Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Visit our demo website for audio samples.

Pre-requisites

  1. Python >= 3.6
  2. Clone this repository.
  3. Install python requirements. Please refer requirements.txt
  4. Download and extract the LJ Speech dataset. And move all wav files to LJSpeech-1.1/wavs

Training

python train.py --config config_v1.json

To train V2 or V3 Generator, replace config_v1.json with config_v2.json or config_v3.json.
Checkpoints and copy of the configuration file are saved in cp_hifigan directory by default.
You can change the path by adding --checkpoint_path option.

Validation loss during training with V1 generator.
validation loss

Pretrained Model

You can also use pretrained models we provide.
Download pretrained models
Details of each folder are as in follows:

Folder Name Generator Dataset Fine-Tuned
LJ_V1 V1 LJSpeech No
LJ_V2 V2 LJSpeech No
LJ_V3 V3 LJSpeech No
LJ_FT_T2_V1 V1 LJSpeech Yes (Tacotron2)
LJ_FT_T2_V2 V2 LJSpeech Yes (Tacotron2)
LJ_FT_T2_V3 V3 LJSpeech Yes (Tacotron2)
VCTK_V1 V1 VCTK No
VCTK_V2 V2 VCTK No
VCTK_V3 V3 VCTK No
UNIVERSAL_V1 V1 Universal No

We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.

Fine-Tuning

  1. Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing.
    The file name of the generated mel-spectrogram should match the audio file and the extension should be .npy.
    Example:
    Audio File : LJ001-0001.wav
    Mel-Spectrogram File : LJ001-0001.npy
    
  2. Create ft_dataset folder and copy the generated mel-spectrogram files into it.
  3. Run the following command.
    python train.py --fine_tuning True --config config_v1.json
    
    For other command line options, please refer to the training section.

Inference from wav file

  1. Make test_files directory and copy wav files into the directory.
  2. Run the following command.
    python inference.py --checkpoint_file [generator checkpoint file path]
    

Generated wav files are saved in generated_files by default.
You can change the path by adding --output_dir option.

Inference for end-to-end speech synthesis

  1. Make test_mel_files directory and copy generated mel-spectrogram files into the directory.
    You can generate mel-spectrograms using Tacotron2, Glow-TTS and so forth.
  2. Run the following command.
    python inference_e2e.py --checkpoint_file [generator checkpoint file path]
    

Generated wav files are saved in generated_files_from_mel by default.
You can change the path by adding --output_dir option.

Acknowledgements

We referred to WaveGlow, MelGAN and Tacotron2 to implement this.

hifi-gan's People

Contributors

edresson avatar jik876 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hifi-gan's Issues

machine noise on 4k frequency

I encouner a bad case: there is machine soud(a line) in the 4k frequency as below. this is breathing part. I try to solve it, Can you give some advice, thanks.

图片

Tracing to torchscript

Has anyone been able to successfully convert the generator model to torchscript?

I receive a bizarre error: while tracing works

zero = torch.full((1, 80, 10), -11.52).cuda()
with open("hifi-gan/config.json") as f:
    data = f.read()
h = env.AttrDict(json.loads(data))
vocoder = models.Generator(h).cuda()
vocoder.load_state_dict(
    torch.load("hifi-gan/pretrained_universal/g_02500000")["generator"]
)
vocoder.remove_weight_norm()
vocoder.eval()
with torch.no_grad():
    traced_vocoder = torch.jit.trace(vocoder, zero)
    torch.jit.save(traced_vocoder, "vocoder.pth")

Trying to then load the model gives a weird error:

traced_vocoder = torch.jit.load("vocoder.pth")
/opt/conda/lib/python3.8/site-packages/torch/jit/_serialization.py in load(f, map_location, _extra_files)
    159     cu = torch._C.CompilationUnit()
    160     if isinstance(f, str) or isinstance(f, pathlib.Path):
--> 161         cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
    162     else:
    163         cpp_module = torch._C.import_ir_module_from_buffer(

RuntimeError: Found character '45' in string, strings must be qualified Python identifiers

ExponentialLR and fine tune

Hi.
I started training the model from scratch and found that the optimizer uses a dynamic learning step. If I train the model with 2.5 million steps, then according to my calculations, the training step will drop to 3e-7. Not only is this a very low learning step that can cause floating point errors, it is also impossible to adapt other speakers at such a checkpoint, because the learning step is too small.
Does this mean that it is better to set lr_decay = 1.0?

Can the MEL spectrum be resampled?

Hello, the model you provided has a strong generalization, but it is different from the sampling rate of my project. Is there any MEL resampling algorithm that can directly sample to the corresponding sampling rate, such as 48000Hz

window size is hardcoded (44khz HiFi-GAN)

Congratulations and thank you for this great work.
We were able to easily adapt it (doubling fft params and changing upsample rates to [8,8,4,2]) to produce 44khz 16bit audio with very high quality.

The line below has the window size hard coded to 1024. It should be replaced with win_size.

hann_window[str(y.device)] = torch.hann_window(1024).to(y.device)

Can't train with output of Tacotron 2

The shape of mel output of Tacotron2 is bigger than mel extracted from audio and the model has issue

 File "train.py", line 113, in train
    for i, batch in enumerate(train_loader):
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 8192 and 8119 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:612

[Question] Dataset preprocessing

I've attempted to preprocess my dataset to meet the mel-spectrogram requirements but I either wind up with incorrectly packed spectrogram files, a wrong header, or wrong data. Don't think any of the tacotron2 implementations I can get my hands on will output the data in the required format, or I'm overlooking something obvious (which is equally likely 😌).

Could one of you helpful people provide a link to a working piece of code that takes care of this properly or could this repository be fleshed out more so that there is a working preprocessor for training datasets? 🤔

What is the input format?

I've seen a lot of general discussion about inputting generated mels into Hifi-GAN and of course we can see the hparams for mel spectrogram in each config file, but nothing that actually says what the format is for input x to Generator(x). Is it (1, n_mels, frames)? Is normalization expected? Nothing I've tried works.

[Question] How to apply on 16k data?

Hi, thanks for sharing your impressive code.
I tried to apply hifigan on 16k data, with config:
"upsample_rates": [8,5,5],
"upsample_kernel_sizes": [16,10,10],
"segment_size": 6400,
"hop_size": 200,
"win_size": 800,
"sampling_rate": 16000,

And it reports error like:
Traceback (most recent call last):
File "train.py", line 271, in <module>
main()
File "train.py", line 267, in main
train(0, a, h)
File "train.py", line 149, in train
loss_fm_f = feature_loss(fmap_f_r, fmap_f_g)
File "models.py", line 255, in feature_loss
loss += torch.mean(torch.abs(rl - gl))
RuntimeError: The size of tensor a (1067) must match the size of tensor b (1068) at non-singleton dimension 2

Is there any wrong in the modified config? Is it padding related?

pretrained model

Hi,thanks for your work,
I only found the pretrained models of generator, is there any pretrained models of discriminator?
Thanks!

Buzzing noise

Hi ,Thanks for your work
I have trained 50k steps start from scratch,the config is v1,there is a buzz noise in the the generated audio , can I get rid of the noise if I continue training,do you have any suggestions?
Thanks for your help

Training times v1 vs. v2 vs. v3?

Hello, I see it took you this amount of time to train v1:

It took about 13-14 days to train the model up to 2,500k steps with two V100 GPUs.

Since v2 and v3 have faster inference, does that mean training them would be faster too?

Errors when trying to load pretrained Universal model

Hi. I'm working on torch==1.4.0 environment, as you wrote in requirement.txt
I successfully inference with your pretrained generator_v1 & finetuned generator_v1

However, I tried to load your universal generator model, following error occured.

RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fec128e9193 in /home/kwkim/six/lib64/python3.6/site-packages/torch/lib/libc10.so) frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7fec15c799eb in /home/kwkim/six/lib64/python3.6/site-packages/torch/lib/libtorch.so) frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7fec15c7ac04 in /home/kwkim/six/lib64/python3.6/site-packages/torch/lib/libtorch.so) frame #3: <unknown function> + 0x6c53a6 (0x7fec5e13b3a6 in /home/kwkim/six/lib64/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: <unknown function> + 0x2961c4 (0x7fec5dd0c1c4 in /home/kwkim/six/lib64/python3.6/site-packages/torch/lib/libtorch_python.so) <omitting python frames> frame #38: __libc_start_main + 0xf5 (0x7fecbc751445 in /lib64/libc.so.6) frame #39: python3() [0x400c40]

I think your universal model was trained and saved in higher torch version than 1.6.0.
Could you please check this error?

Thank you.

Logging validation loss to tensorboard

Currently the code does not log validation loss to tensorboard.
This way we do not know in what optimization regime we are, e. g. underfitting or overfitting.
Can you please add this feature?

FineTuning HiFi with GLowTTS npy

Hello!
I'm trying to FineTuning HiFi with GlowTTS npy
i generate npy with this code:

def TTS(tst_stn, path):
    if getattr(hps.data, "add_blank", False):
        text_norm = text_to_sequence(tst_stn.strip(), ['english_cleaners'], cmu_dict)
        text_norm = commons.intersperse(text_norm, len(symbols))
    else: 
        tst_stn = " " + tst_stn.strip() + " "
        text_norm = text_to_sequence(tst_stn.strip(), ['english_cleaners'], cmu_dict)
    sequence = np.array(text_norm)[None, :]
    x_tst = torch.autograd.Variable(torch.from_numpy(sequence)).cuda().long()
    x_tst_lengths = torch.tensor([x_tst.shape[1]]).cuda()
    

   with torch.no_grad():
        noise_scale = 0.667
        length_scale = 1.0
        (y_gen_tst, *_), *_, (attn_gen, *_) = model(x_tst, x_tst_lengths, gen=True, noise_scale=noise_scale, length_scale=length_scale)
        
    np.save("hf/ft_dataset/" + path.split('/')[1]  + '.npy', y_gen_tst.cpu().detach().numpy())

Next, I make a metafile:
wavs/x.wav | ft_dataset/x.npy

And I get the following error:
RuntimeError: stack expects each tensor to be equal size, but got [8192] at entry 0 and [6623] at entry 6

Hi-Fi generates wav using these npy in inference mode with GlowTTS

Batch synthesis noise at the end

When doing batch synthesis (inference), I zero-pad the mel inputs so that they are the same length, which causes a harsh, buzzing sound to be generated by HiFi-GAN.

Assuming that batching is required for my application's performance purposes, what is the advised approach to dealing with this issue? I don't see support for passing in any sort of mask argument. Should I just try to heuristically cut the resulting wav audio so as to eliminate the noise at the end?

error when fine-tuning

I want to use FastSpeech 2 + hifigan, but it sound a little nosie in some generated audios, so I get the generate mel-spectrogram from FastSpeech 2 to retrain the hifigan, BUT MEET the error when fine-tuning

`Loading 'cp_hifigan/g_00320000'
Complete.
Loading 'cp_hifigan/do_00320000'
Complete.
Epoch: 1227
Traceback (most recent call last):
File "train.py", line 286, in
main()
File "train.py", line 280, in main
mp.spawn(train, nprocs=h.num_gpus, args=(a, h,))
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/data/glusterfs_speech_tts_v2/public_data/tts_public_data/11104653/vocoder/hifi-gan/train.py", line 113, in train
for i, batch in enumerate(train_loader):
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
return self._process_data(data)
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 84, in default_collate
return [default_collate(samples) for samples in transposed]
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 84, in
return [default_collate(samples) for samples in transposed]
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [8192] at entry 0 and [6712] at entry 10`

Minimum hours of data required for fine-tuning for a single unseen speaker

Thank you for your amazing work!!
For the TTS task, assuming that the synthesizer(Tacotron2) + vocoder has already been trained on a significant number of speakers, what would be the minimum amount of data that would be required to fine-tune the vocoder to a new unseen speaker? Would 5-10 hours be sufficient? Would be helpful to have an approximate amount. Just to add more details, this is for TTS in Hindi and I plan to train Tacotron2 + HifiGAN on ~150 hours of Hindi data with several 100s of speakers before fine-tuning on a new unseen speaker. Thanks!

RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

I've encountered the following error when I'm trying to fine-tuning mel-spec from Tacotron2

Traceback (most recent call last):
  File "train.py", line 276, in <module>
    main()
  File "train.py", line 272, in main
    train(0, a, h)
  File "train.py", line 127, in train
    y_g_hat = generator(x)
  File "/home/kynh/anaconda3/envs/nguyenlm_hifigan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home2/nguyenlm/Projects/hifi-gan-clone/models.py", line 101, in forward
    x = self.conv_pre(x)
  File "/home/kynh/anaconda3/envs/nguyenlm_hifigan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kynh/anaconda3/envs/nguyenlm_hifigan/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

Why use those mel spectrogram configurations?

I'm particularly curious about the choices to use:

fmin=0
n_fft = win_length = 1024
mulaw.bits = 9

Were these values chosen arbitrarily or are there some pros/cons (tradeoffs) to these values?

question about the finetune data

in readme ,you said when finetuned should "Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing"
but the generated mel-spectrograms 's number is not equal the oral mel,in meldataset.py. got the audio like this
if audio.size(1) >= self.segment_size:
mel_start = random.randint(0, mel.size(2) - frames_per_seg - 1)
mel = mel[:, :, mel_start:mel_start + frames_per_seg]
audio = audio[:, mel_start * self.hop_size:(mel_start + frames_per_seg) * self.hop_size]

in this way, if we use the oral audio but use the generated mel, the traindata will be wrong, right?

HiFi-GAN TFLite Model

Hi @jik876 @Edresson

We(@sayakpaul) converted the pre-trained PyTorch model of the HiFi-GAN to TFLite Format. Thanks for providing the pre-trained models. You can use this Notebook to convert all the available pre-trained models to TFLite Format. If you are interested in any other models you can visit this Repository. We also provided benchmark results of HiFi-GAN TFLite Model w.r.t to other models like Parallel WaveGAN, MelGAN, MB-MelGAN.

Models will be soon published to TensorFlow Hub.

Glow-tts + hifi-gan inference issue

I have trained the model using Glow-TTS and was trying to infer the texts using the jupyter notebook file given in the hifi-gan directory.

When I tried running this command:

#### Use finetuned HiFi-GAN with Tacotron 2, which is provided in the repo of HiFi-GAN.
!python ./hifi-gan/inference_e2e.py --checkpoint_file  ~/glow-tts/logs/base/G_100.pth

It gives an error:

Initializing Inference Process..
~/glow-tts/logs/base/config_v1.json
Loading '~/glow-tts/logs/base/G_100.pth'
Complete.
Traceback (most recent call last):
  File "./hifi-gan/inference_e2e.py", line 100, in <module>
    main()
  File "./hifi-gan/inference_e2e.py", line 96, in main
    inference(a)
  File "./hifi-gan/inference_e2e.py", line 46, in inference
    generator.load_state_dict(state_dict_g['generator'])
KeyError: 'generator'

can anyone tell why this error is showing and how can I solve this ?

how to set segment_size in different sampling_rate

Hi,
I saw that everyone change their segment_size when use different sampling_rate.
I want to know that:
How to set segment_size in different sampling_rate? Is there any mathematical formula for this?

mel spectrum loss VS stft loss

Hi,
thanks for your great work. In paper you mentioned that using the mel spectrum loss to get a more stable and efficient training. The multi-resolution STFT loss used in parallel-wavegan/mb-megan seems can achieve the same goal.
My question is have you tried the stft loss instead the mel loss? did you observe the differences(if you tried)?

Thanks.

Some questions

Hi, thanks for sharing the code, it is well appreciated. Some questions:

  • Do you train with mean-var normalization? If not, what is the range normalization?
  • I tried to plug in the models using a spectrogram generated by Mozilla TTS, but had no luck (waveform is generated, but sound is very distorted). Do you have any idea why this happens? Is there any difference in which the spectrograms are computed from hifiGAN's side? The training attributes (win, hop, fmin, fmax) are otherwise the same.
  • When finetuning for TTS, how do you acquire your ground truth mels? Using the TTS model you want to use the GAN with?
  • How many steps do you train for?

Thanks again, these results are impressive for a GAN.

Padding mismatch from Tacotron2 pre-processing

Hello, I ran into similar trouble as #52 while trying to fine-tune the Universal_V1 checkpoint using ground-truth aligned TC2 outputs.

In my case, the fine-tuning breaks as soon as it reaches the validation phase. The error message is something along the lines of:

Using a target size (torch.Size([1, 80, 132])) that is different to the input size (torch.Size([1, 80, 131])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()

I found that the padding performed in this line is different than the padding performed in Tacotron2.

I dug deeper and this does introduce a consistent difference of 1 frame between the mel-specs generated by the trained TC2 model and the hifi-gan loss mel-specs.

Technically I could simply edit the padding code in this repo and get the mels to align perfectly but I wonder if this would cause issues since the model is trained using differently padded mel spectrograms. What are your thoughts?

the loss doesn't decrease

Hi ,Thanks for your work
I have trained 10k steps start from scratch,the loss of generator is 8.6, the loss of discriminatoris 4.5,the learning rate is 1e-5, I found the loss doesn't decrease,I don't know why.
Thanks

Buzzing sound when using Tacotron2+HiFi-GAN

Hi @jik876 @Edresson

I have been trying to integrate tacotron2 and Hifi gan to create fully end to end TTS. But when I am feeding Tacotron2 output to your finetuned model of HiFi GAN output audio is just buzzing sound. To make sure tacotron2 output is correct, I fed the tacotron2 output to the parallel wavegan model, and it's working as expected. So believe there is some incompatibility while feeding tacotron2 output to Hifi gan output. To reproduce the same I created the colab notebook. You can reproduce the output with the above-mentioned notebook.

Also, I created and End to End Colab Notebook to run the Hifi GAN Model. If you want to me add this to your repo I will go ahead and create a PR

Also, I am in a plan to convert this Hifi GAN model to TFLite format to help mobile developers. We(@sayakpaul) already converted few models to TFLite format. You can find more details about our TFLite repo here

MPD Vs Filter-bank discriminator

Hi,
Thanks for sharing this great work. I have one theoretical questions regarding the usage of the multi-period discriminator (MPD):

As I understand the MPD, the input waveform is reshaped according to the target period p in order to obtain a 2d map that models different periods of the signal. Actually, when applying this simple reshaping, the subbands of the signal overlap with each other when plotting their spectrograms. If that's correct, what do you think of using a simple filter-bank to decompose the speech waveform into subbands without such overlapping issue?

I would really appreciate if you already have some experiments on that or at least if you could explain what is the different between having multiple periods of the signal versus multiple subbands.

Why does the mel.npy file I trained with the tacotron2 model not match the dimensions in hifi-gan?

I trained the tacotron2 model to produce mel_**.npy ,but in this model, the error of dimension mismatch is reported
python3 inference_e2e.py --checkpoint_file cp_hifigan-1208-test/g_00036000 Initializing Inference Process.. cp_hifigan-1208-test/g_00036000 Loading 'cp_hifigan-1208-test/g_00036000' Complete. Removing weight norm... Traceback (most recent call last): File "inference_e2e.py", line 90, in <module> main() File "inference_e2e.py", line 86, in main inference(a) File "inference_e2e.py", line 51, in inference y_g_hat = generator(x) File "/home/zhchen/python3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/zhchen/hifi-gan-master/models.py", line 101, in forward x = self.conv_pre(x) File "/home/zhchen/python3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/zhchen/python3/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 200, in forward self.padding, self.dilation, self.groups) RuntimeError: Expected 3-dimensional input for 3-dimensional weight 128 80, but got 2-dimensional input of size [288, 80] instead

RuntimeError when finetuning the model

Hi. Thank you very much for your implementation.
I extracted mels following the instructions. The frame size of the mels and audios is matched.
However, when I tried to finetune hifi-gan, I still got the following problem.
Could you tell me how to solve this? Thank you very much!

    main()
  File "train.py", line 267, in main
    train(0, a, h)
  File "train.py", line 113, in train
    for i, batch in enumerate(train_loader):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 118 and 167 in dimension 1 at /opt/conda/conda-bld/pytorch_1579027003190/work/aten/src/TH/generic/THTensor.cpp:612

The training time

Hi ,thanks for your work, I want to know the training time,how long does it take?
Thanks!

Specific environment version

When training I have a runtime error: : cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. Could you please provide specific version of your environment, including python, cuda and cudnn? Thank you very much.

Sorry,a newbie here asking basic setup infos

i tried to setup a proper env to generate the demos.
But somehow have some errors.

(```
Hifigan384test) D:\Coding\PYFastCache\PYVenv\Hifigan384test\hifi-gan-master>python inference.py --checkpoint_file vctk_v2\generator_v2 --input_wavs_dir test_files
Initializing Inference Process..
Loading 'vctk_v2\generator_v2'
Complete.
Removing weight norm...
D:\Coding\PYFastCache\PYVenv\Hifigan384test\hifi-gan-master\meldataset.py:15: WavFileWarning: Chunk (non-data) not understood, skipping it.
sampling_rate, data = read(full_path)
Traceback (most recent call last):
File "inference.py", line 94, in
main()
File "inference.py", line 90, in main
inference(a)
File "inference.py", line 54, in inference
x = get_mel(wav.unsqueeze(0))
File "inference.py", line 26, in get_mel
return mel_spectrogram(x, h.n_fft, h.num_mels, h.sampling_rate, h.hop_size, h.win_size, h.fmin, h.fmax)
File "D:\Coding\PYFastCache\PYVenv\Hifigan384test\hifi-gan-master\meldataset.py", line 61, in mel_spectrogram
y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
File "D:\Coding\PYFastCache\PYVenv\Hifigan384test\lib\site-packages\torch\nn\functional.py", line 2877, in pad
assert len(pad) == 4, '4D tensors expect 4 values for padding'
AssertionError: 4D tensors expect 4 values for padding

not quite sure what went wrong.Maybe the audio?


here's the layout.
02/12/2020  18:04    <DIR>          .
02/12/2020  18:04    <DIR>          ..
30/11/2020  23:03               762 config_v1.json
30/11/2020  23:03               762 config_v2.json
30/11/2020  23:03               752 config_v3.json
30/11/2020  23:03               394 env.py
02/12/2020  18:00    <DIR>          generated_files
02/12/2020  16:59    <DIR>          generated_files_from_mel
30/11/2020  23:03             2,652 inference.py
30/11/2020  23:03             2,444 inference_e2e.py
30/11/2020  23:03             1,067 LICENSE
30/11/2020  23:03    <DIR>          LJSpeech-1.1
30/11/2020  23:03             6,314 meldataset.py
30/11/2020  23:03             9,905 models.py
30/11/2020  23:03             4,767 README.md
30/11/2020  23:03               113 requirements.txt
02/12/2020  18:04    <DIR>          test_files
30/11/2020  23:03            12,153 train.py
30/11/2020  23:03             1,377 utils.py
30/11/2020  23:03            10,995 validation_loss.png
02/12/2020  16:53    <DIR>          vctk_v2
02/12/2020  16:59    <DIR>          __pycache__
              14 File(s)         54,457 bytes
               8 Dir(s)   7,409,783,296 bytes free

btw,for the audio i just recorded my voice.Not quite sure.What is needed for input.
I assume it just needs some wav audio?

Output audio duration does not exactly match input audio.

Running through your pre-trained models, I found that generated audio does not exactly match the input in duration length. For example,

wav, sr = load_wav(os.path.join(a.input_wavs_dir, filname))
wav = wav / MAX_WAV_VALUE
wav = torch.FloatTensor(wav).to(device)  # wav shape is torch.Size([71334])
x = get_mel(wav.unsqueeze(0))  # x shape is torch.Size([1, 80, 278])
y_g_hat = generator(x)  # y_g_hat shape is torch.Size([1, 1, 71168])

As you can see, there is a mismatch of 71334 and 71168. What is happening, and why is this the case? Is there a way that I can change it so that the input and output shapes match?

Thank you.

Edit: So I was checking training, and if the target segment_size is a multiple of 256 (hop_size), then y_g_hat = generator(x) will also have the exact number.

when fine_tuning with the tacotron2 model to produce mel_**.npy, a RuntimeError happened.

Traceback (most recent call last):
File "train.py", line 272, in
main()
File "train.py", line 268, in main
train(0, a, h)
File "train.py", line 113, in train
for i, batch in enumerate(train_loader):
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/hifi/meldataset.py", line 163, in getitem
center=False)
File "/data/hifi/meldataset.py", line 50, in mel_spectrogram
if torch.min(y) < -1.:
RuntimeError: invalid argument 1: cannot perform reduction function min on tensor with no elements because the operation does not have an identity at /pytorch/aten/src/TH/
generic/THTensorEvenMoreMath.cpp:345

break point problem

图片
the spec has beak point, what do you think the reason of this problem is? I have enlarged the receptive: [1,3,5]-->[1,3,5,7], but the problem still exists.

And why your leak_relu_scope=0.1 instead of 0.2?

Some questions about training

Hey.
Why does the readme say that you need to use GTA mels for fine tuning? I used real spectrograms to train waveglow and parallel wavegan, the authors of which indicated that this method achieves acceptable quality in conjunction with tacotron 2. Is the GTA training procedure mandatory for this vocoder to achieve the best possible quality?

And another question about the multi-speaker model. Do I need any modifications to train a multispeaker model, or is it enough to generate spectrograms of different speakers and train on them? Also, what is the minimum number of speakers required for the model to reproduce speakers invisible during training in good quality?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.