jik876 / hifi-gan Goto Github PK

View Code? Open in Web Editor NEW

1.8K 32.0 493.0 620 KB

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

License: MIT License

Python 100.00%

speech-synthesis gan text-to-speech tts deep-learning hifi-gan pytorch vocoder

hifi-gan's Introduction

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
We provide our implementation and pretrained models as open source in this repository.

Abstract : Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Visit our demo website for audio samples.

Pre-requisites

Python >= 3.6
Clone this repository.
Install python requirements. Please refer requirements.txt
Download and extract the LJ Speech dataset. And move all wav files to LJSpeech-1.1/wavs

Training

python train.py --config config_v1.json

To train V2 or V3 Generator, replace config_v1.json with config_v2.json or config_v3.json.
Checkpoints and copy of the configuration file are saved in cp_hifigan directory by default.
You can change the path by adding --checkpoint_path option.

Validation loss during training with V1 generator.

Pretrained Model

You can also use pretrained models we provide.
Download pretrained models
Details of each folder are as in follows:

Folder Name	Generator	Dataset	Fine-Tuned
LJ_V1	V1	LJSpeech	No
LJ_V2	V2	LJSpeech	No
LJ_V3	V3	LJSpeech	No
LJ_FT_T2_V1	V1	LJSpeech	Yes (Tacotron2)
LJ_FT_T2_V2	V2	LJSpeech	Yes (Tacotron2)
LJ_FT_T2_V3	V3	LJSpeech	Yes (Tacotron2)
VCTK_V1	V1	VCTK	No
VCTK_V2	V2	VCTK	No
VCTK_V3	V3	VCTK	No
UNIVERSAL_V1	V1	Universal	No

We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.

Fine-Tuning

Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing.
The file name of the generated mel-spectrogram should match the audio file and the extension should be .npy.
Example:
```
Audio File : LJ001-0001.wav
Mel-Spectrogram File : LJ001-0001.npy
```
Create ft_dataset folder and copy the generated mel-spectrogram files into it.
Run the following command.
```
python train.py --fine_tuning True --config config_v1.json
```
For other command line options, please refer to the training section.

Inference from wav file

Make test_files directory and copy wav files into the directory.

Run the following command.

python inference.py --checkpoint_file [generator checkpoint file path]

Generated wav files are saved in generated_files by default.
You can change the path by adding --output_dir option.

Inference for end-to-end speech synthesis

Make test_mel_files directory and copy generated mel-spectrogram files into the directory.
You can generate mel-spectrograms using Tacotron2, Glow-TTS and so forth.

Run the following command.

python inference_e2e.py --checkpoint_file [generator checkpoint file path]

Generated wav files are saved in generated_files_from_mel by default.
You can change the path by adding --output_dir option.

Acknowledgements

We referred to WaveGlow, MelGAN and Tacotron2 to implement this.

hifi-gan's People

Contributors

Stargazers

Watchers

Forkers

ishine donghaiyw narrationbox carankt geneing lexkoro tsaifangsheng gheyret yhgon c00renut edresson cookieppp ml-applications jwp928 wonst antoniohwang mdda nzpeng hcwu1993 c1a1o1 markyouyuren liusongxiang charlottecuc cuongnm5 batikim09 dsoundsoft zeta1999 elbum kimjj-geek pzhang266 chunhuiwang-china miralan amoart beckgom aggreybosire snakers4 nasa03 holttechnologycorporation epochsimate hiroshiba joovvhan sshuster hlp2819 dendisuhubdy hlng2002 shuheiimai nuts-kun chscheller rancherzhang xuexidi spiralanch ohadvb bart-wojtala halcy ductho9799 sortanon pneumoman solomidhero rishikksh20 deepdubbed thuhcsi whitefu huangjx07 peterzhousz parvez2017 aliceinhunterland pan310 madhavsikka macroustc yingfenging janvainer sx-tts birgermoell sungjae-cho jwjpaton mbencherif chenchy guoyang94 entn-at lism13 vikneo2017 haoxiaoyang444 approximetal samuellarkin anh hongwen-sun saber5433 rothidan dp-aixball hungcn liisaratsep ensky0 idgmatrix mcomunita wookladin casonclagg cyna298 sciai-ai ranchlai vatnid

hifi-gan's Issues

log mel spectrum and linear mel spectrum

Hi ,thanks for your work,
Can I choose to use linear mel spectrum, log mel spectrum and linear mel spectrum,which one is better?
Thanks!

machine noise on 4k frequency

I encouner a bad case: there is machine soud(a line) in the 4k frequency as below. this is breathing part. I try to solve it, Can you give some advice, thanks.

high-frequency has machine noise

hi, this a 48k wav trained 120k steps, you can see the noise above 16kHz. How can I remove this? thanks

Tracing to torchscript

Has anyone been able to successfully convert the generator model to torchscript?

I receive a bizarre error: while tracing works

zero = torch.full((1, 80, 10), -11.52).cuda()
with open("hifi-gan/config.json") as f:
    data = f.read()
h = env.AttrDict(json.loads(data))
vocoder = models.Generator(h).cuda()
vocoder.load_state_dict(
    torch.load("hifi-gan/pretrained_universal/g_02500000")["generator"]
)
vocoder.remove_weight_norm()
vocoder.eval()
with torch.no_grad():
    traced_vocoder = torch.jit.trace(vocoder, zero)
    torch.jit.save(traced_vocoder, "vocoder.pth")

Trying to then load the model gives a weird error:

traced_vocoder = torch.jit.load("vocoder.pth")

/opt/conda/lib/python3.8/site-packages/torch/jit/_serialization.py in load(f, map_location, _extra_files)
    159     cu = torch._C.CompilationUnit()
    160     if isinstance(f, str) or isinstance(f, pathlib.Path):
--> 161         cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
    162     else:
    163         cpp_module = torch._C.import_ir_module_from_buffer(

RuntimeError: Found character '45' in string, strings must be qualified Python identifiers

ExponentialLR and fine tune

Hi.
I started training the model from scratch and found that the optimizer uses a dynamic learning step. If I train the model with 2.5 million steps, then according to my calculations, the training step will drop to 3e-7. Not only is this a very low learning step that can cause floating point errors, it is also impossible to adapt other speakers at such a checkpoint, because the learning step is too small.
Does this mean that it is better to set lr_decay = 1.0?

Can the MEL spectrum be resampled?

Hello, the model you provided has a strong generalization, but it is different from the sampling rate of my project. Is there any MEL resampling algorithm that can directly sample to the corresponding sampling rate, such as 48000Hz

window size is hardcoded (44khz HiFi-GAN)

Congratulations and thank you for this great work.
We were able to easily adapt it (doubling fft params and changing upsample rates to [8,8,4,2]) to produce 44khz 16bit audio with very high quality.

The line below has the window size hard coded to 1024. It should be replaced with win_size.

hifi-gan/meldataset.py

Line 59 in 6b97a5f

hann_window[str(y.device)] = torch.hann_window(1024).to(y.device)

Can't train with output of Tacotron 2

The shape of mel output of Tacotron2 is bigger than mel extracted from audio and the model has issue

 File "train.py", line 113, in train
    for i, batch in enumerate(train_loader):
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/data/tuong/Yen/hifi-gan/env/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 8192 and 8119 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:612

Fine-tuning example

Please add an example of using fine-tuning with hifi-gan.

[Question] Dataset preprocessing

I've attempted to preprocess my dataset to meet the mel-spectrogram requirements but I either wind up with incorrectly packed spectrogram files, a wrong header, or wrong data. Don't think any of the tacotron2 implementations I can get my hands on will output the data in the required format, or I'm overlooking something obvious (which is equally likely 😌).

Could one of you helpful people provide a link to a working piece of code that takes care of this properly or could this repository be fleshed out more so that there is a working preprocessor for training datasets? 🤔

What is the input format?

I've seen a lot of general discussion about inputting generated mels into Hifi-GAN and of course we can see the hparams for mel spectrogram in each config file, but nothing that actually says what the format is for input x to Generator(x). Is it (1, n_mels, frames)? Is normalization expected? Nothing I've tried works.

[Question] How to apply on 16k data?

Hi, thanks for sharing your impressive code.
I tried to apply hifigan on 16k data, with config:
"upsample_rates": [8,5,5],
"upsample_kernel_sizes": [16,10,10],
"segment_size": 6400,
"hop_size": 200,
"win_size": 800,
"sampling_rate": 16000,

And it reports error like:
Traceback (most recent call last):
File "train.py", line 271, in <module>
main()
File "train.py", line 267, in main
train(0, a, h)
File "train.py", line 149, in train
loss_fm_f = feature_loss(fmap_f_r, fmap_f_g)
File "models.py", line 255, in feature_loss
loss += torch.mean(torch.abs(rl - gl))
RuntimeError: The size of tensor a (1067) must match the size of tensor b (1068) at non-singleton dimension 2

Is there any wrong in the modified config? Is it padding related?

pretrained model

Hi,thanks for your work,
I only found the pretrained models of generator, is there any pretrained models of discriminator?
Thanks!

Buzzing noise

Hi ,Thanks for your work
I have trained 50k steps start from scratch,the config is v1,there is a buzz noise in the the generated audio , can I get rid of the noise if I continue training,do you have any suggestions?
Thanks for your help

Training times v1 vs. v2 vs. v3?

Hello, I see it took you this amount of time to train v1:

It took about 13-14 days to train the model up to 2,500k steps with two V100 GPUs.

Since v2 and v3 have faster inference, does that mean training them would be faster too?

Errors when trying to load pretrained Universal model

Hi. I'm working on torch==1.4.0 environment, as you wrote in requirement.txt
I successfully inference with your pretrained generator_v1 & finetuned generator_v1

However, I tried to load your universal generator model, following error occured.

RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fec128e9193 in /home/kwkim/six/lib64/python3.6/site-packages/torch/lib/libc10.so) frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7fec15c799eb in /home/kwkim/six/lib64/python3.6/site-packages/torch/lib/libtorch.so) frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7fec15c7ac04 in /home/kwkim/six/lib64/python3.6/site-packages/torch/lib/libtorch.so) frame #3: <unknown function> + 0x6c53a6 (0x7fec5e13b3a6 in /home/kwkim/six/lib64/python3.6/site-packages/torch/lib/libtorch_python.so) frame #4: <unknown function> + 0x2961c4 (0x7fec5dd0c1c4 in /home/kwkim/six/lib64/python3.6/site-packages/torch/lib/libtorch_python.so) <omitting python frames> frame #38: __libc_start_main + 0xf5 (0x7fecbc751445 in /lib64/libc.so.6) frame #39: python3() [0x400c40]

I think your universal model was trained and saved in higher torch version than 1.6.0.
Could you please check this error?

Thank you.

Logging validation loss to tensorboard

Currently the code does not log validation loss to tensorboard.
This way we do not know in what optimization regime we are, e. g. underfitting or overfitting.
Can you please add this feature?

FineTuning HiFi with GLowTTS npy

Hello!
I'm trying to FineTuning HiFi with GlowTTS npy
i generate npy with this code:

def TTS(tst_stn, path):
    if getattr(hps.data, "add_blank", False):
        text_norm = text_to_sequence(tst_stn.strip(), ['english_cleaners'], cmu_dict)
        text_norm = commons.intersperse(text_norm, len(symbols))
    else: 
        tst_stn = " " + tst_stn.strip() + " "
        text_norm = text_to_sequence(tst_stn.strip(), ['english_cleaners'], cmu_dict)
    sequence = np.array(text_norm)[None, :]
    x_tst = torch.autograd.Variable(torch.from_numpy(sequence)).cuda().long()
    x_tst_lengths = torch.tensor([x_tst.shape[1]]).cuda()
    

   with torch.no_grad():
        noise_scale = 0.667
        length_scale = 1.0
        (y_gen_tst, *_), *_, (attn_gen, *_) = model(x_tst, x_tst_lengths, gen=True, noise_scale=noise_scale, length_scale=length_scale)
        
    np.save("hf/ft_dataset/" + path.split('/')[1]  + '.npy', y_gen_tst.cpu().detach().numpy())

Next, I make a metafile:
wavs/x.wav | ft_dataset/x.npy

And I get the following error:
RuntimeError: stack expects each tensor to be equal size, but got [8192] at entry 0 and [6623] at entry 6

Hi-Fi generates wav using these npy in inference mode with GlowTTS

Batch synthesis noise at the end

When doing batch synthesis (inference), I zero-pad the mel inputs so that they are the same length, which causes a harsh, buzzing sound to be generated by HiFi-GAN.

Assuming that batching is required for my application's performance purposes, what is the advised approach to dealing with this issue? I don't see support for passing in any sort of mask argument. Should I just try to heuristically cut the resulting wav audio so as to eliminate the noise at the end?

error when fine-tuning

I want to use FastSpeech 2 + hifigan, but it sound a little nosie in some generated audios, so I get the generate mel-spectrogram from FastSpeech 2 to retrain the hifigan, BUT MEET the error when fine-tuning

`Loading 'cp_hifigan/g_00320000'
Complete.
Loading 'cp_hifigan/do_00320000'
Complete.
Epoch: 1227
Traceback (most recent call last):
File "train.py", line 286, in
main()
File "train.py", line 280, in main
mp.spawn(train, nprocs=h.num_gpus, args=(a, h,))
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/data/glusterfs_speech_tts_v2/public_data/tts_public_data/11104653/vocoder/hifi-gan/train.py", line 113, in train
for i, batch in enumerate(train_loader):
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in next
data = self._next_data()
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
return self._process_data(data)
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
data.reraise()
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
data = fetcher.fetch(index)
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 84, in default_collate
return [default_collate(samples) for samples in transposed]
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 84, in
return [default_collate(samples) for samples in transposed]
File "/espnet/tools/venv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [8192] at entry 0 and [6712] at entry 10`

Minimum hours of data required for fine-tuning for a single unseen speaker

Thank you for your amazing work!!
For the TTS task, assuming that the synthesizer(Tacotron2) + vocoder has already been trained on a significant number of speakers, what would be the minimum amount of data that would be required to fine-tune the vocoder to a new unseen speaker? Would 5-10 hours be sufficient? Would be helpful to have an approximate amount. Just to add more details, this is for TTS in Hindi and I plan to train Tacotron2 + HifiGAN on ~150 hours of Hindi data with several 100s of speakers before fine-tuning on a new unseen speaker. Thanks!

RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

I've encountered the following error when I'm trying to fine-tuning mel-spec from Tacotron2

Traceback (most recent call last):
  File "train.py", line 276, in <module>
    main()
  File "train.py", line 272, in main
    train(0, a, h)
  File "train.py", line 127, in train
    y_g_hat = generator(x)
  File "/home/kynh/anaconda3/envs/nguyenlm_hifigan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home2/nguyenlm/Projects/hifi-gan-clone/models.py", line 101, in forward
    x = self.conv_pre(x)
  File "/home/kynh/anaconda3/envs/nguyenlm_hifigan/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kynh/anaconda3/envs/nguyenlm_hifigan/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

v1 Training time + training a 44Khz model on Universal VCTK+Blizzard2011+Clipper Datasets

What's a rough estimate for training time of the 22Khz v1 pretrained model provided here?

I'm currently training a 44Khz config.
At the moment I'm training using 3 GPU's, batch_size=24 and segment_size=16384 for a total 1179648 samples per iter.
Not sure how long I should leave it running, any rough estimate would be awesome.

Thanks!

Why use those mel spectrogram configurations?

I'm particularly curious about the choices to use:

fmin=0
n_fft = win_length = 1024
mulaw.bits = 9

Were these values chosen arbitrarily or are there some pros/cons (tradeoffs) to these values?

question about the finetune data

in readme ,you said when finetuned should "Generate mel-spectrograms in numpy format using Tacotron2 with teacher-forcing"
but the generated mel-spectrograms 's number is not equal the oral mel，in meldataset.py. got the audio like this
if audio.size(1) >= self.segment_size:
mel_start = random.randint(0, mel.size(2) - frames_per_seg - 1)
mel = mel[:, :, mel_start:mel_start + frames_per_seg]
audio = audio[:, mel_start * self.hop_size:(mel_start + frames_per_seg) * self.hop_size]

in this way, if we use the oral audio but use the generated mel, the traindata will be wrong, right?

HiFi-GAN TFLite Model

Hi @jik876 @Edresson

We(@sayakpaul) converted the pre-trained PyTorch model of the HiFi-GAN to TFLite Format. Thanks for providing the pre-trained models. You can use this Notebook to convert all the available pre-trained models to TFLite Format. If you are interested in any other models you can visit this Repository. We also provided benchmark results of HiFi-GAN TFLite Model w.r.t to other models like Parallel WaveGAN, MelGAN, MB-MelGAN.

Models will be soon published to TensorFlow Hub.

Glow-tts + hifi-gan inference issue

I have trained the model using Glow-TTS and was trying to infer the texts using the jupyter notebook file given in the hifi-gan directory.

When I tried running this command:

#### Use finetuned HiFi-GAN with Tacotron 2, which is provided in the repo of HiFi-GAN.
!python ./hifi-gan/inference_e2e.py --checkpoint_file  ~/glow-tts/logs/base/G_100.pth

It gives an error:

Initializing Inference Process..
~/glow-tts/logs/base/config_v1.json
Loading '~/glow-tts/logs/base/G_100.pth'
Complete.
Traceback (most recent call last):
  File "./hifi-gan/inference_e2e.py", line 100, in <module>
    main()
  File "./hifi-gan/inference_e2e.py", line 96, in main
    inference(a)
  File "./hifi-gan/inference_e2e.py", line 46, in inference
    generator.load_state_dict(state_dict_g['generator'])
KeyError: 'generator'

can anyone tell why this error is showing and how can I solve this ?

how to set segment_size in different sampling_rate

Hi,
I saw that everyone change their segment_size when use different sampling_rate.
I want to know that:
How to set segment_size in different sampling_rate? Is there any mathematical formula for this？

Will open source data UNIVERSAL_V1？

Hi， I want to train hifigan on 48000hz，and want to achieve the same effect like UNIVERSAL_V1. Can I request a copy of the data?

mel spectrum loss VS stft loss

Hi,
thanks for your great work. In paper you mentioned that using the mel spectrum loss to get a more stable and efficient training. The multi-resolution STFT loss used in parallel-wavegan/mb-megan seems can achieve the same goal.
My question is have you tried the stft loss instead the mel loss? did you observe the differences(if you tried)?

Thanks.

slope for last activation is different from the rest ones

Hi,
The last activation slope in generator is the default value 0.01, while the rest ones are 0.1. Is this just a accident or you have a
design consideration ?

Some questions

Hi, thanks for sharing the code, it is well appreciated. Some questions:

Do you train with mean-var normalization? If not, what is the range normalization?
I tried to plug in the models using a spectrogram generated by Mozilla TTS, but had no luck (waveform is generated, but sound is very distorted). Do you have any idea why this happens? Is there any difference in which the spectrograms are computed from hifiGAN's side? The training attributes (win, hop, fmin, fmax) are otherwise the same.
When finetuning for TTS, how do you acquire your ground truth mels? Using the TTS model you want to use the GAN with?
How many steps do you train for?

Thanks again, these results are impressive for a GAN.

Add the torch pip install full command

maybe something like:
pip install torch===1.4.0 torchvision===0.5.0 -f https://download.pytorch.org/whl/torch_stable.html

just a suggestion.BTW,what is the os for the original setup?

Padding mismatch from Tacotron2 pre-processing

Hello, I ran into similar trouble as #52 while trying to fine-tune the Universal_V1 checkpoint using ground-truth aligned TC2 outputs.

In my case, the fine-tuning breaks as soon as it reaches the validation phase. The error message is something along the lines of:

Using a target size (torch.Size([1, 80, 132])) that is different to the input size (torch.Size([1, 80, 131])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()

I found that the padding performed in this line is different than the padding performed in Tacotron2.

I dug deeper and this does introduce a consistent difference of 1 frame between the mel-specs generated by the trained TC2 model and the hifi-gan loss mel-specs.

Technically I could simply edit the padding code in this repo and get the mels to align perfectly but I wonder if this would cause issues since the model is trained using differently padded mel spectrograms. What are your thoughts?

the loss doesn't decrease

Hi ,Thanks for your work
I have trained １0k steps start from scratch,the loss of generator is 8.6, the loss of discriminatoris 4.5,the learning rate is 1e-5, I found the loss doesn't decrease,I don't know why．
Thanks

Buzzing sound when using Tacotron2+HiFi-GAN

Hi @jik876 @Edresson

I have been trying to integrate tacotron2 and Hifi gan to create fully end to end TTS. But when I am feeding Tacotron2 output to your finetuned model of HiFi GAN output audio is just buzzing sound. To make sure tacotron2 output is correct, I fed the tacotron2 output to the parallel wavegan model, and it's working as expected. So believe there is some incompatibility while feeding tacotron2 output to Hifi gan output. To reproduce the same I created the colab notebook. You can reproduce the output with the above-mentioned notebook.

Also, I created and End to End Colab Notebook to run the Hifi GAN Model. If you want to me add this to your repo I will go ahead and create a PR

Also, I am in a plan to convert this Hifi GAN model to TFLite format to help mobile developers. We(@sayakpaul) already converted few models to TFLite format. You can find more details about our TFLite repo here

What is the Universal dataset?

It is not specified in the paper.

MPD Vs Filter-bank discriminator

Hi,
Thanks for sharing this great work. I have one theoretical questions regarding the usage of the multi-period discriminator (MPD):

As I understand the MPD, the input waveform is reshaped according to the target period p in order to obtain a 2d map that models different periods of the signal. Actually, when applying this simple reshaping, the subbands of the signal overlap with each other when plotting their spectrograms. If that's correct, what do you think of using a simple filter-bank to decompose the speech waveform into subbands without such overlapping issue?

I would really appreciate if you already have some experiments on that or at least if you could explain what is the different between having multiple periods of the signal versus multiple subbands.

Why does the mel.npy file I trained with the tacotron2 model not match the dimensions in hifi-gan?

I trained the tacotron2 model to produce mel_**.npy ,but in this model, the error of dimension mismatch is reported
python3 inference_e2e.py --checkpoint_file cp_hifigan-1208-test/g_00036000 Initializing Inference Process.. cp_hifigan-1208-test/g_00036000 Loading 'cp_hifigan-1208-test/g_00036000' Complete. Removing weight norm... Traceback (most recent call last): File "inference_e2e.py", line 90, in <module> main() File "inference_e2e.py", line 86, in main inference(a) File "inference_e2e.py", line 51, in inference y_g_hat = generator(x) File "/home/zhchen/python3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/zhchen/hifi-gan-master/models.py", line 101, in forward x = self.conv_pre(x) File "/home/zhchen/python3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/zhchen/python3/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 200, in forward self.padding, self.dilation, self.groups) RuntimeError: Expected 3-dimensional input for 3-dimensional weight 128 80, but got 2-dimensional input of size [288, 80] instead

RuntimeError when finetuning the model

Hi. Thank you very much for your implementation.
I extracted mels following the instructions. The frame size of the mels and audios is matched.
However, when I tried to finetune hifi-gan, I still got the following problem.
Could you tell me how to solve this? Thank you very much!

    main()
  File "train.py", line 267, in main
    train(0, a, h)
  File "train.py", line 113, in train
    for i, batch in enumerate(train_loader):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 118 and 167 in dimension 1 at /opt/conda/conda-bld/pytorch_1579027003190/work/aten/src/TH/generic/THTensor.cpp:612

The training time

Hi ,thanks for your work, I want to know the training time,how long does it take?
Thanks!

Specific environment version

When training I have a runtime error: : cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. Could you please provide specific version of your environment, including python, cuda and cudnn? Thank you very much.

Discriminator weights for pretrained models

It'd be nice if you uploaded the discriminator model weights for the pretrained models, so that users can more easily fine-tune the models for their datasets.

Sorry,a newbie here asking basic setup infos

i tried to setup a proper env to generate the demos.
But somehow have some errors.

(```
Hifigan384test) D:\Coding\PYFastCache\PYVenv\Hifigan384test\hifi-gan-master>python inference.py --checkpoint_file vctk_v2\generator_v2 --input_wavs_dir test_files
Initializing Inference Process..
Loading 'vctk_v2\generator_v2'
Complete.
Removing weight norm...
D:\Coding\PYFastCache\PYVenv\Hifigan384test\hifi-gan-master\meldataset.py:15: WavFileWarning: Chunk (non-data) not understood, skipping it.
sampling_rate, data = read(full_path)
Traceback (most recent call last):
File "inference.py", line 94, in
main()
File "inference.py", line 90, in main
inference(a)
File "inference.py", line 54, in inference
x = get_mel(wav.unsqueeze(0))
File "inference.py", line 26, in get_mel
return mel_spectrogram(x, h.n_fft, h.num_mels, h.sampling_rate, h.hop_size, h.win_size, h.fmin, h.fmax)
File "D:\Coding\PYFastCache\PYVenv\Hifigan384test\hifi-gan-master\meldataset.py", line 61, in mel_spectrogram
y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
File "D:\Coding\PYFastCache\PYVenv\Hifigan384test\lib\site-packages\torch\nn\functional.py", line 2877, in pad
assert len(pad) == 4, '4D tensors expect 4 values for padding'
AssertionError: 4D tensors expect 4 values for padding

not quite sure what went wrong.Maybe the audio?


here's the layout.
02/12/2020  18:04    <DIR>          .
02/12/2020  18:04    <DIR>          ..
30/11/2020  23:03               762 config_v1.json
30/11/2020  23:03               762 config_v2.json
30/11/2020  23:03               752 config_v3.json
30/11/2020  23:03               394 env.py
02/12/2020  18:00    <DIR>          generated_files
02/12/2020  16:59    <DIR>          generated_files_from_mel
30/11/2020  23:03             2,652 inference.py
30/11/2020  23:03             2,444 inference_e2e.py
30/11/2020  23:03             1,067 LICENSE
30/11/2020  23:03    <DIR>          LJSpeech-1.1
30/11/2020  23:03             6,314 meldataset.py
30/11/2020  23:03             9,905 models.py
30/11/2020  23:03             4,767 README.md
30/11/2020  23:03               113 requirements.txt
02/12/2020  18:04    <DIR>          test_files
30/11/2020  23:03            12,153 train.py
30/11/2020  23:03             1,377 utils.py
30/11/2020  23:03            10,995 validation_loss.png
02/12/2020  16:53    <DIR>          vctk_v2
02/12/2020  16:59    <DIR>          __pycache__
              14 File(s)         54,457 bytes
               8 Dir(s)   7,409,783,296 bytes free

btw,for the audio i just recorded my voice.Not quite sure.What is needed for input.
I assume it just needs some wav audio?

[Question] Quality improvement using different generator parameters

I was wondering if it is possible to further improve quality by adjusting the generators parameters in the MRF, like the hidden dimension or the kernel sizes. Ideally it would get rid of even more artifacts. Has anybody tried this?

Output audio duration does not exactly match input audio.

Running through your pre-trained models, I found that generated audio does not exactly match the input in duration length. For example,

wav, sr = load_wav(os.path.join(a.input_wavs_dir, filname))
wav = wav / MAX_WAV_VALUE
wav = torch.FloatTensor(wav).to(device)  # wav shape is torch.Size([71334])
x = get_mel(wav.unsqueeze(0))  # x shape is torch.Size([1, 80, 278])
y_g_hat = generator(x)  # y_g_hat shape is torch.Size([1, 1, 71168])

As you can see, there is a mismatch of 71334 and 71168. What is happening, and why is this the case? Is there a way that I can change it so that the input and output shapes match?

Thank you.

Edit: So I was checking training, and if the target segment_size is a multiple of 256 (hop_size), then y_g_hat = generator(x) will also have the exact number.

How to train a new model with 8k data

Thanks for your contribution. What parameters need to be modified to train a new model with 8K data

when fine_tuning with the tacotron2 model to produce mel_**.npy, a RuntimeError happened.

Traceback (most recent call last):
File "train.py", line 272, in
main()
File "train.py", line 268, in main
train(0, a, h)
File "train.py", line 113, in train
for i, batch in enumerate(train_loader):
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/.conda/envs/yzh/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/hifi/meldataset.py", line 163, in getitem
center=False)
File "/data/hifi/meldataset.py", line 50, in mel_spectrogram
if torch.min(y) < -1.:
RuntimeError: invalid argument 1: cannot perform reduction function min on tensor with no elements because the operation does not have an identity at /pytorch/aten/src/TH/
generic/THTensorEvenMoreMath.cpp:345

break point problem

the spec has beak point, what do you think the reason of this problem is? I have enlarged the receptive: [1,3,5]-->[1,3,5,7], but the problem still exists.

And why your leak_relu_scope=0.1 instead of 0.2?

Some questions about training

Hey.
Why does the readme say that you need to use GTA mels for fine tuning? I used real spectrograms to train waveglow and parallel wavegan, the authors of which indicated that this method achieves acceptable quality in conjunction with tacotron 2. Is the GTA training procedure mandatory for this vocoder to achieve the best possible quality?

And another question about the multi-speaker model. Do I need any modifications to train a multispeaker model, or is it enough to generate spectrograms of different speakers and train on them? Also, what is the minimum number of speakers required for the model to reproduce speakers invisible during training in good quality?

jik876 / hifi-gan Goto Github PK

hifi-gan's Introduction

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

Pre-requisites

Training

Pretrained Model

Fine-Tuning

Inference from wav file

Inference for end-to-end speech synthesis

Acknowledgements

hifi-gan's People

Contributors

Stargazers

Watchers

Forkers

hifi-gan's Issues

Recommend Projects

Recommend Topics

Recommend Org