seungwonpark / melgan Goto Github PK

View Code? Open in Web Editor NEW

626.0 30.0 120.0 18.03 MB

MelGAN vocoder (compatible with NVIDIA/tacotron2)

Home Page: http://swpark.me/melgan/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

tts neural-vocoder gan pytorch

melgan's Introduction

MelGAN

Unofficial PyTorch implementation of MelGAN vocoder

Key Features

MelGAN is lighter, faster, and better at generalizing to unseen speakers than WaveGlow.
This repository use identical mel-spectrogram function from NVIDIA/tacotron2, so this can be directly used to convert output from NVIDIA's tacotron2 into raw-audio.
Pretrained model on LJSpeech-1.1 via PyTorch Hub.

Prerequisites

Tested on Python 3.6

pip install -r requirements.txt

Prepare Dataset

Download dataset for training. This can be any wav files with sample rate 22050Hz. (e.g. LJSpeech was used in paper)
preprocess: python preprocess.py -c config/default.yaml -d [data's root path]
Edit configuration yaml file

Train & Tensorboard

python trainer.py -c [config yaml file] -n [name of the run]
- cp config/default.yaml config/config.yaml and then edit config.yaml
- Write down the root path of train/validation files to 2nd/3rd line.
- Each path should contain pairs of *.wav with corresponding (preprocessed) *.mel file.
- The data loader parses list of files within the path recursively.
tensorboard --logdir logs/

Pretrained model

Try with Google Colab: TODO

import torch
vocoder = torch.hub.load('seungwonpark/melgan', 'melgan')
vocoder.eval()
mel = torch.randn(1, 80, 234) # use your own mel-spectrogram here

if torch.cuda.is_available():
    vocoder = vocoder.cuda()
    mel = mel.cuda()

with torch.no_grad():
    audio = vocoder.inference(mel)

Inference

python inference.py -p [checkpoint path] -i [input mel path]

Results

See audio samples at: http://swpark.me/melgan/. Model was trained at V100 GPU for 14 days using LJSpeech-1.1.

Implementation Authors

Seungwon Park @ MINDsLab Inc. ([email protected], [email protected])
Myunchul Joe @ MINDsLab Inc.
Rishikesh @ DeepSync Technologies Pvt Ltd.

License

BSD 3-Clause License.

utils/stft.py by Prem Seetharaman (BSD 3-Clause License)
datasets/mel2samp.py from https://github.com/NVIDIA/waveglow (BSD 3-Clause License)
utils/hparams.py from https://github.com/HarryVolek/PyTorch_Speaker_Verification (No License specified)

Useful resources

How to Train a GAN? Tips and tricks to make GANs work by Soumith Chintala
Official MelGAN implementation by original authors
Reproduction of MelGAN - NeurIPS 2019 Reproducibility Challenge (Ablation Track) by Yifei Zhao, Yichao Yang, and Yang Gao
- "replacing the average pooling layer with max pooling layer and replacing reflection padding with replication padding improves the performance significantly, while combining them produces worse results"

melgan's People

Contributors

Stargazers

Watchers

Forkers

entn-at g-wang dendisuhubdy keep-steady neuroradiology batikim09 sanghwa-ham m-toman chaitanya1123 jaeminbest chrislos ahmed-fau zhoulinmin dipjyoti92 lyz04551 abul22 superhg2012 wgwangang liuweiping2020 tarsbase zvk haifengzeng linzai1992 begeekmyfriend wenbozhangjs benwu95 mingkzhou collectivat-dev soobo-seo mr-patty peter05010402 ai-foundation jinsongpan zhipingzhou kejiejiang beckgom xuexuanyu dagardner hadorganization peng2017 bob80333 kingstorm chazo1994 mimao123 spxnn meelement 5l1v3r1 chcbin as-ideas ml-applications aarongerber pizzabug eloqute intflow jokecorleone jojocorleone georgehappy1 exp-time-series-tools akashrajkn 1uka ryanmcgary wynmew shinewide benjohn18 cherokeelanguage pmolodo moonki-1998 gatsbychen seantempesta xuexidi prajwaljpj alirezaomidi sadam1195 gilmoore hiyoung-asr luizrodolfos darkalfx c1a1o1 smilemcm simenglv sriharsha0806 786440445 hlng2002 liujingxiu23 fengyen-chang assansanogo pan-yangxu sherlock-home kenna3 zhanfengdog michelleappel gabrielmalfatti ranchlai lvxiaoqi okayming seblemaguer tranmduc chalkandpaste huhuqwaszxedc meadow163

melgan's Issues

This repository use identical mel-spectrogram function from NVIDIA/tacotron2, so this can be directly used to convert output from NVIDIA's tacotron2 into raw-audio.

how?

Should you advice why I have artefact

https://drive.google.com/open?id=1EgYUmMaWS56_6G3zcXWQVBIzAL6JBmYO

Report an error

ok
from .res_stack import ResStack
#from res_stack import ResStack

Use mel-gan as an universal vocoder

Hello, thanks for your nice implementation of mel-gan.

I guess mel-gan can be used as the universal vocoder, and I thought there were a mention about multi-speaker training scheme in the original paper. Have you ever tried multi-speaker setting? It might be really useful if it can be an universal vocoder similar like this.

Does this project support non-English languages?

I want to achieve Vietnamese speech text-to-speech conversion.
My native language is Vietnamese.
I want to record my own voice and make a data set.
Does this project apply?

Why do you add 2 to self.mel_segment_length in line 29 of dataloader.py?

Error in training: tensor a must match the size of tensor b

Wrong implementation of Generator

The last layer should be:

nn.utils.weight_norm(nn.Conv1d(32, 1, kernel_size=7, stride=1, padding=3)),

not:

nn.utils.weight_norm(nn.ConvTranspose1d(32, 1, kernel_size=7, stride=1, padding=3)),

omg...

Click sound artifact at the end of each sample

At each resulting audio sample, there are audible clicking sound at the end.
This is also visible at the raw audio visualization in tensorboard.

Target:

Predicted:

Pretrained model

@seungwonpark upload the pretrained model so that we can test

strange noises in your samples && error when running inference.py

Your samples at epoch 3200 have strange noises at unvoiced segments, while there is no such phenomenon in samples at epoch 1600.

Besides, when running inference.py, an error occurs, pointing to

melgan/model/generator.py

Line 68 in 8af1e9c

mel = torch.cat((mel, zero), axis=2)

torch.cat() has a parameter "dim" rather than "axis"

How to Edit default.yaml

Hi, I am at the stage in your instructions where I am supposed to edit the default.Yaml but I am not sure what paths to put in, and where exactly to put them in the text field when I open the yaml in notepad.

multi gpu training

Is it possible to train it on multi gpu? multi node? Thanks.

Can I do the google colab implementation for you?

Hello Seungwon Park @seungwonpark,

You said:
_

Try with Google Colab: TODO...
_

It's my 2nd time doing implementing a ML model on Google Colab. I am a beginner in ML but I learn best hands-on.

Thanks for being awesome
Can I do the Google Colab implementation?
Have you attempted it? If so did you face any obstacles?
Do you want a fork or pull request?

Sound artifact at the end of the sample

First of all, thank you for the wonderful repository.

I know that this issue has been discussed in a previous issue, but I wanted to know if the artifact that appears at the end of the inferred sentence can be solved by the current repository. I am experiencing artifacts at the end of a sample (and also in between sentences) no matter what I try, so I was hoping that someone could point me in the right direction to address this issue.

Also, I was wondering where -11.5129 came from in the following line:

melgan/model/generator.py

Line 69 in aca5990

zero = torch.full((1, self.mel_channel, 10), -11.5129).to(mel.device)

Thanks in advance.

Better audio quality with larger resnet

Hi, great repo!

I found that the audio quality improves considerably with a slightly increased ResNet as suggested in https://arxiv.org/pdf/2005.05106.pdf. The shaky and metallic artefacts are reduced a lot.

Here is a comparison of your pretrained LJSpeech with a current model I am still training (for TTS I used https://github.com/as-ideas/ForwardTacotron)

Original (6400 epochs):
https://drive.google.com/file/d/1LOIB9B7LDX9g-kVu_p1anGJgJ5vjE27s/view?usp=sharing

Larger ResNet (2000 epochs):
https://drive.google.com/file/d/19_d2SQU1xZi-o90MJ8NcKhIS6AFwliH-/view?usp=sharing

If you are interested I could open a PR making the layers more flexible.

Use this implementation for TTS engine

Can create separate branch for TTS implementation, that's the ultimate goal for every neural vocoder. I will try to use this implementation with nvidia's Tacotron2, as preprocessing for both networks are same.

Note : I am already working in it, and will post the output samples here by tomorrow.

Is possible inference concurrency ？

Train on 2 or more GPUs

Awesome work!
How should I change this code to train on 2 or more GPUs?

Unable to resume training from official checkpoint

Hi @seungwonpark

Thanks for all this. I am using your official checkpoint nvidia_tacotron2_LJ11_epoch6400.pt

When I try to resume training from that checkpoint I get the following error:

2020-01-07 00:26:35,386 - INFO - Resuming from checkpoint: ./nvidia_tacotron2_LJ11_epoch6400.pt
Traceback (most recent call last):
  File "trainer.py", line 52, in <module>
    train(args, pt_dir, args.checkpoint_path, trainloader, valloader, writer, logger, hp, hp_str)
  File "/delip/workspace/melgan/utils/train.py", line 34, in train
    model_d.load_state_dict(checkpoint['model_d'])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 816, in load_state_dict
    state_dict = state_dict.copy()
AttributeError: 'NoneType' object has no attribute 'copy'

Looks like a mismatch between my PyTorch version and the one used for this checkpoint? My PyTorch version is 1.3.0a0+24ae9b5.

Do you have a checkpoint for the latest PyTorch? Or alternatively, what was the PyTorch version used with this checkpoint?

PS: md5sum for my copy of the checkpoint is 1cb89dc08401770fa9e2dd7d5c704bf5

Optimize Network to remove click like sound artifact

After 2000 epochs sound quality reach to usable level but only buggy thing remains is metallic click like noise artifact at the end of each generated sample. Needed to optimize and do some more R&D to remove such kind of Noise artifacts.

Loading without cuda results in an error

I am getting the following error while trying to load using torch hub

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

To fix apply the following snippet to this line

state_dict = torch.hub.load_state_dict_from_url(params['model_url'],
                                                        progress=progress,
                                                        map_location=torch.device('cpu'))

How to inference using MelGAN given a tacotron mel spec output?

When i trained melgan with original wav's mel spec, the result went well.

But when i tried to feed tacotron mel spec output into trained melgan model, the sound just all bee. Would you mind sharing some advice? thanks a lot. @seungwonpark

Torch hub command not working

Hello,
Thanks a lot for your implementation.
I am slightly confused because I am trying to run your torch hub command but it does not work.

Also if I look in /home/user/.cache/torch/checkpoints/, I cannot find the checkpoint even though I do get the logging: Downloading: "https://github.com/seungwonpark/melgan/releases/download/v0.1-alpha/nvidia_tacotron2_LJ11_epoch3200.pt" to /home/user/.cache/torch/checkpoints/nvidia_tacotron2_LJ11_epoch3200.pt

Batch size = 16?

Hi,
Thank you for your nice implementation. I have a question about the batch size selection. It looks like the network is small enough for bigger batch size, for example 32 or 64 on a GTX 1080Ti. Batch size of 16 is a kind of regularization?
Another question is related to the G/D updates. In your generated samples, are you using 1:1?
Thanks.

remove weight normalization at inference phase

I'm thinking of writing some code snippet to remove weight normalization of pt file, and remove discriminator components to make the checkpoint file smaller.

Augmentation

Hi,
I've noticed you add random noise to the audio. I just wondered if it would make more sense to add random noise to the Mel spectra? Considering that this is what you might get from something like tacotron, so noisy mels.

torchscript implementation

I noticed a torchscript branch. Is there a torchscript implementation of just the model conversion and inference?

Why remove weight norm?

Why remove weight norm in eval?
At inference time, weight norm should be kept or removed?

inference

inference.py -p [checkpoint path] -i [input mel path]
wich files i use to generate speech?
found this file "nvidia_tacotron2_LJ11_epoch6400.pt", what is it for?

hop_length

Hi!
You once commented that "the model architecture upsamples the mel-spectrogram by 256 times, so the hop_length can't be changed." May I know which model? And if I change the times, can I change hop_length?
Thanks!

Some notable differences with official implementation

Hi,
Just FYI, in the official MelGan repo, the authors used Hinge losses. However, in the paper, the author described with L2 loss. This repo is consistent with the paper! I am setting up some experiments with the Hinge loss to see the differences. Another note is that the default length of the segment_length is 8912 in the official as well (vs 16k in this repo).

Random crashes with custom dataset, tensor size mismatch

I created a small test dataset that you can replicate by downloading this podcast and following these steps.

I then used ffmpeg to convert it to a mono 22050hz wav file with ffmpeg -I input.mp3 -ac 1 -ar 22050 output.wav

I used sox to split on silence to have many smaller pieces into a split_files output folder with sox -V3 output.wav split_files/output.wav silence -l 0 3.0 1.0 5% : newfile : restart

There should be 240 pieces.

The last 24 pieces were used for validation.

Here's two seperate errors (note that the dataloader shuffling was modified to False for both these runs, despite the fact that they crash at different steps)

[eric@eric-pc melgan]$ python trainer.py -c config/default.yaml -n test4
2019-10-24 23:06:54,795 - INFO - Starting new training run.
Validation loop: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:03<00:00,  6.90it/s]
g 31.2470 d 56.5574 | step 13: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:08<00:00,  1.51it/s]
2019-10-24 23:07:10,354 - INFO - Saved checkpoint to: chkpt/test4/test4_df8b090_0000.pt
g 29.4583 d 55.8972 | step 26: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00,  1.91it/s]
g 29.3384 d 55.7414 | step 39: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00,  1.90it/s]
g 31.0743 d 55.8826 | step 52: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00,  1.87it/s]
g 30.2437 d 55.5219 | step 65: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00,  1.89it/s]
Validation loop: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:03<00:00,  6.98it/s]
g 32.9035 d 58.3628 | step 78: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00,  1.88it/s]
g 32.2074 d 55.6909 | step 91: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00,  1.87it/s]
g 30.4200 d 55.2120 | step 93:  15%|██████████████████████████▏                                                                                                                                           	| 2/13 [00:01<00:09,  1.20it/s]2019-10-24 23:07:59,489 - INFO - Exiting due to exception: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
	data = fetcher.fetch(index)
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
	return self.collate_fn(data)
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
	return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 16000 and 15986 in dimension 2 at /pytorch/aten/src/TH/generic/THTensor.cpp:689

Traceback (most recent call last):
  File "/home/eric/Documents/repos/melgan/utils/train.py", line 64, in train
	for (melG, audioG), (melD, audioD) in loader:
  File "/usr/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1060, in __iter__
	for obj in iterable:
  File "/usr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 801, in __next__
	return self._process_data(data)
  File "/usr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
	data.reraise()
  File "/usr/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
	raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
	data = fetcher.fetch(index)
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
	return self.collate_fn(data)
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
	return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 16000 and 15986 in dimension 2 at /pytorch/aten/src/TH/generic/THTensor.cpp:689

g 30.4200 d 55.2120 | step 93:  15%|██████████████████████████▏                                                                                                                                           	| 2/13 [00:01<00:08,  1.28it/s]
[eric@eric-pc melgan]$ python trainer.py -c config/default.yaml -n test5
2019-10-24 23:11:19,808 - INFO - Starting new training run.
Validation loop: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:03<00:00,  6.96it/s]
g 31.1410 d 56.5434 | step 13: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:08<00:00,  1.51it/s]
2019-10-24 23:11:35,537 - INFO - Saved checkpoint to: chkpt/test5/test5_df8b090_0000.pt
g 30.1641 d 56.2416 | step 21:  62%|████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                             	| 8/13 [00:04<00:02,  1.93it/s]2019-10-24 23:11:39,845 - INFO - Exiting due to exception: Caught RuntimeError in DataLoader worker process 8.
Original Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
	data = fetcher.fetch(index)
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
	return self.collate_fn(data)
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
	return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 16000 and 15958 in dimension 2 at /pytorch/aten/src/TH/generic/THTensor.cpp:689

Traceback (most recent call last):
  File "/home/eric/Documents/repos/melgan/utils/train.py", line 64, in train
	for (melG, audioG), (melD, audioD) in loader:
  File "/usr/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1060, in __iter__
	for obj in iterable:
  File "/usr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 801, in __next__
	return self._process_data(data)
  File "/usr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
	data.reraise()
  File "/usr/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
	raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 8.
Original Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
	data = fetcher.fetch(index)
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
	return self.collate_fn(data)
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
	return [default_collate(samples) for samples in transposed]
  File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
	return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 16000 and 15958 in dimension 2 at /pytorch/aten/src/TH/generic/THTensor.cpp:689

g 30.1641 d 56.2416 | step 21:  62%|████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                             	| 8/13 [00:04<00:02,  1.81it/s]

There is some noise in the gap position

Hello, thank you very much for the good work!
I use Chinese datasets for experiments，and I found some noise in the gap position, May I ask if this is the best result？
this is samples syn.zip

직접 학습한 타코트론2에 적용해 봤는데 끝에 소리가 끊어지네요

혹시 이유를 아시나요?
epoch1750.wav.zip

How about the inference speed?

Thanks for your implementation of MelGan. As is introduced, MelGAN is lighter, faster, and better at generalizing speech. How about the inference speed?

Validation data

Quick question: where do I get / generate the validation data required (I'm using the LJSpeech set)?

If you can point to other/better training data with the corresponding validation data, also let me know please.

erro in inference

hello.when i run inference.py, i met this erro.
0it [00:00, ?it/s]
how can i solve that?thank you

feeding mel features into discriminator

that gives a ~10x train speed boost.network is useable after 100k iterations.
新建文件夹.zip

32bit?

Discriminator dominates over generator

I could get audible results at epoch>350, but they don't look good.
Also, d_loss gets too low and g_loss gets too high.

Perhaps this could be caused by:

Discriminator may have learned that real data have discrete values (np.int16): Adding a gaussian noise may help. See soumith/ganhacks#14

Strange inference results with pretrained 6400epoch model

Thank you for your implementation and effort. I have question about inference and getting test samples from your pretrained model. I am running training, preprocess and inference with no problem on my Ubuntu machine. But results are strange, i cannot repeat your samples from 6400epoch trained model.

Am I missing something crucial? This model can generate unconditional audio? What is expected to be mel input for inference? Can your implementation generate audio translation?

My generated test samples and config files are in folder:https://drive.google.com/drive/folders/1zRhTFP7GepXrm_DPHkF1Nt94LZXMBX4Z?usp=sharing

Mel-Gan 학습데이터 전처리 관련해서 질문이 있습니다.

안녕하세요 승원님. 먼저 이렇게 좋은 코드 공개해주셔서 감사합니다.
제가 음성합성 경험이 적어서, 기본적일수도 있는 부분에 대해 질문하는 점 사과드립니다.

전에 WaveRNN 보코더를 학습해본 적이 있습니다.
제가 참고해서 사용했던 레포에서는 Vocoder 앞단의 Mel-Predict Network로 생성한 Mel-Spectrogram을 학습용 Mel로 사용을 했습니다.

어차피 앞단에 모델이 고정되어 있다면, 실제로 모델이 생성한 Mel을 사용하는게 신호처리 매커니즘으로 생성한 Mel을 학습에 사용하는 것보다 더 나은 결과를 보이지 않을까 생각하고 있습니다. 이에 대해서 승원님의 의견을 여쭤보고 싶습니다.

또 한가지 여쭤보고 싶은 점은, Vocoder를 단일 화자로 학습할 수도 있지만, 다화자로 학습할 수도 있는데 다화자로 학습시 모델이 생성해내는 음질이 아무래도 떨어질까요?

감사합니다.

LJSpeech Checkpoints

Hello,

I cant find pretrained lj-speech checkpoints in this repo. Is is possible train for different language with ljspeech checkpoint?

Could you share checkpoint link for LJspeech last step?

Thank you.

Is this repo completely dead?

mel_fmax does not cover all frequency

Looks like waveglow's default configuration doesn't allow mel-spectrogram to represent all range of frequency (0~11025Hz): https://github.com/NVIDIA/waveglow/blob/master/config.json

This is a plot of librosa.filters.mel(22050, 1024, 80, fmin=0.0, fmax=8000.0).

I think was the reason why waveglow and our implementation of melgan doesn't look to generate high-frequency audio.

Questions to use melgan on my own dataset

Hi, I encounter some problems when I try to use melgan on my dataset.
The first one is that you comment in the default.yaml that we should leave the hop_length to 256. Why can't I change the value? Is this some limitations of the model structure?
The second question is that in the MelFromDisk class, you use a mapping in the __getitem__ under training set. What is this mapping used for? I think the input idx is between [0, len(wav_list)) and the mapping also has the same interval.

Report an error

ok
from .res_stack import ResStack
#from res_stack import ResStack

New error
#from .res_stack import ResStack
from res_stack import ResStack

Text2Mel input to MelGan outputs noisy audio file without any speech

Hey!

I've retrained the text2mel model, by cutting out mel reduction part in preprocessor, and changing the hparams to:

hop_length = 256
win_length = 1024
max_N = 180 # Maximum number of characters.
max_T = 210 # Maximum number of mel frames.
e = 512 # embedding dimension
d = 256 # Text2Mel hidden unit dimension

I'm trying to feed generated mels to MelGan, but output audio file is just noisy honk.
Any ideas?

EOFError: Ran out of input with num_workers>0 in windows

Seems there is related issue https://discuss.pytorch.org/t/pytorch-windows-eoferror-ran-out-of-input-when-num-workers-0/25918/18. But I can't get workaround so far.

python trainer.py -c config/config.yaml -n firstrun
2020-04-16 06:20:54,376 - INFO - Starting new training run.
Validation loop: 0%| | 0/1283 [00:00<?, ?it/s]2
020-04-16 06:20:54,386 - INFO - Exiting due to exception: 'getstate'
Traceback (most recent call last):
File "C:\Users\susinder\PycharmProjects\melgan_seungwonpark\utils\train.py", line 60, in train
validate(hp, args, model_g, model_d, valloader, writer, step)
File "C:\Users\susinder\PycharmProjects\melgan_seungwonpark\utils\validation.py", line 13, in validate
for mel, audio in loader:
File "C:\Users\susinder\Anaconda3\envs\test\lib\site-packages\tqdm\std.py", line 1119, in iter
for obj in iterable:
File "C:\Users\susinder\Anaconda3\envs\test\lib\site-packages\torch\utils\data\dataloader.py", line 279, in iter
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\susinder\Anaconda3\envs\test\lib\site-packages\torch\utils\data\dataloader.py", line 719, in init
w.start()
File "C:\Users\susinder\Anaconda3\envs\test\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\Users\susinder\Anaconda3\envs\test\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\susinder\Anaconda3\envs\test\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\susinder\Anaconda3\envs\test\lib\multiprocessing\popen_spawn_win32.py", line 89, in init
reduction.dump(process_obj, to_child)
File "C:\Users\susinder\Anaconda3\envs\test\lib\multiprocessing\reduction.py", line 60, in dump

                                                                          File "C:\Users\susinder\Anaconda3\envs\test\lib\multiprocessing\reduction.py", line 60, in dump

Validation loop: 0%| | 0/1283 [00:00<?, ?it/s]

(test) C:\Users\susinder\PycharmProjects\melgan_seungwonpark>Traceback (most recent call last):
File "", line 1, in
File "C:\Users\susinder\Anaconda3\envs\test\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\susinder\Anaconda3\envs\test\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

seungwonpark / melgan Goto Github PK

melgan's Introduction

MelGAN

Key Features

Prerequisites

Prepare Dataset

Train & Tensorboard

Pretrained model

Inference

Results

Implementation Authors

License

Useful resources

melgan's People

Contributors

Stargazers

Watchers

Forkers

melgan's Issues

This repository use identical mel-spectrogram function from NVIDIA/tacotron2, so this can be directly used to convert output from NVIDIA's tacotron2 into raw-audio.

Recommend Projects

Recommend Topics

Recommend Org