bshall / universalvocoding Goto Github PK
View Code? Open in Web Editor NEWA PyTorch implementation of "Robust Universal Neural Vocoding"
Home Page: https://bshall.github.io/UniversalVocoding/
License: MIT License
A PyTorch implementation of "Robust Universal Neural Vocoding"
Home Page: https://bshall.github.io/UniversalVocoding/
License: MIT License
I will share my result of the Universal Vocoder in other datasets.
Thanks for your great library and impressive result/demo.
It seems that you are interested in other datasets (#2), I will share my result. (if not interested, please feel free to ignore!)
I forked this repository and used this for other dataset, JSUT (Japanese single female utterances, total 10 hours).
Though the model trained on single female speaker, it works very well even toward out-of-domain speaker's test data (other females, male, and even English speaker).
Below is the result/demo.
https://tarepan.github.io/UniversalVocoding
In my impression, RNN_MS (Universal Vocoder) seems to learn utterances from human mouth/vocalTract, which is independent from language. So interesting.
I am grad if my result is good for your further experiments.
Again, thanks for your great library.
I was playing with the preprocessing parameters and I was able to change a bit the sound of the synthesized voice.
I was wondering if there was a clever way to to do it in terms of pitch, energy, style, timbre etc..
Thanks!
Hi,
This repo is really great. May I ask the number of training steps (with batch_size 32) required for your demo samples? Given the amount of training data used here (around 26 hours recordings), I guess the 100k num_steps as provided in the config.json is not enough, right?
Many thanks!
hi,I have doubt about the preprocessing_mel function. I use the following preprocessing method. The generated audio file is muted.
def melspectrogram(wav, hparams):
D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db
np.dot(mel_basis, S)
if hparams.signal_normalization:
return _normalize(S, hparams)
return S
def _stft(y, hparams):
if hparams.use_lws: False
return _lws_processor(hparams).stft(y).T
else:
return librosa.stft(y=y, n_fft=hparams.n_fft, hop_length=get_hop_size(hparams), win_length=hparams.win_size)
librosa.stft(y, n_fft=num_fft, hop_length=hop_length, win_length=win_length)
def _linear_to_mel(spectogram, hparams):
global _mel_basis
if _mel_basis is None:
_mel_basis = _build_mel_basis(hparams)
return np.dot(_mel_basis, spectogram)
def _amp_to_db(x, hparams):
min_level = np.exp(hparams.min_level_db / 20 * np.log(10))
return 20 * np.log10(np.maximum(min_level, x))
np.exp(-100 / 20 * np.log(10))
min_level = 10**(-100 / 20)
return 20 * np.log10(np.maximum(min_level, x))
def _normalize(S, hparams):
if hparams.allow_clipping_in_normalization: (True)
if hparams.symmetric_mels: (True)
return np.clip((2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value,
-hparams.max_abs_value, hparams.max_abs_value)
else:
return np.clip(hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db)), 0, hparams.max_abs_value)
The main difference is “S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db” and _normalize,
hparams.ref_level_db =20, hparams.max_abs_value = 4;
data is [-4, 4], your preprocessing data is[0, 1]; the data range has a great influence on the model? I don't understand,I am asking for your help. thank you.
Hi bshall,
I have doubt about the mu-law encoding function. I wondered why the function mulaw_encode() returns np.floor((fx + 1) / 2 * mu + 0.5) instead of fx directly.
audio_slice_frames
seems to be deprecated in v0.2.
Is 10-bit model trained with this version?
Conditioning network (rrn1) and auto-regressive network (rrn2) used different sample frames (#12).
It was controlled in VocoderDataset
by sample_frames
and audio_slice_frames
.
In v0.2, there seems to be no audio_slice_frames
.
Is it deprecated?
And, is 10-bit model (LJ-speech model) trained without this different frame usage?
If i want to change the parameters of the model, what are the things should I change?
I want to change sampling_rate, num_fft, num_mels, hop length, win_length.
It seems pretty much every parameter.
Are there other things to change other than config.json?
Hello.
In preprocess.py
line 17,
wav /= np.abs(wav).max() * 0.999
I'm wondering why you choose to use * 0.999
. It leads wav
to have value which gets over 1.0. Is it bug or intended code?
Thanks.
“A PyTorch implementation of Robust Universal Neural Vocoding. Audio samples can be found here.“
The link you gave here is the sample you generated is the actual spectrum feed or the acoustic model predicted?
Hello,
I'm trying to figure out what I need to do so to my numpy array can be vocoded by the UniversalVocoder.
Attached is a sample npy file.
The output is from a modified https://github.com/Tomiinek/Multilingual_Text_to_Speech
import os
import numpy
def main():
import torch
import soundfile as sf
from univoc import Vocoder
cwd: str = os.getcwd()
# download pretrained weights (and optionally move to GPU)
vocoder: Vocoder = Vocoder.from_pretrained(
"https://github.com/bshall/UniversalVocoding/releases/download/v0.2/univoc-ljspeech-7mtpaq.pt").cuda()
# load log-Mel spectrogram from file or from tts (see https://github.com/bshall/Tacotron for example)
mel = numpy.load(os.path.join(cwd, "tmp.npy"))
# generate waveform
with torch.no_grad():
wav, sr = vocoder.generate(mel)
# save output
sf.write(os.path.join(cwd, "tmp.wav"), wav, sr)
if __name__ == "__main__":
main()
Traceback (most recent call last):
File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 29, in <module>
main()
File "/home/muksihs/git/Cherokee-TTS/tts-wrapper/uv.py", line 22, in main
wav, sr = vocoder.generate(mel)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/univoc/model.py", line 102, in generate
mel, _ = self.rnn1(mel)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/muksihs/miniconda3/envs/UniversalVocoding/lib/python3.9/site-packages/torch/nn/modules/rnn.py", line 821, in forward
max_batch_size = input.size(0) if self.batch_first else input.size(1)
TypeError: 'int' object is not callable
Hey, thanks for your work in this project, it is really good.
I'm trying to use this vocoder to generate wavs from magnitude spectrograms I generated using another neural network. Using griffin-lim gets me a nice audio, but kind of robotic, so I think your vocoder will improve it a lot.
The biggest difference between the parameters of the two networks are in n_ftt, my spectrograms use 1024 and your network use 2048. So, if I try to use your pre-trained model, changing only n_ftt the resulting audio is sped up a bit and the voice gets really high.
I tryed retraining the network changing only n_ftt, but the results where not good, it got a lot of noise.
Any leads on what I might try next?
audio_slice_frames
seems to be deprecated in v0.2.
Is 10-bit model trained with this version?
Conditioning network (rrn1) and auto-regressive network (rrn2) used different sample frames (#12).
It was controlled in VocoderDataset
by sample_frames
and audio_slice_frames
.
In v0.2, there seems to be no audio_slice_frames
.
Is it deprecated?
And, is 10-bit model (LJ-speech model) trained without this different frame usage?
Hi, I use your method training on my own dataset, for 1000k iterations, it sounds stable, have only a little background noise. But the loss maintains around 2.6, and the noise didn't disappear after another 1000k steps. I have tried to reduce the batchsize to 2 and learning rate 5e-5, but it doesn't work. How can I deal with it?
samples.zip
hi,recently, I've been trying different mel preprocessing methods. mel is [-4, 4],audio stays the same(mulaw_encode and mulaw_decode stay the same). But, the generated audio contains a lot of noise, mel is normal.
How can I deal with it? Looking forward for your response, thank you. @bshall
Hello,
I saw that you used pad
, audio_slice_frames
, sample_frames
but I can't understand the usage of those params. Can you explain the meanings of them?
Also, WaveRNN
model was using padded mel input in the first GRU layer. However you just sliced out paddings after the first layer. Is it important to use padded mel in first GRU?
Thanks.
@bshall - First of all, thank you for this implementation. In this issue, you pointed out that you've generated a sample audio from generated Mel-spectrogram from VQVAE. It sounds pretty good.
My question is: how would one go about generating audio from Mel-spectrograms? Do we need to preprocess the Mel-spectrogram, if that's the only thing we're given?
Thank you for sharing your great work.
As i have changed many parameters(n_mel, fft, hop, window etc), I am training this model from scratch with VCTK dataset.
Could you tell me the environment you had and how long it took?
I have geforce rtx 2080 ti, and it seems to take whole month :(
Hi @bshall, I wondered how to set the path to the downloaded waveform directory when preprocessing, as it not as the parameter.
Hi! Could you share some details about the inference speed compared to Griffin-Lim/WaveNet/WaveRNN?
Hello,
In the original implementation of this model, the authors employed a one-hot audio vector of dimension 1024. Unfortunately, the authors did not detail much about this one-hot vector in the paper and did not explain its purpose in the model. Given that its dimension is 1024 = (2^10), and that authors use 10-bit audio samples, I assume this vector is related to the prediction of each bit in each audio sample. But that's just a guess.
So, I have two (actually three) questions:
Thank you very much
Following the original paper train a model on 24kHz audio with 10 bit mu-law encoding.
Hello,
It takes 25 seconds to generate three seconds (sample_rate 22050, about 15 words) audio. Do you have a good idea for performance optimization?We can discuss it. Thank you.
what's the maximum speakers number during training? the paper use 17 speakers. what will happen if speaker number is larger than 17?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.