Coder Social home page Coder Social logo

cookieppp / vocodercomparisons Goto Github PK

View Code? Open in Web Editor NEW
6.0 4.0 1.0 790 KB

Train/test a variety of open source vocoders using the same input features and dataset. Then infer together for easy side-by-side comparisons.

License: MIT License

Python 100.00%

vocodercomparisons's Introduction

VocoderComparisons

Train/test a variety of open source vocoders using the same input features and dataset. Then infer together for easy side-by-side comparisons.

vocodercomparisons's People

Contributors

cookieppp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

iamgoofball

vocodercomparisons's Issues

The right way to generate mel-spectrogram

I found your repo from this issue: jik876/hifi-gan#63

I am still confused about the mismatch between repos in Mel spectrogram generation. I collect some method from some TTS repo, there are some differences such as

  • STFT from torch vs librosa
  • Log mel with base e vs base 10
  • Difference in padding
  • Use center or not

def get_mel_librosa1(wave):
     wave = wave / max_wav_value
     wave = wave.astype('float32')
     mel = librosa.feature.melspectrogram(y=wave, sr=sampling_rate, n_mels=num_mels, n_fft=fft_size, hop_length=hop_size, win_length=win_length, window=window_librosa) #, center=True, pad_mode='constant', power=2.0)
     return mel

def get_mel_librosa2(wave):
     wave = wave / max_wav_value
     wave = wave.astype('float32')
     sgram = librosa.stft(wave, n_fft=fft_size, hop_length=hop_size, win_length=win_length, window=window_librosa)
     sgram_mag, _ = librosa.magphase(sgram)
     mel_scale_sgram = librosa.feature.melspectrogram(S=sgram_mag, sr=sampling_rate, n_mels=num_mels, n_fft=fft_size, hop_length=hop_size, win_length=win_length, window=window_librosa)
     mel_sgram = librosa.amplitude_to_db(mel_scale_sgram, ref=np.min)
     return mel_sgram

def get_mel_parallelwavegan(wave):
     # get amplitude spectrogram
     wave = wave / max_wav_value
     wave = wave.astype('float32')
     x_stft = librosa.stft(wave, n_fft=fft_size, hop_length=hop_size, win_length=win_length, window=window_librosa, center=True, pad_mode="reflect")
     spc = np.abs(x_stft).T  # (#frames, #bins)
     mel = np.maximum(eps, np.dot(spc, melbasis.T))
     return np.log10(mel).T

def get_mel_tacotron2(wave):
     wave = torch.FloatTensor(wave)
     audio_norm = wave / max_wav_value
     audio_norm = audio_norm.unsqueeze(0)
     audio_norm = torch.autograd.Variable(audio_norm, requires_grad=False)

     _stft = TacotronSTFT(fft_size, hop_size, fft_size, num_mels, sampling_rate, fmin, fmax)

     melspec = _stft.mel_spectrogram(audio_norm)
     melspec = torch.squeeze(melspec, 0)
     return melspec.cpu().detach().numpy()

def get_mel_hifigan_origin(y):
     y = y/max_wav_value
     y = torch.FloatTensor([y]).to(device)
     y = torch.nn.functional.pad(y.unsqueeze(1), (int((fft_size-hop_size)/2), int((fft_size-hop_size)/2)), mode='reflect').squeeze(1)
     spec = torch.stft(y, fft_size, hop_length=hop_size, win_length=win_length, window=window_torch, center=False, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
     spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
     mel_basis = torch.from_numpy( melbasis ).float().to(device)
     spec = torch.matmul(mel_basis, spec)
     spec = torch.log(torch.clamp(spec, min=1e-5) * 1)
     return spec.cpu().detach().numpy()[0]

def get_mel_hifigan_center(y):
     y = y/max_wav_value
     y = torch.FloatTensor([y]).to(device)
     # y = torch.nn.functional.pad(y.unsqueeze(1), (int((fft_size-hop_size)/2), int((fft_size-hop_size)/2)), mode='reflect').squeeze(1)
     spec = torch.stft(y, fft_size, hop_length=hop_size, win_length=win_length, window=window_torch, center=True, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
     spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
     mel_basis = torch.from_numpy( melbasis ).float().to(device)
     spec = torch.matmul(mel_basis, spec)
     spec = torch.log(torch.clamp(spec, min=1e-5) * 1)
     return spec.cpu().detach().numpy()[0]

def get_mel_hifigan_change_pad(y):
     # https://github.com/jik876/hifi-gan/issues/63
     y = y/max_wav_value
     y = torch.FloatTensor([y]).to(device)
     y = torch.nn.functional.pad(y.unsqueeze(1), (int((fft_size)/2), int((fft_size)/2)), mode='reflect').squeeze(1)
     spec = torch.stft(y, fft_size, hop_length=hop_size, win_length=win_length, window=window_torch, center=False, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
     spec = torch.sqrt(spec.pow(2).sum(-1)+(1e-9))
     mel_basis = torch.from_numpy( melbasis ).float().to(device)
     spec = torch.matmul(mel_basis, spec)
     spec = torch.log(torch.clamp(spec, min=1e-5) * 1)

     return spec.cpu().detach().numpy()[0]
 mel0 = get_mel_librosa1(wave)
 mel1 = get_mel_librosa2(wave)
 mel2 = get_mel_parallelwavegan(wave)
 mel3 = get_mel_tacotron2(wave)
 mel4 = get_mel_hifigan_origin(wave)
 mel5 = get_mel_hifigan_center(wave)
 mel6 = get_mel_hifigan_change_pad(wave)
(80, 487)
(80, 487)
(80, 487)
(80, 487)
(80, 486)
(80, 487)
(80, 487)

Only the origin way of hifigan repo give difference shape: get_mel_hifigan_origin

Do you have any comments on this, when I compare element values, there is no total match between these method.

One more question, Is there any benchmark for these Vocoders?

Vocoder Quality

Hello!

You seem to have done quite a bit of vocoder comparisons. I have two questions based on your own personal experience.

  • Which vocoder do you feel has the best overall quality (ignoring inference speed) when fine-tuned from mel's (such as from tacotron2)?

  • Does adding a speaker embedding improve overall synthesis quality when using a multi-speaker model?

Thank you for your time!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.