Coder Social home page Coder Social logo

vqvae's Introduction

WaveRNN + VQ-VAE

This is a Pytorch implementation of WaveRNN. Currently 3 top-level networks are provided:

  • A VQ-VAE implementation with a WaveRNN decoder. Trained on a multispeaker dataset of speech, it can demonstrate speech reconstruction and speaker conversion.
  • A vocoder implementation. Trained on a single-speaker dataset, it can turn a mel spectrogram into raw waveform.
  • An unconditioned WaveRNN. Trained on a single-speaker dataset, it can generate random speech.

Audio samples.

It has been tested with the following datasets.

Multispeaker datasets:

Single-speaker datasets:

Preparation

Requirements

  • Python 3.6 or newer
  • PyTorch with CUDA enabled
  • librosa
  • apex if you want to use FP16 (it probably doesn't work that well).

Create config.py

cp config.py.example config.py

Preparing VCTK

You can skip this section if you don't need a multi-speaker dataset.

  1. Download and uncompress the VCTK dataset.
  2. python preprocess_multispeaker.py /path/to/dataset/VCTK-Corpus/wav48 /path/to/output/directory
  3. In config.py, set multi_speaker_data_path to point to the output directory.

Preparing LJ-Speech

You can skip this section if you don't need a single-speaker dataset.

  1. Download and uncompress the LJ speech dataset.
  2. python preprocess16.py /path/to/dataset/LJSpeech-1.1/wavs /path/to/output/directory
  3. In config.py, set single_speaker_data_path to point to the output directory.

Usage

wavernn.py is the entry point:

$ python wavernn.py

By default, it trains a VQ-VAE model. The -m option can be used to tell the the script to train a different model.

Trained models are saved under the model_checkpoints directory.

By default, the script will take the latest snapshot and continues training from there. To train a new model freshly, use the --scratch option.

Every 50k steps, the model is run to generate test audio outputs. The output goes under the model_outputs directory.

When the -g option is given, the script produces the output using the saved model, rather than training it.

Deviations from the papers

I deviated from the papers in some details, sometimes because I was lazy, and sometimes because I was unable to get good results without it. Below is a (probably incomplete) list of deviations.

All models:

  • The sampling rate is 22.05kHz.

VQ-VAE:

  • I normalize each latent embedding vector, so that it's on the unit 128- dimensional sphere. Without this change, I was unable to get good utilization of the embedding vectors.
  • In the early stage of training, I scale with a small number the penalty term that apply to the input to the VQ layer. Without this, the input very often collapses into a degenerate distribution which always selects the same embedding vector.
  • During training, the target audio signal (which is also the input signal) is translated along the time axis by a random amount, uniformly chosen from [-128, 127] samples. Less importantly, some additive and multiplicative Gaussian noise is also applied to each audio sample. Without these types of noise, the feature captured by the model tended to be very sensitive to small purterbations to the input, and the subjective quality of the model output kept descreasing after a certain point in training.
  • The decoder is based on WaveRNN instead of WaveNet. See the next section for details about this network.

Context stacks

The VQ-VAE implementation uses a WaveRNN-based decoder instead of a WaveNet- based decoder found in the paper. This is a WaveRNN network augmented with a context stack to extend the receptive field. This network is defined in layers/overtone.py.

The network has 6 convolutions with stride 2 to generate 64x downsampled 'summary' of the waveform, and then 4 layers of upsampling RNNs, the last of which is the WaveRNN layer. It also has U-net-like skip connections that connect layers with the same operating frequency.

Acknowledgement

The code is based on fatchord/WaveRNN.

vqvae's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.