WaveODE

An ODE-based generative neural vocoder using Rectified Flow

Introduction

Recently ODE-based generative models are a hot topic in machine learning and image generation and have achieved remarkable performance. However, due to the differences in data distribution between images and waveforms, it is not clear how well these models perform on speech tasks. In this project, I implement an ODE-based generative neural coder called WaveODE using Rectified Flow [4] as the backbone and hope to contribute to the generalization of ODE-based generative models for speech tasks.

Pre-requisites

The testdata folder contains some example files that allow the project to run directly.
If you want to run with your own dataset:
1. Replace the feature_dirs and fileid_list in config.json with your own dataset.
2. Modify the acoustic parameters to match the data you are using and adjust the batch size to the number you need.

Training and inference

Generate MELs

python3 -u generate_mels.py --output testdata/train/ --wav_folder testdata/train/wavs/ --mel_folder testdata/train/mels/

Train WaveODE with 1-Rectified Flow from scratch

python3 -u train.py -c config.yaml -l logdir -m waveode_1-rectified_flow

Inference

RK45 solver:

python3 inference.py --hparams config.yaml --checkpoint logdir/waveode_1-rectified_flow/M_0.pth --input test_mels_dir  --output synthesized_eval_rk45 --sampling_method rk45

python3 inference_mel.py --hparams config.yaml --checkpoint logdir/waveode_1-rectified_flow/M_12.pth --input test_mels_dir  --output synthesized_eval_rk45_mels --sampling_method rk45

Euler sover:

python3 inference.py --hparams config.yaml --checkpoint logdir/waveode_1-rectified_flow/M_0.pth --input test_mels_dir  --output synthesized_eval_euler --sampling_method euler --sampling_steps 20

python3 inference_mel.py --hparams config.yaml --checkpoint logdir/waveode_1-rectified_flow/M_12.pth --input test_mels_dir  --output synthesized_eval_euler_mels --sampling_method euler --sampling_steps 20

Train WaveODE with 2-Rectified Flow

Generate (noise, audio) tuples using 1-Rectified Flow:

python3 inference.py --hparams config.yaml --checkpoint logdir/waveode_1-rectified_flow/M_105.pth --input testdata/train/mels  --output testdata/generate

Train 2-Rectified Flow using generated data

python3 -u train_reflow.py -c config_reflow.yaml -l logdir -m waveode_2-rectified_flow

Todo

Upload demos of Waveode on open-resources speech corpus such as LJSpeech and VCTK

Q&A

What is ODE-based generative models?

ODE-based generative model (also known as continuous normalizing flow) is a family of generative models that use an ODE-based model to model data distributions where the trajectory from an initial distribution such as a Gaussian distribution to a target distribution follows a ordinary differential equation.

There are some relevant papers:

[1] Neural ordinary differential equations (Chen et al. 2018) Paper

[2] FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models (Grathwohl et al. 2018) Paper

[3] Score-Based Generative Modeling through Stochastic Differential Equations (Song et al. 2021) Paper

[3] Flow Matching for Generative Modeling (Lipman et al. 2023) Paper

[4] Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow (Liu et al. 2023) Paper

[5] Stochastic Interpolants: A Unifying Framework for Flows and Diffusions (Albergo et al. 2023) Paper

[6] Action Matching: Learning Stochastic Dynamics From Samples (Neklyudov et al. 2022) Paper

[7] Riemannian Flow Matching on General Geometries (Chen et al. 2023) Paper

[8] Conditional Flow Matching: Simulation-Free Dynamic Optimal Transport (Tong et al. 2023) Paper

[9] Minimizing Trajectory Curvature of ODE-based Generative Models (Lee et all. 2023) Paper

Why choose ODE-based model instead of SDE-based diffusion models or Denosing diffusion models?

Because ODE-based model is simpler in theory and implementation, it has become very popular recently.

Why artifacts and glitches exist in the generated samples?

Since Rectified Flow is a proposed approach based on image generation, it may need to be modified or improved for speech tasks. On the other hand, glitches in image generation (e.g., unnatural hands) are less likely to affect the overall image quality, but glitches in speech are naturally easy to capture perceptually.

How to improve Rectified Flow?

[5] proposed that the loss function of Rectified Flow is biased and [9] proposed that Rectified Flow estimates the upper bound of the degree of intersection of the independent coupling but does not really minimize it, and improvements based on the loss function might improve its quality

Reference

https://github.com/gnobitab/RectifiedFlow

egorsmkv / waveode Goto Github PK

waveode's Introduction

WaveODE

Introduction

Pre-requisites

Training and inference

Generate MELs

Train WaveODE with 1-Rectified Flow from scratch

Inference

Train WaveODE with 2-Rectified Flow

Todo

Q&A

What is ODE-based generative models?

Why choose ODE-based model instead of SDE-based diffusion models or Denosing diffusion models?

Why artifacts and glitches exist in the generated samples?

How to improve Rectified Flow?

Reference

waveode's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org