Coder Social home page Coder Social logo

jxzhanggg / nonparaseq2seqvc_code Goto Github PK

View Code? Open in Web Editor NEW
247.0 15.0 57.0 220 KB

Implementation code of non-parallel sequence-to-sequence VC

License: MIT License

Python 99.15% Shell 0.85%
voice-conversion deep-learning text-to-speech pytorch-implementation

nonparaseq2seqvc_code's Introduction

Non-parallel Seq2seq Voice Conversion

Implementation code of Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations.

For audio samples, please visit our demo page.

The structure overview of the model

Dependencies

  • Python 3.6
  • PyTorch 1.0.1
  • CUDA 10.0

Data

It is recommended you download the VCTK and CMU-ARCTIC datasets.

Usage

Installation

Install Python dependencies.

$ pip install -r requirements.txt

Feature Extraction

Extract Mel-Spectrograms, Spectrograms and Phonemes

You can use extract_features.py

Customize data reader

Write a snippet of code to walk through the dataset for generating list file for train, valid and test set.

Then you will need to modify the data reader to read your training data. The following are scripts you will need to modify.

For pre-training:

For fine-tuning:

Pre-train the model

Add correct paths to your local data, and run the bash script:

$ cd pre-train
$ bash run.sh

Run the inference code to generate audio samples on multi-speaker dataset. During inference, our model can be run on either TTS (using text inputs) or VC (using Mel-spectrogram inputs) mode.

$ python inference.py

Fine-tune the model

Fine-tune the model and generate audio samples on conversion pair. During inference, our model can be run on either TTS (using text inputs) or VC (using Mel-spectrogram inputs) mode.

$ cd fine-tune
$ bash run.sh

Training Time

On a single NVIDIA 1080 Ti GPU, with a batch size of 32, pre-training on VCTK takes approximately 64 hours of wall-clock time. Fine-tuning on two speakers (500 utterances each speaker) with a batch size of 8 takes approximately 6 hours of wall-clock time.

Citation

If you use this code, please cite:

@article{zhangnonpara2020, 
author={Jing-Xuan {Zhang} and Zhen-Hua {Ling} and Li-Rong {Dai}}, 
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
title={Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations}, 
year={2020}, 
volume={28}, 
number={1}, 
pages={540-552}}

Acknowledgements

Part of code was adapted from the following project:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.