Coder Social home page Coder Social logo

w4-jonghoon / vqvae-speech Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jeremycchsu/vqvae-speech

0.0 0.0 0.0 153 KB

Tensorflow implementation of the speech model described in Neural Discrete Representation Learning (a.k.a. VQ-VAE)

License: MIT License

Python 100.00%

vqvae-speech's Introduction

This is an implementation of the VQ-VAE model for voice conversion in Neural Discrete Representation Learning. So far the results are not as impressive as DeepMind's yet (you can find their results here). My estimate is that the voice quality is 2 - 3 and intelligibility is 3 - 4 (in 5-scaled Mean Opinion Score). Contributions are welcome.

Current Results

Audio Samples

Results after training for 500k steps (about 2 days):

Source 1: p227_363 (We're encouraged by the news)
Target 1: converted into p231

Source 2: p240_341 (Who was the mystery MP?)
Target 2: converted into p227

Source 3: p243_359 (Under Alex Ferguson, Aberdeen showed it could be done.)
Target 3: converted into p231

Source 4: p231_430 (It was a breadtaking moment.)
Target 4: converted into p227

Note:

  1. format: [speaker]_[sentence]
  2. the author didn't specify the target speaker on the demo website.


Speaker Space

speaker-space
PCA-2D of the speaker space learned by VQ-VAE (Tensorboard screenshot). Note that genders are separated naturally, as pointed out in Fig. 4 of Deep Voice 2. Interestingly, the gender of p280 is not specified in the speaker-info.txt file released by VCTK, but according to the figure, we can make a confident guess that p280 is female.

Output Frequency of Exemplars (VQ Centroids)

exemplars
All the exemplars are utilized at about the same order of magnitude of frequency (x-axis represents the index of exemplars).


Dependency

  • Ubuntu 16.04
    • ffmpeg
    • Python 3.6
      • Tensorflow 1.5.0


Usage

Create a soft link in the project dir:

git clone https://github.com/JeremyCCHsu/vqvae-speech.git
cd vqvae-speech
mkdir dataset
cd dataset
wget http://homepages.inf.ed.ac.uk/jyamagis/release/VCTK-Corpus.tar.gz
tar -zxvf VCTK-Corpus.tar.gz
mv VCTK-Corpus VCTK
cd ..

# # Ignore these 2 lines if you already use your env
# conda create -n vqvae -y python=3.6
# source activate vqvae

pip install -r requirements

# Convert wav into mu-law encoded sequence
# The double quotation mark is necessary
# WARNING: without ffmpeg, this script crashes with inf loop
python wav2tfr.py   \
  --fs 16000 \
  --output_dir dataset/VCTK/tfr \
  --speaker_list etc/speakers.tsv \
  --file_pattern "dataset/VCTK/wav48/*/*.wav" 

# [Optional] Generate mu-law encoded wav
python tfr2wav.py \
  --output_dir dataset/VCTK/mulaw \
  --speaker_list etc/speakers.tsv \
  --file_pattern "dataset/VCTK/tfr/*/*.tfr"

# Training script
python main.py \
  --speaker_list etc/speakers.tsv \
  --arch architecture.json \
  --file_pattern "dataset/VCTK/tfr/*/*.tfr" \

# Generation script
# Please specify the logdir argument 
# Please specify e.g. `--period 45` for periodic generation
python generate.py \
  --logdir logdir/train/[dir]

Training usually takes days on a Titan Xp. Progresses are significant during the first 24 hours; the cross-entropy loss saturates at around 1.7 afterwards.


Dataset

The experiement were conducted on CSTR VCTK corpus. Download it here.
Note:

  1. One of the speakers (p280) is missing in VCTK's speaker-info.txt file.
  2. One of the sound files (p376_295.raw) isn't in wav format. I simply ignored that file.
  3. One of the speakers (p315) has no accompanying transcriptions, though this doesn't matter in our task.


Misc.

  1. The code for generation is naively implemented (not fast WaveNet), so generation is very slow.
  2. Exact specifications such as the encoder architecture is not provided in their paper.
  3. Whether they use one-hot representation for the wav input is unclear.
  4. Initialization of the exemplars are crucial, but how the authors initialized exemplars is unclear. I chose exemplars from encoder output because this it least expensive and most reasonable. Improper initialization (normal/uniform distribution with wrong variance/range) could end up a detrimental, leading to unused exemplars and reducing speech intelligibility.
  5. dataloader does not explicitly pad the input because the initial second of each wav file is always silent.

Reference:

This repo is inspired by ibab's WaveNet repo.

vqvae-speech's People

Contributors

jeremycchsu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.