NoVoc Djokovic - raw waveform speech synthesis data preparation

What?

These are the baseline scripts developed for my Master's dissertation on the MSc Speech & Language Processing course at the University of Edinburgh (2015-2016).

My project researched and developed methods of modelling raw waveforms in a deep neural network speech synthesis system, instead of predicting vocoder parameters as in other systems...hence the (terrible) name, NoVoc Djokovic.

In this repository are the scripts for creating the training data to pass to a neural speech synthesis system, and the scripts for concatenating the resultant waveforms. As they stand, the pipeline can be run end-to-end as a proof of concept. This will split the original audio file, and reconstruct it as it would after synthesis time.

For this project, I used CSTR's Neural Network based Speech Synthesis System (https://github.com/CSTR-Edinburgh/merlin/).

How to run

Data Preparation

First, you will need the following command line tools:

REAPER: Robust Epoch And Pitch EstimatoR - https://github.com/google/REAPER
SoX - Sound eXchange - http://sox.sourceforge.net/

The scripts should be run from the top directory, NoVoc_Djokovic

cd NoVoc_Djokovic

./scripts/create_data/process_training_data.sh

This runs the following steps:

Downsamples your audio files to 16kHz, using SoX
Extracts pitchmarks from each downsampled audio file, using REAPER
Uses the pitchmark information to extract waveform segments from each audio file. Each segment is a window centred either on each pitchmark (for voiced segments of speech), or using a constant frame size (for unvoiced segments of speech).
Resamples each segment so they are frames of equal length to pass to the neural network. A stretch factor is also appended as a feature to be able to resample to the correct size at synthesis time.
Saves a (binary) .mgc file for each audio file, containing all extracted frames. .mgc is used as this is an extension Merlin expects for training data.

Synthesis time

To produce the resultant audio file from the generated waveform frames at synthesis time, the following script performs an overlap and add to concatenate the frames together into speech.

./scripts/synthesis_time/baseline_OLA.sh

The above command will currently take the .mgc produced by the first data preparation step to re-concatenate into an audio file of speech. In practice, this script will take the generated .mgc file, but for now it points to the original training data so you can try it out.

andrew-watson / novoc_djokovic Goto Github PK

novoc_djokovic's Introduction

NoVoc Djokovic - raw waveform speech synthesis data preparation

What?

How to run

Data Preparation

Synthesis time

novoc_djokovic's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent