speakerboxxx

Clean data, extract features, and train backend of deep LSTM-based speech synthesizer

Overview

Pre-process data by extracting linguistic input features, duration target features (phone durations), and acoustic target features
Duration model
Acoustic model
Test
- Loss by calculating loss for each model on test set
- Full pipeline by passing linguistic features of test set into duration model, then using that output plus the linguistic features as input into acoustic model. Audio, spectral features generated by acoustic model are then fed into WORLD vocoder

Notable dependencies

Python: data munging
Lua and Torch: data munging, training neural nets
Julia: interface to C++ WORLD vocoder, extracting acoustic features
HTK, Prosodylab-Aligner, textgrid: forced alignment on prompt and wav to get phone and word timings

Datasets

CMUArctic
- 4 sets of 1 hour single-speaker, 2 male 2 female
- phonetic timing labels provided, word timing tables not provided
Blizzard2013
- 19 hours single speaker, female
- phonetic and word timing labels extracted using Prosodylab-Aligner

Using each dataset

To generate and save input features / targets:

Blizzard2013:
```
python BlizzardFeatureExtractor.py 
```

CMUArctic:

python ArcticFeatureExtractor.py
julia ArcticAcousticFeatureExtractor.jl

To create train, valid, test splits:

Blizzard2013:

th DataSplitter.lua -dataset blizzard2013

CMUArctic:
```
th DataSplitter.lua -dataset cmuarctic
```

When training / testing with main.lua, pass "cmuarctic" or "blizzard2013" to 'dataset' flag

Force-aligning to get phone and word times

This is used for Blizzard2013 and future datasets where only wav files and prompts are provided.

Installing HTK, Prosodylab-Aligner, textgrid

Github: https://github.com/prosodylab/Prosodylab-Aligner
Other tools for forced alignment (with the reason for not using in parantheses):
- Pratt (only Windows)
- Penn Phonetics Lab Forced Aligner (appears older)
Textgrid:
- https://github.com/kylebgorman/textgrid

NOTE on lexicon when force-aligning

ProsodyLab-Aligner uses CMUDict to do the grapheme to phoneme (g2p) conversion. As such, when there are OOV words in the corpus being aligned, it produces a OOV.txt file and fails.
To train on Blizzard2013, which does contain OOV words, there are two options:
- 1. Avoid all prompts that have OOV words
- 1. ProsodyLab-Aligner allows one to provide a dictionary of g2p mappings for the OOV words (check its README for instructions)
  - This could be done either through my own g2p model, or the rule-based model provided here: http://www.speech.cs.cmu.edu/tools/lextool.html
- I have chosen to go with option 1, as a) there still 7438 prompts, down from the original 9734 prompts, and b) the LSM speech corpus is being created using FestVox, which also uses CMUDict

NOTE on silences

Force-alignments include silent phones and silent words
These occur both in the beginning/end and in the middle
Currently, these are handled by
- ignoring the beginning/end silences (e.g. when extracting target durations, these are skipped)
- splitting the middle silences (e.g. the time for that silence is split between the previous and the next phone/word)

Features

Linguistic Input
- Dimension:
- 39 (phone), etc.
Duration Target
- Dimension: (# of phonemes in seq) x 1
Acoustic Target
- Dimension: (# of frames) x (1 + 1 + sp + ap)
  - 1 for voiced/unvoiced, 1 for f0, sp = TODO:, ap = TODO
- of frames = (length of spoken phonemes in ms) / 5ms
  - Approximately. In reality, each phoneme length is divided by 5 and rounded
- where (length of spoken phonemes) excludes silent phonemes

Running

These are also saved in the command line examples in main.lua

Training (best parameters)

Duration:

th main.lua  -gpuid 0 -model duration -notes linear254to256_linear256to128_lstm128to128_linear128to1 -save_model_every_epoch 10 -maxepochs 100 -lr 0.001 -method adam

Acoustic:

th main.lua  -gpuid 1 -maxepochs 300 -save_model_every_epoch 10 -lr 0.0005 -method adam -model acoustic -notes linear254to512_linear512to512_lstm512to256_lstm256to256_linear256to84__QUINPHONE_f0INTERPOLATE

Testing

Shannon, w GPU:

th main.lua -gpuid 0 -mode test -load_duration_model_path models/duration/2016_8_3___15_5_38/net_e100.t7 -load_acoustic_model_path models/acoustic/2016_8_3___17_24_13/net_e270.t7

Local, no GPU:

th main.lua -mode test -load_duration_model_path models/duration/2016_7_20___5_16_19/net_e9.t7 -load_acoustic_model_path models/acoustic/2016_7_20___14_30_32/net_e1.t7

Current best model, features, and run parameters, etc.

Features:

Quinphone identities
Interpolating F0
Silencing / not silencing

Run parameters:

Adam

TODO

To try improving performance

Feature normalization
- Acoustic
- Lniguistic in [0.01, 0.99]
More linguistic features
- morpheme-level
- lexical stress
- distance from stressed/accented syllable
- position of syllable in utterance (as opposed to just position of syllable in word)
- POS of current/preceding/following word
Mel-cepstral distortion (MCD) loss instead of MSE
Parameter generation
- HMM-based SPSS
- Global varaince parameter generation
  - http://isw3.naist.jp/~tomoki/Tomoki/Conferences/IS2005_HTSGV.pdf
  - http://www.cs.tut.fi/sgn/arg/silen/is2012/GV/
Implement Adadec (mentioned in a few papers)
Remove silence frames (mentioned in a paper or two)

Experiments

Multiple speakers
- Add binary feature to loss that predicts which speaker
- Can also be used to test deep density mixture
- Pass speaker id as one hot encoded vector
phone2vec
How does 1 hour vs 5 hour vs 10 hour affect performance?
Softmax classification for Log F0 values instead of regression (inspired by PixelRNN)

For production on Android

Port models for Tensorflow for easy Android integration
Write sp2mc for WORLD in C++
Build phone-syllable-word contexts from raw text, not just on datasets with forced alignment
- Requires use of g2p model
Shrink, quantize models

SE & other

config for dataset and which paths to use
better way to keep track of how different features affects performance, storing features/outputs when trained with different features in folder whose name includes those new features (e.g. now that quinphones in linguistic inputs is definitively better, it shouldn't really be caled linguistic_inputs_plus anymore)
some scripts for copying files to and from local and Shannon
- For example, testing full pipeline is done on Shannon, but generation of wav files is done locally because installing Julia requires some upgrades I don't want to make (plus we don't need to take up space on Shannon)

Other notes

Phoneset: ['AA', 'AE', 'AH', 'AO', 'AW', 'AY', 'B', 'CH', 'D', 'DH', 'EH', 'ER', 'EY', 'F', 'G', 'HH', 'IH', 'IY', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OY', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UW', 'V', 'W', 'Y', 'Z', 'ZH']

POS set: ['VERB', 'NOUN', 'PRON', 'ADJ', 'ADV', 'ADP', 'CONJ', # 'DET', 'NUM', 'PRT', 'X', '.']

artificialnouveau / blizzardprocesser Goto Github PK

blizzardprocesser's Introduction