In this project, we implement several state-of-the-art Text-to-Speech (TTS) architectures for Gronings, a Low Saxon language spoken in the province of Groningen and around the Groningen border in Drenthe and Friesland in the Netherlands.1
To build TTS systems for Gronings, ESPnet2, a speech processing toolkit has been utilized. The setup configuration and installation steps that have been followed based on the original documentation to develop the TTS systems are documented below.
For the experiments, the following setup has been used. It is not necessary to have this exact configuration, however, compatibility between different versions must be ensured.
- Ubuntu 20.04 LTS
- Python 3.8.12
- CUDA version 11.1 (run
nvcc -V
to check it) - CUDA Driver version 470.103.01 (run
nvidia-smi
to check it) - CUDA version 11.4 (run
nvidia-smi
to check it) - PyTorch 1.10.1+cu111
- cmake
- sox
- sndfile
- ffmpeg
- flac
The following command will install all the above packages.
$ sudo apt-get install cmake sox libsndfile1-dev ffmpeg flac
- Git clone the ESPnet repo
$ cd <any-place>
$ git clone https://github.com/espnet/espnet
- Setup Anaconda Environment
You have to create <espnet-root>/tools/activate_python.sh.
to specify the Python interpreter used in ESPnet recipes. To do so:
$ cd <espnet-root>/tools
$ ./setup_anaconda.sh [output-dir-name|default=venv] [conda-env-name|default=root] [python-version|default=none]
# e.g.
$ ./setup_anaconda.sh anaconda espnet 3.8
- Install ESPnet
The Makefile tries to install ESPnet and all dependencies including PyTorch. You can specify the PyTorch version (must be compatible with your CUDA version), for example:
$ cd <espnet-root>/tools
$ make TH_VERSION=1.10.1+cu111
Note that the CUDA version is derived from the nvcc
command. Alternatively, you can also specify the CUDA version.
$ cd <espnet-root>/tools
$ make TH_VERSION=1.10.1+cu111 CUDA_VERSION=11.1
Note that all the packages are not required to be installed for TTS development.
$ cd <espnet-root>/tools
$ . ./activate_python.sh; python3 check_install.py
The following architectures and neural vocoders have been implemented for Gronings:
- Architecture
- Neural Vocoder
FastSpeech 2 has been implemented in two ways.
- Using Tacotron 2 as the Teacher Forced Aligner
- Using Montreal Forced Aligner to get the alignments
The procedure of training the architectures and vocoders can be found in recipe and neural vocoder.
Results
You can listen to the generated samples from here.
Dataset | Architecture | Vocoder | Mean Opinion Score (MOS) |
---|---|---|---|
Gronings | Ground Truth | - | - |
Gronings | Tacotron 2 | Parallel Wavegan | - |
Gronings | FastSpeech 2 | Parallel Wavegan | - |
Gronings | Conformer FastSpeech 2 | Parallel Wavegan | - |
Gronings | Tacotron 2 | Hifi-gan | - |
Gronings | FastSpeech 2 | Hifi-gan | - |
Gronings | Conformer FastSpeech 2 | Hifi-gan | - |
Online Demo
The real-time demo is available on HuggingFace!
FastSpeech 2 (using Tacotron 2 as Teacher Forced Aligner) and a pre-trained Parallel Wavegan vocoder have been used here. This vocoder is pre-trained on English data since the current ESPnet+HuggingFace integration does not allow to use vocoder trained on custom data.
Pre-trained Models
The following models are trained on approx. 2 hours of Gronings speech data and can be available on HuggingFace!
- Fast Speech 2 (using Tacotron 2 as Teacher Forced Aligner)
- Tacotron 2
- Parallel Wavegan vocoder