Coder Social home page Coder Social logo

persian-text-to-speech's Introduction

Persian text to speech

A convolutional sequence to sequence model for Persian text to speech based on Tachibana et al with a few modifications:

  1. The article didn’t mention position embedding, but in order to give the model a sense of position awareness I added it (it turned out to be useful in the original convolutional seq2seq paper

  2. In the paper they trained both networks with a combination of the L1 loss and an additional binary cross entropy loss , they claim it was beneficial, I found it to be an odd choice of loss function.to validate their idea I trained networks with and without binary cross entropy loss , but adding binary cross entropy loss didn’t make much difference.

  3. In the original paper they used a fixed learning rate of 0.001, but I decayed it.according to the ADAM paper Good default settings for the tested machine learning problems are α = 0.001,β1 = 0.9, β2 = 0.999 but in my case it didn't work!(gradients keep exploding)

  4. I implemented a standard scaled dot-product attention but it mostly failed to converge. Guided attention is a simple but good idea such that the model converge way faster than a standard scaled dot-product attention. Training the attention part was a bottleneck.

In the following figure a schematic of the model architecture is presented:

text to mel

Dataset: (a Persian single speaker speech dataset that last more than 30 hours [narrated by Maryam Mahboub])

There aren’t any available datasets for Persian text to speech so I decided to make my own. I chose to use audio books from navar (I couldn’t find any websites for free public domain audiobooks) then I checked for text availability, if text is not available the audio book is excluded. None of them were available so I $$$$ a few online bookstores like Fidibo to get them. All audio files are sampled at 44 kHz which means there are 44100 values stored for every second of audio, too much information, WAY TOO MUCH! So I down sampled the pcm/wav audio samples from 44 kHz to 22 kHz. I also discarded the stereo channel as it contains highly redundant information.

The audio from navar comes in large files which don’t suit the text to speech task. At first I decided to use automatic force-alignment technique (given an audio clip containing speech [without environmental noise] and the corresponding transcript, computing a forced alignment is the process of determining, for each fragment of the transcript, the time interval containing the spoken text of the fragment) to align the text corpus with audio clips but it didn’t guarantee the correct alignments because texts were slightly different than speech. I finally ended up pragmatically splitting large chunk of audio files into smaller parts by using silence detection and manually aligned text with audio segments.

The distribution of audio lengths for my dataset is shown in the following figure: (90 percentage of the samples have a duration between 2 and 7 seconds)

text to mel

To speed up the training speech and reduce CPU usage I first preprocessed all audio files (using Fourier’s Transform to convert our audio data to the frequency domain). Extracting the audio spectrogram on the fly is expensive.

I implemented the models in Tensorflow and they were trained on a single Nvidia GTX 1050Ti with 4GB memory (a batch size of 32 for text to spectrogram and 8 for super resolution network ).

In the following figure the learned character embeddings is shown:

text to mel

Some generated samples:(NOTE:I use the Griffin Lim algorithm to invert the prediction back to audio signal.i found that's the main source of audible artifacts)

Woman-Maryam Mahboub

Man-Arman Soltan zadeh

Pre-trained model:

You can download pre-trained weights from here

Script files

if you want to train and test your own datasets: Modify train.ipynb including args and data path

demo.ipynb:Enter your sentences and listen to the generated audio

#TODO:Implement mononotic attention

persian-text-to-speech's People

Contributors

alisterta avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.