Coder Social home page Coder Social logo

ndtuan10 / speechrecognition-tacotron2 Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 2.44 GB

- Contents of projects of Speech Recognition / Speech Synthesis (Tổng hợp giọng nói - CS535): pre-trained two models - Tacotron2 and WaveGlow.

Jupyter Notebook 100.00%

speechrecognition-tacotron2's Introduction

SpeechRecognition-Tacotron2

uitlogo

  • Contents of projects of Speech Recognition / Speech Synthesis (Tổng hợp giọng nói - CS535): pre-trained two models - Tacotron2 and WaveGlow.

Introduction to data

Training the model

Evaluation the model

Score 1 2 3 4 5
Label Very bad Bad Medium Good Excellent

Introduction to testing data

STT Content
1 My love for you is like the raging sea. So powerful and deep it will forever be. Through storm, wind, and heavy rain. It will withstand every pain.
2 Amazing good job, you!
3 Manchester City kept their hopes of winning a fourth consecutive Carabao Cup alive after overcoming Manchester United 2-0 in their semi-final at Old Trafford.
4 Roger Federer led 4-1, and 30-0 in the second set.
5 Hanoi capital continue to get cold air which flows from the North, the temperature may drop under 15 degrees, so citizens should keep the bodies warm.
... ...
... ...
23 And the memories bring back, memories bring back you.
24 I woke up early this morning. So I went to school on time.
25 And all my love, I’m holding on forever. Reaching for the love that seems so far.

Text synthesis

  • We enter a text in English in any “TEXT” section, for example: “My love for you is like the raging sea. So powerful and deep it will forever be. Through storm, wind, and heavy rain, it will withstand every pain”. In terms of meaning, we roughly translate "Tình yêu anh dành cho em giống như biển cả đang điên cuồng. Quá mạnh mẽ và sâu sắc, nó sẽ luôn mãi mãi như vậy. Băng qua bão, gió và mưa lớn. Nó sẽ chịu đựng được tất cả mọi nỗi đau".

  • So this is a text like a verse. As a result, we expect the Tacotron2 and WaveGlow models to deliver a soothing, soulful reading.

  • Then we convert that text into a mel spectrogram, and plot it using the matplotlib library. image

  • Output audio by converting the generated mel spectrogram to audio. We use WaveGlow in inference and run using the output of the mel spectrogram when passing through the post-net, with sigma = 0.666 used to denoise the mel spectrogram and sampling rate = 22.050 kHz per second.

  • As a result, we get a 11-second audio female voice reading from the entered English text above. We can download this audio and listen from here. https://github.com/ndtuan10/SpeechRecognition-Tacotron2/blob/main/result.wav

Introduction to evaluation criteria and experimental results table

System MOS
Tacotron2 + WaveGlow (1st person) 3.82
Tacotron2 + WaveGlow (2nd person) 3.65
Tacotron2 + WaveGlow (medium) 3.735

In conclusion

  • According to our assessment, after training two pretrained Tacotron2 and WaveGlow models, we find our model can handle different types of text from weather forecasting, reading stories, reading news ... with a voice that sounds natural and fluent like a human voice. Especially, with exclamations and questions, the voice of the reader tends to increase the intonation at the end of the sentence, and with the words "you're", the voice still gives the correct reading of this word.
  • The Tacotron 2 and WaveGlow models form a TTS (text-to-speech) system that allows users to synthesize natural sounding voices from raw recordings without any additional information, capable of producing high quality voices from mel spectrograms, combining details from Glow and WaveNet enables fast, efficient speech synthesis with a simple model that is easy to train.
  • Despite having a high-quality and clear voice, it still does not response our requirements when reading poetry, the voice is still not inspiring and gentle.
  • In addition, when reading sports news, with specialized sports terms, readers cannot read these words correctly. For example, in football, to pronounce the number “0” instead of 2-0 (two nil), the pronunciation pattern is (two zero); In tennis, instead of 30-0 (thirty love), the pronunciation pattern is (thirty zero).

speechrecognition-tacotron2's People

Contributors

ndtuan10 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.