Coder Social home page Coder Social logo

open-tts-tracker's Introduction

πŸ—£οΈ Open TTS Tracker

A one stop shop to track all open-access/ source TTS models as they come out. Feel free to make a PR for all those that aren't linked here.

This is aimed as a resource to increase awareness for these models and to make it easier for researchers, developers, and enthusiasts to stay informed about the latest advancements in the field.

Note

This repo will only track open source/access codebase TTS models. More motivation for everyone to open-source! πŸ€—

Name GitHub Weights License Fine-tune Languages Paper Demo Issues
Amphion Repo πŸ€— Hub MIT No Multilingual Paper πŸ€— Space
AI4Bharat Repo πŸ€— Hub MIT Yes Indic Paper Demo
Bark Repo πŸ€— Hub MIT No Multilingual Paper πŸ€— Space
EmotiVoice Repo GDrive Apache 2.0 Yes ZH + EN Not Available Not Available Separate GUI agreement
Glow-TTS Repo GDrive MIT Yes English Paper GH Pages
GPT-SoVITS Repo πŸ€— Hub MIT Yes Multilingual Not Available Not Available
HierSpeech++ Repo GDrive MIT No KR + EN Paper πŸ€— Space
IMS-Toucan Repo GH release Apache 2.0 Yes Multilingual Paper πŸ€— Space
MahaTTS Repo πŸ€— Hub Apache 2.0 No English + Indic Not Available Recordings, Colab
Matcha-TTS Repo GDrive MIT Yes English Paper πŸ€— Space GPL-licensed phonemizer
MetaVoice-1B Repo πŸ€— Hub Apache 2.0 Yes Multilingual Not Available πŸ€— Space
Neural-HMM TTS Repo GitHub MIT Yes English Paper GH Pages
OpenVoice Repo πŸ€— Hub CC-BY-NC 4.0 No ZH + EN Paper πŸ€— Space Non Commercial
OverFlow TTS Repo GitHub MIT Yes English Paper GH Pages
Parler TTS Repo πŸ€— Hub Apache 2.0 Yes English Not Available Not Available
pflowTTS Unofficial Repo GDrive MIT Yes English Paper Not Available GPL-licensed phonemizer
Piper Repo πŸ€— Hub MIT Yes Multilingual Not Available Not Available GPL-licensed phonemizer
Pheme Repo πŸ€— Hub CC-BY Yes English Paper πŸ€— Space
RAD-MMM Repo GDrive MIT Yes Multilingual Paper Jupyter Notebook, Webpage
RAD-TTS Repo GDrive MIT Yes English Paper GH Pages
Silero Repo GH links CC BY-NC-SA No EM + DE + ES + EA Not Available Not Available Non Commercial
StyleTTS 2 Repo πŸ€— Hub MIT Yes English Paper πŸ€— Space GPL-licensed phonemizer
Tacotron 2 Unofficial Repo GDrive BSD-3 Yes English Paper Webpage
TorToiSe TTS Repo πŸ€— Hub Apache 2.0 Yes English Technical report πŸ€— Space
TTTS Repo πŸ€— Hub MPL 2.0 No ZH Not Available Colab, πŸ€— Space
VALL-E Unofficial Repo Not Available MIT Yes NA Paper Not Available
VITS/ MMS-TTS Repo πŸ€— Hub / MMS Apache 2.0 Yes English Paper πŸ€— Space GPL-licensed phonemizer
WhisperSpeech Repo πŸ€— Hub MIT No English, Polish Not Available πŸ€— Space, Recordings, Colab
XTTS Repo πŸ€— Hub CPML Yes Multilingual Technical notes πŸ€— Space Non Commercial
xVASynth Repo πŸ€— Hub GPL-3.0 Yes Multilingual Paper πŸ€— Space Copyrighted materials used for training.

Capability specifics

Click on this to toggle table visibility
Name Processor
⚑
Phonetic alphabet
πŸ”€
Insta-clone
πŸ‘₯
Emotional control
🎭
Prompting
πŸ“–
Speech control
🎚
Streaming support
🌊
S2S support
🦜
Longform synthesis
Amphion CUDA πŸ‘₯ 🎭πŸ‘₯ ❌
Bark CUDA ❌ 🎭 tags ❌
EmotiVoice
Glow-TTS
GPT-SoVITS
HierSpeech++ ❌ πŸ‘₯ 🎭πŸ‘₯ ❌ speed / stability
🎚
🦜
IMS-Toucan CUDA ❌ ❌ ❌ ❌
MahaTTS
Matcha-TTS IPA ❌ ❌ ❌ speed / stability
🎚
MetaVoice-1B CUDA πŸ‘₯ 🎭πŸ‘₯ ❌ stability / similarity
🎚
Yes
Neural-HMM TTS
OpenVoice CUDA ❌ πŸ‘₯ 6-type 🎭
πŸ˜‘πŸ˜ƒπŸ˜­πŸ˜―πŸ€«πŸ˜Š
❌
OverFlow TTS
pflowTTS
Piper
Pheme CUDA ❌ πŸ‘₯ 🎭πŸ‘₯ ❌ stability
🎚
RAD-TTS
Silero
StyleTTS 2 CPU / CUDA IPA πŸ‘₯ 🎭πŸ‘₯ ❌ 🌊 Yes
Tacotron 2
TorToiSe TTS ❌ ❌ ❌ πŸ“– 🌊
TTTS CPU/CUDA ❌ πŸ‘₯
VALL-E
VITS/ MMS-TTS CUDA ❌ ❌ ❌ ❌ speed
🎚
WhisperSpeech CUDA ❌ πŸ‘₯ 🎭πŸ‘₯ ❌ speed
🎚
XTTS CUDA ❌ πŸ‘₯ 🎭πŸ‘₯ ❌ speed / stability
🎚
🌊 ❌
xVASynth CPU / CUDA ARPAbet+ ❌ 4-type 🎭
πŸ˜‘πŸ˜ƒπŸ˜­πŸ˜―
per‑phoneme
❌ speed / pitch / energy / 🎭
🎚
per‑phoneme
❌ 🦜
  • Processor - CPU/CUDA/ROCm (single/multi used for inference; Real-time factor should be below 2.0 to qualify for CPU, though some leeway can be given if it supports audio streaming)
  • Phonetic alphabet - None/IPA/ARPAbet (Phonetic transcription that allows to control pronunciation of certain words during inference)
  • Insta-clone - Yes/No (Zero-shot model for quick voice clone)
  • Emotional control - Yes🎭/Strict (Strict, as in has no ability to go in-between states, insta-clone switch/🎭πŸ‘₯)
  • Prompting - Yes/No (A side effect of narrator based datasets and a way to affect the emotional state, ElevenLabs docs)
  • Streaming support - Yes/No (If it is possible to playback audio that is still being generated)
  • Speech control - speed/pitch/ (Ability to change the pitch, duration, energy and/or emotion of generated speech)
  • Speech-To-Speech support - Yes/No (Streaming support implies real-time S2S; S2T=>T2S does not count)

How can you help?

Help make this list more complete. Create demos on the Hugging Face Hub and link them here :) Got any questions? Drop me a DM on Twitter @reach_vb.

open-tts-tracker's People

Contributors

acidbubbles avatar aroraakshit avatar drishyakarki avatar fakerybakery avatar pendrokar avatar sharathadavanne avatar thisisashukla avatar vaibhavs10 avatar zoq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-tts-tracker's Issues

About the License of HierSpeech++

@Vaibhavs10

Thanks for sharing our work!

We have changed the license to MIT License. (So please change our license information in ReadMe.MD!)

Now, you can use for commercial product.

Please enjoy a high-quality zero-shot speech synthesis with a fast inference speed

Thanks!

Links order

I was curious how was the order determined. Alphabetically? Stars? It would be nice to have this clearly defined. Maybe GitHub stars (which means you'd need to periodically review them and update them) would be the most relevant, otherwise alphabetically would be fair. Eventually, an "Active" column (e.g. no commits for 2+ months) could also be helpful to see which projects are not being actively worked on.

Add realtime info

Hi, I came across your list, thanks for sharing it.

I thought it would be useful if the table included information about whether each model can run in realtime or not and with what hardware. I might do a PR if I find the time.

Best,
H. A

Add dataset size column

Add dataset size in hours used to train the latest model.

I'm curious in how it affects quality. ElevenLabs claims to have trained theirs on ~680k hours of speech audio!

What do you think?

Even more capability columns

Suggest adding more columns that would describe capabilities. Please comment on which of these you see as notable enough.

  1. GPU acceleration - Yes/No (CUDA/ROCm single/multi)
  2. Word pronunciation adjustment - None/IPA/ARPAbet/<other>
  3. Insta-clone - Yes/No (quick voice clone using a few audio samples, though already implied with TTS that do not have fine-tuning)
  4. Emotional control - Yes/Strict/No (Strict, as in has no ability to go in-between states)
  5. Prompting* - Yes/No (Often a side effect of narrator based datasets and a way to affect the emotional state)
  6. Streaming support - Yes/No (Is it possible to playback audio that is still being generated)
  7. Audio control - Yes/No (speed/<other>) (Ability to change the pitch, duration, energy and/or emotion of generated speech)
  8. Per-phoneme control - Yes/No (speed/<other>) (Ability to change the pitch, duration, energy and/or emotion of each uttered phoneme)
  9. Speech-To-Speech support - Yes/No (S2S capability lately seems to often come alongside TTS)

*Prompting as mentioned in ElevenLabs docs:
https://elevenlabs.io/docs/speech-synthesis/prompting

Add piper

Piper (https://github.com/rhasspy/piper) :

A fast, local neural text to speech system that sounds great and is optimized for the Raspberry Pi 4

  • MIT license
  • 2.4 stars on github.
  • Weights: https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main
  • Supported languages:
    • Arabic (ar_JO)
    • Catalan (ca_ES)
    • Czech (cs_CZ)
    • Danish (da_DK)
    • German (de_DE)
    • Greek (el_GR)
    • English (en_GB, en_US)
    • Spanish (es_ES, es_MX)
    • Finnish (fi_FI)
    • French (fr_FR)
    • Hungarian (hu_HU)
    • Icelandic (is_IS)
    • Italian (it_IT)
    • Georgian (ka_GE)
    • Kazakh (kk_KZ)
    • Luxembourgish (lb_LU)
    • Nepali (ne_NP)
    • Dutch (nl_BE, nl_NL)
    • Norwegian (no_NO)
    • Polish (pl_PL)
    • Portuguese (pt_BR, pt_PT)
    • Romanian (ro_RO)
    • Russian (ru_RU)
    • Serbian (sr_RS)
    • Swedish (sv_SE)
    • Swahili (sw_CD)
    • Turkish (tr_TR)
    • Ukrainian (uk_UA)
    • Vietnamese (vi_VN)
    • Chinese (zh_CN)

Generate the table from another file

Hi, currently every modification has to be done directly in markdown which is pretty unwieldy. So I propose to move to data to a separate file like a csv, yaml, or json. This would have the added benefit making the data easily machine-readable.

If you accept this proposal, I would probably implement this myself if all goes well. The generation part should be trivial with a github action.

Add training script availability

Would be useful to have a section on training scripts and whether

  • Full training script is available
  • Fine tuning script is available

Additionally, it would be useful to have an idea if data for training is publicly available

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.