vaibhavs10 / open-tts-tracker Goto Github PK

open-tts-tracker's Introduction

🗣️ Open TTS Tracker

A one stop shop to track all open-access/ source TTS models as they come out. Feel free to make a PR for all those that aren't linked here.

This is aimed as a resource to increase awareness for these models and to make it easier for researchers, developers, and enthusiasts to stay informed about the latest advancements in the field.

Note

This repo will only track open source/access codebase TTS models. More motivation for everyone to open-source! 🤗

Name	GitHub	Weights	License	Fine-tune	Languages	Paper	Demo	Issues
Amphion	Repo	🤗 Hub	MIT	No	Multilingual	Paper	🤗 Space
AI4Bharat	Repo	🤗 Hub	MIT	Yes	Indic	Paper	Demo
Bark	Repo	🤗 Hub	MIT	No	Multilingual	Paper	🤗 Space
EmotiVoice	Repo	GDrive	Apache 2.0	Yes	ZH + EN	Not Available	Not Available	Separate GUI agreement
Glow-TTS	Repo	GDrive	MIT	Yes	English	Paper	GH Pages
GPT-SoVITS	Repo	🤗 Hub	MIT	Yes	Multilingual	Not Available	Not Available
HierSpeech++	Repo	GDrive	MIT	No	KR + EN	Paper	🤗 Space
IMS-Toucan	Repo	GH release	Apache 2.0	Yes	Multilingual	Paper	🤗 Space
MahaTTS	Repo	🤗 Hub	Apache 2.0	No	English + Indic	Not Available	Recordings, Colab
Matcha-TTS	Repo	GDrive	MIT	Yes	English	Paper	🤗 Space	GPL-licensed phonemizer
MetaVoice-1B	Repo	🤗 Hub	Apache 2.0	Yes	Multilingual	Not Available	🤗 Space
Neural-HMM TTS	Repo	GitHub	MIT	Yes	English	Paper	GH Pages
OpenVoice	Repo	🤗 Hub	CC-BY-NC 4.0	No	ZH + EN	Paper	🤗 Space	Non Commercial
OverFlow TTS	Repo	GitHub	MIT	Yes	English	Paper	GH Pages
Parler TTS	Repo	🤗 Hub	Apache 2.0	Yes	English	Not Available	Not Available
pflowTTS	Unofficial Repo	GDrive	MIT	Yes	English	Paper	Not Available	GPL-licensed phonemizer
Piper	Repo	🤗 Hub	MIT	Yes	Multilingual	Not Available	Not Available	GPL-licensed phonemizer
Pheme	Repo	🤗 Hub	CC-BY	Yes	English	Paper	🤗 Space
RAD-MMM	Repo	GDrive	MIT	Yes	Multilingual	Paper	Jupyter Notebook, Webpage
RAD-TTS	Repo	GDrive	MIT	Yes	English	Paper	GH Pages
Silero	Repo	GH links	CC BY-NC-SA	No	EM + DE + ES + EA	Not Available	Not Available	Non Commercial
StyleTTS 2	Repo	🤗 Hub	MIT	Yes	English	Paper	🤗 Space	GPL-licensed phonemizer
Tacotron 2	Unofficial Repo	GDrive	BSD-3	Yes	English	Paper	Webpage
TorToiSe TTS	Repo	🤗 Hub	Apache 2.0	Yes	English	Technical report	🤗 Space
TTTS	Repo	🤗 Hub	MPL 2.0	No	ZH	Not Available	Colab, 🤗 Space
VALL-E	Unofficial Repo	Not Available	MIT	Yes	NA	Paper	Not Available
VITS/ MMS-TTS	Repo	🤗 Hub / MMS	Apache 2.0	Yes	English	Paper	🤗 Space	GPL-licensed phonemizer
WhisperSpeech	Repo	🤗 Hub	MIT	No	English, Polish	Not Available	🤗 Space, Recordings, Colab
XTTS	Repo	🤗 Hub	CPML	Yes	Multilingual	Technical notes	🤗 Space	Non Commercial
xVASynth	Repo	🤗 Hub	GPL-3.0	Yes	Multilingual	Paper	🤗 Space	Copyrighted materials used for training.

Capability specifics

Click on this to toggle table visibility

Name	Processor ⚡	Phonetic alphabet 🔤	Insta-clone 👥	Emotional control 🎭	Prompting 📖	Speech control 🎚	Streaming support 🌊	S2S support 🦜	Longform synthesis
Amphion	CUDA		👥	🎭👥	❌
Bark	CUDA		❌	🎭 tags	❌
EmotiVoice
Glow-TTS
GPT-SoVITS
HierSpeech++		❌	👥	🎭👥	❌	speed / stability 🎚		🦜
IMS-Toucan	CUDA	❌	❌	❌	❌
MahaTTS
Matcha-TTS		IPA	❌	❌	❌	speed / stability 🎚
MetaVoice-1B	CUDA		👥	🎭👥	❌	stability / similarity 🎚			Yes
Neural-HMM TTS
OpenVoice	CUDA	❌	👥	6-type 🎭 😡😃😭😯🤫😊	❌
OverFlow TTS
pflowTTS
Piper
Pheme	CUDA	❌	👥	🎭👥	❌	stability 🎚
RAD-TTS
Silero
StyleTTS 2	CPU / CUDA	IPA	👥	🎭👥	❌		🌊		Yes
Tacotron 2
TorToiSe TTS		❌	❌	❌	📖		🌊
TTTS	CPU/CUDA	❌	👥
VALL-E
VITS/ MMS-TTS	CUDA	❌	❌	❌	❌	speed 🎚
WhisperSpeech	CUDA	❌	👥	🎭👥	❌	speed 🎚
XTTS	CUDA	❌	👥	🎭👥	❌	speed / stability 🎚	🌊	❌
xVASynth	CPU / CUDA	ARPAbet+	❌	4-type 🎭 😡😃😭😯 per‑phoneme	❌	speed / pitch / energy / 🎭 🎚 per‑phoneme	❌	🦜

Processor - CPU/CUDA/ROCm (single/multi used for inference; Real-time factor should be below 2.0 to qualify for CPU, though some leeway can be given if it supports audio streaming)
Phonetic alphabet - None/IPA/ARPAbet (Phonetic transcription that allows to control pronunciation of certain words during inference)
Insta-clone - Yes/No (Zero-shot model for quick voice clone)
Emotional control - Yes🎭/Strict (Strict, as in has no ability to go in-between states, insta-clone switch/🎭👥)
Prompting - Yes/No (A side effect of narrator based datasets and a way to affect the emotional state, ElevenLabs docs)
Streaming support - Yes/No (If it is possible to playback audio that is still being generated)
Speech control - speed/pitch/ (Ability to change the pitch, duration, energy and/or emotion of generated speech)
Speech-To-Speech support - Yes/No (Streaming support implies real-time S2S; S2T=>T2S does not count)

How can you help?

Help make this list more complete. Create demos on the Hugging Face Hub and link them here :) Got any questions? Drop me a DM on Twitter @reach_vb.

open-tts-tracker's People

Contributors

Stargazers

Watchers

Forkers

holyr00d josephrp tomchapin bushono chandan0000 tfius shivammehta25 pendrokar vinicius-ianni acidbubbles sharathadavanne yanndd1 deepakkr-singh polya20 harish-tricon rafael-ariascalles liujingxiu23 whitefu blueoceandevops coinhubx zoq ai-jie01 yuan-manx shivamsinha15 ailabteam yanxg anbubyfarstudios lwd-temp jackyken lyhiving maoshuiyang laofuciu jordiluque aaronluoxiao yanniszhou zbcumt keyman9848 fwwdn jmaigc appalachianwine wetdog seafitliu pluketic fakerybakery bigdoublesmallhg shouryan01 ryu1845 drishyakarki thisisashukla unknownhero leepaul2008 sorokinvld heyimjonas aroraakshit nonomal jarvis657 mahboube-askarian liying1989 kaloyan zhaopufeng daviddelaurier

open-tts-tracker's Issues

About the License of HierSpeech++

@Vaibhavs10

Thanks for sharing our work!

We have changed the license to MIT License. (So please change our license information in ReadMe.MD!)

Now, you can use for commercial product.

Please enjoy a high-quality zero-shot speech synthesis with a fast inference speed

Thanks!

[Question] VITS/MMS-TTS also uses a GPL phonimiser?

I might be wrong but Isn't this part of the phonetisation in HF and also in the official VITS repository also GPL licensed?

https://github.com/huggingface/transformers/blob/7142bdfa90a3526cfbed7483ede3afbef7b63939/src/transformers/models/vits/tokenization_vits.py#L199C3-L199C3

It may be that it isn't used but it is something that is not obvious to me thus the question. Would love some clarification about it.

I was curious how was the order determined. Alphabetically? Stars? It would be nice to have this clearly defined. Maybe GitHub stars (which means you'd need to periodically review them and update them) would be the most relevant, otherwise alphabetically would be fair. Eventually, an "Active" column (e.g. no commits for 2+ months) could also be helpful to see which projects are not being actively worked on.

MeloTTS?

https://github.com/myshell-ai/MeloTTS
I just discovered this, and I don't think it's on the list.

Add realtime info

Hi, I came across your list, thanks for sharing it.

I thought it would be useful if the table included information about whether each model can run in realtime or not and with what hardware. I might do a PR if I find the time.

Best,
H. A

Add dataset size column

Add dataset size in hours used to train the latest model.

I'm curious in how it affects quality. ElevenLabs claims to have trained theirs on ~680k hours of speech audio!

What do you think?

[New model] NaturalSpeech 3

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
The code and pre-trained checkpoint of FACodec at here !

Even more capability columns

Suggest adding more columns that would describe capabilities. Please comment on which of these you see as notable enough.

GPU acceleration - Yes/No (CUDA/ROCm single/multi)
Word pronunciation adjustment - None/IPA/ARPAbet/<other>
Insta-clone - Yes/No (quick voice clone using a few audio samples, though already implied with TTS that do not have fine-tuning)
Emotional control - Yes/Strict/No (Strict, as in has no ability to go in-between states)
Prompting* - Yes/No (Often a side effect of narrator based datasets and a way to affect the emotional state)
Streaming support - Yes/No (Is it possible to playback audio that is still being generated)
Audio control - Yes/No (speed/<other>) (Ability to change the pitch, duration, energy and/or emotion of generated speech)
Per-phoneme control - Yes/No (speed/<other>) (Ability to change the pitch, duration, energy and/or emotion of each uttered phoneme)
Speech-To-Speech support - Yes/No (S2S capability lately seems to often come alongside TTS)

*Prompting as mentioned in ElevenLabs docs:
https://elevenlabs.io/docs/speech-synthesis/prompting

Add piper

Piper (https://github.com/rhasspy/piper) :

A fast, local neural text to speech system that sounds great and is optimized for the Raspberry Pi 4

MIT license
2.4 stars on github.
Weights: https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main
Supported languages:
- Arabic (ar_JO)
- Catalan (ca_ES)
- Czech (cs_CZ)
- Danish (da_DK)
- German (de_DE)
- Greek (el_GR)
- English (en_GB, en_US)
- Spanish (es_ES, es_MX)
- Finnish (fi_FI)
- French (fr_FR)
- Hungarian (hu_HU)
- Icelandic (is_IS)
- Italian (it_IT)
- Georgian (ka_GE)
- Kazakh (kk_KZ)
- Luxembourgish (lb_LU)
- Nepali (ne_NP)
- Dutch (nl_BE, nl_NL)
- Norwegian (no_NO)
- Polish (pl_PL)
- Portuguese (pt_BR, pt_PT)
- Romanian (ro_RO)
- Russian (ru_RU)
- Serbian (sr_RS)
- Swedish (sv_SE)
- Swahili (sw_CD)
- Turkish (tr_TR)
- Ukrainian (uk_UA)
- Vietnamese (vi_VN)
- Chinese (zh_CN)

Full training script is available
Fine tuning script is available

Additionally, it would be useful to have an idea if data for training is publicly available

XTTS fine-tuning link not accessible

Hi!

The fine-tuning link in the XTTS model row is not accessible. Actually it's a private HF slack channel link.

Thanks