huggingface / parler-tts Goto Github PK

Inference and training library for high-quality TTS models.

License: Apache License 2.0

Makefile 0.05% Python 99.95%

parler-tts's Introduction

Parler-TTS

Parler-TTS is a lightweight text-to-speech (TTS) model that can generate high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc). It is a reproduction of work from the paper Natural language guidance of high-fidelity text-to-speech with synthetic annotations by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.

Contrarily to other TTS models, Parler-TTS is a fully open-source release. All of the datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models.

This repository contains the inference and training code for Parler-TTS. It is designed to accompany the Data-Speech repository for dataset annotation.

Important

We're proud to release Parler-TTS Mini v0.1, our first 600M parameter model, trained on 10.5K hours of audio data. In the coming weeks, we'll be working on scaling up to 50k hours of data, in preparation for the v1 model.

📖 Quick Index

Installation

Parler-TTS has light-weight dependencies and can be installed in one line:

pip install git+https://github.com/huggingface/parler-tts.git

Usage

Tip

You can directly try it out in an interactive demo here!

Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

prompt = "Hey, how are you doing today?"
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Yoach.Lacombe.s.Video.-.Apr.10.2024.1.mp4

Training

The training folder contains all the information to train or fine-tune your own Parler-TTS model. It consists of:

Important

TL;DR: After having followed the installation steps, you can reproduce the Parler-TTS Mini v0.1 training recipe with the following command line:

accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/starting_point_0.01.json

Acknowledgements

This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!

Special thanks to:

Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively, for publishing such a promising and clear research paper: Natural language guidance of high-fidelity text-to-speech with synthetic annotations.
the many libraries used, namely 🤗 datasets, 🤗 accelerate, jiwer, wandb, and 🤗 transformers.
Descript for the DAC codec model
Hugging Face 🤗 for providing compute resources and time to explore!

Citation

If you found this repository useful, please consider citing this work and also the original Stability AI paper:

@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/parler-tts}}
}

@misc{lyth2024natural,
      title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
      author={Dan Lyth and Simon King},
      year={2024},
      eprint={2402.01912},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Contribution

Contributions are welcome, as the project offers many possibilities for improvement and exploration.

Namely, we're looking at ways to improve both quality and speed:

Datasets:
- Train on more data
- Add more features such as accents
Training:
- Add PEFT compatibility to do Lora fine-tuning.
- Add possibility to train without description column.
- Add notebook training.
- Explore multilingual training.
- Explore mono-speaker finetuning.
- Explore more architectures.
Optimization:
- Compilation and static cache
- Support to FA2 and SDPA
Evaluation:
- Add more evaluation metrics

parler-tts's People

Contributors

Stargazers

Watchers

Forkers

ylacombe ishine shaun95 splinter21 maxmax2016 techthiyanes caoyuhang statelesshz vaibhavs10 polya20 touristshaun suryatmodulus bghira not-important-vr kustomzone id-2 rjac-ml yacineali74 pan-yangxu xzm2004260 zeroxclem shaneteks xiechengmude zhikangniu forifido apollohuang1 alignment-lab-ai pizdarikihq deepak-1530 paperwave twysto-ai haifengzeng serdarildercaglar superjili ku100ren hscspring xiaopaul trocker pypypk render-ai gkbxs ledb4 wuxiaobo zhenqicai amorjnyh btbujiangjun wauplin aplegas fmbento piperino11 qoboty yingzi6776 qingswu f901107 b08240 sunixliu allenchang6868 boveyking songfang powi-fork misterypoem hanooch74 baozhi888 backupmanager mdwoicke jakubik2023 repleeka 7xmohamed rmaster520 org-tekeli-borisp notenumber1captail peachninjanoticeiver targetcoopsh humanshangcottonhope ladyin-w chosenrockerellynnon energyouk 76-sandsly y-gotiz maxroladyin clarysf 78stamaha ysfyf alxsbr2411 joshdayax kotamadelin ysfadlaa minmin2411 hutansilon sungaiglasis gunungtravia opissroo-glasedip toxices87nepheway holyze-68 bloodsolz flavoredbubble69 alonekaven-y ratchapter-bluebanx gament-y sovy-i

parler-tts's Issues

Model stumbling on its words

Running the following code:

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

description = "A male speaker with a low-pitched voice delivering his words at a slow pace in a small, confined space with a very clear audio and an animated tone."
prompt = "In the annals of history, the ink that drafted peace often dried under the shadow of future conflicts. Today, we dive deep into the bottom 10 worst peace treaties ever signed, the naive hopes and the grim repercussions they bore, unraveling a tapestry of unintended consequences that would haunt nations for generations. From agreements that sowed the seeds of resentments leading to catastrophic wars, to those that carved up continents disregarding the people who lived there, we explore how peace can sometimes lead to anything but."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write(os.path.join('output.wav'), audio_arr, model.config.sampling_rate)

Outputs the .wav file posted at this link: http://sndup.net/vzyp

How can I get it to correctly output the prompt text? Is my prompt too large? Am I using the model incorrectly? Thank you!

Long-form synthesis

Hi,
Congrats on the release!! Is long form synthesis planned?
Thank you!

[Question] Does this Mini model can be train on small GPU 24GB x 2?

Hi,

I have dual gpu 3090 x 2, when I run the training script, I always get OOM, do I must configure the parameter? What parameter should I adjust?

error regarding some tokenizer issue

When I run the sample script I keep getting this error message among others...not sure how dire it is or whether it even impacts performance...

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers

Training on a NEW language

Suppose we have to train this TTS model on a language and the tokens of that language are not in the Flan-T5 transformer. So can I simply change the name of the tokenizer in the config.json or do I have to make any code changes also. NOTE The new tokenizer will not be of FLAN-T5

Won't work

first of all congrats on your accomplishments!

but I must be doing something wrong but I can't get it to work,
I want to install it in my textgenwebui environment but I get this error:

C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\torch\nn\utils\weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers Using the model-agnostic default max_length(=2580) to control the generation length. We recommend settingmax_new_tokensto control the maximum length of the generation. Callingsampledirectly is deprecated and will be removed in v4.41. Usegenerate or a custom generation loop instead. --- Logging error --- Traceback (most recent call last): File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 1110, in emit msg = self.format(record) ^^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 953, in format return fmt.format(record) ^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 687, in format record.message = record.getMessage() ^^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 377, in getMessage msg = msg % self.args ~~~~^~~~~~~~~~~ TypeError: not all arguments converted during string formatting Call stack: File "C:\text-generation-webui-snapshot-2024-04-21\snippet.py", line 17, in <module> generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\parler_tts\modeling_parler_tts.py", line 2608, in generate outputs = self.sample( File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2584, in sample return self._sample(*args, **kwargs) File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2730, in _sample logger.warning_once( File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\transformers\utils\logging.py", line 329, in warning_once self.warning(*args, **kwargs) Message: 'eos_token_idis deprecated in this function and will be removed in v4.41, usestopping_criteria=StoppingCriteriaList([EosTokenCriteria(eos_token_id=eos_token_id)])instead. Otherwise make sure to setmodel.generation_config.eos_token_id' Arguments: (<class 'FutureWarning'>,)

It is super vague, and I don't know where to look next.
My current version of python is 3.11, torch 2.2.1+cu121, transformers 4.40.0

can anyone point me in the right direction?
thanks for your time!

Running gets output like this...failure

(base) gwen@GwenSeidr:/2/parler-tts$ virtualenv parler_tts_env
created virtual environment CPython3.10.12.final.0-64 in 328ms
creator CPython3Posix(dest=/home/gwen/2/parler-tts/parler_tts_env, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/gwen/.local/share/virtualenv)
added seed packages: GitPython==3.1.43, Jinja2==3.1.3, Markdown==3.6, MarkupSafe==2.1.5, PyYAML==6.0.1, absl_py==2.1.0, accelerate==0.29.2, aiohttp==3.9.4, aiosignal==1.3.1, appdirs==1.4.4, argbind==0.3.7, asttokens==2.4.1, async_timeout==4.0.3, attrs==23.2.0, audioread==3.0.1, certifi==2024.2.2, cffi==1.16.0, charset_normalizer==3.3.2, click==8.1.7, contourpy==1.2.1, cycler==0.12.1, datasets==2.18.0, decorator==5.1.1, descript_audio_codec==1.0.0, descript_audiotools==0.7.2, dill==0.3.8, docker_pycreds==0.4.0, docstring_parser==0.16, einops==0.7.0, evaluate==0.4.1, exceptiongroup==1.2.0, executing==2.0.1, ffmpy==0.3.2, filelock==3.13.4, fire==0.6.0, flatten_dict==0.4.2, fonttools==4.51.0, frozenlist==1.4.1, fsspec==2024.2.0, future==1.0.0, gitdb==4.0.11, grpcio==1.62.1, huggingface_hub==0.22.2, idna==3.7, importlib_resources==6.4.0, ipython==8.23.0, jedi==0.19.1, jiwer==3.0.3, joblib==1.4.0, julius==0.2.7, kiwisolver==1.4.5, lazy_loader==0.4, librosa==0.10.1, llvmlite==0.42.0, markdown2==2.4.13, markdown_it_py==3.0.0, matplotlib==3.8.4, matplotlib_inline==0.1.6, mdurl==0.1.2, mpmath==1.3.0, msgpack==1.0.8, multidict==6.0.5, multiprocess==0.70.16, networkx==3.3, numba==0.59.1, numpy==1.26.4, nvidia_cublas_cu12==12.1.3.1, nvidia_cuda_cupti_cu12==12.1.105, nvidia_cuda_nvrtc_cu12==12.1.105, nvidia_cuda_runtime_cu12==12.1.105, nvidia_cudnn_cu12==8.9.2.26, nvidia_cufft_cu12==11.0.2.54, nvidia_curand_cu12==10.3.2.106, nvidia_cusolver_cu12==11.4.5.107, nvidia_cusparse_cu12==12.1.0.106, nvidia_nccl_cu12==2.19.3, nvidia_nvjitlink_cu12==12.4.127, nvidia_nvtx_cu12==12.1.105, packaging==24.0, pandas==2.2.2, parler_tts==0.1, parso==0.8.4, pexpect==4.9.0, pillow==10.3.0, pip==24.0, platformdirs==4.2.0, pooch==1.8.1, prompt_toolkit==3.0.43, protobuf==3.19.6, psutil==5.9.8, ptyprocess==0.7.0, pure_eval==0.2.2, pyarrow==15.0.2, pyarrow_hotfix==0.6, pycparser==2.22, pygments==2.17.2, pyloudnorm==0.1.1, pyparsing==3.1.2, pystoi==0.4.1, python_dateutil==2.9.0.post0, pytz==2024.1, randomname==0.2.1, rapidfuzz==3.8.1, regex==2023.12.25, requests==2.31.0, responses==0.18.0, rich==13.7.1, safetensors==0.4.2, scikit_learn==1.4.2, scipy==1.13.0, sentencepiece==0.2.0, sentry_sdk==1.45.0, setproctitle==1.3.3, setuptools==69.2.0, six==1.16.0, smmap==5.0.1, soundfile==0.12.1, soxr==0.3.7, stack_data==0.6.3, sympy==1.12, tensorboard==2.16.2, tensorboard_data_server==0.7.2, termcolor==2.4.0, threadpoolctl==3.4.0, tokenizers==0.15.2, torch==2.2.2, torch_stoi==0.2.1, torchaudio==2.2.2, tqdm==4.66.2, traitlets==5.14.2, transformers==4.39.3, triton==2.2.0, typing_extensions==4.11.0, tzdata==2024.1, urllib3==2.2.1, wandb==0.16.6, wcwidth==0.2.13, werkzeug==3.0.2, wheel==0.43.0, xxhash==3.4.1, yarl==1.9.4
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
(base) gwen@GwenSeidr:/2/parler-tts$ source parler_tts_env/bin/activate
(parler_tts_env) (base) gwen@GwenSeidr:/2/parler-tts$ source parler_tts_env/bin/activate
(parler_tts_env) (base) gwen@GwenSeidr:/2/parler-tts$ python helpers/model_init_scripts/init_model_600M.py ./parler-tts-untrained-600M --text_model "google/flan-t5-base" --audio_model "parler-tts/dac_44khZ_8kbps"
num_codebooks 9
/home/gwen/2/parler-tts/parler_tts_env/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Removed shared tensor {'text_encoder.encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading

Zero-Shot Voice Cloning

Hi,
I know this library is primarily for text -> voice but do you know if it would be possible to modify it to accept a speaker embedding and perform zero-shot voice cloning?
Thanks!

Tts parler for tfs

How could I make this work in spanish

It should be very nice if it works also in spanish, I know that it could be possible, but I dont know how to do it with me technical knowledge...

Streaming support?

Is there any streaming support for this model? if there is a way to do it i would love to get involved and help out!

Does it support Chinese?

[show and tell] apple mps support

with newer pytorch (2.4 nightly) we get bfloat16 support in MPS.

i tested this:

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import torch

device = "mps:0"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device=device, dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

prompt = "welcome to huggingface"
description = "An old man."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device=device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device=device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.to(torch.float32).cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Feature improvements

Hello, I tried the Parler-TTS Mini model and it exceeded my expectations with very good results.

However, I have some questions and possible suggestions for improvement:

Will there be a multi-lingual version available, such as support for Mandarin?
Currently, the accuracy of numbers and punctuation marks is not very good, and there are instances where words are dropped in sentences. Will these issues be addressed in future versions?

Trouble pronouncing dates

I found that the model (here: the Jenny model, but I found the same issue with ParlerTTS mini) seems to have trouble pronouncing years and numbers. For example:

"The Crusaders marched through Eastern Europe, gathering support and supplies along the way, before reaching Constantinople in 1097."

TTS_stumbles_on_numbers_00001-audio.mp4

Could this theoretically be retrained from scratch to generate singing vocals?

Given a 10k hour dataset of singing vocals (instead of the current audiobook reading content), could this model be ported to be able to sing/generate vocals?

Looking for a way to combine spoken words with timestamps in output dictionary

Would it be possible to combine words with timestamps and perhaps return optionally dict with audio tensor and transcription mapping?

[Question] "I got strongly recommended to pass the `sampling_rate` argument to this function... " Is this expected?

Hi,

I'm trying to retrain the model, and I got this message when the process of Encode the audio samples

"strongly recommended to pass the sampling_rate argument to this function." is this expected?

Then after like 8% It stopped with error Signal 11 (SIGSEGV) received ,

Is there any clue?

really appreciate your help 🙏

[rank1]: RuntimeError: DataLoader worker (pid 66405) is killed by signal: Segmentation fault.
W0507 11:19:08.585000 134185846527040 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 65464 closing signal SIGTERM
E0507 11:19:09.022000 134185846527040 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 1 (pid: 65465) of binary: /home/ys/anaconda3/envs/parler/bin/python
Traceback (most recent call last):

...
...

=======================================================
training/run_parler_tts_training.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-07_11:19:08
  host      : trainer
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 65465)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 65465
=======================================================

how to use common voice mozilla dataset train for Parler-TTS

how to use common voice mozilla dataset train for Parler-TTS ?can you help me ?

Save checkpoints as usable models

Hey everyone,

I am trying to fine-tune a model. I ran into overfitting after some training. Now I want to save a previous checkpoint as my model. As far is a can see, you are using safetensor models when using the "ParlerTTSForConditionalGeneration.from_pretrained()" method.

I cannot find an easy way how to load and save a checkpoint without starting a new training. Do you have any suggestions?

Thank you :)

Need the abillity to save/re-use a generated voice

We use TTS in an eLearning environment where we generate hundreds of videos per year. All of these videos must use the same exact voice for consistency.

To use Parler-TTS I'd need to be able to generate a voice (based upon a description), save it, then use it across multiple TTS sessions. We currently use Google's TTS api which allows me to select from a list of voices so that all of my TTS audio sounds exactly like the same speaker.

Unable to get it to run on the gpu.

Hey, I was trying to run the code on a virtual python env, and the tts doesn't seem to use the gpu on my system.

Do we need to have cuda toolkit installed for the gpu to be used?

pls use .mp3 in soundfile output

pls put the .mp3 extension in Usage soundfile output example, i.e.

sf.write("parler-tts.mp3",...

Soundfile docs say MP3 is supported since 202206 and it doesn't seem to be responsible for the ** when the prompt is longer than a sentence or two.
Regards
G.

sampling rate issue

Great work!

When running the DAC token extraction stage of the training script with the default hyperparams, I got warning:

It is strongly recommended to pass the sampling_rate argument to this function. Failing to do so can result in silent errors that might be hard to debug.

I checked the feature_extractor.sampling_rate which got passed to load_multiple_datasets and it's indeed 44100Hz.

Just want to make sure this is expected.

Thanks!

Add to HF Pipeline

Hi,
Would be nice to be able to use this using the text-to-speech pipeline.
Thanks!

Poor quality when batch inferencing

code as below:

prompt1 = "Hey, how are you doing today?"
prompt2 = "Hey, good."
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

input_ids = tokenizer([description, description], return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer([prompt1, prompt2], padding=True, truncation=True,  return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
for i in range(2):
    sf.write(f"{i}_parler_tts_out.wav", audio_arr[i].squeeze(), model.config.sampling_rate)

The result seems not stable.

Descript Audio Codec selection

The current model selected is the 44khz model, and there are 9 codebooks here. However, according to the test at Codec-SUPERB, it can be seen that the decoder of the 24khz model with 32 codebooks performs better. Is it possible to replace 9 codebooks with 32 codebooks? If so, will it be difficult to train a decoder-only model?

How work with data-set. Or question about exemple for work with data-set

Hello. I wanted to ask.

The recipe says a lot about the requirements for the data set and, as I understand it, a fairly advanced technology stack is used to assemble the data set and train the model.

But, I think that it will generally be difficult for novice users (like me) to understand how to compose and how to give to script a data set in order make their own or model based on your model.

There is no clear instruction or tool that would help people deal with their wav or mp3 files automatically without unnecessary intervention

Not everyone can use this technology stack and I wish there was an easier way recipe step-by-step. Or examples of steps on Google Colab of how you do it.

Because it’s difficult for me to immediately understand what needs to be done. Because I personally, like many who looked here, have not used Parquet - tables or DataSpeach, not much else that members of the Huggingface community use

Is there a way to create consistent voices?

I want to make an app that would read long texts in chunks. For this I need to get the same voice for the same speaker prompt. Now I get similar but still not the same voices each generation. Is it possible to somehow fix the voice?

Does it support French?

I recently stumbled upon your project and I'm excited about its potential.
I'm wondering if there are any plans to add French language support in the future.

Benchmarks of parler-tts, the emergence of TTS!

Hey @sanchit-gandhi, like the repo. Excited to see this being worked on. Here's a benchmark of WhisperSpeech. I used your sample script on the same exact text snippet and it finished processing in 16.04 seconds. However, this repo is in float32 while I think WhisperSpeech is being run in float16. Can you provide me with the modification to run in float16 or bfloat16 even? I'm going to do a comparison of this, Bark, and WhisperSpeech:

I want to add that this says nothing about the quality, only speed. I'll evaluate quality next after I ensure comparable testing procedures regarding compute time. Here's the script I used:

SCRIPT HERE

import time
import sounddevice as sd
import torch
from transformers import AutoTokenizer
from parler_tts import ParlerTTSForConditionalGeneration

# Setup device
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load model and tokenizer
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

# Prepare input
prompt = "This script processes a body of text one sentence at a time and plays them consecutively. This enables the audio playback to begin sooner instead of waiting for the entire body of text to be processed. The script uses the threading and queue modules that are part of the standard Python library. It also uses the sound device library, which is fairly reliable across different platforms. I hope you enjoy, and feel free to modify or distribute at your pleasure."
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

# Start timer
start_time = time.time()

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

# End timer
end_time = time.time()
processing_time = end_time - start_time

# Print processing time in green
print(f"\033[92mProcessing time: {processing_time:.2f} seconds\033[0m")

sampling_rate = model.config.sampling_rate
sd.play(audio_arr, samplerate=sampling_rate)
sd.wait()

Lastly, let me know what other "speedups" I can use such as bettertransformer, which I think is part of torch now unless I'm mistaken. I can't test FA2 unless you help me install it. I've tried.