Coder Social home page Coder Social logo

parler-tts's Introduction

Parler-TTS

Parler-TTS is a lightweight text-to-speech (TTS) model that can generate high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc). It is a reproduction of work from the paper Natural language guidance of high-fidelity text-to-speech with synthetic annotations by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.

Contrarily to other TTS models, Parler-TTS is a fully open-source release. All of the datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models.

This repository contains the inference and training code for Parler-TTS. It is designed to accompany the Data-Speech repository for dataset annotation.

Important

We're proud to release Parler-TTS Mini v0.1, our first 600M parameter model, trained on 10.5K hours of audio data. In the coming weeks, we'll be working on scaling up to 50k hours of data, in preparation for the v1 model.

πŸ“– Quick Index

Installation

Parler-TTS has light-weight dependencies and can be installed in one line:

pip install git+https://github.com/huggingface/parler-tts.git

Usage

Tip

You can directly try it out in an interactive demo here!

Using Parler-TTS is as simple as "bonjour". Simply use the following inference snippet.

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

prompt = "Hey, how are you doing today?"
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
Yoach.Lacombe.s.Video.-.Apr.10.2024.1.mp4

Training

Open In Colab

The training folder contains all the information to train or fine-tune your own Parler-TTS model. It consists of:

Important

TL;DR: After having followed the installation steps, you can reproduce the Parler-TTS Mini v0.1 training recipe with the following command line:

accelerate launch ./training/run_parler_tts_training.py ./helpers/training_configs/starting_point_0.01.json

Acknowledgements

This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!

Special thanks to:

Citation

If you found this repository useful, please consider citing this work and also the original Stability AI paper:

@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/parler-tts}}
}
@misc{lyth2024natural,
      title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
      author={Dan Lyth and Simon King},
      year={2024},
      eprint={2402.01912},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Contribution

Contributions are welcome, as the project offers many possibilities for improvement and exploration.

Namely, we're looking at ways to improve both quality and speed:

  • Datasets:
    • Train on more data
    • Add more features such as accents
  • Training:
    • Add PEFT compatibility to do Lora fine-tuning.
    • Add possibility to train without description column.
    • Add notebook training.
    • Explore multilingual training.
    • Explore mono-speaker finetuning.
    • Explore more architectures.
  • Optimization:
    • Compilation and static cache
    • Support to FA2 and SDPA
  • Evaluation:
    • Add more evaluation metrics

parler-tts's People

Contributors

sanchit-gandhi avatar ylacombe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parler-tts's Issues

Model stumbling on its words

Running the following code:

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

description = "A male speaker with a low-pitched voice delivering his words at a slow pace in a small, confined space with a very clear audio and an animated tone."
prompt = "In the annals of history, the ink that drafted peace often dried under the shadow of future conflicts. Today, we dive deep into the bottom 10 worst peace treaties ever signed, the naive hopes and the grim repercussions they bore, unraveling a tapestry of unintended consequences that would haunt nations for generations. From agreements that sowed the seeds of resentments leading to catastrophic wars, to those that carved up continents disregarding the people who lived there, we explore how peace can sometimes lead to anything but."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
sf.write(os.path.join('output.wav'), audio_arr, model.config.sampling_rate)

Outputs the .wav file posted at this link: http://sndup.net/vzyp

How can I get it to correctly output the prompt text? Is my prompt too large? Am I using the model incorrectly? Thank you!

error regarding some tokenizer issue

When I run the sample script I keep getting this error message among others...not sure how dire it is or whether it even impacts performance...

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers

Training on a NEW language

Suppose we have to train this TTS model on a language and the tokens of that language are not in the Flan-T5 transformer. So can I simply change the name of the tokenizer in the config.json or do I have to make any code changes also. NOTE The new tokenizer will not be of FLAN-T5

Won't work

first of all congrats on your accomplishments!

but I must be doing something wrong but I can't get it to work,
I want to install it in my textgenwebui environment but I get this error:

C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\torch\nn\utils\weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers Using the model-agnostic default max_length(=2580) to control the generation length. We recommend settingmax_new_tokensto control the maximum length of the generation. Callingsampledirectly is deprecated and will be removed in v4.41. Usegenerate or a custom generation loop instead. --- Logging error --- Traceback (most recent call last): File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 1110, in emit msg = self.format(record) ^^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 953, in format return fmt.format(record) ^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 687, in format record.message = record.getMessage() ^^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\logging\__init__.py", line 377, in getMessage msg = msg % self.args ~~~~^~~~~~~~~~~ TypeError: not all arguments converted during string formatting Call stack: File "C:\text-generation-webui-snapshot-2024-04-21\snippet.py", line 17, in <module> generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\parler_tts\modeling_parler_tts.py", line 2608, in generate outputs = self.sample( File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2584, in sample return self._sample(*args, **kwargs) File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2730, in _sample logger.warning_once( File "C:\text-generation-webui-snapshot-2024-04-21\installer_files\env\Lib\site-packages\transformers\utils\logging.py", line 329, in warning_once self.warning(*args, **kwargs) Message: 'eos_token_idis deprecated in this function and will be removed in v4.41, usestopping_criteria=StoppingCriteriaList([EosTokenCriteria(eos_token_id=eos_token_id)])instead. Otherwise make sure to setmodel.generation_config.eos_token_id' Arguments: (<class 'FutureWarning'>,)

It is super vague, and I don't know where to look next.
My current version of python is 3.11, torch 2.2.1+cu121, transformers 4.40.0

can anyone point me in the right direction?
thanks for your time!

Running gets output like this...failure

(base) gwen@GwenSeidr:/2/parler-tts$ virtualenv parler_tts_env
created virtual environment CPython3.10.12.final.0-64 in 328ms
creator CPython3Posix(dest=/home/gwen/2/parler-tts/parler_tts_env, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/gwen/.local/share/virtualenv)
added seed packages: GitPython==3.1.43, Jinja2==3.1.3, Markdown==3.6, MarkupSafe==2.1.5, PyYAML==6.0.1, absl_py==2.1.0, accelerate==0.29.2, aiohttp==3.9.4, aiosignal==1.3.1, appdirs==1.4.4, argbind==0.3.7, asttokens==2.4.1, async_timeout==4.0.3, attrs==23.2.0, audioread==3.0.1, certifi==2024.2.2, cffi==1.16.0, charset_normalizer==3.3.2, click==8.1.7, contourpy==1.2.1, cycler==0.12.1, datasets==2.18.0, decorator==5.1.1, descript_audio_codec==1.0.0, descript_audiotools==0.7.2, dill==0.3.8, docker_pycreds==0.4.0, docstring_parser==0.16, einops==0.7.0, evaluate==0.4.1, exceptiongroup==1.2.0, executing==2.0.1, ffmpy==0.3.2, filelock==3.13.4, fire==0.6.0, flatten_dict==0.4.2, fonttools==4.51.0, frozenlist==1.4.1, fsspec==2024.2.0, future==1.0.0, gitdb==4.0.11, grpcio==1.62.1, huggingface_hub==0.22.2, idna==3.7, importlib_resources==6.4.0, ipython==8.23.0, jedi==0.19.1, jiwer==3.0.3, joblib==1.4.0, julius==0.2.7, kiwisolver==1.4.5, lazy_loader==0.4, librosa==0.10.1, llvmlite==0.42.0, markdown2==2.4.13, markdown_it_py==3.0.0, matplotlib==3.8.4, matplotlib_inline==0.1.6, mdurl==0.1.2, mpmath==1.3.0, msgpack==1.0.8, multidict==6.0.5, multiprocess==0.70.16, networkx==3.3, numba==0.59.1, numpy==1.26.4, nvidia_cublas_cu12==12.1.3.1, nvidia_cuda_cupti_cu12==12.1.105, nvidia_cuda_nvrtc_cu12==12.1.105, nvidia_cuda_runtime_cu12==12.1.105, nvidia_cudnn_cu12==8.9.2.26, nvidia_cufft_cu12==11.0.2.54, nvidia_curand_cu12==10.3.2.106, nvidia_cusolver_cu12==11.4.5.107, nvidia_cusparse_cu12==12.1.0.106, nvidia_nccl_cu12==2.19.3, nvidia_nvjitlink_cu12==12.4.127, nvidia_nvtx_cu12==12.1.105, packaging==24.0, pandas==2.2.2, parler_tts==0.1, parso==0.8.4, pexpect==4.9.0, pillow==10.3.0, pip==24.0, platformdirs==4.2.0, pooch==1.8.1, prompt_toolkit==3.0.43, protobuf==3.19.6, psutil==5.9.8, ptyprocess==0.7.0, pure_eval==0.2.2, pyarrow==15.0.2, pyarrow_hotfix==0.6, pycparser==2.22, pygments==2.17.2, pyloudnorm==0.1.1, pyparsing==3.1.2, pystoi==0.4.1, python_dateutil==2.9.0.post0, pytz==2024.1, randomname==0.2.1, rapidfuzz==3.8.1, regex==2023.12.25, requests==2.31.0, responses==0.18.0, rich==13.7.1, safetensors==0.4.2, scikit_learn==1.4.2, scipy==1.13.0, sentencepiece==0.2.0, sentry_sdk==1.45.0, setproctitle==1.3.3, setuptools==69.2.0, six==1.16.0, smmap==5.0.1, soundfile==0.12.1, soxr==0.3.7, stack_data==0.6.3, sympy==1.12, tensorboard==2.16.2, tensorboard_data_server==0.7.2, termcolor==2.4.0, threadpoolctl==3.4.0, tokenizers==0.15.2, torch==2.2.2, torch_stoi==0.2.1, torchaudio==2.2.2, tqdm==4.66.2, traitlets==5.14.2, transformers==4.39.3, triton==2.2.0, typing_extensions==4.11.0, tzdata==2024.1, urllib3==2.2.1, wandb==0.16.6, wcwidth==0.2.13, werkzeug==3.0.2, wheel==0.43.0, xxhash==3.4.1, yarl==1.9.4
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
(base) gwen@GwenSeidr:
/2/parler-tts$ source parler_tts_env/bin/activate
(parler_tts_env) (base) gwen@GwenSeidr:/2/parler-tts$ source parler_tts_env/bin/activate
(parler_tts_env) (base) gwen@GwenSeidr:
/2/parler-tts$ python helpers/model_init_scripts/init_model_600M.py ./parler-tts-untrained-600M --text_model "google/flan-t5-base" --audio_model "parler-tts/dac_44khZ_8kbps"
num_codebooks 9
/home/gwen/2/parler-tts/parler_tts_env/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Removed shared tensor {'text_encoder.encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading

Zero-Shot Voice Cloning

Hi,
I know this library is primarily for text -> voice but do you know if it would be possible to modify it to accept a speaker embedding and perform zero-shot voice cloning?
Thanks!

How could I make this work in spanish

It should be very nice if it works also in spanish, I know that it could be possible, but I dont know how to do it with me technical knowledge...

Streaming support?

Is there any streaming support for this model? if there is a way to do it i would love to get involved and help out!

[show and tell] apple mps support

with newer pytorch (2.4 nightly) we get bfloat16 support in MPS.

i tested this:

from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
import torch

device = "mps:0"

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device=device, dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

prompt = "welcome to huggingface"
description = "An old man."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device=device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device=device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.to(torch.float32).cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Feature improvements

Hello, I tried the Parler-TTS Mini model and it exceeded my expectations with very good results.

However, I have some questions and possible suggestions for improvement:

  1. Will there be a multi-lingual version available, such as support for Mandarin?
  2. Currently, the accuracy of numbers and punctuation marks is not very good, and there are instances where words are dropped in sentences. Will these issues be addressed in future versions?

Trouble pronouncing dates

I found that the model (here: the Jenny model, but I found the same issue with ParlerTTS mini) seems to have trouble pronouncing years and numbers. For example:

"The Crusaders marched through Eastern Europe, gathering support and supplies along the way, before reaching Constantinople in 1097."

TTS_stumbles_on_numbers_00001-audio.mp4

[Question] "I got strongly recommended to pass the `sampling_rate` argument to this function... " Is this expected?

Hi,

I'm trying to retrain the model, and I got this message when the process of Encode the audio samples

"strongly recommended to pass the sampling_rate argument to this function." is this expected?

Then after like 8% It stopped with error Signal 11 (SIGSEGV) received ,

Is there any clue?

really appreciate your help πŸ™


[rank1]: RuntimeError: DataLoader worker (pid 66405) is killed by signal: Segmentation fault.
W0507 11:19:08.585000 134185846527040 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 65464 closing signal SIGTERM
E0507 11:19:09.022000 134185846527040 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 1 (pid: 65465) of binary: /home/ys/anaconda3/envs/parler/bin/python
Traceback (most recent call last):

...
...

=======================================================
training/run_parler_tts_training.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-07_11:19:08
  host      : trainer
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 65465)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 65465
=======================================================

Save checkpoints as usable models

Hey everyone,

I am trying to fine-tune a model. I ran into overfitting after some training. Now I want to save a previous checkpoint as my model. As far is a can see, you are using safetensor models when using the "ParlerTTSForConditionalGeneration.from_pretrained()" method.

I cannot find an easy way how to load and save a checkpoint without starting a new training. Do you have any suggestions?

Thank you :)

Need the abillity to save/re-use a generated voice

We use TTS in an eLearning environment where we generate hundreds of videos per year. All of these videos must use the same exact voice for consistency.

To use Parler-TTS I'd need to be able to generate a voice (based upon a description), save it, then use it across multiple TTS sessions. We currently use Google's TTS api which allows me to select from a list of voices so that all of my TTS audio sounds exactly like the same speaker.

Unable to get it to run on the gpu.

Hey, I was trying to run the code on a virtual python env, and the tts doesn't seem to use the gpu on my system.

Do we need to have cuda toolkit installed for the gpu to be used?

pls use .mp3 in soundfile output

pls put the .mp3 extension in Usage soundfile output example, i.e.

sf.write("parler-tts.mp3",...

Soundfile docs say MP3 is supported since 202206 and it doesn't seem to be responsible for the ** when the prompt is longer than a sentence or two.
Regards
G.

sampling rate issue

Great work!

When running the DAC token extraction stage of the training script with the default hyperparams, I got warning:

It is strongly recommended to pass the sampling_rate argument to this function. Failing to do so can result in silent errors that might be hard to debug.

I checked the feature_extractor.sampling_rate which got passed to load_multiple_datasets and it's indeed 44100Hz.

Just want to make sure this is expected.

Thanks!

Add to HF Pipeline

Hi,
Would be nice to be able to use this using the text-to-speech pipeline.
Thanks!

Poor quality when batch inferencing

code as below:

prompt1 = "Hey, how are you doing today?"
prompt2 = "Hey, good."
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

input_ids = tokenizer([description, description], return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer([prompt1, prompt2], padding=True, truncation=True,  return_tensors="pt").input_ids.to(device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()
for i in range(2):
    sf.write(f"{i}_parler_tts_out.wav", audio_arr[i].squeeze(), model.config.sampling_rate)

The result seems not stable.

Descript Audio Codec selection

The current model selected is the 44khz model, and there are 9 codebooks here. However, according to the test at Codec-SUPERB, it can be seen that the decoder of the 24khz model with 32 codebooks performs better. Is it possible to replace 9 codebooks with 32 codebooks? If so, will it be difficult to train a decoder-only model?

How work with data-set. Or question about exemple for work with data-set

Hello. I wanted to ask.

The recipe says a lot about the requirements for the data set and, as I understand it, a fairly advanced technology stack is used to assemble the data set and train the model.

But, I think that it will generally be difficult for novice users (like me) to understand how to compose and how to give to script a data set in order make their own or model based on your model.

There is no clear instruction or tool that would help people deal with their wav or mp3 files automatically without unnecessary intervention

Not everyone can use this technology stack and I wish there was an easier way recipe step-by-step. Or examples of steps on Google Colab of how you do it.

Because it’s difficult for me to immediately understand what needs to be done. Because I personally, like many who looked here, have not used Parquet - tables or DataSpeach, not much else that members of the Huggingface community use

Is there a way to create consistent voices?

I want to make an app that would read long texts in chunks. For this I need to get the same voice for the same speaker prompt. Now I get similar but still not the same voices each generation. Is it possible to somehow fix the voice?

Does it support French?

I recently stumbled upon your project and I'm excited about its potential.
I'm wondering if there are any plans to add French language support in the future.

Benchmarks of parler-tts, the emergence of TTS!

Hey @sanchit-gandhi, like the repo. Excited to see this being worked on. Here's a benchmark of WhisperSpeech. I used your sample script on the same exact text snippet and it finished processing in 16.04 seconds. However, this repo is in float32 while I think WhisperSpeech is being run in float16. Can you provide me with the modification to run in float16 or bfloat16 even? I'm going to do a comparison of this, Bark, and WhisperSpeech:

image

I want to add that this says nothing about the quality, only speed. I'll evaluate quality next after I ensure comparable testing procedures regarding compute time. Here's the script I used:

SCRIPT HERE
import time
import sounddevice as sd
import torch
from transformers import AutoTokenizer
from parler_tts import ParlerTTSForConditionalGeneration

# Setup device
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load model and tokenizer
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1")

# Prepare input
prompt = "This script processes a body of text one sentence at a time and plays them consecutively. This enables the audio playback to begin sooner instead of waiting for the entire body of text to be processed. The script uses the threading and queue modules that are part of the standard Python library. It also uses the sound device library, which is fairly reliable across different platforms. I hope you enjoy, and feel free to modify or distribute at your pleasure."
description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality. She speaks very fast."

# Start timer
start_time = time.time()

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

# End timer
end_time = time.time()
processing_time = end_time - start_time

# Print processing time in green
print(f"\033[92mProcessing time: {processing_time:.2f} seconds\033[0m")

sampling_rate = model.config.sampling_rate
sd.play(audio_arr, samplerate=sampling_rate)
sd.wait()

Lastly, let me know what other "speedups" I can use such as bettertransformer, which I think is part of torch now unless I'm mistaken. I can't test FA2 unless you help me install it. I've tried.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.