rhasspy / piper Goto Github PK

A fast, local neural text to speech system

Home Page: https://rhasspy.github.io/piper-samples/

License: MIT License

Dockerfile 0.12% Makefile 0.03% CMake 0.32% C++ 72.95% C 0.11% Shell 0.15% Python 18.58% Cython 0.08% Jupyter Notebook 7.66%

speech-synthesis text-to-speech tts

piper's Introduction

A fast, local neural text to speech system that sounds great and is optimized for the Raspberry Pi 4. Piper is used in a variety of projects.

echo 'Welcome to the world of speech synthesis!' | \
  ./piper --model en_US-lessac-medium.onnx --output_file welcome.wav

Listen to voice samples and check out a video tutorial by Thorsten Müller

Voices are trained with VITS and exported to the onnxruntime.

This is a project of the Open Home Foundation.

Voices

Our goal is to support Home Assistant and the Year of Voice.

Download voices for the supported languages:

Arabic (ar_JO)
Catalan (ca_ES)
Czech (cs_CZ)
Danish (da_DK)
German (de_DE)
Greek (el_GR)
English (en_GB, en_US)
Spanish (es_ES, es_MX)
Finnish (fi_FI)
French (fr_FR)
Hungarian (hu_HU)
Icelandic (is_IS)
Italian (it_IT)
Georgian (ka_GE)
Kazakh (kk_KZ)
Luxembourgish (lb_LU)
Nepali (ne_NP)
Dutch (nl_BE, nl_NL)
Norwegian (no_NO)
Polish (pl_PL)
Portuguese (pt_BR, pt_PT)
Romanian (ro_RO)
Russian (ru_RU)
Serbian (sr_RS)
Swedish (sv_SE)
Swahili (sw_CD)
Turkish (tr_TR)
Ukrainian (uk_UA)
Vietnamese (vi_VN)
Chinese (zh_CN)

You will need two files per voice:

A .onnx model file, such as en_US-lessac-medium.onnx
A .onnx.json config file, such as en_US-lessac-medium.onnx.json

The MODEL_CARD file for each voice contains important licensing information. Piper is intended for text to speech research, and does not impose any additional restrictions on voice models. Some voices may have restrictive licenses, however, so please review them carefully!

Installation

You can run Piper with Python or download a binary release:

amd64 (64-bit desktop Linux)
arm64 (64-bit Raspberry Pi 4)
armv7 (32-bit Raspberry Pi 3/4)

If you want to build from source, see the Makefile and C++ source. You must download and extract piper-phonemize to lib/Linux-$(uname -m)/piper_phonemize before building. For example, lib/Linux-x86_64/piper_phonemize/lib/libpiper_phonemize.so should exist for AMD/Intel machines (as well as everything else from libpiper_phonemize-amd64.tar.gz).

Usage

Download a voice and extract the .onnx and .onnx.json files
Run the piper binary with text on standard input, --model /path/to/your-voice.onnx, and --output_file output.wav

For example:

echo 'Welcome to the world of speech synthesis!' | \
  ./piper --model en_US-lessac-medium.onnx --output_file welcome.wav

For multi-speaker models, use --speaker <number> to change speakers (default: 0).

See piper --help for more options.

Streaming Audio

Piper can stream raw audio to stdout as its produced:

echo 'This sentence is spoken first. This sentence is synthesized while the first sentence is spoken.' | \
  ./piper --model en_US-lessac-medium.onnx --output-raw | \
  aplay -r 22050 -f S16_LE -t raw -

This is raw audio and not a WAV file, so make sure your audio player is set to play 16-bit mono PCM samples at the correct sample rate for the voice.

JSON Input

The piper executable can accept JSON input when using the --json-input flag. Each line of input must be a JSON object with text field. For example:

{ "text": "First sentence to speak." }
{ "text": "Second sentence to speak." }

Optional fields include:

speaker - string
- Name of the speaker to use from speaker_id_map in config (multi-speaker voices only)
speaker_id - number
- Id of speaker to use from 0 to number of speakers - 1 (multi-speaker voices only, overrides "speaker")
output_file - string
- Path to output WAV file

The following example writes two sentences with different speakers to different files:

{ "text": "First speaker.", "speaker_id": 0, "output_file": "/tmp/speaker_0.wav" }
{ "text": "Second speaker.", "speaker_id": 1, "output_file": "/tmp/speaker_1.wav" }

People using Piper

Piper has been used in the following projects/papers:

Training

See the training guide and the source code.

Pretrained checkpoints are available on Hugging Face

Running in Python

See src/python_run

Install with pip:

pip install piper-tts

and then run:

echo 'Welcome to the world of speech synthesis!' | piper \
  --model en_US-lessac-medium \
  --output_file welcome.wav

This will automatically download voice files the first time they're used. Use --data-dir and --download-dir to adjust where voices are found/downloaded.

If you'd like to use a GPU, install the onnxruntime-gpu package:

.venv/bin/pip3 install onnxruntime-gpu

and then run piper with the --cuda argument. You will need to have a functioning CUDA environment, such as what's available in NVIDIA's PyTorch containers.

piper's People

Contributors

Stargazers

Watchers

Forkers

ishine maxmax2016 shaun95 lacking1 pan310 whitefu lichunnan aliang-voice haraldberthelsen vovkinson uloveqian2021 charansingh9 scoutink chenchy alphacep ravi-mr dinhchinh82 t-mat martinsweitzer muhammadmoizulhaq vn-os sce-tts nefastosaturo cgisky1980 sirbitesalot javison666 andrewmk runngezhang seanreynoldscs zhaomingwork jreese42 amorjnyh ken2190 beqabeqa473 rmcpantoja scottsln imjustricky donniemattingly truongdo vortex1024 jake-c-s kerbymart acekeysoft shulyaka maoshuiyang rudolfolah formadi neocho januxnet mudler jieyoujun brucepro cuzmuc xiaohunqupo piusayowale set-soft ishan0102 shinedog vital121 leso-kn novablitz-74 amdrozdov paixai pibico jonaskahn kellyjanderson hubertlepicki bookbot-hive stevegyutyan sebasong23 binhphamthanh marifl asdlei99 ramancv abdoiiii mystretch30 marty1885 feiyunwill eyelights sergioudi aigility-llc tmayoff peytontolbert dgreen2017 abbasns alexiri acekeyhusky plowsai a279780399 hxc-kotc feanix-fyre danc403 mrhunsaker aigizk dharmikjagodana magnusoverli louisfoster dharmikjagodana-baruzotech thedenast chandanmali

piper's Issues

Add Swiss German in multiple dialects

As a Swiss-German, I would love to have text spoken in it. If possible, even in one of the various dialects.
There is a freely available dataset with 3 hours of high quality speech with transcript in each of the most used swiss-german dialects (8 as of now). They can be found here: https://mtc.ethz.ch/publications/open-source/swiss-dial.html
And here is a sample implementation: https://stt4sg.fhnw.ch/tts
I will try to find out if we could use the dataset for piper and then might need some help building the models.

How to use zh-cn-huayan-low.onnx

I tried to generate a Windows project in the VS project, and the en-us-ryan-low.onnx worked properly, but the zh-cn-huayan-low.onnx generated incorrect speech,How can I make text encoding work properly?

sound:
sound.zip

docker ERROR:asyncio:Task exception was never retrieved

docker run -it -p 10200:10200 -v ./data:/data rhasspy/wyoming-piper --voice en-us-lessac-low

INFO:__main__:Downloading en-us-lessac-low to /data
INFO:__main__:Ready
ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-6' coro=<AsyncEventHandler.run() done, defined at /usr/local/lib/python3.9/dist-packages/wyoming/server.py:26> exception=JSONDecodeError('Expecting value: line 1 column 1 (char 0)')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/wyoming/server.py", line 28, in run
    event = await async_read_event(self.reader)
  File "/usr/local/lib/python3.9/dist-packages/wyoming/event.py", line 48, in async_read_event
    event_dict = json.loads(json_line)
  File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Glitch in German samples

The pronunciation of Phänomen in the German samples is way off, the stress should be on -men (with a long vowel).

Add the "text to speech" and "tts" tag to the repo

I was spending a lot of time searching for a lightweight TTS library written in C++, and I couldn't find any. Primarily I was looking through this page to list all text to speech libraries written in C++: https://github.com/topics/text-to-speech?l=c%2B%2B

This repo isn't listed on that page because this repo does not have any tags assigned. For better discoverability, you should definitely add the fitting tags to this repo.

Would be nice to have --input_file argument

CUDA out of memory

Hi, thanks for your great job! But when I trained the model with V100 GPU, I encountered such error message:

RuntimeError: CUDA out of memory. Tried to allocate 1.56 GiB (GPU 0; 15.90 GiB total capacity; 13.66 GiB already allocated; 236.75 MiB free; 14.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Then I changed the batch size value to 16, 8 , 4 and even 1, the message still issued.
Can you check this error?

Is there trouble in piper_train.preprocess?

Thank you for this project!
It is so important.

I am trying to check and use it, but I met same trouble on 2 machine.
It seems the piper_train.preprocess frozen, when I am trying to run this command.
It is more than 24 hours.

~/piper/src/python$ python3 -m piper_train.preprocess \
  --language en-us \
  --input-dir /home/hayq/piper/data/LJSpeech-1.1/ \
  --output-dir /home/hayq/piper/train/ \
  --dataset-format ljspeech \
  --sample-rate 22050
INFO:preprocess:13059 speakers detected
INFO:preprocess:Wrote dataset config
INFO:preprocess:Processing 13100 utterance(s) with 6 worker(s)

The processing seems defuncted.

 200794 pts/1    Sl+    0:09 python3 -m piper_train.preprocess --language en-us --input-dir /home/hayq/piper/data/LJSpeech-1.1/ --output-dir /home/hayq/piper/train/ --dataset-format ljspeech --sample-rate 22050
 200806 pts/1    Z+     0:00 [python3] <defunct>
 200809 pts/1    Z+     0:00 [python3] <defunct>
 200812 pts/1    Z+     0:00 [python3] <defunct>
 200815 pts/1    Z+     0:00 [python3] <defunct>
 200816 pts/1    Z+     0:00 [python3] <defunct>
 200817 pts/1    Z+     0:00 [python3] <defunct>

Could you help me understand the reason of the trouble.
Thank you in advance!

Add English U.S. Voice

Hi @synesthesiam

Would it be possible to train a male voice for English using the following dataset:

https://www.kaggle.com/datasets/roholazandie/ryanspeech

I don't know the usage terms of this dataset.

Best
Musharraf

train fails with LJSpeech-1.1

When run train it fails with error
assert utt.speaker_id is not None, "Missing speaker id"
AssertionError: Missing speaker id
I thought that it should work on dataset which you've added in README.
dataset was prepared with scripts which is provided in README.

2 notes!

The piper_train.export_onnx sript does not worked properly. You need to change parameters manually adding "--" before the parameters.

    parser.add_argument("--checkpoint", help="Path to model checkpoint (.ckpt)")
    parser.add_argument("--output", help="Path to output model (.onnx)")

if you run piper with other model, add --config to your config.json file․
echo 'Մենք ողջունում ենք Ձեզ' | ./piper --model hymodel.onnx --config config.json --output_file test.wav

Feature: Support abbreviations

(I'm using the released standalone binary, so this may not be an issue when used with HA.)

Sensors in HA often have some form of unit text available (ie kph or °C) and when using the cloud TTS services, it usually expands them to their spoken version (kilometers per hour). Piper just says the letters themselves (k p h).

Intent guessing would be great, but realistically it would be nice to be able to provide a list of string replacements. That could also help with pronunciation of people or place names, by "invisibly" replacing them with a better string.

It could be handled by updating the text earlier in the pipeline, but it seems pretty specific to the speech output (and even per-voice in some cases.)

Can I continue training from an ONNX model?

Hi there,
I have an ONNX model that I would like to continue training on, but I'm not sure if this is possible or if I need to convert it to CKPT format first. I understand that I can continue training from a CKPT checkpoint file, but I'm not sure if the same can be done with ONNX. Can someone please advise if it's possible to continue training from an existing ONNX model or if I should convert it to CKPT format before proceeding? Thank you in advance for your help.

Provide tensorboard logs

Could you please share your tensorboard logs of some trained representative models like: voice-en-us_lessac.tar.gz, voice-en-us_libritts.tar.gz (Ex. via https://tensorboard.dev) so that we can align our training progress with yours to know how much training time we may needed?

How can I run Kazakh language onnx model?

Hi!

What are the function parameters to run the model?

Multi-speaker models

How do we know if a model is a multi-speaker model (other than just trying to use it as one). Is there some documentation for a given model about if it is a single or multi-speaker model, and if a multi-speaker model, how many speakers it has?

Thanks!

issue on training

Dear Michael,
Thanks for the reply. Which Debian bullseye did you use for test? 64 bit or 32 bit?
Also, I want to train Larynx 2 on a Ubuntu 18.04. Is the training procedure the same as you said in your post?

[ASK] How to Train With Own Voice

I tried to train using my voice from the mimic recording studio recordings

but when I try this command

python3 -m piper_train.preprocess --language id-id --input-dir ./f72022ff-93d3-6492-3f58-0a7eabfa98db/ --output-dir ./training_dir/ --dataset-format mycroft --sample-rate 22050

This error appears

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/galaxeus/Documents/hexaminate/app/piper/src/python/piper_train/preprocess.py", line 320, in <module>
    main()
  File "/home/galaxeus/Documents/hexaminate/app/piper/src/python/piper_train/preprocess.py", line 94, in main
    assert num_utterances > 0, "No utterances found"
AssertionError: No utterances found

How to solve this? What is wrong? sorry I'm still new to python

Generate accented speech

Is it possible to generate accented speech using larynx2?

This was a briefly mentioned feature in larynx and gruut, but I haven't seen any examples of it and was wondering if it's still supported.

Why can this program execute in a pure docker ubuntu container?

Library espeak-ng includes other dependencies(like libsndfile,libpcaudio), I extract released files into a ubuntu 18.04 docker container and i don't do anything,just execute it,then no error happens.
Why?Can you tell me how do you build libespeak-ng?

issue on training

hi,
another issue is in training:
ModuleNotFoundError: No module named 'larynx_train.vits.monotonic_align.monotonic_align'
Thanks for your help to resolve this issue.

Feature request: Add Swedish (sv-se) Voice

Would it be possible to train a male or female voice for Swedish (sv-se)
What is needed for this?

How to convert the vits model to onnx?

or Chinese support thx

Hello, is that possible support windows or mac?

Many users like me wanna have a local client on windows, it would be very useful if it can support build on windows!

(I would suggest let users download onnxruntime from official themself and then link it)

I tried a little bit, the portaudio can not build on windows, and the way you writen inference onnx is not workable on windows:

model.hpp(46,78): error C2440: “<function-style-cast>”: 无法从“initializer list”转换为“Ort::Session”

Same issue:

microsoft/onnxruntime#9001

Nice work, where to download the Chinese pretrain model though?

Build fails with onnxruntime 1.13.1

Hi, wanted to check out larynx2 through nixpkgs and noticed that it doesn't compile with onnxruntime 1.13.1, but works with 1.12.1. Just a heads up.

larnyx> [ 50%] Building CXX object CMakeFiles/larynx.dir/main.cpp.o
larnyx> In file included from /build/source/src/cpp/larynx.hpp:13,
larnyx>                  from /build/source/src/cpp/main.cpp:15:
larnyx> /build/source/src/cpp/model.hpp: In function 'void larynx::loadModel(std::string, larynx::ModelSession&)':
larnyx> /build/source/src/cpp/model.hpp:57:22: error: 'struct Ort::Session' has no member named 'GetInputName'
larnyx>    57 |         session.onnx.GetInputName(i, session.allocator));
larnyx>       |                      ^~~~~~~~~~~~
larnyx> /build/source/src/cpp/model.hpp:62:22: error: 'struct Ort::Session' has no member named 'GetOutputName'
larnyx>    62 |         session.onnx.GetOutputName(i, session.allocator));
larnyx>       |                      ^~~~~~~~~~~~~
larnyx> make[2]: *** [CMakeFiles/larynx.dir/build.make:76: CMakeFiles/larynx.dir/main.cpp.o] Error 1
larnyx> make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/larynx.dir/all] Error 2
larnyx> make: *** [Makefile:91: all] Error 2

Am I correct to assume that the models are tied to the onnxruntime version?

What is the reason for unsqueezing audio in export_onnx.py script?

Hi, I'm diving in to larynx2 and encoutered inconsistence in shape of audio generated by torch and onnx scripts.

In infer_generator.py, audio shape is something like: (1, 1, 413696) and I interpret it as (batch, channels, samples) (isn't it?)

In infer_onnx.py, audio shape is something like: (1, 1, 1, 421376) - I'm unable to interpret it anymore, so I've looked into the exporting script and saw below line. Can you help me to understand the reason for unsqueezing audio in this script?

https://github.com/rhasspy/larynx2/blob/acc3068176feb18f399b95502e7ea5ad01ed6275/src/python/larynx_train/export_onnx.py#L67

And since the time axe of output is 2, I wonder if this is a mistake?
https://github.com/rhasspy/larynx2/blob/acc3068176feb18f399b95502e7ea5ad01ed6275/src/python/larynx_train/export_onnx.py#L99

larynx2 version: latest 25f3f89 commit

Any way to see progress?

It be great when doing a long portion of text to be able to see the current progress. Is this possible somehow?

Windows support

Would be nice if this would officially support Windows. So far, there's only a release for "desktop Linux" and "Raspberry Pi 4", not for Windows.

Conflicting onnx OPSET_VERSION between export_onnx.py and requiremetns.txt

when exporting pytorch generator model to onnx I encountered this error:

  File "/home/trungle/opt/anaconda3/envs/larynx2/lib/python3.10/site-packages/torch/onnx/__init__.py", line 305, in export
    return utils.export(model, args, f, export_params, verbose, training,
  File "/home/trungle/opt/anaconda3/envs/larynx2/lib/python3.10/site-packages/torch/onnx/utils.py", line 118, in export
    _export(model, args, f, export_params, verbose, training, input_names, output_names,
  File "/home/trungle/opt/anaconda3/envs/larynx2/lib/python3.10/site-packages/torch/onnx/utils.py", line 699, in _export
    _set_opset_version(opset_version)
  File "/home/trungle/opt/anaconda3/envs/larynx2/lib/python3.10/site-packages/torch/onnx/symbolic_helper.py", line 853, in _set_opset_version
    raise ValueError("Unsupported ONNX opset version: " + str(opset_version))
ValueError: Unsupported ONNX opset version: 16

And the reason is torch~=1.11.0 actualy not support OPSET_VERSION = 16

https://github.com/rhasspy/larynx2/blob/25f3f89bd8e6904cf1ee75649b675e795b024add/src/python/larynx_train/export_onnx.py#L13

https://github.com/pytorch/pytorch/blob/v1.11.0/torch/onnx/symbolic_helper.py#L839

my pytorch version: 1.11.0+cu102
larynx2: current latest 25f3f89

Adding new language

Hi, thnx for a great project!

Anything that can be run locally is a great step for privacy.

Could you write out a how to what would it take to train new language?

The questions that come up at the moment are:

What kind of recording should it be? As little of noise as possible or "normal" noise is desirable?
What is the minimal length of the training recording?
What kind of content would be the best?

Is there a way to use this in python?

I see on the readme.md about running this in python, but the example given in the end just uses a bash script which runs python3 -m piper "$@" at the end. And in the scr/python_run directory I ran the scripts/setup.sh, got that running, and then I can do python and then >>>import piper which works. But from there, what do I do to use the imported piper?

(Mostly unrelated: also, do you realize that running it like this makes it the py'd piper? ha! :-) I'm sure you already got that joke and maybe that had something to do with the name change from larynx 2?)

Streaming the output (not waiting for whole generation to finish)

Is it possible to use piper in a way that you don't need to "enter text, wait for the whole generation to finish, get a .wav. file, and then play the wav file", but instead get a "live" streaming output as it's being generated, so that playback can immediately start as soon as the first word, or the first sentence, has finished generating?

how extract onnx file from pt file ?

i want to use this project on a gpu server

you says download a voice from here and extract .onnx and .onnx.json files from it.
i choose the generator-en-us_blizzard_lessac.tar.gz file. but there is not any onnx file in it and its content is this:

blizzard_lessac-medium.pt
blizzard_lessac-medium.pt.json

now how convert these files to onxx ?

I asked chatgpt and it said use this python code:

import torch
import onnx

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

model = torch.load('blizzard_lessac-medium.pt', map_location=device)
input_shape = (1, 80, 1000)  # Change the shape to match your model's input shape
input_names = ['input']
output_names = ['output']

# Export the model to ONNX format
onnx_model = onnx.export(model, args=(torch.zeros(input_shape),), f='blizzard_lessac-medium.onnx',
                         input_names=input_names, output_names=output_names)

onnx.save_as_json(onnx_model, 'blizzard_lessac-medium.onnx.json')

and this code give me this error:
ModuleNotFoundError: No module named 'larynx_train'

finally i didn't understand how can use your program !

usage of larynx in a python code

Dear Michael,

I used the larynx in a python code as :
os.system("echo 'Welcome to the world of speech synthesis!' | ./larynx --model ./larynx/en-us-blizzard_lessac-medium.onnx --output_file ./welcome.wav")

it works properly, however it just saves a .wav file to the specified folder. Then, I use "aplay" command for playing.
when I modified the code as:
str='Welcome to the world of speech synthesis!'
os.system("echo $str | ./larynx --model ./larynx/en-us-blizzard_lessac-medium.onnx --output_file ./welcome.wav")
It did not save anything. What is the wrong with the modification?

Is it possible that the usage of larynx is modified in such a way that it could directly accept a string as the input and play the result without saving the .wav file?
I mean a command like this:
./larynx --inputstr 'Welcome to the world of speech synthesis!' --model ./larynx/en-us-blizzard_lessac-medium.onnx

Incorrect pronunciation of Ukrainian voice Lada

Hello/Pryvit to all!

I am a native speaker of Ukrainian and the author of the initiative that brought us the Lada's voice.

I made some tests with piper and I have some thoughts to say. In short: it sounds incorrectly, seems like libespeak-ng mixes Russian and Ukrainian letters.

I'd like to start this issue and to have discussion over the issue.

We have a community in Telegram messenger - https://t.me/speech_synthesis_uk - where we're developing open source voices for synthesis, we can talk in a faster way there.

Supplemental materials:

Audio:
https://user-images.githubusercontent.com/7875085/230715900-21535afa-4406-4002-a2cb-7181e16eb876.mp4

Text in Ukrainian:
світе, привіт! я хочу протестувати цей голос

Translation:
the world, hello! I want to test this voice

Different results from C++ and Python versions

Hi Micheal,

just played around a bit with piper and it looks pretty good 👍.

I noticed 2 things when comparing the C++ version (v.0.0.2 release) and the Python version (python_run from current master) on my Raspberry Pi 4. One I could fix the other one not:

1: The Python version is slower. One could assume this is to be expected ^^, but it is actually just a thread issue. The C++ version uses all 4 threads while the Python version uses only 2. I think ONNX runtime has a default limit of max-cores/2. To fix this I tried the OMP_NUM_THREADS environmental variable, but it didn't do anything (not even for threads = 1). What worked for me though was do modify the onnxruntime code slightly:

self.onnx_options = onnxruntime.SessionOptions()
self.onnx_options.intra_op_num_threads = 4
self.model = onnxruntime.InferenceSession(
    str(model_path),
    sess_options=self.onnx_options,
    providers=["CPUExecutionProvider"]
    if not use_cuda
    else ["CUDAExecutionProvider"],
)

This actually made the Python version faster than the C++ version 🤔, except for start-up time ofc ;-).

2: The results of the C++ build are slightly better than the Python version using same voice with the exact same model. There doesn't seem to be any randomness involved, at least in my quick test I always got the same output on several tries. Voice was en_us_amy_low. I've attached two files for comparison: piper_cpp_py_compare_en_us_amy_low.zip
Is there anything different, some default parameters maybe?

Ty for building this, thinking about a SEPIA integration already 😁, cu soon,
Florian

featurerequest: Prosody or rate control

Dear @synesthesiam,

Thank you so much for creating this and making it available. publicly I apologize if this is not the appropriate place to contact you regarding this, and I should clarify that this is not a true feature request. I have simply tagged it that way so that it has a nicer name in the issue tracker.

I was curious to know if there is any way currently to either tune the network during training, or to adjust larynx2 during operation, to support different rates of speech? I imagine if I were to record an entire speaker's worth of audio at a different rate and train using that, it would create a model that spoke at that different speed. I was just wondering if rate of speech is essentially fixed in these neural model implementations once trained. I am very new to TTS and RNN, and so am not really certain what is technically easy vs feasible but challenging vs impossible.

Thanks to you sharing this code, I am also now seeing what training an RNN might be like and am making headway in setting up training on the LJSpeech dataset to get an idea of how to go about training a voice. How long does your training take per epoch so that I may have a ballpark? At the moment it seems that it will take about 1 hr/epoch for the LJSpeech-1.1 dataset, but I have no frame of reference to determine whether that is fast, slower or about expected. 1000 hours to do 1000 epochs seems like it would be quite a long time to leave this running though. Of course, I know that all of this is relative to the hardware on which things are being run. I am currently running the training through WSL2, with CUDA through a GeForce GTX (not RTX) 1650 on a intel i7-9750. I had to reduce the batch_size down to 4 from the suggested 32 just to enable to training steps to run without PyTorch complaining about RunTime Error: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.37 GiB already allocated; 0 bytes free; 3.45 GiB reserved in total by PyTorch). so I know that there are probably some significant performance hits in general from the WSL2 indirection and from the underspec'ed GPU.

Thank you for any information you have the time to share,
@d-r-a-b

cannot fine tune from libritts checkpoint

Hi,
I downloaded the pretrained model from google drive

i'm attempting to fine-tune but getting an error. Here is my FT command (this works with ljspeech pretrained model btw)

python -m piper_train \
    --resume_from_checkpoint 'checkpoints/libritts/high/epoch=418-step=1582960.ckpt' \
    --accelerator 'gpu' \
    --devices 1 \
    --batch-size 32 \
    --validation-split 0.01 \
    --num-test-examples 0 \
    --max-phoneme-ids 400 \
    --max_epochs 10000 \
    --dataset-dir training_24000 \
    --checkpoint-epochs 1  \
    --quality high

Here is the error:

Restoring states from the checkpoint path at checkpoints/libritts/high/epoch=418-step=1582960.ckpt
DEBUG:fsspec.local:open file: /home/admin/piper/src/python/checkpoints/libritts/high/epoch=418-step=1582960.ckpt
Traceback (most recent call last):
  File "/home/admin/miniconda3/envs/piper/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/admin/miniconda3/envs/piper/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/admin/piper/src/python/piper_train/__main__.py", line 95, in <module>
    main()
  File "/home/admin/piper/src/python/piper_train/__main__.py", line 88, in main
    trainer.fit(model)
  File "/home/admin/miniconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/admin/miniconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/admin/miniconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/admin/miniconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1056, in _run
    self._restore_modules_and_callbacks(ckpt_path)
  File "/home/admin/miniconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1000, in _restore_modules_and_callbacks
    self._checkpoint_connector.restore_model()
  File "/home/admin/miniconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 261, in restore_model
    self.trainer.strategy.load_model_state_dict(self._loaded_checkpoint)
  File "/home/admin/miniconda3/envs/piper/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 363, in load_model_state_dict
    self.lightning_module.load_state_dict(checkpoint["state_dict"])
  File "/home/admin/miniconda3/envs/piper/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for VitsModel:
	Unexpected key(s) in state_dict: "model_g.emb_g.weight", "model_g.dec.cond.weight", "model_g.dec.cond.bias", "model_g.enc_q.enc.cond_layer.bias", "model_g.enc_q.enc.cond_layer.weight_g", "model_g.enc_q.enc.cond_layer.weight_v", "model_g.flow.flows.0.enc.cond_layer.bias", "model_g.flow.flows.0.enc.cond_layer.weight_g", "model_g.flow.flows.0.enc.cond_layer.weight_v", "model_g.flow.flows.2.enc.cond_layer.bias", "model_g.flow.flows.2.enc.cond_layer.weight_g", "model_g.flow.flows.2.enc.cond_layer.weight_v", "model_g.flow.flows.4.enc.cond_layer.bias", "model_g.flow.flows.4.enc.cond_layer.weight_g", "model_g.flow.flows.4.enc.cond_layer.weight_v", "model_g.flow.flows.6.enc.cond_layer.bias", "model_g.flow.flows.6.enc.cond_layer.weight_g", "model_g.flow.flows.6.enc.cond_layer.weight_v", "model_g.dp.cond.weight", "model_g.dp.cond.bias".

How to run with GPU

Firstly, Thank for the high performance TTS tool.
I've tried with CPU and it is really fast, but I don't know how to run it with GPU to make it even faster.
Do this tool support GPU/Cuda?

Floating point exception (core dumped)

turn onnx

Provide voice samples to showcase piper

This is an awesome project.

Could you please provide some samples (ideally for each voice, but just one would already be great) in the readme, to showcase the quality of this tts?

Context: I only recently learned about this library from a reddit post. Some people asked for an example to see how good the generated voices sound, and another redditor posted a one. I was really impressed by the quality.

Adding Arabic language voice

Hello @synesthesiam

Thanks for the effort you put on this.

I'm a native Arabic language speaker.

I'm interested in adding Arabic language voice to Larynx2.

This involves the following:

There is a free, high-quality, speech corpus for Arabic, available from this site:

http://en.arabicspeechcorpus.com/

Arabic text phonemization is handled through:

https://github.com/nawarhalabi/Arabic-Phonetiser

Modern Standard Arabic (MSA) text does not include soft vowels (diacritics), thus a text preprocessing step is required to add soft vowels before converting the text into phonemes. I developed a package to handle this preprocessing step using an ONNX model. The package is available at:

https://github.com/mush42/tashkeel

Best
Musharraf

multiple languages at the same time

I have make a little test with speech-dispatcher, I have used the generic modules.
The purpose was to test if I can send few German, French, and English text lines via sdp-say. The text file is multilingual and
use the syntax "!-!SET SELF LANGUAGE xx" before the few words of the next line.
This worked, but between the output of text with another language, there is noise. Is it possible to eliminate this?
The French pronunciation is better as for larynx or mimic3, but there are some little problems with "liaisons"

LJspeech Checkpoint for fine-tuning

I wonder if there's a checkpoint from the pre-trained model (LJspeech dataset) to continue training from. That would greatly save time and computation especially for low resource datasets.

how to trun pth to onnx?

/piper/src/python/piper_train$ python export_onnx.py checkpoint G_latest.pth output G_latest.onnx
Traceback (most recent call last):
  File "/www/piper/src/python/piper_train/export_onnx.py", line 9, in <module>
    from .vits.lightning import VitsModel
ImportError: attempted relative import with no known parent package

Running on M1 mac

Is there any chance this would run on Apple silicon M1 or M2? I Have an M1. I first tried the raspberry pi binary but it didn't run. Then I tried compiling from source, but for some reason I get the error that it espeak-ng is not available. I have espeak-ng installed and can use it from the command line, but when trying to compile piper it don't find it for some reason. I installed espeak-ng through mac ports

Has anyone done this? Do you have any pointers for me?

This runs fine for me on the raspberry pi, but I was hoping to get it running and have faster tts on the apple silicon.

Thanks!

Persian support

Please add TTS for persian language :
Large Persian datasets:
https://www.kaggle.com/datasets/magnoliasis/persian-tts-dataset
https://www.kaggle.com/datasets/magnoliasis/persian-tts-dataset-famale

error on execution

Hi,

Thanks for your awesome library.
I have executed this line on RPI 4 (buster 64bit-debian 10) after downloading a model:
echo 'Welcome to the world of speech synthesis!' |
./larynx --model blizzard_lessac-medium.onnx --output_file welcome.wav

but there is a error relating to GLIBC as follows:
./larynx: /lib/aarch64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by /home/pi/speech/larynx/larynx/libespeak-ng.so.1)

I tried to upgrade my GLIBC from 2.28 to 2.29, but I could not find this version in this website: https://packages.debian.org/buster/

How can I solve the issue?
Best,

ONNX model runtime

Hey, do you have any runtime comparisons between the ONNX and Pytorch models? I wonder if it is worth trying for 🐸 models if you don't mind.