plachtaa / vall-e-x Goto Github PK

An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io

License: MIT License

Python 100.00%

emotional-speech gpt text-to-speech voice-clone transformer-architecture tts vall-e

vall-e-x's Introduction

VALL-E X: Multilingual Text-to-Speech Synthesis and Voice Cloning 🔊

English | 中文
An open source implementation of Microsoft's VALL-E X zero-shot TTS model.
We release our trained model to the public for research or application usage.

VALL-E X is an amazing multilingual text-to-speech (TTS) model proposed by Microsoft. While Microsoft initially publish in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are glad to share our trained VALL-E X model with the community, allowing everyone to experience the power next-generation TTS! 🎧

More details about the model are presented in model card.

🚀 Updates

2023.09.10

Added AR decoder batch decoding for more stable generation result.

2023.08.30

Replaced EnCodec decoder with Vocos decoder, improved audio quality. (Thanks to @v0xie)

2023.08.23

Added long text generation.

2023.08.20

Added Chinese README.

2023.08.14

Pretrained VALL-E X checkpoint is now released. Download it here

💻 Installation

Install with pip, Python 3.10, CUDA 11.7 ~ 12.0, PyTorch 2.0+

git clone https://github.com/Plachtaa/VALL-E-X.git
cd VALL-E-X
pip install -r requirements.txt

Note: If you want to make prompt, you need to install ffmpeg and add its folder to the environment variable PATH.

When you run the program for the first time, it will automatically download the corresponding model.

If the download fails and reports an error, please follow the steps below to manually download the model.

(Please pay attention to the capitalization of folders)

Check whether there is a checkpoints folder in the installation directory. If not, manually create a checkpoints folder (./checkpoints/) in the installation directory.
Check whether there is a vallex-checkpoint.pt file in the checkpoints folder. If not, please manually download the vallex-checkpoint.pt file from here and put it in the checkpoints folder.
Check whether there is a whisper folder in the installation directory. If not, manually create a whisper folder (./whisper/) in the installation directory.
Check whether there is a medium.pt file in the whisper folder. If not, please manually download the medium.pt file from here and put it in the whisper folder.

🎧 Demos

Not ready to set up the environment on your local machine just yet? No problem! We've got you covered with our online demos. You can try out VALL-E X directly on Hugging Face or Google Colab, experiencing the model's capabilities hassle-free!

📢 Features

VALL-E X comes packed with cutting-edge functionalities:

Multilingual TTS: Speak in three languages - English, Chinese, and Japanese - with natural and expressive speech synthesis.
Zero-shot Voice Cloning: Enroll a short 3~10 seconds recording of an unseen speaker, and watch VALL-E X create personalized, high-quality speech that sounds just like them!

see example

prompt.webm

output.webm

Speech Emotion Control: Experience the power of emotions! VALL-E X can synthesize speech with the same emotion as the acoustic prompt provided, adding an extra layer of expressiveness to your audio.

see example

sleepy-prompt.mp4

sleepy-output.mp4

Zero-shot Cross-Lingual Speech Synthesis: Take monolingual speakers on a linguistic journey! VALL-E X can produce personalized speech in another language without compromising on fluency or accent. Below is a Japanese speaker talk in Chinese & English. 🇯🇵 🗣

see example

jp-prompt.webm

en-output.webm

zh-output.webm

Accent Control: Get creative with accents! VALL-E X allows you to experiment with different accents, like speaking Chinese with an English accent or vice versa. 🇨🇳 💬

see example

en-prompt.webm

zh-accent-output.webm

en-accent-output.webm

Acoustic Environment Maintenance: No need for perfectly clean audio prompts! VALL-E X adapts to the acoustic environment of the input, making speech generation feel natural and immersive.

see example

noise-prompt.webm

noise-output.webm

Explore our demo page for a lot more examples!

🐍 Usage in Python

🪑 Basics

from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio

# download and load all models
preload_models()

# generate audio from text
text_prompt = """
Hello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast.
"""
audio_array = generate_audio(text_prompt)

# save audio to disk
write_wav("vallex_generation.wav", SAMPLE_RATE, audio_array)

# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)

hamburger.webm

🌎 Foreign Language

This VALL-E X implementation also supports Chinese and Japanese. All three languages have equally awesome performance!

text_prompt = """
    チュソクは私のお気に入りの祭りです。 私は数日間休んで、友人や家族との時間を過ごすことができます。
"""
audio_array = generate_audio(text_prompt)

vallex_japanese.webm

Note: VALL-E X controls accent perfectly even when synthesizing code-switch text. However, you need to manually denote language of respective sentences (since our g2p tool is rule-base)

text_prompt = """
    [EN]The Thirty Years' War was a devastating conflict that had a profound impact on Europe.[EN]
    [ZH]这是历史的开始。 如果您想听更多，请继续。[ZH]
"""
audio_array = generate_audio(text_prompt, language='mix')

vallex_codeswitch.webm

📼 Voice Presets

VALL-E X provides tens of speaker voices which you can directly used for inference! Browse all voices in the code

VALL-E X tries to match the tone, pitch, emotion and prosody of a given preset. The model also attempts to preserve music, ambient noise, etc.

text_prompt = """
I am an innocent boy with a smoky voice. It is a great honor for me to speak at the United Nations today.
"""
audio_array = generate_audio(text_prompt, prompt="dingzhen")

smoky.webm

🎙Voice Cloning

VALL-E X supports voice cloning! You can make a voice prompt with any person, character or even your own voice, and use it like other voice presets.
To make a voice prompt, you need to provide a speech of 3~10 seconds long, as well as the transcript of the speech. You can also leave the transcript blank to let the Whisper model to generate the transcript.

VALL-E X tries to match the tone, pitch, emotion and prosody of a given prompt. The model also attempts to preserve music, ambient noise, etc.

from utils.prompt_making import make_prompt

### Use given transcript
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav",
                transcript="Just, what was that? Paimon thought we were gonna get eaten.")

### Alternatively, use whisper
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav")

Now let's try out the prompt we've just made!

from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

# download and load all models
preload_models()

text_prompt = """
Hey, Traveler, Listen to this, This machine has taken my voice, and now it can talk just like me!
"""
audio_array = generate_audio(text_prompt, prompt="paimon")

write_wav("paimon_cloned.wav", SAMPLE_RATE, audio_array)

paimon_prompt.webm

paimon_cloned.webm

🎢User Interface

Not comfortable with codes? No problem! We've also created a user-friendly graphical interface for VALL-E X. It allows you to interact with the model effortlessly, making voice cloning and multilingual speech synthesis a breeze.
You can launch the UI by the following command:

python -X utf8 launch-ui.py

🛠️ Hardware and Inference Speed

VALL-E X works well on both CPU and GPU (pytorch 2.0+, CUDA 11.7 and CUDA 12.0).

A GPU VRAM of 6GB is enough for running VALL-E X without offloading.

⚙️ Details

VALL-E X is similar to Bark, VALL-E and AudioLM, which generates audio in GPT-style by predicting audio tokens quantized by EnCodec.
Comparing to Bark:

✔ Light-weighted: 3️⃣ ✖ smaller,
✔ Efficient: 4️⃣ ✖ faster,
✔ Better quality on Chinese & Japanese
✔ Cross-lingual speech without foreign accent
✔ Easy voice-cloning
❌ Less languages
❌ No special tokens for music / sound effects

Supported Languages

Language	Status
English (en)	✅
Japanese (ja)	✅
Chinese, simplified (zh)	✅

❓ FAQ

Where is code for training?

lifeiteng's vall-e has almost everything. There is no plan to release our training code because there is no difference between lifeiteng's implementation.

Where can I download the model checkpoint?

We use wget to download the model to directory ./checkpoints/ when you run the program for the first time.
If the download fails on the first run, please manually download from this link, and put the file under directory ./checkpoints/.

How much VRAM do I need?

6GB GPU VRAM - Almost all NVIDIA GPUs satisfy the requirement.

Why the model fails to generate long text?

Transformer's computation complexity increases quadratically while the sequence length increases. Hence, all training are kept under 22 seconds. Please make sure the total length of audio prompt and generated audio is less than 22 seconds to ensure acceptable performance.

MORE TO BE ADDED...

🧠 TODO

🙏 Appreciation

VALL-E X paper for the brilliant idea
lifeiteng's vall-e for related training code
bark for the amazing pioneering work in neuro-codec TTS model

⭐️ Show Your Support

If you find VALL-E X interesting and useful, give us a star on GitHub! ⭐️ It encourages us to keep improving the model and adding exciting features.

📜 License

VALL-E X is licensed under the MIT License.

Have questions or need assistance? Feel free to open an issue or join our Discord

Happy voice cloning! 🎤

vall-e-x's People

Contributors

Stargazers

Watchers

Forkers

ishine renxiangnan erickong1985 entn-at justinjohn0306 yuan-manx liujingxiu23 positivewon ybl1984 gentlem4dman zurichrain renfengyi markyfsun syaofox zfbok xaioan irisitomi akito-uzukip 40740 hwang824 1879687161 everysummerday catarinaboa mutoe adambear ckw1219 aylitat j0hn1024 imenist macroustc junhaohuang0615 flalongflalong lurenchou fec123 cylonspace billhughchen03 fb-dtalker sdlibowen wrongww78319 ykfujita jliangqiu zunan-islands techthiyanes tommy13579 hyojunguy ukaserge merumeru-rururu herushi777 hironow saccadic supermax197 haminem hkzbiyx cumulo-autumn v0xie leftomelas iamleon121 shenmintao tomoyukiorita thepioneerjp ikaros-521 ziyaad30 nooseok kustomzone hidirov16 ankit-dahake mbrukman realsrisri tree-ind linyueqian fragotesac eltociear khal-id tomchapin whlcoding ntamotsu yahirmt nakamotojp marzz17 wakusei-meron- a-biao96 salexkidd rayfernando1337-ai-forks ai-code-base kon-iro zfhassaan ureshinop namikawa-wkb qoboty archx3 cocktailpeanut rkhani1 4noha sherief lenin-star suryatmodulus nafiur moyunhua experiencesnetwork hemeda3

vall-e-x's Issues

A note on the implementation.

Great implementation. In the paper it says phoneme multilingual acoustic token pairs <S,A> are concatenated. I'm assuming no prompt for the ar model? In the original valle they train on a prompt <x,C~,C<1 > for both nar and ar model.

In addition, are language ID's added elementwise to the both the ar model and nar model acoustic tokens during training?

如何修改配置使用GPU运算？

完全默认状况下，提示 “Use 20 cpu cores for computing”。在任务管理中查看性能，完全没有用到GPU。
我的机器是8GB的3070，应该能跑得起来吧？
求指路，如何更改配置？

How to make it run on macos

I got the following error:
"NotImplementedError: cannot instantiate 'WindowsPath' on your system"

To fix this I changed line number 12 in file launch-ui.py
FROM:
elif platform.system().lower() == 'linux':
TO:
else:

It runs well on M2.

Unable to save long form generation only gives me the last sentence

Hello I get this message it seems to work inside of the CMD as listed but for some reason I only get end file.

Anyone know what is causing this issue.

长文本的支持，有没有实现方法？

本人对于深度学习不是很懂，出笑话不要介意
我看了bark对于长文本的实现，
https://github.com/suno-ai/bark/blob/main/notebooks/long_form_generation.ipynb
然后我改造了一下ui的代码，
import nltk
nltk.download('punkt')

@torch.no_grad()
def infer_from_prompt(text, language, accent, prompt_file):
clear_prompts()
model.to(device)
# text to synthesize
lang_token = langdropdown2token[language]
lang = token2lang[lang_token]
sentences = nltk.sent_tokenize(text)
silence = np.zeros(int(0.25 * SAMPLE_RATE)) # quarter second of silence
pieces = []
for sentence in sentences:
print("\n生成语音中:" + sentence + "\n")
audio_array,t = get_voice(accent, lang, lang_token, prompt_file, sentence)
pieces += [audio_array, silence.copy()]
normalized_pieces = np.concatenate(pieces)
normalized_pieces = (normalized_pieces * 32767 / np.max(np.abs(normalized_pieces))).astype(np.int16)
# save audio to disk
model.to('cuda')
torch.cuda.empty_cache()
message = f"sythesized text: {text}"
return message, (24000, normalized_pieces)

def get_voice(accent, lang, lang_token, prompt_file, text):
text = lang_token + text + lang_token
print("lang:" + text)
# load prompt
prompt_data = np.load(prompt_file.name)
audio_prompts = prompt_data['audio_tokens']
text_prompts = prompt_data['text_tokens']
lang_pr = prompt_data['lang_code']
lang_pr = code2lang[int(lang_pr)]
# numpy to tensor
audio_prompts = torch.tensor(audio_prompts).type(torch.int32).to(device)
text_prompts = torch.tensor(text_prompts).type(torch.int32)
enroll_x_lens = text_prompts.shape[-1]
logging.info(f"synthesize text: {text}")
phone_tokens, langs = text_tokenizer.tokenize(text=f"_{text}".strip())
print("langs：" + str(langs))
text_tokens, text_tokens_lens = text_collater(
[
phone_tokens
]
)
text_tokens = torch.cat([text_prompts, text_tokens], dim=-1)
text_tokens_lens += enroll_x_lens
# accent control
lang = lang if accent == "no-accent" else token2lang[langdropdown2token[accent]]
else_lang = langs if accent == "no-accent" else lang
print("else_lang:" + else_lang)
encoded_frames = model.inference(
text_tokens.to(device),
text_tokens_lens.to(device),
audio_prompts,
enroll_x_lens=enroll_x_lens,
top_k=-100,
temperature=1,
prompt_language=lang_pr,
text_language=else_lang,
)
samples = audio_tokenizer.decode(
[(encoded_frames.transpose(2, 1), None)]
)
audio_data = samples[0][0].cpu().numpy()
return audio_data, text
但是好像好像分拆句子不成功，看有没有大神，路过一起解决这个问题，争取这个项目早日支持长文本呢？

Dependencies Missing

Dependencies in requirements.txt should also include nltk,sudachipy.
punkt is also needed when splitting long text into sentences.

默认使用了项目外的路径，希望依赖文件可以调整位置或预载

第一次运行Python样例的时候，在C:\Users\*\.cache\torch\hub\checkpoints\路径下下载了encodec_24khz-d7cc33bc.th文件
执行python launch-ui.py后，在C:\Users\*\AppData\Roaming\nltk_data路径下下载nltk_data相关文件

另：如果网络环境不通畅，会一直卡住没有提醒

汇报下我遇到的问题

环境是wsl2 + conda环境

在安装完requirement之后

ModuleNotFoundError: No module named 'phonemizer'
AttributeError: module 'gradio' has no attribute 'Blocks'
ImportError: Numba needs NumPy 1.24 or less

通过安装跟强制重新安装都能解决。
还遇到了cudnn 版本的问题，用conda解决了

谢谢，幸苦了

报错 Kernel size can't be greater than actual input size

生成NPZ文件正常，再创建新的语音时，提示
RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

谢谢

小白：VALL-E X效果实测

测试链接：https://www.bilibili.com/video/BV1gp4y1E7Ta
结论：大佬看看视频上的内容，确定下，下一步的研发方向
致敬大佬

utils/generation.py 和 launch_ui.py 中载入的模型的文件名不一致

如题。希望可以统一一下模型路径

error happend sometimes

return self.codec.decode(frames)

File "encodec\model.py", line 175, in decode
File "encodec\model.py", line 184, in _decode_frame
File "torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "encodec\modules\seanet.py", line 237, in forward
File "torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "torch\nn\modules\container.py", line 217, in forward
input = module(input)
File "torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "encodec\modules\conv.py", line 210, in forward
File "torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "encodec\modules\conv.py", line 120, in forward
File "torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "torch\nn\modules\conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "torch\nn\modules\conv.py", line 309, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

'NoneType' object has no attribute 'inference'

Wehn I'm trying to use my custom prompt I got this error

raceback (most recent call last): File "/home/mike/Documents/ai/VALL-E-X/test_prompt.py", line 7, in <module> audio_array = generate_audio(text_prompt, prompt="trump") File "/home/mike/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/mike/Documents/ai/VALL-E-X/utils/generation.py", line 125, in generate_audio encoded_frames = model.inference( AttributeError: 'NoneType' object has no attribute 'inference'

here is my code:

` from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

text_prompt = """
Hey, Traveler, Listen to this, This machine has taken my voice, and now it can talk just like me!
"""
audio_array = generate_audio(text_prompt, prompt="custom_prompt")

write_wav("inputAudio/custom_prompt_result.wav", SAMPLE_RATE, audio_array)`

[Windows] UnicodeEncodeError: 'charmap' codec can't encode characters

Any other windows users dealing with this error? I don't have this bug if I run it through WSL or google colab.

If I try to clone any japanese voice, I get this error. I assume it might happen with chinese voices too

It works fine with English voices.

Full error

File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4-43: character maps to <undefined>

【bug】[nltk_data] Error loading punkt: <urlopen error [Errno 11004]

第一次启动时，会下载nltk相关内容。

(base) D:\GitHub_pro\VALL-E-X>python launch-ui.py
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.

第二次启动时，报错无法加载

(base) D:\GitHub_pro\VALL-E-X>python launch-ui.py
[nltk_data] Error loading punkt: <urlopen error [Errno 11004]
[nltk_data]     getaddrinfo failed>

本地文件是存在的

Updated phonemizer broke install with requirements.txt

after install it says
"File "/Users/ahekot/PycharmProjects/Python3.10_test/VALL-E-X/launch-ui.py", line 29, in
from data.tokenizer import (
File "/Users/ahekot/PycharmProjects/Python3.10_test/VALL-E-X/data/tokenizer.py", line 26, in
from phonemizer.backend.espeak.language_switch import LanguageSwitch
ModuleNotFoundError: No module named 'phonemizer.backend.espeak.language_switch'; 'phonemizer.backend.espeak' is not a package

You can fix this by downgrading phonemizer with pip install phonemizer==3.2.0

Unexpected token '<', " <!DOCTYPE "... is not valid JSON

我是Mac ssh到 Linux Server上运行的，app.launch(share=True)启动没问题，但是Gen的时候会报错Unexpected token '<', " <!DOCTYPE "... is not valid JSON。感觉是前端应该收到 JSON，结果收到的是 HTML。

运行demo的时候报错，这个错不知道怎么解，求助下。

测试代码如下：
~/VALL-E-X# cat test.py
from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio
preload_models()
text_prompt = """
Hello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast.
"""
audio_array = generate_audio(text_prompt)
write_wav("vallex_generation.wav", SAMPLE_RATE, audio_array)
Audio(audio_array, rate=SAMPLE_RATE)

报错信息如下：
/VALL-E-X# python3 test.py
Traceback (most recent call last):
File "test.py", line 1, in
from utils.generation import SAMPLE_RATE, generate_audio, preload_models
File "/root/VALL-E-X/utils/generation.py", line 4, in
from vocos import Vocos
File "/root/miniconda3/lib/python3.8/site-packages/vocos/init.py", line 1, in
from vocos.pretrained import Vocos
File "/root/miniconda3/lib/python3.8/site-packages/vocos/pretrained.py", line 7, in
from vocos.feature_extractors import FeatureExtractor, EncodecFeatures
File "/root/miniconda3/lib/python3.8/site-packages/vocos/feature_extractors.py", line 8, in
from vocos.modules import safe_log
File "/root/miniconda3/lib/python3.8/site-packages/vocos/modules.py", line 89, in
class ResBlock1(nn.Module):
File "/root/miniconda3/lib/python3.8/site-packages/vocos/modules.py", line 109, in ResBlock1
dilation: tuple[int] = (1, 3, 5),
TypeError: 'type' object is not subscriptable

通过这个错误提示，我有点懵，不知道咋解，求助下。谢谢。

how many datasets to train the pretrained model？

which dataset did you use to use the model？
how many speakers and hours of all the dataset？

如何得到一个比较好的npz模型？

8.14.1.-888.mp4

如上面的视频，如何才能得到一个比较好的npz文件，我在网上找了一个6秒的视频，到处音频，生成npz文件。生成的voice，噪音比较大？
请教一下大家如何生成做得到一个比较的npz文件？能够生成清晰一点的voice？
需要音频的长度要求？
需要声音的频率要求？
需要不需要一个比较安静的环境录音？
但是我又想有点环境音存在，是否能够做到？

是否有考虑更换其他音质更好的codec呢

例如，在不重新训练模型的情况下，应该可以使用Vocos 直接替换掉encodec的decoder
如果重新训练模型的话，有AcademiCodec 、descript-audio-codec 等一系列音质更好的codec可以使用？

有木有可能支持directml

A卡用户有点难受

openjtalk构建问题

根据尝试，openjtalk 可以在 Python3.10上构建成功（Windows），3.11上构建失败
（这一点可以写在readme里）

如何将音色克隆做进一步finetuning？

只使用一个短音频的训练方法得到的克隆音频底噪较大，有明显的电流声，请问怎么可以进一步finetuning，使用更多特定音色音频训练出更清晰的克隆音频？

What is the minimum processing power required for inference or training?

What is the minimum VRAM requirement or RAM requirement for VALL-E X?

How to get emotions?

How to generate voice with emotion like "amused" or "sleepy"?

如何指定多音字的读音和声调？

求指点！
如何指定多音字的读音和声调？
以及能否调节语速？

出现这个问题

我有8G的GPU,但是还是出现这样的问题？
G:\AI\VALL\VALL-E-X-master\myenv\lib\site-packages\whisper\timing.py:58: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. def backtrace(trace: np.ndarray): Use 8 cpu cores for computing Traceback (most recent call last): File "G:\AI\VALL\VALL-E-X-master\VALL-E-X-master\ui.py", line 84, in <module> whisper_model = whisper.load_model("medium").cpu() File "G:\AI\VALL\VALL-E-X-master\myenv\lib\site-packages\whisper\__init__.py", line 154, in load_model return model.to(device) File "G:\AI\VALL\VALL-E-X-master\myenv\lib\site-packages\torch\nn\modules\module.py", line 1145, in to return self._apply(convert) File "G:\AI\VALL\VALL-E-X-master\myenv\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply module._apply(fn) File "G:\AI\VALL\VALL-E-X-master\myenv\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply module._apply(fn) File "G:\AI\VALL\VALL-E-X-master\myenv\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply module._apply(fn) [Previous line repeated 2 more times] File "G:\AI\VALL\VALL-E-X-master\myenv\lib\site-packages\torch\nn\modules\module.py", line 820, in _apply param_applied = fn(param) File "G:\AI\VALL\VALL-E-X-master\myenv\lib\site-packages\torch\nn\modules\module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.05 GiB already allocated; 0 bytes free; 3.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is there gonna be an .exe for windows release

I wanna a version where I can install it like an app 🤔 . Because I think it's easier for any noobi to use it , thanks .

訓練設定

有看到介紹說模型是從lifeiteng改過來的，想詢問幾個問題

似乎沒看到bin 資料集的改動和展示
你訓練過程中 source和target會放入不同語言文字嗎?(有經過翻譯的)
能否透露 python3 bin/trainer.py 後面的設定
您的demo和官方的來說表現相差不遠，您認為再持續加資料集會再表現得更好嗎?

try to train VALL-E-X but the results are weird

hi, thank you for this work.
I try to train this VALL-E-X with LibriTTS and Aishell-1, but get weird result. Do you have any idea? thanks a lot
I use train code from lifeiteng/vall-e, tensorboard is here:

train scripts:

python train.py \
        --train-stage 1 \
        --world-size 2 \
        --num-workers 2 \
        --max-duration 3750 \
        --dtype "float32" \
        --save-every-n 10000 \
        --valid-interval 20000 \
        --model-name valle \
        --share-embedding true \
        --norm-first true \
        --add-prenet true \
        --decoder-dim 1024 \
        --nhead 16 \
        --num-decoder-layers 12 \
        --prefix-mode 1 \
        --base-lr 0.05 \
        --warmup-steps 200 \
        --average-period 0 \
        --num-epochs 20 \
        --start-epoch 15 \
        --start-batch 0 \
        --accumulate-grad-steps 8 \
        --exp-dir "${exp_dir}"

python train.py \
        --train-stage 2 \
        --world-size 2 \
        --num-workers 2 \
        --max-duration 3750 \
        --dtype "float32" \
        --save-every-n 10000 \
        --valid-interval 20000 \
        --model-name valle \
        --share-embedding true \
        --norm-first true \
        --add-prenet true \
        --decoder-dim 1024 \
        --nhead 16 \
        --num-decoder-layers 12 \
        --prefix-mode 1 \
        --base-lr 0.05 \
        --warmup-steps 200 \
        --average-period 0 \
        --num-epochs 100 \
        --start-epoch 41 \
        --start-batch 0 \
        --accumulate-grad-steps 4 \
        --exp-dir "${exp_dir}"

请问大概用了多少量级的数据？

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='drive.google.com', port=443): Max retries exceeded with url:

我使用的操作系统是 Ubuntu 22.04.3 LTS ，实际上也是因为windows安装不了pyopenjtalk才临时更换的。python 3.10.12。
在我运行示例代码
from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio

download and load all models

preload_models()

generate audio from text

text_prompt = """

Hello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast.
"""
audio_array = generate_audio(text_prompt)

save audio to disk

write_wav("vallex_generation.wav", SAMPLE_RATE, audio_array)

play text in notebook

Audio(audio_array, rate=SAMPLE_RATE)
时，出现了如图错误

checkpoint.pt从谷歌网盘下载太慢了，求分享

Training for a new language

Hi,
Is it possible to use the pre-trained model to train a new language?
I'm specifically interested to train a low resource language.
I've done it with Coqui, with IMS-Toucan, and with few-shot-transformer-tts.
First two gave ok results and the 3rd one was better. But I'm hoping for even better quality, and with the great features of VALL-E-X it would be great if it would be possible to fine-tune it to additional languages.
Thanks!

voice cloning problem

非常好的工作！
我遇到一个问题，text_prompt文本稍微长一点，效果就拉了，请问是怎么回事？

是否内置依赖文件

我准备把encodec等的依赖文件从.cache更改到项目路径下，如果直接内置在项目中可以减少网络不稳定带来的影响，但会增加大概140MB的大小。是内置好还是单纯改路径比较好？

'NoneType' object has no attribute 'inference'

Can anyone help me ?

可以提供预设音色的wav格式文件吗

model training script and dataset

Great Work!
Do you have the code for model training? Where can I download the training data?

为什么二次元的声音合成效果较好？

是因为二次元声音吐字清晰，特征明显还是因为用于标注训练的样本多啊？

Adding a new language

what is the possibility to add a new language? e.g., Arabic? is there a guide or something i can read?
thanks in advance.

Traning resources

Hi,

Thank you for the awesome implementation.

I have couple of questions, if you can shade a little light, it would be helpful.

how many gpus (16gb?) you needed to get the pretrained model and how log does it took?
did you do any kind of preprocessing on the librilight dataset?

no such option: --no-error-on-external

pip --version
结果
pip 23.2.1 from C:\Users\XXX\AppData\Local\Programs\Python\Python310\lib\site-packages\pip (python 3.10)

pip install --no-error-on-external -r requirements.txt
结果：
Usage:
pip install [options] [package-index-options] ...
pip install [options] -r [package-index-options] ...
pip install [options] [-e] ...
pip install [options] [-e] ...
pip install [options] <archive url/path> ...

no such option: --no-error-on-external

python --version
结果：
Python 3.10.11

Dataset Copyright

Your demo page states that you have "self-gathered" data.
I would like to know the details of the data breakdown.
Does this include data that may be problematic for commercial use, such as animation, movies, YouTube, etc.?

求助，还是不会启动

1、将中文说明反复观看，先是用一天时间找了个教程安装了深度计算的环境，包括anaconda+python3.10+pycharm+cuda+cudnn+pytorch
2、再看文档说明，说要安装ffmpeg，在CSDN找了个教程也安上了（参考教程：https://blog.csdn.net/kakangel/article/details/128160587）
3、我最早看到大佬的视频先来到github上下载了源文件，放置在了：E:\AI\VALL-E-X-master
4、又下载了预训练模型，将其放在了E:\AI\VALL-E-X-master下面，但说明上要将它放在checkpoints文件夹下，所以又在软件根目录建了个文件夹起名为checkpoints将预训练模型放进去
5、将E:\AI\VALL-E-X-master\ffmpeg\bin也加入了系统变量Patch中
6、然后按本地安装装的，打开anaconda,先通过conda activate py3.10激活PY3.10的环境，然后在里面输入
pip install -r E:\AI\VALL-E-X-master\requirements.txt
好像是安装了一堆东西是即时下载的，完事后，就是启动网页界面了
7、我又在anaconda的py3.10的环境下面输入：python E:\AI\VALL-E-X-master\launch-ui.py，如果如下代码：

(base) C:\Users\PFP>conda activate py3.10

(py3.10) C:\Users\PFP>python E:\AI\VALL-E-X-master\launch-ui.py
[nltk_data] Downloading package punkt to
[nltk_data] C:\Users\PFP\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\punkt.zip.
C:\Users\PFP.conda\envs\py3.10\lib\site-packages\whisper\timing.py:58: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
def backtrace(trace: np.ndarray):
Use 32 cpu cores for computing
Traceback (most recent call last):
File "E:\AI\VALL-E-X-master\launch-ui.py", line 52, in
text_tokenizer = PhonemeBpeTokenizer(tokenizer_path="./utils/g2p/bpe_69.json")
File "E:\AI\VALL-E-X-master\utils\g2p_init_.py", line 13, in init
self.tokenizer = Tokenizer.from_file(tokenizer_path)
Exception: 系统找不到指定的路径。 (os error 3)

(py3.10) C:\Users\PFP>

就是不会启动了，希望大神指点下

How to get vallex-checkpoint.pt?

File "D:\codes\VALL-E-X\py\lib\site-packages\torch\serialization.py", line 252, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './vallex-checkpoint.pt'

合成质量还是差强人意

虽然是0 shot，但是质量还是很重要。可以再提供一个选择：增加训练语料和时长，提供更完美的合成质量

voice clone prefermance，which one is better， this or bark？

When localhost is not accessible

Running on local URL: http://127.0.0.1:7860
Traceback (most recent call last):
File "C:\VALL-E-X-master\launch-ui.py", line 571, in
main()
File "C:\VALL-E-X-master\launch-ui.py", line 564, in main
app.launch()
File "C:\Users***\AppData\Local\Programs\Python\Python311\Lib\site-packages\gradio\blocks.py", line 1974, in launch
raise ValueError(
ValueError: When localhost is not accessible, a shareable link must be created. Please set share=True or check your proxy settings to allow access to localhost.

请问这个要怎么改？

plachtaa / vall-e-x Goto Github PK

vall-e-x's Introduction

VALL-E X: Multilingual Text-to-Speech Synthesis and Voice Cloning 🔊

📖 Quick Index

🚀 Updates

💻 Installation

Install with pip, Python 3.10, CUDA 11.7 ~ 12.0, PyTorch 2.0+

🎧 Demos

📢 Features

see example

see example

see example

see example

see example

🐍 Usage in Python

🪑 Basics

🌎 Foreign Language

📼 Voice Presets

🎙Voice Cloning

🎢User Interface

🛠️ Hardware and Inference Speed

⚙️ Details

Supported Languages

❓ FAQ

Where is code for training?

Where can I download the model checkpoint?

How much VRAM do I need?

Why the model fails to generate long text?

MORE TO BE ADDED...

🧠 TODO

🙏 Appreciation

⭐️ Show Your Support

📜 License

vall-e-x's People

Contributors

Stargazers

Watchers

Forkers

vall-e-x's Issues

download and load all models

generate audio from text

save audio to disk

play text in notebook

Recommend Projects

Recommend Topics

Recommend Org