ntt123 / light-speed Goto Github PK

View Code? Open in Web Editor NEW

99.0 99.0 25.0 184 KB

A modified VITS that utilizes phoneme duration's ground truth for better robustness

License: MIT License

Python 60.14% Jupyter Notebook 39.86%

light-speed's People

Contributors

Stargazers

Watchers

light-speed's Issues

Đã đóng

Bạn có thể cho tôi xin thông tin liên hệ được không

Questions About VITS Code Modifications and Model Performance

Hi,
Thanks for your great works!
I'm curious to understand your thought process as a learner. May I ask why you decided to make modifications to the original VITS code?

You mentioned 'robust,' but I'm not quite clear on its exact meaning. Does it refer to the model's performance in different aspects, such as WER (Word Error Rate) or talking speed?
When you talk about 'speech quality,' are you referring to the sound quality of the generated speech? Is it similar to audio quality metrics like PSEQ?
Regarding the 'expanding the receptive field of the Wavenet Flow module' modification, how did you analyze the need for this change, and in what ways does it enhance the quality of synthesized speech?
I noticed that the original VITS was trained using PyTorch, but you chose to rewrite some code in TensorFlow. What motivated this decision? Are there specific advantages or requirements that led to this change in the tech stack?

The engine skips text quite often, sometimes skipping a sentence, sometimes skipping half a paragraph and then reading the next paragraph. Male voice is very natural, if this error can be fixed, it will almost be perfect.
Thanks the author!

About training

Halo!

I am using your public dataset https://huggingface.co/datasets/ntt123/viet-tts-dataset for training
And got this error

2024-04-05 14:36:50:     return forward_call(*args, **kwargs)
2024-04-05 14:36:50:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-05 14:36:50:   File "/data/light-speed/models.py", line 425, in forward
2024-04-05 14:36:50:     z_slice, ids_slice = commons.rand_slice_segments(
2024-04-05 14:36:51:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-05 14:36:51:   File "/data/light-speed/commons.py", line 64, in rand_slice_segments
2024-04-05 14:36:51:     ret = slice_segments(x, ids_str, segment_size)
2024-04-05 14:36:51:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-05 14:36:51:   File "/data/light-speed/commons.py", line 54, in slice_segments
2024-04-05 14:36:51:     ret[i] = x[i, :, idx_str:idx_end]
2024-04-05 14:36:51:     ~~~^^^
2024-04-05 14:36:51: RuntimeError: The expanded size of the tensor (32) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [192, 32].  Tensor sizes: [192, 0]

Have you encountered this error before? Any solution can I get? Is the issue related to naming the dataset

The directory structure of my data is as follows:

Hope your reply!

Which dataset do you use for VN - Male voice?

First thanks for this great repo.
I have a question.
Are you using this viet-tts-dataset ? If so, do you have the preprocessing code before adding it to the training model?

Share model weight

Can you share your model weight?

Model duration
Model for male voice and model for female voice

Thank you

44.1 Khz training config

Hi,
This is the greatest TTS project for Vietnamese I have found so far. Thanks for your work.

I have successfully trained this model at 44.1Khz by modifying sampling_rate in config.json (and other factors are the same). However the quality of the inference speech is not good which compared to the 16k version. It includes a lot hissing sound (tiếng rè). Do I need to modify anything else to get the better quality at 44.1khz or anyway to upsample from 16khz to 44.1khz after inferencing?

Any help would be appreciated!!

Training took forever to finish

For testing purposes, I extracted only 200 files (100 pairs) from the VietBibleVox zip data. I then ran the prepare_vbx_tfdata.ipynb notebook, which resulted in the following:

The JSON files in "./data/VietBibleVox" directory.
The "./data/tfdata/test" directory was created with one file named "part_000.tfrecords" that is approximately 56 MB in size.
The "./data/tfdata/train" directory was created with 256 files named "part_*.tfrecords", but all of them are empty (0 bytes).
The files "lexicon.dict", "lexicon.txt", "phone_set.json", and "vbx_mfa.zip" are non-empty files.
A directory named "MFA" was created in the "$HOME/Documents" directory, with a total size of 86 MB.

Afterwards, I attempted to run "python3 train.py", but the process repeatedly prints "0it [00:00, ?it/s]" to the screen. I waited for approximately 1 hour before interrupting the process. I believe this is an excessively long time for such a small dataset.

Since the tfrecords files should not be empty, according to the discussion here: #2 (comment), I suspect that something went wrong during the preparation process, but I am unable to identify the specific issue.

My equipments:

OS: Debian testing, Wayland session.
CPU: Intel i5-6300HQ.
RAM: 12 GB.
GPU: GTX 950M.

ntt123 / light-speed Goto Github PK

light-speed's People

Contributors

Stargazers

Watchers

Forkers

light-speed's Issues

Đã đóng

Questions About VITS Code Modifications and Model Performance

The engine skips text

About training

Which dataset do you use for VN - Male voice?

Share model weight

44.1 Khz training config

Training took forever to finish

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent