Coder Social home page Coder Social logo

light-speed's People

Contributors

ntt123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

light-speed's Issues

Đã đóng

Bạn có thể cho tôi xin thông tin liên hệ được không

Questions About VITS Code Modifications and Model Performance

Hi,
Thanks for your great works!
I'm curious to understand your thought process as a learner. May I ask why you decided to make modifications to the original VITS code?

  1. You mentioned 'robust,' but I'm not quite clear on its exact meaning. Does it refer to the model's performance in different aspects, such as WER (Word Error Rate) or talking speed?

  2. When you talk about 'speech quality,' are you referring to the sound quality of the generated speech? Is it similar to audio quality metrics like PSEQ?

  3. Regarding the 'expanding the receptive field of the Wavenet Flow module' modification, how did you analyze the need for this change, and in what ways does it enhance the quality of synthesized speech?

  4. I noticed that the original VITS was trained using PyTorch, but you chose to rewrite some code in TensorFlow. What motivated this decision? Are there specific advantages or requirements that led to this change in the tech stack?

The engine skips text

The engine skips text quite often, sometimes skipping a sentence, sometimes skipping half a paragraph and then reading the next paragraph. Male voice is very natural, if this error can be fixed, it will almost be perfect.
Thanks the author!

About training

Halo!

I am using your public dataset https://huggingface.co/datasets/ntt123/viet-tts-dataset for training
And got this error

2024-04-05 14:36:50:     return forward_call(*args, **kwargs)
2024-04-05 14:36:50:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-05 14:36:50:   File "/data/light-speed/models.py", line 425, in forward
2024-04-05 14:36:50:     z_slice, ids_slice = commons.rand_slice_segments(
2024-04-05 14:36:51:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-05 14:36:51:   File "/data/light-speed/commons.py", line 64, in rand_slice_segments
2024-04-05 14:36:51:     ret = slice_segments(x, ids_str, segment_size)
2024-04-05 14:36:51:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-05 14:36:51:   File "/data/light-speed/commons.py", line 54, in slice_segments
2024-04-05 14:36:51:     ret[i] = x[i, :, idx_str:idx_end]
2024-04-05 14:36:51:     ~~~^^^
2024-04-05 14:36:51: RuntimeError: The expanded size of the tensor (32) must match the existing size (0) at non-singleton dimension 1.  Target sizes: [192, 32].  Tensor sizes: [192, 0]

Have you encountered this error before? Any solution can I get? Is the issue related to naming the dataset

The directory structure of my data is as follows:

Screenshot 2024-04-05 174334

Hope your reply!

Share model weight

Can you share your model weight?

  • Model duration
  • Model for male voice and model for female voice

Thank you

44.1 Khz training config

Hi,
This is the greatest TTS project for Vietnamese I have found so far. Thanks for your work.

I have successfully trained this model at 44.1Khz by modifying sampling_rate in config.json (and other factors are the same). However the quality of the inference speech is not good which compared to the 16k version. It includes a lot hissing sound (tiếng rè). Do I need to modify anything else to get the better quality at 44.1khz or anyway to upsample from 16khz to 44.1khz after inferencing?

Any help would be appreciated!!

Training took forever to finish

For testing purposes, I extracted only 200 files (100 pairs) from the VietBibleVox zip data. I then ran the prepare_vbx_tfdata.ipynb notebook, which resulted in the following:

  • The JSON files in "./data/VietBibleVox" directory.
  • The "./data/tfdata/test" directory was created with one file named "part_000.tfrecords" that is approximately 56 MB in size.
  • The "./data/tfdata/train" directory was created with 256 files named "part_*.tfrecords", but all of them are empty (0 bytes).
  • The files "lexicon.dict", "lexicon.txt", "phone_set.json", and "vbx_mfa.zip" are non-empty files.
  • A directory named "MFA" was created in the "$HOME/Documents" directory, with a total size of 86 MB.

Afterwards, I attempted to run "python3 train.py", but the process repeatedly prints "0it [00:00, ?it/s]" to the screen. I waited for approximately 1 hour before interrupting the process. I believe this is an excessively long time for such a small dataset.

Since the tfrecords files should not be empty, according to the discussion here: #2 (comment), I suspect that something went wrong during the preparation process, but I am unable to identify the specific issue.

My equipments:

  • OS: Debian testing, Wayland session.
  • CPU: Intel i5-6300HQ.
  • RAM: 12 GB.
  • GPU: GTX 950M.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.