Hello，this project is so nice and thank you for your share! I've train English and

Hello, thx for your reply! For the first questi

Hello, I am sorry for a late response. Regarding the noise, it

Hello, thx for your reply! At first, I actually forgot to add

The problem of voice quality and voice conversion about multilingual_text_to_speech HOT 16 CLOSED

tomiinek commented on May 28, 2024

The problem of voice quality and voice conversion

from multilingual_text_to_speech.

Comments (16)

jayzhu02 commented on May 28, 2024 1

Hello @zj19980122, I also want to train it in three languages but may you please tell me a workflow on how to setup this?

Hello.

1. you should generate the same form of txt file like the file in /data//.txt, then use prepare_css_spectrograms.py to generate spectrogram.
1. Generate the config JSON
1. Training

In the first two steps you may write your own code to finish. Most usage details you can find in each .py file to see how to give parameters. Follow the README.md process.

from multilingual_text_to_speech.

Tomiinek commented on May 28, 2024

Hello 🙂

The parameters seem good to me.

I have a few questions:

The sound is ok when training, but the results are much worse during inference. Are they just more noisy or do they have problems with attention/stability and things like that?
You have got hundreds of speakers per language. How many samples do you have per speaker? You can try to visualize the embedding space to figure out what is going on. I would expect some clusters grouping male/famale voices etc. and no distinction across languages.
How do you convert the texts into phonemes? How do you manage tones? 👀 Look into this thread #27 (especially at the end).

from multilingual_text_to_speech.

jayzhu02 commented on May 28, 2024

Hello, thx for your reply!

For the first question, I looked at the tensorboard's generated audio and it sounds ok. But during inference, all the audio always has noise in the background sound no matter which speaker is.
Speaker distribution is here. Most of the speakers have 50-100 samples. Since I use some public datasets, it's maybe hard for me to clusters male/female voice. I'll have a try.

Actually, I looked into this issue before and I use a package call phonemize and make some changes. For Chinese etc. 你好, I would change to pinyin with tone (ni3 hao3) then phonemize it (ni3 xɑu3). For English, I directly use the original function to change to phonemes like good evening -> ɡʊd iːvnɪŋ.

from multilingual_text_to_speech.

Tomiinek commented on May 28, 2024

Hello, I am sorry for a late response.

Regarding the noise, it is really weird. Unfortunately, I have no idea what could be the cause 😢
The link to the image is broken. Do not bother yourself with the clustering, it was just a debug idea ...
Ok, I see. Just to make sure, did you add all the characters you use to the character set in the config file?

from multilingual_text_to_speech.

jayzhu02 commented on May 28, 2024

Hello, thx for your reply!

At first, I actually forgot to add tone number into phonemes character and after some trials, I found this question and fix it. it can pronounce right now.
I've another question: Is that too many speakers would influence other speakers' tone? In fact, I just want to use 4 or 5 speakers but the samples of these speakers are just about 200. So I try to add other speakers to make sure they can speak well. But the result shows that It could speak well but the sound doesn't like himself/herself.
Regarding the noise, now I try to solve it by denoise the inference audio or retrain a WaveRNN.

from multilingual_text_to_speech.

Tomiinek commented on May 28, 2024

Hello again 🙂

I've another question: Is that too many speakers would influence other speakers' tone? In fact, I just want to use 4 or 5 speakers but the samples of these speakers are just about 200. So I try to add other speakers to make sure they can speak well. But the result shows that It could speak well but the sound doesn't like himself/herself.

Well, 200 samples per speaker is not much. I also was not able to get stable and accurate voice for speakers with few examples (I used these low-resource speakers just to help the model to disentangle language and voice).

from multilingual_text_to_speech.

stale commented on May 28, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

from multilingual_text_to_speech.

krigeta commented on May 28, 2024

Hello @zj19980122, I also want to train it in three languages but may you please tell me a workflow on how to setup this?

from multilingual_text_to_speech.

krigeta commented on May 28, 2024

Hello, so far I am able to set up the directories and for now I am also using the CSS10 dataset so I will get an understanding of its working.

Now before running the prepare_css_spectrograms.py file I have some questions in mind:

I set up the comvoi.zip data which is in five languages, why we are setting that up to create the spectrograms, I think these are the languages later we use to get the output Please correct me on this?
i am using 3 Datasets Japanese, German and Chinese from CSS10 dataset and this dataset is used to train model, right?
How the workflow is working, like if I have a speaker in japanese with its accent then how can that japanese speaker is able to speak in Chinese and which accent will it use?
I asked a lot but sorry i am new to this but this project is so much awesome that I want to learn. Please help

from multilingual_text_to_speech.

jayzhu02 commented on May 28, 2024

Hello, so far I am able to set up the directories and for now I am also using the CSS10 dataset so I will get an understanding of its working.

Now before running the prepare_css_spectrograms.py file I have some questions in mind:

I set up the comvoi.zip data which is in five languages, why we are setting that up to create the spectrograms, I think these are the languages later we use to get the output Please correct me on this?

i am using 3 Datasets Japanese, German and Chinese from CSS10 dataset and this dataset is used to train model, right?

How the workflow is working, like if I have a speaker in japanese with its accent then how can that japanese speaker is able to speak in Chinese and which accent will it use?
I asked a lot but sorry i am new to this but this project is so much awesome that I want to learn. Please help

Sorry I can't clearly understand your first question.

If you want to use css10 as the dataset, you should create the corresponding txt files and the spectrograms since the model needs these to train.

For the third question, you can know that an audio contains the speaker's accent and words pronunciation, and MTS model could split them to train both speaker embedding and language embedding. So if your dataset has multiple speakers and languages, you can easily do the voice cloning(as your question mention). The example you can find in the notebooks/code_switching_demo.ipynb.

from multilingual_text_to_speech.

krigeta commented on May 28, 2024

I am confused on how to setup those files sir. If possible may i connect to you on other socials? like discord, email so I can explain every bit of it? or should I share it here?

from multilingual_text_to_speech.

jayzhu02 commented on May 28, 2024

I am confused on how to setup those files sir. If possible may i connect to you on other socials? like discord, email so I can explain every bit of it? or should I share it here?

If you're asking train.txt here is an example:
css10_ja-css10_ja-meian_1015|css10_ja|ja|/data/css10_ja/meian/meian_1015.wav|../spectrograms/css10_ja-css10_ja-meian_1015.npy|../linear_spectrograms/css10_ja-css10_ja-meian_1015.npy|小林は覗き込むように見て云った。僕もそっちへ行くよ。彼らの行く方角には|ko̞bäjäɕi hä no̞zo̞ki ko̞mɯᵝ jo̞ɯᵝni mite̞ iʔ tä 。 bo̞kɯᵝ mo̞so̞ttɕihe̞ ikɯᵝ jo̞。 käɽe̞ɽä no̞ jɯᵝkɯᵝe̞ käkɯᵝ nihä

spectrogram_path and linear_spectrogram_path will be automatically generated by prepare_css_spectrograms.py and phoneme is optional. So you should make sure your form of metadata is similar to this.

If you feel confused about the parameter of config json, I suggest you to contact the author. 🙂

from multilingual_text_to_speech.

krigeta commented on May 28, 2024

Thank you so much for the explanation and I sent an email to the author and he replied to the issues hurray! ~~even bring up the issues but no response from anyone~~ so I was going through the issues and found you that you also able to train and make it work so was thinking it would be a great Idea to ask you here and yes you cleared up the things I didnt even know so thats why I am here.

i will try to setup things till the step you said and if I am able to complete it then May I ask you if I stuck on some steps?

My goal is to make a Japanese speaker to able to speak english and Hindi.

from multilingual_text_to_speech.

jayzhu02 commented on May 28, 2024

Thank you so much for the explanation and I sent an email to the author and even bring up the issues but no response from anyone so I was going through the issues and found you that you also able to train and make it work so was thinking it would be a great Idea to ask you here and yes you cleared up the things I didnt even know so thats why I am here.

i will try to setup things till the step you said and if I am able to complete it then May I ask you if I stuck on some steps?

My goal is to make a Japanese speaker to able to speak english and Hindi.

No problem. Feel free to ask🙂.

from multilingual_text_to_speech.

Tomiinek commented on May 28, 2024

Thank you @zj19980122 for you help. Would you mind to continue discussing in #48 to keep similar topics together?

from multilingual_text_to_speech.

jayzhu02 commented on May 28, 2024

Thank you @zj19980122 for you help. Would you mind to continue discussing in #48 to keep similar topics together?

My pleasure.

from multilingual_text_to_speech.

The problem of voice quality and voice conversion about multilingual_text_to_speech HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent