Comments (15)
Hi,
I listened to your reuslts, it seemed not good. The decoder alignment actually haven't converge sucessfully. Therefore, the model couldn't generate intelligible voice. I suppose the mean standard
normalization is an important step to facilitate the converenge of the model. Therefore, it's advisable
to add this. Alignment converges after first training of 3k steps in my experiment for you reference.
There's some of my samples at inference stage.
samples.zip
from nonparaseq2seqvc_code.
Hi, I used a model at step 59000, and the VC total loss reduced to around 1.2, but all inference samples results in almost null. They looked like this:
My question is how many steps does it need to train a model, and what should the loss level be like?
from nonparaseq2seqvc_code.
I get almost the same results as yours @JRMeyer , have you solved the problem?
from nonparaseq2seqvc_code.
My test results:
test_samples.zip
from nonparaseq2seqvc_code.
I recommend you to filter out some long sentences in your training dataset and try suggestion from here
from nonparaseq2seqvc_code.
In the pre-train folder, I use a decay rate 0.95 at each epoch and abandon training samples whose frame length is longer than 800. The inferred results begin to make sense but sound unreasonable.
samples_55000_loss0.94.zip
What may the problem be?
from nonparaseq2seqvc_code.
Hi, the alignment didn't converge. Therefore, it's unable to generated meanful sounds. It's strange, have you trained the model with enough large training set ? The batch size also should be large enough (better >= 32) to help alignment converge. Could you provide more details about your training?
In the pre-train folder, I use a decay rate 0.95 at each epoch and abandon training samples whose frame length is longer than 800. The inferred results begin to make sense but sound unreasonable.
samples_55000_loss0.94.zip
from nonparaseq2seqvc_code.
Hi, in the feature extraction process, I trimmed silence using librosa.trim and I used 80 dimensional mel-spec as used in hparams.py.
The text look like this:
But the mean and standard variance are calculated using a running method. So mean and variance are global.
The mean and variance look like this:
from nonparaseq2seqvc_code.
In pre-train/model/layers.py, line 353-354, I change the code to
self.initialize_decoder_states(memory,
mask=(1-get_mask_from_lengths(memory_lengths)))
Becase I found ~ is a bit-wise reverse, ~1 will get 254.
from nonparaseq2seqvc_code.
I can't get any possible difference from the source code. So would you please send me a copy of your training text and phn files. @jxzhanggg
from nonparaseq2seqvc_code.
In pre-train/model/layers.py, line 353-354, I change the code to
self.initialize_decoder_states(memory,
mask=(1-get_mask_from_lengths(memory_lengths)))
Becase I found ~ is a bit-wise reverse, ~1 will get 254.
I tested code in python 3.5 and torch 1.4, that's true.
It's strange bug because ~1 will get correct 0 when using python 2.7 torch 1.01
I'm not sure it's caused by python version or torch version ? How about your experiment environment? A bug like this wrong mask will definitely make model be out of order.
So I suspect there's some unrecognized bugs likes this cause the failure of experiments.
About training lists, I'm glad to provide. But I can't access to these files these days (I'm not in University now). And I believe it's not phn's faults.
from nonparaseq2seqvc_code.
I conducted the experiment under ubuntu16.04, using pytorch1.3.1 and python3.7.
For a boolean type variable, ~True gets False and ~False gets True.
I will debug
Please send me a copy of text and phn files. [email protected] and [email protected] both are ok. Thank you.
from nonparaseq2seqvc_code.
after modifying the bit-wise reverse ~, the model begin to converge to reasonable speech.
One problem is the inferred result doesn't keep the speaking style from the speaker embeddings, which means the style is not well disentangled. I will try to do some experiment to disentangle the style and content based on your work. Thank you very much for your patient replies.
from nonparaseq2seqvc_code.
after modifying the bit-wise reverse ~, the model begin to converge to reasonable speech.
One problem is the inferred result doesn't keep the speaking style from the speaker embeddings, which means the style is not well disentangled. I will try to do some experiment to disentangle the style and content based on your work. Thank you very much for your patient replies.
Hi, I'm glad that you got it worked.
from nonparaseq2seqvc_code.
Have you tried VAE loss to further disentangle content embedding from speaker embedding?
from nonparaseq2seqvc_code.
Related Issues (20)
- Change variable names "input_text" and "text_input"
- Normalizing features HOT 5
- Speaker and linguistic embedding visualizations do not look good as in the paper HOT 10
- How much time is the training dataset? HOT 4
- Did you use a silence symbol? HOT 1
- GPU memory requirements HOT 2
- list file format HOT 2
- assignment statement of y_tts_pred & y_vc_pred HOT 2
- Pre-train model results HOT 3
- Training the model for a different language HOT 1
- The mechanism of alignment between text encoder output and audio_seq2seq output HOT 1
- Fine-tuning help HOT 1
- Multi-GPU training HOT 3
- Keeping prosodic features of reference Speaker HOT 4
- > They are my prepared training list. HOT 2
- Feature Request HOT 1
- Transition agent in forward attention HOT 1
- 代码全是坑 HOT 3
- ValueError: too many values to unpack (expected 3) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nonparaseq2seqvc_code.