Comments (9)
I have tried to reproduce this paper.
My train mse mel loss reaches 0.18, while dev mel loss stops at 0.75.
After train the 'Decoder' model, I use this model to do GTA fine finetuning on the HiFi-GAN model you provide.
Below is the HiFi-GAN model fine tuning loss
After that, I try to control speaker Identity by just switching the speaker embedding to target speaker, which is the the
way you said in the paper.
I use a trained audio of CSD female speaker as the reference audio(the link below).
https://drive.google.com/file/d/1QCGlfREai1AgkKnrLhdvZm-jt_k50R79/view?usp=sharing
I use the speaker PAMR in NUS-48E dataset as target speaker.
https://drive.google.com/file/d/19eL1XgAjR4eWTFv7M5jaMJMCWIC17m36/view?usp=sharing
The result audio is:
https://drive.google.com/file/d/1XsaWrSQ2xtiohbjpm6fFU-V28o4pp2wM/view?usp=sharing
I found that lyrics are hard to hear clearly.
My dataset config:
devset:
CSD speaker, these three audio en48/en49/en50 were chosed;
NUS-48E speakers, ADIZ's 13 and JLEE 05 were chosed.
trainset: the other songs in CSD and NUS-48E.
My speaker embedding dimension is 256.( It seems 256 is too large?)
I want to know what could be the problem with my model?
And can you share you Decoder model train/dev loss?
My Decoder model got a relative larger mel MSE loss on devset than trainset.
from assem-vc.
Is there any detail about the speaker embedding? Such as What model is used to generate it, whether it is pre-trained, and what data set is used
from assem-vc.
@AK391
Thanks for your interest!
Currently, we don't have a specific plan to release the code of that paper.
We will add the link to the paper and demo page at README soon.
@980202006
We just used nn.Embedding
without pre-training. Thanks!
from assem-vc.
Thank you!
from assem-vc.
@iehppp2010
Hi. I think your alignment encoder, 'Cotatron' doesn't seem to be working properly.
As explained in the paper, we transferred Cotatron from pre-trained weights, which are trained with LibriTTS and VCTK.
Did you transfer from those weights?
You can find pre-trained weights in this Google Drive link.
from assem-vc.
@wookladin
Thanks for your quick reply.
I do used the pre-trained weights.
When I train the 'Decoder' model, the 'Cotatron' aligner model is freezed.
I found the plotted alignment is not as good as other TTS model,e.g. Tacotron2.
I want to know if I need to do fine-tune the 'Cotatron' model on singing dataset to get better alignment result?
Wish your reply.
from assem-vc.
@iehppp2010
Yes.
You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset.
It would generate better alignment and sample quality
from assem-vc.
@iehppp2010 Yes. You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset. It would generate better alignment and sample quality
@wookladin
Thanks for you quickly reply.
After I fine-tune the Cotatron model, the train.loss_reconstruction converges about 0.2, while val.loss_reconstruction
got mininum value about 0.5 at step 3893.
I use that checkpiont to train the Decoder model and fine tune HIFI-GAN vocoder.
I found that when test with an audio if the fine-tuned Cotatron model never seen it, I can't get good sample quality.
I guess it's the reason that Cotatron model gives not good alignment...
So, I want to know how to let the Cotatron model get better alignment on unseen sing audio?
Besides, can you provide more training details?
from assem-vc.
@iehppp2010, I am also trying to reproduce the results of this paper. I have one doubt regarding the dataset preparation: how did you split the files? In the paper it is said that "all singing voices are split between 1-12 seconds", did you do it manually for both CSD and NUS-48E, or how? Thanks!!
from assem-vc.
Related Issues (20)
- Pre-trained model HOT 1
- Reason to use speaker encoder over speaker embeddings? HOT 2
- NaN loss on cotatron_trainer HOT 3
- One-to-Many HOT 1
- Best way to extend the model to a new speaker HOT 6
- Regarding teacher forcing to calculate alignment HOT 2
- Training HIFI-GAN faster HOT 2
- Speech+Transcript conditioned phoneme recognition as an alternative to G2P HOT 1
- Trouble importing AttrDict HOT 1
- Changin the sampling rate HOT 3
- 어떻게해야 모델을 한글 음소로 학습시킬 수 있나요? HOT 2
- Extending to n+1 target speakers using pretrained Cotatron
- Build custom non-English dataset with ARPABET HOT 2
- other models
- How to split singing voices
- teacher-forcing HOT 1
- Cross-lingual supported? HOT 1
- Training Cotatron has a problem HOT 1
- Where can I get audio samples? The link in the README is broken. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from assem-vc.