Coder Social home page Coder Social logo

Comments (9)

iehppp2010 avatar iehppp2010 commented on September 21, 2024 1

@wookladin

I have tried to reproduce this paper.
My train mse mel loss reaches 0.18, while dev mel loss stops at 0.75.
image

After train the 'Decoder' model, I use this model to do GTA fine finetuning on the HiFi-GAN model you provide.
Below is the HiFi-GAN model fine tuning loss
image

After that, I try to control speaker Identity by just switching the speaker embedding to target speaker, which is the the
way you said in the paper.
I use a trained audio of CSD female speaker as the reference audio(the link below).
https://drive.google.com/file/d/1QCGlfREai1AgkKnrLhdvZm-jt_k50R79/view?usp=sharing
I use the speaker PAMR in NUS-48E dataset as target speaker.
https://drive.google.com/file/d/19eL1XgAjR4eWTFv7M5jaMJMCWIC17m36/view?usp=sharing
The result audio is:
https://drive.google.com/file/d/1XsaWrSQ2xtiohbjpm6fFU-V28o4pp2wM/view?usp=sharing

I found that lyrics are hard to hear clearly.

My dataset config:
devset:
CSD speaker, these three audio en48/en49/en50 were chosed;
NUS-48E speakers, ADIZ's 13 and JLEE 05 were chosed.
trainset: the other songs in CSD and NUS-48E.

My speaker embedding dimension is 256.( It seems 256 is too large?)

I want to know what could be the problem with my model?
And can you share you Decoder model train/dev loss?
My Decoder model got a relative larger mel MSE loss on devset than trainset.

from assem-vc.

980202006 avatar 980202006 commented on September 21, 2024

Is there any detail about the speaker embedding? Such as What model is used to generate it, whether it is pre-trained, and what data set is used

from assem-vc.

wookladin avatar wookladin commented on September 21, 2024

@AK391
Thanks for your interest!
Currently, we don't have a specific plan to release the code of that paper.
We will add the link to the paper and demo page at README soon.

@980202006
We just used nn.Embedding without pre-training. Thanks!

from assem-vc.

980202006 avatar 980202006 commented on September 21, 2024

Thank you!

from assem-vc.

wookladin avatar wookladin commented on September 21, 2024

@iehppp2010
Hi. I think your alignment encoder, 'Cotatron' doesn't seem to be working properly.
As explained in the paper, we transferred Cotatron from pre-trained weights, which are trained with LibriTTS and VCTK.
Did you transfer from those weights?
You can find pre-trained weights in this Google Drive link.

from assem-vc.

iehppp2010 avatar iehppp2010 commented on September 21, 2024

@wookladin
Thanks for your quick reply.
I do used the pre-trained weights.
When I train the 'Decoder' model, the 'Cotatron' aligner model is freezed.
I found the plotted alignment is not as good as other TTS model,e.g. Tacotron2.
image

I want to know if I need to do fine-tune the 'Cotatron' model on singing dataset to get better alignment result?
Wish your reply.

from assem-vc.

wookladin avatar wookladin commented on September 21, 2024

@iehppp2010
Yes.
You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset.
It would generate better alignment and sample quality

from assem-vc.

iehppp2010 avatar iehppp2010 commented on September 21, 2024

@iehppp2010 Yes. You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset. It would generate better alignment and sample quality

@wookladin
Thanks for you quickly reply.
After I fine-tune the Cotatron model, the train.loss_reconstruction converges about 0.2, while val.loss_reconstruction
got mininum value about 0.5 at step 3893.

image

I use that checkpiont to train the Decoder model and fine tune HIFI-GAN vocoder.

I found that when test with an audio if the fine-tuned Cotatron model never seen it, I can't get good sample quality.
I guess it's the reason that Cotatron model gives not good alignment...

So, I want to know how to let the Cotatron model get better alignment on unseen sing audio?
Besides, can you provide more training details?

from assem-vc.

betty97 avatar betty97 commented on September 21, 2024

@iehppp2010, I am also trying to reproduce the results of this paper. I have one doubt regarding the dataset preparation: how did you split the files? In the paper it is said that "all singing voices are split between 1-12 seconds", did you do it manually for both CSD and NUS-48E, or how? Thanks!!

from assem-vc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.