just saw this paper https://

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Controllable and Interpretable Singing Voice Decomposition via Assem-VC about assem-vc HOT 9 OPEN

maum-ai commented on September 21, 2024 1

Controllable and Interpretable Singing Voice Decomposition via Assem-VC

from assem-vc.

Comments (9)

iehppp2010 commented on September 21, 2024 1

@wookladin

I have tried to reproduce this paper.
My train mse mel loss reaches 0.18, while dev mel loss stops at 0.75.

After train the 'Decoder' model, I use this model to do GTA fine finetuning on the HiFi-GAN model you provide.
Below is the HiFi-GAN model fine tuning loss

After that, I try to control speaker Identity by just switching the speaker embedding to target speaker, which is the the
way you said in the paper.
I use a trained audio of CSD female speaker as the reference audio(the link below).
https://drive.google.com/file/d/1QCGlfREai1AgkKnrLhdvZm-jt_k50R79/view?usp=sharing
I use the speaker PAMR in NUS-48E dataset as target speaker.
https://drive.google.com/file/d/19eL1XgAjR4eWTFv7M5jaMJMCWIC17m36/view?usp=sharing
The result audio is:
https://drive.google.com/file/d/1XsaWrSQ2xtiohbjpm6fFU-V28o4pp2wM/view?usp=sharing

I found that lyrics are hard to hear clearly.

My dataset config:
devset:
CSD speaker, these three audio en48/en49/en50 were chosed;
NUS-48E speakers, ADIZ's 13 and JLEE 05 were chosed.
trainset: the other songs in CSD and NUS-48E.

My speaker embedding dimension is 256.( It seems 256 is too large?)

I want to know what could be the problem with my model?
And can you share you Decoder model train/dev loss?
My Decoder model got a relative larger mel MSE loss on devset than trainset.

from assem-vc.

980202006 commented on September 21, 2024

Is there any detail about the speaker embedding? Such as What model is used to generate it, whether it is pre-trained, and what data set is used

from assem-vc.

wookladin commented on September 21, 2024

@AK391
Thanks for your interest!
Currently, we don't have a specific plan to release the code of that paper.
We will add the link to the paper and demo page at README soon.

@980202006
We just used nn.Embedding without pre-training. Thanks!

from assem-vc.

980202006 commented on September 21, 2024

Thank you!

from assem-vc.

wookladin commented on September 21, 2024

@iehppp2010
Hi. I think your alignment encoder, 'Cotatron' doesn't seem to be working properly.
As explained in the paper, we transferred Cotatron from pre-trained weights, which are trained with LibriTTS and VCTK.
Did you transfer from those weights?
You can find pre-trained weights in this Google Drive link.

from assem-vc.

iehppp2010 commented on September 21, 2024

@wookladin
Thanks for your quick reply.
I do used the pre-trained weights.
When I train the 'Decoder' model, the 'Cotatron' aligner model is freezed.
I found the plotted alignment is not as good as other TTS model，e.g. Tacotron2.

I want to know if I need to do fine-tune the 'Cotatron' model on singing dataset to get better alignment result?
Wish your reply.

from assem-vc.

wookladin commented on September 21, 2024

@iehppp2010
Yes.
You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset.
It would generate better alignment and sample quality

from assem-vc.

iehppp2010 commented on September 21, 2024

@iehppp2010 Yes. You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset. It would generate better alignment and sample quality

@wookladin
Thanks for you quickly reply.
After I fine-tune the Cotatron model, the train.loss_reconstruction converges about 0.2, while val.loss_reconstruction
got mininum value about 0.5 at step 3893.

I use that checkpiont to train the Decoder model and fine tune HIFI-GAN vocoder.

I found that when test with an audio if the fine-tuned Cotatron model never seen it, I can't get good sample quality.
I guess it's the reason that Cotatron model gives not good alignment...

So, I want to know how to let the Cotatron model get better alignment on unseen sing audio?
Besides, can you provide more training details?

from assem-vc.

betty97 commented on September 21, 2024

@iehppp2010, I am also trying to reproduce the results of this paper. I have one doubt regarding the dataset preparation: how did you split the files? In the paper it is said that "all singing voices are split between 1-12 seconds", did you do it manually for both CSD and NUS-48E, or how? Thanks!!

from assem-vc.

Controllable and Interpretable Singing Voice Decomposition via Assem-VC about assem-vc HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent