Hey, I saw you are taking mel start /end val 4,-4 now i think its 0.5/-0.5. You are no

hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Regarding mel start and end token about transformertts HOT 7 OPEN

as-ideas commented on June 2, 2024

Regarding mel start and end token

from transformertts.

Comments (7)

sanghuynh1501 commented on June 2, 2024

hello @bkumardevan07 did you solve the problem?

from transformertts.

bkumardevan07 commented on June 2, 2024

No, but by increasing batch size I have seen this problem fading away but still trying to get reasonable output.
Please note that I had modified the architecture to include multispeakers and also pad value may not necessarily be the reason for above plots.
Hope it helps

from transformertts.

sanghuynh1501 commented on June 2, 2024

No, but by increasing batch size I have seen this problem fading away but still trying to get reasonable output.
Please note that I had modified the architecture to include multispeakers and also pad value may not necessarily be the reason for above plots.
Hope it helps

Thanks for your reply, someone told me guided attention can help, have you tried it yet?

from transformertts.

cfrancesco commented on June 2, 2024

Hi @bkumardevan07
if you start with r=1 you most likely will not get the alignment between text and audio. You can observe this in tensorboard in the last layer: if you heads (or at least one ehad) in the last layer do not look roughly diagonal, then the model will fail to produce any reasonable output. you should see some initial alignments in the first 6K to 30K or so steps, depending on the config.
Hi @sanghuynh1501
yes, it does massively. When training on a new dataset (very clean, well curated) I wasn't getting any alignments, until I forced a diagonal loss. You can find the code for this under the branch dev. I'm still messing around with a lot of new things, so it might take a while before I bring this to master, but I definitely recommend trying it out.

Generally to both of you I recommend to use the autorehressive model as a model to extract the durations you need for the forward model. In the next version of the repo the autoregressive prediction will likely be remove entirely.

I hope this helps! If you have any other question feel free to ask, or open a new discussion. @bkumardevan07 I am also about to include multispeaker (in dev there is the pitch prediction too, forward model only), maybe it can be useful to open a discussion under the discussion panel on the topic.

from transformertts.

iclementine commented on June 2, 2024

You can observe this in tensorboard in the last layer: if you heads (or at least one ehad) in the last layer do not look roughly diagonal, then the model will fail to produce any reasonable output.

That's an really interesting finding. But I also observe that when training autoregressive TTS model with multiple encoder-decoder attentions, (either multilayer, multiheaded or both), diagonal alignment tends to appear in shallower layer first, deeepr layers may not learn any diagonal alignments at all. Though by other methods like drop head, we can induce more diagonal alignments ( sometimes up to 12/16 of the attention heads are diagonal).

But do yo have some intuitive idea why the last layer of attention is "critial"? Maybe an attention layer that learns no alignment at all bring messy information from the encoder outputs to the decoder and hurts its performance?

And I am curious that, since a stable alignment is desired, more heads or more layers of attention introduce uncertainty, then is it a better idea to use single attention in the model like DCTTS, tacotron? In these models, diagonal alignment is always learned in the last attention layer since there is only one attention.

Thank you!

from transformertts.

bkumardevan07 commented on June 2, 2024

@iclementine

In Neural speech ... paper, they have mentioned in their ablation study that the layers act as approximating the final function in Taylor expansion way due to residual connections. Hence the initial layers act as learning low-frequency information/structure, and as the layers go deeper, the model tries to learn the higher-order/ finer details of the function. In the recent literatures, it has been shown that some models find it difficult to learn high-frequency info, or initially, the model learns the lower frequency info and as the model starts converging, the model learns high-frequency information. In that context, I think that that the model failing to learn alignments in the last layer could be due to convergence. It might also be difficult to learn due to noisy data.

Regarding your question on why the last layer is critical, I am too unsure because we are anyway using skip connections that should take care of this. Also, have one question why query concatenation is necessary? because We are anyway using skip connections which will add query information.

About your question on the number of attention heads, I think you are right (based on the experiment result); adding more heads might bring uncertainty. In recent papers, authors have started using two attention heads only (because of the additional speaker vector concatenated with spectrogram).

Looking forward to more insights from someone.

from transformertts.

iclementine commented on June 2, 2024

@bkumardevan07

Also, have one question why query concatenation is necessary? because We are anyway using skip connections which will add query information.

In transformer architecture, multihead attention is surrounded by an residual connection and layer normalization, so theoretically, Query and Attention outputs are already fused together by adding.

Concatenating quey and (Attention * Values) before applying the output affine layer is somehow more expressive than simply adding. As soobinseo / Transformer-TTS mentioned, concatenating the query is very important. Maybe it is because of experimental results. But deep voice 3, the attention outputs and the query are fused by simply adding.

Shallower layers learning low frequency informantion and taylor expansion analogy is intuitive. Thank you.

Nvidia has a model Centaur, which removes self attention and only use cross attention. (the number of heads of cross attention is 1). https://github.com/cpuimage/Transformer-TTS

from transformertts.

Regarding mel start and end token about transformertts HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent