Comments (7)
hello @bkumardevan07 did you solve the problem?
from transformertts.
No, but by increasing batch size I have seen this problem fading away but still trying to get reasonable output.
Please note that I had modified the architecture to include multispeakers and also pad value may not necessarily be the reason for above plots.
Hope it helps
from transformertts.
No, but by increasing batch size I have seen this problem fading away but still trying to get reasonable output.
Please note that I had modified the architecture to include multispeakers and also pad value may not necessarily be the reason for above plots.
Hope it helps
Thanks for your reply, someone told me guided attention can help, have you tried it yet?
from transformertts.
Hi @bkumardevan07
if you start with r=1 you most likely will not get the alignment between text and audio. You can observe this in tensorboard in the last layer: if you heads (or at least one ehad) in the last layer do not look roughly diagonal, then the model will fail to produce any reasonable output. you should see some initial alignments in the first 6K to 30K or so steps, depending on the config.
Hi @sanghuynh1501
yes, it does massively. When training on a new dataset (very clean, well curated) I wasn't getting any alignments, until I forced a diagonal loss. You can find the code for this under the branch dev. I'm still messing around with a lot of new things, so it might take a while before I bring this to master, but I definitely recommend trying it out.
Generally to both of you I recommend to use the autorehressive model as a model to extract the durations you need for the forward model. In the next version of the repo the autoregressive prediction will likely be remove entirely.
I hope this helps! If you have any other question feel free to ask, or open a new discussion. @bkumardevan07 I am also about to include multispeaker (in dev there is the pitch prediction too, forward model only), maybe it can be useful to open a discussion under the discussion panel on the topic.
from transformertts.
You can observe this in tensorboard in the last layer: if you heads (or at least one ehad) in the last layer do not look roughly diagonal, then the model will fail to produce any reasonable output.
That's an really interesting finding. But I also observe that when training autoregressive TTS model with multiple encoder-decoder attentions, (either multilayer, multiheaded or both), diagonal alignment tends to appear in shallower layer first, deeepr layers may not learn any diagonal alignments at all. Though by other methods like drop head, we can induce more diagonal alignments ( sometimes up to 12/16 of the attention heads are diagonal).
But do yo have some intuitive idea why the last layer of attention is "critial"? Maybe an attention layer that learns no alignment at all bring messy information from the encoder outputs to the decoder and hurts its performance?
And I am curious that, since a stable alignment is desired, more heads or more layers of attention introduce uncertainty, then is it a better idea to use single attention in the model like DCTTS, tacotron? In these models, diagonal alignment is always learned in the last attention layer since there is only one attention.
Thank you!
from transformertts.
In Neural speech ... paper, they have mentioned in their ablation study that the layers act as approximating the final function in Taylor expansion way due to residual connections. Hence the initial layers act as learning low-frequency information/structure, and as the layers go deeper, the model tries to learn the higher-order/ finer details of the function. In the recent literatures, it has been shown that some models find it difficult to learn high-frequency info, or initially, the model learns the lower frequency info and as the model starts converging, the model learns high-frequency information. In that context, I think that that the model failing to learn alignments in the last layer could be due to convergence. It might also be difficult to learn due to noisy data.
Regarding your question on why the last layer is critical, I am too unsure because we are anyway using skip connections that should take care of this. Also, have one question why query concatenation is necessary? because We are anyway using skip connections which will add query information.
About your question on the number of attention heads, I think you are right (based on the experiment result); adding more heads might bring uncertainty. In recent papers, authors have started using two attention heads only (because of the additional speaker vector concatenated with spectrogram).
Looking forward to more insights from someone.
from transformertts.
Also, have one question why query concatenation is necessary? because We are anyway using skip connections which will add query information.
In transformer architecture, multihead attention is surrounded by an residual connection and layer normalization, so theoretically, Query and Attention outputs are already fused together by adding.
Concatenating quey and (Attention * Values) before applying the output affine layer is somehow more expressive than simply adding. As soobinseo / Transformer-TTS mentioned, concatenating the query is very important. Maybe it is because of experimental results. But deep voice 3, the attention outputs and the query are fused by simply adding.
Shallower layers learning low frequency informantion and taylor expansion analogy is intuitive. Thank you.
Nvidia has a model Centaur, which removes self attention and only use cross attention. (the number of heads of cross attention is 1). https://github.com/cpuimage/Transformer-TTS
from transformertts.
Related Issues (20)
- ERROR: while preparing training data HOT 1
- inference error HOT 1
- how can i save the audio file if i am using thee pretrained model in google colab
- Word timestamps
- layer.py TransposedCNNResNorm
- Get rid of the "robotic" sound HOT 4
- model.hdf5 file does not create
- No module named 'decorator'
- How to install on windows?
- the missing of HiFiGAN model.pt HOT 2
- Pause between sentence
- model.pt
- Alignments in PyTorch implementation
- Can't Finish PHONEMIZING On Google Colab. HOT 2
- Error : raise StopIteration StopIteration
- error to run a train_aligner.py
- how Normalized dataset
- DeepPhonemizer? (Phonemizer License Issue)
- About model structute.
- [CONTRIBUTION] Speech Dataset Generator
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transformertts.