Hi, I have been playing around with re-implementing some of your models in Marian

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Sparsemax not actually used in COMET-KIWI, XCOMET-XL/XXL about comet HOT 4 OPEN

emjotde commented on June 20, 2024

Sparsemax not actually used in COMET-KIWI, XCOMET-XL/XXL

from comet.

Comments (4)

ricardorei commented on June 20, 2024 1

@emjotde nothing like a re-implementation challenge to find bugs 😄... I just confirmed and you are right. Its defaulting to softmax instead of sparsemax.

>>> from comet import download_model, load_from_checkpoint
>>> model = load_from_checkpoint(download_model("Unbabel/wmt23-cometkiwi-da-xxl"))
>>> model.layerwise_attention.transform_fn
<built-in method softmax of type object at 0x7fda5cbd2460>
>>> model.layerwise_attention.layer_norm
False

same thing for XCOMET models.

Regarding Roberta-XL and XXL I realised the change from post-norm to pre-norm. I did not realised the impact on the embeddings returned from HF. Actually HF took a long long time to integrate Roberta-XL/XXL because of this issue... but I never inspected the magnitudes across layers.

Btw the rational for using sparsemax instead of softmax was not performance related. Our goal when integrating Sparsemax was to study if all layers are relevant or not. The performance between sparsemax and softmax is usually the same. Yet, for wmt22-comet-da, because of sparsemax, we can clearly observe which layers are relevant:

e.g:

>>> model = load_from_checkpoint(download_model("Unbabel/wmt22-comet-da"))
>>> weights = torch.cat([parameter for parameter in model.layerwise_attention.scalar_parameters])
>>> normed_weights = model.layerwise_attention.transform_fn(weights, dim=0)
>>> normed_weights
tensor([0.0849, 0.0738, 0.0504, 0.0463, 0.0166, 0.0125, 0.0103, 0.0027, 0.0000,
        0.0000, 0.0007, 0.0088, 0.0151, 0.0463, 0.0591, 0.0466, 0.0516, 0.0552,
        0.0581, 0.0621, 0.0666, 0.0609, 0.0621, 0.0645, 0.0448],
       grad_fn=<SparsemaxFunctionBackward>)

Here we can see that some layers are set to 0 and thus ignored. This provides some layer of interpretability... Ideally, the model would ignore the top layers and we could, after training, prune those (unfortunately this usually does not happen).

With XCOMET, the learned weights are all very similar.... But like you said probably because of the different norms?

>>> model = load_from_checkpoint(download_model("Unbabel/XCOMET-XL"))
>>> weights = torch.cat([parameter for parameter in model.layerwise_attention.scalar_parameters])
>>> normed_weights = model.layerwise_attention.transform_fn(weights, dim=0)
>>> normed_weights
tensor([0.0285, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267,
        0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0268, 0.0268,
        0.0268, 0.0268, 0.0268, 0.0269, 0.0270, 0.0271, 0.0271, 0.0272, 0.0273,
        0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0272,
        0.0287], grad_fn=<SoftmaxBackward0>)

Also, not sure if you noticed but we only use the layerwise attention for creating the sentence-embedding that are used for regression. The embeddings used for classifying the individual tokens as error spans are those from the word_layer (model.hparams.word_layer). We have not played a lot with this hyper-parameters but our goal was to make an individual layer more specialised on that task (usually a top layer because its closer to the MLM objective) while for regression we would like to pool information from all layers.

I am wondering if that wouldn't give you a really hard time during training the xComet-XXL models and skew the weighting during layer mixing?

It did not... I was actually surprised but the training was very stable from the get go.... I had some issues with distributed training and pytorch-lightning and ended up implementing something without lightning but after that was done, training was smooth.

from comet.

emjotde commented on June 20, 2024

Follow-up on that... I am also wondering if you realized that Roberta-XL and Roberta-XXL are pre-norm, while the base model you used for Comet-KIWI is post-norm, but you treat them the same during training/inference. The huggingface implementation is collecting the hidden states without normalization for the XL models with the exception of very last hidden state which is normed.

That seems to mean that the hidden states that you use for your layer-mixing have wildly different magnitudes across layers -- the first and the last one (the most important one?) have very small norms, the ones in-between are unnormed. I am wondering if that wouldn't give you a really hard time during training the xComet-XXL models and skew the weighting during layer mixing?

from comet.

emjotde commented on June 20, 2024

Yeah, I am currently not looking at the word-level predictions yet, stopped at the regressor-implementation.

Regarding the weights above, the fact that they are near-uniform after softmax despite the that the norms over the hidden states are so different is what made me wonder if proper learning happens or rather some form of saturation (always hard to tell with those neural models).

I would have expected the model to strongly push down the weights for the models with high norms. On the other hand, if this becomes bascially an unweighted arithmetic average then the two very small vectors pull everything down by a lot considering that averages reward outliers. Who knows...

from comet.

ricardorei commented on June 20, 2024

Its the black magic art of NN 🙂

from comet.

Sparsemax not actually used in COMET-KIWI, XCOMET-XL/XXL about comet HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent