Coder Social home page Coder Social logo

Comments (4)

ricardorei avatar ricardorei commented on June 20, 2024 1

@emjotde nothing like a re-implementation challenge to find bugs 😄... I just confirmed and you are right. Its defaulting to softmax instead of sparsemax.

>>> from comet import download_model, load_from_checkpoint
>>> model = load_from_checkpoint(download_model("Unbabel/wmt23-cometkiwi-da-xxl"))
>>> model.layerwise_attention.transform_fn
<built-in method softmax of type object at 0x7fda5cbd2460>
>>> model.layerwise_attention.layer_norm
False

same thing for XCOMET models.

Regarding Roberta-XL and XXL I realised the change from post-norm to pre-norm. I did not realised the impact on the embeddings returned from HF. Actually HF took a long long time to integrate Roberta-XL/XXL because of this issue... but I never inspected the magnitudes across layers.

Btw the rational for using sparsemax instead of softmax was not performance related. Our goal when integrating Sparsemax was to study if all layers are relevant or not. The performance between sparsemax and softmax is usually the same. Yet, for wmt22-comet-da, because of sparsemax, we can clearly observe which layers are relevant:

e.g:

>>> model = load_from_checkpoint(download_model("Unbabel/wmt22-comet-da"))
>>> weights = torch.cat([parameter for parameter in model.layerwise_attention.scalar_parameters])
>>> normed_weights = model.layerwise_attention.transform_fn(weights, dim=0)
>>> normed_weights
tensor([0.0849, 0.0738, 0.0504, 0.0463, 0.0166, 0.0125, 0.0103, 0.0027, 0.0000,
        0.0000, 0.0007, 0.0088, 0.0151, 0.0463, 0.0591, 0.0466, 0.0516, 0.0552,
        0.0581, 0.0621, 0.0666, 0.0609, 0.0621, 0.0645, 0.0448],
       grad_fn=<SparsemaxFunctionBackward>)

Here we can see that some layers are set to 0 and thus ignored. This provides some layer of interpretability... Ideally, the model would ignore the top layers and we could, after training, prune those (unfortunately this usually does not happen).

With XCOMET, the learned weights are all very similar.... But like you said probably because of the different norms?

>>> model = load_from_checkpoint(download_model("Unbabel/XCOMET-XL"))
>>> weights = torch.cat([parameter for parameter in model.layerwise_attention.scalar_parameters])
>>> normed_weights = model.layerwise_attention.transform_fn(weights, dim=0)
>>> normed_weights
tensor([0.0285, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267,
        0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0267, 0.0268, 0.0268,
        0.0268, 0.0268, 0.0268, 0.0269, 0.0270, 0.0271, 0.0271, 0.0272, 0.0273,
        0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0273, 0.0272,
        0.0287], grad_fn=<SoftmaxBackward0>)

Also, not sure if you noticed but we only use the layerwise attention for creating the sentence-embedding that are used for regression. The embeddings used for classifying the individual tokens as error spans are those from the word_layer (model.hparams.word_layer). We have not played a lot with this hyper-parameters but our goal was to make an individual layer more specialised on that task (usually a top layer because its closer to the MLM objective) while for regression we would like to pool information from all layers.

I am wondering if that wouldn't give you a really hard time during training the xComet-XXL models and skew the weighting during layer mixing?

It did not... I was actually surprised but the training was very stable from the get go.... I had some issues with distributed training and pytorch-lightning and ended up implementing something without lightning but after that was done, training was smooth.

from comet.

emjotde avatar emjotde commented on June 20, 2024

Follow-up on that... I am also wondering if you realized that Roberta-XL and Roberta-XXL are pre-norm, while the base model you used for Comet-KIWI is post-norm, but you treat them the same during training/inference. The huggingface implementation is collecting the hidden states without normalization for the XL models with the exception of very last hidden state which is normed.

That seems to mean that the hidden states that you use for your layer-mixing have wildly different magnitudes across layers -- the first and the last one (the most important one?) have very small norms, the ones in-between are unnormed. I am wondering if that wouldn't give you a really hard time during training the xComet-XXL models and skew the weighting during layer mixing?

from comet.

emjotde avatar emjotde commented on June 20, 2024

Yeah, I am currently not looking at the word-level predictions yet, stopped at the regressor-implementation.

Regarding the weights above, the fact that they are near-uniform after softmax despite the that the norms over the hidden states are so different is what made me wonder if proper learning happens or rather some form of saturation (always hard to tell with those neural models).

I would have expected the model to strongly push down the weights for the models with high norms. On the other hand, if this becomes bascially an unweighted arithmetic average then the two very small vectors pull everything down by a lot considering that averages reward outliers. Who knows...

from comet.

ricardorei avatar ricardorei commented on June 20, 2024

Its the black magic art of NN 🙂

from comet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.