Hello, Our system is being validated against both "wmt-large-da-esti

If you want to read more about HTER: <a href="https://www.cs.umd.edu/~snover/pub/amta0

[QUESTION] Can Different COMET Metrics Give Opposing Results for Same MT System about comet HOT 7 CLOSED

unbabel commented on May 26, 2024

[QUESTION] Can Different COMET Metrics Give Opposing Results for Same MT System

from comet.

Comments (7)

ricardorei commented on May 26, 2024 2

Ok, the scores make sense! HTER and DA's have different scales. HTER is a measure that you want to minimize. It reflects the effort required to "correct" the translation output in order to be semantically equivalent to the reference (higher HTER reflects more effort).

DA is a continuous scale of "how good is a translation" (a high DA score means that the translation is good).

Both models are telling you that your MT is not good. For a SOTA MT system, you should expect your HTER score to be close to 0 while the DA score should be between 0.6 and 1

Its all here: https://unbabel.github.io/COMET/html/models.html

from comet.

ricardorei commented on May 26, 2024 1

If you want to read more about HTER: Snover et al., 2006

and DA's: Graham et al., 2013

from comet.

ricardorei commented on May 26, 2024

This issue label is exactly for this type of questions! I am happy to help

What are the scores exactly?

Sometimes when comparing two systems with similar quality these two models (wmt-large-da-estimator-1719 and wmt-large-hter-estimator) can differ regarding "which model is better". Yet, when scoring a single MT the scores should point into the same direction...

from comet.

ricardorei commented on May 26, 2024

You are testing the model with 70k translations? can you compute a Pearson correlation between wmt-large-da-estimator-1719 and wmt-large-hter-estimator scores?

from comet.

george2seven commented on May 26, 2024

This issue label is exactly for this type of questions! I am happy to help

What are the scores exactly?

Sometimes when comparing two systems with similar quality these two models (wmt-large-da-estimator-1719 and wmt-large-hter-estimator) can differ regarding "which model is better". Yet, when scoring a single MT the scores should point into the same direction...

Please find below the results:

	wmt-large-da-estimator-1719	wmt-large-hter-estimator	emnlp-base-da-ranker
Score	-0.21418807	0.212977027	0.145221945
Translations Count (same MT)	70544	70544	70544

Thanks for the support!

from comet.

george2seven commented on May 26, 2024

You are testing the model with 70k translations? can you compute a Pearson correlation between wmt-large-da-estimator-1719 and wmt-large-hter-estimator scores?

Unfortunately we don't have within our team experience with this type of computation but I will ask our engineers to have a look.

from comet.

george2seven commented on May 26, 2024

Thank you very much Ricardo! Makes sense now.

from comet.

Recommend Projects

[QUESTION] Can Different COMET Metrics Give Opposing Results for Same MT System about comet HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent