Coder Social home page Coder Social logo

unbabel / comet Goto Github PK

View Code? Open in Web Editor NEW
406.0 17.0 69.0 9.74 MB

A Neural Framework for MT Evaluation

Home Page: https://unbabel.github.io/COMET/html/index.html

License: Apache License 2.0

Python 100.00%
machine-translation evaluation-metrics natural-language-processing machine-learning artificial-intelligence nlp

comet's Introduction



License GitHub stars PyPI Code Style

NEWS:

  1. AfriCOMET released, a new model to embrace under-resourced African Languages.
  2. We released our new eXplainable COMET models (XCOMET-XL and -XXL) which along with quality scores detects which errors in the translation are minor, major or critical according to MQM typology
  3. We release CometKiwi -XL (3.5B) and -XXL (10.7B) QE models. These models were the best performing QE models on the WMT23 QE shared task.

Please check all available models here

Quick Installation

COMET requires python 3.8 or above. Simple installation from PyPI

pip install --upgrade pip  # ensures that pip is current 
pip install unbabel-comet

Note: To use some COMET models such as Unbabel/wmt22-cometkiwi-da you must acknowledge it's license on Hugging Face Hub and log-in into hugging face hub.

To develop locally install run the following commands:

git clone https://github.com/Unbabel/COMET
cd COMET
pip install poetry
poetry install

For development, you can run the CLI tools directly, e.g.,

PYTHONPATH=. ./comet/cli/score.py

Scoring MT outputs:

CLI Usage:

Test examples:

echo -e "10 到 15 分钟可以送到吗\nPode ser entregue dentro de 10 a 15 minutos?" >> src.txt
echo -e "Can I receive my food in 10 to 15 minutes?\nCan it be delivered in 10 to 15 minutes?" >> hyp1.txt
echo -e "Can it be delivered within 10 to 15 minutes?\nCan you send it for 10 to 15 minutes?" >> hyp2.txt
echo -e "Can it be delivered between 10 to 15 minutes?\nCan it be delivered between 10 to 15 minutes?" >> ref.txt

Basic scoring command:

comet-score -s src.txt -t hyp1.txt -r ref.txt

you can set the number of gpus using --gpus (0 to test on CPU).

For better error analysis, you can use XCOMET models such as Unbabel/XCOMET-XL, you can export the identified errors using the --to_json flag:

comet-score -s src.txt -t hyp1.txt -r ref.txt --model Unbabel/XCOMET-XL --to_json output.json

Scoring multiple systems:

comet-score -s src.txt -t hyp1.txt hyp2.txt -r ref.txt

WMT test sets via SacreBLEU:

comet-score -d wmt22:en-de -t PATH/TO/TRANSLATIONS

If you are only interested in a system-level score use the following command:

comet-score -s src.txt -t hyp1.txt -r ref.txt --quiet --only_system

Reference-free evaluation:

comet-score -s src.txt -t hyp1.txt --model Unbabel/wmt22-cometkiwi-da

Note: To use the Unbabel/wmt22-cometkiwi-da-xl you first have to acknowledge its license on Hugging Face Hub.

Comparing multiple systems:

When comparing multiple MT systems we encourage you to run the comet-compare command to get statistical significance with Paired T-Test and bootstrap resampling (Koehn, et al 2004).

comet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en

Minimum Bayes Risk Decoding:

The MBR command allows you to rank translations and select the best one according to COMET metrics. For more details you can read our paper on Quality-Aware Decoding for Neural Machine Translation.

comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt --num_sample [X] -o [OUTPUT_FILE].txt

If working with a very large candidate list you can use --rerank_top_k flag to prune the topK most promissing candidates according to a reference-free metric.

Example for a candidate list of 1000 samples:

comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt -o [OUTPUT_FILE].txt --num_sample 1000 --rerank_top_k 100 --gpus 4 --qe_model Unbabel/wmt23-cometkiwi-da-xl

Your source and samples file should be formatted in this way.

COMET Models

Within COMET, there are several evaluation models available. You can refer to the MODELS page for a comprehensive list of all available models. Here is a concise list of the main reference-based and reference-free models:

  • Default Model: Unbabel/wmt22-comet-da - This model employs a reference-based regression approach and is built upon the XLM-R architecture. It has been trained on direct assessments from WMT17 to WMT20 and provides scores ranging from 0 to 1, where 1 signifies a perfect translation.
  • Reference-free Model: Unbabel/wmt22-cometkiwi-da - This reference-free model employs a regression approach and is built on top of InfoXLM. It has been trained using direct assessments from WMT17 to WMT20, as well as direct assessments from the MLQE-PE corpus. Similar to other models, it generates scores ranging from 0 to 1. For those interested, we also offer larger versions of this model: Unbabel/wmt23-cometkiwi-da-xl with 3.5 billion parameters and Unbabel/wmt23-cometkiwi-da-xxl with 10.7 billion parameters.
  • eXplainable COMET (XCOMET): Unbabel/XCOMET-XXL - Our latest model is trained to identify error spans and assign a final quality score, resulting in an explainable neural metric. We offer this version in XXL with 10.7 billion parameters, as well as the XL variant with 3.5 billion parameters (Unbabel/XCOMET-XL). These models have demonstrated the highest correlation with MQM and are our best performing evaluation models.

Please be aware that different models may be subject to varying licenses. To learn more, kindly refer to the LICENSES.models and model licenses sections.

If you intend to compare your results with papers published before 2022, it's likely that they used older evaluation models. In such cases, please refer to Unbabel/wmt20-comet-da and Unbabel/wmt20-comet-qe-da, which were the primary checkpoints used in previous versions (<2.0) of COMET.

Also, UniTE Metric developed by the NLP2CT Lab at the University of Macau and Alibaba Group can be used directly through COMET check here for more details.

Interpreting Scores:

New: An excellent reference for learning how to interpret machine translation metrics is the analysis paper by Kocmi et al. (2024), available at this link.

When using COMET to evaluate machine translation, it's important to understand how to interpret the scores it produces.

In general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a z-score transformation to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.

However, since 2022 we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance. Also, with the introduction of XCOMET models we can now analyse which text spans are part of minor, major or critical errors according to the MQM typology.

It's worth noting that when using COMET to compare the performance of two different translation systems, it's important to run the comet-compare command to obtain statistical significance measures. This command compares the output of two systems using a statistical hypothesis test, providing an estimate of the probability that the observed difference in scores between the systems is due to chance. This is an important step to ensure that any differences in scores between systems are statistically significant.

Overall, the added interpretability of scores in the latest COMET models, combined with the ability to assess statistical significance between systems using comet-compare, make COMET a valuable tool for evaluating machine translation.

Languages Covered:

All the above mentioned models are build on top of XLM-R (variants) which cover the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

Thus, results for language pairs containing uncovered languages are unreliable!

COMET for African Languages:

If you are interested in COMET metrics for african languages please visit afriCOMET.

Scoring within Python:

from comet import download_model, load_from_checkpoint

# Choose your model from Hugging Face Hub
model_path = download_model("Unbabel/XCOMET-XL")
# or for example:
# model_path = download_model("Unbabel/wmt22-comet-da")

# Load the model checkpoint:
model = load_from_checkpoint(model_path)

# Data must be in the following format:
data = [
    {
        "src": "10 到 15 分钟可以送到吗",
        "mt": "Can I receive my food in 10 to 15 minutes?",
        "ref": "Can it be delivered between 10 to 15 minutes?"
    },
    {
        "src": "Pode ser entregue dentro de 10 a 15 minutos?",
        "mt": "Can you send it for 10 to 15 minutes?",
        "ref": "Can it be delivered between 10 to 15 minutes?"
    }
]
# Call predict method:
model_output = model.predict(data, batch_size=8, gpus=1)
print(model_output)
print(model_output.scores) # sentence-level scores
print(model_output.system_score) # system-level score

# Not all COMET models return metadata with detected errors.
print(model_output.metadata.error_spans) # detected error spans

Train your own Metric:

Instead of using pretrained models your can train your own model with the following command:

comet-train --cfg configs/models/{your_model_config}.yaml

You can then use your own metric to score:

comet-score -s src.de -t hyp1.en -r ref.en --model PATH/TO/CHECKPOINT

You can also upload your model to Hugging Face Hub. Use Unbabel/wmt22-comet-da as example. Then you can use your model directly from the hub.

unittest:

In order to run the toolkit tests you must run the following command:

poetry run coverage run --source=comet -m unittest discover
poetry run coverage report -m # Expected coverage 76%

Note: Testing on CPU takes a long time

Publications

If you use COMET please cite our work and don't forget to say which model you used!

comet's People

Contributors

alvations avatar arturnn avatar bramvanroy avatar coderpat avatar dependabot[bot] avatar devrimcavusoglu avatar eltociear avatar erip avatar gpengzhi avatar hennerm avatar joao-maria-janeiro avatar kocmitom avatar mbtech avatar mjpost avatar new5558 avatar phen0menon avatar pks avatar remorax avatar ricardorei avatar samuellarkin avatar ymoslem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

comet's Issues

[QUESTION] Does COMET work on windows?

COMET installation is failing on windows. Could you please take a look?

(base) C:\>conda create --name comet_windows_3_7 python=3.7
(base) C:\>conda activate comet_windows_3_7
(comet_windows_3_7) C:\>pip install unbabel-comet

Using cached test_tube-0.7.4.tar.gz (21 kB)
    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\test\Anaconda3\envs\comet_windows_3_7\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\test\\AppData\\Local\\Temp\\pip-install-68_b6icx\\test-tube_f70f8dd226d64a01b89decde5fae3cab\\setup.py'"'"'; __file__='"'"'C:\\Users\\test\\AppData\\Local\\Temp\\pip-install-68_b6icx\\test-tube_f70f8dd226d64a01b89decde5fae3cab\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\test\AppData\Local\Temp\pip-pip-egg-info-ko3bque_'
         cwd: C:\Users\test\AppData\Local\Temp\pip-install-68_b6icx\test-tube_f70f8dd226d64a01b89decde5fae3cab\
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\test\AppData\Local\Temp\pip-install-68_b6icx\test-tube_f70f8dd226d64a01b89decde5fae3cab\setup.py", line 28, in <module>
        install_requires=load_requirements(PATH_ROOT),
      File "C:\Users\test\AppData\Local\Temp\pip-install-68_b6icx\test-tube_f70f8dd226d64a01b89decde5fae3cab\setup.py", line 10, in load_requirements
        with open(os.path.join(path_dir, 'requirements.txt'), 'r') as file:
    FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\test\\AppData\\Local\\Temp\\pip-install-68_b6icx\\test-tube_f70f8dd226d64a01b89decde5fae3cab\\requirements.txt'


TypeError: 'NoneType' object is not subscriptable when calling comet-score

🐛 Bug

To Reproduce

Using comet 1.0.1 with Python 3.9, I get the following (using testset in /data of mt-telescope):

$ comet-score -s newstest2020-ruen.src.ru.txt -t newstest2020-ruen.OnlineA.txt -r newstest2020-ruen.ref.en.txt
Global seed set to 12
wmt20-comet-da is already in cache.
Traceback (most recent call last):
File ".../telescope-venv/bin/comet-score", line 8, in
sys.exit(score_command())
File ".../telescope-venv/lib/python3.9/site-packages/comet/cli/score.py", line 180, in score_command
model = load_from_checkpoint(model_path)
File ".../telescope-venv/lib/python3.9/site-packages/comet/models/init.py", line 57, in load_from_checkpoint
model_class = str2model[hparams["class_identifier"]]
TypeError: 'NoneType' object is not subscriptable

Expected behaviour

Output comet score.

Environment

OS: Linus
Packaging Pip
Version 1.0.1

[QUESTION] Training my own metric using Comet regression model

Hi Ricardo, I read all your code implementations. From my understanding, when I use comet-train to train my own evaluation metric, I will not load in any pretrained Comet weights except the initialization of XLM-Roberta weights. I just want to double check if this is correct. Currently, I set "resume_from_checkpoint" in the trainer.yaml in the config to be "null".

Thank you in advance! This is a great work.

Is there a theoretical range of values for the COMET regressor?

Is there a theoretical range of values for the COMET regressor?

Since the final estimator layer is an FFN https://github.com/Unbabel/COMET/blob/master/comet/models/regression/regression_metric.py#L95

Which goes to

modules.append(nn.Linear(hidden_sizes[-1], int(out_dim)))

Is the theoretical range (-inf, inf), https://pytorch.org/docs/1.9.1/generated/torch.nn.Linear.html?

Are there some table of practical ranges that the comet owners/contributors/users have found for varying languages and length?

Requirements conflicts

Hi,
I am using your tool in my python pipelines, but I had a problem, that your requirements are too strict and I get conflict with many tools. Could you please reinvestigate if you must specify all packages on exact version?
Additionally, are you planning to shift into transformer3?
Thank you,
TK

Segmentation fault error

Hello!

🐛 Bug

Whenever I am try to run your demo, when using :

from comet.models import download_model

I get:

Segmentation fault (core dumped)

To Reproduce

I'm using Amazon EC2 instances with ubuntu 18.04 and I've also tried with ubuntu 20.04 and got the same errors.
I log into the instance and do:

sudo su
apt-get update
apt-get install python3-pip
pip3 install unbabel-comet

If I'm on 18.04 I have to pip upgrade because otherwise sentence piece breaks.

I've tried building from source installing from pip nothing seems to be working it installs and I can even do:

import comet

But whenever :

from comet.models import download_model

segmentation fault, also if I use the cli command:

comet download -d apequest --saving_path data/

again the same error!

Environment

OS: Linux
Packaging pip
Version : pip 20.0.2

I've tried multiple instances, virtual environments but nothing seems to be effective! With ubuntu 20.04 I used python 3.8, I've seen that the recommendation is python 3.6, On ubuntu 18.04 I've used the default python3 which is 3.6.9 and still had the same issues!

Here is the output of my pip freeze


absl-py==0.11.0
cachetools==4.1.1
certifi==2020.11.8
cffi==1.14.3
chardet==3.0.4
click==7.1.2
Cython==0.29.15
fairseq==0.9.0
fastBPE==0.1.0
filelock==3.0.12
fsspec==0.8.4
future==0.18.2
google-auth==1.23.0
google-auth-oauthlib==0.4.2
grpcio==1.33.2
idna==2.10
joblib==0.17.0
Markdown==3.3.3
numpy==1.19.4
oauthlib==3.1.0
pandas==1.0.5
portalocker==2.0.0
protobuf==3.14.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.20
python-dateutil==2.8.1
pytorch-lightning==1.0.7
pytorch-nlp==0.5.0
pytz==2020.4
PyYAML==5.3.1
regex==2020.11.13
requests==2.25.0
requests-oauthlib==1.3.0
rsa==4.6
sacrebleu==1.4.14
sacremoses==0.0.43
scikit-learn==0.23.1
scipy==1.5.4
sentencepiece==0.1.94
six==1.15.0
sphinx-markdown-tables==0.0.15
tensorboard==2.2.0
tensorboard-plugin-wit==1.7.0
threadpoolctl==2.1.0
tokenizers==0.7.0
torch==1.4.0
tqdm==4.52.0
transformers==2.10.0
-e git+https://github.com/Unbabel/COMET@c9ac4c9cbdb8484aa5ee286c9cbe13002c16a193#egg=unbabel_comet
urllib3==1.26.2
Werkzeug==1.0.1
wget==3.2

If you could help me figure this thing out it would be wonderful!

Thank you for your time and willingness to share your tool! I'm eager to try it :-)

Error related to incomplete model downloads in cache

When a model download is halted before it completes, and then a new command is used referring to the same model (e.g. the default comet score -s src.de -h hyp.en -r ref.en , the script will try to retrieve the cached (incomplete) download and will result in an error:
Exception: [meta_tags.csv|hparams.yaml is missing from the checkpoint folder.

It is resolved is the cache is cleared

Full error trace:


 Traceback (most recent call last):
  File "/home/chryssa/anaconda3/bin/comet", line 11, in <module>
    load_entry_point('unbabel-comet==0.0.7', 'console_scripts', 'comet')()
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/unbabel_comet-0.0.7-py3.7.egg/comet/cli.py", line 121, in score
    model = load_checkpoint(model) if os.path.exists(model) else download_model(model)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/unbabel_comet-0.0.7-py3.7.egg/comet/models/__init__.py", line 98, in download_model
    return load_checkpoint(checkpoint_path)
  File "/home/chryssa/anaconda3/lib/python3.7/site-packages/unbabel_comet-0.0.7-py3.7.egg/comet/models/__init__.py", line 136, in load_checkpoint
    "[meta_tags.csv|hparams.yaml is missing from the checkpoint folder."

ImportError: cannot import name 'container_abcs' from 'torch._six'

🐛 Bug

If I have apex installed, this library throws ImportError: cannot import name 'container_abcs' from 'torch._six'.

To reproduce

When I try to run this example from the readme:

echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp1.en
echo -e "The fire could have been stopped\nSchools and pre-school were open" >> hyp2.en
echo -e "They were able to control the fire.\nSchools and kindergartens opened" >> ref.en
comet-score -s src.de -t hyp1.en -r ref.en

...I get: ImportError: cannot import name 'container_abcs' from 'torch._six', but if I uninstall apex, comet works again.

Torch versions:

  • torch==1.10.0
  • torchmetrics==0.6.0
  • torchtext==0.5.0
  • apex @ git+git://github.com/NVIDIA/apex.git@700d6825e205732c1d6be511306ca4e595297070

Traceback

Traceback (most recent call last):
  File "/home/scarrion/anaconda3/envs/mltests/bin/comet-score", line 5, in <module>
    from comet.cli.score import score_command
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/comet/__init__.py", line 19, in <module>
    from .download_utils import download_model
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/comet/download_utils.py", line 26, in <module>
    from comet.models import available_metrics
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/comet/models/__init__.py", line 17, in <module>
    from .regression.regression_metric import RegressionMetric
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/comet/models/regression/regression_metric.py", line 26, in <module>
    from comet.models.base import CometModel
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/comet/models/base.py", line 29, in <module>
    import pytorch_lightning as ptl
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 18, in <module>
    from pytorch_lightning.metrics.utils import deprecated_metrics
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/metrics/utils.py", line 29, in <module>
    from pytorch_lightning.utilities import rank_zero_deprecation
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/utilities/__init__.py", line 18, in <module>
    from pytorch_lightning.utilities.apply_func import move_data_to_device  # noqa: F401
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 26, in <module>
    from pytorch_lightning.utilities.imports import _compare_version, _TORCHTEXT_AVAILABLE
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py", line 73, in <module>
    _APEX_AVAILABLE = _module_available("apex.amp")
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/pytorch_lightning/utilities/imports.py", line 36, in _module_available
    return find_spec(module_path) is not None
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/importlib/util.py", line 94, in find_spec
    parent = __import__(parent_name, fromlist=['__path__'])
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/__init__.py", line 8, in <module>
    from . import amp
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/amp/__init__.py", line 1, in <module>
    from .amp import init, half_function, float_function, promote_function,\
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/amp/amp.py", line 1, in <module>
    from . import compat, rnn_compat, utils, wrap
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/amp/rnn_compat.py", line 1, in <module>
    from . import utils, wrap
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/amp/wrap.py", line 3, in <module>
    from ._amp_state import _amp_state
  File "/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/apex/amp/_amp_state.py", line 14, in <module>
    from torch._six import container_abcs
ImportError: cannot import name 'container_abcs' from 'torch._six' (/home/scarrion/anaconda3/envs/mltests/lib/python3.8/site-packages/torch/_six.py)

Environment

OS: Ubuntu 20.04
Packaging: pip
Version: 1.0.1

proposed python version (3.6) throws error when installing requirements (3.7 works)

Running pip install -r requirements with proposed python3.6 results in: RuntimeError: Python version >= 3.7 required, when trying to install the fairseq module. Installation runs smoothly with 3.7.

Full error trace below:

Collecting fairseq==0.9.0 (from -r requirements.txt (line 7))
Cache entry deserialization failed, entry ignored
Cache entry deserialization failed, entry ignored
Downloading https://files.pythonhosted.org/packages/67/bf/de299e082e7af010d35162cb9a185dc6c17db71624590f2f379aeb2519ff/fairseq-0.9.0.tar.gz (306kB)
  100% |████████████████████████████████| 307kB 2.0MB/s 
  Complete output from command python setup.py egg_info:
  Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 154, in save_modules
      yield saved
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context
      yield
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 250, in run_setup
      _execfile(setup_script, ns)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 45, in _execfile
      exec(code, globals, locals)
    File "/tmp/easy_install-s58jkscl/numpy-1.20.1/setup.py", line 30, in <module>
      self.__include_dirs = []
  RuntimeError: Python version >= 3.7 required.
  
  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-build-5bklbiad/fairseq/setup.py", line 161, in <module>
      zip_safe=False,
    File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 128, in setup
      _install_setup_requires(attrs)
    File "/usr/lib/python3/dist-packages/setuptools/__init__.py", line 123, in _install_setup_requires
      dist.fetch_build_eggs(dist.setup_requires)
    File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 513, in fetch_build_eggs
      replace_conflicting=True,
    File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 774, in resolve
      replace_conflicting=replace_conflicting
    File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1057, in best_match
      return self.obtain(req, installer)
    File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1069, in obtain
      return installer(requirement)
    File "/usr/lib/python3/dist-packages/setuptools/dist.py", line 580, in fetch_build_egg
      return cmd.easy_install(req)
    File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 698, in easy_install
      return self.install_item(spec, dist.location, tmpdir, deps)
    File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 724, in install_item
      dists = self.install_eggs(spec, download, tmpdir)
    File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 909, in install_eggs
      return self.build_and_install(setup_script, setup_base)
    File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 1177, in build_and_install
      self.run_setup(setup_script, setup_base, args)
    File "/usr/lib/python3/dist-packages/setuptools/command/easy_install.py", line 1163, in run_setup
      run_setup(setup_script, args)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 253, in run_setup
      raise
    File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
      self.gen.throw(type, value, traceback)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context
      yield
    File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
      self.gen.throw(type, value, traceback)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 166, in save_modules
      saved_exc.resume()
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 141, in resume
      six.reraise(type, exc, self._tb)
    File "/usr/lib/python3/dist-packages/setuptools/_vendor/six.py", line 685, in reraise
      raise value.with_traceback(tb)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 154, in save_modules
      yield saved
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 195, in setup_context
      yield
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 250, in run_setup
      _execfile(setup_script, ns)
    File "/usr/lib/python3/dist-packages/setuptools/sandbox.py", line 45, in _execfile
      exec(code, globals, locals)
    File "/tmp/easy_install-s58jkscl/numpy-1.20.1/setup.py", line 30, in <module>
      self.__include_dirs = []
  RuntimeError: Python version >= 3.7 required.

[QUESTION] Is comet download still a supported command?

❓ Questions and Help

What is your question?

Is comet download still a supported command? If not, what is the best way to download the data needed to reproduce the results?

Code

$ comet download --help
=>  command not found: comet
$ comet-download --help
=>  command not found: comet-download

What have you tried?

I tried running comet download as specified in data/README.md, and I get the error command not found: comet. However, comet-score and comet-compare works, so I know that I did install it. I also tried comet-download and still get the same error.

What's your environment?

  • OS: macOS Big Sur 11.5.2
  • Packaging: conda
  • Version 4.8.3
  • Python 3.9.7

Cannot disable the progress bar

🐛 Bug

Hi,

I'm using version 1.0.0rc6 and scoring within Python, seg_scores, sys_score = model.predict(data, gpus=1).

However, I cannot disable the progress bar using show_progress=False. It would be nice to have this option.

Thanks!

Protect example inside main

Dear authors,
Thanks a lot for COMET and outsourcing your code.

🐛 Bug

Running word_level QE estimation trigger the following error:

Traceback (most recent call last):
File "", line 1, in
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 263, in run_path
return _run_module_code(code, init_globals, run_name,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/Ebenge/Desktop/quality_estimation/examples/word_level/wmt_2018/de_en/microtransquest.py", line 52, in
sources_tags, targets_tags = model.predict(test_sentences[:1], split_on_space=True)
File "/Users/Ebenge/Desktop/quality_estimation/transquest/algo/word_level/microtransquest/run_model.py", line 991, in predict
eval_dataset = self.load_and_cache_examples(None, to_predict=predict_examples)
File "/Users/Ebenge/Desktop/quality_estimation/transquest/algo/word_level/microtransquest/run_model.py", line 1203, in load_and_cache_examples
features = convert_examples_to_features(
File "/Users/Ebenge/Desktop/quality_estimation/transquest/algo/word_level/microtransquest/utils.py", line 345, in convert_examples_to_features
with Pool(process_count) as p:
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 212, in init
self._repopulate_pool()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 303, in _repopulate_pool
return self._repopulate_pool_static(self._ctx, self.Process,
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static
w.start()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
Traceback (most recent call last):
File "", line 1, in
return Popen(process_obj)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
super().init(process_obj)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
exitcode = _main(fd, parent_sentinel)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
_check_not_importing_main()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
prepare(preparation_data)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.
_fixup_main_from_path(data['init_main_from_path'])

To Reproduce

python microtransquest.py (the one in examples/word_level/wmt_2018/de_en)

Expected behaviour

No error

Fix

Wrap everything inside a main 👍

Environment

OS: OS
certifi==2021.10.8
charset-normalizer==2.0.7
click==8.0.3
configparser==5.1.0
cycler==0.11.0
docker-pycreds==0.4.0
filelock==3.4.0
flatbuffers==2.0
fonttools==4.28.2
gitdb==4.0.9
GitPython==3.1.24
huggingface-hub==0.1.2
idna==3.3
joblib==1.1.0
kiwisolver==1.3.2
matplotlib==3.5.0
numpy==1.21.4
onnxruntime==1.9.0
packaging==21.3
pandas==1.3.4
pathtools==0.1.2
Pillow==8.4.0
promise==2.3
protobuf==3.19.1
psutil==5.8.0
pyparsing==3.0.6
python-dateutil==2.8.2
pytz==2021.3
PyYAML==6.0
regex==2021.11.10
requests==2.26.0
sacremoses==0.0.46
scikit-learn==1.0.1
scipy==1.7.2
sentencepiece==0.1.96
sentry-sdk==1.5.0
seqeval==1.2.2
setuptools-scm==6.3.2
shortuuid==1.0.8
six==1.16.0
smmap==5.0.0
subprocess32==3.5.4
tensorboardX==2.4.1
termcolor==1.1.0
threadpoolctl==3.0.0
tokenizers==0.10.3
tomli==1.2.2
torch==1.10.0
tqdm==4.62.3
transformers==4.12.5
typing-extensions==4.0.0
urllib3==1.26.7
wandb==0.12.7
yaspin==2.1.0

Additional context

Fix solve the pb.

How can I use QE model with HTER? (wmt20-comet-qe-hter)

I'm using wmt20-comet-qe-da for MT quality estimation. I wanted to use HTER based QE model for better interpretability. Is that model supported yet? If not can you guide me a bit to understand DA system scores? eg. I have source and MT. I get a DA score - what's the threshold score to be considered good or bad?

fastBPE installation error

🐛 Bug

When installing either using pip install unbabel-comet or directly with pip install -r requirements.txt I get an error error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Running setup.py install for fastBPE ... error ERROR: Command errored out with exit status 1: command: /home/ubuntu/cometenv/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-eeqbscuj/fastbpe_45baab04cd16456fb32b018392790726/setup.py'"'"'; __file__='"'"'/tmp/pip-install-eeqbscuj/fastbpe_45baab04cd16456fb32b018392790726/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-hfrq684c/install-record.txt --single-version-externally-managed --compile --install-headers /home/ubuntu/cometenv/include/site/python3.7/fastBPE cwd: /tmp/pip-install-eeqbscuj/fastbpe_45baab04cd16456fb32b018392790726/ Complete output (15 lines): running install running build running build_py package init file 'fastBPE/__init__.py' not found (or not a regular file) running build_ext building 'fastBPE' extension creating build creating build/temp.linux-x86_64-3.7 creating build/temp.linux-x86_64-3.7/fastBPE x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -IfastBPE -I/usr/include/python3.7m -I/home/ubuntu/cometenv/include/python3.7m -c fastBPE/fastBPE.cpp -o build/temp.linux-x86_64-3.7/fastBPE/fastBPE.o -std=c++11 -Ofast -pthread fastBPE/fastBPE.cpp:28:10: fatal error: Python.h: No such file or directory #include "Python.h" ^~~~~~~~~~ compilation terminated. error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 ---------------------------------------- ERROR: Command errored out with exit status 1: /home/ubuntu/cometenv/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-eeqbscuj/fastbpe_45baab04cd16456fb32b018392790726/setup.py'"'"'; __file__='"'"'/tmp/pip-install-eeqbscuj/fastbpe_45baab04cd16456fb32b018392790726/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-hfrq684c/install-record.txt --single-version-externally-managed --compile --install-headers /home/ubuntu/cometenv/include/site/python3.7/fastBPE Check the logs for full command output.

Environment

(Ubuntu 18.04) Version 34.0
Python 3.7

Calling xlm-roberta-large via HuggingFace

🚀 Feature

I would like to use this library for COMET scores but I want to host xlm-roberta-large elsewhere like HuggingFace. How we can enable in this library? Would modifying xlmr.py be enough to call the XLM model that is deployed to a remote GPU server?

Motivation

I would like to place the computationally intensive encoder of the model on a GPU server (for faster batch-inference) that is shared by multiple COMET scorers but also maybe with other applications that may also be benefiting from xlm-roberta-large.

Progress bar should go to stderr not stdout

🐛 Bug

I'm running

for i in *.output.txt; do
 comet-score -s src.txt -r newstest2021.en-de.ref.ref-A.de -t $i >$i.comet
done

and the progress bar is going into $i.comet.

To Reproduce

comet-score -s src.txt -r ref.txt -t hyp.txt >output
less output

Expected behaviour

It would be better to:

  1. Send progress to stderr instead of stdout.
  2. Sense if the progress bar is going to a terminal and suppress if not.

Screenshots

less output does this

^MPredicting: 0it [00:00, ?it/s]^MPredicting:   1%|          | 1/126 [00:01<02:30,  1.20s/it]^MPredicting:   2%|▏         | 2/126 [00:01<01:30,  1.37it/s]^MPredicting:   2%|▏         | 3/126 [00:01<01:13,  1.67it/s]^MPredicting:   3%|▎         | 4/126 [00:02<01:07,  1.81it/s]^MPredicting:   4%|▍         | 5/126 [00:02<01:02,  1.94it/s]^MPredicting:   5%|▍         | 6/126 [00:02<00:58,  2.05it/s]^MPredicting:   6%|▌         | 7/126 [00:03<00:55,  2.14it/s]^MPredicting:   6%|▋         | 8/126 [00:03<00:52,  2.26it/s]^MPredicting:   7%|▋         | 9/126 [00:03<00:50,  2.30it/s]^MPredicting:   8%|▊         | 10/126 [00:04<00:48,  2.40it/s]^MPredicting:   9%|▊         | 11/126 [00:04<00:46,  2.45it/s]^MPredicting:  10%|▉         | 12/126 [00:04<00:46,  2.47it/s]^MPredicting:  10%|█         | 13/126 [00:05<00:46,  2.45it/s]^MPredicting:  11%|█         | 14/126 [00:05<00:45,  2.48it/s]^MPredicting:  12%|█▏        | 15/126 [00:05<00:43,  2.53it/s]^MPredicting:  13%|█▎        | 16/126 [00:06<00:43,  2.54it/s]^MPredicting:  13%|█▎        | 17/126 [00:06<00:42,  2.58it/s]^MPredicting:  14%|█▍        | 18/126 [00:06<00:41,  2.62it/s]^MPredicting:  15%|█▌        | 19/126 [00:07<00:40,  2.65it/s]^MPredicting:  16%|█▌        | 20/126 [00:07<00:39,  2.66it/s]^MPredicting:  17%|█▋        | 21/126 [00:07<00:39,  2.66it/s]^MPredicting:  17%|█▋        | 22/126 [00:08<00:39,  2.63it/s]^MPredicting:  18%|█▊        | 23/126 [00:08<00:38,  2.64it/s]^MPredicting:  19%|█▉        | 24/126 [00:09<00:38,  2.65it/s]^MPredicting:  20%|█▉        | 25/126 [00:09<00:37,  2.68it/s]^MPredicting:  21%|██        | 26/126 [00:09<00:36,  2.71it/s]^MPredicting:  21%|██▏       | 27/126 [00:09<00:36,  2.74it/s]^MPredicting:  22%|██▏       | 28/126 [00:10<00:35,  2.76it/s]^MPredicting:  23%|██▎       | 29/126 [00:10<00:35,  2.77it/s]^MPredicting:  24%|██▍       | 30/126 [00:10<00:34,  2.76it/s]^MPredicting:  25%|██▍       | 31/126 [00:11<00:34,  2.76it/s]^MPredicting:  25%|██▌       | 32/126 [00:11<00:34,  2.76it/s]^MPredicting:  26%|██▌       | 33/126 [00:11<00:33,  2.77it/s]^MPredicting:  27%|██▋       | 34/126 [00:12<00:32,  2.80it/s]^MPredicting:  28%|██▊       | 35/126 [00:12<00:32,  2.83it/s]^MPredicting:  29%|██▊       | 36/126 [00:12<00:31,  2.85it/s]^MPredicting:  29%|██▉       | 37/126 [00:12<00:31,  2.86it/s]^MPredicting:  30%|███       | 38/126 [00:13<00:30,  2.87it/s]^MPredicting:  31%|███       | 39/126 [00:13<00:30,  2.88it/s]^MPredicting:  32%|███▏      | 40/126 [00:13<00:29,  2.89it/s]^MPredicting:  33%|███▎      | 41/126 [00:14<00:29,  2.89it/s]^MPredicting:  33%|███▎      | 42/126 [00:14<00:28,  2.90it/s]^MPredicting:  34%|███▍      | 43/126 [00:14<00:28,  2.92it/s]^MPredicting:  35%|███▍      | 44/126 [00:14<00:27,  2.94it/s]^MPredicting:  36%|███▌      | 45/126 [00:15<00:27,  2.93it/s]^MPredicting:  37%|███▋      | 46/126 [00:15<00:27,  2.93it/s]^MPredicting:  37%|███▋      | 47/126 [00:16<00:27,  2.89it/s]^MPredicting:  38%|███▊      | 48/126 [00:16<00:27,  2.88it/s]^MPredicting:  39%|███▉      | 49/126 [00:16<00:26,  2.91it/s]^MPredicting:  40%|███▉      | 50/126 [00:17<00:25,  2.93it/s]^MPredicting:  40%|████      | 51/126 [00:17<00:25,  2.93it/s]^MPredicting:  41%|████▏     | 52/126 [00:17<00:25,  2.94it/s]^MPredicting:  42%|████▏     | 53/126 [00:18<00:24,  2.94it/s]^MPredicting:  43%|████▎     | 54/126 [00:18<00:24,  2.95it/s]^MPredicting:  44%|████▎     | 55/126 [00:18<00:24,  2.92it/s]^MPredicting:  44%|████▍     | 56/126 [00:19<00:24,  2.91it/s]^MPredicting:  45%|████▌     | 57/126 [00:19<00:23,  2.91it/s]^MPredicting:  46%|████▌     | 58/126 [00:19<00:23,  2.91it/s]^MPredicting:  47%|████▋     | 59/126 [00:20<00:23,  2.91it/s]^MPredicting:  48%|████▊     | 60/126 [00:20<00:22,  2.90it/s]^MPredicting:  48%|████▊     | 61/126 [00:20<00:22,  2.91it/s]^MPredicting:  49%|████▉     | 62/126 [00:21<00:22,  2.91it/s]^MPredicting:  50%|█████     | 63/126 [00:21<00:21,  2.91it/s]^MPredicting:  51%|█████     | 64/126 [00:22<00:21,  2.89it/s]^MPredicting:  52%|█████▏    | 65/126 [00:22<00:21,  2.90it/s]^MPredicting:  52%|█████▏    | 66/126 [00:22<00:20,  2.88it/s]^MPredicting:  53%|█████▎    | 67/126 [00:23<00:20,  2.85it/s]^MPredicting:  54%|█████▍    | 68/126 [00:23<00:20,  2.84it/s]^MPredicting:  55%|█████▍    | 69/126 [00:24<00:20,  2.85it/s]^MPredicting:  56%|█████▌    | 70/126 [00:24<00:19,  2.86it/s]^MPredicting:  56%|█████▋    | 71/126 [00:24<00:19,  2.87it/s]^MPredicting:  57%|█████▋    | 72/126 [00:25<00:18,  2.87it/s]^MPredicting:  58%|█████▊    | 73/126 [00:25<00:18,  2.87it/s]^MPredicting:  59%|█████▊    | 74/126 [00:25<00:18,  2.87it/s]^MPredicting:  60%|█████▉    | 75/126 [00:25<00:17,  2.89it/s]^MPredicting:  60%|██████    | 76/126 [00:26<00:17,  2.90it/s]^MPredicting:  61%|██████    | 77/126 [00:26<00:16,  2.90it/s]^MPredicting:  62%|██████▏   | 78/126 [00:26<00:16,  2.91it/s]^MPredicting:  63%|██████▎   | 79/126 [00:27<00:16,  2.91it/s]^MPredicting:  63%|██████▎   | 80/126 [00:27<00:15,  2.91it/s]^MPredicting:  64%|██████▍   | 81/126 [00:27<00:15,  2.90it/s]^MPredicting:  65%|██████▌   | 82/126 [00:28<00:15,  2.90it/s]^MPredicting:  66%|██████▌   | 83/126 [00:28<00:14,  2.92it/s]^MPredicting:  67%|██████▋   | 84/126 [00:28<00:14,  2.93it/s]^MPredicting:  67%|██████▋   | 85/126 [00:28<00:13,  2.93it/s]^MPredicting:  68%|██████▊   | 86/126 [00:29<00:13,  2.93it/s]^MPredicting:  69%|██████▉   | 87/126 [00:29<00:13,  2.91it/s]^MPredicting:  70%|██████▉   | 88/126 [00:30<00:13,  2.91it/s]^MPredicting:  71%|███████   | 89/126 [00:30<00:12,  2.90it/s]^MPredicting:  71%|███████▏  | 90/126 [00:30<00:12,  2.90it/s]^MPredicting:  72%|███████▏  | 91/126 [00:31<00:12,  2.91it/s]^MPredicting:  73%|███████▎  | 92/126 [00:31<00:11,  2.90it/s]^MPredicting:  74%|███████▍  | 93/126 [00:32<00:11,  2.90it/s]^MPredicting:  75%|███████▍  | 94/126 [00:32<00:11,  2.89it/s]^MPredicting:  75%|███████▌  | 95/126 [00:32<00:10,  2.89it/s]^MPredicting:  76%|███████▌  | 96/126 [00:33<00:10,  2.88it/s]^MPredicting:  77%|███████▋  | 97/126 [00:33<00:10,  2.88it/s]^MPredicting:  78%|███████▊  | 98/126 [00:34<00:09,  2.88it/s]^MPredicting:  79%|███████▊  | 99/126 [00:34<00:09,  2.87it/s]^MPredicting:  79%|███████▉  | 100/126 [00:34<00:09,  2.88it/s]^MPredicting:  80%|████████  | 101/126 [00:35<00:08,  2.88it/s]^MPredicting:  81%|████████  | 102/126 [00:35<00:08,  2.88it/s]^MPredicting:  82%|████████▏ | 103/126 [00:35<00:07,  2.88it/s]^MPredicting:  83%|████████▎ | 104/126 [00:36<00:07,  2.88it/s]^MPredicting:  83%|████████▎ | 105/126 [00:36<00:07,  2.88it/s]^MPredicting:  84%|████████▍ | 106/126 [00:36<00:06,  2.87it/s]^MPredicting:  85%|████████▍ | 107/126 [00:37<00:06,  2.87it/s]^MPredicting:  86%|████████▌ | 108/126 [00:37<00:06,  2.87it/s]^MPredicting:  87%|████████▋ | 109/126 [00:38<00:05,  2.86it/s]^MPredicting:  87%|████████▋ | 110/126 [00:38<00:05,  2.86it/s]^MPredicting:  88%|████████▊ | 111/126 [00:38<00:05,  2.86it/s]^MPredicting:  89%|████████▉ | 112/126 [00:39<00:04,  2.86it/s]^MPredicting:  90%|████████▉ | 113/126 [00:39<00:04,  2.85it/s]^MPredicting:  90%|█████████ | 114/126 [00:39<00:04,  2.85it/s]^MPredicting:  91%|█████████▏| 115/126 [00:40<00:03,  2.85it/s]^MPredicting:  92%|█████████▏| 116/126 [00:40<00:03,  2.85it/s]^MPredicting:  93%|█████████▎| 117/126 [00:41<00:03,  2.85it/s]^MPredicting:  94%|█████████▎| 118/126 [00:41<00:02,  2.85it/s]^MPredicting:  94%|█████████▍| 119/126 [00:41<00:02,  2.84it/s]^MPredicting:  95%|█████████▌| 120/126 [00:42<00:02,  2.81it/s]^MPredicting:  96%|█████████▌| 121/126 [00:43<00:01,  2.80it/s]^MPredicting:  97%|█████████▋| 122/126 [00:43<00:01,  2.79it/s]^MPredicting:  98%|█████████▊| 123/126 [00:44<00:01,  2.79it/s]^MPredicting:  98%|█████████▊| 124/126 [00:44<00:00,  2.80it/s]^MPredicting:  99%|█████████▉| 125/126 [00:44<00:00,  2.80it/s]^MPredicting: 100%|██████████| 126/126 [00:44<00:00,  2.81it/s]^MPredicting: 100%|██████████| 126/126 [00:45<00:00,  2.79it/s]

Environment

OS: Ubuntu 20.04 x86_64
Packaging pip
Version pip install unbabel-comet==1.0.0rc2

Additional context

Add WMT test sets via sacrebleu

[Extracting from #30]

It would be nice to would be to add support for sacrebleu-style builtin test sets, e.g.,

# one option
$ cat system.txt | comet -t wmt20 -l de-en [other args]

# another option
$ cat system.txt | comet --sacrebleu-testset wmt20/de-en
$ cat system.txt | comet --sacrebleu-testset mtedx/valid/pt-es

You could accomplish this by just using sacrebleu as a library. It’s pretty easy:

from sacrebleu.utils import get_source, get_references, get_files

# trigger sacrebleu test set
# make these optional: nargs=“?” for argparse
if args.source is None and args.references is None:
    if args.sacrebleu_dataset is None:
        # throw error

    # some test sets are hierarchical, e.g., “mtedx/valid”
    test_set, langpair = args.sacrebleu_dataset.rsplit(“/“, maxsplit=1)
    source = get_source(test_set, langpair)
    ref = get_referencees(test_set, langpair)

     # alternative
    source, ref, _ = get_files(test_set, langpair)

Originally posted by @mjpost in #30 (comment)

Read from STDIN

🚀 Feature

It would be really nice if COMET could read input from STDIN, e.g.,

# three fields triggers comet-ref
$ paste source.txt hyps.txt ref.txt | comet [args]

# two fields -> comet-src
$ paste source.txt hyps.txt | comet [args]

Motivation

This is consistent with standard UNIX usage. It is also slightly less cumbersome, and allows comet to be used in settings without writing files to disk.

Shortened version of comet-score output

🚀 Feature

Comet-score should be able to print a shortened score, as an average of all segment scores, when passed a particular flag (maybe something like --quiet). To the best of my knowledge, this does not seem possible currently (though of course, I could be wrong as I am new to this package)

Motivation

Currently, Comet-score prints a line-by-line score for each segment. This can be quite an overkill, especially if one is only interested in the score for the whole test set (which is currently calculated as an average for each segment score). Displaying only the average would be useful in these cases.

Additional context

This is the current output when I run comet-score from the CLI:

Screenshot 2022-02-06 at 11 02 53 pm

Add a check if language is supported by model

Hello,
In my ongoing evaluation of metrics, I have found out, that COMET (especially source based) is very unpredictable when it evaluates a language that is not supported by XLM model. This can easily happen because there is no list of supported languages for COMET on your git page and users would need to investigate this issue to XLM paper/git.

Would it be possible to add a check that language is supported (and thus ask for the language code when evaluating)?

Here I share my findings with you: Here is a graph for COMET source-based (QE), on Y-axis are human deltas and on X-axis are COMET deltas. The green dots around COMET delta 0 (it isn't exactly 0) are for language pairs where one of the languages is not supported by XLM, you can see that humans did found a difference for given languages but COMET was chaotic (other metrics doesn't have this problem).

image

Warnings like "Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel"

❓ Questions and Help

What is your question?

When I run "comet-score -s test.en-zh.en -t decoder-out -r test.en-zh.zh", I got the following warnings. Is that normal? or am I missing something?

/root/.cache/torch/unbabel_comet/wmt20-comet-da//checkpoints/model.ckpt
Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel: ['lm_head.bias', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Encoder model frozen.
/usr/local/python3/lib/python3.8/site-packages/torch/nn/modules/container.py:435: UserWarning: Setting attributes on ParameterList is not supported.

warnings.warn("Setting attributes on ParameterList is not supported.")
GPU available: True, used: True

What's your environment?

  • Linux
  • python 3.8
  • Version

[QUESTION] Can Different COMET Metrics Give Opposing Results for Same MT System

Hello,

Our system is being validated against both "wmt-large-da-estimator-1719" and "wmt-large-hter-estimator" estimators with the same translations dataset, of course (70k+ translations).

The two estimators give completely opposite results.
The "da" estimator is placing our MT system in "...the bottom 25%" while the "HTER" estimator returns a "top 25%" score.

I know this is not a technical issue, but can you please provide some additional information on how we might be able to interpret those types of results?

Thank you very much

[QUESTION] Does COMET support multiple references?

Thanks for the tool! I'm wondering if COMET supports multiple references, or if we can just score each sentence with all the references and take the maximum value? Sorry if this has already been mentioned somewhere.

[QUESTION] About HTER models in download list.

❓ Questions and Help

Before asking:

  1. Search for similar issues.
  2. Search the docs.

What is your question?

Hi, I found that the HTER models are off the download list of the current codes.
https://github.com/Unbabel/COMET/blob/master/comet/models/__init__.py
I wonder whether they are still supported in the current version.

I used version 1.0.0rc9, and it report this.
"Exception: wmt-large-hter-estimator is not in the availale_metrics or is a valid checkpoint folder."
Is that normal or should I use the previous version?
Thanks.

Code

What have you tried?

What's your environment?

  • OS: Linux
  • Packaging pip
  • Version 1.0.0rc9

Get list of models

🚀 Feature

COMET should output a list of available models if -m is used with an invalid model name.

Motivation

It's a bit of a pain to figure out the available models from the CLI. I had to come to the Github page.

Alternatives

Additional context

Benchmark tests?

I've installed COMET on two different machines, (Python 3.8.10, Ubuntu, running on GPU, and 3.9.10, MacOS, running on CPU) and am getting vastly different results on each machine for the same model and the same texts. Clearly, they can't both be correct.

I'm wondering whether there are any benchmark results that I can compare to, for example the simple examples in the installation instructions, so I can try to figure out what's going on and validate the installation. Also, I'm wondering whether the fact that I'm running COMET on Chinese characters might have something to do with the different results. Are there any benchmark results for EN<>CN?

Thanks.

Which scale is COMET? [0,1]?

Is COMET on a 0-1 scale? Is it normalized to [0,1]? I saw on the paper of uncertainty-aware COMET, there is a COMET score that is bigger than 1, how is it so? Is that normal/frequent?

Why is COMET_DA default model?

Hi Unbabel,

I wanted to ask, why is the wmt20-comet-da default model when using COMET? I am a bit worried that people using it off the shelf won't understand the underlying difference and will mistakingly report COMET scores on the QE metric.

Why not set the reference-based COMET as the default, since it seems to be outperforming comet-da (also I fear that comet-da will have more biases and potential problems than reference-based).

Thank you for the answer,
Tom

[QUESTION] How to calculate corpus level COMET?

I understand COMET returns a single score for each sentence it evaluates. I was wondering if there is any way to report a corpus level metric, and what would one be? Similar to how BLEU is reported.

Model outputs error right after finishing training

🐛 Bug

Hello! I've tried to train my a comet model using my own data! I want to train using hter as a metric, I used your configuration that's present in the repo: https://github.com/Unbabel/COMET/blob/master/configs/xlmr/base/hter-estimator.yaml

To Reproduce

Python 3.6.9

python3 -m venv comet
pip install unbabel-comet
comet train -f config.yml 

Where config.yml is the configuration I mentioned above with alterations to the training data path.
It does not seem to be an issue with the data as I have the correct column names and the model did train through the 2 epochs that were established in the configuration file.

Expected behaviour

Trained model, that could be loaded via python.

Screenshots

Here's the output from my logs.

Epoch 2: 100%|██████████| 25000/25000 [1:16:17<00:00,  5.46it/s, loss=0.056, v_num=4-54, pearson=0.924, kendall=0.81, spearman=0.946, avg_loss=0.0621] 
Traceback (most recent call last):                            
  File "/home/ubuntu/comet/bin/comet", line 33, in <module>
    sys.exit(load_entry_point('unbabel-comet==0.0.6', 'console_scripts', 'comet')())
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/comet/cli.py", line 63, in train
    trainer.fit(model)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 453, in fit
    self.call_hook('on_fit_end')
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 835, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/trainer/callback_hook.py", line 57, in on_fit_end
    callback.on_fit_end(self, self.get_model())
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 35, in wrapped_fn
    return fn(*args, **kwargs)
TypeError: on_fit_end() takes 2 positional arguments but 3 were given

Environment

OS: Linux
Packaging: pip
Version: latest

Thank you for your time!

Cumprimentos,

Jose :-)

Reimplementing results in COMET EMNLP'20 paper

Hi,

Recently I'm reimplementing the results in your COMET EMNLP'20 paper. I also carefully referred to documentation for more details. However, when reimplementing experiments over wmt-metrics data, I find something unexpected. Here are my steps for preparing:

  1. I create a virtual environment with conda:
conda create -n comet python=3.8
conda activate comet
pip install unbabel-comet
  1. Download the wmt-metrics data via:
comet download -d wmt-metrics --saving_path data/wmt-metrics/

After these steps I continue the implementation with the released model:

  1. I want to have a test using language pair en-de, so I first split test19-relative-ranking.csv file into multiple files: storing source, reference, positive_hypothesis, and negative_hypothesis line by line. Here is the content of python source file script_language_filter.py
from argparse import ArgumentParser
from csv import reader

parser = ArgumentParser()

parser.add_argument('--input_file', type=str, required=True)
parser.add_argument('--language', type=str, required=True)
parser.add_argument('--output_src', type=str, required=True)
parser.add_argument('--output_ref', type=str, required=True)
parser.add_argument('--output_pos', type=str, required=True)
parser.add_argument('--output_neg', type=str, required=True)

args = parser.parse_args()


def main():
    with open(args.input_file, mode='r', encoding='utf-8') as f1, \
            open(args.output_src, mode='w', encoding='utf-8') as f2, \
            open(args.output_ref, mode='w', encoding='utf-8') as f3, \
            open(args.output_pos, mode='w', encoding='utf-8') as f4, \
            open(args.output_neg, mode='w', encoding='utf-8') as f5:
        csv_reader = reader(f1)
        next(csv_reader) # escape the first line
        for _, row in enumerate(csv_reader):
            # csv_file title:
            # data, lp, src, ref, pos, neg, pos.model, neg.model, bestmodel
            # indexes of our interest:
            #       1 , 2  , 3  , 4  , 5
            if row[1] != args.language:
                continue

            f2.write(row[2].strip() + '\n')
            f3.write(row[3].strip() + '\n')
            f4.write(row[4].strip() + '\n')
            f5.write(row[5].strip() + '\n')
    return


if __name__ == '__main__':
    main()

Then, run this command:

python script_language_filter.py \
    --input_file test19-relative-ranks.csv \
    --language "de-en" \
    --output_src test19-relative-ranks.src \
    --output_ref test19-relative-ranks.ref \
    --output_pos test19-relative-ranks.pos \
    --output_neg test19-relative-ranks.neg

After this step, I get 4 more files test19-relative-ranks.{src,ref,pos,neg}, each yielding 17,073 lines.

  1. Scoring each sentence pair:
    For positive_hypothesis-reference:
comet score -s test19-relative-ranks.src \
    -h test19-relative-ranks.pos \
    -r test19-relative-ranks.ref \
    --batch_size 16 \
    --to_json test19-relative-ranks.pos.json \
    --model emnlp-base-da-ranker

For negative_hypothesis-reference:

comet score -s test19-relative-ranks.src \
    -h test19-relative-ranks.neg \
    -r test19-relative-ranks.ref \
    --batch_size 16 \
    --to_json test19-relative-ranks.neg.json \
    --model emnlp-base-da-ranker

Then I get two files storing predicted scores test19-relative-ranks.{pos,neg}.json.

  1. I didn't find presented script for directly computing WMT DARR Kendall score. So I simply write a script:
from argparse import ArgumentParser
from json import load

parser = ArgumentParser()
parser.add_argument('--pos_json', type=str, required=True)
parser.add_argument('--neg_json', type=str, required=True)

args = parser.parse_args()


def main():
    with open(args.pos_json, mode='r', encoding='utf-8') as f1, \
            open(args.neg_json, mode='r', encoding='utf-8') as f2:
        pos_data = load(f1)
        neg_data = load(f2)

    concor = 0
    discor = 0

    for pos, neg in zip(pos_data, neg_data):
        if pos['predicted_score'] > neg['predicted_score']:
            concor += 1
        else:
            discor += 1

    print('%d items in total. Concor: %d, Discor: %d, WMTKendall: %f' % (concor + discor, concor, discor, (concor - discor) / (concor + discor)))
    return


if __name__ == '__main__':
    main()

And I run:

python script_compute_rr.py \
    --pos_json test19-relative-ranks.pos.json \
    --neg_json test19-relative-ranks.neg.json 

The results are:

17073 items in total. Concor: 11244, Discor: 5829, WMTKendall: 0.317167

So here I find my result is extremely higher than the reported result 0.202 in your paper (column de-en, row COMET-RANK in Table 2). I'm not sure which step I'm wrong. Besides, I also want to know whether the model tagged emnlp-base-da-ranker is exactly the well-trained model corresponding to the reported results in Table 2 of your paper.

Could you answer these questions for me? Many thanks!

  • OS: [Ubuntu 20.0]
  • Packaging [conda]
  • Version [0.1.0]

certificate verify failed: unable to get local issuer certificate

🐛 Bug

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)>

error when running comet-score on python >3.7 MacOS.

To Reproduce

Install python via homebrew, create a virtual environment with comet and then run comet-score.

Screenshots

Environment

OS: macOS Mojave 10.14.6

Additional context

It seems that, for some reason, Brew has not run the Install Certificates.command that comes in the Python3 bundle for Mac.

COMET not working with python 3.8

🐛 Bug

COMET works with python 3.6 but not with python 3.8

Environment

OS: MacOS
Packaging: pip
Version 1.0.0rc6

To Reproduce

I pip installed comet on my python 3.8.12 virtual environment and then tested the example provided in the readme Scoring with Python

seg_scores, sys_score = model.predict(data, batch_size=8, gpus=0)

But I get the following error:

/usr/local/Cellar/[email protected]/3.8.12/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py in _launch(self, process_obj)
     45         try:
     46             reduction.dump(prep_data, fp)
---> 47             reduction.dump(process_obj, fp)
     48         finally:
     49             set_spawning_popen(None)

/usr/local/Cellar/[email protected]/3.8.12/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

AttributeError: Can't pickle local object 'CometModel.predict.<locals>.<lambda>'

I tested the same code on python 3.6 and it did work, so thanks a lot :)

I expected comet to work with >=python3.5
Is there any plan to make it work for python 3.8?

Thanks again.

Model download error

❓ Questions and Help

I tried to download the model using:
model = download_model("wmt-large-da-estimator-1719"

But I get the following error:

'''
AttributeError Traceback (most recent call last)
in

----> 4 model = download_model("wmt-large-da-estimator-1719")

7 frames
/proj/tools/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in setattr(self, name, value)
817 buffers[name] = value
818 else:
--> 819 object.setattr(self, name, value)
820
821 def delattr(self, name):

AttributeError: can't set attribute
'''

  • OS: [Ubuntu 18.04]
  • Packaging [pip]
  • Version [20.2.4]

[QUESTION] Train my own metrics without source sentences

Hi Ricardo, thank you so much for your previous answers.
I have a follow-up question regarding to train Comet myself without using any source sentences.

So the input to the system will be translated sentences, reference sentences and the rating of the translations.
Is there a way to do this in the current code base? Thank you in advance.

GPU support information into README

COMET takes 30-40 minutes to evaluate 400 sentences testset on CPU. Therefore GPU is necessary, but it took me some time before I found out there is a flag "cuda". Can you add an example with cuda parametr into README?

Problems with refless model

🐛 Bug

there are several issues with refless models:

Actually, I solved all the bugs in my local code.

To Reproduce

Before reporting a bug, make sure that the bug can be reproduced with a minimal example and add your relevant changes, to see if the issue persists.

If the test is failing, please add your test cases to the issue (as a draft PR, or simply paste the code to the issue description here).

Environment

OS: ubuntu
Packaging: pip3
Version: unbabel-comet==1.0.0rc8

Comet QE asks for the reference

Using the comet score with QE model (wmt-large-qe-estimator-1719) through the command line asks for a reference, even when it shouldn't be used in the calculation (if I am not mistaken)

pip install conflict

When I tried a clean install of current COMET version via pip, I got an error that there is a conflict in versions. I managed to resolve it by manually degrading PyYAML to 3.3.*

Could you check it, please?

Multi-GPU support?

🚀 Feature

Multi-GPU support would be nice.

Motivation

Scoring larger test sets takes ages on a single GPU :)

Refless example doesn't work with 1.0.0rc4

🐛 Bug

To Reproduce

Following the README:

> echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
> echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp.en
> comet-score -s src.de -t hyp.en --model wmt20-comet-qe-da

results in the following output:

Global seed set to 12
usage: comet-score [-h] [-s SOURCES] [-t TRANSLATIONS] [-r REFERENCES] [--batch_size BATCH_SIZE] [--gpus GPUS]
                   [--to_json TO_JSON]
                   [--model {emnlp20-comet-rank,wmt20-comet-da,wmt20-comet-qe-da,wmt21-cometinho-da}]
                   [--mc_dropout MC_DROPOUT] [--seed_everything SEED_EVERYTHING]
comet-score: error: wmt20-comet-qe-da requires -r/--references.

Looking at the code:

COMET/comet/cli/score.py

Lines 79 to 80 in 61caa5a

if (cfg.references is None) and ("refless" not in cfg.model):
parser.error("{} requires -r/--references.".format(cfg.model))

it seems that the model has to have refless in the name, which none of the available models have:

comet-score: error: argument --model: invalid choice: [...] (choose from 'emnlp20-comet-rank', 'wmt20-comet-da', 'wmt20-comet-qe-da', 'wmt21-cometinho-da')

Expected behaviour

I'd hope to get a score for my hypotheses given the sources.

Screenshots

n/a

Environment

OS: MacOS
Packaging: pip
Version: unbabel-comet==1.0.0rc4

Additional context

TypeError: 'type' object is not subscriptable

🐛 Bug

After installation, whether via pip, poetry, or direct usage (./comet/cli/score.py), I get the following error:

$ PYTHONPATH=. python3 ./comet/cli/score.py
Traceback (most recent call last):
  File "./comet/cli/score.py", line 56, in <module>
    from comet.download_utils import download_model
  File "/home/mattpost/src/COMET/comet/__init__.py", line 19, in <module>
    from .download_utils import download_model
  File "/home/mattpost/src/COMET/comet/download_utils.py", line 26, in <module>
    from comet.models import available_metrics
  File "/home/mattpost/src/COMET/comet/models/__init__.py", line 17, in <module>
    from .regression.regression_metric import RegressionMetric
  File "/home/mattpost/src/COMET/comet/models/regression/regression_metric.py", line 26, in <module>
    from comet.models.base import CometModel
  File "/home/mattpost/src/COMET/comet/models/base.py", line 41, in <module>
    class OrderedSampler(Sampler[int]):
TypeError: 'type' object is not subscriptable

This is with Python 3.8.10, so relatively recent.

Environment

OS: Linux (ubuntu 20.04.3)
Packaging: all
Version: 1.0.1 (latest source)

Add Poetry

Use a dependency manager such as Poetry to end with problems with requirements.

small score difference for identical outputs

🐛 Bug

Using comet-compare
I noticed that the exact same outputs of two different systems receive different score, although almost negligible.
But, although negligibly different, these scores are non consider a tie; and hence there is an impact on the number of wins/losses reported.

    "ties (%)": 0.0,
    "x_wins (%)": 1.0,
    "y_wins (%)": 0.0

{
    "src": "Nedávno prohrál s Raonicem v Brisbane Open.",
    "system_x": {
        "mt": "He recently lost to Raonic at the Brisbane Open.",
        "score": 0.8726277947425842
    },
    "system_y": {
        "mt": "He recently lost to Raonic at the Brisbane Open.",
        "score": 0.872564971446991
    },
    "ref": "He recently lost against Raonic in the Brisbane Open."
},

To Reproduce

comet-compare -s SRC -r REF -x SysX -y SysY --to_json JJJ

Actually, I am using directly the function "compare_command()" included in "cli/compare.py"

Expected behaviour

Either an identical score for identical outputs, or a bit more flexible counts of wins/losses/ties

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

OS: ubuntu
Packaging: pip3
Version: unbabel-comet==1.0.0rc8

Error when running inference on newly trained metric

While inference (comet-score) using the provided metrics works well, if I try to train a new metric with comet-train and the use it to predict quality scores I get the following error:

comet-score: error: argument --model: invalid choice: 'lightning_logs/version_19/checkpoints/epoch=1-step=129339.ckpt' (choose from 'emnlp20-comet-rank', 'wmt20-comet-da', 'wmt20-comet-qe-da', 'wmt21-cometinho-da')

The error disappears once I comment out line 64: choices=available_metrics.keys(), in comet/cli/score.py

parser.add_argument(
        "--model",
        type=Union[str, Path_fr],
        required=False,
        default="wmt20-comet-da",
        choices=available_metrics.keys(),
        help="COMET model to be used.",
    )

wrong output of comet-compare

🐛 Bug

comet-compare output wrong info in the json file:

  • the same src,ref, x and y text are reported
  • the scores are probably correct

I think that the problems are due to this line

"src": system_x[0]["src"],

and the 3 following lines

It seems that the entry 0 is always output, while it should output the entry "i".

To Reproduce

comet-compare -s SRC.txt -r REF.txt -x SysX.txt -y SysY.txt --to_json JSON.txt

SRC.txt, REF.txt, SysX.txt and SysY.txt must have more than one lines.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

OS: ubuntu
Packaging: pip3
Version: unbabel-comet==1.0.0rc8

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.