obss / jury Goto Github PK

View Code? Open in Web Editor NEW

184.0 5.0 20.0 298 KB

Comprehensive NLP Evaluation System

License: MIT License

Python 100.00%

natural-language-processing evaluation huggingface transformers datasets metrics pytorch python nlp machine-learning

jury's People

Contributors

Stargazers

Watchers

jury's Issues

Adopt arrow implementation from datasets

Metric implementation with arrow tables should be adopted for multiple references & multiple predictions cases.

Add CHRF

Brief metric info
chrF++ is a tool for automatic evaluation of machine translation output based on character n-gram precision and recall enhanced with word n-grams. The tool calculates the F-score averaged on all character and word n-grams, where the default character n-gram order is 6 and word n-gram order is 2. The arithmetic mean is used for n-gram averaging.

Meta information
Please fill below if applicable:

Repo/Custom implementation: https://github.com/m-popovic/chrF
Python implementation available: [x]
Paper: https://aclanthology.org/W15-3049
Huggingface: https://github.com/huggingface/datasets/tree/master/metrics/chrf

BLEU: ndarray reshape error

Hey, when computing BLEU score (snippet), facing reshape error in _compute_single_pred_single_ref.

Could you assist with the same.

from jury import Jury

scorer = Jury()

# [2, 5/5]
p = [
        ['dummy text', 'dummy text', 'dummy text', 'dummy text', 'dummy text'],
        ['dummy text', 'dummy text', 'dummy text', 'dummy text', 'dummy text']
    ]

# [2, 4/2]
r = [['be looking for a certain office in the building ',
      ' ask the elevator operator for directions ',
      ' be a trained detective ',
      ' be at the scene of a crime'],
     ['leave the room ',
      ' transport the notebook']]

scores = scorer(predictions=p, references=r)

Output:

Traceback (most recent call last):
  File "/home/axe/Projects/VisComSense/del.py", line 22, in <module>
    scores = scorer(predictions=p, references=r)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/core.py", line 78, in __call__
    score = self._compute_single_score(inputs)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/core.py", line 137, in _compute_single_score
    score = metric.compute(predictions=predictions, references=references, reduce_fn=reduce_fn)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/datasets/metric.py", line 404, in compute
    output = self._compute(predictions=predictions, references=references, **kwargs)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/_core/base.py", line 325, in _compute
    result = self.evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **eval_params)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 241, in evaluate
    return eval_fn(predictions=predictions, references=references, reduce_fn=reduce_fn, **kwargs)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 195, in _compute_multi_pred_multi_ref
    score = self._compute_single_pred_multi_ref(
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 176, in _compute_single_pred_multi_ref
    return self._compute_single_pred_single_ref(
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 165, in _compute_single_pred_single_ref
    predictions = predictions.reshape(
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/collator.py", line 35, in reshape
    return Collator(_seq.reshape(args).tolist(), keep=True)
ValueError: cannot reshape array of size 20 into shape (10,)

Process finished with exit code 1

Implementing fixtures for test cases.

Currently test cases solely operate without using fixtures which kill performance while testing. Implementation of fixtures can dramatically speed up testing.

Add TER

Brief metric info
TER (Translation Edit Rate, also called Translation Error Rate) is a metric to quantify the edit operations that a
hypothesis requires to match a reference translation.

Meta information
Please fill below if applicable:

Repo/Custom implementation: https://github.com/jhclark/tercom
Python implementation available: [x]
Paper: https://aclanthology.org/2006.amta-papers.25
Huggingface: https://github.com/huggingface/datasets/blob/master/metrics/ter

usage in readme

we should add some basic usage into readme

Understanding BLEU Score ('bleu_n')

Hey, how are different bleu scores calculated?

For the give snippet, why are all bleu(n) scores identical?
And how does this relate to nltk's sentence_bleu (weights) ?

from jury import Jury

scorer = Jury()
predictions = [
    ["the cat is on the mat", "There is cat playing on the mat"], 
    ["Look!    a wonderful day."]
]
references = [
    ["the cat is playing on the mat.", "The cat plays on the mat."], 
    ["Today is a wonderful day", "The weather outside is wonderful."]
]
scores = scorer(predictions=predictions, references=references)

Output:

{'empty_predictions': 0,
 'total_items': 2,
 'bleu_1': {'score': 0.42370250917168295,
  'precisions': [0.8823529411764706,
   0.6428571428571429,
   0.45454545454545453,
   0.125],
  'brevity_penalty': 1.0,
  'length_ratio': 1.0,
  'translation_length': 11,
  'reference_length': 11},
 'bleu_2': {'score': 0.42370250917168295,
  'precisions': [0.8823529411764706,
   0.6428571428571429,
   0.45454545454545453,
   0.125],
  'brevity_penalty': 1.0,
  'length_ratio': 1.0,
  'translation_length': 11,
  'reference_length': 11},
 'bleu_3': {'score': 0.42370250917168295,
  'precisions': [0.8823529411764706,
   0.6428571428571429,
   0.45454545454545453,
   0.125],
  'brevity_penalty': 1.0,
  'length_ratio': 1.0,
  'translation_length': 11,
  'reference_length': 11},
 'bleu_4': {'score': 0.42370250917168295,
  'precisions': [0.8823529411764706,
   0.6428571428571429,
   0.45454545454545453,
   0.125],
  'brevity_penalty': 1.0,
  'length_ratio': 1.0,
  'translation_length': 11,
  'reference_length': 11},
 'meteor': {'score': 0.5420511682934044},
 'rouge': {'rouge1': 0.7783882783882783,
  'rouge2': 0.5925324675324675,
  'rougeL': 0.7426739926739926,
  'rougeLsum': 0.7426739926739926}}

Add COMET

Brief metric info
COMET is proposed primarily as an MT evaluation metric through a trained model.

COMET is an open-source framework for MT evaluation that can be used for two purposes:

To evaluate MT systems with our currently available high-performing metrics (check: COMET Metrics).

To train and develop new metrics.

Meta information
Please fill below if applicable:

Repo/Custom implementation: https://github.com/Unbabel/COMET
Python implementation available: [x]
Paper: https://aclanthology.org/2020.emnlp-main.213/
Huggingface: https://github.com/huggingface/datasets/blob/master/metrics/comet

Facing datasets error

Hello,
After dowloading the contents from git and instantiating the object, i get this error :-

/content/image-captioning-bottom-up-top-down
Traceback (most recent call last):
  File "eval.py", line 11, in <module>
   from jury import Jury 
  File "/usr/local/lib/python3.7/dist-packages/jury/__init__.py", line 1, in <module>
    from jury.core import Jury
  File "/usr/local/lib/python3.7/dist-packages/jury/core.py", line 6, in <module>
    from jury.metrics import EvaluationInstance, Metric, load_metric
  File "/usr/local/lib/python3.7/dist-packages/jury/metrics/__init__.py", line 1, in <module>
    from jury.metrics._core import (
  File "/usr/local/lib/python3.7/dist-packages/jury/metrics/_core/__init__.py", line 1, in <module>
    from jury.metrics._core.auto import AutoMetric, load_metric
  File "/usr/local/lib/python3.7/dist-packages/jury/metrics/_core/auto.py", line 23, in <module>
    from jury.metrics._core.base import Metric
  File "/usr/local/lib/python3.7/dist-packages/jury/metrics/_core/base.py", line 28, in <module>
    from datasets.utils.logging import get_logger
ModuleNotFoundError: No module named 'datasets.utils'; 'datasets' is not a package

Can you please check what could be the issue

Add QA metrics (squad) to the current system.

Add citation for the project (zenodo)

Citation info/badge to the repo would be good for the project for ones looking for citation (but cannot find for now unfortunately).

New source for Prism model

There is issues about (timeout) prism the current model source. We should upload the resource model related files into a publicly available source for consistent connection and throughput.

BLEU metric calculates wrong results for single hypothesis single reference pairs.

If you give the BLEU metric hypothesis/reference pairs that consist of a single hypothesis and a single reference, the results don't make sense. If you duplicate either or both of the elements, then it calucaltes the expected results.

Reproduction code can be found in the following gist:
https://gist.github.com/Sophylax/2f70729a8ecb669c98898c65f7aed679

Upgrade nltk>=3.6.4

Newer version of nltk score changed meteor computation, it now requires for input strings to be pretokenized.

BLEURT is failing to produce results

I was trying to check with the same example mentioned in the readme file for Bleurt. It is failing by throwing an error. Please let me know the issue.

Error :

ImportError                               Traceback (most recent call last)
<ipython-input-16-ed14e2ab4c7e> in <module>
----> 1 bleurt = Bleurt.construct()
      2 score = bleurt.compute(predictions=predictions, references=references)

~\anaconda3\lib\site-packages\jury\metrics\_core\auxiliary.py in construct(cls, task, resulting_name, compute_kwargs, **kwargs)
     99         subclass = cls._get_subclass()
    100         resulting_name = resulting_name or cls._get_path()
--> 101         return subclass._construct(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    102 
    103     @classmethod

~\anaconda3\lib\site-packages\jury\metrics\_core\base.py in _construct(cls, resulting_name, compute_kwargs, **kwargs)
    235         cls, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs
    236     ):
--> 237         return cls(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    238 
    239     @staticmethod

~\anaconda3\lib\site-packages\jury\metrics\_core\base.py in __init__(self, resulting_name, compute_kwargs, **kwargs)
    220     def __init__(self, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs):
    221         compute_kwargs = self._validate_compute_kwargs(compute_kwargs)
--> 222         super().__init__(task=self._task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    223 
    224     def _validate_compute_kwargs(self, compute_kwargs: Dict[str, Any]) -> Dict[str, Any]:

~\anaconda3\lib\site-packages\jury\metrics\_core\base.py in __init__(self, task, resulting_name, compute_kwargs, config_name, keep_in_memory, cache_dir, num_process, process_id, seed, experiment_id, max_concurrent_cache_files, timeout, **kwargs)
    100         self.resulting_name = resulting_name if resulting_name is not None else self.name
    101         self.compute_kwargs = compute_kwargs or {}
--> 102         self.download_and_prepare()
    103 
    104     @abstractmethod

~\anaconda3\lib\site-packages\evaluate\module.py in download_and_prepare(self, download_config, dl_manager)
    649             )
    650 
--> 651         self._download_and_prepare(dl_manager)
    652 
    653     def _download_and_prepare(self, dl_manager):

~\anaconda3\lib\site-packages\jury\metrics\bleurt\bleurt_for_language_generation.py in _download_and_prepare(self, dl_manager)
    120         global bleurt
    121         try:
--> 122             from bleurt import score
    123         except ModuleNotFoundError:
    124             raise ModuleNotFoundError(

ImportError: cannot import name 'score' from 'bleurt' (unknown location)

Refactor docstrings, typo-check, review imports

Codebase needs a refactor mainly for docstrings and constant strings (like _CITATIONS, etc. used in metric info). Also, import statements needs to be reviewed (e.g.for no-longer supported versions/legacy should be removed). Update README accordingly.

Add WER

Brief metric info
Word Error Rate metric a derivation of edit distance.

Meta information
Please fill below if applicable:

Repo/Custom implementation: -
Python implementation available: [x]
Paper: -
Huggingface: https://github.com/huggingface/datasets/tree/master/metrics/wer

Computing BLEU more than once

Hey, why does computing the BLEU score more than once, affect the key value of the score dict.
e.g. 'bleu_1', 'bleu_1_1', 'bleu_1_1_1'

Overall I find the library quite user-friendly, but unsure about this behavior.

README.md update elaborating usage.

We need a more elaborate README.md on Usage section

New Metric: Prism

Prism is an automatic MT metric which uses a sequence-to-sequence paraphraser to score MT system outputs conditioned on their respective human references.

Repo: https://github.com/thompsonb/prism
Paper: https://aclanthology.org/2020.emnlp-main.8/

ZeroDivisionError: division by zero in AccuracyForLanguageGeneration._compute_single_pred_single_ref

Describe the bug
I was running RobertaForQuestionAnswering on HuggingFace's squad-v2 train sets (~86k).
The Accuracy metric at AccuracyForLanguageGeneration._compute_single_pred_single_ref threw division by zero error.

To Reproduce

Use datasets squad-v2 train set.
Run the samples through pipeline("question-answering", ...)

Expected behavior
Run without error.

Exception Traceback (if available)
If applicable, add full traceback to help explain your problem.

ration.py:107, in AccuracyForLanguageGeneration._compute_single_pred_single_ref(self, predictions, references, reduce_fn, **kwargs)
    105         if token in ref_counts:
    106             score += min(pred_count, ref_counts[token])  # Intersection count
--> 107     scores.append(score / max(len(pred), len(ref)))
    108 avg_score = sum(scores) / len(scores)
    109 return {"score": avg_score}

ZeroDivisionError: division by zero

Environment Information:

OS: Mac OS 13.2.1 (22D68)
jury version: 2.2.3
evaluate version: evaluate==0.2.2
datasets version: datasets==2.11.0

Thanks. Appreciate jury to exist. I could patch this by cloning and doing in-depth trace analysis. But, I wanted to know if there's a better way to patch this.

Prism support for use_cuda option

Referring this issue thompsonb/prism#13, since it seems like no activate maintanance is going on, we can add this support on a public fork.

Refactor all metrics to use dl_manager internally

Currently, most of the metrics requiring download from a source uses download() helper function under utils, these implementations needs a refactoring to be replaced by dl_manager available in underlying base class.

Multiple predictions and multiple references

Add license information for BLEURT and Prism

Currently license information is missing for BLEURT and Prism implementations, and required to be added.

Imrpove metric definitions

Enrich default metric definitions in definitions.py, and a method for altering (addition, deletion, change) definitions.

Add support for BERTScore to 0.3.11

BERTScore had a new release 0.3.11 which added support for DeBERTa v3 and ByT5 models.

Add reduce_fn parameter

Add reduce_fn parameter to Jury.evaluate() method, which can both take a string (looking for a numpy function) or a function.

Move infrastructure to new `evaluate` package (from `datasets`)

datasets' metrics will be deprecated in favor of the new evaluate package

Metrics for different tasks

There are several different tasks, jury mainly implemented NLG metrics (including precision, recall, etc. as modified n-gram precision, etc.). It'd be nice to have other types of metrics (e.g precision for sequence labeling).

CLI Implementation

CLI implementation for the package the read from txt files.

Draft Usage:
jury evaluate --predictions predictions.txt --references references.txt

NLGEval uses single prediction and multiple references in a way that u specify multiple references.txt files for mulitple references, and like this on API.

My idea is to have a single prediction and refenence file including multiple predictions or multiple references. In a single txt file, maybe we can use some sort of special separator like "<sep>" instead of a special char like [",", ";", ":", "\t"] maybe tab seperated would be OK. Wdyt ? @fcakyon @cemilcengiz

Bug: Metric object and string cannot be used together in input.

Currently, jury allows usage of input metrics to be passed in Jury(metrics=metrics) to be either list of jury.metrics.Metric or str, but it does not allow to use both str and Metric object together as,

from jury import Jury
from jury.metrics import load_metric

metrics = ["bleu", load_metric("meteor")]
jury = Jury(metrics=metrics)

raises an error as metrics parameter expects a NestedSingleType of object which is either list<str> or list<jury.metrics.Metric.

Nltk version upgrade for >=3.6.6

nltk recently released newer versions 3.6.6, 3.6.7 and 3.7. According to the dependabot alerts current dependent version is inefficient for several particular cases, and new versions eliminates these known performance issues.

New Metric: CER

CER is currently available at HF datasets and implicitly available also at jury. Yet, to fully utilize jury, it needs to be added.

HF Implementation: https://github.com/huggingface/datasets/blob/master/metrics/cer/cer.py

New Metric: Add BARTScore

BARTScore was published in NIPS'21. We can add it to jury with minimum requirement set.

Repo: https://github.com/neulab/BARTScore
Paper: https://arxiv.org/abs/2106.11520

New metric: Add BLEURT

BLEURT is currently supported in HF datasets, and hence in jury (implicit), however to fully utilize jury for this metric, it's needed to explicitly added.

Repo: https://github.com/google-research/bleurt
Paper: https://arxiv.org/abs/2004.04696
HF Implementation: https://github.com/huggingface/datasets/tree/master/metrics/bleurt

[BUG] requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/wmt21-cometinho-da/revision/main

Describe the bug

CI failing due to the comet tests -> https://github.com/obss/jury/actions/runs/4607786676/jobs/8142742832?pr=126
Related to Unbabel/COMET#125.

NOTE: Interestingly tests are passing in Python3.7. @ricardorei

ERROR tests/jury/metrics/test_comet.py::test_basic - huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-642c0cf6-01a418f1433996a919d94c44)
ERROR tests/jury/metrics/test_comet.py::test_multiple_ref - huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-642c0cf6-01a418f1433996a919d94c44) 
ERROR tests/jury/metrics/test_comet.py::test_multiple_pred_multiple_ref - huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-642c0cf6-01a418f1433996a919d94c44)

To Reproduce
Run tests.

Exception Traceback (if available)

self = <Response [401]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/wmt21-cometinho-da/revision/main

/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/requests/models.py:1021: HTTPError

The above exception was the direct cause of the following exception:

    @pytest.fixture(scope="module")
    def jury_comet():
>       metric = AutoMetric.load(
            "comet",
            config_name="wmt21-cometinho-da",
            compute_kwargs={"gpus": 0, "num_workers": 0, "progress_bar": False, "batch_size": 2},
        )

tests/jury/metrics/test_comet.py:11: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
jury/metrics/_core/auto.py:116: in load
    metric = klass.construct(task=task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/auxiliary.py:101: in construct
    return subclass._construct(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/base.py:237: in _construct
    return cls(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/base.py:[222](https://github.com/obss/jury/actions/runs/4607786676/jobs/8142742832?pr=126#step:10:223): in __init__
    super().__init__(task=self._task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/base.py:102: in __init__
    self.download_and_prepare()
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/evaluate/module.py:651: in download_and_prepare
    self._download_and_prepare(dl_manager)
jury/metrics/comet/comet_for_language_generation.py:107: in _download_and_prepare
    checkpoint_path = comet.download_model(self.config_name)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/comet/models/__init__.py:40: in download_model
    model_path = snapshot_download(repo_id=model, cache_dir=saving_directory)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:124: in _inner_fn
    return fn(*args, **kwargs)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py:163: in snapshot_download
    repo_info = _api.repo_info(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:124: in _inner_fn
    return fn(*args, **kwargs)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/hf_api.py:1817: in repo_info
    return method(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:124: in _inner_fn
    return fn(*args, **kwargs)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/hf_api.py:1626: in model_info
    hf_raise_for_status(r)

Environment Information:

OS: Windows 10
jury version: 2.2.3
evaluate version: 0.2.2
datasets version: 2.9.0

Fix `sklearn` dependency

PyPI index use of sklearn is deprecated, update it as scikit-learn.

ValueError: unmarshallable object

Describe the bug
While running experiments on my project, I got this error when I evaluate the results using jury.

Expected behavior
I expected it to give me results but I don't know why this happened. I tried on many virtual machines and same error pops up.

Exception Traceback (if available)
The output from the terminal is as follows:

Traceback (most recent call last):
  File "/workspace/cr2gllm-off/train_test_sft.py", line 158, in <module>
    main()
  File "/workspace/cr2gllm-off/train_test_sft.py", line 154, in main
    test(args)
  File "/workspace/cr2gllm-off/train_test_sft.py", line 144, in test
    calculate_metrics([df["prediction_sft"].to_list()], [df["findings"].to_list()])
  File "/workspace/cr2gllm-off/utils.py", line 183, in calculate_metrics
    bleu1 = bleu.compute(predictions=predictions, references=references, max_order=1)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/evaluate/module.py", line 467, in compute
    output = self._compute(**inputs, **compute_kwargs)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/_core/base.py", line 322, in _compute
    result = self.evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **eval_params)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 277, in evaluate
    return super().evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **kwargs)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/_core/base.py", line 276, in evaluate
    return eval_fn(predictions=predictions, references=references, **kwargs)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 237, in _compute_multi_pred_multi_ref
    score = self._compute_single_pred_multi_ref(
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 198, in _compute_single_pred_multi_ref
    return self._compute_bleu_score(
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 158, in _compute_bleu_score
    evaluation_fn = self._get_external_resource("nmt_bleu", attr="compute_bleu")
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/_core/base.py", line 190, in _get_external_resource
    external_module = import_module(module_name, self.external_module_path)
  File "/opt/conda/envs/env/lib/python3.10/site-packages/jury/metrics/_core/utils.py", line 52, in import_module
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 879, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1026, in get_code
  File "<frozen importlib._bootstrap_external>", line 689, in _code_to_timestamp_pyc
ValueError: unmarshallable object

Environment Information:

OS: Ubuntu 22.04
jury version: jury==2.3.1
python: 3.10
conda env was used

ModuleNotFoundError: No module named 'validators'

Describe the bug
jury-v2 raises the following error on import

ModuleNotFoundError: No module named 'validators'

Potential fix
add validators to requirements.txt

load_metric by path (relative/absolute)

load_metric currently takes a path, however relative path may be problematic and also the path is pointing to the script (old style) rather than the enclosing metric folder.

Old usage,

my_metric = jury.load_metric("my_project/my_custom_metrics/my_metric/my_metric.py")  # Rather than this

Suggested replacement,

my_metric = jury.load_metric("my_project/my_custom_metrics/my_metric")  # this should be the correct usage
my_metric = jury.load_metric("my_project.my_custom_metrics.my_metric")   # Another alternative for pathlike string

Add support for custom tokenizer for BLEU

Due to the nature of the Jury API, all input strings must be a whole (not tokenized), the current implementation of BLEU score is tokenized by white spaces. However, one might want results for smaller tokens, morphemes, or even character level rather than BLEU score of the words. Thus, it'd be great to support this with adding a support for tokenizer in the score computation for BLEU.

Drop Python 3.7 Support

Starting v2.3 jury no longer support python_version<3.8

Discard empty corpus before computation

Currently given corpus like below causes problems as they are not expecting an empty list:

import jury

p = [["a b c"], []]
r = [["a b d e f"], ["a g h i"]]

scorer = jury.Jury()
scores = scorer(predictions=p, references=r)

the code above throws an exception when encounters an empty list:

Traceback (most recent call last):
  File "/home/devrimcavusoglu/lab/gh/jury/jury/core.py", line 202, in <module>
    scores = scorer(predictions=p, references=r)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/core.py", line 79, in __call__
    score = self._compute_single_score(inputs)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/core.py", line 148, in _compute_single_score
    score = metric.compute(predictions=predictions, references=references, reduce_fn=reduce_fn)
  File "/home/devrimcavusoglu/lab/gh/jury/venv/lib/python3.8/site-packages/datasets/metric.py", line 402, in compute
    output = self._compute(predictions=predictions, references=references, **kwargs)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/_core/base.py", line 325, in _compute
    result = self.evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **eval_params)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/bleu/bleu_for_language_generation.py", line 262, in evaluate
    return super().evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **kwargs)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/_core/base.py", line 279, in evaluate
    return eval_fn(predictions=predictions, references=references, **kwargs)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/bleu/bleu_for_language_generation.py", line 216, in _compute_multi_pred_multi_ref
    adjusted_prediction_length += get_token_lengths(preds, reduce_fn=max)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/_core/utils.py", line 58, in get_token_lengths
    return int(reduce_fn(token_lengths))
ValueError: max() arg is an empty sequence

Process finished with exit code 1

[BUG] AttributeError: 'DownloadConfig' object has no attribute 'storage_options'

Describe the bug
Failure when loading bleu metric. Probably due to the loose version range dependency on datasets through evaluate.

To Reproduce
load_metric("bleu")

Exception Traceback (if available)

AttributeError                            Traceback (most recent call last)
[<ipython-input-15-2fcc283dab87>](https://localhost:8080/#) in <cell line: 2>()
      1 MT_METRICS = [
----> 2     load_metric("bleu", resulting_name="bleu_1", compute_kwargs={"max_order": 1}),
      3     load_metric("bleu", resulting_name="bleu_2", compute_kwargs={"max_order": 2}),
      4     load_metric("meteor"),
      5     load_metric("rouge"),

[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/auto.py](https://localhost:8080/#) in load_metric(path, resulting_name, task, compute_kwargs, use_jury_only, **kwargs)
---> 55     return AutoMetric.load(
     56         path=path,
     57         resulting_name=resulting_name,

[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/auto.py](https://localhost:8080/#) in load(cls, path, task, resulting_name, compute_kwargs, use_jury_only, **kwargs)
    114             factory_class = module.__main_class__
    115             klass = getattr(module, factory_class)
--> 116             metric = klass.construct(task=task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    117         return metric
    118 

[/usr/local/lib/python3.9/dist-packages/jury/metrics/bleu/bleu.py](https://localhost:8080/#) in construct(cls, task, resulting_name, compute_kwargs, **kwargs)
     22         subclass = cls._get_subclass()
     23         resulting_name = resulting_name or cls._get_path(compute_kwargs=compute_kwargs)
---> 24         return subclass._construct(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
     25 
     26     @classmethod

[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/base.py](https://localhost:8080/#) in _construct(cls, resulting_name, compute_kwargs, **kwargs)
    235         cls, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs
    236     ):
--> 237         return cls(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    238 
    239     @staticmethod

[/usr/local/lib/python3.9/dist-packages/jury/metrics/bleu/bleu_for_language_generation.py](https://localhost:8080/#) in __init__(self, resulting_name, compute_kwargs, **kwargs)
    119         self.should_change_resulting_name = resulting_name is None
    120         self.tokenizer = DefaultTokenizer()
--> 121         super().__init__(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    122 
    123     def _info(self):

[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/base.py](https://localhost:8080/#) in __init__(self, resulting_name, compute_kwargs, **kwargs)
    220     def __init__(self, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs):
    221         compute_kwargs = self._validate_compute_kwargs(compute_kwargs)
--> 222         super().__init__(task=self._task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    223 
    224     def _validate_compute_kwargs(self, compute_kwargs: Dict[str, Any]) -> Dict[str, Any]:

[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/base.py](https://localhost:8080/#) in __init__(self, task, resulting_name, compute_kwargs, config_name, keep_in_memory, cache_dir, num_process, process_id, seed, experiment_id, max_concurrent_cache_files, timeout, **kwargs)
    100         self.resulting_name = resulting_name if resulting_name is not None else self.name
    101         self.compute_kwargs = compute_kwargs or {}
--> 102         self.download_and_prepare()
    103 
    104     @abstractmethod

[/usr/local/lib/python3.9/dist-packages/evaluate/module.py](https://localhost:8080/#) in download_and_prepare(self, download_config, dl_manager)
    649             )
    650 
--> 651         self._download_and_prepare(dl_manager)
    652 
    653     def _download_and_prepare(self, dl_manager):

[/usr/local/lib/python3.9/dist-packages/jury/metrics/bleu/bleu_for_language_generation.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager)
    150         nmt_source = "https://raw.githubusercontent.com/tensorflow/nmt/0be864257a76c151eef20ea689755f08bc1faf4e/nmt/scripts/bleu.py"
--> 151         self.external_module_path = dl_manager.download(nmt_source)
    152 
    153     def _preprocess(self, predictions: Collator, references: Collator) -> Tuple[Collator, Collator]:

[/usr/local/lib/python3.9/dist-packages/datasets/download/download_manager.py](https://localhost:8080/#) in download(self, url_or_urls)
    425 
    426         start_time = datetime.now()
--> 427         downloaded_path_or_paths = map_nested(
    428             download_func,
    429             url_or_urls,

[/usr/local/lib/python3.9/dist-packages/datasets/utils/py_utils.py](https://localhost:8080/#) in map_nested(function, data_struct, dict_only, map_list, map_tuple, map_numpy, num_proc, parallel_min_length, types, disable_tqdm, desc)
    433     # Singleton
    434     if not isinstance(data_struct, dict) and not isinstance(data_struct, types):
--> 435         return function(data_struct)
    436 
    437     disable_tqdm = disable_tqdm or not logging.is_progress_bar_enabled()

[/usr/local/lib/python3.9/dist-packages/datasets/download/download_manager.py](https://localhost:8080/#) in _download(self, url_or_filename, download_config)
    451             # append the relative path to the base_path
    452             url_or_filename = url_or_path_join(self._base_path, url_or_filename)
--> 453         return cached_path(url_or_filename, download_config=download_config)
    454 
    455     def iter_archive(self, path_or_buf: Union[str, io.BufferedReader]):

[/usr/local/lib/python3.9/dist-packages/datasets/utils/file_utils.py](https://localhost:8080/#) in cached_path(url_or_filename, download_config, **download_kwargs)
    193             use_auth_token=download_config.use_auth_token,
    194             ignore_url_params=download_config.ignore_url_params,
--> 195             storage_options=download_config.storage_options,
    196             download_desc=download_config.download_desc,
    197         )

AttributeError: 'DownloadConfig' object has no attribute 'storage_options'

Environment Information:

OS: Linux 5.10.147+
jury version: 2.2.3
evaluate version: 0.2.2
datasets version: 2.11.0

Paper: (link to paper)
Java implementation: (link to repository)
Python implementation available: not found

Read support from a folder

Currently, CLI supports reading from a tsv or txt file, but for multiple inputs a preditions/ and references/ folders can be used.

obss / jury Goto Github PK

jury's People

Contributors

Stargazers

Watchers

Forkers

jury's Issues

Recommend Projects

Recommend Topics

Recommend Org