Coder Social home page Coder Social logo

obss / jury Goto Github PK

View Code? Open in Web Editor NEW
182.0 5.0 20.0 292 KB

Comprehensive NLP Evaluation System

License: MIT License

Python 100.00%
natural-language-processing evaluation huggingface transformers datasets metrics pytorch python nlp machine-learning

jury's People

Contributors

cemilcengiz avatar devrimcavusoglu avatar eltociear avatar fcakyon avatar kennethenevoldsen avatar nish1001 avatar sophylax avatar zafercavdar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

jury's Issues

Meteor for multiple languages

Meteor for multiple languages
It would be nice if the implementation of METEOR would support multiple languages (there is a library in JAVA , but I did not found any implementation in python)

Meta information

Computing BLEU more than once

Hey, why does computing the BLEU score more than once, affect the key value of the score dict.
e.g. 'bleu_1', 'bleu_1_1', 'bleu_1_1_1'

Overall I find the library quite user-friendly, but unsure about this behavior.

Nltk version upgrade for >=3.6.6

nltk recently released newer versions 3.6.6, 3.6.7 and 3.7. According to the dependabot alerts current dependent version is inefficient for several particular cases, and new versions eliminates these known performance issues.

Add support for custom tokenizer for BLEU

Due to the nature of the Jury API, all input strings must be a whole (not tokenized), the current implementation of BLEU score is tokenized by white spaces. However, one might want results for smaller tokens, morphemes, or even character level rather than BLEU score of the words. Thus, it'd be great to support this with adding a support for tokenizer in the score computation for BLEU.

Bug: Metric object and string cannot be used together in input.

Currently, jury allows usage of input metrics to be passed in Jury(metrics=metrics) to be either list of jury.metrics.Metric or str, but it does not allow to use both str and Metric object together as,

from jury import Jury
from jury.metrics import load_metric

metrics = ["bleu", load_metric("meteor")]
jury = Jury(metrics=metrics)

raises an error as metrics parameter expects a NestedSingleType of object which is either list<str> or list<jury.metrics.Metric.

[BUG] AttributeError: 'DownloadConfig' object has no attribute 'storage_options'

Describe the bug
Failure when loading bleu metric. Probably due to the loose version range dependency on datasets through evaluate.

To Reproduce
load_metric("bleu")

Exception Traceback (if available)

AttributeError                            Traceback (most recent call last)
[<ipython-input-15-2fcc283dab87>](https://localhost:8080/#) in <cell line: 2>()
      1 MT_METRICS = [
----> 2     load_metric("bleu", resulting_name="bleu_1", compute_kwargs={"max_order": 1}),
      3     load_metric("bleu", resulting_name="bleu_2", compute_kwargs={"max_order": 2}),
      4     load_metric("meteor"),
      5     load_metric("rouge"),

[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/auto.py](https://localhost:8080/#) in load_metric(path, resulting_name, task, compute_kwargs, use_jury_only, **kwargs)
---> 55     return AutoMetric.load(
     56         path=path,
     57         resulting_name=resulting_name,

[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/auto.py](https://localhost:8080/#) in load(cls, path, task, resulting_name, compute_kwargs, use_jury_only, **kwargs)
    114             factory_class = module.__main_class__
    115             klass = getattr(module, factory_class)
--> 116             metric = klass.construct(task=task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    117         return metric
    118 

[/usr/local/lib/python3.9/dist-packages/jury/metrics/bleu/bleu.py](https://localhost:8080/#) in construct(cls, task, resulting_name, compute_kwargs, **kwargs)
     22         subclass = cls._get_subclass()
     23         resulting_name = resulting_name or cls._get_path(compute_kwargs=compute_kwargs)
---> 24         return subclass._construct(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
     25 
     26     @classmethod

[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/base.py](https://localhost:8080/#) in _construct(cls, resulting_name, compute_kwargs, **kwargs)
    235         cls, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs
    236     ):
--> 237         return cls(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    238 
    239     @staticmethod

[/usr/local/lib/python3.9/dist-packages/jury/metrics/bleu/bleu_for_language_generation.py](https://localhost:8080/#) in __init__(self, resulting_name, compute_kwargs, **kwargs)
    119         self.should_change_resulting_name = resulting_name is None
    120         self.tokenizer = DefaultTokenizer()
--> 121         super().__init__(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    122 
    123     def _info(self):

[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/base.py](https://localhost:8080/#) in __init__(self, resulting_name, compute_kwargs, **kwargs)
    220     def __init__(self, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs):
    221         compute_kwargs = self._validate_compute_kwargs(compute_kwargs)
--> 222         super().__init__(task=self._task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    223 
    224     def _validate_compute_kwargs(self, compute_kwargs: Dict[str, Any]) -> Dict[str, Any]:

[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/base.py](https://localhost:8080/#) in __init__(self, task, resulting_name, compute_kwargs, config_name, keep_in_memory, cache_dir, num_process, process_id, seed, experiment_id, max_concurrent_cache_files, timeout, **kwargs)
    100         self.resulting_name = resulting_name if resulting_name is not None else self.name
    101         self.compute_kwargs = compute_kwargs or {}
--> 102         self.download_and_prepare()
    103 
    104     @abstractmethod

[/usr/local/lib/python3.9/dist-packages/evaluate/module.py](https://localhost:8080/#) in download_and_prepare(self, download_config, dl_manager)
    649             )
    650 
--> 651         self._download_and_prepare(dl_manager)
    652 
    653     def _download_and_prepare(self, dl_manager):

[/usr/local/lib/python3.9/dist-packages/jury/metrics/bleu/bleu_for_language_generation.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager)
    150         nmt_source = "https://raw.githubusercontent.com/tensorflow/nmt/0be864257a76c151eef20ea689755f08bc1faf4e/nmt/scripts/bleu.py"
--> 151         self.external_module_path = dl_manager.download(nmt_source)
    152 
    153     def _preprocess(self, predictions: Collator, references: Collator) -> Tuple[Collator, Collator]:

[/usr/local/lib/python3.9/dist-packages/datasets/download/download_manager.py](https://localhost:8080/#) in download(self, url_or_urls)
    425 
    426         start_time = datetime.now()
--> 427         downloaded_path_or_paths = map_nested(
    428             download_func,
    429             url_or_urls,

[/usr/local/lib/python3.9/dist-packages/datasets/utils/py_utils.py](https://localhost:8080/#) in map_nested(function, data_struct, dict_only, map_list, map_tuple, map_numpy, num_proc, parallel_min_length, types, disable_tqdm, desc)
    433     # Singleton
    434     if not isinstance(data_struct, dict) and not isinstance(data_struct, types):
--> 435         return function(data_struct)
    436 
    437     disable_tqdm = disable_tqdm or not logging.is_progress_bar_enabled()

[/usr/local/lib/python3.9/dist-packages/datasets/download/download_manager.py](https://localhost:8080/#) in _download(self, url_or_filename, download_config)
    451             # append the relative path to the base_path
    452             url_or_filename = url_or_path_join(self._base_path, url_or_filename)
--> 453         return cached_path(url_or_filename, download_config=download_config)
    454 
    455     def iter_archive(self, path_or_buf: Union[str, io.BufferedReader]):

[/usr/local/lib/python3.9/dist-packages/datasets/utils/file_utils.py](https://localhost:8080/#) in cached_path(url_or_filename, download_config, **download_kwargs)
    193             use_auth_token=download_config.use_auth_token,
    194             ignore_url_params=download_config.ignore_url_params,
--> 195             storage_options=download_config.storage_options,
    196             download_desc=download_config.download_desc,
    197         )

AttributeError: 'DownloadConfig' object has no attribute 'storage_options'

Environment Information:

  • OS: Linux 5.10.147+
  • jury version: 2.2.3
  • evaluate version: 0.2.2
  • datasets version: 2.11.0

Refactor all metrics to use dl_manager internally

Currently, most of the metrics requiring download from a source uses download() helper function under utils, these implementations needs a refactoring to be replaced by dl_manager available in underlying base class.

ZeroDivisionError: division by zero in AccuracyForLanguageGeneration._compute_single_pred_single_ref

Describe the bug
I was running RobertaForQuestionAnswering on HuggingFace's squad-v2 train sets (~86k).
The Accuracy metric at AccuracyForLanguageGeneration._compute_single_pred_single_ref threw division by zero error.

image

To Reproduce

  • Use datasets squad-v2 train set.
  • Run the samples through pipeline("question-answering", ...)

Expected behavior
Run without error.

Exception Traceback (if available)
If applicable, add full traceback to help explain your problem.

ration.py:107, in AccuracyForLanguageGeneration._compute_single_pred_single_ref(self, predictions, references, reduce_fn, **kwargs)
    105         if token in ref_counts:
    106             score += min(pred_count, ref_counts[token])  # Intersection count
--> 107     scores.append(score / max(len(pred), len(ref)))
    108 avg_score = sum(scores) / len(scores)
    109 return {"score": avg_score}

ZeroDivisionError: division by zero

Environment Information:

  • OS: Mac OS 13.2.1 (22D68)
  • jury version: 2.2.3
  • evaluate version: evaluate==0.2.2
  • datasets version: datasets==2.11.0

Thanks. Appreciate jury to exist. I could patch this by cloning and doing in-depth trace analysis. But, I wanted to know if there's a better way to patch this.

Refactor docstrings, typo-check, review imports

Codebase needs a refactor mainly for docstrings and constant strings (like _CITATIONS, etc. used in metric info). Also, import statements needs to be reviewed (e.g.for no-longer supported versions/legacy should be removed). Update README accordingly.

New source for Prism model

There is issues about (timeout) prism the current model source. We should upload the resource model related files into a publicly available source for consistent connection and throughput.

Upgrade nltk>=3.6.4

Newer version of nltk score changed meteor computation, it now requires for input strings to be pretokenized.

demo notebook

we need to add demo notebook with some well known dataset and pretrained model from huggingface to showcase the usage of the jury package.

load_metric by path (relative/absolute)

load_metric currently takes a path, however relative path may be problematic and also the path is pointing to the script (old style) rather than the enclosing metric folder.

Old usage,

my_metric = jury.load_metric("my_project/my_custom_metrics/my_metric/my_metric.py")  # Rather than this

Suggested replacement,

my_metric = jury.load_metric("my_project/my_custom_metrics/my_metric")  # this should be the correct usage
my_metric = jury.load_metric("my_project.my_custom_metrics.my_metric")   # Another alternative for pathlike string

Implementing fixtures for test cases.

Currently test cases solely operate without using fixtures which kill performance while testing. Implementation of fixtures can dramatically speed up testing.

Facing datasets error

Hello,
After dowloading the contents from git and instantiating the object, i get this error :-

/content/image-captioning-bottom-up-top-down
Traceback (most recent call last):
  File "eval.py", line 11, in <module>
   from jury import Jury 
  File "/usr/local/lib/python3.7/dist-packages/jury/__init__.py", line 1, in <module>
    from jury.core import Jury
  File "/usr/local/lib/python3.7/dist-packages/jury/core.py", line 6, in <module>
    from jury.metrics import EvaluationInstance, Metric, load_metric
  File "/usr/local/lib/python3.7/dist-packages/jury/metrics/__init__.py", line 1, in <module>
    from jury.metrics._core import (
  File "/usr/local/lib/python3.7/dist-packages/jury/metrics/_core/__init__.py", line 1, in <module>
    from jury.metrics._core.auto import AutoMetric, load_metric
  File "/usr/local/lib/python3.7/dist-packages/jury/metrics/_core/auto.py", line 23, in <module>
    from jury.metrics._core.base import Metric
  File "/usr/local/lib/python3.7/dist-packages/jury/metrics/_core/base.py", line 28, in <module>
    from datasets.utils.logging import get_logger
ModuleNotFoundError: No module named 'datasets.utils'; 'datasets' is not a package

Can you please check what could be the issue

Imrpove metric definitions

Enrich default metric definitions in definitions.py, and a method for altering (addition, deletion, change) definitions.

BLEU: ndarray reshape error

Hey, when computing BLEU score (snippet), facing reshape error in _compute_single_pred_single_ref.

Could you assist with the same.

from jury import Jury

scorer = Jury()

# [2, 5/5]
p = [
        ['dummy text', 'dummy text', 'dummy text', 'dummy text', 'dummy text'],
        ['dummy text', 'dummy text', 'dummy text', 'dummy text', 'dummy text']
    ]

# [2, 4/2]
r = [['be looking for a certain office in the building ',
      ' ask the elevator operator for directions ',
      ' be a trained detective ',
      ' be at the scene of a crime'],
     ['leave the room ',
      ' transport the notebook']]

scores = scorer(predictions=p, references=r)

Output:

Traceback (most recent call last):
  File "/home/axe/Projects/VisComSense/del.py", line 22, in <module>
    scores = scorer(predictions=p, references=r)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/core.py", line 78, in __call__
    score = self._compute_single_score(inputs)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/core.py", line 137, in _compute_single_score
    score = metric.compute(predictions=predictions, references=references, reduce_fn=reduce_fn)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/datasets/metric.py", line 404, in compute
    output = self._compute(predictions=predictions, references=references, **kwargs)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/_core/base.py", line 325, in _compute
    result = self.evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **eval_params)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 241, in evaluate
    return eval_fn(predictions=predictions, references=references, reduce_fn=reduce_fn, **kwargs)
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 195, in _compute_multi_pred_multi_ref
    score = self._compute_single_pred_multi_ref(
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 176, in _compute_single_pred_multi_ref
    return self._compute_single_pred_single_ref(
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 165, in _compute_single_pred_single_ref
    predictions = predictions.reshape(
  File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/collator.py", line 35, in reshape
    return Collator(_seq.reshape(args).tolist(), keep=True)
ValueError: cannot reshape array of size 20 into shape (10,)

Process finished with exit code 1

Add CHRF

Brief metric info
chrF++ is a tool for automatic evaluation of machine translation output based on character n-gram precision and recall enhanced with word n-grams. The tool calculates the F-score averaged on all character and word n-grams, where the default character n-gram order is 6 and word n-gram order is 2. The arithmetic mean is used for n-gram averaging.

Meta information
Please fill below if applicable:

Read support from a folder

Currently, CLI supports reading from a tsv or txt file, but for multiple inputs a preditions/ and references/ folders can be used.

Add reduce_fn parameter

Add reduce_fn parameter to Jury.evaluate() method, which can both take a string (looking for a numpy function) or a function.

CLI Implementation

CLI implementation for the package the read from txt files.

Draft Usage:
jury evaluate --predictions predictions.txt --references references.txt

NLGEval uses single prediction and multiple references in a way that u specify multiple references.txt files for mulitple references, and like this on API.

My idea is to have a single prediction and refenence file including multiple predictions or multiple references. In a single txt file, maybe we can use some sort of special separator like "<sep>" instead of a special char like [",", ";", ":", "\t"] maybe tab seperated would be OK. Wdyt ? @fcakyon @cemilcengiz

[BUG] requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/wmt21-cometinho-da/revision/main

Describe the bug

CI failing due to the comet tests -> https://github.com/obss/jury/actions/runs/4607786676/jobs/8142742832?pr=126
Related to Unbabel/COMET#125.

NOTE: Interestingly tests are passing in Python3.7. @ricardorei

ERROR tests/jury/metrics/test_comet.py::test_basic - huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-642c0cf6-01a418f1433996a919d94c44)
ERROR tests/jury/metrics/test_comet.py::test_multiple_ref - huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-642c0cf6-01a418f1433996a919d94c44) 
ERROR tests/jury/metrics/test_comet.py::test_multiple_pred_multiple_ref - huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-642c0cf6-01a418f1433996a919d94c44)

To Reproduce
Run tests.

Exception Traceback (if available)

self = <Response [401]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/wmt21-cometinho-da/revision/main

/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/requests/models.py:1021: HTTPError

The above exception was the direct cause of the following exception:

    @pytest.fixture(scope="module")
    def jury_comet():
>       metric = AutoMetric.load(
            "comet",
            config_name="wmt21-cometinho-da",
            compute_kwargs={"gpus": 0, "num_workers": 0, "progress_bar": False, "batch_size": 2},
        )

tests/jury/metrics/test_comet.py:11: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
jury/metrics/_core/auto.py:116: in load
    metric = klass.construct(task=task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/auxiliary.py:101: in construct
    return subclass._construct(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/base.py:237: in _construct
    return cls(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/base.py:[222](https://github.com/obss/jury/actions/runs/4607786676/jobs/8142742832?pr=126#step:10:223): in __init__
    super().__init__(task=self._task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/base.py:102: in __init__
    self.download_and_prepare()
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/evaluate/module.py:651: in download_and_prepare
    self._download_and_prepare(dl_manager)
jury/metrics/comet/comet_for_language_generation.py:107: in _download_and_prepare
    checkpoint_path = comet.download_model(self.config_name)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/comet/models/__init__.py:40: in download_model
    model_path = snapshot_download(repo_id=model, cache_dir=saving_directory)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:124: in _inner_fn
    return fn(*args, **kwargs)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py:163: in snapshot_download
    repo_info = _api.repo_info(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:124: in _inner_fn
    return fn(*args, **kwargs)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/hf_api.py:1817: in repo_info
    return method(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:124: in _inner_fn
    return fn(*args, **kwargs)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/hf_api.py:1626: in model_info
    hf_raise_for_status(r)

Environment Information:

  • OS: Windows 10
  • jury version: 2.2.3
  • evaluate version: 0.2.2
  • datasets version: 2.9.0

Understanding BLEU Score ('bleu_n')

Hey, how are different bleu scores calculated?

For the give snippet, why are all bleu(n) scores identical?
And how does this relate to nltk's sentence_bleu (weights) ?

from jury import Jury

scorer = Jury()
predictions = [
    ["the cat is on the mat", "There is cat playing on the mat"], 
    ["Look!    a wonderful day."]
]
references = [
    ["the cat is playing on the mat.", "The cat plays on the mat."], 
    ["Today is a wonderful day", "The weather outside is wonderful."]
]
scores = scorer(predictions=predictions, references=references)

Output:

{'empty_predictions': 0,
 'total_items': 2,
 'bleu_1': {'score': 0.42370250917168295,
  'precisions': [0.8823529411764706,
   0.6428571428571429,
   0.45454545454545453,
   0.125],
  'brevity_penalty': 1.0,
  'length_ratio': 1.0,
  'translation_length': 11,
  'reference_length': 11},
 'bleu_2': {'score': 0.42370250917168295,
  'precisions': [0.8823529411764706,
   0.6428571428571429,
   0.45454545454545453,
   0.125],
  'brevity_penalty': 1.0,
  'length_ratio': 1.0,
  'translation_length': 11,
  'reference_length': 11},
 'bleu_3': {'score': 0.42370250917168295,
  'precisions': [0.8823529411764706,
   0.6428571428571429,
   0.45454545454545453,
   0.125],
  'brevity_penalty': 1.0,
  'length_ratio': 1.0,
  'translation_length': 11,
  'reference_length': 11},
 'bleu_4': {'score': 0.42370250917168295,
  'precisions': [0.8823529411764706,
   0.6428571428571429,
   0.45454545454545453,
   0.125],
  'brevity_penalty': 1.0,
  'length_ratio': 1.0,
  'translation_length': 11,
  'reference_length': 11},
 'meteor': {'score': 0.5420511682934044},
 'rouge': {'rouge1': 0.7783882783882783,
  'rouge2': 0.5925324675324675,
  'rougeL': 0.7426739926739926,
  'rougeLsum': 0.7426739926739926}}

Discard empty corpus before computation

Currently given corpus like below causes problems as they are not expecting an empty list:

import jury

p = [["a b c"], []]
r = [["a b d e f"], ["a g h i"]]

scorer = jury.Jury()
scores = scorer(predictions=p, references=r)

the code above throws an exception when encounters an empty list:

Traceback (most recent call last):
  File "/home/devrimcavusoglu/lab/gh/jury/jury/core.py", line 202, in <module>
    scores = scorer(predictions=p, references=r)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/core.py", line 79, in __call__
    score = self._compute_single_score(inputs)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/core.py", line 148, in _compute_single_score
    score = metric.compute(predictions=predictions, references=references, reduce_fn=reduce_fn)
  File "/home/devrimcavusoglu/lab/gh/jury/venv/lib/python3.8/site-packages/datasets/metric.py", line 402, in compute
    output = self._compute(predictions=predictions, references=references, **kwargs)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/_core/base.py", line 325, in _compute
    result = self.evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **eval_params)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/bleu/bleu_for_language_generation.py", line 262, in evaluate
    return super().evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **kwargs)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/_core/base.py", line 279, in evaluate
    return eval_fn(predictions=predictions, references=references, **kwargs)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/bleu/bleu_for_language_generation.py", line 216, in _compute_multi_pred_multi_ref
    adjusted_prediction_length += get_token_lengths(preds, reduce_fn=max)
  File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/_core/utils.py", line 58, in get_token_lengths
    return int(reduce_fn(token_lengths))
ValueError: max() arg is an empty sequence

Process finished with exit code 1

BLEURT is failing to produce results

I was trying to check with the same example mentioned in the readme file for Bleurt. It is failing by throwing an error. Please let me know the issue.

Error :

ImportError                               Traceback (most recent call last)
<ipython-input-16-ed14e2ab4c7e> in <module>
----> 1 bleurt = Bleurt.construct()
      2 score = bleurt.compute(predictions=predictions, references=references)

~\anaconda3\lib\site-packages\jury\metrics\_core\auxiliary.py in construct(cls, task, resulting_name, compute_kwargs, **kwargs)
     99         subclass = cls._get_subclass()
    100         resulting_name = resulting_name or cls._get_path()
--> 101         return subclass._construct(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    102 
    103     @classmethod

~\anaconda3\lib\site-packages\jury\metrics\_core\base.py in _construct(cls, resulting_name, compute_kwargs, **kwargs)
    235         cls, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs
    236     ):
--> 237         return cls(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    238 
    239     @staticmethod

~\anaconda3\lib\site-packages\jury\metrics\_core\base.py in __init__(self, resulting_name, compute_kwargs, **kwargs)
    220     def __init__(self, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs):
    221         compute_kwargs = self._validate_compute_kwargs(compute_kwargs)
--> 222         super().__init__(task=self._task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
    223 
    224     def _validate_compute_kwargs(self, compute_kwargs: Dict[str, Any]) -> Dict[str, Any]:

~\anaconda3\lib\site-packages\jury\metrics\_core\base.py in __init__(self, task, resulting_name, compute_kwargs, config_name, keep_in_memory, cache_dir, num_process, process_id, seed, experiment_id, max_concurrent_cache_files, timeout, **kwargs)
    100         self.resulting_name = resulting_name if resulting_name is not None else self.name
    101         self.compute_kwargs = compute_kwargs or {}
--> 102         self.download_and_prepare()
    103 
    104     @abstractmethod

~\anaconda3\lib\site-packages\evaluate\module.py in download_and_prepare(self, download_config, dl_manager)
    649             )
    650 
--> 651         self._download_and_prepare(dl_manager)
    652 
    653     def _download_and_prepare(self, dl_manager):

~\anaconda3\lib\site-packages\jury\metrics\bleurt\bleurt_for_language_generation.py in _download_and_prepare(self, dl_manager)
    120         global bleurt
    121         try:
--> 122             from bleurt import score
    123         except ModuleNotFoundError:
    124             raise ModuleNotFoundError(

ImportError: cannot import name 'score' from 'bleurt' (unknown location)

Metrics for different tasks

There are several different tasks, jury mainly implemented NLG metrics (including precision, recall, etc. as modified n-gram precision, etc.). It'd be nice to have other types of metrics (e.g precision for sequence labeling).

Add COMET

Brief metric info
COMET is proposed primarily as an MT evaluation metric through a trained model.

COMET is an open-source framework for MT evaluation that can be used for two purposes:

  • To evaluate MT systems with our currently available high-performing metrics (check: COMET Metrics).
  • To train and develop new metrics.

Meta information
Please fill below if applicable:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.