obss / jury Goto Github PK
View Code? Open in Web Editor NEWComprehensive NLP Evaluation System
License: MIT License
Comprehensive NLP Evaluation System
License: MIT License
Meteor for multiple languages
It would be nice if the implementation of METEOR would support multiple languages (there is a library in JAVA , but I did not found any implementation in python)
Meta information
Hey, why does computing the BLEU score more than once, affect the key value of the score dict.
e.g. 'bleu_1', 'bleu_1_1', 'bleu_1_1_1'
Overall I find the library quite user-friendly, but unsure about this behavior.
nltk recently released newer versions 3.6.6, 3.6.7 and 3.7. According to the dependabot alerts current dependent version is inefficient for several particular cases, and new versions eliminates these known performance issues.
Due to the nature of the Jury API, all input strings must be a whole (not tokenized), the current implementation of BLEU score is tokenized by white spaces. However, one might want results for smaller tokens, morphemes, or even character level rather than BLEU score of the words. Thus, it'd be great to support this with adding a support for tokenizer in the score computation for BLEU.
Currently, jury allows usage of input metrics to be passed in Jury(metrics=metrics)
to be either list of jury.metrics.Metric
or str
, but it does not allow to use both str and Metric
object together as,
from jury import Jury
from jury.metrics import load_metric
metrics = ["bleu", load_metric("meteor")]
jury = Jury(metrics=metrics)
raises an error as metrics
parameter expects a NestedSingleType
of object which is either list<str>
or list<jury.metrics.Metric
.
Describe the bug
Failure when loading bleu
metric. Probably due to the loose version range dependency on datasets
through evaluate
.
To Reproduce
load_metric("bleu")
Exception Traceback (if available)
AttributeError Traceback (most recent call last)
[<ipython-input-15-2fcc283dab87>](https://localhost:8080/#) in <cell line: 2>()
1 MT_METRICS = [
----> 2 load_metric("bleu", resulting_name="bleu_1", compute_kwargs={"max_order": 1}),
3 load_metric("bleu", resulting_name="bleu_2", compute_kwargs={"max_order": 2}),
4 load_metric("meteor"),
5 load_metric("rouge"),
[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/auto.py](https://localhost:8080/#) in load_metric(path, resulting_name, task, compute_kwargs, use_jury_only, **kwargs)
---> 55 return AutoMetric.load(
56 path=path,
57 resulting_name=resulting_name,
[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/auto.py](https://localhost:8080/#) in load(cls, path, task, resulting_name, compute_kwargs, use_jury_only, **kwargs)
114 factory_class = module.__main_class__
115 klass = getattr(module, factory_class)
--> 116 metric = klass.construct(task=task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
117 return metric
118
[/usr/local/lib/python3.9/dist-packages/jury/metrics/bleu/bleu.py](https://localhost:8080/#) in construct(cls, task, resulting_name, compute_kwargs, **kwargs)
22 subclass = cls._get_subclass()
23 resulting_name = resulting_name or cls._get_path(compute_kwargs=compute_kwargs)
---> 24 return subclass._construct(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
25
26 @classmethod
[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/base.py](https://localhost:8080/#) in _construct(cls, resulting_name, compute_kwargs, **kwargs)
235 cls, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs
236 ):
--> 237 return cls(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
238
239 @staticmethod
[/usr/local/lib/python3.9/dist-packages/jury/metrics/bleu/bleu_for_language_generation.py](https://localhost:8080/#) in __init__(self, resulting_name, compute_kwargs, **kwargs)
119 self.should_change_resulting_name = resulting_name is None
120 self.tokenizer = DefaultTokenizer()
--> 121 super().__init__(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
122
123 def _info(self):
[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/base.py](https://localhost:8080/#) in __init__(self, resulting_name, compute_kwargs, **kwargs)
220 def __init__(self, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs):
221 compute_kwargs = self._validate_compute_kwargs(compute_kwargs)
--> 222 super().__init__(task=self._task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
223
224 def _validate_compute_kwargs(self, compute_kwargs: Dict[str, Any]) -> Dict[str, Any]:
[/usr/local/lib/python3.9/dist-packages/jury/metrics/_core/base.py](https://localhost:8080/#) in __init__(self, task, resulting_name, compute_kwargs, config_name, keep_in_memory, cache_dir, num_process, process_id, seed, experiment_id, max_concurrent_cache_files, timeout, **kwargs)
100 self.resulting_name = resulting_name if resulting_name is not None else self.name
101 self.compute_kwargs = compute_kwargs or {}
--> 102 self.download_and_prepare()
103
104 @abstractmethod
[/usr/local/lib/python3.9/dist-packages/evaluate/module.py](https://localhost:8080/#) in download_and_prepare(self, download_config, dl_manager)
649 )
650
--> 651 self._download_and_prepare(dl_manager)
652
653 def _download_and_prepare(self, dl_manager):
[/usr/local/lib/python3.9/dist-packages/jury/metrics/bleu/bleu_for_language_generation.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager)
150 nmt_source = "https://raw.githubusercontent.com/tensorflow/nmt/0be864257a76c151eef20ea689755f08bc1faf4e/nmt/scripts/bleu.py"
--> 151 self.external_module_path = dl_manager.download(nmt_source)
152
153 def _preprocess(self, predictions: Collator, references: Collator) -> Tuple[Collator, Collator]:
[/usr/local/lib/python3.9/dist-packages/datasets/download/download_manager.py](https://localhost:8080/#) in download(self, url_or_urls)
425
426 start_time = datetime.now()
--> 427 downloaded_path_or_paths = map_nested(
428 download_func,
429 url_or_urls,
[/usr/local/lib/python3.9/dist-packages/datasets/utils/py_utils.py](https://localhost:8080/#) in map_nested(function, data_struct, dict_only, map_list, map_tuple, map_numpy, num_proc, parallel_min_length, types, disable_tqdm, desc)
433 # Singleton
434 if not isinstance(data_struct, dict) and not isinstance(data_struct, types):
--> 435 return function(data_struct)
436
437 disable_tqdm = disable_tqdm or not logging.is_progress_bar_enabled()
[/usr/local/lib/python3.9/dist-packages/datasets/download/download_manager.py](https://localhost:8080/#) in _download(self, url_or_filename, download_config)
451 # append the relative path to the base_path
452 url_or_filename = url_or_path_join(self._base_path, url_or_filename)
--> 453 return cached_path(url_or_filename, download_config=download_config)
454
455 def iter_archive(self, path_or_buf: Union[str, io.BufferedReader]):
[/usr/local/lib/python3.9/dist-packages/datasets/utils/file_utils.py](https://localhost:8080/#) in cached_path(url_or_filename, download_config, **download_kwargs)
193 use_auth_token=download_config.use_auth_token,
194 ignore_url_params=download_config.ignore_url_params,
--> 195 storage_options=download_config.storage_options,
196 download_desc=download_config.download_desc,
197 )
AttributeError: 'DownloadConfig' object has no attribute 'storage_options'
Environment Information:
Currently, most of the metrics requiring download from a source uses download()
helper function under utils, these implementations needs a refactoring to be replaced by dl_manager
available in underlying base class.
If you give the BLEU metric hypothesis/reference pairs that consist of a single hypothesis and a single reference, the results don't make sense. If you duplicate either or both of the elements, then it calucaltes the expected results.
Reproduction code can be found in the following gist:
https://gist.github.com/Sophylax/2f70729a8ecb669c98898c65f7aed679
Describe the bug
I was running RobertaForQuestionAnswering
on HuggingFace's squad-v2 train sets (~86k).
The Accuracy
metric at AccuracyForLanguageGeneration._compute_single_pred_single_ref
threw division by zero error.
To Reproduce
datasets
squad-v2 train
set.pipeline("question-answering", ...)
Expected behavior
Run without error.
Exception Traceback (if available)
If applicable, add full traceback to help explain your problem.
ration.py:107, in AccuracyForLanguageGeneration._compute_single_pred_single_ref(self, predictions, references, reduce_fn, **kwargs)
105 if token in ref_counts:
106 score += min(pred_count, ref_counts[token]) # Intersection count
--> 107 scores.append(score / max(len(pred), len(ref)))
108 avg_score = sum(scores) / len(scores)
109 return {"score": avg_score}
ZeroDivisionError: division by zero
Environment Information:
evaluate==0.2.2
datasets==2.11.0
Thanks. Appreciate jury to exist. I could patch this by cloning and doing in-depth trace analysis. But, I wanted to know if there's a better way to patch this.
Citation info/badge to the repo would be good for the project for ones looking for citation (but cannot find for now unfortunately).
BERTScore had a new release 0.3.11
which added support for DeBERTa v3 and ByT5 models.
Codebase needs a refactor mainly for docstrings and constant strings (like _CITATIONS, etc. used in metric info). Also, import statements needs to be reviewed (e.g.for no-longer supported versions/legacy should be removed). Update README accordingly.
Starting v2.3 jury no longer support python_version<3.8
There is issues about (timeout) prism the current model source. We should upload the resource model related files into a publicly available source for consistent connection and throughput.
Describe the bug
jury-v2 raises the following error on import
ModuleNotFoundError: No module named 'validators'
Potential fix
add validators
to requirements.txt
CER is currently available at HF datasets and implicitly available also at jury. Yet, to fully utilize jury, it needs to be added.
HF Implementation: https://github.com/huggingface/datasets/blob/master/metrics/cer/cer.py
Newer version of nltk score changed meteor computation, it now requires for input strings to be pretokenized.
Referring this issue thompsonb/prism#13, since it seems like no activate maintanance is going on, we can add this support on a public fork.
BARTScore was published in NIPS'21. We can add it to jury with minimum requirement set.
Repo: https://github.com/neulab/BARTScore
Paper: https://arxiv.org/abs/2106.11520
we need to add demo notebook with some well known dataset and pretrained model from huggingface to showcase the usage of the jury
package.
load_metric currently takes a path, however relative path may be problematic and also the path is pointing to the script (old style) rather than the enclosing metric folder.
Old usage,
my_metric = jury.load_metric("my_project/my_custom_metrics/my_metric/my_metric.py") # Rather than this
Suggested replacement,
my_metric = jury.load_metric("my_project/my_custom_metrics/my_metric") # this should be the correct usage
my_metric = jury.load_metric("my_project.my_custom_metrics.my_metric") # Another alternative for pathlike string
There is a bug for Squad F1 and EM metrics currently when calculating with multiple predictions.
Currently test cases solely operate without using fixtures which kill performance while testing. Implementation of fixtures can dramatically speed up testing.
Currently license information is missing for BLEURT and Prism implementations, and required to be added.
Prism is an automatic MT metric which uses a sequence-to-sequence paraphraser to score MT system outputs conditioned on their respective human references.
Repo: https://github.com/thompsonb/prism
Paper: https://aclanthology.org/2020.emnlp-main.8/
BLEURT is currently supported in HF datasets, and hence in jury (implicit), however to fully utilize jury for this metric, it's needed to explicitly added.
Repo: https://github.com/google-research/bleurt
Paper: https://arxiv.org/abs/2004.04696
HF Implementation: https://github.com/huggingface/datasets/tree/master/metrics/bleurt
Hello,
After dowloading the contents from git and instantiating the object, i get this error :-
/content/image-captioning-bottom-up-top-down
Traceback (most recent call last):
File "eval.py", line 11, in <module>
from jury import Jury
File "/usr/local/lib/python3.7/dist-packages/jury/__init__.py", line 1, in <module>
from jury.core import Jury
File "/usr/local/lib/python3.7/dist-packages/jury/core.py", line 6, in <module>
from jury.metrics import EvaluationInstance, Metric, load_metric
File "/usr/local/lib/python3.7/dist-packages/jury/metrics/__init__.py", line 1, in <module>
from jury.metrics._core import (
File "/usr/local/lib/python3.7/dist-packages/jury/metrics/_core/__init__.py", line 1, in <module>
from jury.metrics._core.auto import AutoMetric, load_metric
File "/usr/local/lib/python3.7/dist-packages/jury/metrics/_core/auto.py", line 23, in <module>
from jury.metrics._core.base import Metric
File "/usr/local/lib/python3.7/dist-packages/jury/metrics/_core/base.py", line 28, in <module>
from datasets.utils.logging import get_logger
ModuleNotFoundError: No module named 'datasets.utils'; 'datasets' is not a package
Can you please check what could be the issue
Enrich default metric definitions in definitions.py
, and a method for altering (addition, deletion, change) definitions.
Hey, when computing BLEU score (snippet), facing reshape error in _compute_single_pred_single_ref
.
Could you assist with the same.
from jury import Jury
scorer = Jury()
# [2, 5/5]
p = [
['dummy text', 'dummy text', 'dummy text', 'dummy text', 'dummy text'],
['dummy text', 'dummy text', 'dummy text', 'dummy text', 'dummy text']
]
# [2, 4/2]
r = [['be looking for a certain office in the building ',
' ask the elevator operator for directions ',
' be a trained detective ',
' be at the scene of a crime'],
['leave the room ',
' transport the notebook']]
scores = scorer(predictions=p, references=r)
Output:
Traceback (most recent call last):
File "/home/axe/Projects/VisComSense/del.py", line 22, in <module>
scores = scorer(predictions=p, references=r)
File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/core.py", line 78, in __call__
score = self._compute_single_score(inputs)
File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/core.py", line 137, in _compute_single_score
score = metric.compute(predictions=predictions, references=references, reduce_fn=reduce_fn)
File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/datasets/metric.py", line 404, in compute
output = self._compute(predictions=predictions, references=references, **kwargs)
File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/_core/base.py", line 325, in _compute
result = self.evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **eval_params)
File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 241, in evaluate
return eval_fn(predictions=predictions, references=references, reduce_fn=reduce_fn, **kwargs)
File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 195, in _compute_multi_pred_multi_ref
score = self._compute_single_pred_multi_ref(
File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 176, in _compute_single_pred_multi_ref
return self._compute_single_pred_single_ref(
File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/metrics/bleu/bleu_for_language_generation.py", line 165, in _compute_single_pred_single_ref
predictions = predictions.reshape(
File "/home/axe/VirtualEnvs/pyenv3_8/lib/python3.8/site-packages/jury/collator.py", line 35, in reshape
return Collator(_seq.reshape(args).tolist(), keep=True)
ValueError: cannot reshape array of size 20 into shape (10,)
Process finished with exit code 1
We need a more elaborate README.md on Usage section
Brief metric info
Word Error Rate metric a derivation of edit distance.
Meta information
Please fill below if applicable:
Brief metric info
chrF++ is a tool for automatic evaluation of machine translation output based on character n-gram precision and recall enhanced with word n-grams. The tool calculates the F-score averaged on all character and word n-grams, where the default character n-gram order is 6 and word n-gram order is 2. The arithmetic mean is used for n-gram averaging.
Meta information
Please fill below if applicable:
Brief metric info
TER (Translation Edit Rate, also called Translation Error Rate) is a metric to quantify the edit operations that a
hypothesis requires to match a reference translation.
Meta information
Please fill below if applicable:
Currently, CLI supports reading from a tsv or txt file, but for multiple inputs a preditions/
and references/
folders can be used.
Add reduce_fn
parameter to Jury.evaluate()
method, which can both take a string (looking for a numpy function) or a function.
CLI implementation for the package the read from txt files.
Draft Usage:
jury evaluate --predictions predictions.txt --references references.txt
NLGEval uses single prediction and multiple references in a way that u specify multiple references.txt files for mulitple references, and like this on API.
My idea is to have a single prediction and refenence file including multiple predictions or multiple references. In a single txt file, maybe we can use some sort of special separator like "<sep>
" instead of a special char like [",", ";", ":", "\t"]
maybe tab seperated would be OK. Wdyt ? @fcakyon @cemilcengiz
Describe the bug
CI failing due to the comet tests -> https://github.com/obss/jury/actions/runs/4607786676/jobs/8142742832?pr=126
Related to Unbabel/COMET#125.
NOTE: Interestingly tests are passing in Python3.7. @ricardorei
ERROR tests/jury/metrics/test_comet.py::test_basic - huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-642c0cf6-01a418f1433996a919d94c44)
ERROR tests/jury/metrics/test_comet.py::test_multiple_ref - huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-642c0cf6-01a418f1433996a919d94c44)
ERROR tests/jury/metrics/test_comet.py::test_multiple_pred_multiple_ref - huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-642c0cf6-01a418f1433996a919d94c44)
To Reproduce
Run tests.
Exception Traceback (if available)
self = <Response [401]>
def raise_for_status(self):
"""Raises :class:`HTTPError`, if one occurred."""
http_error_msg = ""
if isinstance(self.reason, bytes):
# We attempt to decode utf-8 first because some servers
# choose to localize their reason strings. If the string
# isn't utf-8, we fall back to iso-8859-1 for all other
# encodings. (See PR #3538)
try:
reason = self.reason.decode("utf-8")
except UnicodeDecodeError:
reason = self.reason.decode("iso-8859-1")
else:
reason = self.reason
if 400 <= self.status_code < 500:
http_error_msg = (
f"{self.status_code} Client Error: {reason} for url: {self.url}"
)
elif 500 <= self.status_code < 600:
http_error_msg = (
f"{self.status_code} Server Error: {reason} for url: {self.url}"
)
if http_error_msg:
> raise HTTPError(http_error_msg, response=self)
E requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/models/wmt21-cometinho-da/revision/main
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/requests/models.py:1021: HTTPError
The above exception was the direct cause of the following exception:
@pytest.fixture(scope="module")
def jury_comet():
> metric = AutoMetric.load(
"comet",
config_name="wmt21-cometinho-da",
compute_kwargs={"gpus": 0, "num_workers": 0, "progress_bar": False, "batch_size": 2},
)
tests/jury/metrics/test_comet.py:11:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
jury/metrics/_core/auto.py:116: in load
metric = klass.construct(task=task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/auxiliary.py:101: in construct
return subclass._construct(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/base.py:237: in _construct
return cls(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/base.py:[222](https://github.com/obss/jury/actions/runs/4607786676/jobs/8142742832?pr=126#step:10:223): in __init__
super().__init__(task=self._task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
jury/metrics/_core/base.py:102: in __init__
self.download_and_prepare()
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/evaluate/module.py:651: in download_and_prepare
self._download_and_prepare(dl_manager)
jury/metrics/comet/comet_for_language_generation.py:107: in _download_and_prepare
checkpoint_path = comet.download_model(self.config_name)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/comet/models/__init__.py:40: in download_model
model_path = snapshot_download(repo_id=model, cache_dir=saving_directory)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:124: in _inner_fn
return fn(*args, **kwargs)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py:163: in snapshot_download
repo_info = _api.repo_info(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:124: in _inner_fn
return fn(*args, **kwargs)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/hf_api.py:1817: in repo_info
return method(
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py:124: in _inner_fn
return fn(*args, **kwargs)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/huggingface_hub/hf_api.py:1626: in model_info
hf_raise_for_status(r)
Environment Information:
Hey, how are different bleu scores calculated?
For the give snippet, why are all bleu(n) scores identical?
And how does this relate to nltk's sentence_bleu (weights) ?
from jury import Jury
scorer = Jury()
predictions = [
["the cat is on the mat", "There is cat playing on the mat"],
["Look! a wonderful day."]
]
references = [
["the cat is playing on the mat.", "The cat plays on the mat."],
["Today is a wonderful day", "The weather outside is wonderful."]
]
scores = scorer(predictions=predictions, references=references)
Output:
{'empty_predictions': 0,
'total_items': 2,
'bleu_1': {'score': 0.42370250917168295,
'precisions': [0.8823529411764706,
0.6428571428571429,
0.45454545454545453,
0.125],
'brevity_penalty': 1.0,
'length_ratio': 1.0,
'translation_length': 11,
'reference_length': 11},
'bleu_2': {'score': 0.42370250917168295,
'precisions': [0.8823529411764706,
0.6428571428571429,
0.45454545454545453,
0.125],
'brevity_penalty': 1.0,
'length_ratio': 1.0,
'translation_length': 11,
'reference_length': 11},
'bleu_3': {'score': 0.42370250917168295,
'precisions': [0.8823529411764706,
0.6428571428571429,
0.45454545454545453,
0.125],
'brevity_penalty': 1.0,
'length_ratio': 1.0,
'translation_length': 11,
'reference_length': 11},
'bleu_4': {'score': 0.42370250917168295,
'precisions': [0.8823529411764706,
0.6428571428571429,
0.45454545454545453,
0.125],
'brevity_penalty': 1.0,
'length_ratio': 1.0,
'translation_length': 11,
'reference_length': 11},
'meteor': {'score': 0.5420511682934044},
'rouge': {'rouge1': 0.7783882783882783,
'rouge2': 0.5925324675324675,
'rougeL': 0.7426739926739926,
'rougeLsum': 0.7426739926739926}}
we should add some basic usage into readme
PyPI index use of sklearn
is deprecated, update it as scikit-learn
.
Metric implementation with arrow tables should be adopted for multiple references & multiple predictions cases.
Currently given corpus like below causes problems as they are not expecting an empty list:
import jury
p = [["a b c"], []]
r = [["a b d e f"], ["a g h i"]]
scorer = jury.Jury()
scores = scorer(predictions=p, references=r)
the code above throws an exception when encounters an empty list:
Traceback (most recent call last):
File "/home/devrimcavusoglu/lab/gh/jury/jury/core.py", line 202, in <module>
scores = scorer(predictions=p, references=r)
File "/home/devrimcavusoglu/lab/gh/jury/jury/core.py", line 79, in __call__
score = self._compute_single_score(inputs)
File "/home/devrimcavusoglu/lab/gh/jury/jury/core.py", line 148, in _compute_single_score
score = metric.compute(predictions=predictions, references=references, reduce_fn=reduce_fn)
File "/home/devrimcavusoglu/lab/gh/jury/venv/lib/python3.8/site-packages/datasets/metric.py", line 402, in compute
output = self._compute(predictions=predictions, references=references, **kwargs)
File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/_core/base.py", line 325, in _compute
result = self.evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **eval_params)
File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/bleu/bleu_for_language_generation.py", line 262, in evaluate
return super().evaluate(predictions=predictions, references=references, reduce_fn=reduce_fn, **kwargs)
File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/_core/base.py", line 279, in evaluate
return eval_fn(predictions=predictions, references=references, **kwargs)
File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/bleu/bleu_for_language_generation.py", line 216, in _compute_multi_pred_multi_ref
adjusted_prediction_length += get_token_lengths(preds, reduce_fn=max)
File "/home/devrimcavusoglu/lab/gh/jury/jury/metrics/_core/utils.py", line 58, in get_token_lengths
return int(reduce_fn(token_lengths))
ValueError: max() arg is an empty sequence
Process finished with exit code 1
I was trying to check with the same example mentioned in the readme file for Bleurt. It is failing by throwing an error. Please let me know the issue.
Error :
ImportError Traceback (most recent call last)
<ipython-input-16-ed14e2ab4c7e> in <module>
----> 1 bleurt = Bleurt.construct()
2 score = bleurt.compute(predictions=predictions, references=references)
~\anaconda3\lib\site-packages\jury\metrics\_core\auxiliary.py in construct(cls, task, resulting_name, compute_kwargs, **kwargs)
99 subclass = cls._get_subclass()
100 resulting_name = resulting_name or cls._get_path()
--> 101 return subclass._construct(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
102
103 @classmethod
~\anaconda3\lib\site-packages\jury\metrics\_core\base.py in _construct(cls, resulting_name, compute_kwargs, **kwargs)
235 cls, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs
236 ):
--> 237 return cls(resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
238
239 @staticmethod
~\anaconda3\lib\site-packages\jury\metrics\_core\base.py in __init__(self, resulting_name, compute_kwargs, **kwargs)
220 def __init__(self, resulting_name: Optional[str] = None, compute_kwargs: Optional[Dict[str, Any]] = None, **kwargs):
221 compute_kwargs = self._validate_compute_kwargs(compute_kwargs)
--> 222 super().__init__(task=self._task, resulting_name=resulting_name, compute_kwargs=compute_kwargs, **kwargs)
223
224 def _validate_compute_kwargs(self, compute_kwargs: Dict[str, Any]) -> Dict[str, Any]:
~\anaconda3\lib\site-packages\jury\metrics\_core\base.py in __init__(self, task, resulting_name, compute_kwargs, config_name, keep_in_memory, cache_dir, num_process, process_id, seed, experiment_id, max_concurrent_cache_files, timeout, **kwargs)
100 self.resulting_name = resulting_name if resulting_name is not None else self.name
101 self.compute_kwargs = compute_kwargs or {}
--> 102 self.download_and_prepare()
103
104 @abstractmethod
~\anaconda3\lib\site-packages\evaluate\module.py in download_and_prepare(self, download_config, dl_manager)
649 )
650
--> 651 self._download_and_prepare(dl_manager)
652
653 def _download_and_prepare(self, dl_manager):
~\anaconda3\lib\site-packages\jury\metrics\bleurt\bleurt_for_language_generation.py in _download_and_prepare(self, dl_manager)
120 global bleurt
121 try:
--> 122 from bleurt import score
123 except ModuleNotFoundError:
124 raise ModuleNotFoundError(
ImportError: cannot import name 'score' from 'bleurt' (unknown location)
There are several different tasks, jury mainly implemented NLG metrics (including precision, recall, etc. as modified n-gram precision, etc.). It'd be nice to have other types of metrics (e.g precision for sequence labeling).
Brief metric info
COMET is proposed primarily as an MT evaluation metric through a trained model.
COMET is an open-source framework for MT evaluation that can be used for two purposes:
- To evaluate MT systems with our currently available high-performing metrics (check: COMET Metrics).
- To train and develop new metrics.
Meta information
Please fill below if applicable:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.