Coder Social home page Coder Social logo

scandeval / scandeval Goto Github PK

View Code? Open in Web Editor NEW
66.0 66.0 13.0 74.45 MB

Evaluation of language models on mono- or multilingual tasks.

Home Page: https://scandeval.com

License: MIT License

Python 99.25% Makefile 0.75%
danish dutch english evaluation faroese german icelandic llm nlp norwegian swedish

scandeval's Introduction

Evaluation of pretrained language models on mono- or multilingual language tasks.


PyPI Status Paper License LastCommit Code Coverage Contributor Covenant

Maintainers

Installation

To install the package simply write the following command in your favorite terminal:

$ pip install scandeval[all]

This will install the ScandEval package with all extras. You can also install the minimal version by leaving out the [all], in which case the package will let you know when an evaluation requires a certain extra dependency, and how you install it.

Quickstart

Benchmarking from the Command Line

The easiest way to benchmark pretrained models is via the command line interface. After having installed the package, you can benchmark your favorite model like so:

$ scandeval --model <model-id>

Here model is the HuggingFace model ID, which can be found on the HuggingFace Hub. By default this will benchmark the model on all the tasks available. If you want to benchmark on a particular task, then use the --task argument:

$ scandeval --model <model-id> --task sentiment-classification

We can also narrow down which languages we would like to benchmark on. This can be done by setting the --language argument. Here we thus benchmark the model on the Danish sentiment classification task:

$ scandeval --model <model-id> --task sentiment-classification --language da

Multiple models, datasets and/or languages can be specified by just attaching multiple arguments. Here is an example with two models:

$ scandeval --model <model-id1> --model <model-id2>

The specific model version/revision to use can also be added after the suffix '@':

$ scandeval --model <model-id>@<commit>

This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.

See all the arguments and options available for the scandeval command by typing

$ scandeval --help

Benchmarking from a Script

In a script, the syntax is similar to the command line interface. You simply initialise an object of the Benchmarker class, and call this benchmark object with your favorite model:

>>> from scandeval import Benchmarker
>>> benchmark = Benchmarker()
>>> benchmark(model="<model>")

To benchmark on a specific task and/or language, you simply specify the task or language arguments, shown here with same example as above:

>>> benchmark(model="<model>", task="sentiment-classification", language="da")

If you want to benchmark a subset of all the models on the Hugging Face Hub, you can simply leave out the model argument. In this example, we're benchmarking all Danish models on the Danish sentiment classification task:

>>> benchmark(task="sentiment-classification", language="da")

Benchmarking from Docker

A Dockerfile is provided in the repo, which can be downloaded and run, without needing to clone the repo and installing from source. This can be fetched programmatically by running the following:

$ wget https://raw.githubusercontent.com/ScandEval/ScandEval/main/Dockerfile.cuda

Next, to be able to build the Docker image, first ensure that the NVIDIA Container Toolkit is installed and configured. Ensure that the the CUDA version stated at the top of the Dockerfile matches the CUDA version installed (which you can check using nvidia-smi). After that, we build the image as follows:

$ docker build --pull -t scandeval -f Dockerfile.cuda .

With the Docker image built, we can now evaluate any model as follows:

$ docker run -e args="<scandeval-arguments>" --gpus 1 --name scandeval --rm scandeval

Here <scandeval-arguments> consists of the arguments added to the scandeval CLI argument. This could for instance be --model <model-id> --task sentiment-classification.

Special Thanks πŸ™

  • Thanks to UWV and KU Leuven for sponsoring the Azure OpenAI credits used to evaluate GPT-4-turbo in Dutch.
  • Thanks to MiΓ°eind for sponsoring the OpenAI credits used to evaluate GPT-4-turbo in Icelandic and Faroese.
  • Thanks to CHC for sponsoring the OpenAI credits used to evaluate GPT-4-turbo in German.

Citing ScandEval

If you want to cite the framework then feel free to use this:

@inproceedings{nielsen2023scandeval,
  author = {Nielsen, Dan Saattrup},
  booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
  month = may,
  pages = {185--201},
  title = {{ScandEval: A Benchmark for Scandinavian Natural Language Processing}},
  year = {2023}
}

Remarks

The image used in the logo has been created by the amazing Scandinavia and the World team. Go check them out!

scandeval's People

Contributors

ajders avatar kennethenevoldsen avatar peregilk avatar peter-sk avatar saattrupdan avatar thomaskluiters avatar versae avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

scandeval's Issues

A change in batch size should trigger a change in gradient accumulation

To be able to make a change in batch size more reasonable when finetuning (different batch sizes tend to make the models perform differently), we should change the gradient accumulation accordingly.

Perhaps set a pre-specified list of possible batch sizes, like (1,2,4,8,16,32), and apply corresponding gradient accumulation (32,16,8,4,2,1).

Add mT5

Currently, there is no evaluation of the mT5 models

Flag --model_id does not work, but -m does

$ scandeval --model_id KennethEnevoldsen/dfm-bert-base  --language da
Usage: scandeval [OPTIONS]
Try 'scandeval --help' for help.

Error: No such option: --model_id

Following the suggestion we get:


  Benchmark language models on Scandinavian language tasks.

Options:
  -m TEXT                         The HuggingFace model ID of the model(s) to
                                  be benchmarked. If not specified then all
[...]

Indicating that while -m is an option --model

Environment

scandeval v. 3.0.0, installed using pip install scandeval[all]

Set up extras in setup

It should be possible to use the package without installing every framework (tensorflow, pytorch, flax and spacy currently). Instead, set up these as extras (and set up an all extra, including all of them) and raise an Exception if a benchmark is attempted when the framework is not installed.

Add Swedish benchmark datasets

The superlim datasets could be used: https://spraakbanken.gu.se/en/resources/superlim.

Further, the UniversalDependencies repo has dependency parsing for Swedish.

Potentially:

Dataset wrapper

Aside from benchmarks, a slim version of the package should be available, which just loads in the datasets without benchmarking.

Benchmarking is not shown properly in notebooks

When using the Benchmark class to benchmark models in a notebook, a lot of output appears:

  • The config of the model
  • The architecture of the model
  • Caching logs
  • A block of text starting with ***** Running training *****
  • A message after each finetuning: Training completed. Do not forget to share your model on huggingface.co/models =)
  • A block of text starting with ***** Running evaluation *****

All of these should be removed. It seems like the the set_verbosity_error from the transformers library does not work in notebooks.

Evaluating private HuggingFace models

It would be great if ScandEval also could be used for evaluating private HuggingFace models. This could probably be done by passing along the flag Β«use_auth_token=TrueΒ» to the .from_pretrained method. This allows the user to log in with huggingface-cli. This will store the auth token locally. It can also be passed on directly but personally I find this less convenient.

Invalid model name does not raise an error or a warning

When running the following line of code:

CUDA_VISIBLE_DEVICES=2 scandeval --model-id nota/model --language da

I would expect the model to throw an error like "could not find model in hub", but instead it simply skips all the benchmarks as seen here:

2023-07-09 09:02:42,441 [INFO] <scandeval.benchmarker>
↳ Benchmarking nota/model on the truncated version of AngryTweets
2023-07-09 09:02:44,589 [INFO] <scandeval.benchmarker>
↳ nota/model could not be benchmarked on the truncated version of AngryTweets. Skipping.
2023-07-09 09:02:44,589 [INFO] <scandeval.benchmarker>
↳ Benchmarking nota/model on the truncated version of DaNE
2023-07-09 09:02:46,616 [INFO] <scandeval.benchmarker>
↳ nota/model could not be benchmarked on the truncated version of DaNE. Skipping.
2023-07-09 09:02:46,616 [INFO] <scandeval.benchmarker>
↳ Benchmarking nota/model on the Danish part of ScaLA
2023-07-09 09:02:48,447 [INFO] <scandeval.benchmarker>
↳ nota/model could not be benchmarked on the Danish part of ScaLA. Skipping.
2023-07-09 09:02:48,447 [INFO] <scandeval.benchmarker>
↳ Benchmarking nota/model on the Danish part of the truncated version of ScandiQA
2023-07-09 09:02:50,898 [INFO] <scandeval.benchmarker>
↳ nota/model could not be benchmarked on the Danish part of the truncated version of ScandiQA. Skipping.
2023-07-09 09:02:50,898 [INFO] <scandeval.benchmarker>
↳ Benchmarking nota/model on the speed estimation benchmark
2023-07-09 09:02:51,259 [INFO] <scandeval.benchmarker>
↳ nota/model could not be benchmarked on the speed estimation benchmark. Skipping.

Add support for zero and few-shot evaluation

To evaluate large language models with ScandEval there should be a possibility to benchmark the zero-shot performance and few-shot performance of tasks.

Maybe these should be included in the other benchmarks, to allow comparison with smaller finetuned models, or maybe they should have their own benchmark. If they are included in the other benchmarks, then we need a clear way to see whether a performance arise from zero-shot, few-shot or finetuned. This could simply be using asterisks.

Language flag does not work as intended

When setting the language flag to DA the model still seems to be training on non-Danish datasets:

$ scandeval -m KennethEnevoldsen/dfm-bert-base  --language da
[...] 
↳ Finished finetuning and evaluation of KennethEnevoldsen/dfm-bert-base on DKHate.                                                               
[...]
↳ Benchmarking KennethEnevoldsen/dfm-bert-base on LCC:
[...]
↳ Benchmarking KennethEnevoldsen/dfm-bert-base on NoReC:
[...]

Scandeval version: 3.0.0

Write proper README

Should include:

  1. Installation
  2. Quick start, both to script usage and CLI usage
  3. Bibtex reference

Add support for SLU evaluation

As an analogue to the NLU benchmark, we should add an SLU benchmark.

An obvious task these could solve is automatic speech recognition, but there might be other desirable tasks to include as well.

This issue includes both the ability to benchmark acoustic models as well as setting up a leaderboard for them.

Add multi-GPU support for finetuning

Currently, when using multiple GPUs, the progress bars are not moving and it doesn't seem like anything is happening, despite the fact that the GPUs are in use. If they're indeed benchmarking the models then this progress should be displayed.

Cache interference issues when testing two models at once

Seems like the cache systems interferes with other running sessions of scandeval as it seems to delete the checkpoint of another runs.
Thus does not allow for workflow like:

CUDA_VISIBLE_DEVICES=1 scandeval --model-id MODEL1 --language da --raise-errors
# on a separate instance:
CUDA_VISIBLE_DEVICES=2 scandeval --model-id MODEL2 --language da --raise-errors

Naturally I could use multiple --model-id flags, but I wanted them to run each on their own GPUs.

A simple solution is to naturally do it in different folders and thus create two scandeval caches (which is what I am doing now).
However I wanted to suggest to nest the cache folder such that the structure is:

.scandeval_cache/model_id/checkpoint-270

as opposed to:

.scandeval_cache/checkpoint-270

I might have missed other important considerations

Add support for seq2seq models

See if it's possible to rephrase the classification tasks as text generation tasks, enabling seq2seq models (such as mT5) to be benchmarked as well.

Handle Memory issues

When finetuning a large model, it might be that the standard batch size of 16 is too large. In the individual benchmark classes they could simply be instantiated with a smaller batch size, but the Benchmark class doesn't have this.

One way to deal with this could be to catch the error and cut the batch size in half for that particular evaluation, and continue halving until it works. Another way could be to include batch size as an argument in the Benchmark class. Needs to be in the CLI as well in that case.

HuggingFace Hub web scraper

We need a web scraper which compiles a list of all the currently available models in the Scandinavian languages.

This needs to fetch the model ID (for instance, the model ID for this model is flax-community/roberta-base-danish), as well as the metadata in the "tags" below the headline. Notably, we need to know which framework the model has been built in, so we can load it correctly.

The idea is then that this script should be run automatically every x days/hours/minutes, compiling a list of current models, which could be stored as a jsonl file in the repository.

gpt-neox models cannot be benchmarked

I have trouble benchmarking GPT-NeoX models:
scandeval --model-id Isotonic/gpt_neox_225M --raise-errors
2023-04-12 12:13:42,013 [INFO] <scandeval.benchmarker>
↳ Benchmarking Isotonic/gpt_neox_225M on the truncated version of SweReC
Downloading (…)lve/main/config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 677/677 [00:00<00:00, 79.3kB/s]
Downloading pytorch_model.bin: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 867M/867M [02:01<00:00, 7.11MB/s]
Downloading (…)okenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 261/261 [00:00<00:00, 222kB/s]
Downloading (…)olve/main/vocab.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 799k/799k [00:00<00:00, 1.12MB/s]
Downloading (…)olve/main/merges.txt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 457k/457k [00:00<00:00, 1.41MB/s]
Downloading (…)/main/tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2.11M/2.11M [00:00<00:00, 2.56MB/s]
Downloading (…)cial_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 99.0/99.0 [00:00<00:00, 69.7kB/s]
2023-04-12 12:15:54,792 [INFO] <scandeval.benchmark_dataset>
↳ The model has 152,445,952 parameters, a vocabulary size of 50,432 and a maximum sequence length of 2,048.
Benchmarking: 0%| | 0/10 [00:05<?, ?it/s]
Traceback (most recent call last):s]
File "/usr/local/bin/scandeval", line 8, in
sys.exit(benchmark())
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/scandeval/cli.py", line 209, in benchmark
benchmarker(model_id=model_ids, dataset=datasets)
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmarker.py", line 382, in call
return self.benchmark(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmarker.py", line 189, in benchmark
record = self._benchmark_single(
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmarker.py", line 376, in _benchmark_single
raise e
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmarker.py", line 333, in _benchmark_single
results, metadata_dict = dataset(model_id)
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmark_dataset.py", line 619, in call
return self.benchmark(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmark_dataset.py", line 320, in benchmark
bs, ga = handle_error(
File "/usr/local/lib/python3.9/site-packages/scandeval/utils.py", line 212, in handle_error
raise InvalidBenchmark(str(e))
scandeval.exceptions.InvalidBenchmark: shape '[32, 1050, 12, 255]' is invalid for input of size 103219200

When you misspecify the model name, you don't get an error

When you misspecify the model name, you don't get an error, but it simply runs. This is naturally a user error but it would be nice if it stated that it couldn't load the model.

$ scandeval -m this_is_not_a_HF_model
2022-06-06 10:35:07,485 [INFO] <scandeval.benchmark>
↳ Updating the list of benchmark datasets
Downloading builder script: 6.33kB [00:00, 3.00MB/s]                                                                                                                                                                                                                                                                  
Downloading builder script: 5.27kB [00:00, 4.48MB/s]                                                                                                                                                                                                                                                                  
2022-06-06 10:35:23,881 [INFO] <scandeval.benchmark>
↳ Benchmarking this_is_not_a_HF_model on DaNE:
2022-06-06 10:35:24,890 [INFO] <scandeval.benchmark>
↳ this_is_not_a_HF_model could not be benchmarked on DaNE. Skipping.
[...]

Scandeval Version: 3.0.0

Unit tests

Currently there aren't any tests; these should be implemented using pytest.

Add support for NLG evaluation

The current language model benchmark does not include any generative tasks, such as abstractive summarisation and translation. We can't include it in the LM benchmark either, as the encoder-only models won't be able to do these tasks at all, so this could be included in a separate benchmark instead.

Add inference speed evaluation

It would be very informative to have an evaluation of inference speed for the models, as there's not a clearcut correspondence between the number of model parameters and inference speed. For instance, the DeBERTaV3 models are slower than models with the same number of parameters, as the disentangled attention mechanism slows down the inference.

A challenge here is coming up with a measurement which is consistent. This might not be possible, in which case we could simply implement a speed evaluation, and the speed benchmarks put onto the leaderboard have to come from the same hardware.

Dependency parsing is not being trained optimally

The dependency parsing scores are substantially lower than the ones achieved by e.g. SpaCy's training procedure. See if such a procedure can be implemented in line with the basic training script in base.py.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.