scandeval / scandeval Goto Github PK

View Code? Open in Web Editor NEW

66.0 66.0 13.0 74.45 MB

Evaluation of language models on mono- or multilingual tasks.

Home Page: https://scandeval.com

License: MIT License

Python 99.25% Makefile 0.75%

danish dutch english evaluation faroese german icelandic llm nlp norwegian swedish

scandeval's Introduction

Evaluation of pretrained language models on mono- or multilingual language tasks.

Maintainers

Dan Saattrup Nielsen (@saattrupdan, [email protected])
Kenneth Enevoldsen (@KennethEnevoldsen, [email protected])

Installation

To install the package simply write the following command in your favorite terminal:

$ pip install scandeval[all]

This will install the ScandEval package with all extras. You can also install the minimal version by leaving out the [all], in which case the package will let you know when an evaluation requires a certain extra dependency, and how you install it.

Quickstart

Benchmarking from the Command Line

The easiest way to benchmark pretrained models is via the command line interface. After having installed the package, you can benchmark your favorite model like so:

$ scandeval --model <model-id>

Here model is the HuggingFace model ID, which can be found on the HuggingFace Hub. By default this will benchmark the model on all the tasks available. If you want to benchmark on a particular task, then use the --task argument:

$ scandeval --model <model-id> --task sentiment-classification

We can also narrow down which languages we would like to benchmark on. This can be done by setting the --language argument. Here we thus benchmark the model on the Danish sentiment classification task:

$ scandeval --model <model-id> --task sentiment-classification --language da

Multiple models, datasets and/or languages can be specified by just attaching multiple arguments. Here is an example with two models:

$ scandeval --model <model-id1> --model <model-id2>

The specific model version/revision to use can also be added after the suffix '@':

$ scandeval --model <model-id>@<commit>

This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.

See all the arguments and options available for the scandeval command by typing

$ scandeval --help

Benchmarking from a Script

In a script, the syntax is similar to the command line interface. You simply initialise an object of the Benchmarker class, and call this benchmark object with your favorite model:

>>> from scandeval import Benchmarker
>>> benchmark = Benchmarker()
>>> benchmark(model="<model>")

To benchmark on a specific task and/or language, you simply specify the task or language arguments, shown here with same example as above:

>>> benchmark(model="<model>", task="sentiment-classification", language="da")

If you want to benchmark a subset of all the models on the Hugging Face Hub, you can simply leave out the model argument. In this example, we're benchmarking all Danish models on the Danish sentiment classification task:

>>> benchmark(task="sentiment-classification", language="da")

Benchmarking from Docker

A Dockerfile is provided in the repo, which can be downloaded and run, without needing to clone the repo and installing from source. This can be fetched programmatically by running the following:

$ wget https://raw.githubusercontent.com/ScandEval/ScandEval/main/Dockerfile.cuda

Next, to be able to build the Docker image, first ensure that the NVIDIA Container Toolkit is installed and configured. Ensure that the the CUDA version stated at the top of the Dockerfile matches the CUDA version installed (which you can check using nvidia-smi). After that, we build the image as follows:

$ docker build --pull -t scandeval -f Dockerfile.cuda .

With the Docker image built, we can now evaluate any model as follows:

$ docker run -e args="<scandeval-arguments>" --gpus 1 --name scandeval --rm scandeval

Here <scandeval-arguments> consists of the arguments added to the scandeval CLI argument. This could for instance be --model <model-id> --task sentiment-classification.

Special Thanks 🙏

Thanks to UWV and KU Leuven for sponsoring the Azure OpenAI credits used to evaluate GPT-4-turbo in Dutch.
Thanks to Miðeind for sponsoring the OpenAI credits used to evaluate GPT-4-turbo in Icelandic and Faroese.
Thanks to CHC for sponsoring the OpenAI credits used to evaluate GPT-4-turbo in German.

Citing ScandEval

If you want to cite the framework then feel free to use this:

@inproceedings{nielsen2023scandeval,
  author = {Nielsen, Dan Saattrup},
  booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
  month = may,
  pages = {185--201},
  title = {{ScandEval: A Benchmark for Scandinavian Natural Language Processing}},
  year = {2023}
}

Remarks

The image used in the logo has been created by the amazing Scandinavia and the World team. Go check them out!

scandeval's People

Contributors

Stargazers

Watchers

Forkers

nbailab vesteinn markussagen peregilk ccwutw rlrs peter-sk kennethenevoldsen marcelcastrobr thomaskluiters bramvanroy

scandeval's Issues

A change in batch size should trigger a change in gradient accumulation

To be able to make a change in batch size more reasonable when finetuning (different batch sizes tend to make the models perform differently), we should change the gradient accumulation accordingly.

Perhaps set a pre-specified list of possible batch sizes, like (1,2,4,8,16,32), and apply corresponding gradient accumulation (32,16,8,4,2,1).

Add a scandeval callback to allow for evaluation during training

Add a scandeval callback to allow for evaluation during model training. It would probably be best if implemented after #37.

Add mT5

Currently, there is no evaluation of the mT5 models

Flag --model_id does not work, but -m does

$ scandeval --model_id KennethEnevoldsen/dfm-bert-base  --language da
Usage: scandeval [OPTIONS]
Try 'scandeval --help' for help.

Error: No such option: --model_id

Following the suggestion we get:


  Benchmark language models on Scandinavian language tasks.

Options:
  -m TEXT                         The HuggingFace model ID of the model(s) to
                                  be benchmarked. If not specified then all
[...]

Indicating that while -m is an option --model

Environment

scandeval v. 3.0.0, installed using pip install scandeval[all]

Unit tests for the `benchmark_config_factory` module

The benchmark_config_factory is missing unit tests. There is currently a placeholder test script in tests/test_benchmark_config_factory.py.

Set up extras in setup

It should be possible to use the package without installing every framework (tensorflow, pytorch, flax and spacy currently). Instead, set up these as extras (and set up an all extra, including all of them) and raise an Exception if a benchmark is attempted when the framework is not installed.

Add Swedish benchmark datasets

The superlim datasets could be used: https://spraakbanken.gu.se/en/resources/superlim.

Further, the UniversalDependencies repo has dependency parsing for Swedish.

Potentially:

DaLAJ as a BIN dataset
ABSAbank-Imm as a SENT dataset (even though it's not truly sentiment analysis, but sentiment towards immigration)
Swedish-Talbanken as DEP and POS datasets
Spraakbanken-Webbnyheter as a NER dataset
SIC as a NER dataset
SUC 3.0 as a NER dataset
WikiANN as a NER dataset
Swedish Reviews as a SENT dataset, but it's binary though.. So maybe as a BIN dataset

Unit tests for the `benchmarker` module

The benchmarker is missing unit tests. There is currently a placeholder test script in tests/test_benchmarker.py.

Dataset wrapper

Aside from benchmarks, a slim version of the package should be available, which just loads in the datasets without benchmarking.

CANINE models cannot be benchmarked on QA tasks

The models google/canine-s and google/canine-c fails when trying to benchmark them on QA tasks.

Benchmarking is not shown properly in notebooks

When using the Benchmark class to benchmark models in a notebook, a lot of output appears:

The config of the model
The architecture of the model
Caching logs
A block of text starting with ***** Running training *****
A message after each finetuning: Training completed. Do not forget to share your model on huggingface.co/models =)
A block of text starting with ***** Running evaluation *****

All of these should be removed. It seems like the the set_verbosity_error from the transformers library does not work in notebooks.

Add Icelandic benchmark datasets

POS tagging data can be found at https://github.com/steinst/ABLTagger.

Further, the UniversalDependencies repo has dependency parsing for Icelandic.

Potentially:

UD_Icelandic-Modern as DEP and POS datasets
MIM-GOLD-NER as a NER dataset, but its license does not permit it to be used in the package.
WikiANN as a NER dataset
IcelandicBlogosphere as a SENT dataset. It's not publicly available, but I've contacted the author to see if he still has it.

Make evaluation subclass for `europarl2`

The europarl2 dataset needs evaluation subclass(es).

Evaluating private HuggingFace models

It would be great if ScandEval also could be used for evaluating private HuggingFace models. This could probably be done by passing along the flag «use_auth_token=True» to the .from_pretrained method. This allows the user to log in with huggingface-cli. This will store the auth token locally. It can also be passed on directly but personally I find this less convenient.

Invalid model name does not raise an error or a warning

When running the following line of code:

CUDA_VISIBLE_DEVICES=2 scandeval --model-id nota/model --language da

I would expect the model to throw an error like "could not find model in hub", but instead it simply skips all the benchmarks as seen here:

2023-07-09 09:02:42,441 [INFO] <scandeval.benchmarker>
↳ Benchmarking nota/model on the truncated version of AngryTweets
2023-07-09 09:02:44,589 [INFO] <scandeval.benchmarker>
↳ nota/model could not be benchmarked on the truncated version of AngryTweets. Skipping.
2023-07-09 09:02:44,589 [INFO] <scandeval.benchmarker>
↳ Benchmarking nota/model on the truncated version of DaNE
2023-07-09 09:02:46,616 [INFO] <scandeval.benchmarker>
↳ nota/model could not be benchmarked on the truncated version of DaNE. Skipping.
2023-07-09 09:02:46,616 [INFO] <scandeval.benchmarker>
↳ Benchmarking nota/model on the Danish part of ScaLA
2023-07-09 09:02:48,447 [INFO] <scandeval.benchmarker>
↳ nota/model could not be benchmarked on the Danish part of ScaLA. Skipping.
2023-07-09 09:02:48,447 [INFO] <scandeval.benchmarker>
↳ Benchmarking nota/model on the Danish part of the truncated version of ScandiQA
2023-07-09 09:02:50,898 [INFO] <scandeval.benchmarker>
↳ nota/model could not be benchmarked on the Danish part of the truncated version of ScandiQA. Skipping.
2023-07-09 09:02:50,898 [INFO] <scandeval.benchmarker>
↳ Benchmarking nota/model on the speed estimation benchmark
2023-07-09 09:02:51,259 [INFO] <scandeval.benchmarker>
↳ nota/model could not be benchmarked on the speed estimation benchmark. Skipping.

Add support for zero and few-shot evaluation

To evaluate large language models with ScandEval there should be a possibility to benchmark the zero-shot performance and few-shot performance of tasks.

Maybe these should be included in the other benchmarks, to allow comparison with smaller finetuned models, or maybe they should have their own benchmark. If they are included in the other benchmarks, then we need a clear way to see whether a performance arise from zero-shot, few-shot or finetuned. This could simply be using asterisks.

Add mdeberta-v3-base

A potential model to include which is currently missing is the multilingual deBERTa.

Best,
@KennethEnevoldsen and @MalteHB

Language flag does not work as intended

When setting the language flag to DA the model still seems to be training on non-Danish datasets:

$ scandeval -m KennethEnevoldsen/dfm-bert-base  --language da
[...] 
↳ Finished finetuning and evaluation of KennethEnevoldsen/dfm-bert-base on DKHate.                                                               
[...]
↳ Benchmarking KennethEnevoldsen/dfm-bert-base on LCC:
[...]
↳ Benchmarking KennethEnevoldsen/dfm-bert-base on NoReC:
[...]

Scandeval version: 3.0.0

Write proper README

Should include:

Installation
Quick start, both to script usage and CLI usage
Bibtex reference

Add support for SLU evaluation

As an analogue to the NLU benchmark, we should add an SLU benchmark.

An obvious task these could solve is automatic speech recognition, but there might be other desirable tasks to include as well.

This issue includes both the ability to benchmark acoustic models as well as setting up a leaderboard for them.

Add multi-GPU support for finetuning

Currently, when using multiple GPUs, the progress bars are not moving and it doesn't seem like anything is happening, despite the fact that the GPUs are in use. If they're indeed benchmarking the models then this progress should be displayed.

Add POS tagging and dependency parsing from DDT

Should be separate benchmarks, inheriting from the TokenClassificationBenchmark, just like DaneBenchmark is.

LUKE models cannot be benchmarked on NER tasks

The LUKE models are handling NER tasks differently and thus needs to be dealt with differently.

Cache interference issues when testing two models at once

Seems like the cache systems interferes with other running sessions of scandeval as it seems to delete the checkpoint of another runs.
Thus does not allow for workflow like:

CUDA_VISIBLE_DEVICES=1 scandeval --model-id MODEL1 --language da --raise-errors
# on a separate instance:
CUDA_VISIBLE_DEVICES=2 scandeval --model-id MODEL2 --language da --raise-errors

Naturally I could use multiple --model-id flags, but I wanted them to run each on their own GPUs.

A simple solution is to naturally do it in different folders and thus create two scandeval caches (which is what I am doing now).
However I wanted to suggest to nest the cache folder such that the structure is:

.scandeval_cache/model_id/checkpoint-270

as opposed to:

.scandeval_cache/checkpoint-270

I might have missed other important considerations

Make the `Evaluator` class sufficiently general to accommodate both PyTorch and SpaCy models

The current Evaluator class only works with models for which the HuggingFace Model API works, but the evaluation framework should also work for the SpaCy models on the hub.

We therefore need to create an even more abstract Evaluator class, with HFEvaluator and SpacyEvaluator subclasses.

Add support for seq2seq models

See if it's possible to rephrase the classification tasks as text generation tasks, enabling seq2seq models (such as mT5) to be benchmarked as well.

Make evaluation subclass for `lcc2`

The lcc2 dataset needs evaluation subclass(es).

Use SpaCy Scorer classes in evaluating SpaCy models

Rather than the ad hoc method currently used, use the built-in scoring functionalities in SpaCy:

https://spacy.io/api/language#evaluate

https://spacy.io/api/scorer

Handle Memory issues

When finetuning a large model, it might be that the standard batch size of 16 is too large. In the individual benchmark classes they could simply be instantiated with a smaller batch size, but the Benchmark class doesn't have this.

One way to deal with this could be to catch the error and cut the batch size in half for that particular evaluation, and continue halving until it works. Another way could be to include batch size as an argument in the Benchmark class. Needs to be in the CLI as well in that case.

Make evaluation subclass for `angry_tweets`

The angry_tweets dataset needs evaluation subclass(es).

Add Faroese benchmark datasets

POS tagging data can be found at https://github.com/hinrikur/far-ABLTagger.

Further, the UniversalDependencies repo has dependency parsing for Faroese.

Potentially:

UD_Faroese-FarPaHC as DEP and POS datasets
WikiANN as a NER dataset

Make evaluation subclass for `twitter_sent`

The twitter_sent dataset needs evaluation subclass(es).

HuggingFace Hub web scraper

We need a web scraper which compiles a list of all the currently available models in the Scandinavian languages.

This needs to fetch the model ID (for instance, the model ID for this model is flax-community/roberta-base-danish), as well as the metadata in the "tags" below the headline. Notably, we need to know which framework the model has been built in, so we can load it correctly.

The idea is then that this script should be run automatically every x days/hours/minutes, compiling a list of current models, which could be stored as a jsonl file in the repository.

gpt-neox models cannot be benchmarked

I have trouble benchmarking GPT-NeoX models:
scandeval --model-id Isotonic/gpt_neox_225M --raise-errors
2023-04-12 12:13:42,013 [INFO] <scandeval.benchmarker>
↳ Benchmarking Isotonic/gpt_neox_225M on the truncated version of SweReC
Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 677/677 [00:00<00:00, 79.3kB/s]
Downloading pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 867M/867M [02:01<00:00, 7.11MB/s]
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 261/261 [00:00<00:00, 222kB/s]
Downloading (…)olve/main/vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 799k/799k [00:00<00:00, 1.12MB/s]
Downloading (…)olve/main/merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 457k/457k [00:00<00:00, 1.41MB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 2.56MB/s]
Downloading (…)cial_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 99.0/99.0 [00:00<00:00, 69.7kB/s]
2023-04-12 12:15:54,792 [INFO] <scandeval.benchmark_dataset>
↳ The model has 152,445,952 parameters, a vocabulary size of 50,432 and a maximum sequence length of 2,048.
Benchmarking: 0%| | 0/10 [00:05<?, ?it/s]
Traceback (most recent call last):s]
File "/usr/local/bin/scandeval", line 8, in
sys.exit(benchmark())
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1130, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/scandeval/cli.py", line 209, in benchmark
benchmarker(model_id=model_ids, dataset=datasets)
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmarker.py", line 382, in call
return self.benchmark(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmarker.py", line 189, in benchmark
record = self._benchmark_single(
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmarker.py", line 376, in _benchmark_single
raise e
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmarker.py", line 333, in _benchmark_single
results, metadata_dict = dataset(model_id)
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmark_dataset.py", line 619, in call
return self.benchmark(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/scandeval/benchmark_dataset.py", line 320, in benchmark
bs, ga = handle_error(
File "/usr/local/lib/python3.9/site-packages/scandeval/utils.py", line 212, in handle_error
raise InvalidBenchmark(str(e))
scandeval.exceptions.InvalidBenchmark: shape '[32, 1050, 12, 255]' is invalid for input of size 103219200

Add Norwegian benchmark datasets

The datasets listed at https://github.com/ltgoslo/norbench could be used.

Further, the UniversalDependencies repo has dependency parsing for Norwegian.

Improve Sphinx documentation

There is a Sphinx documentation atm, but right now it looks a bit rubbish.

Unit tests for the `model_loading` module

The model_loading is missing unit tests. There is currently a placeholder test script in tests/test_model_loading.py.

When you misspecify the model name, you don't get an error

When you misspecify the model name, you don't get an error, but it simply runs. This is naturally a user error but it would be nice if it stated that it couldn't load the model.

$ scandeval -m this_is_not_a_HF_model
2022-06-06 10:35:07,485 [INFO] <scandeval.benchmark>
↳ Updating the list of benchmark datasets
Downloading builder script: 6.33kB [00:00, 3.00MB/s]                                                                                                                                                                                                                                                                  
Downloading builder script: 5.27kB [00:00, 4.48MB/s]                                                                                                                                                                                                                                                                  
2022-06-06 10:35:23,881 [INFO] <scandeval.benchmark>
↳ Benchmarking this_is_not_a_HF_model on DaNE:
2022-06-06 10:35:24,890 [INFO] <scandeval.benchmark>
↳ this_is_not_a_HF_model could not be benchmarked on DaNE. Skipping.
[...]

Scandeval Version: 3.0.0

flax-community/dansk-gpt-wiki could not be benchmarked on the truncated version of AngryTweets

The error message was:
The vocab size of the tokenizer is larger than the vocab size of the model. This is not supported.

This is a pretty standard situation for GPT-style models. I am training another large model, where the same issue occurs.

Make evaluation subclass for `europarl1`

The europarl1 dataset needs evaluation subclass(es).

Make evaluation subclass for `dkhate`

The dkhate dataset needs evaluation subclass(es).

Add popular sentence-embedding models

I am notably thinking of model on top of the leaderboard:
https://huggingface.co/spaces/mteb/leaderboard

I am currently testing the multilingual benchmarks for Danish. If you wish I can forward the results.

Ideally, we would naturally have a Danish-specific benchmark for embeddings.

Make evaluation subclass for `lcc1`

The lcc1 dataset needs evaluation subclass(es).

Unit tests for the `question_answering_trainer` module

The question_answering_trainer is missing unit tests. There is currently a placeholder test script in tests/test_question_answering_trainer.py.

Unit tests

Currently there aren't any tests; these should be implemented using pytest.

Unit tests for the `benchmark_dataset` module

The benchmark_dataset is missing unit tests. There is currently a placeholder test script in tests/test_benchmark_dataset.py.

Add support for NLG evaluation

The current language model benchmark does not include any generative tasks, such as abstractive summarisation and translation. We can't include it in the LM benchmark either, as the encoder-only models won't be able to do these tasks at all, so this could be included in a separate benchmark instead.

Add inference speed evaluation

It would be very informative to have an evaluation of inference speed for the models, as there's not a clearcut correspondence between the number of model parameters and inference speed. For instance, the DeBERTaV3 models are slower than models with the same number of parameters, as the disentangled attention mechanism slows down the inference.

A challenge here is coming up with a measurement which is consistent. This might not be possible, in which case we could simply implement a speed evaluation, and the speed benchmarks put onto the leaderboard have to come from the same hardware.

Dependency parsing is not being trained optimally

The dependency parsing scores are substantially lower than the ones achieved by e.g. SpaCy's training procedure. See if such a procedure can be implemented in line with the basic training script in base.py.

`sentence-transformers/distiluse-base-multilingual-cased-v2` fails during evaluation

The model runs for a while, but fails during evaluation. Same on all datasets.

scandeval / scandeval Goto Github PK

scandeval's Introduction

Evaluation of pretrained language models on mono- or multilingual language tasks.

Maintainers

Installation

Quickstart

Benchmarking from the Command Line

Benchmarking from a Script

Benchmarking from Docker

Special Thanks 🙏

Citing ScandEval

Remarks

scandeval's People

Contributors

Stargazers

Watchers

Forkers

scandeval's Issues

Environment

Recommend Projects

Recommend Topics

Recommend Org