The benchmark-suites from openml

Dataset 40978 - Internet Advertisements

Does not have a target but seems to have a good source (UCI)

Missing Descriptions

The following datasets have missing descriptions:

23517 - numerai28.6
40670 - dna
40705 - tokyo1

giuseppe pls copy all your code in this here that is relevant

that means (at least)

benchmark mini study
the tables and EDA stuff for the paper
anything else you can remember

Add list of exclusion criteria found by inspecting the datasets on a per-dataset basis

Where we describe for each dataset why we excluded it.

Discussion items about the paper

The same holds for other repositories, such as LIBSVM (Chang and Lin, 2011)

If we only can give a single other repository as an example, we maybe shouldn't say other repositorios?

However, none of the above tools allows to add new datasets or easily share and compare benchmarking results online.

PMLB would allow for pull requests to the github repository. Not sure how to write this in the paper, though.

Also, I just realized that in the pseudocodes the API key is located at a rather weird place. I'll change this if you agree on this.
Maybe we can actually reduce the space between the references instead of making formatting changes to the main paper, what do you think?

Artificial vs Simulated datasets

Currently, we actually have a some simulated datasets in our list of datasets, but also removed several simulated datasets as being "artificial". However, it is very unclear where to draw the line and based on what criteria we would include a dataset as being simulated or exclude it as being artificial.

Examples of simulated datasets in our list:

MagicTelescope
higgs

Examples of artificial datasets in our list:

waveform-5000

define rules to filter out data which is too simple

current suggestion

run 1nn, NB and rpart, normally crossvalidated on the tasks as OML tells us
missing values imputed by: num --> median, cat --> new level
measure balanced error rate

if BER >= 0.99 for any classifier --> remove data from suite

dataset 40865 - epilobee_mortality

Googled the reference:
https://data.europa.eu/euodp/data/dataset/honey-bee-winter-mortality-2012-2014-epilobee-analysis

Seems like a nice dataset, but horribly formatted (string values instead of nominal, )

Fix speeddating

As raised by @mfeurer and discussed in the skype call this morning, the speed dating dataset should be fixed.

assigned myself for obvious reasons.

we should have a nice ovreview page on openml for benchmark suites

this should contain:

a list of the currently available suites

for each suite:
a short textual description of the suites idea
a little tabular overview of the contents of the suite

NB: a suite is something much more basic than a study

Dou Shou Qi dataset

We could add the dataset that we created based on this paper:
https://arxiv.org/pdf/1604.07312.pdf

I will compile it and upload it to OpenML

Finalize Appendix 2 (for ArXiv submission)

A list with all excluded datasets plus reason

Secom dataset

I just labelled the following datasets as derived:
https://www.openml.org/d/40779
https://www.openml.org/d/40780

We should probably grab the original one from UCI:
http://archive.ics.uci.edu/ml/datasets/secom

Check back with uploader

steel-plates-fault (1) -> Rafael G. Mantovani - where is the description from, especially the statement that The latter is commonly used as a binary classification target ('common' or 'other' fault.).
Bioresponse -> Böhringer Ingelheim because it is a Kaggle dataset

Too easy? MiceProtein

As raised by @joaquinvanschoren

https://www.openml.org/t/146800

Update doc page

As raised by @joaquinvanschoren , https://docs.openml.org/benchmark/ currently doesn't refer to OpenML-CC18

Add links

Go through all 100 datasets and make sure that the link to the original data source is present. Do this only after the faulty datasets have been replaced by new versions.

Update Notebook

Notebook doesn't seem to be up to date with latest version of the paper

Fix datasets

Benchmark results and Overfitting

Maybe I'm thinking too far ahead, but there are a few obvious criticisms that we may get from reviewers that are related to how this benchmark is going to be used (assuming we want this to be the reference benchmarks for the field).

2 important issues here that are typical for benchmarking studies:

Cheating: people can look at the training sets and just publish the correct predictions. This requires some hacking of the OpenML APIs but it's not impossible. How about an algorithm that queries OpenML for the best flow for each task and just runs that? I guess we need a more explicit and visible way to report cheating/issues on the result page and offer a switch to only show results without issues? If only to discourage people from doing this. Do we want people to give their real names when they submit results?

Maybe we can - to some extent - run the flows on the server and try to reproduce the results, and then add a special label to those runs.

Overfitting: on a single task, it is quite easy to submit results many results until - by chance - they overfit on the entire 10-fold CV. For the benchmark, the results are aggregated over multiple tasks, so overfitting is less likely, but not impossible.

Some ways to alleviate this problem:

Use 10x10fold CV instead of normal 10fold CV tasks, maybe as an OpenML-CC18x10 benchmark that people can choose to use instead (assuming that the results here are more authoritative). Maybe even an OpenML-CC18x100.
Have an 'evolving' benchmark: as new datasets are added in CC19, CC20 etc., we can show how results/rankings change over time. Overfitted flows on CC18 will likely perform worse in CC19 etc.
Other ideas? Differential privacy?

In addition, we should also show an aggregated view of the score on the individual tasks (e.g. violin plots?) and do statistical tests? We could do the typical Friedman-Nemenyi test, but not sure if that will work that well on 'only' 80-something datasets.

We could of course wave our hands and say 'yes, but we are only solving the problem of non-standardized benchmark tests and these issues apply to any benchmarking study' but in a way these issues are connected...

delete associated "goal" perf measures for all tasks in suite

why agreed as a default that a task should have no measures

this should then be true for all tasks in our suite but it isnt

More datasets

Try to get more datasets! (Would be good to keep the 'OML100' principle) For example, @joaquinvanschoren mentioned that he as several datasets that comply to the requirements. Related to #13

Document the used tags

Although these tags are not 'protected' (none of the tags are) it would be good to have a documentation of some 'consensual' tags, such as 'artificial', 'label_leakage' etc (as suggested by @berndbischl )

Convex dataset

I just tagged these datasets as origin unknown:
https://www.openml.org/d/40765
https://www.openml.org/d/40766

If we were to obtain the publication, we might be able to use it

adult dataset

Verbatim copy from openml/OpenML#813 by @amueller

This one is tagged:
https://www.openml.org/d/1590

It should be this:
https://www.openml.org/d/1119

It should exclude the "fnlwgt" column (not sure it's marked, can't see that in the web interface).
Also the sklearn fetcher fails.

data_status only keeps the last applicable tested status of each dataset

The way the filters are set up, only the last known 'reason-for-exclusion' is known for each dataset. To me it makes sense to instead keep track of a set of test results. That way it becomes easier to identify which constraints to relax if you would like a larger study.

I'll go ahead and implement it myself either way, so I will just add a PR when it's done. I opened the issue to see if this was considered, if there are good reasons not to do this and/or if there are any additional related features that make sense to add.

segment dataset

This new version was created by Jann:
https://www.openml.org/d/40984

The difference is that the position of the 3x3 pixel sample in the image is removed.
Are we sure that this is correct? If I want to classify 'sky', is it not useful to know the position in the image?

I'm leaving both versions as active for now.

we need to update our inclusion exclusion rules for data sets

we need to at least include

max nr of levels per cat feature <= 100

minimal class size in abs nr >= 20

New dataset list

Credits to @ArlindKadra who compiled the latest list of dataset ids as uploaded by Jann.

1489, 15, 40981, 1462, 1471, 151, 469, 23512, 1464, 1480, 40982, 182, 18, 11, 29, 23, 37, 40983, 1120, 307, 1050, 1590, 1049, 40993, 4538, 4534, 1461, 1466, 40989, 1475, 1497, 23381, 38, 60, 1510, 40975, 50, 22, 40984, 40668, 1063, 1053, 1068, 1067, 1494, 188, 31, 32, 54, 6, 28, 14, 16, 3, 1487, 40992, 1486, 44, 46, 24, 6332, 40536, 40499, 12, 40971, 554, 1038, 4134, 40978, 1501, 42, 1485, 1478, 300, 1515, 1468, 40966, 1491, 1492, 1493, 40979

I have a Java Program to automatically grab or generate the associated tasks, and can easily create a new study when necessary.

Why is there a car (3)?

This was created by Jann, but I don't see the difference with version 1?
https://www.openml.org/d/40975

Question: higgs sampling

As raised by @mfeurer and followed up upon by the call this morning:

Higgs dataset is a subset of the full UCI higgs dataset. The crucial question is: Is this subset randomly sampled or is this subset also used across the literature? I added @joaquinvanschoren as he is the official uploader and might remember how he acquired this :)

paper on arxiv must be updated or changed

that is a sensitive issue as people have already

read it
cited it
used the data with tag oml100

my suggestion:

update the first page of the arxiv paper and tell people that it is "outdated" and was kinda a "preliminary" try. and link to our new paper.

We need to describe how to create a suite better. Currently, we have this https://www.openml.org/guide/benchmark . Maybe we can add something like this (still improvable):

a) To create a benchmark suite, we need to use tasks (not datasets). That is, if there is no task for the corresponding dataset, you have to first create a task out of it (see https://www.openml.org/new/task which is currently only possible through the web interface).
b) You have to create a study https://www.openml.org/new/study (I think this is currently also only possible through the web interface) and remember the study ID after you have created the study, you will need the ID for step c). If you set an alias-string when creating the study, it can then also be used to retrieve the benchmark suite (alternatively the study ID can be used, see step d).
c) You should add a tag called "study_X" where X = your study ID to the tasks (and datasets), this should be possible by the clients (e.g. R) or through web interface.
d) Now you have your benchmark suite. In R, you can get the information using getOMLStudy(IDofStudy) or getOMLStudy("your-alias-string"). Study information can be found online https://www.openml.org/s/IDofStudy

We maybe have to simplify some steps for users. Still, many things are only possible through the web interface. Example:
a) we need a better way to create tasks out of datasets, imagine you want to add your one benchmark suite but have to create 100 tasks manually through the web. See openml/OpenML#325
b) If a task is tagged, also the underlying data should be tagged. Also if a run is tagged, then the underlying task, data and flow should be tagged by the same tag. See openml/OpenML#530 . If the server does not do this automatically, at least the client should do this.
c) Maybe we should also allow that tagging tasks by alias-string are also accepted in step (c) above.

openml / benchmark-suites Goto Github PK

benchmark-suites's Introduction

OpenML Benchmark Suites

Documentation

Notebooks

benchmark-suites's People

Contributors

Stargazers

Watchers

Forkers

benchmark-suites's Issues

Recommend Projects

Recommend Topics

Recommend Org