Coder Social home page Coder Social logo

benchmark-suites's Introduction

OpenML Benchmark Suites

Machine learning research depends on objectively interpretable, comparable, and reproducible algorithm benchmarks. Therefore, we advocate the use of curated, comprehensive suites of machine learning datasets to standardize the setup, execution, and reporting of benchmarks. We enable this through platform-independent software tools that help to create and leverage these benchmarking suites. These are seamlessly integrated into the OpenML platform, and accessible through interfaces in Python, Java, and R.

OpenML benchmarking suites are:

  • easy to use through standardized data formats, APIs, andclient libraries
  • machine-readable, with extensive meta-information on the includeddatasets
  • allow benchmarks to be shared and reused in future studies.

Documentation

Detailed documentation on how to create and use OpenML benchmark suites
This also includes a list of current benchmark suites, such as the OpenML-CC18.

Notebooks

We provide a set of notebooks to explore existing benchmark suites, and create your own:

  • Automated benchmark suite generator: Allows you to specific a list of constraints and additional tests, and retrieve all datasets that adhere to them
  • CC18 score overview: Overview of shared results on the CC18 benchmark suites
  • CC18 benchmark analysis: A deeper analysis of existing results in R (note: this was done for an older benchmark set)
  • Mini-Benchmark of R algorithm on the CC18: http://rpubs.com/giuseppec/OpenML100
  • Mini-Benchmark of WEKA algorithms on the CC18
  • Tutorials for OpenML in R and Python

benchmark-suites's People

Contributors

berndbischl avatar giuseppec avatar janvanrijn avatar joaquinvanschoren avatar mfeurer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

benchmark-suites's Issues

Missing Descriptions

The following datasets have missing descriptions:

  • 23517 - numerai28.6
  • 40670 - dna
  • 40705 - tokyo1

Discussion items about the paper

The same holds for other repositories, such as LIBSVM (Chang and Lin, 2011)

If we only can give a single other repository as an example, we maybe shouldn't say other repositorios?

However, none of the above tools allows to add new datasets or easily share and compare benchmarking results online.

PMLB would allow for pull requests to the github repository. Not sure how to write this in the paper, though.

  • Also, I just realized that in the pseudocodes the API key is located at a rather weird place. I'll change this if you agree on this.
  • Maybe we can actually reduce the space between the references instead of making formatting changes to the main paper, what do you think?

Artificial vs Simulated datasets

Currently, we actually have a some simulated datasets in our list of datasets, but also removed several simulated datasets as being "artificial". However, it is very unclear where to draw the line and based on what criteria we would include a dataset as being simulated or exclude it as being artificial.

Examples of simulated datasets in our list:

  • MagicTelescope
  • higgs

Examples of artificial datasets in our list:

  • waveform-5000

define rules to filter out data which is too simple

current suggestion

run 1nn, NB and rpart, normally crossvalidated on the tasks as OML tells us
missing values imputed by: num --> median, cat --> new level
measure balanced error rate

if BER >= 0.99 for any classifier --> remove data from suite

Fix speeddating

As raised by @mfeurer and discussed in the skype call this morning, the speed dating dataset should be fixed.

assigned myself for obvious reasons.

Check back with uploader

  • steel-plates-fault (1) -> Rafael G. Mantovani - where is the description from, especially the statement that The latter is commonly used as a binary classification target ('common' or 'other' fault.).
  • Bioresponse -> Böhringer Ingelheim because it is a Kaggle dataset

Add links

Go through all 100 datasets and make sure that the link to the original data source is present. Do this only after the faulty datasets have been replaced by new versions.

Update Notebook

Notebook doesn't seem to be up to date with latest version of the paper

Fix datasets

  • Australian (3)
  • cmc (1)
  • ada_agnostic (1)
  • climate-model-simulation-crashes (1)
  • car (1)
  • segment (1)
  • sylva_agnostic (1)
  • SpeedDating (1)
  • Internet-Advertisements (1)
  • MiceProtein (1)
  • mfeat-pixel (1)

Benchmark results and Overfitting

Maybe I'm thinking too far ahead, but there are a few obvious criticisms that we may get from reviewers that are related to how this benchmark is going to be used (assuming we want this to be the reference benchmarks for the field).

2 important issues here that are typical for benchmarking studies:

  • Cheating: people can look at the training sets and just publish the correct predictions. This requires some hacking of the OpenML APIs but it's not impossible. How about an algorithm that queries OpenML for the best flow for each task and just runs that? I guess we need a more explicit and visible way to report cheating/issues on the result page and offer a switch to only show results without issues? If only to discourage people from doing this. Do we want people to give their real names when they submit results?

Maybe we can - to some extent - run the flows on the server and try to reproduce the results, and then add a special label to those runs.

  • Overfitting: on a single task, it is quite easy to submit results many results until - by chance - they overfit on the entire 10-fold CV. For the benchmark, the results are aggregated over multiple tasks, so overfitting is less likely, but not impossible.

Some ways to alleviate this problem:

  • Use 10x10fold CV instead of normal 10fold CV tasks, maybe as an OpenML-CC18x10 benchmark that people can choose to use instead (assuming that the results here are more authoritative). Maybe even an OpenML-CC18x100.
  • Have an 'evolving' benchmark: as new datasets are added in CC19, CC20 etc., we can show how results/rankings change over time. Overfitted flows on CC18 will likely perform worse in CC19 etc.
  • Other ideas? Differential privacy?

In addition, we should also show an aggregated view of the score on the individual tasks (e.g. violin plots?) and do statistical tests? We could do the typical Friedman-Nemenyi test, but not sure if that will work that well on 'only' 80-something datasets.

We could of course wave our hands and say 'yes, but we are only solving the problem of non-standardized benchmark tests and these issues apply to any benchmarking study' but in a way these issues are connected...

More datasets

Try to get more datasets! (Would be good to keep the 'OML100' principle) For example, @joaquinvanschoren mentioned that he as several datasets that comply to the requirements. Related to #13

Document the used tags

Although these tags are not 'protected' (none of the tags are) it would be good to have a documentation of some 'consensual' tags, such as 'artificial', 'label_leakage' etc (as suggested by @berndbischl )

data_status only keeps the last applicable tested status of each dataset

The way the filters are set up, only the last known 'reason-for-exclusion' is known for each dataset. To me it makes sense to instead keep track of a set of test results. That way it becomes easier to identify which constraints to relax if you would like a larger study.

I'll go ahead and implement it myself either way, so I will just add a PR when it's done. I opened the issue to see if this was considered, if there are good reasons not to do this and/or if there are any additional related features that make sense to add.

segment dataset

This new version was created by Jann:
https://www.openml.org/d/40984

The difference is that the position of the 3x3 pixel sample in the image is removed.
Are we sure that this is correct? If I want to classify 'sky', is it not useful to know the position in the image?

I'm leaving both versions as active for now.

New dataset list

Credits to @ArlindKadra who compiled the latest list of dataset ids as uploaded by Jann.

1489, 15, 40981, 1462, 1471, 151, 469, 23512, 1464, 1480, 40982, 182, 18, 11, 29, 23, 37, 40983, 1120, 307, 1050, 1590, 1049, 40993, 4538, 4534, 1461, 1466, 40989, 1475, 1497, 23381, 38, 60, 1510, 40975, 50, 22, 40984, 40668, 1063, 1053, 1068, 1067, 1494, 188, 31, 32, 54, 6, 28, 14, 16, 3, 1487, 40992, 1486, 44, 46, 24, 6332, 40536, 40499, 12, 40971, 554, 1038, 4134, 40978, 1501, 42, 1485, 1478, 300, 1515, 1468, 40966, 1491, 1492, 1493, 40979

I have a Java Program to automatically grab or generate the associated tasks, and can easily create a new study when necessary.

Question: higgs sampling

As raised by @mfeurer and followed up upon by the call this morning:

Higgs dataset is a subset of the full UCI higgs dataset. The crucial question is: Is this subset randomly sampled or is this subset also used across the literature? I added @joaquinvanschoren as he is the official uploader and might remember how he acquired this :)

paper on arxiv must be updated or changed

that is a sensitive issue as people have already

read it
cited it
used the data with tag oml100

my suggestion:

update the first page of the arxiv paper and tell people that it is "outdated" and was kinda a "preliminary" try. and link to our new paper.

Dataset Ipums - 378

Tagged as 'unspecified_target', this is not the case.

However, it seems to be a subsample of a bigger dataset.

Update Wiki

  • vowel (2)
  • adult (2)
  • ada_agnostic (2)
  • eucalyptus (1)
  • tamilnadu_electricity (1)
  • semeion (1)

Decide about data stream data

At the moment we're having several data stream datasets in our list (I think four). We need to decide whether we keep them and argue that because of their featurization, they're valid to use as regular classification datasets.

Make it easier for user to create benchmark suites

  1. We need to describe how to create a suite better. Currently, we have this https://www.openml.org/guide/benchmark . Maybe we can add something like this (still improvable):

a) To create a benchmark suite, we need to use tasks (not datasets). That is, if there is no task for the corresponding dataset, you have to first create a task out of it (see https://www.openml.org/new/task which is currently only possible through the web interface).
b) You have to create a study https://www.openml.org/new/study (I think this is currently also only possible through the web interface) and remember the study ID after you have created the study, you will need the ID for step c). If you set an alias-string when creating the study, it can then also be used to retrieve the benchmark suite (alternatively the study ID can be used, see step d).
c) You should add a tag called "study_X" where X = your study ID to the tasks (and datasets), this should be possible by the clients (e.g. R) or through web interface.
d) Now you have your benchmark suite. In R, you can get the information using getOMLStudy(IDofStudy) or getOMLStudy("your-alias-string"). Study information can be found online https://www.openml.org/s/IDofStudy

  1. We maybe have to simplify some steps for users. Still, many things are only possible through the web interface. Example:
    a) we need a better way to create tasks out of datasets, imagine you want to add your one benchmark suite but have to create 100 tasks manually through the web. See openml/OpenML#325
    b) If a task is tagged, also the underlying data should be tagged. Also if a run is tagged, then the underlying task, data and flow should be tagged by the same tag. See openml/OpenML#530 . If the server does not do this automatically, at least the client should do this.
    c) Maybe we should also allow that tagging tasks by alias-string are also accepted in step (c) above.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.