Coder Social home page Coder Social logo

omlbots's Introduction

OMLbots

Bot that executes (random) experiments on OpenML datasets and uploads the results on the OpenML platform.

The main function of the bot can be executed via runBot.

To add a new algorithm it has to be included in the file R/botSetLearnerParamPairs.R with its hyperparameter ranges.

See executed runs on the openml.org page: https://www.openml.org/u/2702

OpenML Identification

Name: OpenML_Bot R

ID: 2702

Tags

General Tag: mlrRandomBot

Extra-Tag for the random hyperparameter runs (without RF default runs): botV1

Extra-Tag for the reference runs (random forest with defaults): referenceV1

Downloading

The fixed subset of 2.5 million results

A fixed subset of the results of the random bot can be downloaded easily from figshare:

https://figshare.com/articles/OpenML_R_Bot_Benchmark_Data_final_subset_/5882230

This dataset is described soon in a paper.

All results via the nightly database snapshot

Alternatively all results can be downloaded via the nightly database snapshot. The snapshot can be downloaded from: https://docs.openml.org/developers/

After having set up the SQL database (see here for an example how to do it via a terminal in linux), the data can be extracted with this code: https://github.com/ja-thomas/OMLbots/blob/master/snapshot_database/database_extraction.R

Using the R-API

If you want to download results via the OpenML package you can use the following code. (Currently under review, does not work yet.)

https://github.com/ja-thomas/OMLbots/blob/master/GetResultsR-API.R

omlbots's People

Contributors

philipppro avatar danielkuehn87 avatar ja-thomas avatar

Stargazers

Lei Zhang avatar Lucius Hu avatar Sarim Zafar avatar HarryZhu avatar  avatar Ðietrich ₸rautmann avatar Mehdi Cherti avatar Anand Gavai avatar Zeyad Deeb avatar Jakob Richter avatar  avatar Bernd Bischl avatar  avatar

Watchers

James Cloos avatar Bernd Bischl avatar Jakob Richter avatar  avatar  avatar  avatar  avatar

omlbots's Issues

Make model with ranks instead of measure

Otherwise the results do not make any sense, as we can have very different measure outcomes over datasets (e.g., in one dataset AUC between 0 and 0.2, in others between 0.9 and 1). Especially if one uses the randomForest algorithm.
Maybe we have to scale the ranks so they lie between 0 and 1 (for each dataset), we should discuss it.

GetData changes

  1. Add setup.id to getMlrRandomBotResults()
  2. Add task.name to getMlrRandomBotResults()
  3. Change getMetaFeatures to use randomBot tags

Conversion of Hyperparameters

I just had a look at the hyperparameters extracted from the OpenML platformand there are some problems when using them for surrogate models.

We have some/a lot of NA values. For using them in the surrogate models we have to convert them. My suggestion: -1 for numeric variables and an "NA" level for factor variables.

Making a hierarchical model structure (e.g. if booster in xgboost linear than this model, etc.) is too complicated.

getOverview

display task.id, should also display data.id and data name

Prevent runs of same configurations on same dataset

As it is currently a bit complicated to scan for used parameter configuration I am not sure, but it looks like you are not preventing repeating evaluations of the same parameter configuration on the data sets.

This of course only matters for learners that do not have any continuous parameters like knn.

Error runs

Are not uploaded on OpenML, are they? Kind of problematic...

Literature

I did a literature search (also for my other paper).
Here are the papers divided by topic, maybe you can add yours to the specific topic, if you know some more.

Tuning in general:

  • Tuning with Iterated F-Racing: Automatic model selection for high-dimensional survival analysis
  • (iterated) F-Racing: F-Race and iterated F-Race: An overview, Mauro Birattari, Zhi Yuan, Prasanna Balaprakash, and Thomas Stutzle
  • AutoML: Efficient and Robust Automated Machine Learning, Feurer et. al
  • mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions, Bischl et al
  • mlrMBO: Faster Model-Based Optimization through Resource-Aware Scheduling Strategies, Richter et al
  • Hyperopt: a Python library for model selection and hyperparameter optimization (Pythons analogon to MBO)
  • Sequential MBO: Sequential Model-Based Optimization for General Algorithm Configuration, Hutter et al
  • Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms, Thornton et al
  • Practical Bayesian Optimization of Machine Learning Algorithms

Meta-Learning:

  • Multi-Task Bayesian Optimization, Kwersky et. al
  • To tune or not to tune: recommending when to adjust SVM hyper-parameters via metalearning
  • Collaborative hyperparameter tuning, Bardenet et. al
  • Hyperparameter Optimization Machines, Wistuba et. al
  • Using meta-learning to initialize bayesian optimization of hyperparameters, Feurer et al
  • Scalable Hyperparameter Optimization with Products of Gaussian Process Experts, Schilling et. al
  • Sequential Model-Free Hyperparameter Tuning, Wistuba et. al
  • Two-Stage Transfer Surrogate Model for Automatic Hyperparameter Optimization, Wistuba et. al
  • Learning hyperparameter optimization initializations, Wistuba et. al

Hyperparameter Importance:

  • An Efficient Approach for Assessing Hyperparameter Importance
  • Identifying key algorithm parameters and instance features using forward selection
  • Analysing differences between algorithm configurations through ablation
  • Efficient Parameter Importance Analysis via Ablation with Surrogates
  • Hyper-parameter Tuning of a Decision Tree Induction Algorithm

Surrogate Models:

  • Surrogate benchmarks for hyperparameter optimization, Eggensperger et. al (RF is best)
  • Efficient Benchmarking of Hyperparameter Optimizers via Surrogates

Time Prediction:

  • Prediction of Classifier Training Time Including Parameter Optimization, Reif et. al

Time AND Performance:

  • Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets, Klein et al

min.node.size in ranger

We did not include it as parameter and I am a bit sad about this. Maybe we could include it in further experiments... A transformation would be necessary here like with mtry.

n does not equal n in resampling

Some hyperparameters are set accroding to the number of observations (actually only min.node.size)

n = nrow(task$task$input$data.set$data)
par$min.node.size = round(2^(log(n, 2) * par$min.node.size))

This does not respect the resampling. If we have 10fold CV it should be e.g.

n = nrow(task$task$input$data.set$data)/10
par$min.node.size = round(2^(log(n, 2) * par$min.node.size)) 

theoretically all performances for min.node.size in [log(n/10)/log(n),1] should be the same because the min.node.size is set to a value bigger then the actual n

Data conversion for xgboost

xgboost only accepts numerical features.
We should decide how we convert factor variables. I implemented now the automatic conversion into a numeric by ordering it according to the levels and making a numerical variable out of the factor variables. Alternatively convert it to several binary features, but feature space can get big with features with many factors.

runTime not downloadable for some runs

Für bestimmte runs geht das nicht:

> getRunTime(1886569)
Error in function (type, msg, asError = TRUE)  : Could not resolve host: ������
Zusätzlich: Warnmeldung:
In strsplit(msg, "\n") : Eingabe-Zeichenkette 1 ungültig in dieser locale
   run.id run.time sci.mark
1 1886569       NA       NA

Wenn wir noch ein bisschen warten (openml/OpenML#426) wird das ganze über listOMLRunEvaluations verfügbar sein, dann brauchen wir diesen lästigen Umweg nicht mehr und es geht schneller.

document all functions at least briefly

many function do not even have a single line header that docs what they do

you dont have to use full roxygen style, but not docing at all really is not good

To-dos

To dos:
Write functions to:

  1. merge the tables
  2. evaluate the results with models
  3. sample new prediction observations, create pareto front
  4. write recommendation function

Shiny app for the visualisation of the results -> Quay
Write database backend

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.