ja-thomas / omlbots Goto Github PK

View Code? Open in Web Editor NEW

13.0 7.0 6.0 52.16 MB

R 100.00%

omlbots's Introduction

OMLbots

Bot that executes (random) experiments on OpenML datasets and uploads the results on the OpenML platform.

The main function of the bot can be executed via runBot.

To add a new algorithm it has to be included in the file R/botSetLearnerParamPairs.R with its hyperparameter ranges.

See executed runs on the openml.org page: https://www.openml.org/u/2702

OpenML Identification

Name: OpenML_Bot R

ID: 2702

Downloading

The fixed subset of 2.5 million results

A fixed subset of the results of the random bot can be downloaded easily from figshare:

https://figshare.com/articles/OpenML_R_Bot_Benchmark_Data_final_subset_/5882230

This dataset is described soon in a paper.

All results via the nightly database snapshot

Alternatively all results can be downloaded via the nightly database snapshot. The snapshot can be downloaded from: https://docs.openml.org/developers/

After having set up the SQL database (see here for an example how to do it via a terminal in linux), the data can be extracted with this code: https://github.com/ja-thomas/OMLbots/blob/master/snapshot_database/database_extraction.R

Using the R-API

If you want to download results via the OpenML package you can use the following code. (Currently under review, does not work yet.)

https://github.com/ja-thomas/OMLbots/blob/master/GetResultsR-API.R

omlbots's People

Contributors

Stargazers

Watchers

Forkers

philipppro openmlbot jakob-r danielkuehn87 wphong deesatzed

omlbots's Issues

get result tabke, for one learner and one measure

Prevent runs of same configurations on same dataset

As it is currently a bit complicated to scan for used parameter configuration I am not sure, but it looks like you are not preventing repeating evaluations of the same parameter configuration on the data sets.

This of course only matters for learners that do not have any continuous parameters like knn.

Regression datasets

I filtered out some regression datasets on which we could do the analysis.
Firstly I did some automatic filtering, afterwards I looked at the datasets manually and filtered out more.
Overall I got 103 datasets that I saved here.

Code: https://github.com/ja-thomas/OMLbots/blob/master/regression_datasets/regression_datasets.R

Maybe @DanielKuehn87 (or @berndbischl ) can you maybe look/recheck the selection?

Can we give them a tag in OpenML?

Error runs

Are not uploaded on OpenML, are they? Kind of problematic...

document all functions at least briefly

many function do not even have a single line header that docs what they do

you dont have to use full roxygen style, but not docing at all really is not good

dont use sapply

make used tasks a bit more configurable

we need to be able to easiliy switch between 2 things

a) study_14 (reaslistic)

b) something EXTRMELY small. low obs, few data sets, just holdout, etc

Data conversion for xgboost

xgboost only accepts numerical features.
We should decide how we convert factor variables. I implemented now the automatic conversion into a numeric by ordering it according to the levels and making a numerical variable out of the factor variables. Alternatively convert it to several binary features, but feature space can get big with features with many factors.

min.node.size in ranger

We did not include it as parameter and I am a bit sad about this. Maybe we could include it in further experiments... A transformation would be necessary here like with mtry.

randombot needs to use his own OML account

Literature

I did a literature search (also for my other paper).
Here are the papers divided by topic, maybe you can add yours to the specific topic, if you know some more.

Tuning in general:

Tuning with Iterated F-Racing: Automatic model selection for high-dimensional survival analysis
(iterated) F-Racing: F-Race and iterated F-Race: An overview, Mauro Birattari, Zhi Yuan, Prasanna Balaprakash, and Thomas Stutzle
AutoML: Efficient and Robust Automated Machine Learning, Feurer et. al
mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions, Bischl et al
mlrMBO: Faster Model-Based Optimization through Resource-Aware Scheduling Strategies, Richter et al
Hyperopt: a Python library for model selection and hyperparameter optimization (Pythons analogon to MBO)
Sequential MBO: Sequential Model-Based Optimization for General Algorithm Configuration, Hutter et al
Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms, Thornton et al
Practical Bayesian Optimization of Machine Learning Algorithms

Meta-Learning:

Multi-Task Bayesian Optimization, Kwersky et. al
To tune or not to tune: recommending when to adjust SVM hyper-parameters via metalearning
Collaborative hyperparameter tuning, Bardenet et. al
Hyperparameter Optimization Machines, Wistuba et. al
Using meta-learning to initialize bayesian optimization of hyperparameters, Feurer et al
Scalable Hyperparameter Optimization with Products of Gaussian Process Experts, Schilling et. al
Sequential Model-Free Hyperparameter Tuning, Wistuba et. al
Two-Stage Transfer Surrogate Model for Automatic Hyperparameter Optimization, Wistuba et. al
Learning hyperparameter optimization initializations, Wistuba et. al

Hyperparameter Importance:

An Efficient Approach for Assessing Hyperparameter Importance
Identifying key algorithm parameters and instance features using forward selection
Analysing differences between algorithm configurations through ablation
Efficient Parameter Importance Analysis via Ablation with Surrogates
Hyper-parameter Tuning of a Decision Tree Induction Algorithm

Surrogate Models:

Surrogate benchmarks for hyperparameter optimization, Eggensperger et. al (RF is best)
Efficient Benchmarking of Hyperparameter Optimizers via Surrogates

Time Prediction:

Prediction of Classifier Training Time Including Parameter Optimization, Reif et. al

Time AND Performance:

Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets, Klein et al

create overview function that displays the current state of exps on the server

so all uploaded runs from the bot.

possibly in 2 stages:
a) create a data.frame, or multiple ones, that contains all relevenat info (at least a rough approx of that)

b) possibly do the display in shiny? because you really need to drill down the info and click a bit

To-dos

To dos:
Write functions to:

merge the tables
evaluate the results with models
sample new prediction observations, create pareto front
write recommendation function

Shiny app for the visualisation of the results -> Quay
Write database backend

Use OpenML Snapshot database

Use this openml/OpenML#415 (comment) instead of what I did earlier.

Bot runs disappear on cluster.

Needs to be fixed asap, since we need more data tomorrow.

Why is there a number behind flows?

Do we need to remove the (number): ?
mlr.classif.glmnet(2)
mlr.classif.glmnet(3)

getResult functions. at least doc in 1 sentence what data object they return

function docs: add least add one sentence to explain what each function roughy does

Do not hardcode cluster functions

Cluster functions are defined by your .batchtools.conf.R file.

first experiments should always be to run the base learner in defaults

uploaded results should be tagged

dont use dots in filename, eg lrn.ps.parset.R, use underscores

Conversion of Hyperparameters

I just had a look at the hyperparameters extracted from the OpenML platformand there are some problems when using them for surrogate models.

We have some/a lot of NA values. For using them in the surrogate models we have to convert them. My suggestion: -1 for numeric variables and an "NA" level for factor variables.

Making a hierarchical model structure (e.g. if booster in xgboost linear than this model, etc.) is too complicated.

runBot: add option so upload will not happen (for trying out)

that should be enabled by default

logging output: can you display data set readable name?

[1] "Selected OML task: 3896"

like here somewhere

do not use print sprintf

in all instance you use this, you want messagef

Make model with ranks instead of measure

Otherwise the results do not make any sense, as we can have very different measure outcomes over datasets (e.g., in one dataset AUC between 0 and 0.2, in others between 0.9 and 1). Especially if one uses the randomForest algorithm.
Maybe we have to scale the ranks so they lie between 0 and 1 (for each dataset), we should discuss it.

maybe show a little bit more logging in sampleTasks

how many tasks initailly listed?
filtered down to how many?

please clean up which functions go into which file

the "sorting" of function to files really does not seem to make sense in many cases.

please clean this up

n does not equal n in resampling

Some hyperparameters are set accroding to the number of observations (actually only min.node.size)

OMLbots/R/botCallWrapper.R

Lines 36 to 37 in 7f02825

    
           n = nrow(task$task$input$data.set$data) 
        
           par$min.node.size = round(2^(log(n, 2) * par$min.node.size))

This does not respect the resampling. If we have 10fold CV it should be e.g.

n = nrow(task$task$input$data.set$data)/10
par$min.node.size = round(2^(log(n, 2) * par$min.node.size))

theoretically all performances for min.node.size in [log(n/10)/log(n),1] should be the same because the min.node.size is set to a value bigger then the actual n

> getRunTime(1886569)
Error in function (type, msg, asError = TRUE)  : Could not resolve host: ������
Zusätzlich: Warnmeldung:
In strsplit(msg, "\n") : Eingabe-Zeichenkette 1 ungültig in dieser locale
   run.id run.time sci.mark
1 1886569       NA       NA

Wenn wir noch ein bisschen warten (openml/OpenML#426) wird das ganze über listOMLRunEvaluations verfügbar sein, dann brauchen wir diesen lästigen Umweg nicht mehr und es geht schneller.

Add setup.id to getMlrRandomBotResults()
Add task.name to getMlrRandomBotResults()
Change getMetaFeatures to use randomBot tags

	n = nrow(task$task$input$data.set$data)
	par$min.node.size = round(2^(log(n, 2) * par$min.node.size))