The omlbots's discuss from ja-thomas

uploaded results should be tagged

To-dos

To dos:
Write functions to:

merge the tables
evaluate the results with models
sample new prediction observations, create pareto front
write recommendation function

Shiny app for the visualisation of the results -> Quay
Write database backend

Conversion of Hyperparameters

I just had a look at the hyperparameters extracted from the OpenML platformand there are some problems when using them for surrogate models.

We have some/a lot of NA values. For using them in the surrogate models we have to convert them. My suggestion: -1 for numeric variables and an "NA" level for factor variables.

Making a hierarchical model structure (e.g. if booster in xgboost linear than this model, etc.) is too complicated.

runTime not downloadable for some runs

Für bestimmte runs geht das nicht:

> getRunTime(1886569)
Error in function (type, msg, asError = TRUE)  : Could not resolve host: ������
Zusätzlich: Warnmeldung:
In strsplit(msg, "\n") : Eingabe-Zeichenkette 1 ungültig in dieser locale
   run.id run.time sci.mark
1 1886569       NA       NA

Wenn wir noch ein bisschen warten (openml/OpenML#426) wird das ganze über listOMLRunEvaluations verfügbar sein, dann brauchen wir diesen lästigen Umweg nicht mehr und es geht schneller.

xgboost only accepts numerical features.
We should decide how we convert factor variables. I implemented now the automatic conversion into a numeric by ordering it according to the levels and making a numerical variable out of the factor variables. Alternatively convert it to several binary features, but feature space can get big with features with many factors.

min.node.size in ranger

We did not include it as parameter and I am a bit sad about this. Maybe we could include it in further experiments... A transformation would be necessary here like with mtry.

getOverview

display task.id, should also display data.id and data name

GetData changes

Add setup.id to getMlrRandomBotResults()
Add task.name to getMlrRandomBotResults()
Change getMetaFeatures to use randomBot tags

n does not equal n in resampling

Some hyperparameters are set accroding to the number of observations (actually only min.node.size)

OMLbots/R/botCallWrapper.R

Lines 36 to 37 in 7f02825

    
           n = nrow(task$task$input$data.set$data) 
        
           par$min.node.size = round(2^(log(n, 2) * par$min.node.size))

This does not respect the resampling. If we have 10fold CV it should be e.g.

n = nrow(task$task$input$data.set$data)/10
par$min.node.size = round(2^(log(n, 2) * par$min.node.size))

theoretically all performances for min.node.size in [log(n/10)/log(n),1] should be the same because the min.node.size is set to a value bigger then the actual n

Error runs

Are not uploaded on OpenML, are they? Kind of problematic...

first experiments should always be to run the base learner in defaults

randombot needs to use his own OML account

Create a df, that can be used to train a surrogate model

Find a way to upload errors

Should work according to here: openml/OpenML#424

Use OpenML Snapshot database

Use this openml/OpenML#415 (comment) instead of what I did earlier.

please link from the OML user account to this repo

Prevent runs of same configurations on same dataset

As it is currently a bit complicated to scan for used parameter configuration I am not sure, but it looks like you are not preventing repeating evaluations of the same parameter configuration on the data sets.

This of course only matters for learners that do not have any continuous parameters like knn.

Use random search to "optimize" the surrogate model (given a task & learner)

Also benchmark this against defaults.

Run defaults implementieren, falls defaults noch nicht vorhanden

Make model with ranks instead of measure

Otherwise the results do not make any sense, as we can have very different measure outcomes over datasets (e.g., in one dataset AUC between 0 and 0.2, in others between 0.9 and 1). Especially if one uses the randomForest algorithm.
Maybe we have to scale the ranks so they lie between 0 and 1 (for each dataset), we should discuss it.

make used tasks a bit more configurable

we need to be able to easiliy switch between 2 things

a) study_14 (reaslistic)

b) something EXTRMELY small. low obs, few data sets, just holdout, etc

getResult functions. at least doc in 1 sentence what data object they return

Regression datasets

I filtered out some regression datasets on which we could do the analysis.
Firstly I did some automatic filtering, afterwards I looked at the datasets manually and filtered out more.
Overall I got 103 datasets that I saved here.

Code: https://github.com/ja-thomas/OMLbots/blob/master/regression_datasets/regression_datasets.R

Maybe @DanielKuehn87 (or @berndbischl ) can you maybe look/recheck the selection?

Can we give them a tag in OpenML?

Rpart & svm fails?

https://www.openml.org/u/2702/flows
https://www.openml.org/search?q=uploader_id%3A2702&type=flow&sort=runs&order=desc

Looks like we almost have no uploads for the SVM and rpart. I have a limit of 3.4 GB RAM and 2 hours runtime on Azure per run... but this should be enough at least for the smaller datasets. Not sure, what the issue is...

Do not hardcode cluster functions

Cluster functions are defined by your .batchtools.conf.R file.

maybe show a little bit more logging in sampleTasks

how many tasks initailly listed?
filtered down to how many?

Tuning with Iterated F-Racing: Automatic model selection for high-dimensional survival analysis
(iterated) F-Racing: F-Race and iterated F-Race: An overview, Mauro Birattari, Zhi Yuan, Prasanna Balaprakash, and Thomas Stutzle
AutoML: Efficient and Robust Automated Machine Learning, Feurer et. al
mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions, Bischl et al
mlrMBO: Faster Model-Based Optimization through Resource-Aware Scheduling Strategies, Richter et al
Hyperopt: a Python library for model selection and hyperparameter optimization (Pythons analogon to MBO)
Sequential MBO: Sequential Model-Based Optimization for General Algorithm Configuration, Hutter et al
Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms, Thornton et al
Practical Bayesian Optimization of Machine Learning Algorithms

Meta-Learning:

Multi-Task Bayesian Optimization, Kwersky et. al
To tune or not to tune: recommending when to adjust SVM hyper-parameters via metalearning
Collaborative hyperparameter tuning, Bardenet et. al
Hyperparameter Optimization Machines, Wistuba et. al
Using meta-learning to initialize bayesian optimization of hyperparameters, Feurer et al
Scalable Hyperparameter Optimization with Products of Gaussian Process Experts, Schilling et. al
Sequential Model-Free Hyperparameter Tuning, Wistuba et. al
Two-Stage Transfer Surrogate Model for Automatic Hyperparameter Optimization, Wistuba et. al
Learning hyperparameter optimization initializations, Wistuba et. al

Hyperparameter Importance:

An Efficient Approach for Assessing Hyperparameter Importance
Identifying key algorithm parameters and instance features using forward selection
Analysing differences between algorithm configurations through ablation
Efficient Parameter Importance Analysis via Ablation with Surrogates
Hyper-parameter Tuning of a Decision Tree Induction Algorithm

Surrogate Models:

Surrogate benchmarks for hyperparameter optimization, Eggensperger et. al (RF is best)
Efficient Benchmarking of Hyperparameter Optimizers via Surrogates

Time Prediction:

Prediction of Classifier Training Time Including Parameter Optimization, Reif et. al

Time AND Performance:

Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets, Klein et al

b) possibly do the display in shiny? because you really need to drill down the info and click a bit

	n = nrow(task$task$input$data.set$data)
	par$min.node.size = round(2^(log(n, 2) * par$min.node.size))

ja-thomas / omlbots Goto Github PK

omlbots's Issues

Recommend Projects

Recommend Topics

Recommend Org