omlbots's Issues
uploaded results should be tagged
To-dos
To dos:
Write functions to:
- merge the tables
- evaluate the results with models
- sample new prediction observations, create pareto front
- write recommendation function
Shiny app for the visualisation of the results -> Quay
Write database backend
Conversion of Hyperparameters
I just had a look at the hyperparameters extracted from the OpenML platformand there are some problems when using them for surrogate models.
We have some/a lot of NA values. For using them in the surrogate models we have to convert them. My suggestion: -1 for numeric variables and an "NA" level for factor variables.
Making a hierarchical model structure (e.g. if booster in xgboost linear than this model, etc.) is too complicated.
runTime not downloadable for some runs
Für bestimmte runs geht das nicht:
> getRunTime(1886569)
Error in function (type, msg, asError = TRUE) : Could not resolve host: ������
Zusätzlich: Warnmeldung:
In strsplit(msg, "\n") : Eingabe-Zeichenkette 1 ungültig in dieser locale
run.id run.time sci.mark
1 1886569 NA NA
Wenn wir noch ein bisschen warten (openml/OpenML#426) wird das ganze über listOMLRunEvaluations verfügbar sein, dann brauchen wir diesen lästigen Umweg nicht mehr und es geht schneller.
Data conversion for xgboost
xgboost only accepts numerical features.
We should decide how we convert factor variables. I implemented now the automatic conversion into a numeric by ordering it according to the levels and making a numerical variable out of the factor variables. Alternatively convert it to several binary features, but feature space can get big with features with many factors.
min.node.size in ranger
We did not include it as parameter and I am a bit sad about this. Maybe we could include it in further experiments... A transformation would be necessary here like with mtry.
getOverview
display task.id, should also display data.id and data name
GetData changes
- Add setup.id to getMlrRandomBotResults()
- Add task.name to getMlrRandomBotResults()
- Change getMetaFeatures to use randomBot tags
n does not equal n in resampling
Some hyperparameters are set accroding to the number of observations (actually only min.node.size
)
Lines 36 to 37 in 7f02825
This does not respect the resampling. If we have 10fold CV it should be e.g.
n = nrow(task$task$input$data.set$data)/10
par$min.node.size = round(2^(log(n, 2) * par$min.node.size))
theoretically all performances for min.node.size in [log(n/10)/log(n),1] should be the same because the min.node.size is set to a value bigger then the actual n
Error runs
Are not uploaded on OpenML, are they? Kind of problematic...
first experiments should always be to run the base learner in defaults
randombot needs to use his own OML account
Create a df, that can be used to train a surrogate model
Find a way to upload errors
Should work according to here: openml/OpenML#424
Use OpenML Snapshot database
Use this openml/OpenML#415 (comment) instead of what I did earlier.
please link from the OML user account to this repo
Prevent runs of same configurations on same dataset
As it is currently a bit complicated to scan for used parameter configuration I am not sure, but it looks like you are not preventing repeating evaluations of the same parameter configuration on the data sets.
This of course only matters for learners that do not have any continuous parameters like knn.
Use random search to "optimize" the surrogate model (given a task & learner)
Also benchmark this against defaults.
Run defaults implementieren, falls defaults noch nicht vorhanden
Make model with ranks instead of measure
Otherwise the results do not make any sense, as we can have very different measure outcomes over datasets (e.g., in one dataset AUC between 0 and 0.2, in others between 0.9 and 1). Especially if one uses the randomForest algorithm.
Maybe we have to scale the ranks so they lie between 0 and 1 (for each dataset), we should discuss it.
make used tasks a bit more configurable
we need to be able to easiliy switch between 2 things
a) study_14 (reaslistic)
b) something EXTRMELY small. low obs, few data sets, just holdout, etc
getResult functions. at least doc in 1 sentence what data object they return
Regression datasets
I filtered out some regression datasets on which we could do the analysis.
Firstly I did some automatic filtering, afterwards I looked at the datasets manually and filtered out more.
Overall I got 103 datasets that I saved here.
Code: https://github.com/ja-thomas/OMLbots/blob/master/regression_datasets/regression_datasets.R
Maybe @DanielKuehn87 (or @berndbischl ) can you maybe look/recheck the selection?
Can we give them a tag in OpenML?
Rpart & svm fails?
https://www.openml.org/u/2702/flows
https://www.openml.org/search?q=uploader_id%3A2702&type=flow&sort=runs&order=desc
Looks like we almost have no uploads for the SVM and rpart. I have a limit of 3.4 GB RAM and 2 hours runtime on Azure per run... but this should be enough at least for the smaller datasets. Not sure, what the issue is...
Do not hardcode cluster functions
Cluster functions are defined by your .batchtools.conf.R
file.
maybe show a little bit more logging in sampleTasks
- how many tasks initailly listed?
- filtered down to how many?
dont use sapply
runBot: add option so upload will not happen (for trying out)
that should be enabled by default
please clean up which functions go into which file
the "sorting" of function to files really does not seem to make sense in many cases.
please clean this up
Bot runs disappear on cluster.
Needs to be fixed asap, since we need more data tomorrow.
get result tabke, for one learner and one measure
How to handle runtime on different devices?
Currently the rscimark is not included in the runs. We should add this asap, since we might need this in the future to adjust for different hardware settings.
function docs: add least add one sentence to explain what each function roughy does
do not use print sprintf
in all instance you use this, you want messagef
Literature
I did a literature search (also for my other paper).
Here are the papers divided by topic, maybe you can add yours to the specific topic, if you know some more.
Tuning in general:
- Tuning with Iterated F-Racing: Automatic model selection for high-dimensional survival analysis
- (iterated) F-Racing: F-Race and iterated F-Race: An overview, Mauro Birattari, Zhi Yuan, Prasanna Balaprakash, and Thomas Stutzle
- AutoML: Efficient and Robust Automated Machine Learning, Feurer et. al
- mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions, Bischl et al
- mlrMBO: Faster Model-Based Optimization through Resource-Aware Scheduling Strategies, Richter et al
- Hyperopt: a Python library for model selection and hyperparameter optimization (Pythons analogon to MBO)
- Sequential MBO: Sequential Model-Based Optimization for General Algorithm Configuration, Hutter et al
- Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms, Thornton et al
- Practical Bayesian Optimization of Machine Learning Algorithms
Meta-Learning:
- Multi-Task Bayesian Optimization, Kwersky et. al
- To tune or not to tune: recommending when to adjust SVM hyper-parameters via metalearning
- Collaborative hyperparameter tuning, Bardenet et. al
- Hyperparameter Optimization Machines, Wistuba et. al
- Using meta-learning to initialize bayesian optimization of hyperparameters, Feurer et al
- Scalable Hyperparameter Optimization with Products of Gaussian Process Experts, Schilling et. al
- Sequential Model-Free Hyperparameter Tuning, Wistuba et. al
- Two-Stage Transfer Surrogate Model for Automatic Hyperparameter Optimization, Wistuba et. al
- Learning hyperparameter optimization initializations, Wistuba et. al
Hyperparameter Importance:
- An Efficient Approach for Assessing Hyperparameter Importance
- Identifying key algorithm parameters and instance features using forward selection
- Analysing differences between algorithm configurations through ablation
- Efficient Parameter Importance Analysis via Ablation with Surrogates
- Hyper-parameter Tuning of a Decision Tree Induction Algorithm
Surrogate Models:
- Surrogate benchmarks for hyperparameter optimization, Eggensperger et. al (RF is best)
- Efficient Benchmarking of Hyperparameter Optimizers via Surrogates
Time Prediction:
- Prediction of Classifier Training Time Including Parameter Optimization, Reif et. al
Time AND Performance:
- Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets, Klein et al
dont use dots in filename, eg lrn.ps.parset.R, use underscores
Why is there a number behind flows?
Do we need to remove the (number): ?
mlr.classif.glmnet(2)
mlr.classif.glmnet(3)
logging output: can you display data set readable name?
[1] "Selected OML task: 3896"
like here somewhere
document all functions at least briefly
many function do not even have a single line header that docs what they do
you dont have to use full roxygen style, but not docing at all really is not good
create overview function that displays the current state of exps on the server
so all uploaded runs from the bot.
possibly in 2 stages:
a) create a data.frame, or multiple ones, that contains all relevenat info (at least a rough approx of that)
b) possibly do the display in shiny? because you really need to drill down the info and click a bit
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.