openml / openml Goto Github PK
View Code? Open in Web Editor NEWOpen Machine Learning
Home Page: https://openml.org
License: BSD 3-Clause "New" or "Revised" License
Open Machine Learning
Home Page: https://openml.org
License: BSD 3-Clause "New" or "Revised" License
In some of the tables reserved names like "class" and "type" are being used. This is a bad practice because it can cause some conflicts in some development languages. Ruby for example where the word class in the "Algorithm" table overrides the default ruby "Class".
My request is renaming of the columns "class" to something like "algorithm_class" and the columns type (tables: cvrun, task_type_estimation_procedure, math_function, experiment_variable, queries, task_type_prediction_feature) to something else.
Hi,
I'm having trouble uploading a run. I keep on getting error 207:
"File upload failed. One of the files uploaded has a problem"
Is it possible to provide more information? E.g., which file has a problem or even what the problem is. I think it's the output_files (in my case, a single .arff. Which format is expected?)...
Thanks in advance,
Dominik
These should probably be removed very soon. I think they are also flagged as "original".
name NumberOfFeatures NumberOfInstances
28 cl2_view2_combined_and_view3 0 0
30 cl2_view3_names 0 0
35 cl3_view2_combined_and_view3 0 0
37 cl3_view3 0 0
53 CoEPrA-2006_Classification_001_Calibration_Peptides 0 0
55 CoEPrA-2006_Classification_002_Calibration_Peptides 0 0
57 CoEPrA-2006_Classification_003_Calibration_Peptides 0 0
The already emailed example queries for the web interface should be collected in the wiki and also extended! We can only design the interface in a good way, if we collect reasonable things that people want to do with it.
There are 43 data sets (for isOriginal = 'true') that have neither features nor instances. What's up with them? They have names like these:
cl1_view2, cl1_view2_combined, cl1_view2_combined_and_raw_data, cl1_view2_combined_and_view3, CoEPrA-2006_Classification_001_Calibration_Data, ...
Can they be deleted or labeled as "not original"?
There is another data set without features and instances called "eucalyptus". Well, at least this is what the server tells me.
http://expdb.cs.kuleuven.be/expdb/api/?f=openml.evaluation.measures
For quite a lot of the measures it is unclear what they mean exactly, or it is, but it does not make sense to ask the client to optimize them in a task.
Examples:
a) How is kohavi_wolpert_bias_squared defined exactly?
b) Ho is the client supposed to optimize for "confusion_matrix"?
Solution:
Right now, we have user-defined versioning for datasets and implementations, which means that users have to keep track of versions and have to select/invent a versioning system which will lead to a variety of versioning schemes on the server.
It would be better if OpenML could take care of versioning.
We can then remove the version field altogether. The user just provides a name for his dataset/implementation, the server then checks if that name exists, and if not, assigns version number 1 and stores a hash computed on the uploaded code. If the dataset/implementation is uploaded again, and the hash has changed, a new version number is assigned and the new id is returned.
Comments, please :)
Allow to return a model built on the input data. This is useful for people actually interested in what is hidden in the input data. We don't want to force people to use PMML, so a model can be anything, such as a WEKA model file (.model) or an R data object (binary). Ideally, we can catalogue commonly used model formats (i.e. 'Weka model', 'R data object', ...) and describe then on the webpage, so that people know what to do with these model files.
I would propose to make this an option output for the classification/regression task, thus:
'model' -> POSTed file with the model.
'model_format' -> a string with the model format. Can be free text, people can add a description afterwards on the website.
You can see yourself, e.g., tasks dont even appear in it.
When I am searching for tasks I see these:
iris-weka.RemovePercentage-P:20
What are they? Should they be removed?
Best,
Bernd
R cannot parse this, fix pls
http://expdb.cs.kuleuven.be/expdb/api/?f=openml.data.description&data_id=61
0000-00-00 00:00:00
See subject ;)
See:
http://expdb.cs.kuleuven.be/expdb/api/index.php#openml_evaluation_measures
Naming format is wrong here, measure names should be lower case.
Also: Why have this twice anyway? Probably best to remove the example output, just provide a link to the api call, this gives all needed info.
Also: area_under_ROC_curve
should probably be area_under_roc_curve
Currently, all results shown in both the implementation and dataset detail pages implicitly belong to the 'Supervised classification' task with 10x10 CV. It would be good to show that.
Maybe we should add a dropdown box showing the different tasks for which results can be returned? It is possible that the same dataset is used in more than one task.
Bernd mentioned he would like to store the features selected in a run.
I would like to start a discussion about how to do this.
Thanks,
Joaquin
should be 2 reps of 10CV but is:
, , = TRAIN
1 2 3 4 5 6 7 8 9 10
1 142 135 126 135 135 135 135 135 135 135
2 11 135 126 135 135 135 135 135 135 270
, , = TEST
1 2 3 4 5 6 7 8 9 10
1 15 15 15 15 15 15 15 15 15 15
2 2 15 13 15 15 15 15 15 15 30
Hi,
I just downloaded all implementations that are stored on the server at the moment. Therefore, I made an SQL-query and downloaded a .csv-table with names and versions of all implementations. Here are some issues/questions:
The second point is obviously the most problematic one. Should it be forbidden to use "<" and ">" or are there possibilities to parse an XML that contatins these in its contents?
Hi all,
We're working on the WEKA-plugin and had the following question: Say you have an ensemble method, such as Bagging, and a base-learner like a decision tree.
It is currently possible to store this either as:
I believe KNIME and Rapidminer would store these as separate subcomponents of the workflow. How are things currently handled in R? Do you use option 1 or 2?
I have a slight preference for the first method, mainly because it becomes easier to compare implementations (e.g. Bagging_J48 vs Bagging_OneR), even between environments (weka.Bagging_J48 vs KNIME.Bagging_J48_workflow), and to track the effect of parameters: I can track the effect of a J48 parameter easily without having to interpret strings.
This is indeed harder for us to implement because WEKA is kind of quirky in this area, but overall I think it makes things easier and more comparable.
Thanks,
Joaquin
Make Datasets searchable for
And / or provide a table with the most essential data features for each available dataset. This would also be my prefered overview when clicking on Search -> Datasets.
Download an implementation
"The implementation is returned by the server hosting it. This can be OpenML, but also any other code repository. Try it now"
"Try it now" links to http://expdb.cs.kuleuven.be/expdb/data/uci/nominal/anneal.arff
which is a data set.
Hi,
during a test today I simply uploaded the same run (exactly the same object) three times and this was possible.
Do we really want this? I did not think this thru currently, mainly posting this as a question. But this is in 99% of cases a user error that we should catch I would suggest...
Hi,
the problem seems to be the comments in tag oml:description.
Probably because there can be any kind of weird chars in there - and apparently there already are.
R does not parse the whole XML at all, but tells me:
xmlParseEntityRef: no name
xmlParseEntityRef: no name
xmlParseEntityRef: no name
Error: 1: xmlParseEntityRef: no name
If I nearly completely remove the contents of oml:description I can parse again, so the problem is definitely located there.
Any ideas?
Data set description contains:
oml:upload_date0000-00-00 00:00:00/oml:upload_date
In R this produces:
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
Discovered while unit testing all tasks.
In the JSON output of the data qualities, no type information of the columns is given, when we directly query thru API / SQL.
Every columns has an undefined type and every value is encoded as a string, even if it is a number.
Can this be corrected?
Currently we use a trick in R so we do not have to convert manually.
Here is the API call
"http://www.openml.org/api_query/?q=SELECT%20d.name%20AS%20dataset,%20MAX(IF(dq.quality='NumberOfFeatures',%20dq.value,%20NULL))%20AS%20NumberOfFeatures,MAX(IF(dq.quality='NumberOfInstances',%20dq.value,%20NULL))%20AS%20NumberOfInstances,MAX(IF(dq.quality='NumberOfClasses',%20dq.value,%20NULL))%20AS%20NumberOfClasses,MAX(IF(dq.quality='MajorityClassSize',%20dq.value,%20NULL))%20AS%20MajorityClassSize,MAX(IF(dq.quality='MinorityClassSize',%20dq.value,%20NULL))%20AS%20MinorityClassSize,MAX(IF(dq.quality='NumberOfInstancesWithMissingValues',%20dq.value,%20NULL))%20AS%20NumberOfInstancesWithMissingValues,MAX(IF(dq.quality='NumberOfMissingValues',%20dq.value,%20NULL))%20AS%20NumberOfMissingValues,MAX(IF(dq.quality='NumberOfNumericFeatures',%20dq.value,%20NULL))%20AS%20NumberOfNumericFeatures,MAX(IF(dq.quality='NumberOfSymbolicFeatures',%20dq.value,%20NULL))%20AS%20NumberOfSymbolicFeatures%20FROM%20dataset%20d,%20data_quality%20dq%20WHERE%20d.did%20=%20dq.data%20AND%20d.isOriginal%20=%20'true'%20GROUP%20BY%20dataset"
I want to see how many observations, features, types of features, NAs and so on are in a data set, so I can choose the correct sets for my study.
I also want to query that table in R to "compute" on it.
Every implementation needs a name, a version and a description, but there are many implementations that do not contain all of these (most have name = version = ""). I only checked the first few algorithms, however.
Additionally, the implementation "weka.AODE(1.8.2.3)" is not parseable.
Implementations can currently be uploaded in many different ways. While this makes it easier for users to upload implementations, it makes it harder for other users to download and use those implementations. Hence, it would be good to define an interface for uploaded implementations that is simple enough for uploaders to provide, and that will allow downloaders to easily run the algorithm. It also allows us to provide further services on OpenML, such as automatically running implementations on the server.
We won't enforce this interface, but suggest it as a 'best practice', and state it as a prerequisite for more advanced OpenML services. We should adhere to it for our own plugins and provide clear examples for users to look at.
As usual, in what follows an implementation can be an script, program or workflow depending on its environment.
The interface:
For the common case of running well-known library algorithms, an implementation will be a wrapper/adapter that handles the conversion from an OpenML task to the required inputs for the library algorithm and interprets its (intermediate) outputs to produce the expected outputs.
I believe it is also best that the implementation description lists the task_types that it supports. Bernd also previously suggested that implementations report which types of data they can/cannot handle.
Comments, please :)
Evaluation measures in ExpDB are CamelCase. Should become lower_case.
Hi,
Search - Datasets - select one - select a run / impl
If you click on "General information" of the implementation it would be nice to see the uploader displayed.
Yes, minor point for now.
Might probably be relevant of similar displays for other objects as well?
When thinking about uploading our first experiments, I noticed that sometimes I maybe do not want to upload either a source file or a binary file.
This mainly concerns applying "standard methods" from libraries. E.g., when I apply the libsvm implementation in the R package e1071, I only need to know the package name and the version number. Uploading the package itself (in binary or source form) makes no sense, this is hosted on the official CRAN R package server.
I could upload a very short code that uses this package and produces the desired predictions. Actually there are a few more subtle questions involed here and it might be easier to discuss them briefly on Skype, I would like to hear your opinions on this.
The question basically is, how much we want to enable users that download implementations to rerun the experiments in a convenient fashion.
The data set description seems to be wrong. E.g., it says there are 798 instances but the data set has 898 rows.
Here you can find the same inconsistencies:
http://mldata.org/repository/data/viewslug/datasets-uci-anneal/
(tabs "summary" vs. "data")
I think this is what Bernd meant when he said someone should check all the data sets. Actually, the correctness of the data characteristics is way more important than the description. Let's check it:
[...]
NumberOfInstancesWithMissingValues: 0
NumberOfMissingValues: 0
[...]
This is obviously wrong. I think we have to add a slot in the data set description for how missing values are signified. Also, the server should transform them into the desired representation (e.g., "NA") before computing the data qualities.
http://openml.org/learn
Sharing a run
Both links in: (Response / XSD Schema)
Returned file: Response
The response file returned implementation description file depending on the task type. For supervised classification, the API will also compute evaluation measures based on uploaded predictions, and return them as part of the response file. See the XSD Schema for details.
Currently, an uploaded result could be the result of running an implementation with default parameters, running an implementation that does internal parameter optimisation, running an implementation many times in a parameter sweep, or running an implementation with 'magically optimised' parameters.
When ranking implementations based on their evaluations, an unfair advantage will be given to parameter sweeps (data leakage).
Thus, it has been suggested that, during upload, users should flag the run with one of the following cases:
With the latter, a short notice should indicate that this optimization must have been done internally using only the training set(s).
I do think that, even with default parameter settings, the parameter settings should be uploaded with the run.
Comments, please :).
They way I understand it:
Impl ID = name + version (Both user chosen)
When uploading, the server tells me, whether this combo is already in use and therefore not possible.
Could we please specify somewhere in the docs what chars are actually allowed for id and version? Do we really allow:
name = "Jörg's cool algorithm^2" ?
Some clarification on how we are reimplementing code uploads/checks:
There will be 2 API calls:
'implementation.upload' (exists)
This call has as a required argument POST description: an XML file containing the implementation meta data. Currently, this XML file contains a field 'version', but this was ignored at upload time. The reason for this was that we don't want to force the user to provide a version number. Therefore, the server would pick a version number (1,2,3,...).
However, it often make sense for users to include some kind of versioning. For instance, if I maintain my code at GitHub I may want to add the version hash so I can revisit the code as it was at the time of upload.
Therefore, we will do the following. The description XML will have the following fields:
Plugins can decide freely how to handle this. If there is a good versioning system already, use that, if not, maybe take a hash of the source code. As long as changes to the code correspond to changes in the version number.
What will happen is that the server will store this info, and then associate a 'label' to each upload (1,2,3,...) linked one-to-one to the user version number/hash. This label is merely aesthetic: in the web interface, you will see both the upload counter as the user-defined version number/hash value. If no version number is given, the server will compute a hash based on the uploaded code. The library-name-version combo will be linked to a unique implementation id.
If you try to upload an implementation with the same library, name, version the server will say that there already is an implementation with those keys, and return the id.
When you want to check what the id is of an already uploaded implementation, there will be the following api:
'implementation.getid' (or implementation.check')
arguments:
GET library_name
GET implementation_name
GET version (user-defined version number, e.g. the hash value)
Based on that info, the server will return the corresponding implementation id. If no match can be found, it will tell you that that implementation is unknown.
Sounds ok?
Cheers,
Joaquin
Client programmers want / should check their parsers through unit tests with different examples for task.xml, dataset.xml and so on.
Therefore, the server needs to provide examples of different complexity for each of these.
Best is probably to have the server already provide them trough the standard API calls and for now just tell the client programmers how to access them. We might reserve special IDs for this "testing calls" for now, e.g.
(???) task_id = 100001 to 100005 are examples to test tasks for now (???)
I have a problem uploading implementations to the server.
I downloaded weka.AODE(1.8.2.3) from the server, changed name and version in the XML file and tried to upload it agan. This doesn't work yet, I always get this error:
"Problem validating uploaded description file
XML does not correspond to XSD schema."
The XML looks like this now:
< oml:implementation xmlns:oml="http://openml.org/openml" >
< oml:name >testestest< /oml:name >< oml:version >1.0< /oml:version >< oml:description >test< /oml:description >< /oml:implementation >
What is wrong?
We need to be able to at least specify:
Parameter name
Parameter data type
Bonus points (need not be done at once)
Simple constraints like box-constraints
My CSV file is always empty, just run any query.
I tried:
select * from implementation
Hey,
I discovered a few data sets that have special characters (";", "?", ...) or spaces in some of their column names. Some of them just start with a number, which is also not okay with R.
It would be great if the server could check for those problems and resolve them somehow.
Thanks in advance,
Dominik
I'm a bit confused. Task 1 used to be based on the Iris data set. Now it's annealing? Did you guys change the tasks? So,... the results that are provided by the new API call (http://www.openml.org/api/?f=openml.task.results&task_id=1) don't belong to the displayed data descriptions, right?
For a given task, would like to get (on the client)
a) What runs / implemenations are available?
b) What are their performances metric values?
c) Get the complete prediction. Would be sufficient for just a selected implementation / run, because I could always loop thru this.
Tried to list all tasks on openml.org
Search -> Tasks -> Supervised Classif
Hit Search to list all tasks: Server error
I then typed "iris" in "Datasets": Server error
It should be clear from an uploaded run how parameters were chosen. We previously agreed on the following three cases:
We should add a field/flag to report this, e.g.
parameter_setting_type = [manual, sweep, optimized]
In cases 1 and 2, the parameter settings should be uploaded with the run. This is already supported.
In case 3, the optimized parameters are fold/repeat specific, and should thus be added to the predictions file. This can simply be an extra column in the predictions arff file. I propose a simple key-value format, maybe json, that can then be stored as a string:
{"parameter_name_1":0.4, "parameter_name_2":123}
We can thus extend the classification/regression task with the following:
For at least 1-2 projects I would like to have larger data sets on OpenML.
So with more than 10K-50K observations.
Some are available here:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
http://mldata.org/
http://www.cs.ubc.ca/labs/beta/Projects/autoweka/datasets/
Issue: We might need to support another data format, especially w.r.t. sparse data.
There is HDF5.
There is also a converter:
http://mldata.org/about/hdf5/
Can you look into the general data format issue server-wise, than we can upload some data sets?
This is more of a very general design question.
Would it make sense to have a general OpenML Java base lib, which contains all the common objects as Java classes and offers common functionality like downloading, parsing and uploading?
This would make very simple for next guy to connect another Java-based toolkit to OpenMl.
Or do you guys already do that?
In some data sets there are factor variables that have only one level. Sometimes there are two or more levels but all examples belong to the same level. I'm not quite sure where we should fix this. For machine learning, such a factor is useless and might lead to errors. Either the server deletes those factors or we do it locally. What do you think is better?
There are 2 implementation schemas.
a) https://raw.github.com/openml/OpenML/master/XML/Schemas/implementation_upload.xsd
b)
https://raw.github.com/openml/OpenML/master/XML/Schemas/implementation.xsd
I understand that a) is uploaded by the user, b) is returned when you ask for it on the server to get it.
The problem is that both share about 90% of their xml fields, but the schemas are already not the same. Could they be made consistent?
Also we noticed this:
<xs:element name="version" minOccurs="0" type="xs:string"/>
minOccurrs = 0 is wrong, isnt it?
Things are coming together nicely, but there are also many new things planned. Bernd suggested we define what features should be in a 1.0 version, and finish that as soon as possible, making sure it works so that we can really start spreading the word.
I'm just making a list here, most of which is already done. Feel free to add/remove. Paraphrasing Linus Torvalds, 'suggestions are welcome, but we won't promise we'll implement them :-)'.
https://github.com/openml/OpenML
Then click:
Service: openml.authenticate
Service: openml.data.upload
There are probably a few more!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.