openml / openml-java Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 4.0 2.87 MB

Java library to interface with OpenML

Java 100.00%

openml-java's Introduction

OpenML: Open Machine Learning

Welcome to the OpenML GitHub page! 🎉

Contents:

Who are we?
What is OpenML?
Get involved

Who are we?

We are a group of people who are excited about open science, open data and machine learning. We want to make machine learning and data analysis simple, accessible, collaborative and open with an optimal division of labour between computers and humans.

What is OpenML?

Want to learn about OpenML or get involved? Please do and get in touch in case of questions or comments! 📨

Getting started:
- Check out the OpenML Website to get a first impression of what OpenML is
- The OpenML Documentation page gives an introduction in details and features, as well as
- OpenML's different APIs and integrations so that everyone can work with their favorite tool.
How to contribute: https://github.com/openml/OpenML/blob/master/CONTRIBUTING.md
Citation and Honor Code: https://www.openml.org/terms
Communication / Contact: https://github.com/openml/OpenML/wiki/Communication-Channels

OpenML is an online machine learning platform for sharing and organizing data, machine learning algorithms and experiments. It is designed to create a frictionless, networked ecosystem, that you can readily integrate into your existing processes/code/environments, allowing people all over the world to collaborate and build directly on each other’s latest ideas, data and results, irrespective of the tools and infrastructure they happen to use.

As an open science platform, OpenML provides important benefits for the science community and beyond.

Benefits for Science

Many sciences have made significant breakthroughs by adopting online tools that help organizing, structuring and analyzing scientific data online. Indeed, any shared idea, question, observation or tool may be noticed by someone who has just the right expertise to spark new ideas, answer open questions, reinterpret observations or reuse data and tools in unexpected new ways. Therefore, sharing research results and collaborating online as a (possibly cross-disciplinary) team enables scientists to quickly build on and extend the results of others, fostering new discoveries.

Moreover, ever larger studies become feasible as a lot of data are already available. Questions such as “Which hyperparameter is important to tune?”, “Which is the best known workflow for analyzing this data set?” or “Which data sets are similar in structure to my own?” can be answered in minutes by reusing prior experiments, instead of spending days setting up and running new experiments.

Benefits for Scientists

Scientists can also benefit personally from using OpenML. For example, they can save time, because OpenML assists in many routine and tedious duties: finding data sets, tasks, flows and prior results, setting up experiments and organizing all experiments for further analysis. Moreover, new experiments are immediately compared to the state of the art without always having to rerun other people’s experiments.

Another benefit is that linking one’s results to those of others has a large potential for new discoveries (see, for instance, Feurer et al. 2015; Post et al. 2016; Probst et al. 2017), leading to more publications and more collaboration with other scientists all over the world.

Finally, OpenML can help scientists to reinforce their reputation by making their work (published or not) visible to a wide group of people and by showing how often one’s data, code and experiments are downloaded or reused in the experiments of others.

Benefits for Society

OpenML also provides a useful learning and working environment for students, citizen scientists and practitioners. Students and citizen scientist can easily explore the state of the art and work together with top minds by contributing their own algorithms and experiments. Teachers can challenge their students by letting them compete on OpenML tasks or by reusing OpenML data in assignments. Finally, machine learning practitioners can explore and reuse the best solutions for specific analysis problems, interact with the scientific community or efficiently try out many possible approaches.

Get involved

OpenML has grown into quite a big project. We could use many more hands to help us out 🔧.

You want to contribute?: Awesome! Check out our wiki page on how to contribute or get in touch. There may be unexpected ways for how you could help. We are open for any ideas.
You want to support us financially?: YES! Getting funding through conventional channels is very competitive, and we are happy about every small contribution. Please send an email to [email protected]!

GitHub organization structure

OpenML's code distrubuted over different repositories to simplify development. Please see their individual readme's and issue trackers of you like to contribute. These are the most important ones:

openml/OpenML: The OpenML web application, including the REST API.
openml/openml-python: The Python API, to talk to OpenML from Python scripts (including scikit-learn).
openml/openml-r: The R API, to talk to OpenML from R scripts (inclusing mlr).
openml/java: The Java API, to talk to OpenML from Java scripts.
openml/openml-weka: The WEKA plugin, to talk to OpenML from the WEKA toolbox.

openml-java's People

Contributors

Stargazers

Watchers

Forkers

jaksmid arlindkadra williamraynaut mwever

openml-java's Issues

Improving dependencies in flows

From @DraXus on November 26, 2014 12:17

Currently, flow dependencies are limited to the software version. However, some of them also require additional dependencies. For example, flow 191 requires also the installation of multiBoostAB from the package manager.

Copied from original issue: openml/OpenML#161

ignore_attribute in DataSetDescription null.

The value of ignore_attribute in org.openml.apiconnector.xml.DataSetDescription is always null since the XML parser tries to access an item field name "ignore_attribute" instead of a tag named "oml:ignore_attribute".

apiconnector lib in Weka project differs from the actual version used

From @DraXus on December 6, 2014 12:18

I was having some errors compiling the Weka package. Apparently, the apiconnector.jar file provided in lib folder has a different version that the one used in the project.

I've generated a new jar file with the current code and it is working fine so far. I can update the jar file in the repository, although I'm not sure is the best way to mantain the compatibility between versions. What do you think?

Copied from original issue: openml/OpenML#163

Handle potential empty tags

data Quality can have empty tags, xstream doesn't like those.

Broken method

org.openml.apiconnector.xml.RunList.getRuns() returns null even if the list is well formed and contains runs

Changing "runs" to "run" at lines 50 & 53 seems to solve the issue. Just an Xstream typo ?

Test Data qualities interval

Unit test missing since the function is not supported by backend atm

Add static code analysis to CI

Sonarcloud.io hosts SonarQube which is a static code analysis tool.

Using a plugin for travis and maven we could integrate our projects with sonarcloud.io to benefit from the code analysis. Sonarcloud.io helps in improving overall code quality, finding bugs (e.g. leaking resources), security critical issues, and code duplicates. Furthermore, it can give summaries about test coverage.

It supports various platforms, among others Python, R, and Java, and can easily be set up together with travis. I already tried out to configure it for a fork. To get a feeling you can have a look at this report: https://sonarcloud.io/dashboard?id=org.openml%3Aapiconnector

In GitHub, one can also set some quality gates to prevent "bad" code getting into the dev/master branch.

Null value for number_missing_values

When using the data features, while passing through each feature, I found a null value for number_missing_values. The following code can replicate the problem:

OpenmlConnector connector = new OpenmlConnector("https://www.openml.org/", "9ed41f60b87fbe17054397936b96212d");
		Settings.CACHE_ALLOWED = false;
		DataFeature dataFeatures = connector.dataFeatures(2);
		for(Feature feature : dataFeatures.getFeatures()) {
			if(feature.getNumberOfMissingValues() != null && feature.getNumberOfMissingValues() instanceof Integer) {
				continue;
			} else {
				throw new IllegalArgumentException();
			}
		}

Names of functions inconsistent and against conventions

OpenmlConnector:
Almost all methods starts with openMl. This is generally not recommended as it should be clear from the object type itself what it is upload.
It would be like class person would have methods like getPersonName.

Method also usually start with nout and then verb.
e.g. dataUpload. This could lead to thinking that you are creating some sort of precooked dataupload object (especially when the return data type is called DataUpload).
It is usually recommended to start with verb first, e.g. uploadData.
It would be also consistent with some other methods in the class that starts with verb.

There are multiple ways to go, we could break the API or we could stuill support the old api and mark the old methods as obsolete.
Ref: http://www.iwombat.com/standards/JavaStyleGuide.html

Weka page for RapidMiner extension

From @DraXus on December 6, 2014 15:47

Add a roadmap and further documentation to RapidMiner extension wiki page: https://github.com/openml/OpenML/wiki/RapidMiner-extension

Copied from original issue: openml/OpenML#165

OpenMl connector tightly coupled

OpenMlConnector:
relies heavily on the method: HttpConnector.doApiRequest which is static.
This is hard to test and mock.
Imho beter to call some interface, maybe even include some dependency injection.

Rename to openml-java

Would it make sense to rename this repo to 'openml-java' to make it consistent with the others?

Display links to OpenML when using the WEKA plugin

Student remark: it is annoying that, when you run an experiment in WEKA, you have to search for it on the website. Would it be possible to have a backlink to OpenML. E.g. in the WEKA Run window, when a run finishes, we could display a link to the run (http://www.openml.org/r/123456).

Improve caching mechanism (consistent with Python)

also unit test caching

Error while connecting using API

Hi everyone.
I'm trying to connect to download dataset using my java application and I'm following the JAVA API documentation. But I'm getting an error. Can some please figure out why it's throwing an error.

This is the error.

ELKI OpenML add-on for clustering and anomaly detection

ELKI is an open-source Java tool for clustering and outlier detection.

You could also use the ELKI cluster evaluation functionality for ranking results.

OpenMlConnector - some apis not referenced

Missing: getRun by id.
Commented:openmlData() - get lists of datasets. Not working, change of xml mapper needed.

Investigate if these are all api calls missing

[MOA] NullPointerException using OpenmlTaskEvaluator

From @DraXus on February 15, 2015 18:20

I got the following error when running tasks in MOA (last version from OpenML website).

I tried different tasks and configurations:
openml.OpenmlDataStreamClassification -t 2177 -e openml.OpenmlTaskEvaluator
openml.OpenmlDataStreamClassification -l functions.NoChange -t 2172 -e openml.OpenmlTaskEvaluator

Failure reason: null
*** STACK TRACE ***java.lang.NullPointerException
    at java.util.Arrays$ArrayList.<init>(Arrays.java:2842)
    at java.util.Arrays.asList(Arrays.java:2828)
    at moa.evaluation.LearningEvaluation.<init>(LearningEvaluation.java:53)
    at moa.tasks.openml.OpenmlDataStreamClassification.doMainTask(OpenmlDataStreamClassification.java:175)
    at moa.tasks.MainTask.doTaskImpl(MainTask.java:50)
    at moa.tasks.AbstractTask.doTask(AbstractTask.java:57)
    at moa.tasks.TaskThread.run(TaskThread.java:76)

In addition, the console log output looks fine without errors:

[15-02-2015 18:11:27] [OK] [Authenticate] Authentication successfull. 
[15-02-2015 18:11:28] [INFO] [ARFF Cache] Stored dataset dataset_4_labor.arff to cache. 
[15-02-2015 18:11:28] [OK] [Download] Obtained Stream Header.

However, it works if BasicClassificationPerformanceEvaluator is selected instead.

Copied from original issue: openml/OpenML#173

[Weka] IndexOutOfBoundsException when obtaining folds

From @DraXus on February 14, 2015 15:17

The following error is shown when trying to run task 17 in Weka 3.7.12 using Naive Bayes as classifier.

[14-02-2015 13:41:27] [INFO] [ARFF Cache] Stored dataset 17 to cache.
[14-02-2015 13:41:27] [INFO] [ARFF Cache] Stored splits 17 to cache.
[14-02-2015 13:41:27] [INFO] [Splits] Obtaining folds for Task 17 (bridges) with weka.classifiers.bayes.NaiveBayes - Repeat 0
java.lang.IndexOutOfBoundsException: Index: 107, Size: 107
java.util.ArrayList.rangeCheck(ArrayList.java:635)
java.util.ArrayList.get(ArrayList.java:411)
weka.core.Instances.instance(Instances.java:768)
org.openml.weka.experiment.TaskResultProducer.doRun(TaskResultProducer.java:248)
org.openml.weka.experiment.TaskBasedExperiment.nextIteration(TaskBasedExperiment.java:173)
org.openml.weka.gui.OpenmlRunPanel$ExperimentRunner.run(OpenmlRunPanel.java:197)

at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at weka.core.Instances.instance(Instances.java:768)
at org.openml.weka.experiment.TaskResultProducer.doRun(TaskResultProducer.java:248)
at org.openml.weka.experiment.TaskBasedExperiment.nextIteration(TaskBasedExperiment.java:173)
at org.openml.weka.gui.OpenmlRunPanel$ExperimentRunner.run(OpenmlRunPanel.java:197)

Copied from original issue: openml/OpenML#172

Using OpenML Java apiconnector in Matlab

I tried to use Java apiconnector in Matlab R2012a but got the following error:

Java exception occurred:
java.lang.NoSuchMethodError:
com.thoughtworks.xstream.io.xml.DomDriver.<init>(Ljava/lang/String;Lcom/thoughtworks/xstream/io/naming/NameCoder;)V
 at org.openml.apiconnector.xstream.XstreamXmlMapping.getInstance(XstreamXmlMapping.java:63)
 at org.openml.apiconnector.io.HttpConnector.doApiRequest(HttpConnector.java:30)
 at org.openml.apiconnector.io.ApiSessionHash.openmlAuthenticate(ApiSessionHash.java:162)
 at org.openml.apiconnector.io.ApiSessionHash.update(ApiSessionHash.java:84)
 at org.openml.apiconnector.io.ApiSessionHash.set(ApiSessionHash.java:71)
 at org.openml.apiconnector.io.OpenmlConnector.<init>(OpenmlConnector.java:89)

This problem occurs because an older version of xstream is loaded by Matlab in the static java class path and therefore it's not using the xstream library provided in the dynamic java class path. I've been googling and couldn't find any elegant solution. Apparently, the only workaround would be to replace the xstream.jar file in the "jarext" Matlab folder with the new version, but that could lead to internal Matlab problems.

So at the moment it's not possible to use the Java apiconnector in Matlab, but I would like to leave this comment here for future reference.

Java docs

Seems like (outdated) java docs are currently hardcoded hosted on the webserver:
https://www.openml.org/docs/

Wouldn't it be better to somehow host these on maven central and link to there? (if possible..) that's where a java-doc.jar is available)

Incomplete Stats in OpenML Features using Java API

Using the latest Java API (ver. 1.0.13 from Maven), we are facing an issue with the dataFeatures class methods to get some statistics about the features in the datasets. Whenever we call a method to retrieve statistics about the features (e.g. getNumberOfDistinctValues() ), we get a Null value. For example, when called using dataset_id = 967 or 21.

Are those methods to retrieve such statistics about features fully implemented in the current version of the API or are they still under development and shouldn't be used?

Rename repo to openml-java?

Similar to openml-python and openml-r?