The openml-data from openml

tecator dataset has three targets: openML version has two targets added as predictors

Hi There,

The tecator dataset is part of OpenML-Reg19, a (work in progress) suite of Regression datasets.

The dataset on OpenML has as target the fat variable. It turns out that moisture and protein are included in the dataset as predictors, which otherwise only contain absorbances from a spectrometer. I found that moisture and protein are highly predictive of fat, no need to include the absorbances for optimal prediction.

Curious, I Checked the documentation of the dataset. it turns out that this dataset as used in the literature contains three targets for prediction, with the idea to only use the absorbances as predictors.

The original publication for this dataset is here (behind a paywall).

https://pubs.acs.org/doi/pdf/10.1021/ac00029a018

I checked, and there fat was predicted using only the absorbances.

So be able to compare with published literature for this dataset, it makes sense to leave out moisture and protein from the predictors.

Any thoughts on how to incorporate this in the OpenML framework? Can we remove the two other targets from tecator? Or would this make it a new dataset? But if every subset of variables of a dataset must be added to OpenML as a new dataset a lot of duplication would occur, right?

PS here is a summary documentation for caret (https://rdrr.io/cran/caret/man/tecator.html) , where tecator is also included:

"For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry."

Regards,
Gertjan

BNG Dataset description

To do at Jan: update description of BNG datasets according to the stack overflow:
https://datascience.stackexchange.com/questions/26757/what-does-bng-stand-for/26964#26964

Migration ARFF to Parquet on the OpenML server

This is a centralised discussion about the server side changes (being) made to the datasets in their conversion from ARFF to Parquet. Related on-going discussions that reference the server state of different datasets:

Let's keep the relevant information about the migration as it relates to server data in this thread.
This is not for connector specific discussions (for example, how openml-python handles this).
@joaquinvanschoren @prabhant @sebffischer

Automated checks for incorrect target column types (data quality bot)

Hey guys,

I was looking at the following dataset:
https://www.openml.org/d/42175

It seems that the class feature is not given correctly as it should be nominal and not numerical.
I was trying to create a classification task based on that and it was failing. I am guessing it is because it is not a nominal feature, however, no warning or error is given. It just fails.

Standard datasets for benchmarking regression

Are there any other known selected subset of benchmarking datasets except the study_14 datasets? They only contain classification tasks, but I also would like to have datasets for regression.

Why is a column is both "row identifier" and "ignore"?

In: https://www.openml.org/d/1414

Why a column is both "row identifier" and "ignore"? What is the meaning of this? That you should not use it as an identifier? Or that identifiers in other datasets you can use for training (if they do not have "ignore" set)?

COMET_MC

As in issue #34, the dataset seems to have a nominal target attribute which is given as numerical.
There are 3 versions of the dataset:

https://www.openml.org/d/5889
https://www.openml.org/d/5648
https://www.openml.org/d/5587

Seismic bump dataset is actually seed dataset

https://www.openml.org/d/1500
is identical to
https://www.openml.org/d/1499

It should have more features according to the description.

No dataset qualities for dataset with ID 495 (baseball-pitcher)

Apparently, there went something wrong with the evaluation engine for dataset 495. There are no dataset qualities stored for this dataset.

URL: https://openml.org/d/495

Add more datasets from kaggle?

There are many nice interesting datasets on kaggle (in the dataset section, not the competitions):
https://www.kaggle.com/datasets

Unfortunately most of these don't qualify for CC-18 because they are missing a publication. But they are quite interesting and I think we need more interesting datasets.

Why is row identifier column marked as "nominal" and not simply "string"?

In: https://www.openml.org/d/17

get all R datasets

http://vincentarelbundock.github.io/Rdatasets/datasets.html

Attribute description mistakes

The following datasets have suspicious attribute types:

298 - several attributes should be nominal instead of numerical.
345 - there should be 3 numeric attributes according to the description, but none of the available attributes is numeric
504 - how can data from a book called Analyzing Categorical Data be all continuous
532 - same here...
458 - same here...at least the Book ID should be categorical
516 - day should be numerical, not categorical.
1169 - flight number should not be numerical, but categorical.

Dataset 42564 incorrectly marked as sparse

The dataset https://openml.org/d/42564 is not sparse, but is marked as such.
The file is actually pretty poorly formatted though: it encodes each row's unique values as its own attribute.
It might be better to deactivate and contact the user or to reformat it ourselves.

rcv1.binary has zero classes

https://www.openml.org/d/1577

Should have two classes.

Import the outlier detection benchmark results from http://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/

http://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/

Is a repository of outlier detection benchmark data and results.

Every data set comes with a downloadable "raw algorithm results" package containing the results of a few hundred (algorithm, parameter) combinations on these data sets, and there is a separate file with generated evaluation results, too. Alternatively, you could also import only the best-of results.

As mentioned in openml/openml-java#6 it would also be nice to have a "submit to OpenML" function in ELKI; and on the other hand, OpenML could use ELKI for evaluating outlier and clustering results (ELKI has 19 supervised evaluation measures for clustering, 9 internal evalution measures, with 3 different strategies for handling noise objects. For outlier evaluation, it has 4 measures + adjustment for chance for them, which yields 7 interesting measures). Except for the internal cluster evaluation measures (which may need O(n^2) memory and pairwise distances) they are all very fast to compute.

I don't have the capacity right now to do the integration myself; but I can assist e.g. with adapting the scripts used to generate above results. Or we can simply transfer the data as ascii for submission?
From the API documentation, I do not understand how to format result data for submission. Are arbitrary file types or only ARFF allowed? How are evaluation results uploaded?

pendigits appears to be grouped data

https://www.openml.org/d/32 - see the description

Non-existent column in ignore_attribute in DSD

The dataset okcupid-stem with ID 42734 mentions a column in the meta-data attribute oml:ignore_attribute which does not actually exist in the dataset. More specifically, it says to ignore the column "last_login" but as can be seen here https://www.openml.org/d/42734 there is no such column.

Fairness Data Sources

Got these fairness data sources recommended by a colleague. Making this issue to at one point

upload these datasets
create a specific fairness task

fairness_survey.pdf
Section 5 describes the datasets

Tidyrainbow dataset collection

This list of datasets might be interesting/relevant: https://github.com/r-lgbtq/tidyrainbow/tree/main/data/2022
They come with some information, though not each dataset seems documented equally well (but perhaps that is because they are not all officially released yet).

adding datasets?

Right now it's still pretty hard to add datasets via python or the web interface.
Can we maybe collect some interesting datasets to add somewhere?
Particular those that could be candidates for the next round of benchmarks?

kaggle actually has some quite interesting open ones:
https://www.kaggle.com/dalpozz/creditcardfraud [done 42175]
https://www.kaggle.com/zalando-research/fashionmnist [done]
https://www.kaggle.com/nsharan/h-1b-visa [maybe not, no clear task]

Dataset 40978: should have missing values.

Description of the dataset states (highlight is mine):

There are : 3 continuous attributes. The others are binary. This is the "STANDARD encoding" mentioned in the [Kushmerick, 99] (see below). One or more of the three continuous features are missing in 28% of the instances. Missing values should be interpreted as "unknown".

However, the dataset on OpenML does not have missing values (as seen in the "Qualities").

The original dataset as hosted by UCI has missing values indicated by "?". In the OpenML dataset, the corresponding cells are 0, instead.

Note the dataset is tagged as OpenML-CC18.

Dataset still "in_preparation"

Hello,

I have imported some datasets, some get the "active" status but some doesn't (remaining "in_preparation") without any issues or error.

Here is datasets still "in_preparation":

Is there anyway I can get an error message somewhere for those datasets ?
Thank you

King country housing data has no target column

It should be price:
https://www.openml.org/d/42079

ValueError on retrieving Penguins data

When I try to get the penguins data set using the python API (openml.datasets.get_dataset(dataset_id=42585)), I get a ValueError originating from pandas, because there are some missing values.

In the scikit-learn API (sklearn.datasets.fetch_openml) it's possible to use the as_frame argument to control whether pandas is used or not. I'm not sure whether I've just missed that in the openml python API but I couldn't find a similar option there.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-42cadb8ab205> in <module>
----> 1 dataset = get_dataset(dataset_id=42585)

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/functions.py in get_dataset(dataset_id, download_data, version, error_if_multiple)
    527                                      did_cache_dir)
    528 
--> 529     dataset = _create_dataset_from_description(
    530         description, features, qualities, arff_file
    531     )

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/functions.py in _create_dataset_from_description(description, features, qualities, arff_file)
    995         Dataset object from dict and ARFF.
    996     """
--> 997     return OpenMLDataset(
    998         description["oml:name"],
    999         description.get("oml:description"),

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/dataset.py in __init__(self, name, description, format, data_format, dataset_id, version, creator, contributor, collection_date, upload_date, language, licence, url, default_target_attribute, row_id_attribute, ignore_attribute, version_label, citation, tag, visibility, original_data_url, paper_url, update_comment, md5_checksum, data_file, features, qualities, dataset)
    181 
    182         if data_file is not None:
--> 183             self.data_pickle_file = self._create_pickle_in_cache(data_file)
    184         else:
    185             self.data_pickle_file = None

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/dataset.py in _create_pickle_in_cache(self, data_file)
    421         # At this point either the pickle file does not exist, or it had outdated formatting.
    422         # We parse the data from arff again and populate the cache with a recent pickle file.
--> 423         X, categorical, attribute_names = self._parse_data_from_arff(data_file)
    424 
    425         with open(data_pickle_file, "wb") as fh:

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/dataset.py in _parse_data_from_arff(self, arff_file_path)
    387                 if attribute_dtype[column_name] in ('categorical',
    388                                                     'boolean'):
--> 389                     col.append(self._unpack_categories(
    390                         X[column_name], categories_names[column_name]))
    391                 else:

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/dataset.py in _unpack_categories(series, categories)
    531         # We require two lines to create a series of categories as detailed here:
    532         # https://pandas.pydata.org/pandas-docs/version/0.24/user_guide/categorical.html#series-creation  # noqa E501
--> 533         raw_cat = pd.Categorical(col, ordered=True, categories=categories)
    534         return pd.Series(raw_cat, index=series.index, name=series.name)
    535 

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/categorical.py in __init__(self, values, categories, ordered, dtype, fastpath)
    314     ):
    315 
--> 316         dtype = CategoricalDtype._from_values_or_dtype(
    317             values, categories, ordered, dtype
    318         )

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py in _from_values_or_dtype(cls, values, categories, ordered, dtype)
    328             # Note: This could potentially have categories=None and
    329             # ordered=None.
--> 330             dtype = CategoricalDtype(categories, ordered)
    331 
    332         return dtype

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py in __init__(self, categories, ordered)
    220 
    221     def __init__(self, categories=None, ordered: Ordered = False):
--> 222         self._finalize(categories, ordered, fastpath=False)
    223 
    224     @classmethod

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py in _finalize(self, categories, ordered, fastpath)
    367 
    368         if categories is not None:
--> 369             categories = self.validate_categories(categories, fastpath=fastpath)
    370 
    371         self._categories = categories

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py in validate_categories(categories, fastpath)
    541 
    542             if categories.hasnans:
--> 543                 raise ValueError("Categorial categories cannot be null")
    544 
    545             if not categories.is_unique:

ValueError: Categorial categories cannot be null

No dataset qualities in few oml datasets

Error while fetching these datasets

>>> openml.dataset.get_dataset(202)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'openml' has no attribute 'dataset'
>>> openml.datasets.get_dataset(202)
Traceback (most recent call last):
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/datasets/functions.py", line 1113, in _get_dataset_qualities_file
    with io.open(qualities_file, encoding="utf8") as fh:
FileNotFoundError: [Errno 2] No such file or directory: '/home/prabhant/.cache/openml/org/openml/www/datasets/202/qualities.xml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/datasets/functions.py", line 438, in get_dataset
    raise e
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/datasets/functions.py", line 416, in get_dataset
    qualities_file = _get_dataset_qualities_file(did_cache_dir, dataset_id)
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/datasets/functions.py", line 1117, in _get_dataset_qualities_file
    qualities_xml = openml._api_calls._perform_api_call(url_extension, "get")
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/_api_calls.py", line 65, in _perform_api_call
    response = __read_url(url, request_method, data)
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/_api_calls.py", line 204, in __read_url
    request_method=request_method, url=url, data=data, md5_checksum=md5_checksum
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/_api_calls.py", line 235, in _send_request
    __check_response(response=response, url=url, file_elements=files)
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/_api_calls.py", line 273, in __check_response
    raise __parse_server_exception(response, url, file_elements=file_elements)
openml.exceptions.OpenMLServerException: https://www.openml.org/api/v1/xml/data/qualities/202 returned code 362: No qualities found - None

FOREX_eurrub-hour-Close fails to load

Trying to load this dataset using Python API openml==0.10.1, I get the following error:

  File ".../site-packages/openml/datasets/dataset.py", line 574, in get_data
    data, categorical, attribute_names = self._load_data()
  File ".../site-packages/openml/datasets/dataset.py", line 438, in _load_data
    self.data_pickle_file = self._create_pickle_in_cache(self.data_file)
  File ".../site-packages/openml/datasets/dataset.py", line 421, in _create_pickle_in_cache
    X, categorical, attribute_names = self._parse_data_from_arff(data_file)
  File ".../site-packages/openml/datasets/dataset.py", line 314, in _parse_data_from_arff
    data = self._get_arff(self.format)
  File ".../site-packages/openml/datasets/dataset.py", line 293, in _get_arff
    return decode_arff(fh)
  File ".../site-packages/openml/datasets/dataset.py", line 286, in decode_arff
    return_type=return_type)
  File ".../site-packages/arff.py", line 895, in decode
    raise e
  File ".../site-packages/arff.py", line 892, in decode
    matrix_type=return_type)
  File ".../site-packages/arff.py", line 822, in _decode
    attr = self._decode_attribute(row)
  File ".../site-packages/arff.py", line 764, in _decode_attribute
    raise BadAttributeType()
arff.BadAttributeType: Bad @ATTRIBUTE type, at line 2.

Python 3.6 on Linux.

Download names

From @joaquinvanschoren on March 27, 2017 9:43

I just downloaded this dataset: https://www.openml.org/d/1500

However, instead of a file like seismic.arff, I get 'phpvz1IQW' (not even an arff extension). Is this easy to fix?

Thanks!

Copied from original issue: openml/openml.org#113

Datasets have "classes = -1" but are used in classification tasks

for example:
https://www.openml.org/d/210
https://www.openml.org/d/8

Looks like these are regression datasets (is that documented?)
I'm not sure if it's a good idea to allow these to be run as classification problems. You can, but then the number of classes is not -1, but the number of unique values.

thyroid datasets don't match with literature

I'm trying to understand the UCI thryoid dataset(s?):
https://www.openml.org/search?q=thyroid&type=data

I can't consolidate the data on openml with the data on UCI and the description. thyroid should have 6832 or 5473 samples depending on the version, none of the datasets on OpenML has that.
There is a standard split into training and test set, and it looks like OpenML only has the training sets, but I'm not sure.
This dataset pretty explicitly gives the number of training samples, so this is missing the test set and the correct split:
https://www.openml.org/d/40497

Identifier row for dataset 41705 not marked as such.

The dataset ASP-POTASSCO-classification with the ID 41075 has a string column containing an ID. Also, the column has the name "instance_id". Maybe it should also be marked as a row_identifier column?

URL: https://openml.org/d/41075

Several potentially undeclared ID columns in datasets

Hi, I found several more potentially undeclared ID columns in a few datasets:

Having seen some more datasets with a timestamp, I am wondering if it would be a good idea to actually introduce the concept of a timestamp in the database. Then we'd easily see that those datasets shouldn't simply be used for regular classification and @janvanrijn can use them for his work on data streams.

It would be good if someone could double-check these datasets to make sure that what I identified as IDs are actually IDs.

OpenML Sparse Dataset support

This issue tracks the progress of sparse Dataset support on the OpenML-MinIO backend.
Currently, MinIO does not have OpenML sparse datasets because pandas can't write to sparse datasets by Default.
Example

did = 42379
  d = openml.datasets.get_dataset(did, download_qualities=False)
  df , *_ = d.get_data(dataset_format="dataframe", include_row_id=True, include_ignore_attribute=True)
  df.to_parquet(f'dataset_{d.id}.pq')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-42ca2d7c4839> in <module>
      7                                       target=d.default_target_attribute)
      8     df = pd.concat([X,y], axis=1)
----> 9     df.to_parquet(f'dataset_{d.id}.pq')
     10     client.make_bucket(f"dataset{did}")
     11     client.fput_object(

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/core/frame.py in to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2453         from pandas.io.parquet import to_parquet
   2454 
-> 2455         return to_parquet(
   2456             self,
   2457             path,

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
    388     path_or_buf: FilePathOrBuffer = io.BytesIO() if path is None else path
    389 
--> 390     impl.write(
    391         df,
    392         path_or_buf,

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/io/parquet.py in write(self, df, path, compression, index, storage_options, partition_cols, **kwargs)
    150             from_pandas_kwargs["preserve_index"] = index
    151 
--> 152         table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
    153 
    154         path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    551      index_columns,
    552      columns_to_convert,
--> 553      convert_fields) = _get_columns_to_convert(df, schema, preserve_index,
    554                                                columns)
    555 

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/pandas_compat.py in _get_columns_to_convert(df, schema, preserve_index, columns)
    357 
    358         if _pandas_api.is_sparse(col):
--> 359             raise TypeError(
    360                 "Sparse pandas data (column {}) not supported.".format(name))
    361 

TypeError: Sparse pandas data (column FCFP6_1024_0) not supported.

libras_move should have 15 classes, has 0

https://www.openml.org/d/299

Maybe we can write some kind of check for this?

add datasets from Gael's paper to OpenML

requested by @amueller
paper:
https://twitter.com/GaelVaroquaux/status/1039512164443217921

car versions completely unrelated

car v2 and car v3 seem to be completely unrelated and both are active. This is really confusing!

Wrongly annotated target / ignore columns

Overview of datasets with known issues regarding special column types:

Nominal target but indicated as numerical. This should probably be checked automatically.
Also check whether no 'wrong' tasks are created on these.

Missing target column:

https://www.openml.org/d/42079 -> should be price

Bad row_id / ignore columns:

https://www.openml.org/d/1414 -> column is both row_id and ignored (it's actually a timestamp)

Data set 301 incorrect

The nominal variables in this data set are not actually nominal, but have some "?", which should probably be missing values.

https://www.openml.org/d/301

library("OpenML")
#> Loading required package: mlr
#> Loading required package: ParamHelpers
dt <- getOMLDataSet(data.id = 301)
#> Data '301' file 'description.xml' found in cache.
#> Data '301' file 'dataset.arff' found in cache.
#> Loading required package: readr
summary(dt$data[, 1:5])
#>       WSR0           WSR1           WSR2           WSR3     
#>  ?      : 299   ?      : 292   ?      : 294   ?      : 292  
#>  0.4    : 128   0.4    : 136   0.4    : 138   0.4    : 147  
#>  0.8    : 107   0.8    : 119   0.8    : 133   0.8    : 118  
#>  0.3    : 101   0.3    : 113   0.3    : 115   0.3    : 112  
#>  1.3    :  99   0.2    :  95   1.3    :  99   1.3    : 105  
#>  1.7    :  90   1.3    :  95   0.1    :  90   0.2    : 103  
#>  (Other):1712   (Other):1686   (Other):1667   (Other):1659  
#>       WSR4     
#>  ?      : 293  
#>  0.4    : 139  
#>  0.8    : 117  
#>  0.2    : 107  
#>  0.3    : 100  
#>  1.3    :  98  
#>  (Other):1682

I was not able to raise an issue on the website (see #708)

add dataset: sector

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#sector

Generally someone should check if openml has all the LibSVM datasets (if appropriate).

hungarian should be a classification dataset

https://www.openml.org/d/231

There even exists a "binarized version" even though the original dataset was already binarized...

mldata

I started sifting through mldata datasets about a year ago, never had the time to finish.
This is a dump of my progress.

Ignoring all datasets for which the all descriptive features match, and those which were not valid arff.

The following datasets have matching names, but differ in either instances, features, or missing values:

The following datasets have matching names, but differ in more than one way:

The following datasets do not have matching names,but have the same number of instances, features and missing values:

The following datasets do not match any of the above criteria:

None

MNIST is called mnist_784

It's weird to an sklearn user to have fetch_openml("MNIST") return "no dataset MNIST". There is an MNIST dataset on OpenML, which is "in perparation".

Column not marked as a row identifier

For https://www.openml.org/d/46, it says Instance_name is an identifier and should be ignored for modelling", why is then instance_name not marked as "row identifier"?

Add conservatory datasets

http://lila.science/otherdatasets

[550] Update description Quake dataset

Quake has a poor dataset description, it is generic to the book that it was taken from, and the provided link to the online book is dead. The actual dataset description in the book lists:

Data Format Blog Post Discussion

If you have any comments or feedback after reading our blog post, this is the place to discuss it.

Many classification tasks seem to have numeric targets

Several datasets have numeric targets, but are clearly classification tasks (a non-exhaustive list):
https://www.openml.org/d/23513
https://www.openml.org/d/4532
https://www.openml.org/d/5587 (https://www.openml.org/d/5648, …)
https://www.openml.org/d/1575
https://www.openml.org/d/1577

and probably also https://www.openml.org/d/296

It should be easy to find most of them programmatically by looking at the number of unique values of the target variable. Of course one would have to be careful not to accidentally identify a dataset with ordinal discrete values (e.g. counts) as classification.

Data set pbc wrong target info + covariate problem

From @HeidiSeibold on March 13, 2017 11:46

Each version of the pbc data set (version 1: https://www.openml.org/d/200) has problems. Since I can't open issues on the website at the moment (see #106), I'll document them here.

For V1 the outcome is clearly a right censored outcome, i.e. both X/class and D (censoring indicator) are part of the target
In V2 the target should be categorical
V3 is something I don't think we can deal with right now in OpenML since we need to use class and D somehow as a combined target and both are binary.

Due to these problems there exist also non sensible tasks.

Copied from original issue: openml/openml.org#109

Add / change flight delay dataset

There's a standard dataset called "flight delays" that's not on openml.
It's not https://www.openml.org/d/410 It's used here for example:
https://github.com/szilard/benchm-ml

I think the original data is here:
http://stat-computing.org/dataexpo/2009/

But usually subsets are used.

Multiple measurements of the same target

https://www.openml.org/search?type=data&status=active&id=42963

there are 4 measurements of the same variable, 3 of those should be in the ignore attribute column but they arent.

openml / openml-data Goto Github PK

openml-data's People

Watchers

Forkers

openml-data's Issues

Recommend Projects

Recommend Topics

Recommend Org