Coder Social home page Coder Social logo

openml-data's People

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

marcoslbueno

openml-data's Issues

tecator dataset has three targets: openML version has two targets added as predictors

Hi There,

The tecator dataset is part of OpenML-Reg19, a (work in progress) suite of Regression datasets.

The dataset on OpenML has as target the fat variable. It turns out that moisture and protein are included in the dataset as predictors, which otherwise only contain absorbances from a spectrometer. I found that moisture and protein are highly predictive of fat, no need to include the absorbances for optimal prediction.

Curious, I Checked the documentation of the dataset. it turns out that this dataset as used in the literature contains three targets for prediction, with the idea to only use the absorbances as predictors.

The original publication for this dataset is here (behind a paywall).

https://pubs.acs.org/doi/pdf/10.1021/ac00029a018

I checked, and there fat was predicted using only the absorbances.

So be able to compare with published literature for this dataset, it makes sense to leave out moisture and protein from the predictors.

Any thoughts on how to incorporate this in the OpenML framework? Can we remove the two other targets from tecator? Or would this make it a new dataset? But if every subset of variables of a dataset must be added to OpenML as a new dataset a lot of duplication would occur, right?

PS here is a summary documentation for caret (https://rdrr.io/cran/caret/man/tecator.html) , where tecator is also included:

"For each meat sample the data consists of a 100 channel spectrum of absorbances and the contents of moisture (water), fat and protein. The absorbance is -log10 of the transmittance measured by the spectrometer. The three contents, measured in percent, are determined by analytic chemistry." 

Regards,
Gertjan

Migration ARFF to Parquet on the OpenML server

This is a centralised discussion about the server side changes (being) made to the datasets in their conversion from ARFF to Parquet. Related on-going discussions that reference the server state of different datasets:

Let's keep the relevant information about the migration as it relates to server data in this thread.
This is not for connector specific discussions (for example, how openml-python handles this).
@joaquinvanschoren @prabhant @sebffischer

Standard datasets for benchmarking regression

Are there any other known selected subset of benchmarking datasets except the study_14 datasets? They only contain classification tasks, but I also would like to have datasets for regression.

Add more datasets from kaggle?

There are many nice interesting datasets on kaggle (in the dataset section, not the competitions):
https://www.kaggle.com/datasets

Unfortunately most of these don't qualify for CC-18 because they are missing a publication. But they are quite interesting and I think we need more interesting datasets.

Attribute description mistakes

The following datasets have suspicious attribute types:

  • 298 - several attributes should be nominal instead of numerical.
  • 345 - there should be 3 numeric attributes according to the description, but none of the available attributes is numeric
  • 504 - how can data from a book called Analyzing Categorical Data be all continuous
  • 532 - same here...
  • 458 - same here...at least the Book ID should be categorical
  • 516 - day should be numerical, not categorical.
  • 1169 - flight number should not be numerical, but categorical.

Import the outlier detection benchmark results from http://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/

http://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/

Is a repository of outlier detection benchmark data and results.

Every data set comes with a downloadable "raw algorithm results" package containing the results of a few hundred (algorithm, parameter) combinations on these data sets, and there is a separate file with generated evaluation results, too. Alternatively, you could also import only the best-of results.

As mentioned in openml/openml-java#6 it would also be nice to have a "submit to OpenML" function in ELKI; and on the other hand, OpenML could use ELKI for evaluating outlier and clustering results (ELKI has 19 supervised evaluation measures for clustering, 9 internal evalution measures, with 3 different strategies for handling noise objects. For outlier evaluation, it has 4 measures + adjustment for chance for them, which yields 7 interesting measures). Except for the internal cluster evaluation measures (which may need O(n^2) memory and pairwise distances) they are all very fast to compute.

I don't have the capacity right now to do the integration myself; but I can assist e.g. with adapting the scripts used to generate above results. Or we can simply transfer the data as ascii for submission?
From the API documentation, I do not understand how to format result data for submission. Are arbitrary file types or only ARFF allowed? How are evaluation results uploaded?

Fairness Data Sources

Got these fairness data sources recommended by a colleague. Making this issue to at one point

  1. upload these datasets
  2. create a specific fairness task

fairness_survey.pdf
Section 5 describes the datasets

Dataset 40978: should have missing values.

Description of the dataset states (highlight is mine):

There are : 3 continuous attributes. The others are binary. This is the "STANDARD encoding" mentioned in the [Kushmerick, 99] (see below). One or more of the three continuous features are missing in 28% of the instances. Missing values should be interpreted as "unknown".

However, the dataset on OpenML does not have missing values (as seen in the "Qualities").

The original dataset as hosted by UCI has missing values indicated by "?". In the OpenML dataset, the corresponding cells are 0, instead.

Note the dataset is tagged as OpenML-CC18.

ValueError on retrieving Penguins data

When I try to get the penguins data set using the python API (openml.datasets.get_dataset(dataset_id=42585)), I get a ValueError originating from pandas, because there are some missing values.

In the scikit-learn API (sklearn.datasets.fetch_openml) it's possible to use the as_frame argument to control whether pandas is used or not. I'm not sure whether I've just missed that in the openml python API but I couldn't find a similar option there.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-42cadb8ab205> in <module>
----> 1 dataset = get_dataset(dataset_id=42585)

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/functions.py in get_dataset(dataset_id, download_data, version, error_if_multiple)
    527                                      did_cache_dir)
    528 
--> 529     dataset = _create_dataset_from_description(
    530         description, features, qualities, arff_file
    531     )

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/functions.py in _create_dataset_from_description(description, features, qualities, arff_file)
    995         Dataset object from dict and ARFF.
    996     """
--> 997     return OpenMLDataset(
    998         description["oml:name"],
    999         description.get("oml:description"),

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/dataset.py in __init__(self, name, description, format, data_format, dataset_id, version, creator, contributor, collection_date, upload_date, language, licence, url, default_target_attribute, row_id_attribute, ignore_attribute, version_label, citation, tag, visibility, original_data_url, paper_url, update_comment, md5_checksum, data_file, features, qualities, dataset)
    181 
    182         if data_file is not None:
--> 183             self.data_pickle_file = self._create_pickle_in_cache(data_file)
    184         else:
    185             self.data_pickle_file = None

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/dataset.py in _create_pickle_in_cache(self, data_file)
    421         # At this point either the pickle file does not exist, or it had outdated formatting.
    422         # We parse the data from arff again and populate the cache with a recent pickle file.
--> 423         X, categorical, attribute_names = self._parse_data_from_arff(data_file)
    424 
    425         with open(data_pickle_file, "wb") as fh:

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/dataset.py in _parse_data_from_arff(self, arff_file_path)
    387                 if attribute_dtype[column_name] in ('categorical',
    388                                                     'boolean'):
--> 389                     col.append(self._unpack_categories(
    390                         X[column_name], categories_names[column_name]))
    391                 else:

~/opt/anaconda3/lib/python3.8/site-packages/openml/datasets/dataset.py in _unpack_categories(series, categories)
    531         # We require two lines to create a series of categories as detailed here:
    532         # https://pandas.pydata.org/pandas-docs/version/0.24/user_guide/categorical.html#series-creation  # noqa E501
--> 533         raw_cat = pd.Categorical(col, ordered=True, categories=categories)
    534         return pd.Series(raw_cat, index=series.index, name=series.name)
    535 

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arrays/categorical.py in __init__(self, values, categories, ordered, dtype, fastpath)
    314     ):
    315 
--> 316         dtype = CategoricalDtype._from_values_or_dtype(
    317             values, categories, ordered, dtype
    318         )

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py in _from_values_or_dtype(cls, values, categories, ordered, dtype)
    328             # Note: This could potentially have categories=None and
    329             # ordered=None.
--> 330             dtype = CategoricalDtype(categories, ordered)
    331 
    332         return dtype

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py in __init__(self, categories, ordered)
    220 
    221     def __init__(self, categories=None, ordered: Ordered = False):
--> 222         self._finalize(categories, ordered, fastpath=False)
    223 
    224     @classmethod

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py in _finalize(self, categories, ordered, fastpath)
    367 
    368         if categories is not None:
--> 369             categories = self.validate_categories(categories, fastpath=fastpath)
    370 
    371         self._categories = categories

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py in validate_categories(categories, fastpath)
    541 
    542             if categories.hasnans:
--> 543                 raise ValueError("Categorial categories cannot be null")
    544 
    545             if not categories.is_unique:

ValueError: Categorial categories cannot be null

No dataset qualities in few oml datasets

Error while fetching these datasets

>>> openml.dataset.get_dataset(202)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'openml' has no attribute 'dataset'
>>> openml.datasets.get_dataset(202)
Traceback (most recent call last):
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/datasets/functions.py", line 1113, in _get_dataset_qualities_file
    with io.open(qualities_file, encoding="utf8") as fh:
FileNotFoundError: [Errno 2] No such file or directory: '/home/prabhant/.cache/openml/org/openml/www/datasets/202/qualities.xml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/datasets/functions.py", line 438, in get_dataset
    raise e
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/datasets/functions.py", line 416, in get_dataset
    qualities_file = _get_dataset_qualities_file(did_cache_dir, dataset_id)
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/datasets/functions.py", line 1117, in _get_dataset_qualities_file
    qualities_xml = openml._api_calls._perform_api_call(url_extension, "get")
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/_api_calls.py", line 65, in _perform_api_call
    response = __read_url(url, request_method, data)
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/_api_calls.py", line 204, in __read_url
    request_method=request_method, url=url, data=data, md5_checksum=md5_checksum
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/_api_calls.py", line 235, in _send_request
    __check_response(response=response, url=url, file_elements=files)
  File "/home/prabhant/anaconda3/envs/temp/lib/python3.7/site-packages/openml/_api_calls.py", line 273, in __check_response
    raise __parse_server_exception(response, url, file_elements=file_elements)
openml.exceptions.OpenMLServerException: https://www.openml.org/api/v1/xml/data/qualities/202 returned code 362: No qualities found - None

FOREX_eurrub-hour-Close fails to load

Trying to load this dataset using Python API openml==0.10.1, I get the following error:

  File ".../site-packages/openml/datasets/dataset.py", line 574, in get_data
    data, categorical, attribute_names = self._load_data()
  File ".../site-packages/openml/datasets/dataset.py", line 438, in _load_data
    self.data_pickle_file = self._create_pickle_in_cache(self.data_file)
  File ".../site-packages/openml/datasets/dataset.py", line 421, in _create_pickle_in_cache
    X, categorical, attribute_names = self._parse_data_from_arff(data_file)
  File ".../site-packages/openml/datasets/dataset.py", line 314, in _parse_data_from_arff
    data = self._get_arff(self.format)
  File ".../site-packages/openml/datasets/dataset.py", line 293, in _get_arff
    return decode_arff(fh)
  File ".../site-packages/openml/datasets/dataset.py", line 286, in decode_arff
    return_type=return_type)
  File ".../site-packages/arff.py", line 895, in decode
    raise e
  File ".../site-packages/arff.py", line 892, in decode
    matrix_type=return_type)
  File ".../site-packages/arff.py", line 822, in _decode
    attr = self._decode_attribute(row)
  File ".../site-packages/arff.py", line 764, in _decode_attribute
    raise BadAttributeType()
arff.BadAttributeType: Bad @ATTRIBUTE type, at line 2.

Python 3.6 on Linux.

thyroid datasets don't match with literature

I'm trying to understand the UCI thryoid dataset(s?):
https://www.openml.org/search?q=thyroid&type=data

I can't consolidate the data on openml with the data on UCI and the description. thyroid should have 6832 or 5473 samples depending on the version, none of the datasets on OpenML has that.
There is a standard split into training and test set, and it looks like OpenML only has the training sets, but I'm not sure.
This dataset pretty explicitly gives the number of training samples, so this is missing the test set and the correct split:
https://www.openml.org/d/40497

Several potentially undeclared ID columns in datasets

Hi, I found several more potentially undeclared ID columns in a few datasets:

  • 275: openml_task_id - also, should it really be a numeric attribute?
  • 276: openml_task_id - also, should it really be a numeric attribute?
  • 277: openml_task_id - also, should it really be a numeric attribute?
  • 278: openml_task_id - also, should it really be a numeric attribute?
  • 372: who is most likely an identifier, also, age most likely should be numerical with the value not_say being encoded as missing?
  • 458: BookID
  • 481: hospital_identification_number probably shouldn't be numeric?
  • 524: case_number
  • 565: I am not really sure here, but there are four suspicious attributes here with a very high number of different categories
  • 674: col1
  • 694: col3
  • 695: col1
  • 691: col1
  • 692: col1
  • 693: col1
  • 815: col1
  • 817: col1
  • 818: col3
  • 825: OBS
  • 820: col1
  • 857: RUN
  • 880: LABEL
  • 874: col1
  • 897: FICE is deactivated in version 1
  • 890 - PERIOD is deactivated in version 1
  • 885 - different target than 544 - obs looks suspiciously like an ID
  • 930 - FICE is deactivated in version 1
  • 939 - col7
  • 940 - same as 565
  • 967 - name an ID? - has two times a version 1
  • 987 - has two times a version 1 - Counter should probably be ignored (as in version 1)
  • 1044 - lineNo
  • 1076 - recordnumber
  • 1117: project
  • 1115 - ID
  • 1217 & 1220 : query_id and user_id are IDs, but they are probably useful here? - actually, almost everything in all click prediction datasets are IDs - and they should probably be used to look up some information in some other file which is not on OpenML.
  • 1467 - V2 looks like an ID? V3 should be in [0, 1]
  • 1483 - v3 is a timestamp
  • 1559 - weird dataset as many attributes have 105 values (for 106 samples)
  • 4531 - ID
  • 4541 - encounter_id seems to be an ID, not sure about patient_id
  • 4545 - description says to not use URL and timedelta
  • 4550 - MouseID is labeled as a row_id in the frontend, but not in the XML
  • 4552 - there are IDs not marked as IDs
  • 4545 - URL is a weird attribute - it should maybe rather be a string?
  • 4533 - first attribute has quite a lot of values (~63000 nominal for 65000samples)
  • 5587 -> IDs are used contrarily to https://www.kaggle.com/c/comet-track-recognition-mlhep-2015/data, same for 23394, 23395, 23396, 23397
  • 23394 - should be v2 of the dataset, should have a target? contains a lot of IDs?
  • 23396 - same here
  • 6331 - ID
  • 40665 - conformation_name seems to be an ID, molucule_name is suspicious - but both shouldn't at least be numeric, right?
  • 40666 - same here
  • 4138 - there's a numeric feature called task_id, but not sure what's good for in a clustering dataset
  • 40698 - sample_code_number
  • 40701 - should phone_number really be numerical?
  • 40869 - has no target, has a column called ID
  • 23383 - sensor dataset, has a timestamp and is not complete

Having seen some more datasets with a timestamp, I am wondering if it would be a good idea to actually introduce the concept of a timestamp in the database. Then we'd easily see that those datasets shouldn't simply be used for regular classification and @janvanrijn can use them for his work on data streams.

It would be good if someone could double-check these datasets to make sure that what I identified as IDs are actually IDs.

OpenML Sparse Dataset support

This issue tracks the progress of sparse Dataset support on the OpenML-MinIO backend.
Currently, MinIO does not have OpenML sparse datasets because pandas can't write to sparse datasets by Default.
Example

did = 42379
  d = openml.datasets.get_dataset(did, download_qualities=False)
  df , *_ = d.get_data(dataset_format="dataframe", include_row_id=True, include_ignore_attribute=True)
  df.to_parquet(f'dataset_{d.id}.pq')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-42ca2d7c4839> in <module>
      7                                       target=d.default_target_attribute)
      8     df = pd.concat([X,y], axis=1)
----> 9     df.to_parquet(f'dataset_{d.id}.pq')
     10     client.make_bucket(f"dataset{did}")
     11     client.fput_object(

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/core/frame.py in to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2453         from pandas.io.parquet import to_parquet
   2454 
-> 2455         return to_parquet(
   2456             self,
   2457             path,

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
    388     path_or_buf: FilePathOrBuffer = io.BytesIO() if path is None else path
    389 
--> 390     impl.write(
    391         df,
    392         path_or_buf,

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pandas/io/parquet.py in write(self, df, path, compression, index, storage_options, partition_cols, **kwargs)
    150             from_pandas_kwargs["preserve_index"] = index
    151 
--> 152         table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
    153 
    154         path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    551      index_columns,
    552      columns_to_convert,
--> 553      convert_fields) = _get_columns_to_convert(df, schema, preserve_index,
    554                                                columns)
    555 

~/opt/anaconda3/envs/env/lib/python3.8/site-packages/pyarrow/pandas_compat.py in _get_columns_to_convert(df, schema, preserve_index, columns)
    357 
    358         if _pandas_api.is_sparse(col):
--> 359             raise TypeError(
    360                 "Sparse pandas data (column {}) not supported.".format(name))
    361 

TypeError: Sparse pandas data (column FCFP6_1024_0) not supported.

Wrongly annotated target / ignore columns

Overview of datasets with known issues regarding special column types:

Nominal target but indicated as numerical. This should probably be checked automatically.
Also check whether no 'wrong' tasks are created on these.

Missing target column:

Bad row_id / ignore columns:

Data set 301 incorrect

The nominal variables in this data set are not actually nominal, but have some "?", which should probably be missing values.

https://www.openml.org/d/301

library("OpenML")
#> Loading required package: mlr
#> Loading required package: ParamHelpers
dt <- getOMLDataSet(data.id = 301)
#> Data '301' file 'description.xml' found in cache.
#> Data '301' file 'dataset.arff' found in cache.
#> Loading required package: readr
summary(dt$data[, 1:5])
#>       WSR0           WSR1           WSR2           WSR3     
#>  ?      : 299   ?      : 292   ?      : 294   ?      : 292  
#>  0.4    : 128   0.4    : 136   0.4    : 138   0.4    : 147  
#>  0.8    : 107   0.8    : 119   0.8    : 133   0.8    : 118  
#>  0.3    : 101   0.3    : 113   0.3    : 115   0.3    : 112  
#>  1.3    :  99   0.2    :  95   1.3    :  99   1.3    : 105  
#>  1.7    :  90   1.3    :  95   0.1    :  90   0.2    : 103  
#>  (Other):1712   (Other):1686   (Other):1667   (Other):1659  
#>       WSR4     
#>  ?      : 293  
#>  0.4    : 139  
#>  0.8    : 117  
#>  0.2    : 107  
#>  0.3    : 100  
#>  1.3    :  98  
#>  (Other):1682

I was not able to raise an issue on the website (see #708)

mldata

I started sifting through mldata datasets about a year ago, never had the time to finish.
This is a dump of my progress.


Ignoring all datasets for which the all descriptive features match, and those which were not valid arff.

The following datasets have matching names, but differ in either instances, features, or missing values:

  • datasets-numeric-autompg : [196, 831]
  • datasets-numeric-sleep : [205, 739]
  • letter : [6, 74, 247, 977, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386]
  • regression-datasets-autompg : [196, 831]
  • satimage : [182, 1183]
  • shuttle : [40685]
  • splice-ida : [1579]
  • splice_scale : [1579]
  • statlib-20050214-cars : [40700]
  • statlib-20050214-hip : [490, 898]
  • svmguide3 : [1589]
  • uci-20070111-arrhythmia : [5, 1017]
  • uci-20070111-autompg : [196, 831]
  • uci-20070111-dermatology : [35, 129, 263, 1010]
  • uci-20070111-hayes-roth_test : [329, 974]
  • uci-20070111-kdd_el_nino-small: [839]
  • uci-20070111-sleep : [205, 739]
  • uci-20070111-spectf_test : [1181]
  • uci-20070111-spectf_train : [1181]
  • uci-20070111-spect_test : [1180]
  • uci-20070111-spect_train : [1180]
  • vowel : [307, 1016]

The following datasets have matching names, but differ in more than one way:

  • breast-cancer : [13, 77, 1434, 23499]
  • breast-cancer_scale : [13]
  • cadata : [41156]
  • cpusmall : [561, 796]
  • cpusmall_scale : [561, 796]
  • global-earthquakes : [209, 550, 772]
  • image-ida : [40592]
  • mauna-loa-atmospheric-co2 : [41187]
  • mg : [1433, 1589]
  • natural-scenes-data : [312, 40595]
  • statlib-20050214-disclosure_z: [40713]
  • statlib-20050214-papir_1 : [486, 487]

The following datasets do not have matching names,but have the same number of instances, features and missing values:

  • australian : [40981]
  • australian_scale : [40981]
  • cadata : [537, 823]
  • cpusmall : [227, 562, 735]
  • cpusmall_scale : [227, 562, 735]
  • datasets-arie_ben_david-era : [1029]
  • datasets-arie_ben_david-lev : [1030]
  • datasets-arie_ben_david-swd : [593, 595, 606, 608, 623, 740, 751, 845, 910, 913]
  • datasets-numeric-autoprice : [195, 745]
  • datasets-numeric-housing : [531, 853]
  • datasets-numeric-mbagrade : [380]
  • diabetes-ida : [37]
  • drug-datasets-chang : [418]
  • drug-datasets-garrat : [413, 436]
  • drug-datasets-mtp : [405]
  • drug-datasets-penning : [404, 417, 423, 425]
  • drug-datasets-phen : [412]
  • drug-datasets-phenetyl1 : [419]
  • drug-datasets-qsabr1 : [440]
  • drug-datasets-qsabr2 : [441]
  • drug-datasets-rosowky : [413, 437]
  • drug-datasets-siddiqi : [436, 437]
  • drug-datasets-strupcz : [439]
  • drug-datasets-svensson : [404, 417, 425]
  • drug-datasets-tsutumi : [404, 423, 425]
  • drug-datasets-yokohoma1 : [417, 423, 425]
  • friedman-datasets-fri_c0_1000_10: [593, 606, 608, 623, 740, 751, 910, 913, 1028]
  • friedman-datasets-fri_c0_1000_25: [586, 589, 592, 620, 715, 723, 903, 917]
  • friedman-datasets-fri_c0_1000_5 : [599, 612, 628, 743, 813, 912]
  • friedman-datasets-fri_c0_1000_50: [583, 607, 618, 622, 797, 806, 837, 866]
  • friedman-datasets-fri_c0_100_10 : [585, 591, 634, 640, 762, 783, 789, 878]
  • friedman-datasets-fri_c0_100_25 : [625, 629, 639, 655, 768, 775, 812, 868]
  • friedman-datasets-fri_c0_100_5 : [594, 611, 656, 726, 829, 916, 1463]
  • friedman-datasets-fri_c0_100_50 : [587, 630, 636, 642, 716, 876, 922, 932]
  • friedman-datasets-fri_c0_250_10 : [602, 615, 647, 657, 793, 830, 863, 935]
  • friedman-datasets-fri_c0_250_25 : [605, 614, 644, 658, 746, 794, 832, 933]
  • friedman-datasets-fri_c0_250_5 : [596, 601, 613, 730, 744, 911]
  • friedman-datasets-fri_c0_250_50 : [619, 632, 638, 648, 769, 873, 877, 918]
  • friedman-datasets-fri_c0_500_10 : [604, 627, 641, 646, 824, 855, 869, 936]
  • friedman-datasets-fri_c0_500_25 : [581, 582, 584, 643, 779, 838, 879, 896]
  • friedman-datasets-fri_c0_500_5 : [597, 617, 631, 749, 792, 870]
  • friedman-datasets-fri_c0_500_50 : [616, 626, 637, 645, 766, 805, 920, 937]
  • friedman-datasets-fri_c1_1000_10: [595, 606, 608, 623, 740, 751, 845, 913, 1028]
  • friedman-datasets-fri_c1_1000_25: [586, 589, 592, 598, 715, 723, 849, 903]
  • friedman-datasets-fri_c1_1000_5 : [599, 609, 628, 799, 813, 912]
  • friedman-datasets-fri_c1_1000_50: [590, 607, 618, 622, 797, 806, 866, 904]
  • friedman-datasets-fri_c1_100_10 : [585, 621, 634, 640, 762, 783, 808, 878]
  • friedman-datasets-fri_c1_100_25 : [625, 639, 651, 655, 768, 775, 868, 889]
  • friedman-datasets-fri_c1_100_5 : [594, 611, 624, 726, 754, 916, 1463]
  • friedman-datasets-fri_c1_100_50 : [587, 600, 630, 642, 716, 850, 922, 932]
  • friedman-datasets-fri_c1_250_10 : [602, 615, 635, 657, 763, 793, 830, 863]
  • friedman-datasets-fri_c1_250_25 : [605, 644, 653, 658, 773, 794, 832, 933]
  • friedman-datasets-fri_c1_250_5 : [579, 596, 613, 744, 776, 911]
  • friedman-datasets-fri_c1_250_50 : [603, 619, 632, 638, 732, 873, 877, 918]
  • friedman-datasets-fri_c1_500_10 : [604, 627, 646, 654, 855, 869, 936, 943]
  • friedman-datasets-fri_c1_500_25 : [581, 584, 633, 643, 838, 879, 896, 926]
  • friedman-datasets-fri_c1_500_5 : [597, 617, 649, 749, 792, 884]
  • friedman-datasets-fri_c1_500_50 : [616, 626, 645, 650, 805, 888, 920, 937]
  • friedman-datasets-fri_c2_1000_10: [593, 595, 608, 623, 740, 751, 845, 910, 1028]
  • friedman-datasets-fri_c2_1000_25: [586, 592, 598, 620, 715, 723, 849, 917]
  • friedman-datasets-fri_c2_1000_5 : [609, 612, 628, 743, 799, 813]
  • friedman-datasets-fri_c2_1000_50: [583, 590, 607, 618, 797, 806, 837, 904]
  • friedman-datasets-fri_c2_100_10 : [585, 591, 621, 640, 783, 789, 808, 878]
  • friedman-datasets-fri_c2_100_25 : [625, 629, 639, 651, 768, 812, 868, 889]
  • friedman-datasets-fri_c2_100_5 : [611, 624, 656, 754, 829, 916, 1463]
  • friedman-datasets-fri_c2_100_50 : [587, 600, 636, 642, 716, 850, 876, 932]
  • friedman-datasets-fri_c2_250_10 : [602, 615, 635, 647, 763, 793, 863, 935]
  • friedman-datasets-fri_c2_250_25 : [614, 644, 653, 658, 746, 773, 832, 933]
  • friedman-datasets-fri_c2_250_5 : [579, 601, 613, 730, 744, 776]
  • friedman-datasets-fri_c2_250_50 : [603, 619, 632, 648, 732, 769, 873, 918]
  • friedman-datasets-fri_c2_500_10 : [604, 641, 646, 654, 824, 855, 936, 943]
  • friedman-datasets-fri_c2_500_25 : [581, 582, 584, 633, 779, 838, 896, 926]
  • friedman-datasets-fri_c2_500_5 : [617, 631, 649, 749, 870, 884]
  • friedman-datasets-fri_c2_500_50 : [616, 637, 645, 650, 766, 805, 888, 937]
  • friedman-datasets-fri_c3_1000_10: [593, 595, 606, 623, 751, 845, 910, 913, 1028]
  • friedman-datasets-fri_c3_1000_25: [589, 592, 598, 620, 723, 849, 903, 917]
  • friedman-datasets-fri_c3_1000_5 : [599, 609, 612, 743, 799, 912]
  • friedman-datasets-fri_c3_1000_50: [583, 590, 607, 622, 797, 837, 866, 904]
  • friedman-datasets-fri_c3_100_10 : [591, 621, 634, 640, 762, 789, 808, 878]
  • friedman-datasets-fri_c3_100_25 : [625, 629, 651, 655, 775, 812, 868, 889]
  • friedman-datasets-fri_c3_100_5 : [594, 624, 656, 726, 754, 829, 1463]
  • friedman-datasets-fri_c3_100_50 : [600, 630, 636, 642, 850, 876, 922, 932]
  • friedman-datasets-fri_c3_250_10 : [615, 635, 647, 657, 763, 830, 863, 935]
  • friedman-datasets-fri_c3_250_25 : [605, 614, 644, 653, 746, 773, 794, 933]
  • friedman-datasets-fri_c3_250_5 : [579, 596, 601, 730, 776, 911]
  • friedman-datasets-fri_c3_250_50 : [603, 619, 638, 648, 732, 769, 877, 918]
  • friedman-datasets-fri_c3_500_10 : [604, 627, 641, 654, 824, 855, 869, 943]
  • friedman-datasets-fri_c3_500_25 : [582, 584, 633, 643, 779, 838, 879, 926]
  • friedman-datasets-fri_c3_500_5 : [597, 631, 649, 792, 870, 884]
  • friedman-datasets-fri_c3_500_50 : [616, 626, 637, 650, 766, 805, 888, 920]
  • friedman-datasets-fri_c4_1000_10: [593, 595, 606, 608, 740, 845, 910, 913, 1028]
  • friedman-datasets-fri_c4_1000_25: [586, 589, 598, 620, 715, 849, 903, 917]
  • friedman-datasets-fri_c4_1000_50: [583, 590, 618, 622, 806, 837, 866, 904]
  • friedman-datasets-fri_c4_100_10 : [585, 591, 621, 634, 762, 783, 789, 808]
  • friedman-datasets-fri_c4_100_25 : [629, 639, 651, 655, 768, 775, 812, 889]
  • friedman-datasets-fri_c4_100_50 : [587, 600, 630, 636, 716, 850, 876, 922]
  • friedman-datasets-fri_c4_250_10 : [602, 635, 647, 657, 763, 793, 830, 935]
  • friedman-datasets-fri_c4_250_25 : [605, 614, 653, 658, 746, 773, 794, 832]
  • friedman-datasets-fri_c4_250_50 : [603, 632, 638, 648, 732, 769, 873, 877]
  • friedman-datasets-fri_c4_500_10 : [627, 641, 646, 654, 824, 869, 936, 943]
  • friedman-datasets-fri_c4_500_25 : [581, 582, 633, 643, 779, 879, 896, 926]
  • friedman-datasets-fri_c4_500_50 : [626, 637, 645, 650, 766, 888, 920, 937]
  • german-ida : [31, 1547]
  • germannumer : [1436, 1572]
  • germannumer_scale : [1436, 1572]
  • heart_scale : [53]
  • housing : [531, 853]
  • housing_scale : [531, 853]
  • iris : [1099, 1413]
  • mpg : [40700]
  • mpg_scale : [40700]
  • regression-datasets-2dplanes : [344, 564, 881, 901]
  • regression-datasets-ailerons : [296]
  • regression-datasets-auto_price : [207, 756]
  • regression-datasets-bank32nh : [308, 752]
  • regression-datasets-bank8fm : [189, 225, 807, 816]
  • regression-datasets-cal_housing : [537, 823]
  • regression-datasets-fried : [215, 344, 727, 881]
  • regression-datasets-housing : [531, 853]
  • regression-datasets-kin8nm : [225, 572, 725, 816]
  • regression-datasets-puma32h : [558, 833]
  • regression-datasets-puma8nh : [189, 572, 725, 807]
  • ringnorm-ida : [1507]
  • statlib-20050214-chatfield_4 : [695, 820]
  • statlib-20050214-chscase_census3: [670, 671, 672, 673, 906, 907, 908, 909]
  • statlib-20050214-chscase_census5: [670, 671, 672, 673, 906, 907, 908, 909]
  • statlib-20050214-chscase_geyser1: [712, 895]
  • statlib-20050214-csb_ch2 : [668, 692, 787, 874, 1096]
  • statlib-20050214-diggle_table_a1: [485, 693, 817, 835]
  • statlib-20050214-diggle_table_a2: [694, 818]
  • statlib-20050214-disclosure_z : [676, 699, 704, 709, 774, 795, 827, 931]
  • statlib-20050214-hutsof99_logis : [681, 804]
  • statlib-20050214-no2 : [522, 750, 40496]
  • statlib-20050214-pm10 : [547, 886, 40496]
  • statlib-20050214-prnn_synth : [464]
  • statlib-20050214-rabe_131 : [668, 692, 787, 874, 1096]
  • statlib-20050214-rabe_148 : [710, 894]
  • statlib-20050214-rabe_166 : [684, 919]
  • statlib-20050214-rabe_176 : [698, 929]
  • statlib-20050214-rabe_265 : [660, 780]
  • statlib-20050214-rabe_266 : [663, 782]
  • statlib-20050214-rabe_97 : [697, 928]
  • statlib-20050214-sleuth_case1202: [706, 891]
  • statlib-20050214-sleuth_case1501: [711, 946]
  • statlib-20050214-sleuth_case2002: [665, 902]
  • statlib-20050214-sleuth_ex1605 : [687, 755]
  • statlib-20050214-sleuth_ex1714 : [659, 777]
  • statlib-20050214-sleuth_ex2012 : [663, 782]
  • statlib-20050214-sleuth_ex2015 : [683, 864]
  • statlib-20050214-sleuth_ex2016 : [682, 862]
  • thyroid-ida : [40682]
  • twonorm-ida : [1496]
  • uci-20070111-2dplanes : [344, 564, 881, 901]
  • uci-20070111-ailerons : [296]
  • uci-20070111-autoprice : [195, 745]
  • uci-20070111-auto_price : [207, 756]
  • uci-20070111-bank32nh : [308, 752]
  • uci-20070111-bank8fm : [189, 225, 807, 816]
  • uci-20070111-cal_housing : [537, 823]
  • uci-20070111-fried : [215, 344, 727, 881]
  • uci-20070111-housing : [531, 853]
  • uci-20070111-kin8nm : [225, 572, 725, 816]
  • uci-20070111-mbagrade : [380]
  • uci-20070111-puma32h : [558, 833]
  • uci-20070111-puma8nh : [189, 572, 725, 807]
  • usps : [41082]
  • waveform-ida : [4551]

The following datasets do not match any of the above criteria:

None

MNIST is called mnist_784

It's weird to an sklearn user to have fetch_openml("MNIST") return "no dataset MNIST". There is an MNIST dataset on OpenML, which is "in perparation".

[550] Update description Quake dataset

Quake has a poor dataset description, it is generic to the book that it was taken from, and the provided link to the online book is dead. The actual dataset description in the book lists:

image

Many classification tasks seem to have numeric targets

Several datasets have numeric targets, but are clearly classification tasks (a non-exhaustive list):
https://www.openml.org/d/23513
https://www.openml.org/d/4532
https://www.openml.org/d/5587 (https://www.openml.org/d/5648, โ€ฆ)
https://www.openml.org/d/1575
https://www.openml.org/d/1577

and probably also https://www.openml.org/d/296

It should be easy to find most of them programmatically by looking at the number of unique values of the target variable. Of course one would have to be careful not to accidentally identify a dataset with ordinal discrete values (e.g. counts) as classification.

Data set pbc wrong target info + covariate problem

From @HeidiSeibold on March 13, 2017 11:46

Each version of the pbc data set (version 1: https://www.openml.org/d/200) has problems. Since I can't open issues on the website at the moment (see #106), I'll document them here.

  • For V1 the outcome is clearly a right censored outcome, i.e. both X/class and D (censoring indicator) are part of the target
  • In V2 the target should be categorical
  • V3 is something I don't think we can deal with right now in OpenML since we need to use class and D somehow as a combined target and both are binary.

Due to these problems there exist also non sensible tasks.

Copied from original issue: openml/openml.org#109

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.