Coder Social home page Coder Social logo

cleanml's Introduction

CleanML

This is the CleanML Benchmark for Joint Data Cleaning and Machine Learning.

The details of the benchmark methodology and design are described in the paper: CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

Basic Usage

Run Experiments

To run experiments, download and unzip the datasets. Place it under the project home directory and execute the following command from the project home directory:

python3 main.py --run_experiments [--dataset <name>] [--cpu <num_cpu>] [--log]

Options:

--dataset: the experiment dataset. If not specified, the program will run experiments on all datasets.
--cpu: the number of cpu used for experiment. Default is 1.
--log: whether to log experiment process

Output:

The experimental results for each dataset will be saved in /result directory as a json file named as <dataset name>_result.json. Each result is a key-value pair. The key is a string in format "<dataset>/<split seed>/<error type>/<clean method>/<ML model>/<random search seed>". The value is a set of key-value pairs for each evaluation metric and result. Our experimental results are provided in result.zip.

Run Analysis

To run analysis for populating relations described in the paper, unzip result.zip and execute the following command from the project home directory:

python3 main.py --run_analysis [--alpha <value>]

Options:

--alpha: the significance level for multiple hypothesis test. Default is 0.05.

Output:

The relations R1, R2 and R3 will be saved in /analysis directory. Our analysis results are provided in analysis.zip.

Extend Domain of Attributes

Add new datasets:

To add a new dataset, first, create a new folder with dataset name under /data and create a raw folder under the new folder. The raw folder must contain raw data named raw.csv. For dataset with inconsistencies, it must also contain the inconsistency-cleaned version data named inconsistency_clean_raw.csv. For dataset with mislabels, it must also contain the mislabel-cleaned version data named mislabel_clean_raw.csv. The structure of the directory looks like:

.
└── data
    └── new_dataset
        └── raw
            ├── raw.csv
            ├── inconsistency_clean_raw.csv (for dataset with inconsistencies)
            └── mislabel_clean_raw.csv (for dataset with mislabels)

Then add a dictionary to /schema/dataset.py and append it to datasets array at the end of the file.

The new dictionary must contain the following keys:

data_dir: the name of the dataset.
error_types: a list of error types that the dataset contains.
label: the label of ML task.

The following keys are optional:

class_imbalance: whether the dataset is class imbalanced.
categorical_variables: a list of categorical attributes.
text_variables: a list of text attributes.
key_columns: a list of key columns used for deduplication.
drop_variables: a list of irrelevant attributes.

Add new error types:

To add a new error type, add a dictionary to /schema/error_type.py and append it to error_types array at the end of the file.

The new dictionary must contain the following keys:

name: the name of the error type.
cleaning_methods: a dictionary, {cleaning method name: cleaning methods object}.

Add new models:

To add a new ML model, add a dictionary to /schema/model.py and append it to models array at the end of the file.

The new dictionary must contain the following keys:

name: the name of the model.
fn: the function of the model.
fixed_params: parameters not to be tuned.
hyperparams: the hyperparameter to be tuned.
hyperparams_type: the type of hyperparameter "real" or "int".
hyperparams_range: range of search. Use log base for real type hyperparameters.

Add new cleaning methods:

To add a new cleaning methods, add a class to /schema/cleaning_method.py.

The class must contain two methods:

fit(dataset, dirty_train): take in the dataset dictionary and dirty training set. Compute statistics or train models on training set for data cleaning.
clean(dirty_train, dirty_test): take in the dirty training set and dirty test set. Clean the error in the training set and test set. Return (clean_train, indicator_train, clean_test, indicator_test), which are the clean version datasets and indicators that indicate the location of error.

Add new scenarios:

We consider "BD" and "CD" scenarios in our paper. To investigate other scenarios, add scenarios to /schema/scenario.py.

cleanml's People

Contributors

lipengcs avatar

Stargazers

Mario H. Adaniya avatar GuaiYoui avatar  avatar Rajat Shinde avatar Kurt avatar Nate Abele avatar soby avatar fuj avatar Filip Butić avatar  avatar  avatar Ahoy, the Fate Weaver avatar Jin Huang avatar  avatar  avatar James avatar Tim Stefany avatar 小鹤仙 avatar Uday Phalak avatar Zhening Feng avatar Wang Shizun avatar Jiongli Zhu avatar Romain Egele avatar George Pearse avatar Ervin avatar Xu Chu avatar Paul Groth avatar  avatar  avatar Giovanni avatar Yuyang Dong avatar Claire Lee avatar  avatar  avatar  avatar Jared Zhang avatar  avatar ozkan avatar  avatar Jin Heo avatar Dimitris avatar

Watchers

James Cloos avatar Xu Chu avatar  avatar  avatar

cleanml's Issues

Severe bug which randomly changes the assignment between cleaning techniques and result metrics in the output

Hi,

First of all: thanks for creating CleanML, its a super interesting experimentation framework, and we are using it for our ongoing research on the connection of data quality and fairness.

Unfortunately, we found a severe bug in CleanML, which should be fixed as soon as possible, as it potentially renders a lot of the results computed with CleanML unreliable.

We noticed that we got strange results when running the same experiment twice. The screenshot below shows the outputs from two subsequent runs of the same experiment. The actual accuracy/f1 numbers are the same, but the order and assignment of the keys (the cleaning techniques) is different.

afbeelding

This comes from a bug which may sometimes randomly reorder the list of cleaning techniques (but not the corresponding cleaned test sets...) The issue is caused by the following line of code in

test_files = list(set(test_files).difference(set(skip_test_files)))

We think that this code is intended to filter the test_files list, however, it converts it to a set for that. Sets have no order, so converting a list to a set and back means that the order of the contained items can randomly change.

We confirmed that this happens by printing the list before and after this line:
afbeelding

The code after this line in

result = train_and_evaluate(X_train, y_train, X_test_list, y_test_list, test_files, model, n_jobs=n_jobs, seed=train_seed, hyperparams=hyperparams)

implicitly assumes that the X_test_list, y_test_list and test_files lists are correctly aligned, however the last list is randomly reordered, which means that the cleaning results are also randomly reordered.

We fixed this bug in our fork (and will now rerun several thousands of experiments), and we wanted to point it out here as well, so that you can fix it in the original code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.