rhiever / datacleaner Goto Github PK

View Code? Open in Web Editor NEW

1.0K 57.0 206.0 633 KB

A Python tool that automatically cleans data sets and readies them for analysis.

License: MIT License

Python 90.04% Shell 9.96%

python data-science machine-learning automation

datacleaner's Introduction

datacleaner

A Python tool that automatically cleans data sets and readies them for analysis.

datacleaner is not magic

datacleaner works with data in pandas DataFrames.

datacleaner is not magic, and it won't take an unorganized blob of text and automagically parse it out for you.

What datacleaner will do is save you a ton of time encoding and cleaning your data once it's already in a format that pandas DataFrames can handle.

Currently, datacleaner does the following:

Optionally drops any row with a missing value
Replaces missing values with the mode (for categorical variables) or median (for continuous variables) on a column-by-column basis
Encodes non-numerical variables (e.g., categorical variables with strings) with numerical equivalents

We plan to add more cleaning features as the project grows.

License

Please see the repository license for the licensing and usage information for datacleaner.

Generally, we have licensed datacleaner to make it as widely usable as possible.

Installation

datacleaner is built to use pandas DataFrames and some scikit-learn modules for data preprocessing. As such, we recommend installing the Anaconda Python distribution prior to installing datacleaner.

Once the prerequisites are installed, datacleaner can be installed with a simple pip command:

pip install datacleaner

Usage

datacleaner on the command line

datacleaner can be used on the command line. Use --help to see its usage instructions.

usage: datacleaner [-h] [-cv CROSS_VAL_FILENAME] [-o OUTPUT_FILENAME]
                   [-cvo CV_OUTPUT_FILENAME] [-is INPUT_SEPARATOR]
                   [-os OUTPUT_SEPARATOR] [--drop-nans]
                   [--ignore-update-check] [--version]
                   INPUT_FILENAME

A Python tool that automatically cleans data sets and readies them for analysis

positional arguments:
  INPUT_FILENAME        File name of the data file to clean

optional arguments:
  -h, --help            show this help message and exit
  -cv CROSS_VAL_FILENAME
                        File name for the validation data set if performing
                        cross-validation
  -o OUTPUT_FILENAME    Data file to output the cleaned data set to
  -cvo CV_OUTPUT_FILENAME
                        Data file to output the cleaned cross-validation data
                        set to
  -is INPUT_SEPARATOR   Column separator for the input file(s) (default: \t)
  -os OUTPUT_SEPARATOR  Column separator for the output file(s) (default: \t)
  --drop-nans           Drop all rows that have a NaN in any column (default: False)
  --ignore-update-check
                        Do not check for the latest version of datacleaner
                        (default: False)
  --version             show program's version number and exit

An example command-line call to datacleaner may look like:

datacleaner my_data.csv -o my_clean.data.csv -is , -os ,

which will read the data from my_data.csv (assuming columns are separated by commas), clean the data set, then output the resulting data set to my_clean.data.csv.

datacleaner in scripts

datacleaner can also be used as part of a script. There are two primary functions implemented in datacleaner: autoclean and autoclean_cv.

autoclean(input_dataframe, drop_nans=False, copy=False, ignore_update_check=False)
    Performs a series of automated data cleaning transformations on the provided data set
    
    Parameters
    ----------
    input_dataframe: pandas.DataFrame
        Data set to clean
    drop_nans: bool
        Drop all rows that have a NaN in any column (default: False)
    copy: bool
        Make a copy of the data set (default: False) 
    encoder: category_encoders transformer
        The a valid category_encoders transformer which is passed an inferred cols list. Default (None: LabelEncoder)
    encoder_kwargs: category_encoders
        The a valid sklearn transformer to encode categorical features. Default (None)
    ignore_update_check: bool
        Do not check for the latest version of datacleaner

    Returns
    ----------
    output_dataframe: pandas.DataFrame
        Cleaned data set

autoclean_cv(training_dataframe, testing_dataframe, drop_nans=False, copy=False, ignore_update_check=False)
    Performs a series of automated data cleaning transformations on the provided training and testing data sets
    
    Unlike `autoclean()`, this function takes cross-validation into account by learning the data transformations
    from only the training set, then applying those transformations to both the training and testing set.
    By doing so, this function will prevent information leak from the training set into the testing set.
    
    Parameters
    ----------
    training_dataframe: pandas.DataFrame
        Training data set
    testing_dataframe: pandas.DataFrame
        Testing data set
    drop_nans: bool
        Drop all rows that have a NaN in any column (default: False)
    copy: bool
        Make a copy of the data set (default: False)  
    encoder: category_encoders transformer
        The a valid category_encoders transformer which is passed an inferred cols list. Default (None: LabelEncoder)
    encoder_kwargs: category_encoders
        The a valid sklearn transformer to encode categorical features. Default (None)
    ignore_update_check: bool
        Do not check for the latest version of datacleaner

    Returns
    ----------
    output_training_dataframe: pandas.DataFrame
        Cleaned training data set
    output_testing_dataframe: pandas.DataFrame
        Cleaned testing data set

Below is an example of datacleaner performing basic cleaning on a data set.

from datacleaner import autoclean
import pandas as pd

my_data = pd.read_csv('my_data.csv', sep=',')
my_clean_data = autoclean(my_data)
my_data.to_csv('my_clean_data.csv', sep=',', index=False)

Note that because datacleaner works directly on pandas DataFrames, all DataFrame operations are still available to the resulting data sets.

Contributing to datacleaner

We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to datacleaner, please file a new issue so we can discuss it.

Citing datacleaner

If you use datacleaner as part of your workflow in a scientific publication, please consider citing the datacleaner repository with the following DOI:

datacleaner's People

Contributors

Stargazers

Watchers

Forkers

cophy08 wdm0006 abdojulari wkryst sboraiah gitter-badger honglongwu aoracle wallawaz bekterra bnuruddin czeinerb eagles0607 robsmith1776 4sp1r3 royshan codeaudit jayinai vigneshradhakrishnan1 datnamer mdbconsulting nareshshah139 shellleyma diwahars ahmedhamedtn andymason57 gaobo07 sc305495 calsaviour bmritz libardo1 nkhuyu slevinsc oge77 duyong6380 harshanimmagadda44 magnetonbora arita37 ofergold pawanadh sam2015 realarunnair carsondahlberg svaksha olivierh59500 theodorosk yochju fndjjx radovankavicky gapdata ken-muturi leezqcst ondrocks prokopyev rdoume mathias3 paurichardson dimenwarper mikewlange allensmile graphtobinary ike-okonkwo badkk deepmachines kartikmathpal hans-liu linkfar wdmattos65 douxiansheng tomarraj008 yiyisan gfg geoffkip datanest xtutran realitygamer yp-ye chansonz willstory 3shmawei ml-data-science-tutorials 17ai kuhung fengwang sniperxiaojun k-gerst ravi-code-ranjan ayush1999 shivank01 felzek hoadu chetanmehra ncdingari autodataplatform mjimcua praveenmunagapati handongfeng rawnc ianmadlenya tawfung

datacleaner's Issues

Add scikit-learn compatibility to datacleaner

Write a wrapper for datacleaner that allows it to act as a scikit-learn transformer. See the scikit-learn docs for information on the transformer API.

Automatically cleaning unicode text

Thanks for this awesome tool! I was wondering if we could include some sanity checking/cleanup for badly behaved text (e.g. all those invalid unicode characters). Could be as simple as running ftfy on all text columns. I'd volunteer to integrate this into datacleaner.

Integrate unit tests

Test both autoclean() and autoclean_cv(), each with 5 test cases:

Simulated data, no NaNs, all columns numerical
Simulated data, with NaNs, all columns numerical
Simulated data, no NaNs, some columns with strings
Simulated data, with NaNs, some columns with strings
Real data (adult.csv.gz) with some NaNs placed into it

CI/CD doesn't work

[provide general introduction to the issue and why it is relevant to this repository]

Context of the issue

CI/CD doens't work at all

Process to reproduce the issue

I suggest that editting travis.yml without virtual
I tested in my repo. I got success from it

Expected result

edit travis.yml without virtual

Current result

[describe what you currently experience from this process, and thereby explain the bug]

Possible fix

[not necessary, but suggest fixes or reasons for the bug]

`name of issue` screenshot

[if relevant, include a screenshot]

Planned functionality

In the immediate future, datacleaner will:

Encode all non-numerical variables as numerical variables
Replace all NaNs with the median of the column or drop all NaN rows (configurable)

See this tweet chain for more ideas.

If anyone has more ideas, please add them here.

Integrate more encoding options for object columns

It would be nice to be able to pass in an encoding type to use something more than the default label encoding. I have a library: category encoders, which does that, and it can be easily added in with one extra flag. (suggested -en for encoder).

I have a not-yet-tested implementation of this at:

https://github.com/wdm0006/datacleaner

Which just carries over the available encoders:

backward difference
binary
hashing
helmert
one hot (pass through to scikit-learn)
ordinal (should be the same as label encoding)
polynomial
sum coding

A deeper look into the differences between these can be found here and here.

Let me know if you think that fits into your project, or if there is any change I can make to my implementation or the library, I can work on those and send a PR.

ValueError instead of TypeError in Python 2.7

The try except block starting at line 76 of datacleaner.py raises a ValueError in Python 2.7 when the column is of type object (string). Since the Python 2.7 icon is displayed in the repo markdown, can you clarify which Python version is supported?

Feature: %string to numerical value conversion

You have some datasets that have % values strings e.g. '95%',''82%' etc.

It would be great if this could be automatically dealt with. On Pandas dataframe this can be done with

df = df.replace('%','',regex=True).astype('float')

Index out of bounds error when a col has all different value

Hi
I find a issue in datacleaner. When I use this tool to deal with my dataset, it generates a index out of bounds error. I check the code and I find this row in function autoclean:

input_dataframe[column].fillna(input_dataframe[column].mode()[0], inplace=True)

when a col has no same value, the mode will return empty, so the index will out of bound.
I think this is the reason, could you confirm it. Thank you!

Replace +/- Infs with Max/Min

Hi there,

datacleaner seems quite interesting. Cleaning Data is always annoying and tools are missing.

If I have seen it right, you impute NaNs. You could also consider to replace +/- Infs by Max/Min of the respective column.

We have implemented that In the tsfresh impute function. Maybe you can use some of the code there.