rhiever / datacleaner Goto Github PK

A Python tool that automatically cleans data sets and readies them for analysis.

License: MIT License

Python 90.04% Shell 9.96%

python data-science machine-learning automation

datacleaner's Issues

Integrate more encoding options for object columns

It would be nice to be able to pass in an encoding type to use something more than the default label encoding. I have a library: category encoders, which does that, and it can be easily added in with one extra flag. (suggested -en for encoder).

I have a not-yet-tested implementation of this at:

https://github.com/wdm0006/datacleaner

Which just carries over the available encoders:

backward difference
binary
hashing
helmert
one hot (pass through to scikit-learn)
ordinal (should be the same as label encoding)
polynomial
sum coding

A deeper look into the differences between these can be found here and here.

Let me know if you think that fits into your project, or if there is any change I can make to my implementation or the library, I can work on those and send a PR.

Planned functionality

In the immediate future, datacleaner will:

Encode all non-numerical variables as numerical variables
Replace all NaNs with the median of the column or drop all NaN rows (configurable)

See this tweet chain for more ideas.

If anyone has more ideas, please add them here.

CI/CD doesn't work

[provide general introduction to the issue and why it is relevant to this repository]

Context of the issue

CI/CD doens't work at all

Process to reproduce the issue

I suggest that editting travis.yml without virtual
I tested in my repo. I got success from it

Expected result

edit travis.yml without virtual

Current result

[describe what you currently experience from this process, and thereby explain the bug]

Possible fix

[not necessary, but suggest fixes or reasons for the bug]

`name of issue` screenshot

[if relevant, include a screenshot]

Automatically cleaning unicode text

Thanks for this awesome tool! I was wondering if we could include some sanity checking/cleanup for badly behaved text (e.g. all those invalid unicode characters). Could be as simple as running ftfy on all text columns. I'd volunteer to integrate this into datacleaner.

Add update_checker

Send PR to update @bboe's server.

Package: https://github.com/bboe/update_checker

Add easy way to write out feature-to-categorical mapping.

Add a easy-to-use handle that saves the mapping between features values to their categorical label.

Replace +/- Infs with Max/Min

Hi there,

datacleaner seems quite interesting. Cleaning Data is always annoying and tools are missing.

If I have seen it right, you impute NaNs. You could also consider to replace +/- Infs by Max/Min of the respective column.

We have implemented that In the tsfresh impute function. Maybe you can use some of the code there.

Index out of bounds error when a col has all different value

Hi
I find a issue in datacleaner. When I use this tool to deal with my dataset, it generates a index out of bounds error. I check the code and I find this row in function autoclean:

input_dataframe[column].fillna(input_dataframe[column].mode()[0], inplace=True)

when a col has no same value, the mode will return empty, so the index will out of bound.
I think this is the reason, could you confirm it. Thank you!

ValueError instead of TypeError in Python 2.7

The try except block starting at line 76 of datacleaner.py raises a ValueError in Python 2.7 when the column is of type object (string). Since the Python 2.7 icon is displayed in the repo markdown, can you clarify which Python version is supported?

'<' not supported between instances of 'str' and 'int'

when running the script,
my_data = pd.read_csv('test2.csv', sep=',',encoding='utf-8')
my_clean_data = autoclean(my_data)
my_data.to_csv('my_clean_data.csv')

getting error
'<' not supported between instances of 'str' and 'int'

Integrate unit tests

Test both autoclean() and autoclean_cv(), each with 5 test cases:

Simulated data, no NaNs, all columns numerical
Simulated data, with NaNs, all columns numerical
Simulated data, no NaNs, some columns with strings
Simulated data, with NaNs, some columns with strings
Real data (adult.csv.gz) with some NaNs placed into it

Add scikit-learn compatibility to datacleaner

Write a wrapper for datacleaner that allows it to act as a scikit-learn transformer. See the scikit-learn docs for information on the transformer API.

Feature: %string to numerical value conversion

You have some datasets that have % values strings e.g. '95%',''82%' etc.

It would be great if this could be automatically dealt with. On Pandas dataframe this can be done with

df = df.replace('%','',regex=True).astype('float')

rhiever / datacleaner Goto Github PK

datacleaner's Issues

Context of the issue

Process to reproduce the issue

Expected result

Current result

Possible fix

name of issue screenshot

Recommend Projects

Recommend Topics

Recommend Org

`name of issue` screenshot