Comments (5)
Sounds like this is a bug. Would you be willing to write a patch for it?
from datacleaner.
Hi! Mind if I take a shot at this?
from datacleaner.
Please do! Probably the best starting point is to write a minimal example that reproduces the error, then that will stand as our first unit test for this patch.
from datacleaner.
So I took a look at this today and was having trouble reproducing the error. I've included my test script here. Maybe I'm misinterpreting the issue described above?
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
def autoclean(input_dataframe, drop_nans=False, copy=False, encoder=None,
encoder_kwargs=None, ignore_update_check=False):
"""Performs a series of automated data cleaning transformations on the provided data set
Parameters
----------
input_dataframe: pandas.DataFrame
Data set to clean
drop_nans: bool
Drop all rows that have a NaN in any column (default: False)
copy: bool
Make a copy of the data set (default: False)
encoder: category_encoders transformer
The a valid category_encoders transformer which is passed an inferred cols list. Default (None: LabelEncoder)
encoder_kwargs: category_encoders
The a valid sklearn transformer to encode categorical features. Default (None)
ignore_update_check: bool
Do not check for the latest version of datacleaner
Returns
----------
output_dataframe: pandas.DataFrame
Cleaned data set
"""
'''global update_checked
if ignore_update_check:
update_checked = True
if not update_checked:
update_check('datacleaner', __version__)
update_checked = True'''
if copy:
input_dataframe = input_dataframe.copy()
if drop_nans:
input_dataframe.dropna(inplace=True)
if encoder_kwargs is None:
encoder_kwargs = {}
for column in input_dataframe.columns.values:
# Replace NaNs with the median or mode of the column depending on the column type
try:
print('hit try block')
input_dataframe[column].fillna(input_dataframe[column].median(), inplace=True)
except TypeError:
print('caught type error')
most_frequent = input_dataframe[column].mode()
# If the mode can't be computed, use the nearest valid value
# See https://github.com/rhiever/datacleaner/issues/8
if len(most_frequent) > 0:
input_dataframe[column].fillna(input_dataframe[column].mode()[0], inplace=True)
else:
input_dataframe[column].fillna(method='bfill', inplace=True)
input_dataframe[column].fillna(method='ffill', inplace=True)
# Encode all strings with numerical equivalents
if str(input_dataframe[column].values.dtype) == 'object':
if encoder is not None:
column_encoder = encoder(**encoder_kwargs).fit(input_dataframe[column].values)
else:
column_encoder = LabelEncoder().fit(input_dataframe[column].values)
input_dataframe[column] = column_encoder.transform(input_dataframe[column].values)
return input_dataframe
def test_type_error():
d = {'A': ['a',np.nan,'c'], 'B': [np.nan,'e',np.nan]}
df = pd.DataFrame(data = d)
print(df)
print(df['A'].dtypes)
cleaned_data = autoclean(df)
print(cleaned_data)
def main():
test_type_error()
if __name__ == '__main__':
main()
which outputs:
A B
0 a NaN
1 NaN e
2 c NaN
object
hit try block
caught type error
hit try block
caught type error
A B
0 0 0
1 0 0
2 1 0
from datacleaner.
ping @dburns7
from datacleaner.
Related Issues (13)
- Planned functionality HOT 19
- Replace +/- Infs with Max/Min HOT 1
- Add scikit-learn compatibility to datacleaner
- Automatically cleaning unicode text HOT 2
- Add easy way to write out feature-to-categorical mapping.
- '<' not supported between instances of 'str' and 'int' HOT 2
- CI/CD doesn't work
- Add update_checker
- Integrate more encoding options for object columns HOT 11
- Integrate unit tests HOT 4
- Feature: %string to numerical value conversion HOT 4
- Index out of bounds error when a col has all different value HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datacleaner.