pyjanitor-devs / pyjanitor Goto Github PK
View Code? Open in Web Editor NEWClean APIs for data cleaning. Python implementation of R package Janitor
Home Page: https://pyjanitor-devs.github.io/pyjanitor
License: MIT License
Clean APIs for data cleaning. Python implementation of R package Janitor
Home Page: https://pyjanitor-devs.github.io/pyjanitor
License: MIT License
For example, after aggregation with multiple functions.
For df
with columns ['group', 'category', 'value']
:
stats_df = (
df.groupby(['group', 'category'])
.agg(['mean', 'median'])
.reset_index()
)
produces stats_df
with a MultiIndex
.columns
attribute where {'mean', 'median'}
are second level column names under value
. It would be nice if .columns
was just an Index
instead, for some applications.
Now, to flatten the MultiIndex
out into an Index
by concatenating the different levels with an underscore:
stats_df.columns.values
is array([('group', ''), ('category', ''), ('value', 'mean'), ('value', 'median')], dtype=object)
stats_df.columns = ['_'.join(tup) if tup[1] != '' else tup[0] for tup in stats_df.columns.values]
stats_df.columns
is now ['group', 'category', 'value_mean', 'value_median']
Is it possible to "fix" pandas methods using pyjanitor? For e.g. I will like to validate the parameters for read_excel or read_csv. I have raised an issue but there is no progress.
If janitor accept the PR to override the default behavior of read methods, then it will be great.
MultiIndex columns returns expected str not tuple
type error.
If you instead used,
df.rename(columns=lambda x: x.lower().replace('_', ''))
this would work for standard and MultiIndex DataFrames.
Can do PR if required.
Is this possible with janitor?
df.clean_names(inplace=True)
@zbarry @szuckerman, I would like to invite you to participate in the pyjanitor software manuscript that I am writing.
I am writing it in a branch off master: https://github.com/ericmjl/pyjanitor/blob/whitepaper/paper/manuscript.md
At the moment, I am seeking out input on:
If you would like to participate, please put in a PR against the whitepaper
branch and add your name!
From a previous pull-request the issue of namespaces arose. I wanted to open this issue to discuss various new namespaces possible for the module.
It appears that the R version also is having this issue as well.
There was discussion of a finance
submodule, which sounds good, but I don't work in finance and would be unfamiliar with many necessary items to be included.
I think that a summary
submodule or something like that would be good to add tabyl
or other summary statistics.
Thoughts?
I just tried to run the test suite and had some failures because I don't have some of the Chemistry packages installed. I realized that those packages are only available through conda.
I don't use conda and am finding out that it's not so easy to just "install" conda packages into a previously created virtual environment (I mean, I'll figure it out eventually, but this is just at first glance).
In any event, I think it raises an interesting question: Does this create issues for people who don't use conda? Meaning, if I try to use something from a submodule, but because I don't have a dependency installed, like rdkit
, it'll tell me to remedy with a conda install
that's not going to work?
I'm not really sure what the answer is, but I could see either putting functions that rely on conda dependencies in their own package for conda, or just keeping it how it is and let people deal with it (assuming that people using these modules most likely already have conda installed).
Just wanted to throw this out there before the submodules get bigger.
Due to pinning of sklearn version, it fails on 3.7. Need to change the dep for sklearn from ==
in requirements.txt
to =>
Example usage:
df = (
pd.read_csv('blah.csv') # containing ['col1', 'col2']
.add_column('col3', 12345)
.reorder_columns(['col3'])
)
Columns not specified retain their order and follow after specified columns
As per title! Having more than one example can be helpful for getting other users to use the package.
Hi! Decided to have a bit of fun with pandas-flavor
which I came across via pyjanitor*
. I noticed that the docs here for contributors refer to register_dataframe_function
, but it looks like the correct name is register_dataframe_method
.
Not exactly earth-shattering stuff, but it didn't seem like it had been mentioned in another issue so thought I'd just flag it!
*
-- just for a little Christmas break project, on which point, compliments of the season!
For myself, mostly:
Chaining implementation:
import pandas_flavor as pf
@pf.register_dataframe_method
def add_columns(df, **kwargs):
# column already exists
# v is not a scalar but is a different length from the dataframe
for k, v in kwargs.items():
df = df.add_column(k, v)
return df
Example usage for copying repeating rows from a DataFrame
into another:
df1.columns
is of {'var1', 'var2', 'var3'}
column_order = ['var1', 'var2']
df2.add_columns(**{
col: vals
for col, vals in zip(column_order, df1[column_order].iloc[0])
})
This is quite important. I'd like to wrap other packages rather than invent the wheel. One possible package is to wrap missingno with a user-friendly API.
Using pyjanitor again today I realised I didn't want to change capitals to lowercase. I think a case sensitive kwarg (default on) would be good.
Happy to do PR for this if you think it is a good idea. One Q on this, what would should the kwarg be called?
Some ideas:
lower
, remove_upper
, drop_case
, case_sensitive
,...
After seeing issue #67, was curious as to what people think as to adding this capability to all functions?
Some of them, like df.limit_column_characters()
, already return in place. I don't think it will be hard to extend to others.
As per Twitter chat with @twiecki, I think a feature that might come in handy is method chaining for index methods.
Reference: https://twitter.com/twiecki/status/973892601018572800
For example, instead of:
df.index = df.index.drop_level()
We would have
df.remove_empty().index_drop_level()...
The following is a list of functions missing from the PyJanitor library that are implemented in the R version. I think the aggregation and adornment can be put in their own submodules later.
clean_names.R
top_levels.R
I'll commit to a 1.0 release to PyPI when:
Old is re: reorder_columns
: Does not mutate original DataFrame
. I'm thinking about modding it to do so to be consistent with everything else I implemented.
Edit:
In working on the Jupyter Notebook example walkthrough for pyjanitor
, I'm noticing some inconsistencies regarding whether the original DataFrame
is changed after an operation in the provided example. My notes:
.clean_names()
does not mutate.remove_empty()
does.rename_column()
does not.coalesce()
does not.encode_categorical()
does.convert_excel_date()
doesWhat do we think about this?
It does not seem to be needed anywhere in the source code
Pandas handles this relatively well but would be a good as a kwarg for jn.clean_names()
.
The default False
or None
could leave leading, True
or 'both'
to remove all and trailing and then pass 'leading'
and 'trailing
' or similar to remove each.
While renaming the dataframe, I need to preserve the original names. For e.g.
santandar_data = pd.read_csv(r"train.csv", nrows=40000)
santandar_data.shape
santandar_data.original_names=santandar_data.columns
ndf=santandar_data
ndf.original_names
Index(['ID', 'var3', 'var15', 'imp_ent_var16_ult1', 'imp_op_var39_comer_ult1',
'imp_op_var39_comer_ult3', 'imp_op_var40_comer_ult1',
'imp_op_var40_comer_ult3', 'imp_op_var40_efect_ult1',
'imp_op_var40_efect_ult3',
...
'saldo_medio_var33_hace2', 'saldo_medio_var33_hace3',
'saldo_medio_var33_ult1', 'saldo_medio_var33_ult3',
'saldo_medio_var44_hace2', 'saldo_medio_var44_hace3',
'saldo_medio_var44_ult1', 'saldo_medio_var44_ult3', 'var38', 'TARGET'],
dtype='object', length=371)
The ndf dataframe object has a property original_names that works correctly. But when I use clean_names function, I do not get this functionality.
df=santandar_data.clean_names(case_type="upper", remove_special=True).limit_column_characters(3)
df.original_names
AttributeError: 'DataFrame' object has no attribute 'original_names'
import pandas as pd
import janitor as jn
df = pd.read_excel("dirty_data.xlsx")
df = (
jn.DataFrame(df)
.clean_names()
.remove_empty()
.rename_column("%_allocated", "percent_allocated")
.rename_column("full_time?", "full_time")
.coalesce(["certification", "certification.1"], "certification")
.encode_categorical(["subject", "employee_status", "full_time"])
.convert_excel_date("hire_date")
)
print(df)
Traceback (most recent call last):
File "/home/zachary/projects/pyjanitor/examples/example.py", line 7, in <module>
jn.DataFrame(df)
AttributeError: module 'janitor' has no attribute 'DataFrame'
DataFrame
is not in the janitor
namespace. Also, df is already a DataFrame
, so not sure what the intent would be of doing this, anyway.
Note that if I replace jn.DataFrame(df)
with simply df
, I get:
Traceback (most recent call last):
File "/home/zachary/miniconda3/envs/hv/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'certification.1'
please make sure to make it compatible with Pandas dataframes...
thanks!
@shantanuo this appears to be related to your earlier PR. When I run tests, this shows up consistently.
/Users/maer3/github/software/pyjanitor/janitor/functions.py:143: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
df.original_columns = original_column_names
From your reading, do you know of an appropriate way to attach a new dataframe attribute without invoking this warning?
I kind of want to put something together to show usage, talk about philosophy of pyjanitor, etc.
Colleague proposed this function: within a column that houses strings, within each cell, provide a find_replace()
function that finds a substring in the cell, and replaces it with another substring.
Note to self:
Title.
Hi Eric,
Just a thought -- I find myself doing a lot of hand imputation. It would be nice if you could add imputation as a chain able function.
I'm under the gun and can't submit a PR, but I think this would be a great feature.
As far as I'm aware, an inplace=True
reset_index()
does not return a DataFrame
.
In:
df = (
pd.DataFrame(...)
.remove_column('column1')
.dropna(subset=['column2', 'column3'])
.rename_column('column2', 'unicorns')
.rename_column('column3', 'dragons')
.add_column('newcolumn', ['iterable', 'of', 'items'])
)
remove_column()
is not [any longer?] a function. I guess this should be remove_columns()
, instead. Note that the latter only takes list arguments apparently, instead of just a string in the case where you only want to remove one column. Might be useful to support both types of inputs.
Are the successive chaining operations happening in place, or is the entire dataframe copied every time?
transform_column(df, col_name: str, function)
could be augmented to include a destination column. Example use case is when you want to perform a log10
transformation, as in the docs example, yet you also want to preserve the non-transformed data for some other purpose.
In order to keep pyjanitor lightweight, I would like to propose that the submodule dependencies not be installed with the main module.
In order for this to work out in a user-friendly fashion, I think we will need to provide some try/except imports. For example, in the biology submodule:
try:
from Bio import Seq
except ImportError:
print('You need to install `biopython`: \n\n conda install -c conda-forge biopython')
I wanted to discuss naming conventions for the various functions and arguments for consistency. expand_column
has a parameter column
whereas add_column
had col_name
.
Also, is _column
necessary in each function? Would it be ok to just have an add()
or transform()
method? In general I'm more on the side of more verbose function names, but just wanted to throw the idea out there.
Similarly, when following the format of limit_column_characters
functions like change_type
should probably be named change_column_type
.
I'm not married to any of this (except the function arguments, those should be consistent) but wanted to get peoples' ideas.
I noticed the pyjanitor-dev conda environment is running on Pandas 22.X.X. Pandas has just done a new release to 23.0.0. I have no idea if this is significant, or if you have plans to keep up to date with the most recent Pandas, but I thought it worth mentioning :)
Speaking as a newcomer to Pandas who finds its syntax confusing, pyjanitor is a breath of fresh air. There's a similar, older project which may provide extra inspiration, maybe even code: agate by @onyxfish (who also created the fantastic Quartz guide to bad data). However, he now seems to have moved on to other things, and development has been considerably slower for the past couple of years.
Agate is well worth a look: the documentation is extensive and well-written, and it has a few features which pyjanitor doesn't (yet). Unfortunately much of the code may be hard to port as it relies on its own table implementation rather than using Pandas DataFrame or similar. Even so, it's worth checking out.
With hypothesis providing a pandas dataframe generator, I'm wondering if we could generate more better battle-tested code by using hypothesis for property-based testing, instead of using our current hacky way of generating tests based on simple dataframes.
One of Hypothesis' best traits is the ability to find edge cases that I myself was unable to find. I think this is worth a shot. Will tag this issue as appropriate.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html
The .query()
method opens up a ton of possibilities for us! We can make a series of specific filtering functions, I believe.
I see there is remove columns function. I think a select_columns function would work nice. It would be cleaner and easier to understand then df[['col1',col2','col3']].
This issue is to track creating a filter_date
function, or whatever else you might want to call it.
The main idea is that even though there are already filter
functions, date filtering is a bit more complex. For example, when filtering strings you can just write a string, but for dates it usually needs to be a date
or datetime
object, and not all of them play together nicely.
More important than the above is that when filtering a range, the lines can get really long, like so:
start_date = date(2017, 1, 1)
end_date = date(2018, 1, 1)
df = df[(df.date_column >= start_date) & (df.date_column <= end_date)]
I think something like the following is nicer:
df = df.filter_date('date_column', start='2018-01-01', end='2018-01-01')
It could also include arguments for year
, years
, etc.
Is there any way to limit the length of column name? for e.g. this_is_very_long_column_needs_truncated should be truncated to left 10 or 20 characters. This may lead to duplicate column names - those should be appended by numbers like this_is_very_long_1 and this_is_very_long_2
I find myself loading a lot of files from a folder. I would use glob library for this but it is a lot to write out. For example I will write:
path ="C:/Finance/Month End/2018/CSV Imports YTD"
files_xls = glob.glob(path + "/*.csv")
df = pd.DataFrame()
for f in files_xls:
data1 = pd.read_csv(f,skiprows=0,low_memory=False,encoding="cp1252")
data1['File_Name'] = (f)
#data2['File_Name'] = (f)
data1.append(data1,ignore_index=True)
df = df.append(data1,ignore_index=True)
Somthing like this would be easier:
read_folderfiles( path = "", extension ="", encoding = "", add_filenames = True)
thoughts?
I can try to create this if you like.
A thought came to my mind: how do we clean key-value stores? Do we have to implicitly assume that there are regularly repeating key-value pairs? Or do we define data types and provide easy commands to clean them?
One low-hanging fruit is possibly changing all key names to lowercase + underscore-separated.
Heads-up, @zbarry and @szuckerman.
I have refactored the test suite a little bit (actually, quite a lot). The changes hopefully make the test suite easier to develop against and enable newer contributions. I will be writing a section on how to write tests (as part of the CONTRIBUTING.rst) file
The key changes here are:
test_<function_name>.py
file. This structure allows us to create multiple tests for each function, without cluttering up one big test_functions.py
file.Big thanks to both of you for your contributions thus far, I really appreciate the work that has gone in ๐.
There's a sanity check I put in there where if the supplied value is a sequence, it makes sure that the length is the same as the number of rows in the DataFrame
as long as fill_remaining
is false. This is done by checking for the existence of __len__
. I need to add logic to exclude this checking for str
objects. Fixing now.
With python 3.6.6
and pyjanitor 0.5.0
installed using pipenv
with Windows 10, I get the following UserWarning
:
In [1]: import janitor
C:\Users\<USERNAME>\.virtualenvs\<REPO_NAME>\Lib\site-packages\janitor\dataframe.py:24:
UserWarning: Janitor's subclassed DataFrame and Series will be deprecated before
the 1.0 release. Instead of importing the Janitor DataFrame, please instead
`import janitor`, and use the functions directly attached to native pandas
dataframe.
warnings.warn(msg)
This is when using pyjanitor
in the recommended way. Shouldn't the warning only be there on import of the the depricated DataFrame
functions? I even get the warning when only getting clean_names
:
In [1]: from janitor import clean_names
C:\Users\<USERNAME>\.virtualenvs\<REPO_NAME>\Lib\site-packages\janitor\dataframe.py:24:
UserWarning: Janitor's subclassed DataFrame and Series will be deprecated before
the 1.0 release. Instead of importing the Janitor DataFrame, please instead
`import janitor`, and use the functions directly attached to native pandas
dataframe.
warnings.warn(msg)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.