pyjanitor-devs / pyjanitor Goto Github PK

View Code? Open in Web Editor NEW

1.3K 16.0 164.0 11.27 MB

Clean APIs for data cleaning. Python implementation of R package Janitor

Home Page: https://pyjanitor-devs.github.io/pyjanitor

License: MIT License

Python 89.12% Makefile 0.12% Jupyter Notebook 10.41% Dockerfile 0.21% Shell 0.03% CSS 0.11%

pandas dataframe data cleaning-data data-engineering pydata hacktoberfest

pyjanitor's People

Contributors

Stargazers

Watchers

Forkers

mamonu joshuac3 thesmarthomeninja cduvallet gbalaji88 hectormz smeichle szuckerman zbarry zsailer ocefpaf paddyalton cwen001 stoltzmaniac jk3587 sorenfrohlich kurtispinkney ricky-lim kimt33 jonnybazookatone jqwotos lphk92 eidhagen dsouzadaniel science4fun rsaavy rajat-181 dendrondal catherinedevlin napsterinblue jekwatt stephenschroeder moomoofarm1 chrisfs rahosbach bradkeogh rebeccawperry asearfos porch4 gjlynx dwgoltra uribe-convers emnemnemnem taedaniellim puruckertom h-klein chungkim271 benjaminjack kulini qtson samwalkow mengksun6 dsnortsev nickdelgrosso sbarman-mi9 jeprescottroy aopisco cjmayers gaworecki5 jiafengkevinchen thomasjpfan shandou mralbu ram-n zjpoh goodmonsters clayton-springer gddcunh jdice deepchatterjeevns stjordanis milog17 shalevy1 project-renard-survey cozydoomer bdice markfairbanks abrarahmedraza nournegm evan-anderson changhsinlee mlogotheti anzelpwj smu095 chaztikov luiggi629 puremath86 drimdave elarkk t-i-r skerker vperrollaz srijan-deepsource vishalbelsare mphirke dushyantkhosla sallyhong sauln minchinweb richardqiu

pyjanitor's Issues

Feature enhancement: collapsing MultiIndex columns

For example, after aggregation with multiple functions.

For df with columns ['group', 'category', 'value']:

stats_df = (
    df.groupby(['group', 'category'])
    .agg(['mean', 'median'])
    .reset_index()
)

produces stats_df with a MultiIndex .columns attribute where {'mean', 'median'} are second level column names under value. It would be nice if .columns was just an Index instead, for some applications.

Now, to flatten the MultiIndex out into an Index by concatenating the different levels with an underscore:

stats_df.columns.values is array([('group', ''), ('category', ''), ('value', 'mean'), ('value', 'median')], dtype=object)

stats_df.columns = ['_'.join(tup) if tup[1] != '' else tup[0] for tup in stats_df.columns.values]

stats_df.columns is now ['group', 'category', 'value_mean', 'value_median']

Fix default pandas methods

Is it possible to "fix" pandas methods using pyjanitor? For e.g. I will like to validate the parameters for read_excel or read_csv. I have raised an issue but there is no progress.

pandas-dev/pandas#22189

If janitor accept the PR to override the default behavior of read methods, then it will be great.

clean_names breaks for MultiIndex Columns

MultiIndex columns returns expected str not tuple type error.

If you instead used,
df.rename(columns=lambda x: x.lower().replace('_', ''))
this would work for standard and MultiIndex DataFrames.

Can do PR if required.

Rename dataframe inplace

Is this possible with janitor?
df.clean_names(inplace=True)

software paper

@zbarry @szuckerman, I would like to invite you to participate in the pyjanitor software manuscript that I am writing.

I am writing it in a branch off master: https://github.com/ericmjl/pyjanitor/blob/whitepaper/paper/manuscript.md

At the moment, I am seeking out input on:

Current known limitations of pyjanitor.
Possible extensions.
Readability.

If you would like to participate, please put in a PR against the whitepaper branch and add your name!

Additional Namespaces

From a previous pull-request the issue of namespaces arose. I wanted to open this issue to discuss various new namespaces possible for the module.

It appears that the R version also is having this issue as well.

There was discussion of a finance submodule, which sounds good, but I don't work in finance and would be unfamiliar with many necessary items to be included.

I think that a summary submodule or something like that would be good to add tabyl or other summary statistics.

Thoughts?

Skip tests for optional dependencies if they are not installed

I just tried to run the test suite and had some failures because I don't have some of the Chemistry packages installed. I realized that those packages are only available through conda.

I don't use conda and am finding out that it's not so easy to just "install" conda packages into a previously created virtual environment (I mean, I'll figure it out eventually, but this is just at first glance).

In any event, I think it raises an interesting question: Does this create issues for people who don't use conda? Meaning, if I try to use something from a submodule, but because I don't have a dependency installed, like rdkit, it'll tell me to remedy with a conda install that's not going to work?

I'm not really sure what the answer is, but I could see either putting functions that rely on conda dependencies in their own package for conda, or just keeping it how it is and let people deal with it (assuming that people using these modules most likely already have conda installed).

Just wanted to throw this out there before the submodules get bigger.

Formatting of the readme is messed up

transform_column, using .apply(), operates elementwise in a column individually

This precludes doing something nice like normalizing a column into probabilities as in:

df['col'] = df['col'] / df['col'].sum()

Install fails with Python 3.7

Due to pinning of sklearn version, it fails on 3.7. Need to change the dep for sklearn from == in requirements.txt to =>

Feature enhancement thought: reorder_columns()

Example usage:

df = (
    pd.read_csv('blah.csv')  # containing ['col1', 'col2']
    .add_column('col3', 12345)
    .reorder_columns(['col3'])  
)

Columns not specified retain their order and follow after specified columns

Examples needed

As per title! Having more than one example can be helpful for getting other users to use the package.

documentation clarification (registering new methods)

Hi! Decided to have a bit of fun with pandas-flavor which I came across via pyjanitor*. I noticed that the docs here for contributors refer to register_dataframe_function, but it looks like the correct name is register_dataframe_method.

Not exactly earth-shattering stuff, but it didn't seem like it had been mentioned in another issue so thought I'd just flag it!

* -- just for a little Christmas break project, on which point, compliments of the season!

Reminder of add_columns implementation

For myself, mostly:

Chaining implementation:

import pandas_flavor as pf

@pf.register_dataframe_method
def add_columns(df, **kwargs):
    # column already exists
    # v is not a scalar but is a different length from the dataframe
    
    for k, v in kwargs.items():
        df = df.add_column(k, v)
    return df

Example usage for copying repeating rows from a DataFrame into another:

df1.columns is of {'var1', 'var2', 'var3'}

column_order = ['var1', 'var2']

df2.add_columns(**{
    col: vals
    for col, vals in zip(column_order, df1[column_order].iloc[0]) 
})

[DOC] Add "missing data" viz to example notebooks

This is quite important. I'd like to wrap other packages rather than invent the wheel. One possible package is to wrap missingno with a user-friendly API.

Administrative todos

Setup Travis CI for continuous integration.
Setup pyup.io for dependency management/pinning.

Case Sensitive True/False kwarg on clean_names

Using pyjanitor again today I realised I didn't want to change capitals to lowercase. I think a case sensitive kwarg (default on) would be good.

Happy to do PR for this if you think it is a good idea. One Q on this, what would should the kwarg be called?

Some ideas:
lower, remove_upper, drop_case, case_sensitive,...

Returning functions in place

After seeing issue #67, was curious as to what people think as to adding this capability to all functions?

Some of them, like df.limit_column_characters(), already return in place. I don't think it will be hard to extend to others.

Index munging methods

As per Twitter chat with @twiecki, I think a feature that might come in handy is method chaining for index methods.

Reference: https://twitter.com/twiecki/status/973892601018572800

For example, instead of:

df.index = df.index.drop_level()

We would have

df.remove_empty().index_drop_level()...

Remaining functions from R version

The following is a list of functions missing from the PyJanitor library that are implemented in the R version. I think the aggregation and adornment can be put in their own submodules later.

To be implemented

Main Functions

Aggregation

as_and_untabyl.R
print_tabyl.R
tabyl.R
top_levels.R

Adornment

Won't be implemented

round_half_up.R
Probably don't need to implement this; the main reason it exists is because round(2.5) in R is 2, this makes round(2.5) == 3. For Python, round(2.5) == 3.
make_clean_names.R
Helper function for clean_names.R
get_level_groups.R
Helper function for top_levels.R
crosstab.R
Deprecated in favor of tabyl

Criteria for 1.0 release to PyPI

I'll commit to a 1.0 release to PyPI when:

We have 80% of R-janitor's functionality implemented.
Test coverage is acceptable.
We have more than one example available on the Docs.

Inconsistencies in original-dataframe mutation

Old is re: reorder_columns: Does not mutate original DataFrame. I'm thinking about modding it to do so to be consistent with everything else I implemented.

Edit:

In working on the Jupyter Notebook example walkthrough for pyjanitor, I'm noticing some inconsistencies regarding whether the original DataFrame is changed after an operation in the provided example. My notes:

.clean_names() does not mutate
.remove_empty() does
.rename_column() does not
.coalesce() does not
.encode_categorical() does
.convert_excel_date() does

What do we think about this?

why the scipy dependency ?

It does not seem to be needed anywhere in the source code

Remove Trailing (and Leading) underscores from clean_names

Pandas handles this relatively well but would be a good as a kwarg for jn.clean_names().

The default False or None could leave leading, True or 'both' to remove all and trailing and then pass 'leading' and 'trailing' or similar to remove each.

Preserve original names

While renaming the dataframe, I need to preserve the original names. For e.g.

santandar_data = pd.read_csv(r"train.csv", nrows=40000)  
santandar_data.shape  

santandar_data.original_names=santandar_data.columns

ndf=santandar_data

ndf.original_names

Index(['ID', 'var3', 'var15', 'imp_ent_var16_ult1', 'imp_op_var39_comer_ult1',
       'imp_op_var39_comer_ult3', 'imp_op_var40_comer_ult1',
       'imp_op_var40_comer_ult3', 'imp_op_var40_efect_ult1',
       'imp_op_var40_efect_ult3',
       ...
       'saldo_medio_var33_hace2', 'saldo_medio_var33_hace3',
       'saldo_medio_var33_ult1', 'saldo_medio_var33_ult3',
       'saldo_medio_var44_hace2', 'saldo_medio_var44_hace3',
       'saldo_medio_var44_ult1', 'saldo_medio_var44_ult3', 'var38', 'TARGET'],
      dtype='object', length=371)

The ndf dataframe object has a property original_names that works correctly. But when I use clean_names function, I do not get this functionality.

df=santandar_data.clean_names(case_type="upper", remove_special=True).limit_column_characters(3)
df.original_names

AttributeError: 'DataFrame' object has no attribute 'original_names'

example.py broken

import pandas as pd
import janitor as jn

df = pd.read_excel("dirty_data.xlsx")

df = (
    jn.DataFrame(df)
    .clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time?", "full_time")
    .coalesce(["certification", "certification.1"], "certification")
    .encode_categorical(["subject", "employee_status", "full_time"])
    .convert_excel_date("hire_date")
)

print(df)

Traceback (most recent call last):
  File "/home/zachary/projects/pyjanitor/examples/example.py", line 7, in <module>
    jn.DataFrame(df)
AttributeError: module 'janitor' has no attribute 'DataFrame'

DataFrame is not in the janitor namespace. Also, df is already a DataFrame, so not sure what the intent would be of doing this, anyway.

Note that if I replace jn.DataFrame(df) with simply df, I get:

Traceback (most recent call last):
  File "/home/zachary/miniconda3/envs/hv/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'certification.1'

great idea!

please make sure to make it compatible with Pandas dataframes...
thanks!

Warning shows up with adding new attributes

@shantanuo this appears to be related to your earlier PR. When I run tests, this shows up consistently.

  /Users/maer3/github/software/pyjanitor/janitor/functions.py:143: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
    df.original_columns = original_column_names

From your reading, do you know of an appropriate way to attach a new dataframe attribute without invoking this warning?

Jupyter notebook example of pyjanitor usage

I kind of want to put something together to show usage, talk about philosophy of pyjanitor, etc.

new function proposal: find and replace

Colleague proposed this function: within a column that houses strings, within each cell, provide a find_replace() function that finds a substring in the cell, and replaces it with another substring.

Note to self:

Figure out if this is already possible with pandas.
Decide if implementing this provides a more fluent function API.

All the code blocks in the docs are borked.

Title.

Feature thought: Add generic imputation

Hi Eric,

Just a thought -- I find myself doing a lot of hand imputation. It would be nice if you could add imputation as a chain able function.

I'm under the gun and can't submit a PR, but I think this would be a great feature.

Feature thought - chainable, in place reset_index

As far as I'm aware, an inplace=True reset_index() does not return a DataFrame.

Welcome page documentation code example problem - remove_column() doesn't exist

In:

df = (
    pd.DataFrame(...)
    .remove_column('column1')
    .dropna(subset=['column2', 'column3'])
    .rename_column('column2', 'unicorns')
    .rename_column('column3', 'dragons')
    .add_column('newcolumn', ['iterable', 'of', 'items'])
)

remove_column() is not [any longer?] a function. I guess this should be remove_columns(), instead. Note that the latter only takes list arguments apparently, instead of just a string in the case where you only want to remove one column. Might be useful to support both types of inputs.

Question about chaining

Are the successive chaining operations happening in place, or is the entire dataframe copied every time?

transform_column proposed improvement

transform_column(df, col_name: str, function) could be augmented to include a destination column. Example use case is when you want to perform a log10 transformation, as in the docs example, yet you also want to preserve the non-transformed data for some other purpose.

proposal: import errors raised for submodules

In order to keep pyjanitor lightweight, I would like to propose that the submodule dependencies not be installed with the main module.

In order for this to work out in a user-friendly fashion, I think we will need to provide some try/except imports. For example, in the biology submodule:

try:
    from Bio import Seq
except ImportError:
    print('You need to install `biopython`: \n\n    conda install -c conda-forge biopython')

[ENH] Naming conventions

I wanted to discuss naming conventions for the various functions and arguments for consistency. expand_column has a parameter column whereas add_column had col_name.

Also, is _column necessary in each function? Would it be ok to just have an add() or transform() method? In general I'm more on the side of more verbose function names, but just wanted to throw the idea out there.

Similarly, when following the format of limit_column_characters functions like change_type should probably be named change_column_type.

I'm not married to any of this (except the function arguments, those should be consistent) but wanted to get peoples' ideas.

Pandas 23.0.0 Release

I noticed the pyjanitor-dev conda environment is running on Pandas 22.X.X. Pandas has just done a new release to 23.0.0. I have no idea if this is significant, or if you have plans to keep up to date with the most recent Pandas, but I thought it worth mentioning :)

Take ideas/code from Agate?

Speaking as a newcomer to Pandas who finds its syntax confusing, pyjanitor is a breath of fresh air. There's a similar, older project which may provide extra inspiration, maybe even code: agate by @onyxfish (who also created the fantastic Quartz guide to bad data). However, he now seems to have moved on to other things, and development has been considerably slower for the past couple of years.

Agate is well worth a look: the documentation is extensive and well-written, and it has a few features which pyjanitor doesn't (yet). Unfortunately much of the code may be hard to port as it relies on its own table implementation rather than using Pandas DataFrame or similar. Even so, it's worth checking out.

Switching tests to Hypothesis

With hypothesis providing a pandas dataframe generator, I'm wondering if we could generate more better battle-tested code by using hypothesis for property-based testing, instead of using our current hacky way of generating tests based on simple dataframes.

One of Hypothesis' best traits is the ability to find edge cases that I myself was unable to find. I think this is worth a shot. Will tag this issue as appropriate.

filtering with nice strings in between

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

The .query() method opens up a ton of possibilities for us! We can make a series of specific filtering functions, I believe.

Select_Columns Function Added Suggestion...

I see there is remove columns function. I think a select_columns function would work nice. It would be cleaner and easier to understand then df[['col1',col2','col3']].

Add filter_date function

This issue is to track creating a filter_date function, or whatever else you might want to call it.

The main idea is that even though there are already filter functions, date filtering is a bit more complex. For example, when filtering strings you can just write a string, but for dates it usually needs to be a date or datetime object, and not all of them play together nicely.

More important than the above is that when filtering a range, the lines can get really long, like so:

start_date = date(2017, 1, 1)
end_date = date(2018, 1, 1)
df = df[(df.date_column >= start_date) & (df.date_column <= end_date)]

I think something like the following is nicer:

df = df.filter_date('date_column', start='2018-01-01', end='2018-01-01')

It could also include arguments for year, years, etc.

limit column characters

Is there any way to limit the length of column name? for e.g. this_is_very_long_column_needs_truncated should be truncated to left 10 or 20 characters. This may lead to duplicate column names - those should be appended by numbers like this_is_very_long_1 and this_is_very_long_2

[ENH] Add read_folderfiles() function

I find myself loading a lot of files from a folder. I would use glob library for this but it is a lot to write out. For example I will write:

path ="C:/Finance/Month End/2018/CSV Imports YTD"
files_xls = glob.glob(path + "/*.csv")
df = pd.DataFrame()

for f in files_xls:
data1 = pd.read_csv(f,skiprows=0,low_memory=False,encoding="cp1252")
data1['File_Name'] = (f)
#data2['File_Name'] = (f)
data1.append(data1,ignore_index=True)
df = df.append(data1,ignore_index=True)

Somthing like this would be easier:

read_folderfiles( path = "", extension ="", encoding = "", add_filenames = True)

thoughts?
I can try to create this if you like.

Cleaning key-value stores?

A thought came to my mind: how do we clean key-value stores? Do we have to implicitly assume that there are regularly repeating key-value pairs? Or do we define data types and provide easy commands to clean them?

One low-hanging fruit is possibly changing all key names to lowercase + underscore-separated.

Documenting how tests are organized

Heads-up, @zbarry and @szuckerman.

I have refactored the test suite a little bit (actually, quite a lot). The changes hopefully make the test suite easier to develop against and enable newer contributions. I will be writing a section on how to write tests (as part of the CONTRIBUTING.rst) file

The key changes here are:

Each function gets tested in a test_<function_name>.py file. This structure allows us to create multiple tests for each function, without cluttering up one big test_functions.py file.
Testing utilities, such as fixtures and hypothesis strategies, are now part of the main library. This enables them to be imported into the tests instead. Side note: We do not need to have tests for the testing utilities.

Big thanks to both of you for your contributions thus far, I really appreciate the work that has gone in 😄.

add_column sanity checking broken for strings

There's a sanity check I put in there where if the supplied value is a sequence, it makes sure that the length is the same as the number of rows in the DataFrame as long as fill_remaining is false. This is done by checking for the existence of __len__. I need to add logic to exclude this checking for str objects. Fixing now.

Warning about deprication when not using those methods

With python 3.6.6 and pyjanitor 0.5.0 installed using pipenv with Windows 10, I get the following UserWarning:

In [1]: import janitor
C:\Users\<USERNAME>\.virtualenvs\<REPO_NAME>\Lib\site-packages\janitor\dataframe.py:24: 
UserWarning: Janitor's subclassed DataFrame and Series will be deprecated before
the 1.0 release. Instead of importing the Janitor DataFrame, please instead
`import janitor`, and use the functions directly attached to native pandas
dataframe.
  warnings.warn(msg)

This is when using pyjanitor in the recommended way. Shouldn't the warning only be there on import of the the depricated DataFrame functions? I even get the warning when only getting clean_names:

In [1]: from janitor import clean_names
C:\Users\<USERNAME>\.virtualenvs\<REPO_NAME>\Lib\site-packages\janitor\dataframe.py:24: 
UserWarning: Janitor's subclassed DataFrame and Series will be deprecated before
the 1.0 release. Instead of importing the Janitor DataFrame, please instead
`import janitor`, and use the functions directly attached to native pandas
dataframe.
  warnings.warn(msg)