kieferk / dfply Goto Github PK

View Code? Open in Web Editor NEW

888.0 888.0 103.0 2.14 MB

dplyr-style piping operations for pandas dataframes

License: GNU General Public License v3.0

Python 100.00%

dfply's People

Contributors

Stargazers

Watchers

Forkers

techscientist lizhihao1990 nickdelgrosso pepsalehi lukearmbruster jankatins hoardboard ike-okonkwo bleearmstrong datnamer sharpe5 ryninho wgao3 janfreyberg tariqahassan prash16 jiaenyue nanaakwasiabayieboateng kieferkat duytintruong cedricfr juanlp dthboyd jimmybow hankhaha cunningjames anirband mqwu crew102 tbrambor ashkansaboori bmuralik andrewkho jonvitale straightnochaser ipeke prabhanjan215 pchtsp dannyydt wsu-datascience makarevichy marksecada erinmitten srepho rajatjatana germayneng ddoannpd jstrong-tios russellpierce rodrigoacastro samanfrm angel-rc davescusope grst zdelrosario kadari-abhinav-zz noahpieta billyc guokai8 cris-bytes yyd007 lucasiscovici rheehot loveactualry davidwedekind thedatadudede nchelaru akarito birusod gursartaj fanysa95 ransinghsatyajitray osamafrougi danielsjf zjw0358 arengard wenjias heracleshx tanzelle ktuyends raysunau thekingofall shism2 vishalbelsare kmustzjq lenamax2355 thuongvovan boral thua101 mwhalen18 datatrekkers josemou8 m2march emmar2000 sirkahuna yardsale8 arpitjain799 lixzhang nguyenngocbinh gkuo06

dfply's Issues

dfply.X shadowing sklearn predictor data variable

Hey, not sure if this is an issue for anyone else, but one of my favourite features of dfply is the X symbol, but it's annoying because of the scikit-learn convention to use X for the array of predictors ... would it make sense to change X to something else? maybe D or DF?

Allow distinct with no arguments

I want to be able to apply distinct to the whole dataframe, so that distinct() is equivalent to drop_duplicates().

Just putting this as a placeholder for now so that I remember, I'm happy to take this on and submit a PR when I get time.

get_join_parameters has a bug: 'right_in' should be 'right_on'

The function get_join_parameters in join.py has an error:

        if not isinstance(by[0], str):
            left_on = by[0]
            right_in = by[1]

This should be right_on = by[1].

Great library btw - I really missed dplyr when moving to python.

data still missing after installation

I'm having this same issue still:
#8

-I am using conda to install dfply (which I need to because that's the package manager used by the computing cluster I have access to).

conda install -c tallic dfply

That's the command I use to install the package from https://anaconda.org/tallic/dfply.

But when I go to use dfply, it still says the diamonds.csv data is missing.

Traceback (most recent call last):
File "ACH_nested_anova.py", line 1, in
import dfply
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/init.py", line 11, in
from .data import diamonds
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/init.py", line 5, in
diamonds = pd.read_csv(os.path.join(root, "diamonds.csv"))
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in init
self._make_engine(self.engine)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1853, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv' does not exist: b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv'

2019-03-15 13:25:11 ⌚ gateway-03 in ~/ACH_Development/ACH_tests/ACH_quiz3/python_scripts/Analysis
○ → python ACH_nested_anova.py
Traceback (most recent call last):
File "ACH_nested_anova.py", line 2, in
from dfply import group_by as group_by, summarize as summarize, select as select
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/init.py", line 11, in
from .data import diamonds
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/init.py", line 5, in
diamonds = pd.read_csv(os.path.join(root, "diamonds.csv"))
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in init
self._make_engine(self.engine)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1853, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv' does not exist: b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv'

2019-03-15 13:25:41 ⌚ gateway-03 in ~/ACH_Development/ACH_tests/ACH_quiz3/python_scripts/Analysis
○ → pip install dfply
Requirement already satisfied: dfply in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (0.3.1)
Requirement already satisfied: numpy in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (from dfply) (1.16.2)
Requirement already satisfied: pandas in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (from dfply) (0.24.2)
Requirement already satisfied: python-dateutil>=2.5.0 in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (from pandas->dfply) (2.8.0)
Requirement already satisfied: pytz>=2011k in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (from pandas->dfply) (2018.9)
Requirement already satisfied: six>=1.5 in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas->dfply) (1.12.0)

2019-03-15 13:26:59 ⌚ gateway-03 in ~/ACH_Development/ACH_tests/ACH_quiz3/python_scripts/Analysis
○ → python ACH_nested_anova.py
Traceback (most recent call last):
File "ACH_nested_anova.py", line 2, in
from dfply import group_by as group_by, summarize as summarize, select as select
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/init.py", line 11, in
from .data import diamonds
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/init.py", line 5, in
diamonds = pd.read_csv(os.path.join(root, "diamonds.csv"))
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in init
self._make_engine(self.engine)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1853, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv' does not exist: b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv'

I can substitute the import line with any of the following and the result is still the same:
-import dfply
-from dfply import group_by as group_by, summarize as summarize, select as select
-from dfply import *

Please help. I cannot seem to use git or pip to correct the problem. Pip tells me the package is already installed, but I get the same problem. Git is not available to me.

data missing after installation

mask does not work in 0.2.4 properly

The line

series = signals.loc[(signals.type == sig_type) & (signals.part_number == sig_partnr), 'value']

is working for my code, the line

series = signals >> mask(X.type == sig_type, X.part_number == sig_partnr) >> select('value')

results in the error

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "[..]/dfply/base.py", line 45, in __rrshift__
    result = self.function(other_copy)
  File "[..]/dfply/base.py", line 52, in <lambda>
    return pipe(lambda x: self.function(x, *args, **kwargs))
  File "[..]/dfply/base.py", line 112, in __call__
    return self.function(*args, **kwargs)
  File "[..]/dfply/base.py", line 179, in __call__
    evaluation = self.call_action(args, kwargs)
  File "[..]/dfply/base.py", line 253, in call_action
    return symbolic.to_callable(symbolic_function)(args[0])
  File "[..]/pandas_ply/symbolic.py", line 204, in <lambda>
    return lambda *args, **kwargs: obj._eval(dict(enumerate(args), **kwargs))
  File "[..]/pandas_ply/symbolic.py", line 142, in _eval
    result = evaled_func(*evaled_args, **evaled_kwargs)
  File "[..]/dfply/subset.py", line 55, in mask
    mask = mask & arg
  File "[..]/pandas/core/ops.py", line 915, in wrapper
    self, other = _align_method_SERIES(self, other, align_asobject=True)
  File "[..]/pandas/core/ops.py", line 629, in _align_method_SERIES
    left, right = left.align(right, copy=False)
  File "[..]/pandas/core/series.py", line 2411, in align
    broadcast_axis=broadcast_axis)
  File "[..]/pandas/core/generic.py", line 4937, in align
    fill_axis=fill_axis)
  File "[..]/pandas/core/generic.py", line 5006, in _align_series
    return_indexers=True)
  File "[..]/pandas/core/indexes/range.py", line 441, in join
    sort)
  File "[..]/pandas/core/indexes/base.py", line 3024, in join
    return_indexers=return_indexers)
  File "[..]/pandas/core/indexes/datetimes.py", line 1069, in join
    return_indexers=return_indexers, sort=sort)
  File "[..]/pandas/core/indexes/base.py", line 3033, in join
    return this.join(other, how=how, return_indexers=return_indexers)
  File "[..]/pandas/core/indexes/base.py", line 3046, in join
    return_indexers=return_indexers)
  File "[..]/pandas/core/indexes/base.py", line 3127, in _join_non_unique
    sort=True)
  File "[..]/pandas/core/reshape/merge.py", line 982, in _get_join_indexers
    llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
  File "[..]/pandas/core/reshape/merge.py", line 1412, in _factorize_keys
    llab, rlab = _sort_labels(uniques, llab, rlab)
  File "[..]/pandas/core/reshape/merge.py", line 1438, in _sort_labels
    _, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
  File "[..]/pandas/core/algorithms.py", line 483, in safe_sort
    ordered = sort_mixed(values)
  File "[..]/pandas/core/algorithms.py", line 476, in sort_mixed
    nums = np.sort(values[~str_pos])
  File "[..]/numpy/core/fromnumeric.py", line 822, in sort
    a.sort(axis=axis, kind=kind, order=order)
  File "pandas/_libs/tslib.pyx", line 1080, in pandas._libs.tslib._Timestamp.__richcmp__ (pandas/_libs/tslib.c:20281)
TypeError: Cannot compare type 'Timestamp' with type 'int'

What is the reason? My dataframe looks like

                                                 part_number         type     value
timestamps                                                                         
2017-08-01 00:00:32.651504  91cb9fa3859f4d44853f6200616db619        power1 -0.001651
2017-08-01 00:00:32.652504  91cb9fa3859f4d44853f6200616db619        power2  0.005068
2017-08-01 00:00:32.653504  91cb9fa3859f4d44853f6200616db619        power1 -0.004536
2017-08-01 00:00:32.654504  91cb9fa3859f4d44853f6200616db619        power2 -0.000084
2017-08-01 00:00:32.655504  5535c560ece9415f8f6ad996f1c23f6e        power1 -0.001114
2017-08-01 00:00:32.656504  5535c560ece9415f8f6ad996f1c23f6e        power2 -0.005621
2017-08-01 00:00:32.657504  5535c560ece9415f8f6ad996f1c23f6e        power1 -0.000638
2017-08-01 00:00:32.658504  5535c560ece9415f8f6ad996f1c23f6e        power2 -0.006916
2017-08-01 00:00:32.659504  5535c560ece9415f8f6ad996f1c23f6e        power1  0.001549

where the index is DatatimeIndex. I am using dfply version 0.2.4.

missing cumcount() function in embedded column functions

missing cumcount() function in embedded column functions, especially needed for using spread() function to handle the DataFrame with duplicate identifier.

DFPLY not working.....at all

This was never an issue. It should be deleted.

unable to aggregate and summarize counts for a categorical variable

So I want to be to simply do a group by and count on a column with categorical values. When running the code below

df = pd.DataFrame({"animal": ["cat", "cat", "dog", "dog"],
                   "breed": ["tabby", "short hair", "poodle", "pug"],
                   "age": [1,2,3,4]
                  })

df >> group_by(X.animal) >> summarize(count=n(X.name))

I run into a AttributeError: 'str' object has no attribute 'size' error.

In dplyr, this would be the equivalent of:

df %>% group_by(animal) %>% summarise(count = n())

semi_join and anti_join fail when joining for more than one column

semi_join and anti_join fail when joining for more than one column
You can reproduce it with

df1 = pd.DataFrame({'x':[1,2,3,4,5], 'y':[10,20,40,50,100]})
df2 = pd.DataFrame({'x':[3,4], 'y':[40,51], 'z':[600,800]})
anti_join(df2, by =['x','y'])
#or anti_join(df2, by =[['x','y'],['x','y']])

left_join works fine with the same construction
the error message is:

  df1 >> anti_join(df2, by =['x','y'])
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/dfply/base.py", line 45, in __rrshift__
    result = self.function(other_copy)
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/dfply/base.py", line 52, in <lambda>
    return pipe(lambda x: self.function(x, *args, **kwargs))
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/dfply/join.py", line 246, in anti_join
    other_reduced = other[right_on].drop_duplicates()
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/core/frame.py", line 2053, in __getitem__
    return self._getitem_array(key)
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/core/frame.py", line 2097, in _getitem_array
    indexer = self.ix._convert_to_indexer(key, axis=1)
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py", line 1217, in _convert_to_indexer
    indexer = check = labels.get_indexer(objarr)
  File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/indexes/base.py", line 2286, in get_indexer
    indexer = self._engine.get_indexer(target._values)
  File "pandas/index.pyx", line 300, in pandas.index.IndexEngine.get_indexer (pandas/index.c:6420)
  File "pandas/src/hashtable_class_helper.pxi", line 793, in pandas.hashtable.PyObjectHashTable.lookup (pandas/hashtable.c:14637)
TypeError: unhashable type: 'list'

It looks like the problem is with else block in the code below (it's from function semi_join)

...
if not right_on:
        right_on = [col_name for col_name in df.columns.values.tolist() if col_name in other.columns.values.tolist()]
        left_on = right_on
    else:
        right_on = [right_on]
...

Pandas expects list of columns names but this block makes it list of list
When else part removed it starts to work

stray print statement

Hi, in the unite function you have a stray print

print(to_unite, sep, remove, na_action)

Thanks

How to install?

In the readme is not instruction how to best install dfply. Could you add this, please? I especially would like to know how to install it into an Anaconda environment.

I checked out the requirements and, except for pandas_ply, I got them installed. However, I am not sure what pandas_ply is supposed to be. Do you mean https://github.com/coursera/pandas-ply? If so: From their github-page I understand that pandas-ply is not stable yet. So why would dfply be considered stable if it is build on pandas-ply?

Boolean "~" operator ignored after "|"

Hi kieferk,

I am an R user learning how to use dfply. I may have spotted an issue: it appears that Boolean ~ isn't evaluated after Boolean | if applied in the syntax below.

My code:

# Import
import pandas as pd
import numpy as np
from dfply import *

# Create data frame and mask it
df  = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
        mask((X.a.isnull()) | ~(X.b.isnull())))
print(df)
print(df2)

Here is the original data frame, df:

       a    b    c
    0  NaN  6.0  5
    1  2.0  7.0  4
    2  3.0  8.0  3
    3  4.0  9.0  2
    4  5.0  NaN  1

And here is the result of the piped mask, df2:

         a    b    c
      0  NaN  6.0  5
      4  5.0  NaN  1

However, I expect this instead:

         a    b    c
      0  NaN  6.0  5
      1  2.0  7.0  4
      2  3.0  8.0  3
      3  4.0  9.0  2

I don't understand why the | and ~ operators result in rows in which column "a" is either NaN or column "b" is not NaN?

By the way, I also tried np.logical_or():

df  = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
        mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
print(df)
print(df2)

But this resulted in error:

mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
ValueError: invalid __array_struct__

Why do not we have mutate_at in the dfply?

New grammar for data minpulation

This is not an issue, but I am not sure how to contact someone from github and give some information. I just wanted to let you know that there is a new grammar for data manipulation: https://github.com/has2k1/plydata

A related question I would have is, how dfply related to it - they seem rather, right?

Unable to limit the output from head after group_by

Head operator works well when the dataset was initially loaded.

df >> head(5) - returned exactly 5 rows

I applied a group_by on the dataframe and saved it to the same
df = df >> group_by(X.Team_Name) >> mutate(bat_avg = X.Hits.sum()/X.Bats.sum())

While printing the head of the updated dataframe
df >> head(5) - prints the entire dataframe instead of just printing the first five rows.

P.S : A big shout to the amazing work which went behind this package, using this saved me a lot of time. Thanks!

Update package on Pip

Last update is from August 2017.

Pip version of the package currently doens't import the filter_by function, because it doesn't exist. #50

How to select multiple values from same column using musk

Guys, How to filter multiple values from same column, Below code throws the error.

import pandas as pd
from dfply import *
data = pd.DataFrame({"Col1" :["a","b","c","d"],"Col2":[1,2,3,4]})
data >> mask(X.Col1 == ["a","b"])

Error:
ValueError: Arrays were different lengths: 4 vs 2

arrange() not working

I am trying to calculate the summary statistics by grouping variable and then sorting the result in descending order.

#Import Data
import pandas as pd
mydata=pd.read_csv("http://winterolympicsmedals.com/medals.csv")

#2006 Gold Medal Count 
mydata >> mask(X.Year==2006 , X.Medal =='Gold') >> group_by(X.NOC) >> summarize(N=n(X.NOC)) >> arrange(X.N, ascending=False)

Gold Medal Count (i.e. variable N) is not sorted in descending order

NameError: name 'filter_by' is not defined

I am not able to call any of the functions in dfply package. The package has been imported and installed. I am using python 3.0 in JupyterLab notebook.

filter

This function is pretty essential for data analysis, and selecting by value is also one of the least ergonomic operations in pandas, so this would provide real value.

joining on different columns does not work

I think joining on different columns does not work. By that I mean

a_df = pd.DataFrame.from_items([('one', [1,2,3]),('two',['a','b','c'])])
b_df = pd.DataFrame.from_items([('three', [1,2,3]),('four',['d','e','f'])])
a_df >> inner_join(b_df,by=['one','three'])

gives the error

  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'one'

and

a_df >> inner_join(b_df,by=[['one'],['three']])

gives

IndexError: list index out of range

Mask with OR

When using mask() it is possible to AND statements, but I don't see a way of OR-ing statements. Could this please be added to the syntax somehow?

More general joining

I have fixed a typo (right_in -> right_on) and reversed logic in one if statement in function that creates join parameters. See the changes here:
master...jankislinger:fix-join-multiple-by

Now it can be used to join tables on columns with different names:

import pandas as pd
from dfply import *

a = pd.DataFrame({
    'x1': ['A', 'B', 'C'],
    'x2': [1, 2, 3]
})
b = pd.DataFrame({
    'x4': ['A', 'B', 'D'],
    'x3': [True, False, True]
})

a >> inner_join(b, by=('x1', 'x4'))

It would be also convenient to be able to use multiple by statements. For example expression

a >> inner_join(b, by=['x1', ('x2', 'x3')])

could be used as

a.merge(b, left_on=['x1', 'x2'], right_on=['x1', 'x3'])

If you agree I would modify the code and create a PR.

Use numpy and math functions inside verbs?

I'm running across errors when I try to use numpy or math functions (e.g., sqrt, log, etc) inside dfply verbs. Here's a minimal example:

import pandas as pd
from dfply import *
import numpy as np

df = pd.DataFrame({'x': np.linspace(1, 10, 500)})
df >> mutate(y = np.log(X.x))

This gives the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-f8d61ebf2e20> in <module>()
      3 df = pd.DataFrame({'x': np.linspace(1, 10, 500)})
      4 
----> 5 df >> mutate(y = np.log(X.x))

ValueError: invalid __array_struct__

Is this functionality not implemented? Maybe there's a workaround I'm not seeing?

(I'm on python 3.6.3)

Suggestion: an operator similar to magrittr's %<>%

A common idiom in dplyr is something like

df = df %>% (stuff)...

which can be simplified with

df %<>% (stuff)

I've looked around to see there's a way to overload <>, and I'm not sure there is (I think this gets intrepreted as __neq__, but then that also applied to !=).

At any rate, I think this would be very useful (essentially just doing things inplace instead of with copies)

It might be worth noting that Hadley has purposely left this out of dplyr, because (I believe) he's sort of opposed to doing things inplace.

I can confirm in 0.3.3, issue still same

Originally posted by @steer629 in #61 (comment)

mutate does not work with pandas.to_datetime

I have a DataFrame for which
hub2['time'] = pd.to_datetime(hub2.timestamp)
works, but when I write
hub2 >> mutate(time=pd.to_datetime(X.timestamp))
I get the error

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "[...]/lib/python2.7/site-packages/pandas/util/decorators.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "[...]/lib/python2.7/site-packages/pandas/tseries/tools.py", line 419, in to_datetime
    elif isinstance(arg, ABCSeries):
  File "[...]/lib/python2.7/site-packages/pandas/types/generic.py", line 9, in _check
    return getattr(inst, attr, '_typ') in comp
TypeError: __nonzero__ should return bool or int, returned Call

Why is that?

Example request

Hi,

I am an avid dplyr user in R and somewhat new to python. I have been looking for a dplyr-like package in python for a while when I came across dfply which looks pretty close to what I was looking for.

Please excuse if this is not quite the right forum, but I was looking for some help/request some documentation/request a feature.

My use case essentially is that I have a function that operates on single elements of a data-frame columns, e.g.

my_func(a,b)

where both a and b are single elements from columns of a data frame. I have found a stackoverflow-post that shows this for an operation on a single column only.

https://stackoverflow.com/questions/42671168/dfply-mutating-string-column-typeerror

The solution show here of using X.file.apply for the column X.file in the data-frame seems to only work when you only have a single column to operate on.

What i was essentially wondering is - how do you recommend to best use dfply in this context? Could you add some documentation on how best to use functions that don't natively understand Series objects?

E.g. could there be an "Intention" like object that takes a function that operators on several parameters, each of which is intended to be a single element from a column, "vectorizes" this function and then when passed an intention object representing a "Series", applies this appropriately?

Thanks for your help!

Issue with 'arrange' when df has an index

Hi,
Please take a look at the following example:

from dfply import *
utime = pd.DataFrame({"u":1,"eventTime":["01-01-1971 01:04:00","01-01-1971 02:07:00","01-01-1971 01:09:00","01-01-1971 01:10:00"]})
print(utime >> arrange(X.eventTime))

utime = utime.set_index("u")
print(utime >> d.arrange(X.eventTime))

In the first option, the result is as expected. When introducing an index, the result is incorrect and contains 4 times as many values as before.

I'm not sure if it's bug or an expected behavior, as I'm a newbie to pandas and to indices of data frames.

output for the code:
eventTime u
0 01-01-1971 01:04:00 1
2 01-01-1971 01:09:00 1
3 01-01-1971 01:10:00 1
1 01-01-1971 02:07:00 1
eventTime
u
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00

Compatibility break on joins

Hi,

The latest modifications to the join functions are breaking some usage code.

The incriminated changes are: bbe03e8...17b3440

The problem is this case:

df >> left_join(df2, by=["column1", "column2", "column3"])

It used to mean "merge based on those 3 columns", but now it means "merge based on column1 for left dataframe, and column2 on the right dataframe", which is quite different!

What's the rationale? Is it because we're now supposed to use tuples and not lists in those cases?

Either way, if we decide this is the way forward, it should be a major version change because it breaks one of the major use cases of dfply. What do you think?

Data missing after installation

Steps to reproduce:

$ conda create -n volatile-test-dfply -y python=3 jupyter
$ source activate volatile-test-dfply
$ pip install https://github.com/kieferk/dfply/zipball/master
$ jupyter console
>>> from dfply import *
In [1]: from dfply import *
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-aa7758fd9eef> in <module>()
----> 1 from dfply import *

/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/dfply/__init__.py in <module>()
      9 from dfply.summarize import *
     10 from dfply.transform import *
---> 11 from dfply.data import *
     12 from dfply.summary_functions import *
     13 from dfply.window_functions import *

/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/dfply/data.py in <module>()
      3
      4 root = os.path.abspath(os.path.dirname(__file__))
----> 5 diamonds = pd.read_csv(os.path.join(root,  '..', 'data', "diamonds.csv"))

/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644
--> 645         return _read(filepath_or_buffer, kwds)
    646
    647     parser_f.__name__ = name

/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    386
    387     # Create the parser.
--> 388     parser = TextFileReader(filepath_or_buffer, **kwds)
    389
    390     if (nrows is not None) and (chunksize is not None):

/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    727             self.options['has_index_names'] = kwds['has_index_names']
    728
--> 729         self._make_engine(self.engine)
    730
    731     def close(self):

/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    920     def _make_engine(self, engine='c'):
    921         if engine == 'c':
--> 922             self._engine = CParserWrapper(self.f, **self.options)
    923         else:
    924             if engine == 'python':

/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1387         kwds['allow_leading_cols'] = self.index_col is not False
   1388
-> 1389         self._reader = _parser.TextReader(src, **kwds)
   1390
   1391         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4019)()

pandas/parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:7967)()

FileNotFoundError: File b'/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/dfply/../data/diamonds.csv' does not exist

FileNotFoundError: File b'/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/dfply/../data/diamonds.csv' does not exist

ValueError: Item wrong length 114 instead of 60.

With my simple code

 resultstatsDF >> mask(X.method == 'MICE')

I get the error

Traceback (most recent call last):
  File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/code.py", line 91, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/base.py", line 45, in __rrshift__
    result = self.function(other_copy)
  File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/base.py", line 52, in <lambda>
    return pipe(lambda x: self.function(x, *args, **kwargs))
  File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/base.py", line 112, in __call__
    return self.function(*args, **kwargs)
  File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/base.py", line 179, in __call__
    evaluation = self.call_action(args, kwargs)
  File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/base.py", line 253, in call_action
    return symbolic.to_callable(symbolic_function)(args[0])
  File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/pandas_ply/symbolic.py", line 204, in <lambda>
    return lambda *args, **kwargs: obj._eval(dict(enumerate(args), **kwargs))
  File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/pandas_ply/symbolic.py", line 142, in _eval
    result = evaled_func(*evaled_args, **evaled_kwargs)
  File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/subset.py", line 56, in mask
    return df[mask.values]
  File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/pandas/core/frame.py", line 2053, in __getitem__
    return self._getitem_array(key)
  File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/pandas/core/frame.py", line 2090, in _getitem_array
    (len(key), len(self.index)))
ValueError: Item wrong length 114 instead of 60.

Where my resultstatsDF.shape is (60, 10). What am I to do?

Could it be that it has something to do with my the following?

resultstatsDF.index
Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 1, 2, 3, 4, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5],
           dtype='int64')

group_by fails to raise exception for missing key

A missing group_by column does not raise an exception.

                                                                                                                                                                                                                                                       
In [15]: pd.DataFrame({'a': [1,2,2], 'b': [4,5,6]}) >> group_by('c')                                                                                                                                                                                       
Out[15]:                                                                                                                                                                                                                                                   
   a  b                                                                                                                                                                                                                                                       
0  1  4                                                                                                                                                                                                                                                       
1  2  5                                                                                                                                                                                                                                                       
2  2  6

In dplyr, the equivalent situation raises an error:

library(tidyverse)
> data.frame(a=c(1,2,2), b=c(4,5,6)) %>% group_by(c)
Error in grouped_df_impl(data, unname(vars), drop) : 
  Column `c` is unknown

My preference would be for this error to raise an exception instead of passing silently.

Pandas warning: column creation via attribute name

When doing groupby / summarise actions, the following warning occurs:

/opt/anaconda/envs/Python3/lib/python3.6/site-packages/dfply/base.py:137: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  other_copy._grouped_by = getattr(other, '_grouped_by', None)

An example:

from dfply import *
diamonds >> group_by('carat', 'cut') >> summarize(price=X.price.mean())

I don't know if this is fixable, but could be nice to get rid of the warning!

use mask for column names with blank

for some reasons, I have a very complicate excel be read into pandas, some column names are with blank, i.e "Commercial Project-ID",
target_data_frame= orginal_data_frame >> mask(X.['Commercial Project-ID']==0)
or
target_data_frame= orginal_data_frame >> mask(X.'Commercial Project-ID'==0)
or
target_data_frame= orginal_data_frame >> mask(X.Commercial Project-ID==0)

either way, not work
of course I can remove the blank before the filter, but I am wondering, if there are better way to do

Error when groupby

Below is my test data:

In[43]: payment >> head(5)
Out[43]: 

                     date      user_name game_id                channel  \
11165 2016-08-24 06:36:28  000000000000o  myfish     FB_IS_MA_AG2535_GP   
0     2016-08-02 10:14:31       00000025  myfish            google-play   
8     2016-08-02 13:18:19       00000027  myfish  Fanpage_Dailypost_APK   
10921 2016-08-23 19:48:21       00000030  myfish                 in_app   
11980 2016-08-25 11:25:29       00000030  myfish                 in_app   

        money  
11165  3000.0  
0      1000.0  
8      3000.0  
10921  3000.0  
11980  3000.0

When I try to groupby:

payment >> head(5) >> groupby(X.user_name)
In[45]: payment >> head(5) >> groupby(X.user_name)
Traceback (most recent call last):
  File "C:\Program Files\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-45-cf2fbef85582>", line 1, in <module>
    payment >> head(5) >> groupby(X.user_name)
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 45, in __rrshift__
    result = self.function(other_copy)
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 52, in <lambda>
    return pipe(lambda x: self.function(x, *args, **kwargs))
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 179, in __call__
    evaluation = self.call_action(args, kwargs)
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 285, in call_action
    return symbolic.to_callable(symbolic_function)(self.df)
  File "C:\Program Files\Anaconda2\lib\site-packages\pandas_ply\symbolic.py", line 204, in <lambda>
    return lambda *args, **kwargs: obj._eval(dict(enumerate(args), **kwargs))
  File "C:\Program Files\Anaconda2\lib\site-packages\pandas_ply\symbolic.py", line 142, in _eval
    result = evaled_func(*evaled_args, **evaled_kwargs)
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 357, in wrapped
    return f(*flat_args, **kwargs)
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 456, in wrapped
    for arg in args[1:]]
  File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 437, in _col_ind_to_label
    raise Exception("Label not of type str or int.")
Exception: Label not of type str or int.

My data type:

In[46]: payment.dtypes
Out[46]: 

date         datetime64[ns]
user_name            object
game_id              object
channel              object
money               float64
dtype: object

The data was read from database using sqlalchemy and user_name is store as varchar.

I rechecked with the diamonds data using the same command but it works for diamond and I could not figure out why.

How could I fix the problem?

Kind regards.

generalizing dfply?

Hello all.

I understand this isn't the most bristling-with-activity project, but -- coming at it from the perspective of a Python tinkerer who's fallen in love with R's tidyverse -- this is really clever stuff. I've been reading through the source code and it's clear, pretty, and concise.

It is, however, deeply stitched up to Pandas. One of the neat things about dplyr / the tidyverse in general is that it works (to a greater or lesser degree!) with other sources like DBI connections or Spark. Further, it allows the programmer to bring an almost declarative paradigm where it might be warranted to tasks unrelated to data wrangling. I think some Pythonistas would frown deeply at the suggestion, but something like dfply's pipes with an FP library like toolz could be really nice from my perspective.

All that said: would you be amenable to my taking on the task of trying to generalize dfply a little? Obviously the tidyverse would be too huge an undertaking, but I could see making the pipe code more general or in the medium term something like a proof-of-concept SQL generator or SQLalchemy backend.

If not, totally fine! Though I may fork and go at it on my lonesome.

Mutate with boolean expressions

Hi, thank you for all the good work here, I like this the best of the dplyr clones.

In R I am able to do something like,
df %>% mutate(newcol = ifelse(x > 3 & lead(y) < 2, 'yes', 'no')

In Python it seems that I should be using the numpy.where function. I also read enough of your documentation to realize I need to wrap this function in another function with the @make_symbolic decorator. So, I have this:

@make_symbolic
def np_where(bools, val_if_true, val_if_false):
	return list(np.where(bools, val_if_true, val_if_false))

When I call it like this, it works just fine:
df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F', 'Punct, 'Not Punc')

However if I want to make my expression to evaluate to True or False more complex with ands or ors, I get an error:
df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F' & X.CPOS == 'F', 'Punct, 'Not Punc')
also tried with:
df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F' and X.CPOS == 'F', 'Punct, 'Not Punc')

I get this error:
TypeError: index returned non-int (type Intention)

I thought that my @make_symbolic decorator took care of this kind of thing. Perhaps I need a logical and that also has the delaying decorator.

Function docstring not preserved with @dfpipe decorator

Passing through docstrings now works great when using the @pipe decorator, but not when you use the @dfpipe decorator (which the dfply functions get defined through).

The problem is that when using @dfpipe, the function passed to pipe is actually group_delegation:

def dfpipe(f):
    return pipe(
        group_delegation(
            symbolic_evaluation(f)
        )
    )

I think for this to truly work you'd need to pass the docstring through all three of these functions. Then mutate and other dfply functions would have the proper docstrings.

Can not resolve column names that are also functions in the environment

Consider the following example:

diamonds >> mutate(rank=min_rank(X.carat)) >> filter_by(X.rank <10)

This fails with

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 142, in __rrshift__
    result = self.function(other_copy)
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 149, in <lambda>
    return pipe(lambda x: self.function(x, *args, **kwargs))
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 329, in __call__
    return self.function(*args, **kwargs)
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 282, in __call__
    return self.function(df, *args, **kwargs)
  File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/subset.py", line 62, in mask
    if arg.dtype != bool:
AttributeError: 'NotImplementedType' object has no attribute 'dtype'

but seems legit to me.

Write migration tutorial

On my development machine I installed dfply version 0.2.4. Now I installed dfply with pip install dfply on the server and suddenly I got NameError: name 'groupby' is not defined. It took me ages to realise it's not that code that is wrong, but you had a new release (which was such a surprise because it had been a while). It seems that your groupby is now group_by... maybe you can write a migration tutorial? Tensorflow did it reasonably well.

mutate to substring doesn't work

I am trying to extract a date portion from datetime via mutating into a substring and resulting dataframe still prints whole datetime field:
df >>= mask(X["Item Type"] != 'Application') >> mutate(transaction_date = X["Transaction Time"][0:10])
Resulting df.transaction_date prints full string, ex: 2019-02-10 19:21:45 PST

Strange behavior of Mask on grouped df

Mask doesn't seem to work correctly on grouped dataframes. Consider the three operations below.

diamonds >> mask(X.cut == 'Ideal') >> groupby(X.color)  # works as expected
diamonds >> groupby(X.color) >> mask(X.cut == 'Ideal')  # doesn't work correctly (strange behavior)
diamonds >> mask(X.cut == 'Ideal')                                    # works as expected

In the first example, the mask is applied before the grouping, so it behaves as expected.
In the second example, the grouping is applied before the mask. The returned dataframe includes cases where X.cut != 'Ideal', and returns about 2500 rows. I'm not sure what causes the returned rows to be returned.
In the third example, there is no grouping, and the data behaves as expected (returns the same result as the first example.

There are use cases where you might want to use grouping and mask together (for example, to return the min for each group, you might want to do something like

diamonds >> groupby(X.color) >> mask(X.x == X.x.min())        (1)
diamonds >> mask(X.x == X.x.min())                                          (2)

For (1), it returns a single row, where x is not a min for any group, but (2) behaves as expected.

Doc for masking on Null

Hello, in your README examples would be very useful to have how to mask on Null / None, and not null?

Upcoming v1.0.0

Hello users,

I have been working on and off on an upcoming version which will be v1.0.0 due to its incompatibility with previous versions. You can actually view this nearly-complete version in the "feature/collapsed-selection" branch.

Originally I was just working on getting the selection helper functions working, but in order to do that a lot had to change with the base decorators. The selection helper functions now work (such as contains("ca") for finding columns that contain that string as an argument to the select function. Previously, there were a variety of different decorators that would be stacked together to get different kinds of behavior. In the new conceptualization, the only decorator that will be used is the @dfpipe decorator and it will take keyword arguments that can change its behavior (it can also be used without keyword arguments in which case it will behave as the current @dfpipe decorator does now.

If you're interested in checking it out and have any questions/comments/concerns, please go ahead. I don't have a timetable for its release but considering it's nearing completion and currently passes all the written unit tests, I don't expect it will be much longer.

Add `matches()` function?

Greetings -

I noticed that dfply is missing the dplyr equivalent of matches(), which allows the user to pass in a regular expression when selecting columns (as opposed to a literal string - i.e., the contains() function). I've found matches() to be very useful in the past and would like to know if you're amenable to it being added to dfply? If so, I'd be happy to put together a PR.

Chris

Pandas throwing FutureWarnings on group_by

Pandas is constantly throwing warnings like this and it makes the group_by unusable in my case, since it swamps all other console output.

.../python3.6/site-packages/dfply/base.py:307: FutureWarning: Interpreting tuple 'by' as a list of keys, rather than a single key. Use 'by=[...]' instead of 'by=(...)'. In the future, a tuple will always mean a single key.

When Pandas makes this change it will break the group_by so it's probably worth fixing now.

I think the way around this is to wrap the RHS with a list(args) here.

feature request: add window operation of fillna

should work as:
df >>= mutate(nonans=fillna(X.col_with_nans, value_to_replace))

Suggestings for improvements on the 'from dfply import *' front

There's an universal (and justified) dislike in the python community for * imports.
Now I admit that dfply (great work btw) is a pain without it.

But, it currently has a bunch of things in the user-importable namespace
that we could possibly clean up.

A quick accounting of the 129 exports from dfply by type:

type	count
<class 'dfply.base.Intention'>	1
<class 'dict'>	1
<class 'NoneType'>	1
<class '_frozen_importlib_external.SourceFileL...	1
<class 'list'>	1
<class '_frozen_importlib.ModuleSpec'>	1
<class 'pandas.core.frame.DataFrame'>	1
<class 'type'>	5
<class 'str'>	6
<class 'module'>	16
<class 'dfply.base.pipe'>	40
<class 'function'>	55

where presumably only the dfply.base.pipe and a subset of the functions are 'verbs'.

My suggesting would be to introduce to additional namespaces

dfply.verbs, - which would export what a typical 'user' might use
dfply.extend - which would export everything necessary to extend dfply - ie. decorators.

and update the examples to use
from dfply.verbs import * instead of from dfply import *

This way we would
a) not break anyones code and
b) have a clean, 'non polluting' module that users can import.

What do you think?