kieferk / dfply Goto Github PK
View Code? Open in Web Editor NEWdplyr-style piping operations for pandas dataframes
License: GNU General Public License v3.0
dplyr-style piping operations for pandas dataframes
License: GNU General Public License v3.0
Hey, not sure if this is an issue for anyone else, but one of my favourite features of dfply is the X
symbol, but it's annoying because of the scikit-learn convention to use X
for the array of predictors ... would it make sense to change X
to something else? maybe D
or DF
?
I want to be able to apply distinct
to the whole dataframe, so that distinct()
is equivalent to drop_duplicates()
.
Just putting this as a placeholder for now so that I remember, I'm happy to take this on and submit a PR when I get time.
The function get_join_parameters in join.py has an error:
if not isinstance(by[0], str):
left_on = by[0]
right_in = by[1]
This should be right_on = by[1].
Great library btw - I really missed dplyr when moving to python.
I'm having this same issue still:
#8
-I am using conda to install dfply (which I need to because that's the package manager used by the computing cluster I have access to).
conda install -c tallic dfply
That's the command I use to install the package from https://anaconda.org/tallic/dfply.
But when I go to use dfply, it still says the diamonds.csv data is missing.
Traceback (most recent call last):
File "ACH_nested_anova.py", line 1, in
import dfply
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/init.py", line 11, in
from .data import diamonds
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/init.py", line 5, in
diamonds = pd.read_csv(os.path.join(root, "diamonds.csv"))
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in init
self._make_engine(self.engine)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1853, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv' does not exist: b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv'
2019-03-15 13:25:11 ⌚ gateway-03 in ~/ACH_Development/ACH_tests/ACH_quiz3/python_scripts/Analysis
○ → python ACH_nested_anova.py
Traceback (most recent call last):
File "ACH_nested_anova.py", line 2, in
from dfply import group_by as group_by, summarize as summarize, select as select
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/init.py", line 11, in
from .data import diamonds
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/init.py", line 5, in
diamonds = pd.read_csv(os.path.join(root, "diamonds.csv"))
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in init
self._make_engine(self.engine)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1853, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv' does not exist: b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv'
2019-03-15 13:25:41 ⌚ gateway-03 in ~/ACH_Development/ACH_tests/ACH_quiz3/python_scripts/Analysis
○ → pip install dfply
Requirement already satisfied: dfply in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (0.3.1)
Requirement already satisfied: numpy in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (from dfply) (1.16.2)
Requirement already satisfied: pandas in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (from dfply) (0.24.2)
Requirement already satisfied: python-dateutil>=2.5.0 in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (from pandas->dfply) (2.8.0)
Requirement already satisfied: pytz>=2011k in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (from pandas->dfply) (2018.9)
Requirement already satisfied: six>=1.5 in /mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas->dfply) (1.12.0)
2019-03-15 13:26:59 ⌚ gateway-03 in ~/ACH_Development/ACH_tests/ACH_quiz3/python_scripts/Analysis
○ → python ACH_nested_anova.py
Traceback (most recent call last):
File "ACH_nested_anova.py", line 2, in
from dfply import group_by as group_by, summarize as summarize, select as select
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/init.py", line 11, in
from .data import diamonds
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/init.py", line 5, in
diamonds = pd.read_csv(os.path.join(root, "diamonds.csv"))
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in init
self._make_engine(self.engine)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/pandas/io/parsers.py", line 1853, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv' does not exist: b'/mnt/home/bundyjas/anaconda3/envs/ACH_environment/lib/python3.6/site-packages/dfply/data/diamonds.csv'
I can substitute the import line with any of the following and the result is still the same:
-import dfply
-from dfply import group_by as group_by, summarize as summarize, select as select
-from dfply import *
Please help. I cannot seem to use git or pip to correct the problem. Pip tells me the package is already installed, but I get the same problem. Git is not available to me.
The line
series = signals.loc[(signals.type == sig_type) & (signals.part_number == sig_partnr), 'value']
is working for my code, the line
series = signals >> mask(X.type == sig_type, X.part_number == sig_partnr) >> select('value')
results in the error
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "[..]/dfply/base.py", line 45, in __rrshift__
result = self.function(other_copy)
File "[..]/dfply/base.py", line 52, in <lambda>
return pipe(lambda x: self.function(x, *args, **kwargs))
File "[..]/dfply/base.py", line 112, in __call__
return self.function(*args, **kwargs)
File "[..]/dfply/base.py", line 179, in __call__
evaluation = self.call_action(args, kwargs)
File "[..]/dfply/base.py", line 253, in call_action
return symbolic.to_callable(symbolic_function)(args[0])
File "[..]/pandas_ply/symbolic.py", line 204, in <lambda>
return lambda *args, **kwargs: obj._eval(dict(enumerate(args), **kwargs))
File "[..]/pandas_ply/symbolic.py", line 142, in _eval
result = evaled_func(*evaled_args, **evaled_kwargs)
File "[..]/dfply/subset.py", line 55, in mask
mask = mask & arg
File "[..]/pandas/core/ops.py", line 915, in wrapper
self, other = _align_method_SERIES(self, other, align_asobject=True)
File "[..]/pandas/core/ops.py", line 629, in _align_method_SERIES
left, right = left.align(right, copy=False)
File "[..]/pandas/core/series.py", line 2411, in align
broadcast_axis=broadcast_axis)
File "[..]/pandas/core/generic.py", line 4937, in align
fill_axis=fill_axis)
File "[..]/pandas/core/generic.py", line 5006, in _align_series
return_indexers=True)
File "[..]/pandas/core/indexes/range.py", line 441, in join
sort)
File "[..]/pandas/core/indexes/base.py", line 3024, in join
return_indexers=return_indexers)
File "[..]/pandas/core/indexes/datetimes.py", line 1069, in join
return_indexers=return_indexers, sort=sort)
File "[..]/pandas/core/indexes/base.py", line 3033, in join
return this.join(other, how=how, return_indexers=return_indexers)
File "[..]/pandas/core/indexes/base.py", line 3046, in join
return_indexers=return_indexers)
File "[..]/pandas/core/indexes/base.py", line 3127, in _join_non_unique
sort=True)
File "[..]/pandas/core/reshape/merge.py", line 982, in _get_join_indexers
llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
File "[..]/pandas/core/reshape/merge.py", line 1412, in _factorize_keys
llab, rlab = _sort_labels(uniques, llab, rlab)
File "[..]/pandas/core/reshape/merge.py", line 1438, in _sort_labels
_, new_labels = algos.safe_sort(uniques, labels, na_sentinel=-1)
File "[..]/pandas/core/algorithms.py", line 483, in safe_sort
ordered = sort_mixed(values)
File "[..]/pandas/core/algorithms.py", line 476, in sort_mixed
nums = np.sort(values[~str_pos])
File "[..]/numpy/core/fromnumeric.py", line 822, in sort
a.sort(axis=axis, kind=kind, order=order)
File "pandas/_libs/tslib.pyx", line 1080, in pandas._libs.tslib._Timestamp.__richcmp__ (pandas/_libs/tslib.c:20281)
TypeError: Cannot compare type 'Timestamp' with type 'int'
What is the reason? My dataframe looks like
part_number type value
timestamps
2017-08-01 00:00:32.651504 91cb9fa3859f4d44853f6200616db619 power1 -0.001651
2017-08-01 00:00:32.652504 91cb9fa3859f4d44853f6200616db619 power2 0.005068
2017-08-01 00:00:32.653504 91cb9fa3859f4d44853f6200616db619 power1 -0.004536
2017-08-01 00:00:32.654504 91cb9fa3859f4d44853f6200616db619 power2 -0.000084
2017-08-01 00:00:32.655504 5535c560ece9415f8f6ad996f1c23f6e power1 -0.001114
2017-08-01 00:00:32.656504 5535c560ece9415f8f6ad996f1c23f6e power2 -0.005621
2017-08-01 00:00:32.657504 5535c560ece9415f8f6ad996f1c23f6e power1 -0.000638
2017-08-01 00:00:32.658504 5535c560ece9415f8f6ad996f1c23f6e power2 -0.006916
2017-08-01 00:00:32.659504 5535c560ece9415f8f6ad996f1c23f6e power1 0.001549
where the index is DatatimeIndex. I am using dfply version 0.2.4.
missing cumcount() function in embedded column functions, especially needed for using spread() function to handle the DataFrame with duplicate identifier.
This was never an issue. It should be deleted.
So I want to be to simply do a group by and count on a column with categorical values. When running the code below
df = pd.DataFrame({"animal": ["cat", "cat", "dog", "dog"],
"breed": ["tabby", "short hair", "poodle", "pug"],
"age": [1,2,3,4]
})
df >> group_by(X.animal) >> summarize(count=n(X.name))
I run into a AttributeError: 'str' object has no attribute 'size'
error.
In dplyr, this would be the equivalent of:
df %>% group_by(animal) %>% summarise(count = n())
semi_join and anti_join fail when joining for more than one column
You can reproduce it with
df1 = pd.DataFrame({'x':[1,2,3,4,5], 'y':[10,20,40,50,100]})
df2 = pd.DataFrame({'x':[3,4], 'y':[40,51], 'z':[600,800]})
anti_join(df2, by =['x','y'])
#or anti_join(df2, by =[['x','y'],['x','y']])
left_join works fine with the same construction
the error message is:
df1 >> anti_join(df2, by =['x','y'])
File "/Users/anmiko/anaconda/lib/python3.5/site-packages/dfply/base.py", line 45, in __rrshift__
result = self.function(other_copy)
File "/Users/anmiko/anaconda/lib/python3.5/site-packages/dfply/base.py", line 52, in <lambda>
return pipe(lambda x: self.function(x, *args, **kwargs))
File "/Users/anmiko/anaconda/lib/python3.5/site-packages/dfply/join.py", line 246, in anti_join
other_reduced = other[right_on].drop_duplicates()
File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/core/frame.py", line 2053, in __getitem__
return self._getitem_array(key)
File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/core/frame.py", line 2097, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py", line 1217, in _convert_to_indexer
indexer = check = labels.get_indexer(objarr)
File "/Users/anmiko/anaconda/lib/python3.5/site-packages/pandas/indexes/base.py", line 2286, in get_indexer
indexer = self._engine.get_indexer(target._values)
File "pandas/index.pyx", line 300, in pandas.index.IndexEngine.get_indexer (pandas/index.c:6420)
File "pandas/src/hashtable_class_helper.pxi", line 793, in pandas.hashtable.PyObjectHashTable.lookup (pandas/hashtable.c:14637)
TypeError: unhashable type: 'list'
It looks like the problem is with else block in the code below (it's from function semi_join)
...
if not right_on:
right_on = [col_name for col_name in df.columns.values.tolist() if col_name in other.columns.values.tolist()]
left_on = right_on
else:
right_on = [right_on]
...
Pandas expects list of columns names but this block makes it list of list
When else part removed it starts to work
Hi, in the unite function you have a stray print
print(to_unite, sep, remove, na_action)
Thanks
In the readme is not instruction how to best install dfply. Could you add this, please? I especially would like to know how to install it into an Anaconda environment.
I checked out the requirements and, except for pandas_ply
, I got them installed. However, I am not sure what pandas_ply
is supposed to be. Do you mean https://github.com/coursera/pandas-ply? If so: From their github-page I understand that pandas-ply is not stable yet. So why would dfply be considered stable if it is build on pandas-ply?
Hi kieferk,
I am an R user learning how to use dfply
. I may have spotted an issue: it appears that Boolean ~
isn't evaluated after Boolean |
if applied in the syntax below.
My code:
# Import
import pandas as pd
import numpy as np
from dfply import *
# Create data frame and mask it
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask((X.a.isnull()) | ~(X.b.isnull())))
print(df)
print(df2)
Here is the original data frame, df:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
4 5.0 NaN 1
And here is the result of the piped mask, df2:
a b c
0 NaN 6.0 5
4 5.0 NaN 1
However, I expect this instead:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
I don't understand why the |
and ~
operators result in rows in which column "a" is either NaN
or column "b" is not NaN
?
By the way, I also tried np.logical_or()
:
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
print(df)
print(df2)
But this resulted in error:
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
ValueError: invalid __array_struct__
Why do not we have mutate_at in the dfply?
This is not an issue, but I am not sure how to contact someone from github and give some information. I just wanted to let you know that there is a new grammar for data manipulation: https://github.com/has2k1/plydata
A related question I would have is, how dfply related to it - they seem rather, right?
Head operator works well when the dataset was initially loaded.
df >> head(5) - returned exactly 5 rows
I applied a group_by on the dataframe and saved it to the same
df = df >> group_by(X.Team_Name) >> mutate(bat_avg = X.Hits.sum()/X.Bats.sum())
While printing the head of the updated dataframe
df >> head(5) - prints the entire dataframe instead of just printing the first five rows.
P.S : A big shout to the amazing work which went behind this package, using this saved me a lot of time. Thanks!
Last update is from August 2017.
Pip version of the package currently doens't import the filter_by
function, because it doesn't exist. #50
Guys, How to filter multiple values from same column, Below code throws the error.
import pandas as pd
from dfply import *
data = pd.DataFrame({"Col1" :["a","b","c","d"],"Col2":[1,2,3,4]})
data >> mask(X.Col1 == ["a","b"])
Error:
ValueError: Arrays were different lengths: 4 vs 2
I am trying to calculate the summary statistics by grouping variable and then sorting the result in descending order.
#Import Data
import pandas as pd
mydata=pd.read_csv("http://winterolympicsmedals.com/medals.csv")
#2006 Gold Medal Count
mydata >> mask(X.Year==2006 , X.Medal =='Gold') >> group_by(X.NOC) >> summarize(N=n(X.NOC)) >> arrange(X.N, ascending=False)
Gold Medal Count (i.e. variable N) is not sorted in descending order
This function is pretty essential for data analysis, and selecting by value is also one of the least ergonomic operations in pandas, so this would provide real value.
I think joining on different columns does not work. By that I mean
a_df = pd.DataFrame.from_items([('one', [1,2,3]),('two',['a','b','c'])])
b_df = pd.DataFrame.from_items([('three', [1,2,3]),('four',['d','e','f'])])
a_df >> inner_join(b_df,by=['one','three'])
gives the error
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'one'
and
a_df >> inner_join(b_df,by=[['one'],['three']])
gives
IndexError: list index out of range
When using mask() it is possible to AND statements, but I don't see a way of OR-ing statements. Could this please be added to the syntax somehow?
I have fixed a typo (right_in
-> right_on
) and reversed logic in one if statement in function that creates join parameters. See the changes here:
master...jankislinger:fix-join-multiple-by
Now it can be used to join tables on columns with different names:
import pandas as pd
from dfply import *
a = pd.DataFrame({
'x1': ['A', 'B', 'C'],
'x2': [1, 2, 3]
})
b = pd.DataFrame({
'x4': ['A', 'B', 'D'],
'x3': [True, False, True]
})
a >> inner_join(b, by=('x1', 'x4'))
It would be also convenient to be able to use multiple by statements. For example expression
a >> inner_join(b, by=['x1', ('x2', 'x3')])
could be used as
a.merge(b, left_on=['x1', 'x2'], right_on=['x1', 'x3'])
If you agree I would modify the code and create a PR.
I'm running across errors when I try to use numpy or math functions (e.g., sqrt, log, etc) inside dfply verbs. Here's a minimal example:
import pandas as pd
from dfply import *
import numpy as np
df = pd.DataFrame({'x': np.linspace(1, 10, 500)})
df >> mutate(y = np.log(X.x))
This gives the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-f8d61ebf2e20> in <module>()
3 df = pd.DataFrame({'x': np.linspace(1, 10, 500)})
4
----> 5 df >> mutate(y = np.log(X.x))
ValueError: invalid __array_struct__
Is this functionality not implemented? Maybe there's a workaround I'm not seeing?
(I'm on python 3.6.3)
A common idiom in dplyr is something like
df = df %>% (stuff)...
which can be simplified with
df %<>% (stuff)
I've looked around to see there's a way to overload <>
, and I'm not sure there is (I think this gets intrepreted as __neq__
, but then that also applied to !=
).
At any rate, I think this would be very useful (essentially just doing things inplace instead of with copies)
It might be worth noting that Hadley has purposely left this out of dplyr, because (I believe) he's sort of opposed to doing things inplace.
I can confirm in 0.3.3, issue still same
Originally posted by @steer629 in #61 (comment)
I have a DataFrame for which
hub2['time'] = pd.to_datetime(hub2.timestamp)
works, but when I write
hub2 >> mutate(time=pd.to_datetime(X.timestamp))
I get the error
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "[...]/lib/python2.7/site-packages/pandas/util/decorators.py", line 91, in wrapper
return func(*args, **kwargs)
File "[...]/lib/python2.7/site-packages/pandas/tseries/tools.py", line 419, in to_datetime
elif isinstance(arg, ABCSeries):
File "[...]/lib/python2.7/site-packages/pandas/types/generic.py", line 9, in _check
return getattr(inst, attr, '_typ') in comp
TypeError: __nonzero__ should return bool or int, returned Call
Why is that?
Hi,
I am an avid dplyr user in R and somewhat new to python. I have been looking for a dplyr-like package in python for a while when I came across dfply which looks pretty close to what I was looking for.
Please excuse if this is not quite the right forum, but I was looking for some help/request some documentation/request a feature.
My use case essentially is that I have a function that operates on single elements of a data-frame columns, e.g.
my_func(a,b)
where both a and b are single elements from columns of a data frame. I have found a stackoverflow-post that shows this for an operation on a single column only.
https://stackoverflow.com/questions/42671168/dfply-mutating-string-column-typeerror
The solution show here of using X.file.apply for the column X.file in the data-frame seems to only work when you only have a single column to operate on.
What i was essentially wondering is - how do you recommend to best use dfply in this context? Could you add some documentation on how best to use functions that don't natively understand Series objects?
E.g. could there be an "Intention" like object that takes a function that operators on several parameters, each of which is intended to be a single element from a column, "vectorizes" this function and then when passed an intention object representing a "Series", applies this appropriately?
Thanks for your help!
Hi,
Please take a look at the following example:
from dfply import *
utime = pd.DataFrame({"u":1,"eventTime":["01-01-1971 01:04:00","01-01-1971 02:07:00","01-01-1971 01:09:00","01-01-1971 01:10:00"]})
print(utime >> arrange(X.eventTime))
utime = utime.set_index("u")
print(utime >> d.arrange(X.eventTime))
In the first option, the result is as expected. When introducing an index, the result is incorrect and contains 4 times as many values as before.
I'm not sure if it's bug or an expected behavior, as I'm a newbie to pandas and to indices of data frames.
output for the code:
eventTime u
0 01-01-1971 01:04:00 1
2 01-01-1971 01:09:00 1
3 01-01-1971 01:10:00 1
1 01-01-1971 02:07:00 1
eventTime
u
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00
1 01-01-1971 01:04:00
1 01-01-1971 02:07:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00
Hi,
The latest modifications to the join
functions are breaking some usage code.
The incriminated changes are: bbe03e8...17b3440
The problem is this case:
df >> left_join(df2, by=["column1", "column2", "column3"])
It used to mean "merge based on those 3 columns", but now it means "merge based on column1 for left dataframe, and column2 on the right dataframe", which is quite different!
What's the rationale? Is it because we're now supposed to use tuples and not lists in those cases?
Either way, if we decide this is the way forward, it should be a major version change because it breaks one of the major use cases of dfply. What do you think?
Steps to reproduce:
$ conda create -n volatile-test-dfply -y python=3 jupyter
$ source activate volatile-test-dfply
$ pip install https://github.com/kieferk/dfply/zipball/master
$ jupyter console
>>> from dfply import *
In [1]: from dfply import *
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-aa7758fd9eef> in <module>()
----> 1 from dfply import *
/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/dfply/__init__.py in <module>()
9 from dfply.summarize import *
10 from dfply.transform import *
---> 11 from dfply.data import *
12 from dfply.summary_functions import *
13 from dfply.window_functions import *
/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/dfply/data.py in <module>()
3
4 root = os.path.abspath(os.path.dirname(__file__))
----> 5 diamonds = pd.read_csv(os.path.join(root, '..', 'data', "diamonds.csv"))
/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
643 skip_blank_lines=skip_blank_lines)
644
--> 645 return _read(filepath_or_buffer, kwds)
646
647 parser_f.__name__ = name
/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
386
387 # Create the parser.
--> 388 parser = TextFileReader(filepath_or_buffer, **kwds)
389
390 if (nrows is not None) and (chunksize is not None):
/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
727 self.options['has_index_names'] = kwds['has_index_names']
728
--> 729 self._make_engine(self.engine)
730
731 def close(self):
/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
920 def _make_engine(self, engine='c'):
921 if engine == 'c':
--> 922 self._engine = CParserWrapper(self.f, **self.options)
923 else:
924 if engine == 'python':
/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1387 kwds['allow_leading_cols'] = self.index_col is not False
1388
-> 1389 self._reader = _parser.TextReader(src, **kwds)
1390
1391 # XXX
pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4019)()
pandas/parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:7967)()
FileNotFoundError: File b'/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/dfply/../data/diamonds.csv' does not exist
FileNotFoundError: File b'/Users/elyase/miniconda3/envs/volatile-test-dfply/lib/python3.5/site-packages/dfply/../data/diamonds.csv' does not exist
With my simple code
resultstatsDF >> mask(X.method == 'MICE')
I get the error
Traceback (most recent call last):
File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/code.py", line 91, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/base.py", line 45, in __rrshift__
result = self.function(other_copy)
File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/base.py", line 52, in <lambda>
return pipe(lambda x: self.function(x, *args, **kwargs))
File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/base.py", line 112, in __call__
return self.function(*args, **kwargs)
File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/base.py", line 179, in __call__
evaluation = self.call_action(args, kwargs)
File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/base.py", line 253, in call_action
return symbolic.to_callable(symbolic_function)(args[0])
File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/pandas_ply/symbolic.py", line 204, in <lambda>
return lambda *args, **kwargs: obj._eval(dict(enumerate(args), **kwargs))
File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/pandas_ply/symbolic.py", line 142, in _eval
result = evaled_func(*evaled_args, **evaled_kwargs)
File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/dfply/subset.py", line 56, in mask
return df[mask.values]
File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/pandas/core/frame.py", line 2053, in __getitem__
return self._getitem_array(key)
File "/Development/Anaconda-Python-Distribution/anaconda2/envs/myenv/lib/python3.5/site-packages/pandas/core/frame.py", line 2090, in _getitem_array
(len(key), len(self.index)))
ValueError: Item wrong length 114 instead of 60.
Where my resultstatsDF.shape
is (60, 10)
. What am I to do?
Could it be that it has something to do with my the following?
resultstatsDF.index
Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 2, 3, 4, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5],
dtype='int64')
A missing group_by
column does not raise an exception.
In [15]: pd.DataFrame({'a': [1,2,2], 'b': [4,5,6]}) >> group_by('c')
Out[15]:
a b
0 1 4
1 2 5
2 2 6
In dplyr, the equivalent situation raises an error:
library(tidyverse)
> data.frame(a=c(1,2,2), b=c(4,5,6)) %>% group_by(c)
Error in grouped_df_impl(data, unname(vars), drop) :
Column `c` is unknown
My preference would be for this error to raise an exception instead of passing silently.
When doing groupby / summarise actions, the following warning occurs:
/opt/anaconda/envs/Python3/lib/python3.6/site-packages/dfply/base.py:137: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
other_copy._grouped_by = getattr(other, '_grouped_by', None)
An example:
from dfply import *
diamonds >> group_by('carat', 'cut') >> summarize(price=X.price.mean())
I don't know if this is fixable, but could be nice to get rid of the warning!
for some reasons, I have a very complicate excel be read into pandas, some column names are with blank, i.e "Commercial Project-ID",
target_data_frame= orginal_data_frame >> mask(X.['Commercial Project-ID']==0)
or
target_data_frame= orginal_data_frame >> mask(X.'Commercial Project-ID'==0)
or
target_data_frame= orginal_data_frame >> mask(X.Commercial Project-ID==0)
either way, not work
of course I can remove the blank before the filter, but I am wondering, if there are better way to do
Below is my test data:
In[43]: payment >> head(5)
Out[43]:
date user_name game_id channel \
11165 2016-08-24 06:36:28 000000000000o myfish FB_IS_MA_AG2535_GP
0 2016-08-02 10:14:31 00000025 myfish google-play
8 2016-08-02 13:18:19 00000027 myfish Fanpage_Dailypost_APK
10921 2016-08-23 19:48:21 00000030 myfish in_app
11980 2016-08-25 11:25:29 00000030 myfish in_app
money
11165 3000.0
0 1000.0
8 3000.0
10921 3000.0
11980 3000.0
When I try to groupby:
payment >> head(5) >> groupby(X.user_name)
In[45]: payment >> head(5) >> groupby(X.user_name)
Traceback (most recent call last):
File "C:\Program Files\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-45-cf2fbef85582>", line 1, in <module>
payment >> head(5) >> groupby(X.user_name)
File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 45, in __rrshift__
result = self.function(other_copy)
File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 52, in <lambda>
return pipe(lambda x: self.function(x, *args, **kwargs))
File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 179, in __call__
evaluation = self.call_action(args, kwargs)
File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 285, in call_action
return symbolic.to_callable(symbolic_function)(self.df)
File "C:\Program Files\Anaconda2\lib\site-packages\pandas_ply\symbolic.py", line 204, in <lambda>
return lambda *args, **kwargs: obj._eval(dict(enumerate(args), **kwargs))
File "C:\Program Files\Anaconda2\lib\site-packages\pandas_ply\symbolic.py", line 142, in _eval
result = evaled_func(*evaled_args, **evaled_kwargs)
File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 357, in wrapped
return f(*flat_args, **kwargs)
File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 456, in wrapped
for arg in args[1:]]
File "C:\Program Files\Anaconda2\lib\site-packages\dfply\base.py", line 437, in _col_ind_to_label
raise Exception("Label not of type str or int.")
Exception: Label not of type str or int.
My data type:
In[46]: payment.dtypes
Out[46]:
date datetime64[ns]
user_name object
game_id object
channel object
money float64
dtype: object
The data was read from database using sqlalchemy
and user_name
is store as varchar.
I rechecked with the diamonds
data using the same command but it works for diamond
and I could not figure out why.
How could I fix the problem?
Kind regards.
Hello all.
I understand this isn't the most bristling-with-activity project, but -- coming at it from the perspective of a Python tinkerer who's fallen in love with R's tidyverse -- this is really clever stuff. I've been reading through the source code and it's clear, pretty, and concise.
It is, however, deeply stitched up to Pandas. One of the neat things about dplyr / the tidyverse in general is that it works (to a greater or lesser degree!) with other sources like DBI connections or Spark. Further, it allows the programmer to bring an almost declarative paradigm where it might be warranted to tasks unrelated to data wrangling. I think some Pythonistas would frown deeply at the suggestion, but something like dfply's pipes with an FP library like toolz could be really nice from my perspective.
All that said: would you be amenable to my taking on the task of trying to generalize dfply a little? Obviously the tidyverse would be too huge an undertaking, but I could see making the pipe code more general or in the medium term something like a proof-of-concept SQL generator or SQLalchemy backend.
If not, totally fine! Though I may fork and go at it on my lonesome.
Hi, thank you for all the good work here, I like this the best of the dplyr clones.
In R I am able to do something like,
df %>% mutate(newcol = ifelse(x > 3 & lead(y) < 2, 'yes', 'no')
In Python it seems that I should be using the numpy.where function. I also read enough of your documentation to realize I need to wrap this function in another function with the @make_symbolic decorator. So, I have this:
@make_symbolic
def np_where(bools, val_if_true, val_if_false):
return list(np.where(bools, val_if_true, val_if_false))
When I call it like this, it works just fine:
df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F', 'Punct, 'Not Punc')
However if I want to make my expression to evaluate to True or False more complex with ands or ors, I get an error:
df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F' & X.CPOS == 'F', 'Punct, 'Not Punc')
also tried with:
df >>= mutate(my_val = np_where(lead(X.CPOS) == 'F' and X.CPOS == 'F', 'Punct, 'Not Punc')
I get this error:
TypeError: index returned non-int (type Intention)
I thought that my @make_symbolic decorator took care of this kind of thing. Perhaps I need a logical and that also has the delaying decorator.
Passing through docstrings now works great when using the @pipe
decorator, but not when you use the @dfpipe
decorator (which the dfply functions get defined through).
The problem is that when using @dfpipe
, the function passed to pipe
is actually group_delegation
:
def dfpipe(f):
return pipe(
group_delegation(
symbolic_evaluation(f)
)
)
I think for this to truly work you'd need to pass the docstring through all three of these functions. Then mutate
and other dfply
functions would have the proper docstrings.
Consider the following example:
diamonds >> mutate(rank=min_rank(X.carat)) >> filter_by(X.rank <10)
This fails with
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 142, in __rrshift__
result = self.function(other_copy)
File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 149, in <lambda>
return pipe(lambda x: self.function(x, *args, **kwargs))
File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 329, in __call__
return self.function(*args, **kwargs)
File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/base.py", line 282, in __call__
return self.function(df, *args, **kwargs)
File "/Users/brandl/anaconda3/envs/scikit_playground/lib/python3.6/site-packages/dfply/subset.py", line 62, in mask
if arg.dtype != bool:
AttributeError: 'NotImplementedType' object has no attribute 'dtype'
but seems legit to me.
On my development machine I installed dfply version 0.2.4. Now I installed dfply with pip install dfply
on the server and suddenly I got NameError: name 'groupby' is not defined
. It took me ages to realise it's not that code that is wrong, but you had a new release (which was such a surprise because it had been a while). It seems that your groupby
is now group_by
... maybe you can write a migration tutorial? Tensorflow did it reasonably well.
I am trying to extract a date portion from datetime via mutating into a substring and resulting dataframe still prints whole datetime field:
df >>= mask(X["Item Type"] != 'Application') >> mutate(transaction_date = X["Transaction Time"][0:10])
Resulting df.transaction_date prints full string, ex: 2019-02-10 19:21:45 PST
Mask doesn't seem to work correctly on grouped dataframes. Consider the three operations below.
diamonds >> mask(X.cut == 'Ideal') >> groupby(X.color) # works as expected
diamonds >> groupby(X.color) >> mask(X.cut == 'Ideal') # doesn't work correctly (strange behavior)
diamonds >> mask(X.cut == 'Ideal') # works as expected
In the first example, the mask is applied before the grouping, so it behaves as expected.
In the second example, the grouping is applied before the mask. The returned dataframe includes cases where X.cut != 'Ideal', and returns about 2500 rows. I'm not sure what causes the returned rows to be returned.
In the third example, there is no grouping, and the data behaves as expected (returns the same result as the first example.
There are use cases where you might want to use grouping and mask together (for example, to return the min for each group, you might want to do something like
diamonds >> groupby(X.color) >> mask(X.x == X.x.min()) (1)
diamonds >> mask(X.x == X.x.min()) (2)
For (1), it returns a single row, where x is not a min for any group, but (2) behaves as expected.
Hello, in your README examples would be very useful to have how to mask on Null / None, and not null?
Hello users,
I have been working on and off on an upcoming version which will be v1.0.0 due to its incompatibility with previous versions. You can actually view this nearly-complete version in the "feature/collapsed-selection" branch.
Originally I was just working on getting the selection helper functions working, but in order to do that a lot had to change with the base decorators. The selection helper functions now work (such as contains("ca")
for finding columns that contain that string as an argument to the select function. Previously, there were a variety of different decorators that would be stacked together to get different kinds of behavior. In the new conceptualization, the only decorator that will be used is the @dfpipe
decorator and it will take keyword arguments that can change its behavior (it can also be used without keyword arguments in which case it will behave as the current @dfpipe
decorator does now.
If you're interested in checking it out and have any questions/comments/concerns, please go ahead. I don't have a timetable for its release but considering it's nearing completion and currently passes all the written unit tests, I don't expect it will be much longer.
Greetings -
I noticed that dfply is missing the dplyr equivalent of matches()
, which allows the user to pass in a regular expression when selecting columns (as opposed to a literal string - i.e., the contains()
function). I've found matches()
to be very useful in the past and would like to know if you're amenable to it being added to dfply? If so, I'd be happy to put together a PR.
Chris
Pandas is constantly throwing warnings like this and it makes the group_by
unusable in my case, since it swamps all other console output.
.../python3.6/site-packages/dfply/base.py:307: FutureWarning: Interpreting tuple 'by' as a list of keys, rather than a single key. Use 'by=[...]' instead of 'by=(...)'. In the future, a tuple will always mean a single key.
When Pandas makes this change it will break the group_by
so it's probably worth fixing now.
I think the way around this is to wrap the RHS with a list(args)
here.
should work as:
df >>= mutate(nonans=fillna(X.col_with_nans, value_to_replace))
There's an universal (and justified) dislike in the python community for * imports.
Now I admit that dfply (great work btw) is a pain without it.
But, it currently has a bunch of things in the user-importable namespace
that we could possibly clean up.
A quick accounting of the 129 exports from dfply by type:
type | count |
---|---|
<class 'dfply.base.Intention'> | 1 |
<class 'dict'> | 1 |
<class 'NoneType'> | 1 |
<class '_frozen_importlib_external.SourceFileL... | 1 |
<class 'list'> | 1 |
<class '_frozen_importlib.ModuleSpec'> | 1 |
<class 'pandas.core.frame.DataFrame'> | 1 |
<class 'type'> | 5 |
<class 'str'> | 6 |
<class 'module'> | 16 |
<class 'dfply.base.pipe'> | 40 |
<class 'function'> | 55 |
where presumably only the dfply.base.pipe and a subset of the functions are 'verbs'.
My suggesting would be to introduce to additional namespaces
and update the examples to use
from dfply.verbs import *
instead of from dfply import *
This way we would
a) not break anyones code and
b) have a clean, 'non polluting' module that users can import.
What do you think?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.