Comments (18)
All done, @samukweku!
from pyjanitor.
Ok to do so! I trust your judgment on whether to do a patch or minor π.
from pyjanitor.
Likely be a minor release
from pyjanitor.
What pandas version are you using @dugarte-vox ?
Sorry for the late response. I'm currently using pandas 2.2.1.
I've tried with a lower version of pandas (2.0.3) as you suggested and the example is working.
from pyjanitor.
We haven't migrated to pyproject.yaml yet, need your infra expertise on the PR
from pyjanitor.
@dugarte-vox a new version of pyjanitor has been released. It should be fine on pandas > 2. Test and let's know if there are any issues. By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?
from pyjanitor.
Thanks for the help!
By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?
Basically, I have a function that completes missing dates on a DataFrame with groups. This is an extract of that code:
import pandas as pd
import janitor
import random
df = pd.DataFrame({
'index1': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
'index2': [1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1],
'date': pd.to_datetime([
'2024-01-01', '2024-01-10', '2024-04-01',
'2024-01-01', '2024-01-10', '2024-04-01',
'2024-01-01', '2024-01-10', '2024-04-01',
'2021-10-01', '2021-11-01'
])
})
df['value'] = [random.randrange(1, 50, 1) for _ in range(df.shape[0])]
# Create min/max dates per group
df_group_dates = df.groupby(['index1', 'index2']).agg({'date': ['min', 'max']})
df_group_dates.columns = ["_".join(col) for col in df_group_dates.columns]
df_group_dates = df_group_dates.reset_index()
# Create a combination between all groups and all possible dates
date_min = df['date'].min()
date_max = df['date'].max()
date_range = pd.date_range(start=date_min, end=date_max, freq='D')
df_dates = pd.merge(
df[['index1', 'index2']].drop_duplicates(),
pd.DataFrame({'date': date_range}),
how='cross'
)
# Join combinations with min/max dates per group
df_dates = pd.merge(
df_dates,
df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
how='left',
on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
(df_dates['date'] >= df_dates['date_min']) &
(df_dates['date'] <= df_dates['date_max']),
['date', 'index1', 'index2']
]
# Merge with original DataFrame
df = pd.merge(df, df_dates, on=['date', 'index1', 'index2'], how='right')
As you can see when I merge df_dates with df_group_dates I create a bigger DataFrame that is later filtered. This sometimes causes a memory error when the DataFrame is big.
Now, I can replace:
# Join combinations with min/max dates per group
df_dates = pd.merge(
df_dates,
df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
how='left',
on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
(df_dates['date'] >= df_dates['date_min']) &
(df_dates['date'] <= df_dates['date_max']),
['date', 'index1', 'index2']
]
with:
# Join combinations with min/max date ranges per group
df_dates = df_dates.conditional_join(
df_group_dates,
('index1', 'index1', '=='),
('index2', 'index2', '=='),
('date', 'date_min', '>='),
('date', 'date_max', '<='),
how = 'inner'
).drop([('right', col) for col in ['index1', 'index2', 'date_min', 'date_max']], axis=1).droplevel(0, axis=1)
I haven't done any test to see if this is more efficient but, at least the code is smaller and more readable.
from pyjanitor.
I tried joining both columns into an auxiliary one and I still get the same error.
df1['index_aux'] = (df1['index1']+'-'+df1['index2'].astype(str)).astype(str).copy()
df2['index_aux'] = (df2['index1']+'-'+df2['index2'].astype(str)).astype(str).copy()
df1.conditional_join(
df2,
('index_aux', 'index_aux', '=='),
('date', 'date_min', '>='),
('date', 'date_max', '<='),
)
from pyjanitor.
Apologies for the late reply @dugarte-vox I'll have a look at this
from pyjanitor.
What pandas version are you using @dugarte-vox ?
from pyjanitor.
@dugarte-vox I believe this issue pops up when using pandas version >= 2.2, and has been fixed in pyjanitor. A new release will be out soon ,with the fix. In the mean time see if you can pandas version < 2.2.
from pyjanitor.
@ericmjl ok to do a release?
from pyjanitor.
@samukweku I think we should increase the cadence of releases; each PR merge can probably be considered a candidate for a new release, I think. Been trialling that with llamabot and I think itβll be ok.
from pyjanitor.
Llamabot! Can't wait to C it in action
from pyjanitor.
@ericmjl any suggestions on this failure? https://github.com/pyjanitor-devs/pyjanitor/actions/runs/8342259392
from pyjanitor.
Yes, I think I know what's happening.
Could you do a hot fix push to main? It would be to change this line in GH Actions:
python setup.py sdist bdist_wheel
To this:
pip install -U build && python -m build -w -s
I added the pip install of build just in case. We should be using pyproject.toml now, is that right?
If not, the hotfix should be to add, one line above the setup.py line:
pip install -U setuptools
And that should do the trick.
If I were in your shoes, I would push to main directly, since this change is infra-related.
from pyjanitor.
No problem! I will take a look at it.
from pyjanitor.
great response @dugarte-vox. It seems though that complete may fit in here for you - it makes missing rows explicit:
# pip install pyjanitor
(df
.complete(
{'date':lambda f: pd.date_range(f.date.min(), f.date.max(), freq='D')},
by=['index1','index2'])
)
The by
parameter ensures the columns are completed
per group. Under the hood it is just a for loop on the groups, which may help with memory issues, but may not be great performance wise. it is on my todo list to improve the performance for groups, i just havent had time to think it through, and really i wasnt sure anybody was using the function.
Another option, using complete
, which may be performant, is to use a variant of your solution, where you build a large dataframe and post filter (if memory allows):
grp = df.groupby(['index1','index2'],sort=False)
(df
.assign(
date_min=grp.date.transform('min'),
date_max=grp.date.transform('max'))
.complete(
('index1','index2','date_min','date_max'),
{'date':lambda f: pd.date_range(f.date.min(), f.date.max(), freq='D')})
.query('date_min<=date<=date_max')
)
appreciate the feedback, and if you have suggestions on how to make the functions better, the dev team are glad and always eager to get feedback.
from pyjanitor.
Related Issues (20)
- Not able to import janitor.clean_name function - ImportError: cannot import name 'ABCPandasArray' from 'pandas.core.dtypes.generic' HOT 2
- Typos in repository
- expand function
- [INFRA] Switch over to pyproject.toml
- Support efficient json extraction within a pandas column HOT 1
- [ENH] implement full numba version of a single conditional_join
- deprecation warning for pivot_longer HOT 1
- Return only matching indices for `conditional_join`
- [ENH] cython a subset of _range_join_indices and equi join HOT 4
- extend `col` powers for index selection HOT 1
- dtype conversion on index
- `conditional_join` fails on mac for `equi-join` and numba HOT 1
- Outdated version in conda forge HOT 1
- extend `row_to_names` to support multiindex
- `sheet_name` not required in jn.xlsx_table
- Make clean_names() compatible with polars and geopandas dataframes HOT 6
- implement similar functions for polars
- perf slower when `sort_by_appearance` is True for `pivot_longer`
- Question about "tabyl()" in pyjanitor HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyjanitor.