Brief Deion I don't know if I'm doing this correctly but, I

All done, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

What pandas version are you using <a class="user-mention notranslate" dat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Apologies for the late reply <a class="user-mention notranslate" data-hovercard-type="

What pandas version are you using <a class="user-mention notranslate" data-hovercard-t

Problems with equalities in contional_join about pyjanitor HOT 18 CLOSED

dugarte-vox commented on June 9, 2024

Problems with equalities in contional_join

from pyjanitor.

Comments (18)

ericmjl commented on June 9, 2024 2

All done, @samukweku!

from pyjanitor.

ericmjl commented on June 9, 2024 1

Ok to do so! I trust your judgment on whether to do a patch or minor 😄.

from pyjanitor.

samukweku commented on June 9, 2024 1

Likely be a minor release

from pyjanitor.

dugarte-vox commented on June 9, 2024 1

What pandas version are you using @dugarte-vox ?

Sorry for the late response. I'm currently using pandas 2.2.1.
I've tried with a lower version of pandas (2.0.3) as you suggested and the example is working.

from pyjanitor.

samukweku commented on June 9, 2024 1

We haven't migrated to pyproject.yaml yet, need your infra expertise on the PR

from pyjanitor.

samukweku commented on June 9, 2024 1

@dugarte-vox a new version of pyjanitor has been released. It should be fine on pandas > 2. Test and let's know if there are any issues. By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?

from pyjanitor.

dugarte-vox commented on June 9, 2024 1

Thanks for the help!

By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?

Basically, I have a function that completes missing dates on a DataFrame with groups. This is an extract of that code:

import pandas as pd
import janitor
import random

df = pd.DataFrame({
    'index1': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
    'index2': [1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1],
    'date': pd.to_datetime([
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2021-10-01', '2021-11-01'
    ])
})
df['value'] = [random.randrange(1, 50, 1) for _ in range(df.shape[0])]

# Create min/max dates per group
df_group_dates = df.groupby(['index1', 'index2']).agg({'date': ['min', 'max']})
df_group_dates.columns = ["_".join(col) for col in df_group_dates.columns]
df_group_dates = df_group_dates.reset_index()

# Create a combination between all groups and all possible dates
date_min = df['date'].min()
date_max = df['date'].max()
date_range = pd.date_range(start=date_min, end=date_max, freq='D')
df_dates = pd.merge(
    df[['index1', 'index2']].drop_duplicates(),
    pd.DataFrame({'date': date_range}),
    how='cross'
)

# Join combinations with min/max dates per group
df_dates = pd.merge(
    df_dates, 
    df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
    how='left',
    on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
    (df_dates['date'] >= df_dates['date_min']) &                      
    (df_dates['date'] <= df_dates['date_max']),
    ['date', 'index1', 'index2']
]

# Merge with original DataFrame
df = pd.merge(df, df_dates, on=['date', 'index1', 'index2'], how='right')

As you can see when I merge df_dates with df_group_dates I create a bigger DataFrame that is later filtered. This sometimes causes a memory error when the DataFrame is big.

Now, I can replace:

# Join combinations with min/max dates per group
df_dates = pd.merge(
    df_dates, 
    df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
    how='left',
    on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
    (df_dates['date'] >= df_dates['date_min']) &                      
    (df_dates['date'] <= df_dates['date_max']),
    ['date', 'index1', 'index2']
]

with:

# Join combinations with min/max date ranges per group 
df_dates = df_dates.conditional_join(
    df_group_dates,         
    ('index1', 'index1', '=='),
    ('index2', 'index2', '=='),
    ('date', 'date_min', '>='), 
    ('date', 'date_max', '<='), 
    how = 'inner'
).drop([('right', col) for col in ['index1', 'index2', 'date_min', 'date_max']], axis=1).droplevel(0, axis=1)

I haven't done any test to see if this is more efficient but, at least the code is smaller and more readable.

from pyjanitor.

dugarte-vox commented on June 9, 2024

I tried joining both columns into an auxiliary one and I still get the same error.

df1['index_aux'] = (df1['index1']+'-'+df1['index2'].astype(str)).astype(str).copy()
df2['index_aux'] = (df2['index1']+'-'+df2['index2'].astype(str)).astype(str).copy()

df1.conditional_join(
    df2,         
    ('index_aux', 'index_aux', '=='),
    ('date', 'date_min', '>='), 
    ('date', 'date_max', '<='), 
)

from pyjanitor.

samukweku commented on June 9, 2024

Apologies for the late reply @dugarte-vox I'll have a look at this

from pyjanitor.

samukweku commented on June 9, 2024

What pandas version are you using @dugarte-vox ?

from pyjanitor.

samukweku commented on June 9, 2024

@dugarte-vox I believe this issue pops up when using pandas version >= 2.2, and has been fixed in pyjanitor. A new release will be out soon ,with the fix. In the mean time see if you can pandas version < 2.2.

from pyjanitor.

samukweku commented on June 9, 2024

@ericmjl ok to do a release?

from pyjanitor.

ericmjl commented on June 9, 2024

@samukweku I think we should increase the cadence of releases; each PR merge can probably be considered a candidate for a new release, I think. Been trialling that with llamabot and I think it’ll be ok.

from pyjanitor.

samukweku commented on June 9, 2024

Llamabot! Can't wait to C it in action

from pyjanitor.

samukweku commented on June 9, 2024

@ericmjl any suggestions on this failure? https://github.com/pyjanitor-devs/pyjanitor/actions/runs/8342259392

from pyjanitor.

ericmjl commented on June 9, 2024

Yes, I think I know what's happening.

Could you do a hot fix push to main? It would be to change this line in GH Actions:

python setup.py sdist bdist_wheel

To this:

pip install -U build && python -m build -w -s

I added the pip install of build just in case. We should be using pyproject.toml now, is that right?

If not, the hotfix should be to add, one line above the setup.py line:

pip install -U setuptools

And that should do the trick.

If I were in your shoes, I would push to main directly, since this change is infra-related.

from pyjanitor.

ericmjl commented on June 9, 2024

No problem! I will take a look at it.

from pyjanitor.

samukweku commented on June 9, 2024

great response @dugarte-vox. It seems though that complete may fit in here for you - it makes missing rows explicit:

# pip install pyjanitor
(df
.complete(
    {'date':lambda f: pd.date_range(f.date.min(), f.date.max(), freq='D')},
    by=['index1','index2'])
)

The by parameter ensures the columns are completed per group. Under the hood it is just a for loop on the groups, which may help with memory issues, but may not be great performance wise. it is on my todo list to improve the performance for groups, i just havent had time to think it through, and really i wasnt sure anybody was using the function.

Another option, using complete, which may be performant, is to use a variant of your solution, where you build a large dataframe and post filter (if memory allows):

grp = df.groupby(['index1','index2'],sort=False)
(df
.assign(
    date_min=grp.date.transform('min'),
    date_max=grp.date.transform('max'))
.complete(
    ('index1','index2','date_min','date_max'),
    {'date':lambda f: pd.date_range(f.date.min(), f.date.max(), freq='D')})
.query('date_min<=date<=date_max')
)

appreciate the feedback, and if you have suggestions on how to make the functions better, the dev team are glad and always eager to get feedback.

from pyjanitor.

Problems with equalities in contional_join about pyjanitor HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent