Coder Social home page Coder Social logo

Comments (18)

ericmjl avatar ericmjl commented on June 9, 2024 2

All done, @samukweku!

from pyjanitor.

ericmjl avatar ericmjl commented on June 9, 2024 1

Ok to do so! I trust your judgment on whether to do a patch or minor πŸ˜„.

from pyjanitor.

samukweku avatar samukweku commented on June 9, 2024 1

Likely be a minor release

from pyjanitor.

dugarte-vox avatar dugarte-vox commented on June 9, 2024 1

What pandas version are you using @dugarte-vox ?

Sorry for the late response. I'm currently using pandas 2.2.1.
I've tried with a lower version of pandas (2.0.3) as you suggested and the example is working.

from pyjanitor.

samukweku avatar samukweku commented on June 9, 2024 1

We haven't migrated to pyproject.yaml yet, need your infra expertise on the PR

from pyjanitor.

samukweku avatar samukweku commented on June 9, 2024 1

@dugarte-vox a new version of pyjanitor has been released. It should be fine on pandas > 2. Test and let's know if there are any issues. By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?

from pyjanitor.

dugarte-vox avatar dugarte-vox commented on June 9, 2024 1

Thanks for the help!

By the way, out of curiosity, what is your use case for conditional_join that regular pandas could not solve?

Basically, I have a function that completes missing dates on a DataFrame with groups. This is an extract of that code:

import pandas as pd
import janitor
import random

df = pd.DataFrame({
    'index1': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C'],
    'index2': [1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1],
    'date': pd.to_datetime([
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2024-01-01', '2024-01-10', '2024-04-01',
        '2021-10-01', '2021-11-01'
    ])
})
df['value'] = [random.randrange(1, 50, 1) for _ in range(df.shape[0])]

# Create min/max dates per group
df_group_dates = df.groupby(['index1', 'index2']).agg({'date': ['min', 'max']})
df_group_dates.columns = ["_".join(col) for col in df_group_dates.columns]
df_group_dates = df_group_dates.reset_index()

# Create a combination between all groups and all possible dates
date_min = df['date'].min()
date_max = df['date'].max()
date_range = pd.date_range(start=date_min, end=date_max, freq='D')
df_dates = pd.merge(
    df[['index1', 'index2']].drop_duplicates(),
    pd.DataFrame({'date': date_range}),
    how='cross'
)

# Join combinations with min/max dates per group
df_dates = pd.merge(
    df_dates, 
    df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
    how='left',
    on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
    (df_dates['date'] >= df_dates['date_min']) &                      
    (df_dates['date'] <= df_dates['date_max']),
    ['date', 'index1', 'index2']
]

# Merge with original DataFrame
df = pd.merge(df, df_dates, on=['date', 'index1', 'index2'], how='right')

As you can see when I merge df_dates with df_group_dates I create a bigger DataFrame that is later filtered. This sometimes causes a memory error when the DataFrame is big.

Now, I can replace:

# Join combinations with min/max dates per group
df_dates = pd.merge(
    df_dates, 
    df_group_dates[['index1', 'index2', 'date_min', 'date_max']],
    how='left',
    on=['index1', 'index2']
)
# Filter out date values outside the min/max range
df_dates = df_dates.loc[
    (df_dates['date'] >= df_dates['date_min']) &                      
    (df_dates['date'] <= df_dates['date_max']),
    ['date', 'index1', 'index2']
]

with:

# Join combinations with min/max date ranges per group 
df_dates = df_dates.conditional_join(
    df_group_dates,         
    ('index1', 'index1', '=='),
    ('index2', 'index2', '=='),
    ('date', 'date_min', '>='), 
    ('date', 'date_max', '<='), 
    how = 'inner'
).drop([('right', col) for col in ['index1', 'index2', 'date_min', 'date_max']], axis=1).droplevel(0, axis=1)

I haven't done any test to see if this is more efficient but, at least the code is smaller and more readable.

from pyjanitor.

dugarte-vox avatar dugarte-vox commented on June 9, 2024

I tried joining both columns into an auxiliary one and I still get the same error.

df1['index_aux'] = (df1['index1']+'-'+df1['index2'].astype(str)).astype(str).copy()
df2['index_aux'] = (df2['index1']+'-'+df2['index2'].astype(str)).astype(str).copy()

df1.conditional_join(
    df2,         
    ('index_aux', 'index_aux', '=='),
    ('date', 'date_min', '>='), 
    ('date', 'date_max', '<='), 
)

from pyjanitor.

samukweku avatar samukweku commented on June 9, 2024

Apologies for the late reply @dugarte-vox I'll have a look at this

from pyjanitor.

samukweku avatar samukweku commented on June 9, 2024

What pandas version are you using @dugarte-vox ?

from pyjanitor.

samukweku avatar samukweku commented on June 9, 2024

@dugarte-vox I believe this issue pops up when using pandas version >= 2.2, and has been fixed in pyjanitor. A new release will be out soon ,with the fix. In the mean time see if you can pandas version < 2.2.

from pyjanitor.

samukweku avatar samukweku commented on June 9, 2024

@ericmjl ok to do a release?

from pyjanitor.

ericmjl avatar ericmjl commented on June 9, 2024

@samukweku I think we should increase the cadence of releases; each PR merge can probably be considered a candidate for a new release, I think. Been trialling that with llamabot and I think it’ll be ok.

from pyjanitor.

samukweku avatar samukweku commented on June 9, 2024

Llamabot! Can't wait to C it in action

from pyjanitor.

samukweku avatar samukweku commented on June 9, 2024

@ericmjl any suggestions on this failure? https://github.com/pyjanitor-devs/pyjanitor/actions/runs/8342259392

from pyjanitor.

ericmjl avatar ericmjl commented on June 9, 2024

Yes, I think I know what's happening.

Could you do a hot fix push to main? It would be to change this line in GH Actions:

python setup.py sdist bdist_wheel

To this:

pip install -U build && python -m build -w -s

I added the pip install of build just in case. We should be using pyproject.toml now, is that right?

If not, the hotfix should be to add, one line above the setup.py line:

pip install -U setuptools

And that should do the trick.

If I were in your shoes, I would push to main directly, since this change is infra-related.

from pyjanitor.

ericmjl avatar ericmjl commented on June 9, 2024

No problem! I will take a look at it.

from pyjanitor.

samukweku avatar samukweku commented on June 9, 2024

great response @dugarte-vox. It seems though that complete may fit in here for you - it makes missing rows explicit:

# pip install pyjanitor
(df
.complete(
    {'date':lambda f: pd.date_range(f.date.min(), f.date.max(), freq='D')},
    by=['index1','index2'])
)

The by parameter ensures the columns are completed per group. Under the hood it is just a for loop on the groups, which may help with memory issues, but may not be great performance wise. it is on my todo list to improve the performance for groups, i just havent had time to think it through, and really i wasnt sure anybody was using the function.

Another option, using complete, which may be performant, is to use a variant of your solution, where you build a large dataframe and post filter (if memory allows):

grp = df.groupby(['index1','index2'],sort=False)
(df
.assign(
    date_min=grp.date.transform('min'),
    date_max=grp.date.transform('max'))
.complete(
    ('index1','index2','date_min','date_max'),
    {'date':lambda f: pd.date_range(f.date.min(), f.date.max(), freq='D')})
.query('date_min<=date<=date_max')
)

appreciate the feedback, and if you have suggestions on how to make the functions better, the dev team are glad and always eager to get feedback.

from pyjanitor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.