Hi Eric, Just a thought -- I find myself doing a lot of hand imputat

Aloha <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Feature thought: Add generic imputation about pyjanitor HOT 5 OPEN

pyjanitor-devs commented on May 18, 2024

Feature thought: Add generic imputation

from pyjanitor.

Comments (5)

ericmjl commented on May 18, 2024

Aloha @eli-s-goldberg! 😸

Thanks for pinging in with the feature request. Love it - it means we've got users who are engaged!

Before I go on and try an implementation, I have a few questions. "Imputation" can be a bit nebulous - would you be open to providing some details?

One question I have is - how does this differ from df.fillna()?

Also, would you be able to describe a bit more about a desired API? You don't have to worry about the implementation, we can try to figure it out.

Those two specifics would be helpful for me to work out a proper implementation!

from pyjanitor.

ericmjl commented on May 18, 2024

I gave it some thought, and here's my ideas on a generic imputation function:

@pf.register_dataframe_method
def impute(df, column: str, value=None, statistic=None):
    """
    Method-chainable imputation of values in a column.

    Underneath the hood, this function calls the `.fillna()` method available
    to every pandas.Series object.

    Method-chaining example:

    ..code-block:: python

        df = (
            pd.DataFrame(...)
            # Impute null values with 0
            .impute(column='sales', value=0.0)
            # Impute null values with median
            .impute(column='score', statistic='median')
        )

    Either one of ``value`` or ``statistic`` should be provided.

    If ``value`` is provided, then all null values in the selected column will
        take on the value provided.

    If ``statistic`` is provided, then all null values in the selected column
    will take on the summary statistic value of other non-null values.

    Currently supported ``statistic``s include:

    - ``mean`` (also aliased by ``average``)
    - ``median``
    - ``mode``
    - ``minimum`` (also aliased by ``min``)
    - ``maximum`` (also aliased by ``max``)

    :param df: A pandas DataFrame
    :param column: The name of the column on which to impute values.
    :param value: (optional) The value to impute.
    :param statistic: (optional) The column statistic to impute.
    """

    # Firstly, we check that only one of `value` or `statistic` are provided.
    if value is not None and statistic is not None:
        raise ValueError(
            'Only one of `value` or `statistic` should be provided'
            )

    # If statistic is provided, then we compute the relevant summary statistic
    # from the other data.
    funcs = {
        'mean': np.mean,
        'average': np.mean,  # aliased
        'median': np.median,
        'mode': np.mode,
        'minimum': np.minimum,
        'min': np.minimum,  # aliased
        'maximum': np.maximum,
        'max': np.maximum,  # aliased
    }
    if statistic is not None:
        # Check that the statistic keyword argument is one of the approved.
        if statistic not in funcs.keys():
            raise KeyError(f'`statistic` must be one of {funcs.keys()}')
        value = funcs[statistic](df[column].dropna())

    if value is not None:
        df[column] = df[column].fillna(value)
    return df

What are your thoughts on this?

from pyjanitor.

eli-s-goldberg commented on May 18, 2024

@ericmjl - Thanks for keying me in and I like what's written. That said, I actually need to check this out quickly and use it, which is something that I'm unable to do until tomorrow afternoon.

A more complex imputation task that I find myself doing is fillna with age, gender, and disease mean imputation. Here's a description of the hack as my code is a bit too specific at this point to share/be useful.

First, I iterate through the data to create a dict that links category with the matches with mean values.
Here's some pseudocode:

df = DataFrame(MedicalData)
filtered_groupby = df.groupby(['gender', 'disease'])
ageGenderMatchAverageDict = dict()
valueLabels = ['cat1', 'cat2','cat3']
for label in valueLabels:
    for name, group in filtered_groupby:
        ageGenderMatchAverageDict.update({str(label + '_' + '_'.join(name)): group[label].mean()})

Next, I loop through the unique labels, genders, and diseases, using .get to select the fill value based on the combination of gender/disease. I use a little helper function iffillnaval to make sure that I'm only imputing NaNs and not real data. I know, I know. Hacky.

def iffillnaval(x, ageGenderMatchAverageDict, label, gender, disease):
    if np.isnan(x):
        return ageGenderMatchAverageDict.get(str(label + '_' + gender + '_' + disease))
    else:
        return x

for label in valueLabels:
    for gender in df['gender'].unique():
        for disease in df['disease'].unique():
            df.loc[:, label] = [iffillnaval(x, ageGenderMatchAverageDict, label, gender, disease) for x in
                                df[label].values]

With a bit of your patented ma-gic, I'm sure you can turn this hack into something powerful and generic. Thanks again and thanks for pyjanitor!

from pyjanitor.

ericmjl commented on May 18, 2024

@eli-s-goldberg once you had described this, I think the grammar and ontology looks something like this:

Potential function signature (ontology):

def grouped_impute(df, columns, mapping=None, statistic=None):
    pass

Another potential function signature, if we wanted to sound more academic, is:

def stratified_impute(df, columns, mapping, statistic):

The grammar can actually be quite generic.

Groupby on the columns keyword.
If a mapping (i.e. dictionary) is provided, use the mapping. Mapping should have the groupby keys that are provided by df.groupby(columns). Naturally, this is not the easiest thing that end-users will commonly use, but I think it's useful to provide the option.
If a statistic is provided, then we can easily map your mean imputation (or mode or median or minimum or maximum - all the M&Ms) to the groupby keys.

The way you've shown it is pretty good, actually. I'll use that as a jumping board for this.

from pyjanitor.

eli-s-goldberg commented on May 18, 2024

A bit of an update. This doesn't handle variable statistics, or textual data, but it's getting there. I've been using it generically for the past weekend, or so. It's not the quickest thing (millions of rows take a few minutes per column). This will only work with a single column.

def stratified_impute(df, mapping, columns):
    """
    Perform a stratified impute match. Note, mapping cannot contain columns. 
    Method chaining usage:

    .. code-block:: python
        df = (
            pd.DataFrame(...)
            .stratified_impute(mapping=['gender', 'race','ethnicity','disease'], columns=['pack_year'])
        )
    
    :param df: pandas DataFrame.
    :param mapping: Column(s) on which to map stratified impute.
    :param columns: Column(s) on which to perform stratified imputation.
    """
    
    if set(columns).issubset(mapping):
        raise ValueError("{} must not include {}".format(mapping, columns))
        
    filtered_groupby = df.groupby(mapping)
    strat_dict = dict()
    for column in columns:
        for name, group in filtered_groupby:
            group = group.dropna()
            strat_dict.update(
                {str(column + str(name)): group[column].mean()}
            )
            
    group_nan_list = []
    group_nonnan_list = []
    for column in columns:
        for name, group in filtered_groupby:
            # nan_data to be filled
            nan_data = group[pd.isna(group[column])]
            
            # replace nan_data with backfilled data from dict
            nan_data = nan_data.fillna({column:strat_dict.get(str(column + str(name)))})
            group_nan_list.append(nan_data)
            
            # non nan_data to be passed along
            nonnan_data = group[~pd.isna(group[column])]
            group_nonnan_list.append(nonnan_data)
            
    group_nan_list.extend(group_nonnan_list)
    df = pd.concat(group_nan_list)
    
    return df

from pyjanitor.

Feature thought: Add generic imputation about pyjanitor HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent