Coder Social home page Coder Social logo

Comments (5)

ericmjl avatar ericmjl commented on May 18, 2024

Aloha @eli-s-goldberg! 😸

Thanks for pinging in with the feature request. Love it - it means we've got users who are engaged!

Before I go on and try an implementation, I have a few questions. "Imputation" can be a bit nebulous - would you be open to providing some details?

One question I have is - how does this differ from df.fillna()?

Also, would you be able to describe a bit more about a desired API? You don't have to worry about the implementation, we can try to figure it out.

Those two specifics would be helpful for me to work out a proper implementation!

from pyjanitor.

ericmjl avatar ericmjl commented on May 18, 2024

I gave it some thought, and here's my ideas on a generic imputation function:

@pf.register_dataframe_method
def impute(df, column: str, value=None, statistic=None):
    """
    Method-chainable imputation of values in a column.

    Underneath the hood, this function calls the `.fillna()` method available
    to every pandas.Series object.

    Method-chaining example:

    ..code-block:: python

        df = (
            pd.DataFrame(...)
            # Impute null values with 0
            .impute(column='sales', value=0.0)
            # Impute null values with median
            .impute(column='score', statistic='median')
        )

    Either one of ``value`` or ``statistic`` should be provided.

    If ``value`` is provided, then all null values in the selected column will
        take on the value provided.

    If ``statistic`` is provided, then all null values in the selected column
    will take on the summary statistic value of other non-null values.

    Currently supported ``statistic``s include:

    - ``mean`` (also aliased by ``average``)
    - ``median``
    - ``mode``
    - ``minimum`` (also aliased by ``min``)
    - ``maximum`` (also aliased by ``max``)

    :param df: A pandas DataFrame
    :param column: The name of the column on which to impute values.
    :param value: (optional) The value to impute.
    :param statistic: (optional) The column statistic to impute.
    """

    # Firstly, we check that only one of `value` or `statistic` are provided.
    if value is not None and statistic is not None:
        raise ValueError(
            'Only one of `value` or `statistic` should be provided'
            )

    # If statistic is provided, then we compute the relevant summary statistic
    # from the other data.
    funcs = {
        'mean': np.mean,
        'average': np.mean,  # aliased
        'median': np.median,
        'mode': np.mode,
        'minimum': np.minimum,
        'min': np.minimum,  # aliased
        'maximum': np.maximum,
        'max': np.maximum,  # aliased
    }
    if statistic is not None:
        # Check that the statistic keyword argument is one of the approved.
        if statistic not in funcs.keys():
            raise KeyError(f'`statistic` must be one of {funcs.keys()}')
        value = funcs[statistic](df[column].dropna())

    if value is not None:
        df[column] = df[column].fillna(value)
    return df

What are your thoughts on this?

from pyjanitor.

eli-s-goldberg avatar eli-s-goldberg commented on May 18, 2024

@ericmjl - Thanks for keying me in and I like what's written. That said, I actually need to check this out quickly and use it, which is something that I'm unable to do until tomorrow afternoon.

A more complex imputation task that I find myself doing is fillna with age, gender, and disease mean imputation. Here's a description of the hack as my code is a bit too specific at this point to share/be useful.

First, I iterate through the data to create a dict that links category with the matches with mean values.
Here's some pseudocode:

df = DataFrame(MedicalData)
filtered_groupby = df.groupby(['gender', 'disease'])
ageGenderMatchAverageDict = dict()
valueLabels = ['cat1', 'cat2','cat3']
for label in valueLabels:
    for name, group in filtered_groupby:
        ageGenderMatchAverageDict.update({str(label + '_' + '_'.join(name)): group[label].mean()})

Next, I loop through the unique labels, genders, and diseases, using .get to select the fill value based on the combination of gender/disease. I use a little helper function iffillnaval to make sure that I'm only imputing NaNs and not real data. I know, I know. Hacky.

def iffillnaval(x, ageGenderMatchAverageDict, label, gender, disease):
    if np.isnan(x):
        return ageGenderMatchAverageDict.get(str(label + '_' + gender + '_' + disease))
    else:
        return x

for label in valueLabels:
    for gender in df['gender'].unique():
        for disease in df['disease'].unique():
            df.loc[:, label] = [iffillnaval(x, ageGenderMatchAverageDict, label, gender, disease) for x in
                                df[label].values]

With a bit of your patented ma-gic, I'm sure you can turn this hack into something powerful and generic. Thanks again and thanks for pyjanitor!

from pyjanitor.

ericmjl avatar ericmjl commented on May 18, 2024

@eli-s-goldberg once you had described this, I think the grammar and ontology looks something like this:

Potential function signature (ontology):

def grouped_impute(df, columns, mapping=None, statistic=None):
    pass

Another potential function signature, if we wanted to sound more academic, is:

def stratified_impute(df, columns, mapping, statistic):

The grammar can actually be quite generic.

  1. Groupby on the columns keyword.
  2. If a mapping (i.e. dictionary) is provided, use the mapping. Mapping should have the groupby keys that are provided by df.groupby(columns). Naturally, this is not the easiest thing that end-users will commonly use, but I think it's useful to provide the option.
  3. If a statistic is provided, then we can easily map your mean imputation (or mode or median or minimum or maximum - all the M&Ms) to the groupby keys.

The way you've shown it is pretty good, actually. I'll use that as a jumping board for this.

from pyjanitor.

eli-s-goldberg avatar eli-s-goldberg commented on May 18, 2024

A bit of an update. This doesn't handle variable statistics, or textual data, but it's getting there. I've been using it generically for the past weekend, or so. It's not the quickest thing (millions of rows take a few minutes per column). This will only work with a single column.

def stratified_impute(df, mapping, columns):
    """
    Perform a stratified impute match. Note, mapping cannot contain columns. 
    Method chaining usage:

    .. code-block:: python
        df = (
            pd.DataFrame(...)
            .stratified_impute(mapping=['gender', 'race','ethnicity','disease'], columns=['pack_year'])
        )
    
    :param df: pandas DataFrame.
    :param mapping: Column(s) on which to map stratified impute.
    :param columns: Column(s) on which to perform stratified imputation.
    """
    
    if set(columns).issubset(mapping):
        raise ValueError("{} must not include {}".format(mapping, columns))
        
    filtered_groupby = df.groupby(mapping)
    strat_dict = dict()
    for column in columns:
        for name, group in filtered_groupby:
            group = group.dropna()
            strat_dict.update(
                {str(column + str(name)): group[column].mean()}
            )
            
    group_nan_list = []
    group_nonnan_list = []
    for column in columns:
        for name, group in filtered_groupby:
            # nan_data to be filled
            nan_data = group[pd.isna(group[column])]
            
            # replace nan_data with backfilled data from dict
            nan_data = nan_data.fillna({column:strat_dict.get(str(column + str(name)))})
            group_nan_list.append(nan_data)
            
            # non nan_data to be passed along
            nonnan_data = group[~pd.isna(group[column])]
            group_nonnan_list.append(nonnan_data)
            
    group_nan_list.extend(group_nonnan_list)
    df = pd.concat(group_nan_list)
    
    return df

from pyjanitor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.