Comments (5)
Aloha @eli-s-goldberg! 😸
Thanks for pinging in with the feature request. Love it - it means we've got users who are engaged!
Before I go on and try an implementation, I have a few questions. "Imputation" can be a bit nebulous - would you be open to providing some details?
One question I have is - how does this differ from df.fillna()
?
Also, would you be able to describe a bit more about a desired API? You don't have to worry about the implementation, we can try to figure it out.
Those two specifics would be helpful for me to work out a proper implementation!
from pyjanitor.
I gave it some thought, and here's my ideas on a generic imputation function:
@pf.register_dataframe_method
def impute(df, column: str, value=None, statistic=None):
"""
Method-chainable imputation of values in a column.
Underneath the hood, this function calls the `.fillna()` method available
to every pandas.Series object.
Method-chaining example:
..code-block:: python
df = (
pd.DataFrame(...)
# Impute null values with 0
.impute(column='sales', value=0.0)
# Impute null values with median
.impute(column='score', statistic='median')
)
Either one of ``value`` or ``statistic`` should be provided.
If ``value`` is provided, then all null values in the selected column will
take on the value provided.
If ``statistic`` is provided, then all null values in the selected column
will take on the summary statistic value of other non-null values.
Currently supported ``statistic``s include:
- ``mean`` (also aliased by ``average``)
- ``median``
- ``mode``
- ``minimum`` (also aliased by ``min``)
- ``maximum`` (also aliased by ``max``)
:param df: A pandas DataFrame
:param column: The name of the column on which to impute values.
:param value: (optional) The value to impute.
:param statistic: (optional) The column statistic to impute.
"""
# Firstly, we check that only one of `value` or `statistic` are provided.
if value is not None and statistic is not None:
raise ValueError(
'Only one of `value` or `statistic` should be provided'
)
# If statistic is provided, then we compute the relevant summary statistic
# from the other data.
funcs = {
'mean': np.mean,
'average': np.mean, # aliased
'median': np.median,
'mode': np.mode,
'minimum': np.minimum,
'min': np.minimum, # aliased
'maximum': np.maximum,
'max': np.maximum, # aliased
}
if statistic is not None:
# Check that the statistic keyword argument is one of the approved.
if statistic not in funcs.keys():
raise KeyError(f'`statistic` must be one of {funcs.keys()}')
value = funcs[statistic](df[column].dropna())
if value is not None:
df[column] = df[column].fillna(value)
return df
What are your thoughts on this?
from pyjanitor.
@ericmjl - Thanks for keying me in and I like what's written. That said, I actually need to check this out quickly and use it, which is something that I'm unable to do until tomorrow afternoon.
A more complex imputation task that I find myself doing is fillna with age, gender, and disease mean imputation. Here's a description of the hack as my code is a bit too specific at this point to share/be useful.
First, I iterate through the data to create a dict that links category with the matches with mean values.
Here's some pseudocode:
df = DataFrame(MedicalData)
filtered_groupby = df.groupby(['gender', 'disease'])
ageGenderMatchAverageDict = dict()
valueLabels = ['cat1', 'cat2','cat3']
for label in valueLabels:
for name, group in filtered_groupby:
ageGenderMatchAverageDict.update({str(label + '_' + '_'.join(name)): group[label].mean()})
Next, I loop through the unique labels, genders, and diseases, using .get
to select the fill value based on the combination of gender/disease. I use a little helper function iffillnaval
to make sure that I'm only imputing NaNs and not real data. I know, I know. Hacky.
def iffillnaval(x, ageGenderMatchAverageDict, label, gender, disease):
if np.isnan(x):
return ageGenderMatchAverageDict.get(str(label + '_' + gender + '_' + disease))
else:
return x
for label in valueLabels:
for gender in df['gender'].unique():
for disease in df['disease'].unique():
df.loc[:, label] = [iffillnaval(x, ageGenderMatchAverageDict, label, gender, disease) for x in
df[label].values]
With a bit of your patented ma-gic
, I'm sure you can turn this hack into something powerful and generic. Thanks again and thanks for pyjanitor!
from pyjanitor.
@eli-s-goldberg once you had described this, I think the grammar and ontology looks something like this:
Potential function signature (ontology):
def grouped_impute(df, columns, mapping=None, statistic=None):
pass
Another potential function signature, if we wanted to sound more academic, is:
def stratified_impute(df, columns, mapping, statistic):
The grammar can actually be quite generic.
- Groupby on the
columns
keyword. - If a mapping (i.e. dictionary) is provided, use the mapping. Mapping should have the groupby keys that are provided by
df.groupby(columns)
. Naturally, this is not the easiest thing that end-users will commonly use, but I think it's useful to provide the option. - If a statistic is provided, then we can easily map your mean imputation (or mode or median or minimum or maximum - all the M&Ms) to the groupby keys.
The way you've shown it is pretty good, actually. I'll use that as a jumping board for this.
from pyjanitor.
A bit of an update. This doesn't handle variable statistics, or textual data, but it's getting there. I've been using it generically for the past weekend, or so. It's not the quickest thing (millions of rows take a few minutes per column). This will only work with a single column.
def stratified_impute(df, mapping, columns):
"""
Perform a stratified impute match. Note, mapping cannot contain columns.
Method chaining usage:
.. code-block:: python
df = (
pd.DataFrame(...)
.stratified_impute(mapping=['gender', 'race','ethnicity','disease'], columns=['pack_year'])
)
:param df: pandas DataFrame.
:param mapping: Column(s) on which to map stratified impute.
:param columns: Column(s) on which to perform stratified imputation.
"""
if set(columns).issubset(mapping):
raise ValueError("{} must not include {}".format(mapping, columns))
filtered_groupby = df.groupby(mapping)
strat_dict = dict()
for column in columns:
for name, group in filtered_groupby:
group = group.dropna()
strat_dict.update(
{str(column + str(name)): group[column].mean()}
)
group_nan_list = []
group_nonnan_list = []
for column in columns:
for name, group in filtered_groupby:
# nan_data to be filled
nan_data = group[pd.isna(group[column])]
# replace nan_data with backfilled data from dict
nan_data = nan_data.fillna({column:strat_dict.get(str(column + str(name)))})
group_nan_list.append(nan_data)
# non nan_data to be passed along
nonnan_data = group[~pd.isna(group[column])]
group_nonnan_list.append(nonnan_data)
group_nan_list.extend(group_nonnan_list)
df = pd.concat(group_nan_list)
return df
from pyjanitor.
Related Issues (20)
- RuntimeWarning: subpackages can technically be lazily loaded HOT 16
- explode_levels
- Not able to import janitor.clean_name function - ImportError: cannot import name 'ABCPandasArray' from 'pandas.core.dtypes.generic' HOT 2
- Typos in repository
- expand function
- [INFRA] Switch over to pyproject.toml
- Support efficient json extraction within a pandas column HOT 1
- [ENH] implement full numba version of a single conditional_join
- deprecation warning for pivot_longer HOT 1
- Return only matching indices for `conditional_join`
- [ENH] cython a subset of _range_join_indices and equi join HOT 4
- extend `col` powers for index selection HOT 1
- dtype conversion on index
- `conditional_join` fails on mac for `equi-join` and numba HOT 1
- Outdated version in conda forge HOT 1
- extend `row_to_names` to support multiindex
- `sheet_name` not required in jn.xlsx_table
- Problems with equalities in contional_join HOT 18
- Make clean_names() compatible with polars and geopandas dataframes HOT 6
- implement similar functions for polars
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyjanitor.