pslmodels / microdf Goto Github PK

View Code? Open in Web Editor NEW

12.0 7.0 10.0 10.33 MB

Analysis tools for working with survey microdata as DataFrames.

Home Page: http://pslmodels.github.io/microdf

License: MIT License

Python 100.00%

microdata pandas tax-calculator survey-microdata dataframes analysis psl-cataloged

microdf's Introduction

microdf

Analysis tools for working with survey microdata as DataFrames.

Disclaimer: MicroSeries and MicroDataFrame are experimental features and may not consider weights after performing some operations. See open issues.

Installation

Install with:

pip install git+git://github.com/PSLmodels/microdf.git

Questions

Contact the maintainer, Max Ghenis ([email protected]).

Citation

You may cite the source of your analysis as "microdf release #.#.#, author's calculations."

microdf's People

Contributors

Stargazers

Watchers

Forkers

fagan2888 erinmelly mgilbert1 maxghenis nikhilwoodruff peter-metz jdebacker rickecon

microdf's Issues

Remove Medicare and Medicaid from after-tax income used for VAT/FTT/CT

set_plot_style isn't setting the grid color

This should set the grid color to light gray:
https://github.com/MaxGhenis/microdf/blob/65ce8db2bbacdb991f60a92641b0e468cda561fe/microdf/style.py#L41

But it shows up blank:

import microdf as mdf
import matplotlib.pyplot as plt

!wget https://github.com/MaxGhenis/random/raw/master/Roboto-Regular.ttf
mdf.set_plot_style()

plt.plot([1, 2])

Remove weighted functions if pandas adds weights to Series operations

See pandas-dev/pandas#15039 and pandas-dev/pandas#10030

Add OSI license

PSL criteria 1:

Models MUST be released under an OSI-approved open source license or the Creative Commons Public Domain Dedication (CC0).
https://github.com/PSLmodels/PSL/blob/master/Criteria/library_criteria.md

Consistent weighting API

Currently, the functions in weighted.py have a mix of APIs, e.g.:

weighted_sum(df, col, w)
weighted_quantile(values, quantiles, sample_weight)

Make this consistent.

calc_df fails when metric_vars isn't provided

Change in disposable income decile chart

Something like this with only two points on the x-axis: baseline and reform.

https://ourworldindata.org/incomes-across-the-distribution

Like comparing Gini indexes, this measures the total effect on the distribution, rather than slicing by the baseline distribution as most other charts do.

Add groupby argument to add_weighted_quantiles

This is useful for datasets that stack multiple reforms (groupby reform), or to calculate percentiles within certain tax unit types (e.g. by whether they have children).

Replace `tax` with `combined` in calc_df()

Avoid creating a new duplicate variable.

Apply for PSL

Requirements at https://github.com/PSLmodels/PSL/blob/master/Criteria/library_criteria.md

Separate issues for each requirement to come.

Restructure into microdf package and taxcalc module

Move functions that aren't particular to taxcalc into the general microdf (microdataframe) package, and then create a module for taxcalc functions named taxcalc (or maybe taxcalc_df, as PEP8 permits underscores in module names). The convention should be from microdf import taxcalc as tcdf.

Possibly move into OpenUBI repo, but maybe do that later.

See https://stackoverflow.com/questions/15746675/how-to-write-a-python-module-package.

Add to conda

Make weighted percentiles match unweighted percentiles of stacked data

weighted_quantile, which also powers weighted_median, comes from this SO answer, which works efficiently, but doesn't exactly match unweighted percentiles of stacked data. For example (per this SO comment), the following two should be equivalent:

mdf.weighted_quantile([1, 2], 0.5, [1, 3])  # 1.75
np.median([1, 2, 2, 2])  # 2

They differ because the weighted approach interpolates between values around the specified quantile.

I added a note to the docstring and test about this in https://github.com/MaxGhenis/microdf/pull/49, but it'd be good to figure out a solution, at least as an option. I asked on SO if this is possible; it could well not be, short of literally replicating the values according to the weights.

Use Travis CI

Create roadmap

PSL criteria (see #13) include:

Projects SHOULD have a public roadmap.

Supplemental Poverty Measure

Estimating this will require merging more data from the raw CPS. Probably ideally done in C-TAM, but can test out here first.

See Trends in Poverty with an Anchored Supplemental Poverty Measure and PSLmodels/Tax-Calculator#1896.

Add ratio arg to add_custom_tax

Only one of total and ratio can be provided.

Add option to zero out incomes in `calc_df`

e.g.:

df.aftertax_income.clip_lower(0, inplace=True)
df.aftertax_income_m.clip_lower(0, inplace=True)

Make CASH_SHARES array

https://github.com/MaxGhenis/microdf/blob/master/microdf/constants.py currently has separate constants representing the cash share of each benefit program. This should be an array instead.

Consider splitting out taxcalc features

microdf supports data and tasks that include but are not limited to taxcalc, and it's increasingly general. Given taxcalc's size and large dependency set, ideally it would be an optional dependency. While pip supports optional dependencies, they're not yet implemented in conda (conda/conda#7502).

Another option is creating a new package like microdf-taxcalc which includes both microdf and taxcalc as dependencies, and includes functions like calc_df.

It's OK for now but placeholder for future consideration.

Add codecov

tax_from_mtrs produces NaNs when starting brackets from -np.inf

This produces all NaNs:

mdf.tax_from_mtrs([-1, 0, 1, 1e6, 10e6], [-np.inf, 1e6], [0, 0.01])

It works with a large negative number, e.g.:

mdf.tax_from_mtrs([-1, 0, 1, 1e6, 10e6], [-9e99, 1e6], [0, 0.01])

Add avoidance/evasion rate to tax_from_mtrs

Constant, as in 15-16% modeled by Saez and Zucman

Add type hinting

See https://stackoverflow.com/a/32558710/1840471

Add functions to read reforms from taxcalc GitHub

Move these here: https://github.com/MaxGhenis/taxcalc-notebooks/blob/master/utils/maxghenis_taxcalc_utils.py

Include matplotlib stylesheet

A bunch of my plots, some of which would go here, use this style:

!wget https://github.com/MaxGhenis/random/raw/master/Roboto-Regular.ttf -P /usr/local/lib/python3.6/dist-packages/matplotlib/mpl-data/fonts/ttf
mpl.font_manager._rebuild()

sns.set_style('white')
DPI = 200
mpl.rc('savefig', dpi=DPI)
mpl.rcParams['figure.dpi'] = DPI
mpl.rcParams['figure.figsize'] = 6.4, 4.8  # Default.
mpl.rcParams['font.sans-serif'] = 'Roboto'
mpl.rcParams['font.family'] = 'sans-serif'

# Set title text color to dark gray (https://material.io/color) not black.
TITLE_COLOR = '#212121'
mpl.rcParams['text.color'] = TITLE_COLOR

# Axis titles and tick marks are medium gray.
AXIS_COLOR = '#757575'
mpl.rcParams['axes.labelcolor'] = AXIS_COLOR
mpl.rcParams['xtick.color'] = AXIS_COLOR
mpl.rcParams['ytick.color'] = AXIS_COLOR

Ideally this would be part of the package.

See https://stackoverflow.com/questions/31559225/how-to-ship-or-distribute-a-matplotlib-stylesheet

Fix collections warning

From pytest:

==================================================== warnings summary ====================================================
/home/mghenis/miniconda3/lib/python3.7/site-packages/numba/types/containers.py:3
  /home/mghenis/miniconda3/lib/python3.7/site-packages/numba/types/containers.py:3: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from collections import Iterable

-- Docs: https://docs.pytest.org/en/latest/warnings.html

Create test generating outputs

To meet PSL criteria (#13):

At least one test MUST generate key outputs from source materials, the test MUST be run with every new version, and the outputs of the test MUST be checked into the repository.

This could be a summary table from calc_df (Tax-Calculator) or something from another dataset like SCF which can be easily downloaded.

Add unit tests

To satisfy PSL requirements 4 and 5:

At least one test MUST generate key outputs from source materials, the test MUST be run with every new version, and the outputs of the test MUST be checked into the repository.
Projects MUST have unit tests and SHOULD report code coverage.

https://github.com/PSLmodels/PSL/blob/master/Criteria/library_criteria.md

Allow for pre-divided weights in add_weighted_metrics

For example, to weight something by XTOT, you currently have to pre-calculate XTOT_m = XTOT * s006 and then do:

add_weighted_metrics(df, metric_vars, w='XTOT_m', divisor=1, suffix='_XTOT_m')

And then you're left with a superfluous XTOT_m_XTOT_m column.

Make this easier.

Despine as part of set_plot_style

set_plot_style should do the equivalent of seaborn.despine(left=True, bottom=True).

From:

to:

Add function to calculate tax from a MTR schedule

Basic tax calculation:

def tax_from_mtrs(val, brackets, rates):
    # Args:
    #     val: Value to assess tax on, e.g. wealth or income (list or Series).
    #     brackets: Left side of each bracket (list or Series).
    #     rates: Rate corresponding to each bracket.
    df_tax = pd.DataFrame({'brackets': brackets, 'rates': rates})
    df_tax['base_tax'] = df_tax.brackets.\
        sub(df_tax.brackets.shift(fill_value=0)).\
        mul(df_tax.rates.shift(fill_value=0)).cumsum()
    rows = df_tax.brackets.searchsorted(val, side='right') - 1
    income_bracket_df = df_tax.loc[rows].reset_index(drop=True)
    return pd.Series(val).sub(income_bracket_df.brackets).\
        mul(income_bracket_df.rates).add(income_bracket_df.base_tax)

Add market income

See PSLmodels/Tax-Calculator#2309. Can do it here first.

Add Colab version of example notebooks

These will have installation at the top:

!pip install git+https://github.com/PSLmodels/Tax-Calculator
!pip install git+https://github.com/MaxGhenis/microdf

Could also be used for QuantEcon Notes.

Add disclaimer

PSL criteria (see #13) include:

Projects SHOULD include a disclaimer.

See e.g. Tax-Calculator's disclaimer:

Results will change as Tax-Calculator data and logic improve. A
fundamental reason for adopting open-source methods in this project
is so that people from all backgrounds can contribute to the models
that our society uses to assess economic policy; when
community-contributed improvements are incorporated, the model will
produce different results.

Add n65

Until the new version of Tax-Calculator implements it: PSLmodels/taxdata#243

Add dollar_axis

def dollar_axis(axis):
    axis.set_major_formatter(
        mpl.ticker.FuncFormatter(lambda x, p: '$' + format(int(x), ',')))

Add inequality metrics

Create a new inequality.py script with inequality metrics:

Share from the top and bottom x percent (make general and also for top 50, 10, 1, 0.1)
Ratio of these, e.g. T10/B50 as used in World Inequality Database
gini (move from utils.py)

Distributional impact chart

Make this a function:

From https://nbviewer.jupyter.org/github/MaxGhenis/taxcalc-notebooks/blob/master/tcja/distributional_impact_within_groups.ipynb

Check that metrics are in both files in agg()

Currently you can pass a metric that is only in one of the two files in tch.agg().

Add PSL_catalog.json

PSL criteria (#13) include:

A PSL_catalog.json configuration file to be used for cataloging these criteria MUST be included in the project's repository. Specific instructions for creating this file can be found in the Catalog-Builder Documentation.

Add build status and codecov icons to README

Transfer to PSLmodels organization

To comply with PSL criteria (see #13).

See https://help.github.com/en/github/creating-cloning-and-archiving-repositories/duplicating-a-repository#mirroring-a-repository

Add avoidance/evasion elasticity to tax_from_mtrs

e.g. the 8 identified by Saez and Zucman

Refer to percentiles rather than deciles in quantile change charts

e.g. the 90th percentile is clearer than the 9th decile boundary.

mtr() doesn't work with values below the bottom bracket threshold

For example, the below produces NaN values for negative networth.

WARREN_RATES = [0, 0.02, 0.03]  # 0%, 2%, 3%.

WARREN_BRACKETS = [0,
                   50e6,  # First $50 million.
                   1e9]   # Over $1 billion.

mdf.mtr(wealth.networth, WARREN_BRACKETS, WARREN_RATES)

Workaround is to start the brackets below zero, e.g.:

WARREN_BRACKETS = [-np.inf,
                   50e6,  # First $50 million.
                   1e9]   # Over $1 billion.

But it could be nice to have an option to do this automatically in mtr (which is also used in tax_from_mtrs).

Projects MUST use a consistent versioning scheme, which SHOULD be semantic versioning. If projects want to use the PSL Package-Builder tool to distribute packages via the Anaconda Cloud PSLmodels channel, there are additional MUST criteria.