pymc-labs / causalpy Goto Github PK
View Code? Open in Web Editor NEWA Python package for causal inference in quasi-experimental settings
Home Page: https://causalpy.readthedocs.io
License: Apache License 2.0
A Python package for causal inference in quasi-experimental settings
Home Page: https://causalpy.readthedocs.io
License: Apache License 2.0
Instead, the plots tend to look like this:
and
TODO:
Create an in-depth notebook to illustrate regression discontinuity
For Bayesian synthetic control:
[similar to #45]
Do this after #22
Tagging @juanitorduz
We are not removing custom PyMC models. It makes a lot of sense to be able to write custom PyMC models, for maximum flexibility.
But for the majority of cases, a linear model will be used. Because of this, it doesn't make sense to duplicate all the work that Bambi does in terms of specifying custom priors and handling hierarchical model formulae.
So need to figure out how to use Bambi models instead in addition.
To fit with the scikit-learn
API we need to be able to pass in a blank model object, so the creation of that will have to happen behind the scenes. So maybe we have a PymcModel
class as a wrapper around a Bambi model.
A better idea would be to have ModelBuilder
subclass the Bambi model class, not pm.Model
.
Need to work on the class structure to get this working smoothly.
InterruptedTimeSeries
experiment + notebook exampleDifferenceInDifferences
experiment + notebook exampleRegressionDiscontinuity
experiment + notebook example
Export from the notebooks in svg format. Could likely reduce role sizes. Double check I can embed svg files in the readme.
@ricardoV94 suggested that it could be useful to plot the cumulative sum of absolute impact values. This might be useful in some situations if the intervention causes an increase in volatility for example.
Update
Originally this was a more complex time series with a seasonal component. But I we need a much simpler example. So this will now be a simple linear trend with no seasonality or complex temporal component.
__innit__.py
and setup.py
setup.cfg
manifest.in
.gitignore
to ignore the dist/
foldertwine
as a developer requirementCausalPy
-> cauasalpy
I've experienced clearly sub-optimal weightings when running the the WeightedProportion
custom scikit-learn model. It is likely due to bad optimisation, perhaps getting stuck by local optima. So we need to explore the dependence of the results upon w_start
.
CausalPy/causalpy/skl_models.py
Lines 22 to 33 in 815c14c
One way to approach making the results more reliable (more likely to represent the global minimum) is to use a particle swarm type approach where we run the optimisation multiple times, each with different w_start
.
Create an in-depth notebook to illustrate difference in differences
Create an in-depth notebook to illustrate interrupted time series
Split up the following into smaller methods:
TimeSeriesExperiment.__init__()
DifferenceInDifferences.__init__()
RegressionDiscontinuity.__init__()
Suggestion from @ricardoV94: In situations where there are multiple valid models, then we either have to pick what model we want to use, or we can do Bayesian model averaging. So you can just fit both model, do model comparison which gives the model weightings, then we can generate model averaged predictions.
I think this was done as posterior_predictive_w
(or similar) in PyMC3, but was not ported to v4.
simulate_data.py
into data folderstatsmodels
dependency into requirements-docs.txt
In the example notebooks... Error when calling Seaborn plot code.
ValueError: Could not interpret value `y` for parameter `y`
Maybe related to my Seaborn version?
It would improve the plot if we add the untreated units to the plot (e.g. in light grey).
This will deviate from the plot
method in the TimeSeriesExperiment
class. So it's probably best to override this plot method where we call the superclass method then additionally plot the untreated units.
Suggestion by @ricardoV94. At the moment, users would test how well the model fits pre-treatment data visually. But we should add quantitative metrics.
This could happen in the fit
method. So override the ModelBuilder.fit
method:
super().fit()
Set up pre-commit checks to enforce code and notebook formatting.
Add a plot where we compare the inferred causal impact to the true causal impact.
At the moment we use sklearn.linear_model.LinearRegression
, but that is bad because: a) we can overfit, b) regression coefficients could be negative.
What we really want is to constrain coefficients to be positive and to have some kind of penalty on the weights.
We could try
sklearn.linear_model.Ridge
with positive=True
sklearn.linear_model.Lasso
with positive=True
Create an in-depth notebook to illustrate synthetic control
Known problem for regression discontinuity, possibly for other experiments...
When the treatment
column data is integer (0/1) then we get an error, it currently only works when the dtype is boolean
Size/aspect ratio specs for digital:
TODO
black-jupyter
to pre-commitsisort
working in notebooks
nbqa
Suggested by @juanitorduz. Would be good to get measures of uncertainty for the non-Bayesian models. Could use:
At the moment we have the
Pair of examples at the moment are jarringly different.
Suggestion by @tomicapretto
ModelBuilder
is currently in pymc-experimental but it will be merged into PyMC soon.
Change the code around to use ModelBuilder
. This repo will then supply a couple of pre-built models, but it also means users can use the ModelBuilder
class to make their own models.
Need to provide quantitative outputs/reports for synthetic control and interrupted time series.
The Causal Impact package provides these summary stats:
For the frequentist version: Add ability to test for presence/absence of causal impact. There is a traditional way of doing this, but we could also envisage bootstrap on the pre-intervention data.
TODO
There is at least one instance in the readme. To be replaced with “CausalPy”
mu
, not the yhat
summary
method for the Bayesian model. Might be best to create some get methods to avoid repeating this task multiple times.This issue will likely be touched by a number of other issues as we flesh out the quantitative outputs and work through more examples. But it is important to go beyond the slightly vague 'causal impact' terminology to be more specific about:
At the moment, all the examples show very clear causal impacts. But it would be nice to add an example without any causal impact, particularly if it demonstrates how one can be fooled into thinking there is an effect when there is not.
(Suggestion by @ricardoV94)
At the moment, the assumption is that the units above the threshold are treated. But this absolutely is not always going to be true. So we need to allow for this.
Option 1: Setting a threshold_function='<='
or threshold_function='>='
Option 2: allow users to use a kwarg where they can override a function. Eg. threshold_function=np.greater_equal
or threshold_function =np.less_equal
Do this on the synthetic regression discontinuity datasets, for both PyMC and skl. Append it as another analysis example.
Things to think about:
_is_treated
uses np.greater_equal
treated
column in the dataset. This presents some redundancy because all we need is the running variable and the _is_treated
helper function. That function is there because we need a way of working out which data are treated when we interpolate for xpred
. One solution would be to remove treated
as a column of data and instead derive this from the running variable and _is_treated
. However, the treated
still needs to appear in the model formula. So would have to add some explanatory text in notebooks.discontinuity_at_threshold
[Optional] Do we want to add in a shaded region above/below the treatment threshold?
#14 improved the synthetic control example by moving from a linear regression model to a Ridge
model (with positive weights constraint). But ideally we can use either Lasso
or an actual model with positive weights that sum to a desired value (normally 1, but higher values allow for some level of extrapolation).
See the example in skl_demos.ipynb notebook
Suggestion by @juanitorduz... Rather than just applying the package to synthetic datasets, it would be good to apply the methods to classic datasets / causal inference problems. This also gives people some faith that the package produces sensible (or at least similar) results as other people's implementations.
See https://matheusfacure.github.io/python-causality-handbook/16-Regression-Discontinuity-Design.html#
its_pymc.ipynb
its_skl.ipynb
This will almost certainly require code changes. At the moment there is a hard wired constraint that there is just a single pre and post observation
#2 added a very simple interrupted time series example with no predictors.
But it would be good to add another example where there is more temporal structure. This would then we well suited for an actual time series model, here an AR model.
generate_time_series_data
(rename this)AutoRegressive
subclass of CausalBase
TODO
scikit-learn
or sktime
model. But pmdarima actually looks very promising. It wraps statsmodels
but provides the fit/predict API.pymc
modelA declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.