Coder Social home page Coder Social logo

mlresearchatosram / cause2e Goto Github PK

View Code? Open in Web Editor NEW
56.0 4.0 4.0 8.72 MB

The cause2e package provides tools for performing an end-to-end causal analysis of your data. Developed by Daniel Grünbaum (@dg46).

Home Page: https://gitlab.com/causal-inference/working-group/-/wikis/home

License: MIT License

Python 99.43% Makefile 0.26% Batchfile 0.31%
causal-analysis dowhy domain-knowledge tetrad python causal-reasoning causal-inference causal-models causality

cause2e's Introduction

Build Status Coverage License: MIT Linux Windows PyPI - Python Version

Getting started:

The easiest way of learning about cause2e's functionality and starting your own causal analyses is to check out this example notebook, which can be easily adapted to fit the needs of your custom analysis, and its resulting report file. Additional notebooks with examples of more specific functionality are also provided. If you are new to graph-based causal inference, this presentation from CDSM21 might help you out.

Contributing:

Your contributions are more than welcome! If you have an idea for a feature, please open an issue or (even better) implement it and make a pull request.

Overview:

The cause2e package provides tools for performing an end-to-end causal analysis of your data. If you have data and domain knowledge about the data generating process, it allows you to:

  • learn a graphical causal model of the data generating process
  • identify a statistical estimand for the causal effect that one variable has on another variable
  • estimate the effect with various statistical techniques
  • check the robustness of your results with respect to changes in the causal model

For analyzing the whole system at once after learning the causal graph, you can use a single command to

  • visualize your qualitative domain knowledge
  • estimate all possible direct, indirect and overall causal effects between your variables via do-calculus-backed linear regression
  • receive a ranking of the strongest causal effects
  • visualize all causal effects in heatmaps
  • validate your model against a priori known quantitative causal effects
  • receive a pdf report containing all relevant information

The main contribution of cause2e is the integration of two established causal packages that have currently been separated and cumbersome to combine:

  • Causal discovery methods from the py-causal package [1], which is a Python wrapper around parts of the Java TETRAD software. It provides many algorithms for learning the causal graph from data and domain knowledge.

  • Causal reasoning methods from the DoWhy package [2], which is the current standard for the steps of a causal analysis starting from a known causal graph and data:

    • Algebraically identifying a statistical estimand for a causal effect from the causal graph via do-calculus.
    • Using statistical estimators to actually estimate the causal effect.
    • Performing robustness tests to check how sensitive the estimate is to model misspecification and other errors.

Structured API:

cause2e provides an easy to use API for performing an end-to-end causal analysis without having to worry about fitting together different libraries and data structures for causal discovery and causal reasoning:

  • The StructureLearner class for causal discovery can

    • read and preprocess data
    • accept domain knowledge in a simple data format
    • learn the causal graph using py-causal algorithms
    • visualize the causal graph and the influence of the specified domain knowledge on the result
    • manually postprocess the resulting graph in case you want to add, delete or reverse some edges
    • check if the graph is acyclic and respects the domain knowledge
    • save the graph to various file formats
  • The Estimator class for causal reasoning can

    • read data and imitate the preprocessing steps applied by the StructureLearner
    • load the causal graph that was saved by the StructureLearner
    • perform the above mentioned causal reasoning steps suggested by the DoWhy package

Additonally, cause2e offers helper classes for handling all paths to your data and output, representing domain knowledge, as well as bookkeeping, ranking, visualization and validation of the results of a multi-effect analysis.

Documentation:

For a detailed documentation of the package, please refer to mlresearchatosram.github.io/cause2e. The documentation has been generated from Python docstrings via Sphinx.

Outlook:

We are planning to integrate the causal discovery toolbox [3] as a second collection of causal discovery algorithms. In the spirit of end-to-end causal analysis, it would also be desirable to include causal representation learning before the discovery step (e.g. for image data), or causal reinforcement learning after having distilled a valid causal model that delivers interventional distributions.

Installation:

First, install py-causal by following these instructions. If you run into troubles with Java on Windows, check out this comment.

You can then install cause2e from pypi:

pip install cause2e

You can also install it directly from this Github repository to receive the newest version:

pip install dowhy -U
pip install ipython -U
pip install jinja2 -U
pip install pillow -U
pip install pyarrow -U
pip install seaborn -U

pip install git+git://github.com/MLResearchAtOSRAM/cause2e

Afterwards, try to run the minimal example notebook to check that everything works.

If you want to clone the repository into a folder for development on your local machine, please navigate to the folder and run:

git clone https://github.com/MLResearchAtOSRAM/cause2e

Disclaimer:

cause2e is not meant to replace either py-causal or DoWhy, our goal is to make it easier for researchers to string together causal discovery and causal reasoning with these libraries. If you are only interested in causal discovery, it is preferable to directly use py-causal or the TETRAD GUI. If you are only interested in causal reasoning, it is preferable to directly use DoWhy.

Citation:

If you are using cause2e in your work, please cite:

Daniel Grünbaum (2021). cause2e: A Python package for end-to-end causal analysis. https://github.com/MLResearchAtOSRAM/cause2e

References:

[1] Chirayu (Kong) Wongchokprasitti, Harry Hochheiser, Jeremy Espino, Eamonn Maguire, Bryan Andrews, Michael Davis, & Chris Inskip. (2019, December 26). bd2kccd/py-causal v1.2.1 (Version v1.2.1). Zenodo. http://doi.org/10.5281/zenodo.3592985

[2] Amit Sharma, Emre Kiciman, et al. DoWhy: A Python package for causal inference. 2019. https://github.com/microsoft/dowhy

[3] Kalainathan, D., & Goudet, O. (2019). Causal Discovery Toolbox: Uncover causal relationships in Python. arXiv:1903.02278. https://github.com/FenTechSolutions/CausalDiscoveryToolbox

cause2e's People

Contributors

dg46 avatar mlresearchatosram avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cause2e's Issues

Sensitivity Analysis

Currently we only get point estimates of the causal effects, assuming that our causal graph is flawless. If the graph is wrong, we have no idea how bad our estimation results can get.

It would lend additional credibility to our analyses if we could specify multiple possible graphs (e.g. because we are not sure about the presence of one edge), estimate the causal effects based on each of the graphs and return something like a confidence interval instead of the current point estimates. A visual representation could be added to the automated pdf report.

I already have implemented a prototypical solution, just need to refactor and integrate it properly.

Exception handling

Currently, we have only minimal handling of exceptions. Given that causal analyses with cause2e have a mostly fixed structure, it would be helpful for the users to receive clearer feedback, at least whenever their input for the main analysis methods cannot be processed as desired.

Steps:

  • identify frequent sources of errors
  • write custom exception types for them to avoid confusion with other errors
  • handle each exception in a way that enables the users to adapt their input accordingly (or inform them whenever the desired functionality is implemented)

Windows installation

Continuous integration pipeline works on ubuntu image, but changing to windows image results in errors (first is Java related, then some issue with multithreading in pytest when the unit tests are run). I am using cause2e on a windows machine for my own analyses, so I know that it can work. Will fix the pipeline so that CI ensures that cause2e is in a usable state for linux and windows users.

Test coverage

Test coverage should be increased to at least 80% (currently about 40%). The full end-to-end analysis from reading data to generating the summary pdf of the analysis should be covered in the tests.

PySpark dependency

Cause2e is a lightweight package except for the dependency on PySpark. Can we make this optional, given that most users will not really need it?

Continuous integration failing

Some tests do not pass, likely related to pandas updating to 2.0.0 (breaking changes).
Warning says that we are passing set as indexer to dataframe, which is no longer supported: pandas-dev/pandas#42825
Try fixing it by requiring older pandas version in requirements.txt.
If this does not work, manually either convert the sets to lists in our code or patch Loc as suggested here: pandas-dev/pandas#42825 (comment)

Improve automated reporting

The automated pdf report is a helpful summary of the causal analysis. However, it has not been updated since the original proof of concept and could be rendered more visually pleasing and easier to understand.

  • Add short descriptions to every page (e.g. explain the color coding in the graphs).
  • Check if we can group all heatmaps on one page, all full tables on one page etc. for an easier overview.
  • Polish axis labels and figure titles wrt. font size and text placement.
  • Check if additional information should be added to the report.

Check variables names when passing domain knowledge

I have had repeating issues caused by typos in the variable names when passing domain knowledge.

These errors are hard to find manually after you've made them, but it is trivial to detect them in an automated way: For every variable that is used when passing domain knowledge, check if it is actually the name of a data column; otherwise, raise an error.

This check should be part of the "set_knowledge" method of the learner.

Bug in edge analysis

If no remaining allowed edges exist, the result manager throws an error when saving the edge analysis to png. Might also happen for other boundary cases. Will fix it by adding an additional check for empty information.

Batch methods for non-linear estimation

The functionality for estimating multiple causal effects at once and triggering cause2e's reporting capabilities is currently only implemented for linear estimation methods. At least for the ATE, it should be realistic to expand the functionality to some non-linear methods.

Replace py-causal by causal-learn?

Causal-learn "is a python package for causal discovery that implements both classical and state-of-the-art causal discovery algorithms, which is a Python translation and extension of Tetrad."

If it provides the same functionality as the (Java wrapper) py-causal library, but with the algorithms actually implemented in Python, this would free cause2e of any Java dependencies. I need to check out the package in more detail to see if it is that easy.

Allow Java VM restart

Currently, the Java VM cannot be restarted after it is shut down.
This limitation can be avoided if we run all tasks that require the VM in a separate process using the multiprocessing module: LeeKamentsky/python-javabridge#88
This would also allow us to run many causal discovery procedures in parallel. A downside is that the start of a new process adds additional overhead that will affect the runtime of an analysis.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.