causy-dev / causy Goto Github PK

View Code? Open in Web Editor NEW

22.0 3.0 0.0 11.5 MB

Causal discovery made easy.

Home Page: https://causy-dev.github.io/causy/

License: MIT License

Python 99.67% Makefile 0.03% Smarty 0.29%

causality pc-algorithm python pytorch causal-discovery causal-inference

causy's Introduction

Warning

causy is a prototype. Please report any issues and be mindful when using it in production.

causy

causy is a command line tool that allows you to apply causal inference methods like causal discovery and causal effect estimation. You can adjust causal discovery algorithms with easy to use, extend and maintain pipelines. causy is built based on pytorch which allows you to run the algorithms on CPUs as well as GPUs.

causy workspaces allow you to manage your data sets, algorithm adjustments, and (hyper-)parameters for your experiments.

causy UI allows you to look at your resulting graphs in the browser and gain further insights into every step of the algorithms.

You can find the documentation here.

Installation

Currently, we support python 3.11 and 3.12. To install causy run

pip install causy

Usage

Causy can be used with workspaces via CLI or via code.

Workspaces Quickstart

See options for causy workspace

causy workspace --help

Create a new workspace and start the process to configure your pipeline, data loader and experiments interactively. Your input data should be a json file stored in the same directory.

causy workspace init

Add an experiment

causy workspace experiment add your_experiment_name

Update a variable in the experiment

causy workspace experiment update-variable your_experiment_name your_variable_name your_variable_value

Run multiple experiments

causy workspace execute

Compare the graphs of the experiments with different variable values via a matrix plot

causy workspace diff

Compare the graphs in the UI, switch between different experiments and visualize the causal discovery process

causy ui

Usage via Code

Use a default algorithm

from causy.algorithms import PC
from causy.graph_utils import retrieve_edges

model = PC()
model.create_graph_from_data(
    [
        {"a": 1, "b": 0.3},
        {"a": 0.5, "b": 0.2}
    ]
)
model.create_all_possible_edges()
model.execute_pipeline_steps()
edges = retrieve_edges(model.graph)

for edge in edges:
    print(
        f"{edge[0].name} -> {edge[1].name}: {model.graph.edges[edge[0]][edge[1]]}"
    )

Supported algorithms

Currently, causy supports the following algorithms:

PC (Peter-Clark)
- PC - the original PC algorithm without any modifications causy.algorithms.PC
- ParallelPC - a parallelized version of the PC algorithm causy.algorithms.ParallelPC

Supported pipeline steps

Detailed information about the pipeline steps can be found in the API Documentation.

Dev usage

Setup

We recommend using poetry to manage the dependencies. To install poetry follow the instructions on https://python-poetry.org/docs/#installation.

Install dependencies

poetry install

Execute tests

poetry run python -m unittest

Funded by the Prototype Fund from March 2024 until September 2024

causy's People

Contributors

Stargazers

Watchers

causy's Issues

Assumptions: Test how to best introduce warnings / guides

Example: The RKI data in the project is not i.i.d. (independent and identically distributed) because PLZs that lie close together are highly correlated. Therefore, the results of the PC algorithm can be highly biased.

Test how to best integrate this information. Ideas:

Use available tests for assumptions (e.g. if they are i.i.d. or stationary) and through warnings
Suggest algorithms that do need the violated assumption if available
If no algorithm is available: offer different heuristics to account for the assumption violation but indicate that the results are not reliable anymore. This could intuitively be done by showing outputs of different heuristics and documenting their weaknesses as well as robustness tests whenever possible.

Add advanced rules for conflict resolution in collider rule stage

At the moment, we implemented basic conflict resolution strategies for the collider rule stage in the PC algorithm (orientation rules -> collider test), namely:

KEEP_FIRST: If an edge in an unshielded triple has been oriented as a collider and it is then attempted to be oriented as part of a second unshielded triple, the first orientation is kept and the other edge in the second unshielded triple is not oriented.
KEEP_LAST: Works analogously with flipped order.

However, there are several enhancements that can be implemented:

If an edge is trying to be oriented as part of a second unshielded triple and the orientations are compatible, i.e. x -> z <- y and x -> a <- y, orient both and not just the first or the last.
If there are several options, choose the one with the highest probability (see e.g. causal learn's priority functions for a reference: https://github.com/py-why/causal-learn/blob/main/causallearn/utils/PCUtils/UCSepset.py)

Direct Effect Estimation For Graphs With Assumption Violations

Causal Effect Estimation Relies heavily on identifying a valid adjustment set. For example, when estimating direct effects in DAGs under causal sufficiency and linearity assumptions, regressing the effect variable on all parents of the effect variable will provide an unbiased estimator for the true causal effect. However, whenever an edge is wrongfully estimated, this can lead to heavily biased direct effect estimation, which we see for real data or toy models with built-in assumption violations. Wrong orientations can occur due to small sample effects (statistical tests indicating a wrong result), faithfulness assumption violations or the PC algorithms being applied to data with hidden confounding (in which one should have used the FCI algorithm). Think about how to indicate this uncertainty whenever there are orientation conflicts.

For example, consider this toy model:

model = IIDSampleGenerator(
            edges=[
                SampleEdge(NodeReference("A"), NodeReference("C"), 1),
                SampleEdge(NodeReference("B"), NodeReference("C"), 2),
                SampleEdge(NodeReference("A"), NodeReference("D"), 3),
                SampleEdge(NodeReference("B"), NodeReference("D"), 1),
                SampleEdge(NodeReference("C"), NodeReference("D"), 1),
                SampleEdge(NodeReference("B"), NodeReference("E"), 4),
                SampleEdge(NodeReference("E"), NodeReference("F"), 5),
                SampleEdge(NodeReference("B"), NodeReference("F"), 6),
                SampleEdge(NodeReference("C"), NodeReference("F"), 1),
                SampleEdge(NodeReference("D"), NodeReference("F"), 1),
            ],
        )

Create loops over pipeline steps

Clean up create_pipeline and add the following features:

using different generators for each rule
iterating over pipeline steps until exit condition

Update config accordingly.

Wiki: Add list of existing resources (repos, literature, etc.)

Implement edge type enum per algorithm

Currently we give our edges implicitly meaning based on context of the algorithms they are used in.

But edges have specific different meanings in different algorithms. One option would be to find a common superset between those algorithms. Another option would be to have one EdgeType enum class per Algorithm.

This could look something like this:

class PCEdgeTypes(EdgeType):
     DIRECTED_EDGE = "directed"
     UNDIRECTED_EDGE = "undirected"

     @pre
     @on_updated([PCEdgeTypes.UNDIRECTED_EDGE], [PCEdgeTypes.DIRECTED_EDGE])
     @classmethod
     def check_update_of_undirected_edge_possible(cls, node_a, node_b, graph, operations):
         pass

This also means that a PipelineStep needs to explicitly tell what kind of edge types it requires. And that the edge type enum object can be configured at the creation of a model.

Optimizing the PC Algorithm

Improving the Efficiency of the PC Algorithm by Using Model-Based Conditional Independence Tests - often 90% less independence tests
If we run ci tests in parallel, it gets faster - but you can only do it on every level - we already do this nice. But someone wrote a paper on it.

Replace scipy_stats.t.ppf with native pytorch function (get_t_and_critical_t)

Currently our entire codebase is working based on pytorch - except for this one little function (scipy_stats.t.ppf).

Which exists only in scipy and the entire world is working based on >30 year old C Code so noone needs to implement it again. But we wanna run it on the gpu.

So we should either do a nice fake-implementation like amazon did it in their gluonts project. Or we just implement it properly in pytorch.

Pre-knowledge: Allow edges to be protected

Currently our data structure does not support to protect edges from being deleted.

Protecting edges is needed so that we can incorporate pre-knowledge into our graphs.

Therefore we need to

add a protected field to our Edge class
check before modification or deletion of edges if the deletion of the edge is allowed
show the user a warning and add the information into our edge history if we try to remove a protected edge
add an option to incorporate pre knowledge #9

Generate large toy model that uses quadruple orientation rules

Think of an example that runs into the situations in the pictures and test if the current quadruple orientation rules work as they are supposed to also in bigger examples.

(Pictures taken from https://hpi.de/fileadmin/user_upload/fachgebiete/plattner/teaching/CausalInference/2019/Introduction_to_Constraint-Based_Causal_Structure_Learning.pdf)

Fix IID Sample generator bug

It currently generates the data based on the initial value and not based on the current step. Also, we later don't want initial values at all, but will dynamically compute the order such that no variable depends on a variable that has not been assigned a value yet. But for now, it's ok with initial values and it should first work properly.

Fix misleading naming of path and edge types

Currently, we use the following functions to check the following tasks:

directed_edge_exists(v, w): checks if there is a directed edge from node v to node w or a bidirected edge between two nodes v and w
only_directed_edge_exists(v, w): checks if there is a directed edge from node v to node w
directed_path_exists(v, w): checks if a directed path from node v to node w exists, not containing any bidirected edges
path_exists(v, w): checks if a path exists between node v and node w on the underlying undirected graph, ignoring edge types

Think about a better and coherent naming. First ideas:

path_exists -> orientation_agnostic_path_exists
directed_edge_exists -> directed_from_to_or_bidirected_edge_exists
only_directed_edge_exists -> directed_edge_exists
directed_path_exists - ok.

Also add better documentation of the concept of inducing paths.

Test that fails in current setup

check why.

 def test_second_toy_model_example(self):
        rdnv = self.seeded_random.normalvariate
        model = IIDSampleGenerator(
            edges=[
                SampleEdge(NodeReference("A"), NodeReference("C"), 1),
                SampleEdge(NodeReference("B"), NodeReference("C"), 2),
                SampleEdge(NodeReference("A"), NodeReference("D"), 3),
                SampleEdge(NodeReference("B"), NodeReference("D"), 1),
                SampleEdge(NodeReference("C"), NodeReference("D"), 1),
                SampleEdge(NodeReference("B"), NodeReference("E"), 4),
                SampleEdge(NodeReference("E"), NodeReference("F"), 5),
                SampleEdge(NodeReference("B"), NodeReference("F"), 6),
                SampleEdge(NodeReference("C"), NodeReference("F"), 1),
                SampleEdge(NodeReference("D"), NodeReference("F"), 1),
            ],
            random=lambda: rdnv(0, 1),
        )

        sample_size = 100000
        test_data, sample_graph = model.generate(sample_size)

        tst = PCStable()
        tst.create_graph_from_data(test_data)
        tst.create_all_possible_edges()
        tst.execute_pipeline_steps()

Implement other Independence Tests (Fisher, Mutual Information)

Graph rendering: Prevent overlapping edge ends

At the moment, multiple edges can end at the same point of a node and therefore it can be hard to differentiate which end belongs to which edge. This can be a problem when edges are of different types (partially directed, bidirected, undirected) and might lead to confusion.

Possible solutions would be curved edges, space between the edge ends or at least including the info about the edge type in the widget that pops up when you click on the edge.

Allow multiple edges of different types between nodes

The output of the FCI algorithm is a MAG with at most one edge between two nodes. However, to properly test the algorithm, it is helpful to test the inducing_path_exists function on ADMGs with possibly two different edge types between two nodes, a directed edge representing a direct effect and a bidirected edge representing a hidden confounder. (In a MAG, there would just be a directed edge in this case.)

Therefore, we should think about whether to implement this option. For now, we exclude tests that would need such an option in order to return the desired results, for example:

    def test_is_path_inducing_multiple_edges(self):
        graph = GraphManager()
        node1 = graph.add_node("test1", [1, 2, 3])
        node2 = graph.add_node("test2", [1, 2, 3])
        node3 = graph.add_node("test3", [1, 2, 3])
        graph.add_bidirected_edge(node1, node2, {"test": "test"})
        graph.add_bidirected_edge(node2, node3, {"test": "test"})
        graph.add_directed_edge(node2, node3, {"test": "test"})
        path = [(node1, node2), (node2, node3)]
        self.assertTrue(graph._is_path_inducing(path, node1, node3))

Data clean up: pipeline step

Implement strategies for interpolation of missing data points on the graph.

Implement Skeleton Generator Concept

Currently, the graph is initialised with one hard coded skeleton (create_all_possible_edges). This should be configurable such that including prior knowledge becomes easy. Also, when initialising the pre-configured algorithms, you should not have to explicitly initialise the graph anymore.

Fix Time Series Data Generator: Add initial values drawn from distribution

Use analytic results for mean and variance (such that the process stays stationary) and normal distribution

Implement effect estimation for CPDAGs

Currently, we implemented causal effect estimation which is guaranteed to be unbiased (and even variance minimizing) for directed acyclic graphs (DAGs): regressing on all parents. However, this does not work if an adjacent edge is undirected. Therefore, we have to implement causal effect estimation using valid adjustment sets in completed partially directed acyclic graphs (CPDAGs) which are the output of the PC algorithm, see for example: https://www.jmlr.org/papers/v21/20-175.html.

Move from serialize methods everywhere to a serializer mixin

Currently we hack a serialize method into every graph to allow users to eject and modify them in JSON/(soon YAML) format. But it would be so much cooler if we would just have a generic Mixin which makes every part of our pipeline serializable.