Coder Social home page Coder Social logo

causy-dev / causy Goto Github PK

View Code? Open in Web Editor NEW
22.0 3.0 0.0 11.5 MB

Causal discovery made easy.

Home Page: https://causy-dev.github.io/causy/

License: MIT License

Python 99.67% Makefile 0.03% Smarty 0.29%
causality pc-algorithm python pytorch causal-discovery causal-inference

causy's Introduction

Warning

causy is a prototype. Please report any issues and be mindful when using it in production.

causy

causy is a command line tool that allows you to apply causal inference methods like causal discovery and causal effect estimation. You can adjust causal discovery algorithms with easy to use, extend and maintain pipelines. causy is built based on pytorch which allows you to run the algorithms on CPUs as well as GPUs.

causy workspaces allow you to manage your data sets, algorithm adjustments, and (hyper-)parameters for your experiments.

causy UI allows you to look at your resulting graphs in the browser and gain further insights into every step of the algorithms.

You can find the documentation here.

Installation

Currently, we support python 3.11 and 3.12. To install causy run

pip install causy

Usage

Causy can be used with workspaces via CLI or via code.

Workspaces Quickstart

See options for causy workspace

causy workspace --help

Create a new workspace and start the process to configure your pipeline, data loader and experiments interactively. Your input data should be a json file stored in the same directory.

causy workspace init

Add an experiment

causy workspace experiment add your_experiment_name

Update a variable in the experiment

causy workspace experiment update-variable your_experiment_name your_variable_name your_variable_value 

Run multiple experiments

causy workspace execute 

Compare the graphs of the experiments with different variable values via a matrix plot

causy workspace diff

Compare the graphs in the UI, switch between different experiments and visualize the causal discovery process

causy ui

Usage via Code

Use a default algorithm

from causy.algorithms import PC
from causy.graph_utils import retrieve_edges

model = PC()
model.create_graph_from_data(
    [
        {"a": 1, "b": 0.3},
        {"a": 0.5, "b": 0.2}
    ]
)
model.create_all_possible_edges()
model.execute_pipeline_steps()
edges = retrieve_edges(model.graph)

for edge in edges:
    print(
        f"{edge[0].name} -> {edge[1].name}: {model.graph.edges[edge[0]][edge[1]]}"
    )

Supported algorithms

Currently, causy supports the following algorithms:

  • PC (Peter-Clark)
    • PC - the original PC algorithm without any modifications causy.algorithms.PC
    • ParallelPC - a parallelized version of the PC algorithm causy.algorithms.ParallelPC

Supported pipeline steps

Detailed information about the pipeline steps can be found in the API Documentation.

Dev usage

Setup

We recommend using poetry to manage the dependencies. To install poetry follow the instructions on https://python-poetry.org/docs/#installation.

Install dependencies

poetry install

Execute tests

poetry run python -m unittest

Funded by the Prototype Fund from March 2024 until September 2024

pf_funding_logos

causy's People

Contributors

dependabot[bot] avatar lilithwittmann avatar this-is-sofia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

causy's Issues

Assumptions: Test how to best introduce warnings / guides

Example: The RKI data in the project is not i.i.d. (independent and identically distributed) because PLZs that lie close together are highly correlated. Therefore, the results of the PC algorithm can be highly biased.

Test how to best integrate this information. Ideas:

  • Use available tests for assumptions (e.g. if they are i.i.d. or stationary) and through warnings
  • Suggest algorithms that do need the violated assumption if available
  • If no algorithm is available: offer different heuristics to account for the assumption violation but indicate that the results are not reliable anymore. This could intuitively be done by showing outputs of different heuristics and documenting their weaknesses as well as robustness tests whenever possible.

Add advanced rules for conflict resolution in collider rule stage

At the moment, we implemented basic conflict resolution strategies for the collider rule stage in the PC algorithm (orientation rules -> collider test), namely:

  • KEEP_FIRST: If an edge in an unshielded triple has been oriented as a collider and it is then attempted to be oriented as part of a second unshielded triple, the first orientation is kept and the other edge in the second unshielded triple is not oriented.
  • KEEP_LAST: Works analogously with flipped order.

However, there are several enhancements that can be implemented:

Direct Effect Estimation For Graphs With Assumption Violations

Causal Effect Estimation Relies heavily on identifying a valid adjustment set. For example, when estimating direct effects in DAGs under causal sufficiency and linearity assumptions, regressing the effect variable on all parents of the effect variable will provide an unbiased estimator for the true causal effect. However, whenever an edge is wrongfully estimated, this can lead to heavily biased direct effect estimation, which we see for real data or toy models with built-in assumption violations. Wrong orientations can occur due to small sample effects (statistical tests indicating a wrong result), faithfulness assumption violations or the PC algorithms being applied to data with hidden confounding (in which one should have used the FCI algorithm). Think about how to indicate this uncertainty whenever there are orientation conflicts.

For example, consider this toy model:

model = IIDSampleGenerator(
            edges=[
                SampleEdge(NodeReference("A"), NodeReference("C"), 1),
                SampleEdge(NodeReference("B"), NodeReference("C"), 2),
                SampleEdge(NodeReference("A"), NodeReference("D"), 3),
                SampleEdge(NodeReference("B"), NodeReference("D"), 1),
                SampleEdge(NodeReference("C"), NodeReference("D"), 1),
                SampleEdge(NodeReference("B"), NodeReference("E"), 4),
                SampleEdge(NodeReference("E"), NodeReference("F"), 5),
                SampleEdge(NodeReference("B"), NodeReference("F"), 6),
                SampleEdge(NodeReference("C"), NodeReference("F"), 1),
                SampleEdge(NodeReference("D"), NodeReference("F"), 1),
            ],
        )

Create loops over pipeline steps

Clean up create_pipeline and add the following features:

  • using different generators for each rule
  • iterating over pipeline steps until exit condition

Update config accordingly.

Implement edge type enum per algorithm

Currently we give our edges implicitly meaning based on context of the algorithms they are used in.

But edges have specific different meanings in different algorithms. One option would be to find a common superset between those algorithms. Another option would be to have one EdgeType enum class per Algorithm.

This could look something like this:

class PCEdgeTypes(EdgeType):
     DIRECTED_EDGE = "directed"
     UNDIRECTED_EDGE = "undirected"

     @pre
     @on_updated([PCEdgeTypes.UNDIRECTED_EDGE], [PCEdgeTypes.DIRECTED_EDGE])
     @classmethod
     def check_update_of_undirected_edge_possible(cls, node_a, node_b, graph, operations):
         pass

This also means that a PipelineStep needs to explicitly tell what kind of edge types it requires. And that the edge type enum object can be configured at the creation of a model.

Pre-knowledge: Allow edges to be protected

Currently our data structure does not support to protect edges from being deleted.

Protecting edges is needed so that we can incorporate pre-knowledge into our graphs.

Therefore we need to

  • add a protected field to our Edge class
  • check before modification or deletion of edges if the deletion of the edge is allowed
  • show the user a warning and add the information into our edge history if we try to remove a protected edge
  • add an option to incorporate pre knowledge #9

Fix IID Sample generator bug

It currently generates the data based on the initial value and not based on the current step. Also, we later don't want initial values at all, but will dynamically compute the order such that no variable depends on a variable that has not been assigned a value yet. But for now, it's ok with initial values and it should first work properly.

Fix misleading naming of path and edge types

Currently, we use the following functions to check the following tasks:

directed_edge_exists(v, w): checks if there is a directed edge from node v to node w or a bidirected edge between two nodes v and w
only_directed_edge_exists(v, w): checks if there is a directed edge from node v to node w
directed_path_exists(v, w): checks if a directed path from node v to node w exists, not containing any bidirected edges
path_exists(v, w): checks if a path exists between node v and node w on the underlying undirected graph, ignoring edge types

Think about a better and coherent naming. First ideas:

path_exists -> orientation_agnostic_path_exists
directed_edge_exists -> directed_from_to_or_bidirected_edge_exists
only_directed_edge_exists -> directed_edge_exists
directed_path_exists - ok.

Also add better documentation of the concept of inducing paths.

Test that fails in current setup

check why.

 def test_second_toy_model_example(self):
        rdnv = self.seeded_random.normalvariate
        model = IIDSampleGenerator(
            edges=[
                SampleEdge(NodeReference("A"), NodeReference("C"), 1),
                SampleEdge(NodeReference("B"), NodeReference("C"), 2),
                SampleEdge(NodeReference("A"), NodeReference("D"), 3),
                SampleEdge(NodeReference("B"), NodeReference("D"), 1),
                SampleEdge(NodeReference("C"), NodeReference("D"), 1),
                SampleEdge(NodeReference("B"), NodeReference("E"), 4),
                SampleEdge(NodeReference("E"), NodeReference("F"), 5),
                SampleEdge(NodeReference("B"), NodeReference("F"), 6),
                SampleEdge(NodeReference("C"), NodeReference("F"), 1),
                SampleEdge(NodeReference("D"), NodeReference("F"), 1),
            ],
            random=lambda: rdnv(0, 1),
        )

        sample_size = 100000
        test_data, sample_graph = model.generate(sample_size)

        tst = PCStable()
        tst.create_graph_from_data(test_data)
        tst.create_all_possible_edges()
        tst.execute_pipeline_steps()

Graph rendering: Prevent overlapping edge ends

At the moment, multiple edges can end at the same point of a node and therefore it can be hard to differentiate which end belongs to which edge. This can be a problem when edges are of different types (partially directed, bidirected, undirected) and might lead to confusion.

Possible solutions would be curved edges, space between the edge ends or at least including the info about the edge type in the widget that pops up when you click on the edge.

Allow multiple edges of different types between nodes

The output of the FCI algorithm is a MAG with at most one edge between two nodes. However, to properly test the algorithm, it is helpful to test the inducing_path_exists function on ADMGs with possibly two different edge types between two nodes, a directed edge representing a direct effect and a bidirected edge representing a hidden confounder. (In a MAG, there would just be a directed edge in this case.)

Therefore, we should think about whether to implement this option. For now, we exclude tests that would need such an option in order to return the desired results, for example:

    def test_is_path_inducing_multiple_edges(self):
        graph = GraphManager()
        node1 = graph.add_node("test1", [1, 2, 3])
        node2 = graph.add_node("test2", [1, 2, 3])
        node3 = graph.add_node("test3", [1, 2, 3])
        graph.add_bidirected_edge(node1, node2, {"test": "test"})
        graph.add_bidirected_edge(node2, node3, {"test": "test"})
        graph.add_directed_edge(node2, node3, {"test": "test"})
        path = [(node1, node2), (node2, node3)]
        self.assertTrue(graph._is_path_inducing(path, node1, node3))

Implement Skeleton Generator Concept

Currently, the graph is initialised with one hard coded skeleton (create_all_possible_edges). This should be configurable such that including prior knowledge becomes easy. Also, when initialising the pre-configured algorithms, you should not have to explicitly initialise the graph anymore.

Implement effect estimation for CPDAGs

Currently, we implemented causal effect estimation which is guaranteed to be unbiased (and even variance minimizing) for directed acyclic graphs (DAGs): regressing on all parents. However, this does not work if an adjacent edge is undirected. Therefore, we have to implement causal effect estimation using valid adjustment sets in completed partially directed acyclic graphs (CPDAGs) which are the output of the PC algorithm, see for example: https://www.jmlr.org/papers/v21/20-175.html.

Move from serialize methods everywhere to a serializer mixin

Currently we hack a serialize method into every graph to allow users to eject and modify them in JSON/(soon YAML) format. But it would be so much cooler if we would just have a generic Mixin which makes every part of our pipeline serializable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.