causy-dev / causy Goto Github PK

View Code? Open in Web Editor NEW

20.0 3.0 0.0 8.94 MB

Causal discovery made easy.

Home Page: https://causy-dev.github.io/causy/

License: MIT License

Python 99.61% Makefile 0.04% Smarty 0.35%

causality pc-algorithm python pytorch causal-discovery causal-inference

causy's Issues

Wiki: Add list of existing resources (repos, literature, etc.)

Pre-knowledge: Allow edges to be protected

Currently our data structure does not support to protect edges from being deleted.

Protecting edges is needed so that we can incorporate pre-knowledge into our graphs.

Therefore we need to

add a protected field to our Edge class
check before modification or deletion of edges if the deletion of the edge is allowed
show the user a warning and add the information into our edge history if we try to remove a protected edge
add an option to incorporate pre knowledge #9

Create loops over pipeline steps

Clean up create_pipeline and add the following features:

using different generators for each rule
iterating over pipeline steps until exit condition

Update config accordingly.

Allow multiple edges of different types between nodes

The output of the FCI algorithm is a MAG with at most one edge between two nodes. However, to properly test the algorithm, it is helpful to test the inducing_path_exists function on ADMGs with possibly two different edge types between two nodes, a directed edge representing a direct effect and a bidirected edge representing a hidden confounder. (In a MAG, there would just be a directed edge in this case.)

Therefore, we should think about whether to implement this option. For now, we exclude tests that would need such an option in order to return the desired results, for example:

    def test_is_path_inducing_multiple_edges(self):
        graph = GraphManager()
        node1 = graph.add_node("test1", [1, 2, 3])
        node2 = graph.add_node("test2", [1, 2, 3])
        node3 = graph.add_node("test3", [1, 2, 3])
        graph.add_bidirected_edge(node1, node2, {"test": "test"})
        graph.add_bidirected_edge(node2, node3, {"test": "test"})
        graph.add_directed_edge(node2, node3, {"test": "test"})
        path = [(node1, node2), (node2, node3)]
        self.assertTrue(graph._is_path_inducing(path, node1, node3))

Implement edge type enum per algorithm

Currently we give our edges implicitly meaning based on context of the algorithms they are used in.

But edges have specific different meanings in different algorithms. One option would be to find a common superset between those algorithms. Another option would be to have one EdgeType enum class per Algorithm.

This could look something like this:

class PCEdgeTypes(EdgeType):
     DIRECTED_EDGE = "directed"
     UNDIRECTED_EDGE = "undirected"

     @pre
     @on_updated([PCEdgeTypes.UNDIRECTED_EDGE], [PCEdgeTypes.DIRECTED_EDGE])
     @classmethod
     def check_update_of_undirected_edge_possible(cls, node_a, node_b, graph, operations):
         pass

This also means that a PipelineStep needs to explicitly tell what kind of edge types it requires. And that the edge type enum object can be configured at the creation of a model.

Fix Time Series Data Generator: Add initial values drawn from distribution

Use analytic results for mean and variance (such that the process stays stationary) and normal distribution

Assumptions: Test how to best introduce warnings / guides

Example: The RKI data in the project is not i.i.d. (independent and identically distributed) because PLZs that lie close together are highly correlated. Therefore, the results of the PC algorithm can be highly biased.

Test how to best integrate this information. Ideas:

Use available tests for assumptions (e.g. if they are i.i.d. or stationary) and through warnings
Suggest algorithms that do need the violated assumption if available
If no algorithm is available: offer different heuristics to account for the assumption violation but indicate that the results are not reliable anymore. This could intuitively be done by showing outputs of different heuristics and documenting their weaknesses as well as robustness tests whenever possible.

Replace scipy_stats.t.ppf with native pytorch function (get_t_and_critical_t)

Currently our entire codebase is working based on pytorch - except for this one little function (scipy_stats.t.ppf).

Which exists only in scipy and the entire world is working based on >30 year old C Code so noone needs to implement it again. But we wanna run it on the gpu.

So we should either do a nice fake-implementation like amazon did it in their gluonts project. Or we just implement it properly in pytorch.

Implement other Independence Tests (Fisher, Mutual Information)

Test that fails in current setup

check why.

 def test_second_toy_model_example(self):
        rdnv = self.seeded_random.normalvariate
        model = IIDSampleGenerator(
            edges=[
                SampleEdge(NodeReference("A"), NodeReference("C"), 1),
                SampleEdge(NodeReference("B"), NodeReference("C"), 2),
                SampleEdge(NodeReference("A"), NodeReference("D"), 3),
                SampleEdge(NodeReference("B"), NodeReference("D"), 1),
                SampleEdge(NodeReference("C"), NodeReference("D"), 1),
                SampleEdge(NodeReference("B"), NodeReference("E"), 4),
                SampleEdge(NodeReference("E"), NodeReference("F"), 5),
                SampleEdge(NodeReference("B"), NodeReference("F"), 6),
                SampleEdge(NodeReference("C"), NodeReference("F"), 1),
                SampleEdge(NodeReference("D"), NodeReference("F"), 1),
            ],
            random=lambda: rdnv(0, 1),
        )

        sample_size = 100000
        test_data, sample_graph = model.generate(sample_size)

        tst = PCStable()
        tst.create_graph_from_data(test_data)
        tst.create_all_possible_edges()
        tst.execute_pipeline_steps()

Move from serialize methods everywhere to a serializer mixin

Currently we hack a serialize method into every graph to allow users to eject and modify them in JSON/(soon YAML) format. But it would be so much cooler if we would just have a generic Mixin which makes every part of our pipeline serializable.

Optimizing the PC Algorithm

Improving the Efficiency of the PC Algorithm by Using Model-Based Conditional Independence Tests - often 90% less independence tests
If we run ci tests in parallel, it gets faster - but you can only do it on every level - we already do this nice. But someone wrote a paper on it.

Data clean up: pipeline step

Implement strategies for interpolation of missing data points on the graph.

Implement Skeleton Generator Concept

Currently, the graph is initialised with one hard coded skeleton (create_all_possible_edges). This should be configurable such that including prior knowledge becomes easy. Also, when initialising the pre-configured algorithms, you should not have to explicitly initialise the graph anymore.

Fix misleading naming of path and edge types

Currently, we use the following functions to check the following tasks:

directed_edge_exists(v, w): checks if there is a directed edge from node v to node w or a bidirected edge between two nodes v and w
only_directed_edge_exists(v, w): checks if there is a directed edge from node v to node w
directed_path_exists(v, w): checks if a directed path from node v to node w exists, not containing any bidirected edges
path_exists(v, w): checks if a path exists between node v and node w on the underlying undirected graph, ignoring edge types

Think about a better and coherent naming. First ideas:

path_exists -> orientation_agnostic_path_exists
directed_edge_exists -> directed_from_to_or_bidirected_edge_exists
only_directed_edge_exists -> directed_edge_exists
directed_path_exists - ok.

Also add better documentation of the concept of inducing paths.

Generate large toy model that uses quadruple orientation rules

Think of an example that runs into the situations in the pictures and test if the current quadruple orientation rules work as they are supposed to also in bigger examples.

(Pictures taken from https://hpi.de/fileadmin/user_upload/fachgebiete/plattner/teaching/CausalInference/2019/Introduction_to_Constraint-Based_Causal_Structure_Learning.pdf)

Fix IID Sample generator bug

It currently generates the data based on the initial value and not based on the current step. Also, we later don't want initial values at all, but will dynamically compute the order such that no variable depends on a variable that has not been assigned a value yet. But for now, it's ok with initial values and it should first work properly.

causy-dev / causy Goto Github PK

causy's Issues

Recommend Projects

Recommend Topics

Recommend Org