felixleopoldo / benchpress Goto Github PK

A Snakemake workflow to run and benchmark structure learning (a.k.a. causal discovery) algorithms for probabilistic graphical models.

Home Page: https://benchpressdocs.readthedocs.io

License: GNU General Public License v2.0

R 39.21% Python 45.20% Shell 3.97% Dockerfile 0.54% TeX 11.08%

graphical-models bayesian-networks markov-networks benchmarking reproducible-research machine-learning snakemake-workflow structure-learning causal-discovery causal-models

benchpress's People

Contributors

Stargazers

Watchers

Forkers

raonyguimaraes jcussens rsantana-isg melmasri ncherric aditya003singh arita37 yassine2612 dmachlanski alex-markham kenneth-lee-ch zehsilva yasu-sh viznuv yzhou

benchpress's Issues

Boxplot

Enable for using box plots in the roc curves.

Tetrad json graph structure may have the arrow-tail direction as <--

benchpress/workflow/scripts/utils/tetrad_graph_to_adjmat.R

Line 27 in c9ea209

    
           isdirected <- ((e$endpoint1 == "TAIL") && (e$endpoint2 == "ARROW")) | ((e$endpoint2 == "TAIL") && (e$endpoint1 == "ARROW"))

Currently the code has checked the existence of direction so far.
But tetrad with bootstrapping may have output the opposite arrow as edge: "<--".

The code as below may avoid potential problems.
For my experience, but for bootstrapping, I expect tetrad would make output as one direction as "-->"
If you have any intentions, feel free to let me know.

    # No CIRCLE check.
    isdirected1to2 <- (e$endpoint1 == "TAIL") && (e$endpoint2 == "ARROW")
    isdirected2to1 <- (e$endpoint1 == "ARROW") && (e$endpoint2 == "TAIL")

    if (isdirected1to2) {
      m[node1_ind, node2_ind] <- 1
    } else if (isdirected2to1) {
      m[node2_ind, node1_ind] <- 1
    } else {
      m[node1_ind, node2_ind] <- 1
      m[node2_ind, node1_ind] <- 1
    }

Edges and non-edges

Hi @felixleopoldo

it seems that the calculation of non-edges depend on the true skeleton, whether it allows self edges or not. as shown in the following link

benchpress/workflow/rules/evaluation/benchmarks/run_summarise.R

Line 127 in c5d96d6

n_nonedges <- sum(1 - skel_true) / 2

so it seems a better way to calculate total number of edges as |V|(|V|-1)/2 and the deduct the existing edges.

What do you think?

Support for arm64 architecture

The Docker images are currently built for amd64 architectures but arm64 should also be available so that Benchpress could be used (through Docker) on e.g. Mac M1-2 machines.

SHD

Plot the SHD metric.

Write and plot edge weights

Some algorithms, like no tears, estimates edge weights/parameters. It should be possible to access these. It can be done by adding another output field, edge_weights, to the rules corresponding to these algorithms. For mcmc algorithms there should be a general converter rule that creates an estimate of the edge probabilities.

Failed to find loop device: could not attach image file to loop device: no loop devices available

@felixleopoldo Thanks for your works, benchpress. It works fine on the machines.
I was using benchpress and issues on WSL2 at win11 with docker.
So now I moved to native Ubuntu machine with docker-image as the instruction manual at benchpress.

One symptom is worth telling you on some cases.

countermeasure-1 reduce cores - worked.
- snakemake --cores 4 --use-singularity --configfile config/config.json
countermeasure-2 install as Linux(Ubuntu) on WSL - worked!
- Straightforwardly used miniforge(mambaforge) instead of miniconda since both conda make corruption of base environment at conda.
- Apptainer version 1.2.3 installed with non-setuid installation

I guess the root cause has not been solved yet. But it in some cases updates makes a solutions.
I hope my report be a help for benchpress users.
sylabs/singularity#67 <- the same symptoms.

When I run the code as below:

(snakemake) root@:/mnt# snakemake --cores all --use-singularity --configfile config/config.json

Then I faced the error as below. Both have the same. loop device was not enough.

Win11 on Docke Desktop with WSL2, volume mounted at Windows file system
PS > docker run -it -w /mnt --privileged -v F:/benchpress:/mnt bpimages/snakemake:v7.32.3
Win11 on Docke Desktop with WSL2, volume mounted at WSL2 file system
docker run -it -w /mnt --privileged --name bntab -v /home/path/benchpress:/mnt bpimages/snakemake:v7.32.3
Symptom:

(omit)
[Fri Sep 29 05:17:07 2023]
Finished job 112.
1 of 346 steps (0.3%) done
FATAL:   container creation failed: mount /proc/self/fd/3->/opt/conda/envs/snakemake/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/3: failed to find loop device: could not attach image file to loop device: no loop devices available

(snakemake) root@d6f240d00620:/mnt# lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       48 bits physical, 48 bits virtual
CPU(s):              12
On-line CPU(s) list: 0-11
Thread(s) per core:  2
Core(s) per socket:  6
Model name:          AMD Ryzen 5 5500
CPU MHz:             3593.164

Memory 64GB / GPU nvidia 8GB

The loop device increased from 8 to 255.

(snakemake) root@:/mnt# ls /dev/loop*
/dev/loop-control  /dev/loop119  /dev/loop140  /dev/loop162  /dev/loop184  /dev/loop205  /dev/loop227  /dev/loop249  /dev/loop40  /dev/loop62  /dev/loop84
/dev/loop0         /dev/loop12   /dev/loop141  /dev/loop163  /dev/loop185  /dev/loop206  /dev/loop228  /dev/loop25   /dev/loop41  /dev/loop63  /dev/loop85
(omit)
/dev/loop118       /dev/loop14   /dev/loop161  /dev/loop183  /dev/loop204  /dev/loop226  /dev/loop248  /dev/loop4    /dev/loop61  /dev/loop83

Variable naming convention.

When running the diffplot method to compare graphs, you have L127 in this here

compares the true graph to the estimated. This throws an error of like

Error in check.nodes(.nodes(custom[[i]]), graph = nodes, min.nodes = length(nodes),  : 
  invalid node(s) 'X0' 'X1' 'X2' 'X3' 'X4' 'X5' 'X6' 'X7' 'X8' 'X9' 'X10' 'X11' 'X12' 'X13' 'X14' 'X15' 'X16' 'X17' 'X18' 'X19' 'X20' 'X21' 'X22' 'X23' 'X24' 'X25' 'X26' 'X27' 'X28' 'X29' 'X30' 'X31' 'X32' 'X33' 'X34' 'X35' 'X36' 'X37' 'X38' 'X39' 'X40' 'X41' 'X42' 'X43' 'X44' 'X45' 'X46' 'X47' 'X48' 'X49' 'X50' 'X51' 'X52' 'X53' 'X54' 'X55' 'X56' 'X57' 'X58' 'X59' 'X60' 'X61' 'X62' 'X63' 'X64' 'X65' 'X66' 'X67' 'X68' 'X69' 'X70' 'X71' 'X72' 'X73' 'X74' 'X75' 'X76' 'X77' 'X78' 'X79' 'X80' 'X81' 'X82' 'X83' 'X84' 'X85' 'X86' 'X87' 'X88' 'X89' 'X90' 'X91' 'X92' 'X93' 'X94' 'X95' 'X96' 'X97' 'X98' 'X99'.
Calls: benchmarks ... graphviz.compare -> check.customlist -> check.nodes
Execution halted

The issue here is that names(pattern_true_bn$nodes) are named [1, 2,3, ....., n] while names(pattern_estimated_bn$nodes) are named [X1, X2, X3, ...] So the file throws and error since the names are not similar.

Here is print of those names

1] "names 1"
  [1] "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
 [16] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
 [31] "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44"
 [46] "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59"
 [61] "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71" "72" "73" "74"
 [76] "75" "76" "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87" "88" "89"
 [91] "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"
[1] "names 2"
  [1] "X0"  "X1"  "X2"  "X3"  "X4"  "X5"  "X6"  "X7"  "X8"  "X9"  "X10" "X11"
 [13] "X12" "X13" "X14" "X15" "X16" "X17" "X18" "X19" "X20" "X21" "X22" "X23"
 [25] "X24" "X25" "X26" "X27" "X28" "X29" "X30" "X31" "X32" "X33" "X34" "X35"
 [37] "X36" "X37" "X38" "X39" "X40" "X41" "X42" "X43" "X44" "X45" "X46" "X47"
 [49] "X48" "X49" "X50" "X51" "X52" "X53" "X54" "X55" "X56" "X57" "X58" "X59"
 [61] "X60" "X61" "X62" "X63" "X64" "X65" "X66" "X67" "X68" "X69" "X70" "X71"
 [73] "X72" "X73" "X74" "X75" "X76" "X77" "X78" "X79" "X80" "X81" "X82" "X83"
 [85] "X84" "X85" "X86" "X87" "X88" "X89" "X90" "X91" "X92" "X93" "X94" "X95"
 [97] "X96" "X97" "X98" "X99"

Set output directory in conf

The benchmark_setup section of the config file should have a key called output_dir specifying where the output of the evaluation modules should be saved. Thus everything currently saved in output should instead be saved in output/config["benchmark_setup"]["output_directory"]. When running a config file, the config file itself should also be saved here.
One would basically just have to change the output part of the rules.smk in the evaluation modules. For example this line and the ones below.

Compress MCMC trajectories

MCMC trajectories should be compressed into tarballs to save disk space.
This can be be implemented with an additional rule that compresses a trajectory. Another rule should extract it. The output from the MCMC algorithms should be protected (i.e. rule-temporary) CSV files.

Difference adjacency matrix plots

In the graph_plots module, we should in case the true graph is provided also plot a difference matrix plot, similar to the adjacency matrix plots, were correct, missing, and false edges are indicated in black, blue, and red, respectively.

Use comma instead of blank space for data files

Bidirected edges in mixed graphs should be plotted as undirected

For mixed graphs such as CPDAGs, undirected edges are currently shown as bidirected. They should however be undirected.
This thread shows how that is treated in graphviz and this is the file that should be altered in Benchpress. Setting edge arguments in pygraphviz is described here.

True graph skeleton as optional algorithm input

In other words, create an undirected graph from the true DAG (or any graph), and pass it to an algorithm as input. Sampled data should be passed as input as well (as per usual). It's important to note that the passed skeleton is the true undirected graph, not an estimate.

The reason is to be able to test pairwise algorithms, such as these.

Usually, pairwise methods are tested only on (X, Y) datasets. But testing them on bigger graphs (nodes > 2) is arguably more interesting and challenging. This, however, requires to provide the algorithms with a starting point in the form of a graph's skeleton. The task then boils down to orient the edges. The final product is a fully oriented graph (can have cycles), so most, if not all, of the existing metrics can be used without issues.

For an example, see [1] section 5 (5.2 and 5.4 specifically).

[1] O. Goudet, D. Kalainathan, P. Caillou, D. Lopez-Paz, I. Guyon, and M. Sebag, ‘Learning Functional Causal Models with Generative Neural Networks’, Springer International Publishing, 2018. doi: 10.1007/978-3-319-98131-4.

MAP estimator for MCMC samplers

Implement MAP estimator for MCMC samplers in addition to the current thresholding approach.

Ground truth adjacent matrix's column order might need to be the same one as dataset.

This would be true since I noticed the large SHD number are obtained at no-bootstrapping even I get reasonalbe result with bootstrapping in tetrad from my eyes in plot.
If it is true, it is important for users.

Dataset: alarm(made from bnlearn by me)
left: ground truth / center: without bootstrapping / right: with bootstrapping = 5

Diffplot

Graph structure

delta parameter i trilearn

Discussed in #21

^{Originally posted by melmasri May 8, 2022}

benchpress/workflow/rules/algorithm_shell_commands.smk

Line 206 in 26363f5

    
                                   /usr/bin/time -f \"%e\" -o {output.time} pgibbs_ggm_sample -N {wildcards.n_particles} -M {wildcards.M} -f {input} -o . -F {output.adjvecs} -s {wildcards.mcmc_seed};

Why you have not assigned a delta parameter here. In this case, trilearn will set the delta parameter to 1, right?

Estimated graph as input

It would be nice with a reserved field name e.g. input_graph_id, that could be used to pass an estiamated graph to an algorithm using the algorithms object ID. This is done already in some algorithm modules, but it is not as easy as using a reserved field in the JSON config.
Looking at this implementation of the gobnilp module/rule:

benchpress/workflow/rules/structure_learning_algorithms/gobnilp/rule.smk

Line 16 in 3313d22

startgraph_file=fix_none_startalg,

startgraph_file would correspond to something like input_graph_file and wildcards["startalg"] here:

benchpress/workflow/rules/structure_learning_algorithms/gobnilp/rule.smk

Line 6 in 3313d22

if wildcards["startalg"] == "None":

would be wildcards["input_graph_id"] . The idea is to make this feature available seamlessly in any algorithm module by adding the input_graph_id field in the config file.
To do that, something similar to this part of the code should be evaluated for any algorithm having the input_graph_id in the config file.

benchpress/workflow/rules/module_strings.py

Line 60 in 3313d22

items["startalg"] = idtopath(items["startalg"], json_string)

Timing

Proper timing for all algorithms. There is already support for timing but the times are not set for all algorithms.

Annotations to ROC curves

When the ROC plots contain many ids, the curves tend to be hard to distinguish by their colors. Annotations in the plots should make it easier.

New data format supporting mixed variables (continuous/discrete) and interventional data

A better data format that supports mixed and interventional data.

First row: labels
Second row: indicates that the variable is (c)ontinuous or the number of discrete (ordinal) categories (levels).

The right part of the dataset indicates the intervention (t)argets.

a,b,c,d,a,b,c,d
c,c,2,3,t,t,t,t
2.7,7.4,1,3,0,0,0,0
5.4,3.9,2,1,0,1,0

Data preprocessing

There should be a data preprocessing field in the benchmark_setup section. This could e.g. handle modules that normalize, pollute or discretize data.

Using custom dataset

I'm interested in running benchpress with my own dataset without a solution graph and I would like to use specifically notears and golem algorithms for research purposes.

I've set already this data configuration:

            "graph_id": null,

            "parameters_id": null,

            "data_id": "insilico.csv",

            "seed_range": null

but I don't know how should I configure the evaluation section to not let the config validator show an error of "ROC evaluation requires graph_id.".

I started from the gcastle.json configuration.

It can happen, that I'm missing something or not fully understand the environment, sorry in advance!

Thank you for your help!

Score plots

Save scores for the score-based algorithms and plot them.
Scores are probably best saved in separate files as the timings.
In the plots, the result for different seeds should not be mixed up, so the score of one method should be used as benchmarks.

Readme file for the datasets

There should be a README.rst in the resources/data/mydatasets folder that contains one (or several) table describing the datasets having info as

Title
Filename
Dimension
Number of observation
Datatype
Underlying graph (if applicable)
Description

Output options to graph_plots

Make the different outputs from the graph_plots module optional.

Number of nodes argument should have the same name across graph modules

Number of nodes argument should have the same name across graph modules and the benchmarks module should save this info separate from the adjmat column in the CSV file. In this way it would be possible to generate plots based on the number of nodes as well.

Adding Python implementation of MCMC algorithm for estimating unconditional equivalence classes

I'm in the process of adding an implementation of the GrUES algorithm from this paper, so I'm making this issue to collect any questions I have before making the final pull request. I'll also try to add to the development documentation since there's no template for adding Python implementations of MCMC algorithms there.

When running parallelDG autocorrelation doesn't plot property

You get a flat picture for size/score correlational when running
snakemake --configfile config/parallelDG.json --cores all --use-singularity

The issues seems to come from this line of code here. the ffill routine fills forward, but when the first index in you data is not 0, the re-indixing of this line

        df2 = df2.reindex(newindex).reset_index().reindex(
            columns=df2.columns).fillna(method="ffill")

Creates index 0 with NaN.

Here is an example

>>> df2
       size
index      
2         1
4         3
6         6
7        10
8        15
...     ...
99978   197
99980   196
99988   195
99996   194
99998   195

[21120 rows x 1 columns]
>>> df2.reindex(newindex).reset_index()
       index   size
0          0    NaN
1          1    NaN
2          2    1.0
3          3    NaN
4          4    3.0
...      ...    ...
99993  99993    NaN
99994  99994    NaN
99995  99995    NaN
99996  99996  194.0
99997  99997    NaN

[99998 rows x 2 columns]
>>> 99997  99997    NaN

[99998 rows x 2 columns]
>>>   C-c C-c

Add algo

Hello,

Just wondering, how one can add those ones :

NO FEAR
https://github.com/skypea/DAG_No_Fear

Custom Dataset format

Hello,

if we wan to use on our own dataset (tabular csv)
format.

what will be the format of the dataset ?

thx

Jaccard index

Add the Jaccard index as an evaluation metric.

Summarize Iterative search

Bug when summarising iterative search

Error in dag2essgraph(g) : Invalid graph passed to replaceUnprotected(). Calls: <Anonymous> -> dag2essgraph Execution halted Full Traceback (most recent call last): File "/users/staff/dmi-dmi/rios0000/anaconda3/envs/benchmark/lib/python3.8/site-packages/snakemake/executors.py", line 2141, in run_wrapper run( File "/users/staff/dmi-dmi/rios0000/git/benchpress/workflow/rules/algorithm_rules.smk", line 588, in __rule_summarise_itsearch File "/users/staff/dmi-dmi/rios0000/anaconda3/envs/benchmark/lib/python3.8/site-packages/snakemake/shell.py", line 176, in __new__ raise sp.CalledProcessError(retcode, cmd)

Auto correlation estimator for MCMC

Implement estimator for the autocorrelation of eg number of edges for MCMC estimators.

Decomposable graph learners should read variable labels from data header

The algorithms for learning decomposable graphs do not read variable labels from the data file headers at the moment but they should.

Allow for different parameters for json objects from same algorithm or module

The same algorithm can be parameterized in different ways depending on e.g. the score function. This is currently handled by setting some values to null. One way to solve this could be to generate pattern stings from each algorithm object and then generate the snakemake rules in a for loop using the pattern strings.

A bug in creating the adj mat

This line in buggy.

m = nx.to_numpy_matrix(g) - np.identity(g.order())

It should be

m = nx.to_numpy_matrix(g)

You can see in the below code, you are removing the diagonal. Is this intended. In trilearn

g = dlib.gen_AR_graph(10, width=2)

m = nx.to_numpy_matrix(g) - np.identity(g.order())
m
matrix([[1., 1., 1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0., 0., 0.],
        [0., 1., 1., 1., 1., 1., 0., 0., 0., 0.],
        [0., 0., 1., 1., 1., 1., 1., 0., 0., 0.],
        [0., 0., 0., 1., 1., 1., 1., 1., 0., 0.],
        [0., 0., 0., 0., 1., 1., 1., 1., 1., 0.],
        [0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 0., 0., 0., 1., 1., 1.]])
nx.to_numpy_matrix(g) - np.identity(g.order())
matrix([[0., 1., 1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 0., 1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 1., 1., 0., 0., 0., 0., 0.],
        [0., 1., 1., 0., 1., 1., 0., 0., 0., 0.],
        [0., 0., 1., 1., 0., 1., 1., 0., 0., 0.],
        [0., 0., 0., 1., 1., 0., 1., 1., 0., 0.],
        [0., 0., 0., 0., 1., 1., 0., 1., 1., 0.],
        [0., 0., 0., 0., 0., 1., 1., 0., 1., 1.],
        [0., 0., 0., 0., 0., 0., 1., 1., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 1., 1., 0.]])

Looking at the code downstream, this shouldn't make a difference, because you are using cov_matrix function, which takes in the graph of the adj. However, I found that in sims, this actually makes a difference.

If you generate the same sim, with the two difference adj matrices, you get a substantially different true graph. I'll look into it later. I am just noting it here.

Problem with time limit

It's basically working but I find that when I hit the timelimit then rather than getting "None" in my time file, the file is empty. Not sure what is going on there. This is causing problems in combine_ROC_data.R since we get an error:
Error in summarise():
ℹ In argument: time_median = median(time).
ℹ In group 1: id = "gobnilp-neat-bge", adjmat = "pcalg_randdag/max_parents=5/n=20/d=4/par1=None/par2=None/method=er/DAG=True", parameters = "sem_params/min=0.25/max=1", data = "iid/n=5000/standardized=True", alpha_mu = 0.01.

Line 71 in 7c8eae5

def result_path_mcmc(algorithm):

and the end of the file

Separate plots instead of ggplot facet_wrap

It would be better to plot e.g. the roc curves in separate plots instead of using facet_wrap since it often happens that the title doesn't fit into the figure and there could be too many plots.

Time limit for gobnilp

Set a time limit for gobnilp algorithm.