furrer-lab / abn Goto Github PK
View Code? Open in Web Editor NEWBayesian network analysis in R
Home Page: https://r-bayesian-networks.org/
License: GNU General Public License v3.0
Bayesian network analysis in R
Home Page: https://r-bayesian-networks.org/
License: GNU General Public License v3.0
Use apex::mixed() instead of lme4::glmer()? This would return pvalues and etc. see: https://mspeekenbrink.github.io/sdam-r-companion/generalized-linear-models.html#generalized-linear-mixed-effects-models
run them only if INLA is available.
Flavor: r-devel-linux-x86_64-debian-gcc
Check: examples, Result: ERROR
Running examples in 'abn-Ex.R' failed
The error most likely occurred in:
> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: buildScoreCache
> ### Title: Build a cache of goodness of fit metrics for each node in a DAG,
> ### possibly subject to user-defined restrictions
> ### Aliases: buildScoreCache buildScoreCache.bayes forLoopContentBayes
> ### forLoopContent buildScoreCache.mle
> ### Keywords: buildScoreCache.bayes buildScoreCache.mle calc.node.inla.glm
> ### calc.node.inla.glmm fitAbn.bayes fitAbn.mle internal models
>
> ### ** Examples
>
> ## Simple example
> # Generate data
> N <- 1e6
> mydists <- list(a="gaussian",
+ b="gaussian",
+ c="gaussian")
> a <- rnorm(n = N, mean = 0, sd = 1)
> b <- 1 + 2*rnorm(n = N, mean = 5, sd = 1)
> c <- 2 + 1*a + 2*b + rnorm(n = N, mean = 2, sd = 1)
> mydf <- data.frame("a" = scale(a),
+ "b" = scale(b),
+ "c" = scale(c))
>
> # ABN with MLE
> mycache.mle <- buildScoreCache(data.df = mydf,
+ data.dists = mydists,
+ method = "mle",
+ max.parents = 2)
Loading required package: Matrix
> dag.mle <- mostProbable(score.cache = mycache.mle,
+ max.parents = 2)
Step1. completed max alpha_i(S) for all i and S
Total sets g(S) to be evaluated over: 8
> myfit.mle <- fitAbn(object = dag.mle,
+ method = "mle",
+ max.parents = 2)
> plot(myfit.mle)
>
> # ABN with Bayes
> mycache.bayes <- buildScoreCache(data.df = mydf,
+ data.dists = mydists,
+ method = "bayes",
+ max.parents = 2)
Error in library(p, character.only = TRUE) :
there is no package called 'INLA'
Calls: buildScoreCache ... buildScoreCache.bayes -> %do% -> <Anonymous> -> library
Execution halted
if max.parents as list this will fail:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/build_score_cache.R#L523-L524
Catch it and/or resolve max.parent list (e.g. when all items in the list are equal).
Some C-level functions are mentioned here to silence R CMD check
but are not properly documented.
This doesn't work only because there is no check for the combination of these arguments implemented.
Check if defn.res
and which.nodes
are not mismatching and keep if ok:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/build_score_cache.R#L517-L518
basically,
vignettes/precomile.R
. See #11 .pkgdown::build_site()
the pendent of "/dev/null" on windows is "nul".
https://stackoverflow.com/questions/4507312/how-to-redirect-stderr-to-null-in-cmd-exe
Currently, when the output of tests is captured in "/dev/null" the tests are omitted on windows.
Consider instead sth like this:
test_that("plot.abnDag() works.", {
mydag <- createAbnDag(dag = ~a+b|a, data.df = data.frame("a"=1, "b"=1))
if(.Platform$OS.type == "unix") {
FILE <- "/dev/null"
} else {
FILE <- "nul"
}
capture.output({
expect_no_error({
plot(mydag)
})
},
file = FILE)
})
Not sure this example works well...
Thanks, we see:
Size of tarball: 7142960 bytes
Please reduce to less than 5 MB.
irls_poisson_fast.cpp
results in slightly different score values compared to the same model computed with glm
.
Compare mycache.mle
with modglm
in test-build_score_cache_mle.R
.
> mycache.mle$mlik
[1] -1418.438 -Inf
> logLik(modglm)
'log Lik.' -1410.645 (df=2)
Analogous for AIC and BIC scores.
I'm unsure if this variation in score values is expected.
This was temporarily fixed with an increased tolerance to pass the tests.
Double-check IRLS Poisson Fast algorithm. It has been shown that numerical overflow is not handled properly for large values of eta. Unsure if eta should ever be that large or if this was only caused by a faulty test. If the latter, consider catching such cases upstream properly and investigate why glm
did not raise a warning.
export fitted abn to .net file to be read by e.g. HUGIN GUI.
These might help:
the data field in .net file contains the CPT of the nodes.
alternatives to HUGIN (commercial):
they use .dot files.
Currently, not all control parameters are checked for eligibility.
Extend for build.control()
here:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/abn-internal.R#L650-L697
and extend for fit.control()
here:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/abn-internal.R#L766-L815
Please always write package names, software names and API (application
programming interface) names in single quotes in title and description.
e.g: --> 'INLA'
Please note that package names are case sensitive.
From case study zero of the old abn-homepage.
The for-loop comparing INLA, internal C laplace and glm results, shows an over/underflow warning originating from laplace calculations in node_binomial.c
.
In different parts (e.g. line 940) , we exponentiate large numbers raising the overflow warning and resulting in Inf values which can lead to issues later down-stream.
load(system.file("extdata", "QA_glm_case1_data.RData", package = "abn")) # or download from here: http://r-bayesian-networks.org/source/Rcode/QA_glm_case2.tar.gz
## 1. plot of raw differences, a wide range of values since both poisson, bin and gaus distributions used.
## vast majority as almost identical, but some are rather different
#plot(mycache.inla$mlik-mycache.c$mlik);
## 2. also look at % differences - gives a crude overview
## as 1. so suggests perhaps not just floating point rounding issue e.g. in log transforms
perc<-100*(mycache.c$mlik-mycache.inla$mlik)/mycache.c$mlik;
## 3. get all mliks which are adrift by more than 1%
bad<-which(abs(perc)>1);
## go through each and check for issues
##
mydat<-ex2.dag.data;## this data comes with abn see ?ex2.dag.data
mydat.std<-mydat;
## setup distribution list for each node
mydists<-list(b1="binomial",
g1="gaussian",
p1="poisson",
b2="binomial",
g2="gaussian",
p2="poisson",
b3="binomial",
g3="gaussian",
p3="poisson",
b4="binomial",
g4="gaussian",
p4="poisson",
b5="binomial",
g5="gaussian",
p5="poisson",
b6="binomial",
g6="gaussian",
p6="poisson"
);
## create standardised dataset for comparison with glm
for(i in 1:length(mydists)){if(mydists[[i]]=="gaussian"){## then std data for comparison with glm_case
mydat.std[,i]<-(mydat.std[,i]-mean(mydat.std[,i]))/sd(mydat.std[,i]);}
}
## create empty matrix which will be filled with nodes as needed
mydag<-matrix(rep(0,dim(mydat)[2]^2),ncol=dim(mydat)[2]);colnames(mydag)<-rownames(mydag)<-names(mydat);
## loop through each node which differed from INLA by at least 1% and compare with glm() modes
for(i in 1:length(bad)){
mydag[,]<-0;## reset
node<-mycache.c$child[bad[i]];pars<-mycache.c$node.defn[bad[i],];
form<-as.formula(paste(colnames(mydag)[node],"~",paste(colnames(mydag)[which(pars==1)],collapse="+",sep=""),sep=""));
family<-mydists[[node]];
mydag[node,]<-pars;## copy "bad" node into DAG
myres.c<-fitabn(dag.m=mydag,data.df=mydat,data.dists=mydists,max.mode.error=0,compute.fixed=TRUE);## use C
myres.inla<-fitabn(dag.m=mydag,data.df=mydat,data.dists=mydists,max.mode.error=100,compute.fixed=TRUE,n.grid=NULL,std.area=FALSE);## use INLA
myres.glm<-glm(form,data=mydat.std,family=family);
cat("################ bad=",i,"#################\n");
cat("\n# 1. glm()\n");print(coef(myres.glm));
cat("\n# 2. C\n");print(myres.c$modes[[node]]);
cat("\n# 3. INLA\n");print(myres.inla$modes[[node]]);
cat("\n###########################################\n");
}
The operation from line 940 appears in different locations in the code. Often they are marked with an old note regarding its potential to overflow. There is a note about a workaround in one place. Consider to investigate more on this workaround and check if the other parts of the code could be adapted accordingly or if there exists a better strategy (as the workaround doesn't seem to be the universal solution).
We can save potentially quite some computational power if we modify the regular runs of the fast pipeline such that first a single job runs and then, only if it does not fail, all the other flavors run.
We might even consider designing the fast pipeline to only run on a subset of flavors and postpone the extensive checks (i.e. on all combinations) to ongoing pull requests and commits to master
.
Because there is no test that handles this situation properly, defn.res
and which.nodes
provided together results in an error. https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/build_score_cache.R#L590-L603
Resolve by checking if defn.res and which.nodes are not mismatching and keep if ok.
Think about:
Flavor: r-devel-linux-x86_64-debian-gcc
Check: re-building of vignette outputs, Result: ERROR
Error(s) in re-building vignettes:
...
--- re-building 'data_simulation.Rmd' using rmarkdown
Quitting from lines 29-58 [fit_model] (data_simulation.Rmd)
Error: processing vignette 'data_simulation.Rmd' failed with diagnostics:
there is no package called 'INLA'
--- failed re-building 'data_simulation.Rmd'
--- re-building 'mixed_effect_BN_model.Rmd' using rmarkdown
--- finished re-building 'mixed_effect_BN_model.Rmd'
--- re-building 'model_specification.Rmd' using rmarkdown
--- finished re-building 'model_specification.Rmd'
--- re-building 'multiprocessing.Rmd' using rmarkdown
Quitting from lines 88-130 [benchmarking] (multiprocessing.Rmd)
Error: processing vignette 'multiprocessing.Rmd' failed with diagnostics:
worker initialization failed: there is no package called 'INLA'
--- failed re-building 'multiprocessing.Rmd'
--- re-building 'paper.Rmd' using rmarkdown
--- finished re-building 'paper.Rmd'
--- re-building 'parameter_learning.Rmd' using rmarkdown
Quitting from lines 67-72 [unnamed-chunk-3] (parameter_learning.Rmd)
Error: processing vignette 'parameter_learning.Rmd' failed with diagnostics:
there is no package called 'INLA'
--- failed re-building 'parameter_learning.Rmd'
--- re-building 'quick_start_example.Rmd' using rmarkdown
--- finished re-building 'quick_start_example.Rmd'
--- re-building 'structure_learning.Rmd' using rmarkdown
--- finished re-building 'structure_learning.Rmd'
SUMMARY: processing the following files failed:
'data_simulation.Rmd' 'multiprocessing.Rmd' 'parameter_learning.Rmd'
Error: Vignette re-building failed.
Execution halted
We want to implement a robust testing and deployment pipeline.
The ideas is that the creation of a new tag on the master branch will trigger a CRAN submission under the condition that our fast running checks passed. In this case we also want to start to a slow run to monitor a.o. memory leakage. If the slow run succeeds, then the pipeline can create a new release from the tag.
x.x.x-rc
)-rc
):
Only with method = "bayes"
we can set the number of maximal allowed parents individually per node.
### Generate data
# Set seed for reproducibility
set.seed(123)
# Number of groups
n_groups <- 5
# Number of observations per group
n_obs_per_group <- 100
# Total number of observations
n_obs <- n_groups * n_obs_per_group
# Simulate group effects
group <- factor(rep(1:n_groups, each = n_obs_per_group))
group_effects <- rnorm(n_groups)
# Simulate variables
G1 <- rnorm(n_obs) + group_effects[group]
B1 <- rbinom(n_obs, 1, plogis(group_effects[group]))
G2 <- 1.5 * B1 + 0.7 * G1 + rnorm(n_obs) + group_effects[group]
B2 <- rbinom(n_obs, 1, plogis(2 * G2 + group_effects[group]))
# Create data frame
data <- data.frame(group = group, G1 = G1, G2 = G2, B1 = factor(B1), B2 = factor(B2))
# Look at data
str(data)
summary(data)
######
# Reproduce issue
######
### method = "mle"
# OK: Build the score cache with 2 parents for each variable
score_cache <- buildScoreCache(data.df = data,
data.dists = list(G1 = "gaussian",
G2 = "gaussian",
B1 = "binomial",
B2 = "binomial"),
group.var = "group",
max.parents = 2,
method = "mle")
# BUG: Build the score cache with different number of parents for each variable
score_cache <- buildScoreCache(data.df = data,
data.dists = list(G1 = "gaussian",
G2 = "gaussian",
B1 = "binomial",
B2 = "binomial"),
group.var = "group",
max.parents = list(G1 = 0, G2 = 2, B1 = 0, B2 = 3),
method = "mle")
### method = "bayes"
# OK: Build the score cache with different number of parents for each variable
score_cache <- buildScoreCache(data.df = data,
data.dists = list(G1 = "gaussian",
G2 = "gaussian",
B1 = "binomial",
B2 = "binomial"),
group.var = "group",
max.parents = list(G1 = 0, G2 = 2, B1 = 0, B2 = 3),
method = "bayes")
Extend the checking procedure of the combination of cor.var
, which.nodes
and group.var
arguments here: https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/build_score_cache.R#L446C1-L454C4
Print meaningful warnings/errors for the specific combinations.
CRAN has the package archived. Fix this with a new release.
This approach actually includes 3 types of tests:
- fast tests with testthat which run regularly
- fast tests that are CRAN-like which run on changes (and change requests to) the default branch
- slow tests that track memory usage
The first two are implemented (about to be - see furrer-lab/devel-abn#100 ), what remains is the tests that include the tracking of memory usage.
Originally posted by @j-i-l in #81
We want to run tests with valgrind enabled (what else?) if we have a release candidate.
Depending on what it is exactly that we want to track it might be enough to run R CMD check
with --use-valgrind
, in which case we could handle this by setting some variables in the existing github action CRAN_checks
.
This relates to #33
Decide if it is worth speeding up mostProbable()
.
For this example, it takes quite a while to run:
# get data
mydat <- ex5.dag.data[,-19] ## get the data - drop group variable
# Restrict DAG
banned<-matrix(c(
# 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b2
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b3
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b4
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b5
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b6
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g2
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g3
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g4
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g5
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g6
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g7
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g8
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g9
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g10
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g11
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 # g12
),byrow=TRUE,ncol=18)
colnames(banned)<-rownames(banned)<-names(mydat)
retain<-matrix(c(
# 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b2
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b3
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b4
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b5
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # b6
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g2
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g3
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g4
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g5
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g6
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g7
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g8
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g9
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g10
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, # g11
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 # g12
),byrow=TRUE,ncol=18)
## again must set names
colnames(retain)<-rownames(retain)<-names(mydat)
# set distributions
mydists<-list(b1="binomial",
b2="binomial",
b3="binomial",
b4="binomial",
b5="binomial",
b6="binomial",
g1="gaussian",
g2="gaussian",
g3="gaussian",
g4="gaussian",
g5="gaussian",
g6="gaussian",
g7="gaussian",
g8="gaussian",
g9="gaussian",
g10="gaussian",
g11="gaussian",
g12="gaussian"
)
# Compute score cache
mycache.1par <- buildScoreCache(data.df=mydat,data.dists=mydists, max.parents=1,centre=TRUE)
# Estimate most probable DAG
mp.dag <- mostProbable(score.cache = mycache.1par)
Fix the following error message
Found the following (possibly) invalid URLs:
URL: http://aje.oxfordjournals.org/content/176/11/1051.abstract (moved to https://academic.oup.com/aje/article-abstract/176/11/1051/178588)
From: README.md
Status: 301
Message: Moved Permanently
URL: http://aje.oxfordjournals.org/content/176/11/1051.full.pdf?keytype=ref&ijkey=zCJD2Zt88XaDYyY (moved to https://academic.oup.com/aje/article-pdf/176/11/1051/428801/kws183.pdf?keytype=ref&ijkey=zCJD2Zt88XaDYyY)
From: README.md
Status: 301
Message: Moved Permanently
URL: http://download.springer.com/static/pdf/949/art%253A10.1186%252Fs12917-016-0649-0.pdf?originUrl=http%3A%2F%2Fbmcvetres.biomedcentral.com%2Farticle%2F10.1186%2Fs12917-016-0649-0&token2=exp=1455044551~acl=%2Fstatic%2Fpdf%2F949%2Fart%25253A10.1186%25252Fs12917-016-0649-0.pdf*~hmac=e04039a7400eefea35dc05635bccae1688e549b8b0eb36edc0b8fd72caba73fc
From: README.md
Status: 404
Message: Not Found
URL: http://mcmc-jags.sourceforge.net/ (moved to https://mcmc-jags.sourceforge.io/)
From: README.md
Status: 301
Message: Moved Permanently
URL: http://pdn.sciencedirect.com/science?_ob=MiamiImageURL&_cid=271186&_user=4429&_pii=S0167587711000341&_check=y&_origin=browseVolIssue&_zone=rslt_list_item&_coverDate=2011-06-15&wchp=dGLbVlS-zSkWb&md5=29522e1462a0ac05fe07c787a4cd3d0a&pid=1-s2.0-S0167587711000341-main.pdf
From: README.md
Status: Error
Message: Could not resolve host: pdn.sciencedirect.com
URL: http://web.cs.iastate.edu/~jtian/cs673/cs673_spring05/references/Friedman-Koller-2003.pdf (moved to https://faculty.sites.iastate.edu/jtian/)
From: README.md
Status: 301
Message: Moved Permanently
URL: http://www.bioconductor.org/ (moved to https://www.bioconductor.org/)
From: README.md
Status: 301
Message: Moved Permanently
URL: http://www.bioconductor.org/packages/release/bioc/html/Rgraphviz.html (moved to https://www.bioconductor.org/packages/release/bioc/html/Rgraphviz.html)
From: README.md
Status: 301
Message: Moved Permanently
URL: http://www.ete-online.com/content/10/1/4 (moved to https://link.springer.com/journal/12982)
From: README.md
Status: 301
Message: Moved Permanently
URL: http://www.r-inla.org/ (moved to https://www.r-inla.org/)
From: README.md
Status: 301
Message: Moved Permanently
URL: https://r-bayesian-networks.org/quick_start_example.html
From: inst/doc/paper.html
Status: Error
Message: schannel: SNI or certificate check failed: SEC_E_WRONG_PRINCIPAL (0x80090322) - Der Zielprinzipalname ist falsch.
For content that is 'Moved Permanently', please change http to https,
add trailing slashes, or replace the old by the new URL.
Found the following (possibly) invalid file URI:
URI: quick_start_example.md
From: README.md
We have two different actions for running the CRAN like tests and one just for getting the test-coverage.
It is unclear to me why they exists separately.
Remove test-coverage.yml
and include its last 3 steps in the CRAN tests
move testpipline from private devel-abn repo to this public repo.
Unexecutable code in man/fitAbn.Rd.
Please make sure that all your examples are executable. I think you
forgot to comment out a line there:
This is a basic plot of some posterior densities. The algorithm used
for selecting
density points is quite straightforward, but it might result in a
sparse distribution.
Therefore, we also recompute the density over an evenly spaced grid
of 50 points between the two endpoints that had a minimum PDF at f=min.pdf.
Setting max.mode.error=0 forces the use of the internal C code.
buildScoreCache(mle, group.var) warning "nlminb message: false convergence (8)", "nlminb message: function evaluation limit reached without convergence (9)".
See:
https://stackoverflow.com/a/40049233/6098024
https://stat.ethz.ch/pipermail/r-help/2008-June/164797.html
https://stats.stackexchange.com/a/44884/152981)
latest
containers are updated each monthAdd section to README (subsection of contributing) Development Environment and Testing
manipulate VarCov to bring in correct shape:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/fitabn_mle.R#L455-L457
Make glmm.score with interface to julia to
i) increase performance
ii) rank deficiency is handeled properly: https://juliastats.org/MixedModels.jl/dev/rankdeficiency/
- see example package: https://github.com/Non-Contradiction/ipoptjlr/blob/master/R/IPOPT.R
- using the R package JuliaCall.
Consider to call the loops in getmarginals()
with foreach
to gain speed up.
fit.control()
Catch fit
when its NULL and return a very low score:
https://github.com/furrer-lab/devel-abn/blob/40dc1e44269bac02adfc65ca9a4b489ebe017229/abn/R/fitabn_mle.R#L606-L609
Found the following (possibly) invalid file URI:
URI: quick_start_example.md
From: README.md
Please omit the redundant " The abn
R package is a powerful tool for"
from the Description field.
Please single quote software names with straight (rather than directed)
single quotes in the Description field as in 'abn'.
Please fix and resubmit.
Get the .tar.gz from the build command as an artifact. This eases the CRAN submission.
In addition, the software associated with your submission must:
In addition, JOSS requires that software should be
Co-publication of science, methods, and software:
CoI Policy:
What should my paper contain?
Given this format, a “full length” paper is not permitted, and software documentation such as API (Application Programming Interface) functionality should not be in the paper and instead should be outlined in the software documentation.
Check: examples, Result: NOTE
Examples with CPU (user + system) or elapsed time > 10s
user system elapsed
buildScoreCache 8.67 1.61 10.28
We might not always need to have the quick tests running on every commit to a branch.
When working on the documentation or, as we do now, on the paper we are not interested in the tests.
Therefore it should be easy to skip the tests.
noT
then the quick tests do not run at all on this branchnoT
then for this commit the tests are skippedDear Matteo,
Included is a patch from the latest version to help scale building the score cache for the mle option. I'm using a "Sparse Candidate" type algorithm, so the number
of possible parents is normally quite constrained, in the region of 10s. This algorithm also needs to be able to check the scoring on adding
a single node, so I've had to make max.parents per node (I'm not sure why it was forbidden before). I've tested this on features running to
1000s and it seems to work quite well and replicates the previous results.
I also have code that scales the hill climbing algorithm to 1000s of variables, in R, but this is missing some of the functionality of the C
code, so I won't offer it yet.
Any questions, please get in touch!
Many thanks,
Rónán
On 15 Nov 2023, at 16:57, Delucchi Matteo [xxx] wrote:
Dear Ronan,
Thank you for your interest in our abn package and for taking the time to provide feedback.
We currently host the code on our institute’s GitLab server, which can be found at this link: https://git.math.uzh.ch/mdeluc/abn
I greatly appreciate your contribution towards improving the scalability of the package. I would happily review your patch and consider incorporating it for the next release.
Please don’t hesitate to reach out if you encounter any issues or have further questions.
Thank you again for your feedback!
Best regards,
Matteo
From: Ronan
Subject: abn R package
Dear Mr Delucchi,
I've been experimenting with the abn package and found that scaling up to large numbers of nodes was causing issues
with the code setting up the cache structure, specifically in buildScoreCache.mle where banned possibilities are filtered
out, was causing runtime to grow perhaps quadratically. I've implemented a fix that means the code can now scale to
larger examples and I'm wondering is there a way to incorporate this into the mainline of your package? I haven't seen a
github repository, but could send a patch etc.
Many thanks,
Ronan
diff --git a/R/build_score_cache_mle.R b/R/build_score_cache_mle.R
index 6e3e650..5b03b3f 100755
--- a/R/build_score_cache_mle.R
+++ b/R/build_score_cache_mle.R
@@ -377,6 +377,9 @@ buildScoreCache.mle <-
############################## Function to create the cache
+ if ( length(max.parents) == 1 ) {
+ max.parents <- rep(max.parents, nvars)
+ }
if (!is.null(defn.res)) {
max.parents <- max(apply(defn.res[["node.defn"]], 1, sum))
@@ -392,83 +395,64 @@ buildScoreCache.mle <-
return(v)
}
- node.defn <- matrix(data = as.integer(0), nrow = 1L, ncol = nvars)
- children <- 1
+ ## Generate all possible bit patterns for n variables, with a maximum of m 1s
+ generateBitPatterns = function(n, m) {
+ z <- rep(0,n)
+ do.call(rbind, lapply(0:m, function(i) t(apply(combn(1:n,i), 2, function(k) {z[k]=1;z}))))
+ }
- for (j in 1:nvars) {
- if (j != 1) {
- node.defn <- rbind(node.defn, matrix(data = as.integer(0),
- nrow = 1L, ncol = nvars))
- children <- cbind(children, j)
- }
- # node.defn <- rbind(node.defn,matrix(data = 0,nrow = 1,ncol = n))
+ # Function to generate all possible combinations of parents
+ filteredCombinations = function(x, m, bannedParents, retainedParents) {
+ # These are the parents that cannot change
+ fixedParents = bannedParents | retainedParents | (fun.return(x, length(x) + 2) + 1) %% 2
+ # These are the parents that can change
+ parentPossibleChoices = which(fixedParents == 0)
+ numPossibleChoices = length(parentPossibleChoices)
+ numRetainedParents = sum(retainedParents)
+
+ # Generate all possible combinations of parents, taking account of banned, retained and maximum number of parents
+ parentChoices = generateBitPatterns(numPossibleChoices, min(m-numRetainedParents, numPossibleChoices)) == 1
+ output = t(apply(parentChoices, 1, function(pc) {
+ combinedRow = 1L*(retainedParents | fun.return(parentPossibleChoices[pc], length(x) + 2))
+ combinedRow
+ }))
+ output
+ }
+
+ children <- matrix(nrow=1, ncol=0)
+ node.defn.list = list()
+ for (j in 1:nvars) {
if(is.list(max.parents)){
stop("ISSUE: `max.parents` as list is not yet implemented further down here. Try with a single numeric value as max.parents instead.")
if(!is.null(which.nodes)){
stop("ISSUE: `max.parents` as list in combination with `which.nodes` is not yet implemented further down here. Try with single numeric as max.parents instead.")
}
- } else if (is.numeric(max.parents) && length(max.parents)>1){
- if (length(unique(max.parents)) == 1){
- max.parents <- unique(max.parents)
- } else {
- stop("ISSUE: `max.parents` with node specific values that are not all the same, is not yet implemented further down here.")
- }
- }
-
- if(max.parents == nvars){
- max.parents <- max.parents-1
- warning(paste("`max.par` == no. of variables. I set it to (no. of variables - 1)=", max.parents)) #NOTE: This might cause differences to method="bayes"!
}
- for (i in 1:(max.parents)) {
- tmp <- t(combn(x = (nvars - 1), m = i, FUN = fun.return, n = nvars, simplify = TRUE))
- tmp <- t(apply(X = tmp, MARGIN = 1, FUN = function(x) append(x = x, values = 0, after = j - 1)))
-
- node.defn <- rbind(node.defn, tmp)
-
- # children position
- children <- cbind(children, t(rep(j, length(tmp[, 1]))))
- }
+ # The parents that are banned and retained for node j
+ bannedParents = dag.banned[j, ]
+ retainedParents = dag.retained[j, ]
+ # All possible parents for node j, which is all nodes except j
+ parentChoice = c(seq.int(from=1, length.out=j-1), seq.int(from=j+1, length.out=nvars-j))
+ # How many parents we are keeping for node j
+ numRetainedParents = sum(retainedParents)
+ # The maximum number of parents for node j
+ m = max.parents[j]
+
+ # Generate all possible combinations of parents for node j
+ tmp <- filteredCombinations(x = parentChoice, m=m, bannedParents=bannedParents, retainedParents=retainedParents)
+ # We need a sparse matrix here to deal with large numbers of variables, otherwise memory usage if very high.
+ tmp2 = Matrix(tmp, sparse = TRUE)
+ node.defn.list[[length(node.defn.list) + 1]] <- tmp2
+ children <- cbind(children, t(rep(j, length(tmp2[, 1]))))
}
- # children <- rowSums(node.defn)
+ node.defn = do.call(rbind, node.defn.list)
colnames(node.defn) <- colnames(data.df)
- ## Coerce numeric matrix into integer matrix !!!
- node.defn <- apply(node.defn, c(1, 2), function(x) {
- (as.integer(x))
- })
-
children <- as.integer(children)
# node.defn_ <- node.defn
- ## DAG RETAIN/BANNED
- for (i in 1:nvars) {
- for (j in 1:nvars) {
-
- ## DAG RETAIN
- if (dag.retained[i, j] != 0) {
- tmp.indices <- which(children == i & node.defn[, j] == 0)
-
- if (length(tmp.indices) != 0) {
- node.defn <- node.defn[-tmp.indices, ]
- children <- children[-tmp.indices]
- }
- }
-
- ## DAG BANNED
- if (dag.banned[i, j] != 0) {
- tmp.indices <- which(children == i & node.defn[, j] == 1)
-
- if (length(tmp.indices) != 0) {
- node.defn <- node.defn[-tmp.indices, ]
- children <- children[-tmp.indices]
- }
- }
-
- }
- }
-
mycache <- list(children = as.integer(children), node.defn = (node.defn))
###------------------------------###
Prediction of a fitted BN can be achieved in several ways.
new branch in the public abn repo where we build a new version of the package from scratch
This is related to #9
We include sanity checks on URL's/URI's into the testing procedure, also because CRAN does the same when a package is submitted.
The check can be performed with https://github.com/r-lib/urlchecker which might even update permanent redirects (301s).
As such sanity checks are generally relevant, the installation of https://github.com/r-lib/urlchecker should happen in the testing container already, therefore this issue relies on the resolution of furrer-lab/r-containers#21
fitAbn
and buildScoreCache
(both "mle"): Collinearity is only addressed for binomial variables. Extend to all distributions.
Go through all examples. Run those who might require INLA only if INLA is available.
Flavor: r-devel-linux-x86_64-debian-gcc
Check: package dependencies, Result: NOTE
Package suggested but not available for checking: 'INLA'
Flavor: r-devel-linux-x86_64-debian-gcc
Check: examples, Result: ERROR
Running examples in 'abn-Ex.R' failed
The error most likely occurred in:
> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: print.abnCache
> ### Title: Print objects of class 'abnCache'
> ### Aliases: print.abnCache
>
> ### ** Examples
>
> ## Subset of the build-in dataset, see ?ex0.dag.data
> mydat <- ex0.dag.data[,c("b1","b2","g1","g2","b3","g3")] ## take a subset of cols
>
> ## setup distribution list for each node
> mydists <- list(b1="binomial", b2="binomial", g1="gaussian",
+ g2="gaussian", b3="binomial", g3="gaussian")
>
> # Structural constraints
> # ban arc from b2 to b1
> # always retain arc from g2 to g1
>
> ## parent limits
> max.par <- list("b1"=2, "b2"=2, "g1"=2, "g2"=2, "b3"=2, "g3"=2)
>
> ## now build the cache of pre-computed scores accordingly to the structural constraints
>
> res.c <- buildScoreCache(data.df=mydat, data.dists=mydists,
+ dag.banned= ~b1|b2, dag.retained= ~g1|g2, max.parents=max.par)
Error in library(p, character.only = TRUE) :
there is no package called 'INLA'
Calls: buildScoreCache ... buildScoreCache.bayes -> %do% -> <Anonymous> -> library
Execution halted
Currently we have multiple locations where we need to set the version manually (DESCRIPTION
, News.md
, configure
and configure.ac
, others?) in addition to the version we set via git tag
.
The goal would be to streamline the process of bumping the version, ideally designating one source for the version and have all other mentions be generated automatically.
As @matteodelucchi pointed out, usethis::use_version() might be a solution.
If there does not exist an implementation already that suites our needs, we might also implement this via templating (e.g. with https://github.com/davidchall/jinjar/).
In addition to streamlining the version-bumping process we might also consider to adhere to semver versioning scheme.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.