The misssbm from grosssbm

Go parallel

inferSBM should exploit multicore computing if possible (either with parallel or future).

TASK: for testing network samplings, we need the theoretical expected sampling rate

Consider the following piece of code for sampling in a 300-node SBM with various sampling schemes.

## SBM parameters
N <- 300
Q <- 3
alpha <- rep(1,Q)/Q                     # mixture parameter
pi <- diag(.45,Q) + .05                 # connectivity matrix
directed <- FALSE
# Draw a SBM model (Bernoulli, undirected)
mySBM <- simulateSBM(N, alpha, pi, directed)
A <- mySBM$adjacencyMatrix

## network samplings 
dyad  <- samplingSBM(A, "dyad", parameters =.1)
node  <- samplingSBM(A, "node", parameters =.1)
block <- samplingSBM(A, "block", parameters =c(.1, .2, .7), clusters = mySBM$memberships)
double_standard <- samplingSBM(A,"double_standard", parameters =c(0.1, 0.5))
degree <- samplingSBM(A,"degree", parameters =c(0.01,0.01))
snowball <- samplingSBM(A,"snowball", parameters =.3)

To test if each sampling given a network with a proportion of NA corresponding to what's expected, I need the theoretical sampling rate for each sampling design.

For instance, for the dyad samplign rate, it is simply the value of the parameters (0.1).

I need the theoretical expectation for the other ones as a function of the vector of parameters.

Error: netSampling %in% available_samplings is not TRUE

When compiling JSS paper with knitr.

Code coverage

We should use covr, in complement with testthat in #2

add S3 methods for missSBM-fit (in particular, predict and summary)

Required by JSS pre-reviewing

bug: bad estimation of glm coefficients in missSBM

See e.g., test_consistency:

fittedSBM$covarParam
fittedSampling$parameters

Handle covariates

The networkSampling_fit should be splitted into two subclasses, with/without covariates.

The same for the class networkSampling_sampler.

See the structure of SBM_fit for a reference.

Estimation covariate effect

See vignette for an astonishing estimate for the covariate effect... (comparison with SBM)

SBM_fit_covariates only needs covarArray

So we should never store neither covarMatrix nor covarSimilarity in it.

Bug : Warnings smoothing

There are recurrent warnings that appears during the smoothing of ICL curve with degree sampling : "In log(1 - prob) : NaNs produced"

use formula to specify our model

Once covariates will be available, maybe we could use network ~ covariates to specify our model.

Think also about how specifying smartly the sampling model.

Find a small network data set to illustrate the package

This can be as simple as the karate club network with some randomnly sampled edges and dyad.

Documentation

All function (even internal) should have a basic documentation to help other developers to understand what's going on...

Vignette

We should start to write a basic vignette.

Bug: internal clustering init_clustering take a matrix of covariates

The correct argument should be an array of covariates.

ICL comparisons

See file inst/lostinICL.R for reproductible example.

ICL inconsistency when comparing with direct computation. I mean ICL computed in the class SBMfit

Task: Add a class for fitting dyad and node sampling in the presence of covariates

Probably in file networkSampling_fit-Class.R

Initial imputation in missSBM-fit is probably not adapted to cases with covariates

Maybe related to #21, the following code for first estimation of pi in missSBM-fit is relevant for problem without covariates (when private$pi is indeed the mathematical matrix of connectivity between blocks)

https://github.com/jchiquet/missSBM/blob/18c3959d60ae4cf12039e37492987e89a8702253/R/missingSBM_fit.R#L32-L37

However, when private$pi represent gamma, as it is the case for the model with covariatess, we should adapt this first initialization and imputation. It has been show to be crucial in order to reproduce properly the resuts found with @TabouyT 's implementation.

So @TabouyT , comment initialises-tu les pi/gamma dans le modèles avec covariables ?

MAR case: perform optimization only on observed values

At this stage of the development, we perform imputation even in the MAR case to keep the same framework, whatever the underlying sampling process (MAR or NMAR).

Not only it would save some time to eprform the inference only on the observed part of the surrogate loglikelihood, but it would also be more correct in the LMAR case with covariates.

I will create a branch for that, as it changes a bit the interface with the C++ code and also the structure of the R6 object. A elegant solution would be to handle NA in the C++, by only looping over the no-NA value of the network adjacency matrix.

Define a S3 class with basic methods to manipulate the output of inferSBM

Also related to #8

Bug: smooth()

Error: netSampling %in% available_samplings is not TRUE

when sampling = "block-node", same code

Add basic show/print method for all R6 classes

At this stage, printing the result of the inferSBM is not informative at all...

Symmetric adj matrices?

If the adjacency matrix is symmetric, does missSBM use the symmetric version of the lik?

Issues with missSBM::smooth

The smooth function doesn't always improve the estimation, it makes it even worst and it shouldn't be ... Example with R code in JSS paper (.Rnw code)... See the ggplot attached with this issue

Rplot01.pdf

neither 'samplingSBM' nor ''inferSBM" are not good function names

Indeed, this function can perform sampling of a network even if the network is not drawn under a SBM.

How to fit blockmodels on a fix (given) number of cluster?

This will save time in the tests, I only manage to fit on "up to" a required number of clusters 👍

BM <- blockmodels::BM_bernoulli_covariates("SBM_sym", A, covariates_BM, verbosity = 0, explore_max = Q, plotting = "", ncores = 1)

Bug when defining an SBM object

THe following code does not work properly

A <- matrix(rbinom(100,1,.2),10,10)
  type <- "simple"
mySBM <- SimpleSBM_fit$new(A, "poisson",directed=FALSE)

Add unit tests for everything

This is important and should avoid long waste of time when code has not been checked for a while. This is also important for

targeting a release on CRAN
preparing Timothée's end of PhD

Class networkSamplingCovariates is not well defined

Moreover, it does not make sense since we should not store the array of covariates in there.

add a summary S3 methods for sampledNetwork

With basic output like: number of nodes, dyads, sampling rates.

Required for resubmission to JSS

Covariates for network / node sampling

see inst/covariates.R
In the particular case of a node sampling when we want to provide a matrix for the covariate on dyad

Issues with next version of ggplot2

Hi

We are preparing the next version of ggplot2 and our reverse dependency tests shows an issue with missSBM. The issue revolves around tighter checks of theme settings in facet rendering and means that free scales in facets will error if the theme has a specified aspect ratio. This change results in an error when running the examples in the estimateMissSBM documentation.

The next release is available in the v3.3.4-rc branch if you need to test against it. We plan on releasing in the next week.

best
Thomas

SetModel does not work for the model with greatest index

Hi Julien !

using SBM and LBM functions we would like to explore storedModels using the setModel method. But setModel does not allow exploring the last model, the one with the highest Index value. Is it possible to correct for that?

Also, I have a question about the storedModels method. Are they all here or are there more than that stored but not accessible through the method storedModels ?

Virginie and Benoit