philchalmers / simdesign Goto Github PK

View Code? Open in Web Editor NEW

57.0 7.0 15.0 98.94 MB

Structure for organizing Monte Carlo simulations in R

R 99.87% Makefile 0.12% Shell 0.01%

r simulation simulation-framework monte-carlo-simulation

simdesign's Introduction

SimDesign

Structure for Organizing Monte Carlo Simulation Designs

Installation

To install the latest stable version of the package from CRAN, please use the following in your R console:

install.packages('SimDesign')

To install the Github version of the package with devtools, type the following (assuming you have already installed the devtools package from CRAN).

library('devtools')
install_github('philchalmers/SimDesign')

Getting started

For a discription pertaining to the philosophy and general workflow of the package it is helpful to first read through the following: Chalmers, R. Philip, Adkins, Mark C. (2020) Writing Effective and Reliable Monte Carlo Simulations with the SimDesign Package, The Quantitative Methods for Psychology, 16(4), 248-280. doi: 10.20982/tqmp.16.4.p248

Coding examples found within this article range from relatively simple (e.g., a re-implementation of one of Hallgren's (2013) simulation study examples, as well as possible extensions to the simulation design) to more advanced real-world simulation experiments (e.g., Flora and Curran's (2004) simulation study). For additional information and instructions about how to use the package please refer to the examples in the associated Github wiki.

simdesign's People

Contributors

Stargazers

Watchers

Forkers

sjrushford mattsigal pedroliman guhjy oguzhanogreden anhnguyendepocen liuhuan90123 cymgen30 gloriacolmenares marklhc phdtai erge324 datakosta tangchen bsiepe

simdesign's Issues

aggregate_simulations and summary results

aggregate_simulations works currently by aggregating all results from many results files. This may not work well with very large simulations. For example, I have a simulation that produced around 300GB of replication data. (In this case, computation was expensive, and therefore, it made sense to save a lot of different statistics from each replication.) Loading these results with aggregate_simulations would not work because the computer runs out of memory.

I also propose adding aggregate_simulations_summaries function, which would run the summarise function from dplyr to each design before aggregating the results. This would allow aggregating the simulation results as summary statistics instead of raw statistics considerably reducing the memory requirements.

Seeds for each replication inside a condition?

Hi Phil,

First of all, thanks for your wonderful package! I find it really useful and complete for carrying out Monte Carlo studies.

One of the situations that I face commonly is to try to reproduce a particular replication or a particular condition. Ideally, I would like to do that by prefixing the seeds for all cases and then generate the same data. However seed in runSimulation seems only to fix the seed for conditions, that is, blocks of replications:

a vector of integers to be used for reproducibility. The length of the vector must be equal the number of rows in design. This argument calls set.seed or clusterSetRNGStream for each condition, respectively, but will not be run when MPI = TRUE. Default is NULL, indicating that no seed is set for each condition

That means that if I want to reproduce replicate 99 in a condition with 100 replicates is not going to be an easy task, as there will be 98 calls to random functions before the one in which I am interested. And in parallel I guess it will become messier.

My question (or feature request) is whether there is transparent way of fixing seeds for each of the replicates and conditions. So if I have 100 replicates and 5 conditions, fix a seed vector for each of the 500 cases.

Thank you!

Errors in analysis functions are catched and passed as input to `Summarise` when there are multiple analysis functions

Hi @philchalmers, thank you so much for adding the capacity to handle multiple analysis functions, which I think is really a game-changer as I usually need to compare multiple methods applied to the same data set! As I'm trying it out, however, I run into an issue in that when one of the analysis functions throw an error, it does not stop the program or regenerate a new data set, like what it typically does when there is a single analysis function. Instead, it seems the error is captured using tryCatch() (or try()), and is included in the results object passed to the Summarise function. Here's a reproducible example that I adapted from this vignette:

library(SimDesign)
# SimFunctions(comments=FALSE)

Design <- createDesign(N = c(10, 20, 30))

Generate <- function(condition, fixed_objects = NULL) {
    ret <- with(condition, rnorm(N))
    ret
}

Analyse.a1 <- function(condition, dat, fixed_objects = NULL) {
    whc <- sample(c(0, 1, 2, 3), 1, prob = c(.7, .20, .05, .05))
    if (whc == 0) {
        ret <- mean(dat)
    } else if (whc == 1) {
        ret <- t.test() # missing arguments
    } else if (whc == 2) {
        ret <- t.test("invalid") # invalid arguments
    } else if (whc == 3) {
        # throw error manually
        stop("Manual error thrown")
    }
    # manual warnings
    if (sample(c(TRUE, FALSE), 1, prob = c(.1, .9))) {
        warning("This warning happens rarely")
    }
    if (sample(c(TRUE, FALSE), 1, prob = c(.5, .5))) {
        warning("This warning happens much more often")
    }
    ret
}

Analyse.a2 <- function(condition, dat, fixed_objects = NULL) {
    ret <- median(dat)
    ret
}

Summarise <- function(condition, results, fixed_objects = NULL) {
    ret <- c(bias = bias(results, 0))
    ret
}

result <- runSimulation(Design,
                        replications = 10, seed = rep(103, 3),
                        generate = Generate, analyse = list(a1 = Analyse.a1, a2 = Analyse.a2), summarise = Summarise
)
#> 
#> Design row: 1/3;   Started: Tue Aug 17 11:32:50 2021;   Total elapsed time: 0.00s
#> Error: Summarise() should not throw errors. Message was:
#>     Error in t(estimate) - parameter : 
#>   non-numeric argument to binary operator

^{Created on 2021-08-17 by the reprex package (v2.0.1)}

I think this is not the intended behavior but maybe I'm mistaken. Would you mind taking a look? Thanks so much.

Mark

Code in SimDesign Package Won't Run

results <- runSimulation(Design, replications = 1000, generate=Generate,
analyse=Analyse, summarise=Summarise)
Returns error message:
Error: Function arguments for analyse are not correct!!!

Error arises for all your examples.

Using latest version of RStudio and R 3.1.3 on windows 10 computer. First time I have ever had sample code in a Package fail.

`SimBoot` always uses CI = 0.99

Hi Phil,

Thank you for making the great SimDesign package. I'm exploring the SimBoot function but noticed that currently the argument CI did not get passed to the function, as the bootstrap CI always has 99% confidence level. Should be an easy fix for this. Thank you!

Best,

Mark

For-loop vs SimDesign issues

Hi, first I would like to congratulate with your very nice package. I don't know if I will use it in my simulations because I'm a long time user of the for-loop approach. However, your approach is very elegant.
I was reading your note at http://philchalmers.github.io/SimDesign/html/12-What_not_2_do.html
and I guess your claim about the slow for-loop approach is mainly related to the fact that the matrix d is initialised as a NULL object and then updated each time by increasing the size of the matrix. That is something to absolutely avoid in R. The solution is simple, just define the matrix of the size needed and then assign the results to the matrix using an index for the row to update.
For example:

# set nrows of matrix d
m = 2*length(N_list)*length(a_list)*length(b_list)*length(cp_list)*reps
# define matrix d
d = matrix(as.double(NA), nrow = m, ncol = 7)
idx = 0 # index to add results on rows of d
for (N in N_list){
    for (a in a_list){
        for (b in b_list){
            for (cp in cp_list){
                for (i in 1:reps){
                    X = rnorm(N, 0, 1)
                    M = a*X + rnorm(N, 0, 1)
                    Y = cp*X + b*M + rnorm(N, 0, 1)
                    idx <- idx+1
                    d[idx,] = c(i, a, b, cp, N, 1, sobel_test(X, M, Y))
                    idx <- idx+1
                    d[idx,] = c(i, a, b, cp, N, 2, sobel_test(X, Y, M))
                }
            }
        }
    }
}
colnames(d) = c("iteration", "a", "b", "cp", "N", "model", "Sobel_z")
d = data.frame(d)

I did a little experiment using reps=100 and the time required goes from 38 secs to 31 secs (about 20% faster), and I expect the speed up will be even much larger as the number of replications increases because in the old approach matrices of larger sizes have to be allocated each time.

Best,

Luca

Variable naming error when using `runSimulation()`

Hello,

thank you for creating this amazing package!

I am using SimDesign for a simulation study with a package that is work-in-progress, making a fully reproducible example complicated at this point. I fully understand if you are unable to help me without one, and I realize that the underlying issue may not be within your package.

The simulation involves four different analysis strategies for time-series structural equation models, where each simulated individual has a unique data-generating process embedded within a larger group-model. This results in a list of data-generating parameters and estimates passed along the analysis pathway. The summarise function returns a list containing four named vectors, each containing performance measures for a specific analysis function.

In a working condition, this is the output of the SimExtract(., what = "summarise") function:

list(`n_ind=30 ; ar=0.2 ; ind_a=0.15 ; ind_phi=0.05` = 
list(
default = c(ari_m = -0.0041, prec_dir_m = 0.14119, recall_dir_m = 0.14119), 
liberal = c(ari_m = -0.0041, prec_dir_m = 0.15279, recall_dir_m = 0.15279), 
strict = c(ari_m = -0.0041, prec_dir_m = 0.13942, recall_dir_m = 0.13942), 
threshold1 = c(ari_m = -0.0041, prec_dir_m = 0.14549, recall_dir_m = 0.14549)))

The problem arises when I change the sample size in my data generation to a larger number, which generates more individual models. When running runSimulation(), an error occurs with the following message:

Error in do.call(data.frame, c(x, alis)) : 
  variable names are limited to 10000 bytes

5: do.call(data.frame, c(x, alis))
4: as.data.frame.list(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
3: as.data.frame(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors)
2: data.frame(design[i, ], as.list(tmp), check.names = FALSE)
1: SimDesign::runSimulation(sim_design_test, replications = replications, 
       generate = sim_generate, analyse = sim_analyse_all, 
       summarise = sim_summarize, packages = c("mvgimme", "here"), 
       fixed_objects = sim_pars, parallel = TRUE)

It seems likely to me that some larger object is somewhere saved as a name. I have used browser() to debug all three simulation functions and found no apparent structural differences between the conditions that work and those that throw an error. There are also no issues with nonconvergence, and the output of summarise does not depend on the sample size.

I have tried the following without success:

Using a list of analysis functions vs. one analysis function containing multiple analyses
Returning a list of results vs. a named vector of results for summarise
shortening the names and reducing the number of performance measures in summarise
not using a summarise function at all
turning the various save options in SimDesign on or off

I would greatly appreciate any help or intuition about this problem.

Thanks,
Björn

Implicit activation of parallelization can be risky

Hi. I'm happy to see you're supporting future parallelization.

Your NEWS file entry on this got me curious, so I looked at the source code and spotted:

SimDesign/R/analysis.R

Line 10 in acc8e07

if("future" %in% (.packages())){

Note that this way of conditioning parallelization on what packages are attached is a bit risky. The reason is that you (developer but also user) never know in what context your package is used. For example, there might be another package is the user's script/code that attaches future, which then silently changes the behavior.

An alternative is to use reuse, say, argument cl, e.g. cl = "future". This is the approach that pbapply recently took (https://cran.r-project.org/web/packages/pbapply).

PS. Yes, I'm in favor of not having any parallelization arguments at alll, but I understand that some packages want them, or have them for legacy reasons.

Error with `reSummarise()` with `bootSE = TRUE`

Hi Phil,

I got this error when running reSummarise() with bootSE = TRUE:

Error in as.integer(max) : 
  cannot coerce type 'closure' to vector of type 'integer'

Because inside the function it has this part:

      SE_sim_results <- sapply(1L:boot_draws, function(r) {
        pick <- rint(n = replications, min = 1L, max = replications)
        tmp <- if (!is.data.frame(inp$results)) 
          inp$results[pick]
        else inp$results[pick, , drop = FALSE]
        summarise(results = tmp, condition = inp$condition, 
          fixed_objects = fixed_objects)
      })

but replications is not defined or extracted from the saved results.

Best,

Mark

Rstudio server crashes when debugging runSimulation()

First of all, thank you for developing this great R package.

I am currently running simulations using SimDesign and a Rstudio Server. The script works as intended if I run the entire script as a whole, but when I add the argument debug = 'generate' (or any of the other debug options), it gives the following message:

"The previous R session was abnormally terminated due to an unexpected crash. You may have lost workspace dat as a result of this crash."

The content of the simulation does not seem to impact the crash. Simplest reproducible example that crashes:

`library(SimDesign)

Design <- createDesign(factor1 = NA,
factor2 = NA)

#-------------------------------------------------------------------

Generate <- function(condition, fixed_objects = NULL) {
dat <- data.frame()
dat
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
ret <- c(stat1 = NaN, stat2 = NaN)
ret
}

Summarise <- function(condition, results, fixed_objects = NULL) {
ret <- c(bias = NaN, RMSE = NaN)
ret
}

#-------------------------------------------------------------------

res <- runSimulation(design=Design, replications=1000, generate=Generate,
analyse=Analyse, summarise=Summarise, debug = 'Generate') # change to none and it works
res`

SessionInfo:

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] SimDesign_2.7.1

loaded via a namespace (and not attached):
[1] rstudioapi_0.13 magrittr_2.0.1 tidyselect_1.1.1 R6_2.5.1
[5] rlang_0.4.11 foreach_1.5.1 pbapply_1.5-0 fansi_0.5.0
[9] dplyr_1.0.7 tools_4.1.1 parallel_4.1.1 sessioninfo_1.1.1
[13] utf8_1.2.2 cli_3.0.1 withr_2.4.2 iterators_1.0.13
[17] ellipsis_0.3.2 tibble_3.1.4 lifecycle_1.0.1 crayon_1.4.1
[21] purrr_0.3.4 vctrs_0.3.8 codetools_0.2-18 RPushbullet_0.3.4
[25] curl_4.3.2 glue_1.4.2 compiler_4.1.1 pillar_1.6.3
[29] generics_0.1.0 jsonlite_1.7.2 pkgconfig_2.0.3

Could it be is server-related instead of specific to this R package?
Might you have any idea what could be causing this crash and have ideas for potential solutions?

Thank you in advance,

Joppe

Checking for incomplete designs in large simulations

The current way of checking for incomplete designs in distributed simulations is as follows:

Final <- SimDesign::aggregate_simulations(files=dir())
pick <- subset(Final, REPLICATIONS < 10000)
subDesign <- subset(pick, select=N)
replications_missed <- 10000 - pick$REPLICATIONS

This consumes a lot of memory and might not be a feasible solution for very large simulations that might produce several gigabytes of results. I propose a new function called check_missing_simulations(), which would load each file, store the replication count, and then discard the actual simulation results before loading the next result file.

sum() on Boolean vector not working as expected in Summarise function

In the Analyse function of my simulation I am using a function that returns TRUE or FALSE when a certain issue arises in my analysis (singularity). In the Summarise function, I wish to count how many times the TRUE occurred given the parameters of my design. When I have a Boolean vector, I can use sum() to count the number of times TRUE occurs.

However, when I use the sum() function inside of my Summarise function, it does not return an integer, but instead returns a float, whereas I am expecting an integer of the amount of times TRUE occurred.

Below is a script that should occasionally provide a singular fit. In the results, I would expect the column "singularity" to provide the value of how often a singular fit (TRUE) occurred for each parameter combination.

library(SimDesign)
library(lme4)
library(tidyverse)

Design <- expand.grid(sampleSize = c(250),
                      effectSize = c(0, 4), 
                      measError = c(0, 0.1)) # Percentage measurement error

# Function that generates the data
Generate <- function(condition, fixed_objects=NULL) {
  Attach(condition) # So I don't have to type "condition$" in front of every element in our design

  # Generate measurement error as percentage of total possible value of BDI-II. Example: if Measurement Error = 0.1 (10% of range), the mean of the measurement error distribution is 0.1*63=6.3.
  r <- 0.5 # Intra-individual correlation of measurement error
  Sigma <- matrix(c((3.214286)^2, r, r, (3.214286)^2),nrow=2) # Covariance matrix
  measurementError <- rmvnorm(sampleSize, mean=c(measError*63,measError*63), sigma=Sigma)
  measSign <- rbinom(sampleSize, 1, prob=c(0.5,0.5)) # random binomial to make measurement error randomly positively or negatively signed
  measSign[measSign==0] <- -1 # Change zero to -1 for sign
  measurementError <- measurementError * measSign # Apply sign
  measurementErrorDat <- data.frame(id = sort(rep(c(1:sampleSize))),
                          Pre = measurementError[,1],
                          Post = measurementError[,2]) %>% gather("time", "measurementError", -id)
  
  # Random effects
  i <- 0.2 # sd between intercepts
  s <- 0.5 # sd between slopes
  r <- 0.8 # correlation between intercepts and slopes
  
  cov.matrix1 <- matrix(c(i^2, r * i * s, r * i * s, s^2), nrow = 2, byrow = T)
  randomEffects <- rmvnorm(sampleSize, mean = c(0, 0), sigma = cov.matrix1)
  
  indDf <- data.frame(id = c(1:sampleSize), 
                      condition=c(rep('Treatment', sampleSize/2), rep('Control', sampleSize/2))) %>% 
    mutate(alpha = 24.1+randomEffects[, 1],
           beta = ifelse(condition=="Treatment", -(9.828+effectSize+randomEffects[, 2]), -(9.828+randomEffects[, 2])))
  
  timeContrast = c(0,1)
  withinDf <-  data.frame(id = sort(rep(c(1:sampleSize), 2)), 
                          time=c(rep(c('Pre', 'Post'), sampleSize)), 
                          x=rep(timeContrast, 2))
  
  dat <- merge(indDf, withinDf) %>% merge(measurementErrorDat) %>% mutate(DV = alpha + x * beta + measurementError)
  # Truncate to 0 or 63
  dat$DV[dat$DV<0] <- 0
  dat$DV[dat$DV>63] <- 63

  dat
}

# Function for analysing the data set
Analyse <- function(condition, dat, fixed_objects=NULL){
  Attach(condition) # So I don't have to type "condition$" in front of every element in our design
  fit <- lmer(DV ~ condition * time + (1 | id), data=dat)
  singular <- isSingular(fit)
  ret <- c(coefficientInt=coef(summary(fit))[4,1], singular=singular)
  ret
}

# Function that summarises the analysis results into useful statistics for examining which of the data sets showed less bias
Summarise <- function(condition, results, fixed_objects = NULL) {
  bias_Int <- bias(results[1])
  sing <- sum(results[2], na.rm=TRUE)
  
  ret <- c(bias_Int=bias_Int, singular=sing)
  ret
}

results <- runSimulation(Design, replications=5, verbose=FALSE, parallel=FALSE, generate=Generate, analyse=Analyse, summarise=Summarise, progress=TRUE)

Another quick issue (rbind.SimDesign) that might be as designed

I have two runs that are identical but for a "sample size" condition.

Full_Sim_500reps_CIs <- SimDesign:::rbind.SimDesign(simresults_500reps_500sample_boot_all, simresults_500reps_1ksample_boot_all)

yields error: Error in if (x$Version != tmp$sessionInfo$otherPkgs[[x$Package]]$Version) warning(sprintf("Different \"%s\" package used across simulation objects (%s and %s).", : argument is of length zero

Now I understand the utility of this check, but the only difference between runs appears to be that someone loaded a few packages before one of the runs but not the others. One run only has: SimDesign, RPushBullet as otherPkgs, the other shares those but we've also loaded dplyr and magrittr, and maybe something else. If I check the shared packages, however:

> identical(rep500_samp500$sessionInfo$otherPkgs$RPushbullet, y = rep500_samp500$sessionInfo$otherPkgs$RPushbullet)
[1] TRUE
> identical(rep500_samp500$sessionInfo$otherPkgs$SimDesign, y = rep500_samp500$sessionInfo$otherPkgs$SimDesign)
[1] TRUE

These packages were of the same version, and others loaded in a one of the objects were irrelevant to the simulation. I can handle this in a number of ways that make it not critical to address, but I'm wondering if my read of your error code is correct, and if so, is this a necessary check? If it's necessary, I'd suggest pointing out (somewhere in the documentation... that may already exist and I missed) it might come into play later if you want to use the special rbind retaining the list structure.

sorry for the long post and no reprex, but I it's not needed for this one?

Improve documentation on parallel processing

The documentation on parallel processing could be improved by explaining on which level parallelization is done. Is it done on the simulation desing level, on the replication level, or both? The current documentation and the 2020 article leave this unexplained.

Features to support runnign as array jobs on a cluster

I am running my simulations as array jobs on a computer cluster. I know that I could, in principle, set up an array job backend using futures, but this is not a very flexible approach. Ideally, I would be able to run array jobs using SimDesing.

Let's assume that I have a simulation with 1000 conditions and 1000 replications each and I am runnign on a cluster where I have 10 000 cores available. I would like to split the simulation so that each core runs 100 replications of one condition. This is controlled by array job number so that e.g. job number 1 means replications 1-100 of design 1, job number 2 means replications 101-200 of desing 2, and so on.

The simulation should be ideally fully reproducible so that no matter how I split the work, I will get the same results. This means that I cannot split the simulation into chunks of 10100 replications without explicitly considering how seeds are managed because otherwise, the results would be different from just running 11000 replications. The computational time is constrained so that I can reserve one core for at most 4 hours and need to exit before that, or otherwise, the job will be killed.

To implement my workflow with SimDesign, I would need a couple of features:

Setting maximum execution time and aborting the simulation saving the incomplete results when the maximum execution time has been exceeded by SimDesign. I believe this might be useful for researchers who run simulations in computer classes, for example. SimDesign already has a way to continue from incomplete replications.
Running a subset of replications while maintaining reproducibility. This could work, for example, by generating a list of 1000 seeds randomly and then setting a seed for each replication. This way, it would not matter in which order I run the replications, and the replications would still be reproducible. This could be implemented by allowing the seed argument to take a list of vectors of seeds so that each vector contains 1000 seeds each used for each replication, and the list elements correspond to the conditions. Ideally, this would be done internally, though, and I would just tell simDesign to, e.g., run replication 201-300 out of 1000.

I have custom code that implements this approach here https://osf.io/5uep6 (See: # Multiple Monte Carlo simulation, parallel execution on a computer cluster ====)

Question / Issue

Hello Phil,

I hope this message finds you well.

I've been utilizing simDesign to execute a substantial MCMC Simulation, encompassing over 200,000 models. To systematically track the individual outcomes, I initially stored all results from the analyse function into an .RDS file. The filename encapsulates both the condition and the model. For instance, within the summarise function, I've employed the following approach:

` Analyze_M3(){

..........
# max rhat for subject theta

max_rhat_c <- subj %>% select(contains("rhat_c")) %>% max()
max_rhat_a <- subj %>% select(contains("rhat_a")) %>% max()

# Predictive Correlation for Different Response Categories

cor_rep_IIP <- cor(dat[[3]][,"IIP"],count_rep$IIP)
cor_rep_IOP <- cor(dat[[3]][,"IOP"],count_rep$IOP)
cor_rep_NPL <- cor(dat[[3]][,"NPL"],count_rep$NPL)


# Merge Results for current iteration

ret <- data.frame(re_c = recoveries_c,
                  rmse_c = rmse_c,
                  rmse_mu_c=rmse_mu_c,
                  rhat_c = max_rhat_c,
                  
                  re_a=recoveries_a,
                  rmse_a = rmse_a,
                  rmse_mu_a=rmse_mu_a,
                  rhat_a=max_rhat_a,
                  
                  cor_rep_IIP = cor_rep_IIP,
                  cor_rep_IOP = cor_rep_IOP,
                  cor_rep_NPL = cor_rep_NPL)

ret <- cbind(condition, ret)

.....rest of the code }`

Summarise_M3 <- function(condition,results,fixed_objects=NULL) {
  
  Attach(condition)
  
  
  # Load Model Type 
  
  # Create the base model name
  model_name <- paste0("M3_", model_type)
  
  # Check if the 'fixed' condition applies
  if (!(model_type == "SS" || model_type == "CS")) {
    if (fixed == 1) {
      # Append '_fixed' to the model name
      model <- paste0(model_name, "_fixed")
    } else {
      model <- model_name
    }
  } else {
    model <- model_name
  }
  

  
  # Create filename according to condition
  # 
  # if (!(model == "M3_SS" || model == "M3_CS")) {
  #   
  #   file.name <- paste0("Con.",model,".","N.",OtherItems,".","K.",NPL,
  #                       ".","nRet.",Retrievals,".",
  #                       "nFT.",nFreetime,".","length_FT.",length_FT,
  #                       ".fixed.", fixed, ".RDS")
  # } else {
  #   
  #   file.name <- paste0("Con.",model,".","N.",OtherItems,".","K.",NPL,
  #                       ".","nRet.",Retrievals,".RDS")
  #   
  # }
  # 
  # 
  # 
  # saveRDS(results, paste0(fixed_objects$output.dir.reps,file.name))
  
  if(model== "M3_CS_EE"){
    
    ..... rest of code.....

Error:


Design row: 1/1;   RAM used: 157.1 Mb;   Total elapsed time: 0.00s 
 Conditions: REPLICATION=0, OtherItems=5, NPL=8, Retrievals=500, model_type=CS
New names:
• `ID` -> `ID...1`
• `REPLICATION` -> `REPLICATION...2`
• `OtherItems` -> `OtherItems...3`
• `NPL` -> `NPL...4`
• `Retrievals` -> `Retrievals...5`
• `model_type` -> `model_type...6`
• `ID` -> `ID...7`
• `REPLICATION` -> `REPLICATION...8`
• `OtherItems` -> `OtherItems...9`
• `NPL` -> `NPL...10`
• `Retrievals` -> `Retrievals...11`
• `model_type` -> `model_type...12`
Error in `dplyr::as_tibble()`:
! Column names `REPLICATION`, `OtherItems`, `NPL`, `Retrievals`, and
  `model_type` must not be duplicated.
Use `.name_repair` to specify repair.
Caused by error in `repaired_names()`:
! Names must be unique.
✖ These names are duplicated:
  * "REPLICATION" at locations 1 and 7.
  * "OtherItems" at locations 2 and 8.
  * "NPL" at locations 3 and 9.
  * "Retrievals" at locations 4 and 10.
  * "model_type" at locations 5 and 11.
Backtrace:
     ▆
  1. └─SimDesign::runSimulation(...)
  2.   ├─dplyr::as_tibble(stored_Results_list)
  3.   └─tibble:::as_tibble.data.frame(stored_Results_list)
  4.     └─tibble:::lst_to_tibble(unclass(x), .rows, .name_repair)
  5.       └─tibble:::set_repaired_names(...)
  6.         └─tibble:::repaired_names(...)
  7.           ├─tibble:::subclass_name_repair_errors(...)
  8.           │ └─base::withCallingHandlers(...)
  9.           └─vctrs::vec_as_names(...)
 10.             └─vctrs (local) `<fn>`()
 11.               └─vctrs:::validate_unique(names = names, arg = arg, call = call)
 12.                 └─vctrs:::stop_names_must_be_unique(names, arg, call = call)
 13.                   └─vctrs:::stop_names(...)
 14.                     └─vctrs:::stop_vctrs(...)
 15.                       └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)
Execution halted

Interestingly, this setup runs seamlessly on my home system, saving a dataframe (not a list) to a designated "conditions" folder with the specified filename and all pertinent condition information. However, when executed on the HPC, I encounter an error related to duplicated columns in the summary file, which is a dplyr error. I surmise this might be due to the combination process in the preceding function. Could this be attributed to a version-specific change? It's worth noting that everything operates as expected on my home setup.

On further exploration, I discerned that it's feasible to store the intermediate results from the analyse function directly, eliminating the need for an additional saving routine:

 save = FALSE, save_results = TRUE,
              save_details = list(out_rootdir=paste0(default.out,"Results/",model_type,"/"),
                                  save_results_dirname=paste0("Conditions/")),

This method yields the desired results without any errors. However, there's a minor hiccup with the naming convention. Given that I parallelize all simulation conditions simultaneously, with each HPC node handling a single condition, the resultant folder is simply named "_1" and the file is labeled "results-row-1". I'm concerned this might lead to complications, especially when running multiple simulations across different nodes, as I intend for the results to be consolidated within a single directory. Is there a way to modify the naming to reflect the simulation condition?

Thank you for your time and assistance. I appreciate your expertise on this matter.

Warm regards,

Jan

Error in load_packages(packages) : task 1 failed - "could not find function "load_packages""

Hey folks,

first of all, thanks for this great package - really nice work !

I ran in some problems when I tried to use MPI (thegoal is to use multiple nodes on a cluster) - if I follow the example given by Phil on parallel computing, I get the following error:

cl <- startMPIcluster()
	2 slaves are spawned successfully. 0 failed.
registerDoMPI(cl)

res <- runSimulation(sim3, replications = reps2con, generate = Generate_M3, 
                     analyse = Analyze_M3, summarise = Summarise,
                     fixed_objects = fo,extra_options=list(MPI=TRUE))
                     packages = c("cmdstanr","posterior","tmvtnorm","psych","tidyverse","tidybayes")

Error in load_packages(packages) : 
  task 1 failed - "could not find function "load_packages""

It's an error related to packages which should be loaded by some function defined in runSimualtion() I guess, it seems that the there's no pass through for the required functions from the included libs statement? I really have no idea - but I thought reaching out can't be wrong. I get this error even if I submit a slurm job, following the exact example given here:

https://cran.r-project.org/web/packages/SimDesign/vignettes/Parallel-computing.html#2_Network_computing

If someone has an advice, I'd be very happy!

cheers

reSummarise from dir with missing non-sequential results

I don't currently have any reproducible code, but I can probably demonstrate it if necessary:

I have a simulation running in which fatal errors are relatively routine, though I need to get through them. At times, the simulation has ended, temp file deleted, but the results meta-statistic data frame has failed to write. The save results directory may have, then, only 50 files from "results-row-1001.rds" to "results-row-1964.rds" when running from 1001 to 2000 due to fatal terminations not write to the directory.

Is there any quick way to summarize these without having all the fatal terminations in the directory?

Implement MCSE for performance measure functions

Hello,

as outlined in #27 , Samuel Pawel, František Bartoš, and I would be willing to help with the implementation of some MCSE-related functionality for SimDesign.

Our suggestions:
Allow users to obtain CLT-based Monte Carlo Standard Errors for the performance measures in the SimDesign summary functions based on the formulas provided in Siepe et al. (2023).

Sketch of what such an extension could look like:

(building on the existing EDR function):

mcse_prop <- function(x, n){
  sqrt((1/n) * (x * (1 - x)))
}

EDR <- function(p,
            	alpha = .05,
            	unname = FALSE,
            	MCSE = TRUE){
 
  browser()
  if(is.data.frame(p)) p <- as.matrix(p)
  stopifnot(all(p <= 1 & p >= 0))
  stopifnot(length(alpha) == 1L)
  stopifnot(alpha <= 1 && alpha >= 0)
  if(is.vector(p)) p <- matrix(p)
  ret <- colMeans(p <= alpha)
 
  if(MCSE){
	mcse <- mcse_prop(ret, n = nrow(p))
	if(!is.null(names(ret))){
  	names(mcse) <- paste0(names(ret), ".MCSE")
	} else {
  	names(ret) <- paste0("V", 1:length(ret))
  	names(mcse) <- paste0("MCSE.V", 1:length(ret))
	}
	ret <- c(ret, mcse)
  }
 
  if(unname) ret <- unname(ret)
  ret
}

Open Questions/Ideas:

Should the additional MCSE argument only allow for the CLT-based approach, or should this somehow also be tied to bootstrap-based MCSE estimation as available in reSummarize() or runSimulation()?
The named vector output could potentially lead to confusion; another possibility would be to provide the MCSE as an attribute

Depending on your input, we will open a pull request suggesting the functions soon.

Validate function outputs

The package should provide an informative error message when the generate or analyse functions do not return valid outputs.

Here is a test case that produces a non-informative error

library(lavaan)
library(tidyverse)
library(SimDesign)

model.str <- "
F1 =~ x1 + x2 + x3
F2 =~ x4 + x5 + x6
M =~ x1 + x2 + x3 + x4 + x5 + x6
"

model <- lavaanify(model.str, model.type = "cfa")

popModel <- mutate(model,
                    free = 0,
                    ustart = 1)

# approximate population covariance matrix

cov(simulateData(popModel, sample.nobs = 100000))


Design <- createDesign(N = 10^(2:5))

#-------------------------------------------------------------------

Generate <- function(condition, fixed_objects = NULL) {
  dat <- simulateData(popModel, sample.nobs = condition$N)
  dat
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
  ret <- cfa(model.str, dat)
  ret
}

Summarise <- function(condition, results, fixed_objects = NULL) {
  ret <- c(converged = lavInspect(results, "converged"),
           admissible = lavInspect(results,"post.check"))
  ret
}

#-------------------------------------------------------------------

res <- runSimulation(design=Design, replications=1000, generate=Generate, 
                     analyse=Analyse, summarise=Summarise)

Allow RAM testing considerations

Revisit use of object.size() to track the storage magnitude, and manage in a way similar to how max_time is currently used.

Help: moderated mediation sample size calculation

Good morning,
I was wondering if anyone could help me out in setting up code for a simulation based sample size estimation?

I'm working with a moderated mediation model (Model 14 as per A.Hayes PROCESS templates), where:
X is a dichotomous experimental condition (0 and 1)
M is a continuous mediator (mean of discrete 1 to 5 items)
V is a continuous mediator (mean of discrete 1 to 5 items)
Y is a continuous mediator (mean of discrete 1 to 5 items)

When I was analyzing pilot data I mean centered X, M and V, computed the product of M*V by hand and set the analysis syntax in for example lavaan.

I already learned that I can't run a simulation for sample size using lavaan model syntax in simsem - due to the fact, that I'm introducing a variable being a product of other two variables.

I was wondering if you could help me out in working out syntax for such simulation, where conditional indirect effects are hypothesized, and sample size is estimated based on previous research data. For example:


      lhs op          est mult             rhs
        M  ~  1.202004232    *               X
        Y  ~  0.575722530    *               M
        Y  ~ -0.006049529    *               V
        Y  ~  0.069165295    *             MxV
        Y  ~ -0.030178110    *               X
        M ~~  0.615693995    *               M
        Y ~~  0.361699422    *               Y
        X ~~  0.249909489    *               X
        X ~~  0.028958249    *               V
        X ~~ -0.164863026    *             MxV
        V ~~  2.527681615    *               V
        V ~~ -0.102325584    *             MxV
      MxV ~~  2.918988907    *             MxV
 indirect :=  0.692020918    *            a*b1
   direct := -0.030178110    *               c
      ab3 :=  0.083136978    *            a*b3
       lo :=  0.532131320    * a*(b1+ab3*-1.6)
       av :=  0.692020918    *    a*(b1+ab3*0)
       hi :=  0.851910516    *  a*(b1+ab3*1.6)

These where the model estimates of a pilot study (analyzed with lavaan)

Unable to save generated datasets when using saved seeds

Hello,

I am trying to replicate simulation conditions using the saved seeds.

When I set save_seeds = TRUE and I choose a save_seeds_dirname, everything works fine.

However, I am only able to load the seed if it is in the first seed directory: SimDesign-seeds_james-Oryx-Pro without _4. So to load the seeds, I had to rename the seed directory from SimDesign-seeds_james-Oryx-Pro_4 to SimDesign-seeds_james-Oryx-Pro.

I could not specify this option: load_seed = "./SimDesign-seeds_james-Oryx-Pro_4/design-row-1/seed-639". I had to rename the directory, then delete the directory name in the load_seed argument. This is a minor issue.

The major problem is I am unable to save the generated data when using the loaded seed. I obtain results, but the generated dataset is not saved. It only saves if I comment the load_seed = "design-row-1/seed-639", line out.

This is my script below.

library(SimDesign)

inv.logit <- function(x) 1 / (1 + exp(-x))

design_4 <- expand.grid(n = 100)

generate_4 <- function(condition, fixed_objects = NULL) {
  dat <- with(condition, {
    x <- c(rnorm(n - .05 * n), rcauchy(.05 * n))
    dat <- data.frame(
      x = x, y = rbinom(n, 1, inv.logit(x * log(1.5))))
  })
  dat
}

analyze_4 <- function(condition, dat, fixed_objects = NULL) {
  ols.0 <- glm(y ~ x, gaussian, dat)

  ols.rd <- bias(unname(ols.0$coefficients[2]), .1, TRUE)
  stat <- c(ols.rd = ols.rd)
  stat
}

summaries_4 <- function(condition, results, fixed_objects = NULL) {
  ret <- colMeans(results)
  ret
}

results_4 <- runSimulation(
  design = design_4, replications = 10000, parallel = TRUE, save_seeds = TRUE,
  generate = generate_4, analyse = analyze_4, summarise = summaries_4, edit = "none",
  save = TRUE, save_results = TRUE, save_generate_data = FALSE, progress = TRUE,
  filename = "./simdata_4", seed = c(100),
  save_details = list(save_seeds_dirname = "./SimDesign-seeds_james-Oryx-Pro_4"))

results_4 <- readRDS("simdata_4.rds")

# Generated dataset does not save.
results_4_r <- runSimulation(
  design = design_4, replications = 1, seed = c(100), edit = "none",
  generate = generate_4, analyse = analyze_4, summarise = summaries_4,
  save = TRUE, save_results = FALSE, save_generate_data = TRUE, progress = TRUE,
  load_seed = "design-row-1/seed-639",
  filename = "./simdata_4r", save_details = list(
    save_generate_data_dirname = "./SimDesign-generate-data_james-Oryx-Pro_4r/")
)

Thank you very much

Problem with restarting simulation after error in summarise-function

Hello Phil,

I encountered a possible bug or at least unintented behaviour of SimDesign. I was running a simulation with save_results = TRUE and everything was fine until I had an error in one condition in Summarise (because of a warning message, which was too long for a column name). I fixed that problem in Analyse and wanted to restart the simulation, but got the following error:

save_results_dirname not starting from correct location according to tempfile

I checked all my paths and the saved dirnames in the tempfile and everything was in order. Looking at the source code, the problem seems to be that after restarting a simulation it compares the first condition in the tempfile which is NULL (= starts) to the number of results saved in the folder save_results_dirname and expects there to be one less file than starts, otherwise it throws the error from above. I fixed this problem by deleting the last results-file manually. Is there a way to handle this problem in SimDesign directly by catching that error with a specific error-message or ideally restarting the simulation by repeating the condition which lead to the error in Summarise.

I included a small example:

library(SimDesign)

Design <- data.frame(m = 1:4)

Generate <- function(condition, fixed_objects = NULL) {
  dat <- data.frame(x = rnorm(100, condition$m, 1))
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
  error <- 0
  if(condition$m == 3) error <- 1 # exclude this line after first run
  res <- c(m = mean(dat$x), error = error)
}

Summarise <- function(condition, results, fixed_objects = NULL) {
  if(any(results[, "error"] == 1)) {
    stop("ERROR")
  }

  res <- c(m = mean(results[, "m"]))
}

results <- runSimulation(design=Design, replications=5, 
                         generate=Generate, analyse=Analyse, summarise=Summarise,
                         progress = TRUE, max_errors = 5,
                         save = TRUE, save_results = TRUE,
                         save_details = list(save_results_dirname = "SimResults")
)

Add better complete crash information

When a design row crashes due to too many consecutive errors there should be a file dumped to the hard-drive containing the .Random.seed components useful for replicating these issues. This file can be over-writable since it should be meant as just a crash log for debugging purposes.

Computing relative bias when population parameter is a vector rather than scalar.

I had problems computing the relative bias using the bias() function when the population parameter is a vector.

Looking at the code for bias(), it appears bias is first computed as a scalar, mean(estimate(i) - parameter(i)). If the user sets the option, type = "relative", this scalar is then divided by the parameter vector which results in a vector of results instead of a scalar for relative bias.

I do not know if this is clear, but mathematically, bias() appears to be doing mean(bias(i)) / parameter(i) resulting in a vector of length i, when I think it should be doing: mean(bias(i)/parameter(i)) resulting in a scalar.

Thinking about it some more, maybe we should divide by the absolute value of the parameter instead?

Reproducible example

set.seed(1); n <- 5

par <- rnorm(n)
est <- par + .05 * rnorm(n)

SimDesign::bias(est, par, type = "relative")
[1] -0.010785765  0.036792970 -0.008085869  0.004235482  0.020505687

mean(est - par) / (par)
[1] -0.010785765  0.036792970 -0.008085869  0.004235482  0.020505687

# I can take the mean to produce a scalar but
mean(SimDesign::bias(est, par, type = "relative"))
[1] 0.008532501

# Using the SimDesign approach but if I divide by the absolute value instead,
mean(mean(est - par) / abs(par))
[1] 0.01608115

# This is what I think I want:
mean((est - par) / abs(par))
[1] 0.01662195

Which approach of the three is preferable?

Thanks.

Is it possible to capture all errors when using multiple analysis functions?

Hi Phil, I have another question regarding using multiple analysis functions. In a current project I want to compare the convergence rates of different methods, and since SimDesign draws a new sample when one of the analysis functions throws an error. My understanding is that currently SimDesign will stop a replication as soon as the first error was found, which means that the convergence rate of each analysis depends on the ordering of the method, as in the following example:

library(SimDesign)
# SimFunctions(comments=FALSE)

Design <- createDesign(N = c(10, 20, 30))

Generate <- function(condition, fixed_objects = NULL) {
  ret <- with(condition, if (runif(1) > 0.8) rep(NA, N) else rnorm(N))
  ret
}

Analyse.a1 <- function(condition, dat, fixed_objects = NULL) {
  if (any(is.na(dat))) {
    stop("mean did not converge")
  }
  ret <- mean(dat)
  ret
}

Analyse.a2 <- function(condition, dat, fixed_objects = NULL) {
  if (any(is.na(dat))) {
    stop("median did not converge")
  }
  ret <- median(dat)
  ret
}

Summarise <- function(condition, results, fixed_objects = NULL) {
  ret <- c(bias = bias(results, 0))
  ret
}

result <- runSimulation(Design,
                        replications = 10, seed = rep(103, 3),
                        generate = Generate, 
                        analyse = list(a1 = Analyse.a1, a2 = Analyse.a2), 
                        summarise = Summarise
)
#> 
#> Design row: 1/3;   Started: Fri Oct 22 15:04:57 2021;   Total elapsed time: 0.00s 
#> 
#> Design row: 2/3;   Started: Fri Oct 22 15:04:57 2021;   Total elapsed time: 0.01s 
#> 
#> Design row: 3/3;   Started: Fri Oct 22 15:04:57 2021;   Total elapsed time: 0.02s
#> 
#> Simulation complete. Total execution time: 0.02s

# Switch the order of a1 and a2
result2 <- runSimulation(Design,
                        replications = 10, seed = rep(103, 3),
                        generate = Generate, 
                        analyse = list(a2 = Analyse.a2, a1 = Analyse.a1), 
                        summarise = Summarise
)
#> 
#> Design row: 1/3;   Started: Fri Oct 22 15:04:57 2021;   Total elapsed time: 0.00s 
#> 
#> Design row: 2/3;   Started: Fri Oct 22 15:04:57 2021;   Total elapsed time: 0.01s 
#> 
#> Design row: 3/3;   Started: Fri Oct 22 15:04:57 2021;   Total elapsed time: 0.01s
#> 
#> Simulation complete. Total execution time: 0.02s

attr(result, which = "ERROR")
#> # A tibble: 3 × 1
#>   `ERROR: .mean did not converge\n`
#>                               <dbl>
#> 1                                 1
#> 2                                 2
#> 3                                 4
attr(result2, which = "ERROR")
#> # A tibble: 3 × 1
#>   `ERROR: .median did not converge\n`
#>                                 <dbl>
#> 1                                   1
#> 2                                   2
#> 3                                   4

^{Created on 2021-10-22 by the reprex package (v2.0.1)}

Would it be possible to add an option so that for a given replication, all functions in the analysis list will run, and then all errors will be recorded? I know I can treat method as another design factor but I want to have the same data sets when comparing the different analysis functions. What I previously have done is to use multiple tryCatch() inside the Analyse() functions, and threw an error if anyone of the analyses resulted in an error; when multiple functions resulted in errors I just chained the error messages together (and there may be better ways to do it). Curious what you think about that, and whether something like that can be done with multiple analysis functions.

Thank you very much!

Mark

Random number seed management

The page about cluster computation should discuss random number management more explicly.

https://cran.r-project.org/web/packages/SimDesign/vignettes/Parallel-computing.html

For example, the section "3 Poor man’s cluster computing for independent nodes" suggests running R=200 and R=300 on two computers. This can easily lead to running with the same random number seeds causing the replications to be non-independent. It would be useful to provide some recommendations on how seeds should be managed in this scenario.

runSimulation fails if design and estimate are named the same

Here is a minimal reproducible example that shows the problem:

SimClean()

#-------------------------------------------------------------------

Design <- createDesign(b = 1)

#-------------------------------------------------------------------

# 1) Generate data

Generate <- function(condition, fixed_objects = NULL) {
    
    dat <- data.frame()
    dat
}

# 2) Estimate 

Analyse <- function(condition, dat, fixed_objects = NULL) {
    ret <- nc(b = 1)
    ret
}

# 3) Summarize 

Summarise <- function(condition, results, fixed_objects = NULL) {
    ret <- c(b = 1)
    ret
}

#-------------------------------------------------------------------

res <- runSimulation(design = Design, replications = 10, generate = Generate, 
                     analyse = Analyse, summarise = Summarise)
res

Implement sample size planning functions

Hello Phil,

as we previously briefly discussed via email, it could be useful to have functionality that allows for sample size planning for simulation studies to achieve a desired Monte Carlo Standard Error (MCSE). These functions should allow users to specify their performance measure of interest and the desired precision. The functions then return the number of repetitions needed to achieve said precision. The calculations can be based on the formulas we provide in Siepe et al. (2023).

We (Samuel Pawel, František Bartoš, and I) would like to contribute to this functionality.

Our Suggestions:

implement helper functions plan_*, where * stands for performance measures such as bias or coverage as implemented in the SimDesign summary functions
let users either specify a 'worst-case' scenario (in case of performance measures with known SE) or an empirical variance of the estimates (based on previous simulation studies or a pilot simulation)
the plan_* can be used within the Summarise() function. Users can then run a pilot simulation study to obtain the empirical variance and return the required sample size for each condition/method.
this would seamlessly build upon existing infrastructure. We could create a wiki/vignette that explains the idea.

Sketch of what such a function could look like:

For a performance measure with known SE:

plan_EDR <- function(target_mcse,
                     target_edr = 0.5){ 
  n_rep <- target_edr * (1 - target_edr) / target_mcse^2
  n_rep
}

For a performance measure with unknown SE:

plan_bias <- function(target_mcse,
                      empirical_var){
  n_rep <- empirical_var / target_mcse^2
  n_rep
}

Depending on your input, we will open a pull request suggesting the functions soon.

Best
Björn

Infinite loop triggered by error at analyse= return

I stumbled upon a way to trigger an infinite loop. I must say this was a result of disregarding a detail in the examples. I'm opening an issue since I think this was a reasonable disregard.

The infinite loop can be triggered like this:

analyse <- function(condition, dat, fixed_objects = NULL) {
  analysis_1 <- mean(rnorm(100))

  list(a1 = analisis_1) # mind the typo...
}

I believe this is a common way of returning functions, although the examples first create an object and return that.

check design dataset on resume

If a folder contains a tempfile from simulations with different parameter-sets the simulation is restarted with the new parameter-set starting from the last line number contained in the old file. The output file is concatenated, if the parameter-sets contained different columns, the columns are NA before and after the respective line number.

Please add a check and do not resume simulations for a different design dataset but instead stop with an informative error-message.

aggregate_simulations and selective loading of results

aggregate_simulations work currently by aggregating all results from many results files. This may not work well with very large simulations. For example, I have a simulation that produced around 300GB of replication data. (In this case, computation was expensive and therefore it made sense to save a lot of different statistics from each replication.) Loading these results with aggregate_simulations would not work because the computer runs out of memory.

I propose adding three more arguments to aggregate_simulations to handle this kind of scenario. The filter and select arguments, modeled after the dplyr package, would allow the users to choose which conditions and which result variables to load. This would be useful, for example, when producing figures or tables that each need just a subset of the full simulation results.

Code in SimDesign Package Won't Run for window 10!!

install.packages("simdesign")
Installing package into ‘C:/Users/HP/AppData/Local/R/win-library/4.2’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘simdesign’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Warning in install.packages :
Perhaps you meant ‘SimDesign’ ?

Seeds and reproducibility in distributed simulations

A simulation may not fully complete when running on an HPC cluster because of the cluster's time limits. The currently recommended workflow is to check the missing replications, add more rows to the design matrix, and rerun with a new set of arrayIDs. This may need to be repeated several times.

This workflow is problematic for reproducibility because it would require the user to keep track of which runs each replication originates from.

Storing the random number generator state before each replication is one way to solve this problem, but it is impractical with very large simulations.

I propose three changes to make reproducibility easier:

1) Implement replication number

Large simulations can now run replications as smaller sets. For example, if we run 1000 reps, this can run as 10 sets of 100 replications using `expandDesign()'. This complicates the management of replications because the aggregated results do not clearly indicate which run each replication originated from.

I propose a new function that adds replication numbers to the design. The replication number would be stored as a new column, replications, that contains a vector of two or more elements. If the vector contains two elements, it is interpreted as a range where the first element is the first replication number and the second element is the final replication number. If the vector has more than two elements, it is interpreted as a vector of replications to be run.

Here is a proposed implementation


library(SimDesign)

Design <- createDesign(N = c(10, 20, 30))

# Add design numbers, this is needed for adding replication numbers

Design <- tibble::add_column(Design, 
                             design = 1:nrow(Design),
                             .before = 1)

Generate <- function(condition, fixed_objects = NULL) {
    dat <- with(condition, rnorm(N, 10, 5)) # distributed N(10, 5)
    dat
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
    ret <- c(mean=mean(dat), median=median(dat)) # mean/median of sample data
    ret
}

Summarise <- function(condition, results, fixed_objects = NULL){
    colMeans(results)
}

expandDesign <- function(Design, repeat_conditions){
    stopifnot(!missing(Design))
    stopifnot(!missing(repeat_conditions))
    if(length(repeat_conditions) == 1L)
        repeat_conditions <- rep(repeat_conditions, nrow(Design))
    stopifnot("length of repeat_rows must equal number of rows in final object"=
                  length(repeat_conditions) == nrow(Design))
    ret <- Design[sort(rep(1L:nrow(Design), times=repeat_conditions)), ]
    rownames(ret) <- NULL
    
    ret
}

addReplicationsToDesign <- function(ExpandedDesign, replications){
    stopifnot(!missing(ExpandedDesign))
    stopifnot(!missing(replications))

    ExpandedDesign %>%
        group_by(design) %>%
        group_modify(function(.x, .y){
            
            # Split the replications in sets
            replicationSets <- split(1:replications,
                                     cut(1:replications, nrow(.x),
                                         labels = FALSE))
            # Then return ranges
            ret <- tibble(replications = lapply(replicationSets, range))
            ret
        })
}

# expand the design to create 300 rows with associated replications
Design300 <- expandDesign(Design, repeat_conditions = 100L)
DesignWithReplications <- addReplicationsToDesign(Design300, replications = 10000L)
DesignWithReplications

If the simulation design contains the replications column, it should override the replications argument of the runSimulation() function.

2) Add design number and replication number to each set of results

When we print out the raw simulation results, the output should look like this:

# A tibble: 180 × 7
   design_number replication_number sample_size group_size_ratio standard_deviation_r…¹  welch independent
           <int>              <int>       <dbl>            <dbl>                  <dbl>  <dbl>       <dbl>
 1             1                  1          30                1                    0.5 0.850       0.849 
 2             1                  2          30                1                    0.5 0.574       0.573 
 3             1                  3          30                1                    0.5 0.945       0.944 
 4             1                  4          30                1                    0.5 0.0910      0.0883
 5             1                  5          30                1                    0.5 0.0365      0.0329
 6             2                  1          60                1                    0.5 0.819       0.819 
 7             2                  2          60                1                    0.5 0.355       0.354 
 8             2                  3          60                1                    0.5 0.586       0.585 
 9             2                  4          60                1                    0.5 0.671       0.670 
10             2                  5          60                1                    0.5 0.285       0.283

Adding design numbers is easy in the user code, but adding replication numbers is not. The code to produce this example is

Design <- createDesign(sample_size = c(30, 60, 90, 120),
                       group_size_ratio = c(1, 4, 8),
                       standard_deviation_ratio = c(.5, 1, 2))
Design

# THIS SHOULD BE ADDED AUTOMATICALLY
Design <- tibble::add_column(Design, 
                             design_number = 1:nrow(Design),
                             .before = 1)

#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 2 --- Define generate, analyse, and summarise functions

Generate <- function(condition, fixed_objects = NULL) {
    N <- condition$sample_size      # could use Attach() to make objects available
    grs <- condition$group_size_ratio
    sd <- condition$standard_deviation_ratio
    if(grs < 1){
        N2 <- N / (1/grs + 1)
        N1 <- N - N2
    } else {
        N1 <- N / (grs + 1)
        N2 <- N - N1
    }
    group1 <- rnorm(N1)
    group2 <- rnorm(N2, sd=sd)
    dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))
    dat
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
    welch <- t.test(DV ~ group, dat)$p.value
    independent <- t.test(DV ~ group, dat, var.equal=TRUE)$p.value
    
    # In this function the p values for the t-tests are returned,
    #  and make sure to name each element, for future reference
    ret <- nc(welch, independent)
    ret
}

Summarise <- function(condition, results, fixed_objects = NULL) {
    #find results of interest here (e.g., alpha < .1, .05, .01)
    ret <- EDR(results, alpha = .05)
    ret
}


#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 3 --- Collect results by looping over the rows in design

# first, test to see if it works
res <- runSimulation(design=Design, replications=5,
                     generate=Generate, analyse=Analyse)
res

# ADD REPLICATION NUMBERS LIKE THIS IN runSimulation

tibble::add_column(res, 
                   replication_number = rep(1:5, nrow(Design)),
                   .after = 1)

3) Use RNG substreams

Parallel computation is currently implemented with L'Ecuyer-CMRG's (2002) method. This method supports RNG streams that can each have a substream https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/RngStream.html. The current implementation uses streams on the design level. I propose using substreams on a replication level.

One way to implement this would be to have the seed argument to also accept functions that set the RNG state for each replication: seed = function(design, replication). This would allow the user to specify their own seed generating functions on a replication level. One useful function would be

function(design, replication){
    nextRNGStream(design)
    nextRNGSubStream(replication)
}

This could be implemented also as a new argument accepting a function initRNG = function(seed, design, replication)

philchalmers / simdesign Goto Github PK

simdesign's Introduction

SimDesign

Installation

Getting started

simdesign's People

Contributors

Stargazers

Watchers

Forkers

simdesign's Issues

1) Implement replication number

2) Add design number and replication number to each set of results

3) Use RNG substreams

Recommend Projects

Recommend Topics

Recommend Org