A simulation may not fully complete when running on an HPC cluster because of the cluster's time limits. The currently recommended workflow is to check the missing replications, add more rows to the design matrix, and rerun with a new set of arrayIDs. This may need to be repeated several times.
This workflow is problematic for reproducibility because it would require the user to keep track of which runs each replication originates from.
Storing the random number generator state before each replication is one way to solve this problem, but it is impractical with very large simulations.
I propose three changes to make reproducibility easier:
1) Implement replication number
Large simulations can now run replications as smaller sets. For example, if we run 1000 reps, this can run as 10 sets of 100 replications using `expandDesign()'. This complicates the management of replications because the aggregated results do not clearly indicate which run each replication originated from.
I propose a new function that adds replication numbers to the design. The replication number would be stored as a new column, replications
, that contains a vector of two or more elements. If the vector contains two elements, it is interpreted as a range where the first element is the first replication number and the second element is the final replication number. If the vector has more than two elements, it is interpreted as a vector of replications to be run.
Here is a proposed implementation
library(SimDesign)
Design <- createDesign(N = c(10, 20, 30))
# Add design numbers, this is needed for adding replication numbers
Design <- tibble::add_column(Design,
design = 1:nrow(Design),
.before = 1)
Generate <- function(condition, fixed_objects = NULL) {
dat <- with(condition, rnorm(N, 10, 5)) # distributed N(10, 5)
dat
}
Analyse <- function(condition, dat, fixed_objects = NULL) {
ret <- c(mean=mean(dat), median=median(dat)) # mean/median of sample data
ret
}
Summarise <- function(condition, results, fixed_objects = NULL){
colMeans(results)
}
expandDesign <- function(Design, repeat_conditions){
stopifnot(!missing(Design))
stopifnot(!missing(repeat_conditions))
if(length(repeat_conditions) == 1L)
repeat_conditions <- rep(repeat_conditions, nrow(Design))
stopifnot("length of repeat_rows must equal number of rows in final object"=
length(repeat_conditions) == nrow(Design))
ret <- Design[sort(rep(1L:nrow(Design), times=repeat_conditions)), ]
rownames(ret) <- NULL
ret
}
addReplicationsToDesign <- function(ExpandedDesign, replications){
stopifnot(!missing(ExpandedDesign))
stopifnot(!missing(replications))
ExpandedDesign %>%
group_by(design) %>%
group_modify(function(.x, .y){
# Split the replications in sets
replicationSets <- split(1:replications,
cut(1:replications, nrow(.x),
labels = FALSE))
# Then return ranges
ret <- tibble(replications = lapply(replicationSets, range))
ret
})
}
# expand the design to create 300 rows with associated replications
Design300 <- expandDesign(Design, repeat_conditions = 100L)
DesignWithReplications <- addReplicationsToDesign(Design300, replications = 10000L)
DesignWithReplications
If the simulation design contains the replications
column, it should override the replications argument of the runSimulation()
function.
2) Add design number and replication number to each set of results
When we print out the raw simulation results, the output should look like this:
# A tibble: 180 × 7
design_number replication_number sample_size group_size_ratio standard_deviation_r…¹ welch independent
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 30 1 0.5 0.850 0.849
2 1 2 30 1 0.5 0.574 0.573
3 1 3 30 1 0.5 0.945 0.944
4 1 4 30 1 0.5 0.0910 0.0883
5 1 5 30 1 0.5 0.0365 0.0329
6 2 1 60 1 0.5 0.819 0.819
7 2 2 60 1 0.5 0.355 0.354
8 2 3 60 1 0.5 0.586 0.585
9 2 4 60 1 0.5 0.671 0.670
10 2 5 60 1 0.5 0.285 0.283
Adding design numbers is easy in the user code, but adding replication numbers is not. The code to produce this example is
Design <- createDesign(sample_size = c(30, 60, 90, 120),
group_size_ratio = c(1, 4, 8),
standard_deviation_ratio = c(.5, 1, 2))
Design
# THIS SHOULD BE ADDED AUTOMATICALLY
Design <- tibble::add_column(Design,
design_number = 1:nrow(Design),
.before = 1)
#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 2 --- Define generate, analyse, and summarise functions
Generate <- function(condition, fixed_objects = NULL) {
N <- condition$sample_size # could use Attach() to make objects available
grs <- condition$group_size_ratio
sd <- condition$standard_deviation_ratio
if(grs < 1){
N2 <- N / (1/grs + 1)
N1 <- N - N2
} else {
N1 <- N / (grs + 1)
N2 <- N - N1
}
group1 <- rnorm(N1)
group2 <- rnorm(N2, sd=sd)
dat <- data.frame(group = c(rep('g1', N1), rep('g2', N2)), DV = c(group1, group2))
dat
}
Analyse <- function(condition, dat, fixed_objects = NULL) {
welch <- t.test(DV ~ group, dat)$p.value
independent <- t.test(DV ~ group, dat, var.equal=TRUE)$p.value
# In this function the p values for the t-tests are returned,
# and make sure to name each element, for future reference
ret <- nc(welch, independent)
ret
}
Summarise <- function(condition, results, fixed_objects = NULL) {
#find results of interest here (e.g., alpha < .1, .05, .01)
ret <- EDR(results, alpha = .05)
ret
}
#~~~~~~~~~~~~~~~~~~~~~~~~
#### Step 3 --- Collect results by looping over the rows in design
# first, test to see if it works
res <- runSimulation(design=Design, replications=5,
generate=Generate, analyse=Analyse)
res
# ADD REPLICATION NUMBERS LIKE THIS IN runSimulation
tibble::add_column(res,
replication_number = rep(1:5, nrow(Design)),
.after = 1)
3) Use RNG substreams
Parallel computation is currently implemented with L'Ecuyer-CMRG's (2002) method. This method supports RNG streams that can each have a substream https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/RngStream.html. The current implementation uses streams on the design level. I propose using substreams on a replication level.
One way to implement this would be to have the seed
argument to also accept functions that set the RNG state for each replication: seed = function(design, replication)
. This would allow the user to specify their own seed generating functions on a replication level. One useful function would be
function(design, replication){
nextRNGStream(design)
nextRNGSubStream(replication)
}
This could be implemented also as a new argument accepting a function initRNG = function(seed, design, replication)