Hi. I'm happy to see you're supporting future parallelization. <p

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I see that one can use <div class="highlight highlight-source-r notranslate positi

Implicit activation of parallelization can be risky about simdesign HOT 8 CLOSED

HenrikBengtsson commented on May 26, 2024

Implicit activation of parallelization can be risky

from simdesign.

Comments (8)

philchalmers commented on May 26, 2024

Hi @HenrikBengtsson!

First, thank you very much for taking the time to inspect + open this issue! You raise a good point about silent behaviour issues that I didn't fully think about. That said, I find the pbapply approach to be jarringly atypical of the future style, which is why I avoided such argument declarations in the current approach (e.g., early on I contemplated adding a future = TRUE argument to do something similar to overloading the cl input but decided against it as it felt awkward).

I agree that checking whether the package is attached is not ideal, but would still prefer a standard future-type specification. I wonder if it's possible to check whether the default plan() has been overwritten, in which case the use of future.apply() would have less ambiguity (e.g., if plan is other than the stock-standard sequential default with ncpus=1 then the front-end user clearly meant to use a different computational plan, regardless of whether this is in a main.R file or source()ed at some point). That's slightly better than checking whether future was simply attached, though of course doesn't fix the issue sourcing in files without knowing about package or function masking consequences.

Any thoughts you have in this area are appreciated as balancing the global specification approach used by future is a little tricky to navigate. Cheers.

Phil

from simdesign.

philchalmers commented on May 26, 2024

I see that one can use

if(is(future::plan(), "sequential") { ... }

which appears to be a reasonable solution, however it doesn't appear correct in the situation where multiple plan()s are defined.

library(future)
plan(list('multisession', 'sequential'))

is(plan(), 'multisession') # TRUE
is(plan(), 'sequential') # FALSE

plan(list('sequential', 'multisession'))
is(plan(), 'multisession') # FALSE
is(plan(), 'sequential') # TRUE

from simdesign.

HenrikBengtsson commented on May 26, 2024

The non-documented future::plan("next") returns the next future strategy on the stack. There's also nbrOfWorkers(), which returns 1L for sequential. OTH, it can return 1L also for other backends.

from simdesign.

HenrikBengtsson commented on May 26, 2024

For the bigger question: As you probably understand by now, I'm trying to avoid function arguments that control how and if a specific function runs in parallel. With futureverse, I'm trying to push toward that goal as far as I ever can. I consider that too low-level specific for an API that does analysis.

Looking at your Analysis(), you can several different options for parallelization:

    if("future" %in% (.packages())){
        ...
    } else if(is.null(cl)){
        ...
        results <- if(progress){
            try(pbapply::pblapply(1L:replications, mainsim, condition=condition,
                ...)
        } else {
            try(lapply(1L:replications, mainsim, condition=condition,
                ...)
        }
    } else {
        if(MPI){
            ...
            results <- try(foreach(i=1L:replications, .export=export_funs, .packages=packages,
                                   .options.mpi=.options.mpi) %dopar%
                ...)
        } else {
            ...
            results <- if(progress){
                try(pbapply::pblapply(1L:replications, mainsim,
                ...)
            } else {
                try(parallel::parLapply(cl, 1L:replications, mainsim,
                ...)
            }
        }
    }

I think you can replace all special cases with that single future_lapply() version. Then you can remove arguments cl and MPI, and let the current plan() control how parallelization is done, and if not specified, then it defaults to sequential processing.

Another advantage of this approach is that you no longer have to write separate package tests for each of those cases to make sure you have a high test code coverage. Testing and validation toward different backends is done by the futureverse framework, so you don't have to worry about it (https://www.futureverse.org/quality.html).

So that's my view and take on it. That said, I don't want to "force" futureverse on anyone, and I understand there are other reasons for using alternatives.

from simdesign.

HenrikBengtsson commented on May 26, 2024

plan is other than the stock-standard sequential default with ncpus=1 then the front-end user clearly meant to use a different computational plan

Note that your function might be called in a parallel worker by some other code. Then, at least in the future framework, the default is to run with plan(sequential) to avoid CPU overuse from nested parallelization.

Point is, it's really hard to predict how, where, and in what context ones will be used. We also don't control what happens in the future, so an update to another package might change this all of a sudden.

from simdesign.

philchalmers commented on May 26, 2024

Thanks so much for the detailed replies; they have given me a lot to think about. I've decided to use a parallel = 'future' approach instead of the previous behaviour, though while doing so I've uncovered somewhat of a snag using the new tests (e.g., some internal objects must be exported as they are not visible when using a plan other than plan(sequential)).

For now I'll roll back the current version on CRAN until a suitable future option is available and well tested in this package. Thanks again for all your help and pointing me in better directions moving forward.

P.S., While I have your attention, if you're able to direct me to some future equivalent of parallel::clusterExport(cl, ..., envir) that would be very helpful since in the current setup I export objects for different enviroment locations at runtime. I can see this is possible with future::plan(...), though in my early attempts this has failed quite miserably and doesn't feel kosher.

from simdesign.

HenrikBengtsson commented on May 26, 2024

While I have your attention, if you're able to direct me to some future equivalent of parallel::clusterExport(cl, ..., envir) that would be very helpful since in the current setup I export objects for different enviroment locations at runtime.

There is no counterpart - by design. The way to think about futures is that they may end up running anywhere, and we should assume it runs in a fresh environment each time. Futureverse tries to identify all global variables needed automatically, but it's not 100%. But, you guide Futureverse in the right direction when this happens. See https://future.futureverse.org/articles/future-4-issues.html#missing-globals-false-negatives for common solutions. If a function/method is missing, then it could be that it fails to detect required packages. See the same vignette for how to guide what packages should be added.

from simdesign.

philchalmers commented on May 26, 2024

Looks to be patched now. Thanks again for all your helpful comments and references to solutions! The current specification now works well with the following structure

plan(multisession)
results <- runSimulation(..., parallel = 'future', ...)

from simdesign.

Implicit activation of parallelization can be risky about simdesign HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent