Coder Social home page Coder Social logo

henrikbengtsson / wishlist-for-r Goto Github PK

View Code? Open in Web Editor NEW
130.0 26.0 4.0 24 KB

Features and tweaks to R that I and others would love to see - feel free to add yours!

Home Page: https://github.com/HenrikBengtsson/Wishlist-for-R/issues

License: GNU Lesser General Public License v3.0

R 100.00%
r bugs features wishlist troubleshooting oddities tweaks suggestions

wishlist-for-r's Introduction

wishlist-for-r's People

Contributors

henrikbengtsson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wishlist-for-r's Issues

WISH: Error messages from library() are much less informative than from loadNamespace()

Issue

If there's an error loading an error using library(), then the error message, e.g.

Error: package or namespace load failed for 'future'

is much less informative than when using loadNamespace(), e.g.

Error: unable to load shared object
'/Users/foo/R/test-3.4/digest/libs/digest.so':
     dlopen(/Users/foo/R/test-3.4/digest/libs/digest.so, 6):
Library not loaded: libR.dylib
     Referenced from: /Users/foo/R/test-3.4/digest/libs/digest.so
     Reason: Incompatible library version: digest.so requires version 3.4.0 or later, but libR.dylib provides version 3.3.0

Troubleshooting

The reason for the library() error message to be so generic is that the code simply catches any type of errors and generates another generic error message without forwarding the original error message:

  tt <- try({
      attr(package, "LibPath") <- which.lib.loc
      ns <- loadNamespace(package, lib.loc)
      env <- attachNamespace(ns, pos = pos, deps)
  })
  attr(package, "LibPath") <- NULL
  if (inherits(tt, "try-error"))
      if (logical.return)
          return(FALSE)
      else stop(gettextf("package or namespace load failed for %s",
                         sQuote(package)),
                call. = FALSE, domain = NA)

Suggestion / wish

Include also conditionMessage(attr(tt, "condition")) in the error message, e.g.

      else stop(gettextf("package or namespace load failed for %s",
                         sQuote(package)),
                         ":\n", conditionMessage(attr(tt, "condition")),
                call. = FALSE, domain = NA)

Note that the above uses the same gettextf() format string as before.

With this patch, the above library() error message would look like:

Error: package or namespace load failed for 'future':
unable to load shared object
'/Users/foo/R/test-3.4/digest/libs/digest.so':
     dlopen(/Users/foo/R/test-3.4/digest/libs/digest.so, 6):
Library not loaded: libR.dylib
     Referenced from: /Users/foo/R/test-3.4/digest/libs/digest.so
     Reason: Incompatible library version: digest.so requires version 3.4.0 or later, but libR.dylib provides version 3.3.0

Building R from source: `configure` should also look for `cmp` and `find`

Issue

When building R from source on Linux, configure asserts that required compilers and libraries are available. However, it does not check for:

  • cmp (part of diffutils) - if missing, make gives error during modules/lapack
  • find (part of findutils) - if missing, make gives (silent) error during library/translations
  • which - if missing, make gives error during modules/lapack (PR18510), fixed in R (>= 4.3.1) wch/r-source@079f17d

Wish

Have configure also check for these tools.

WISH: .rm(x) - a fast light-weight version of rm(x)

Background

rm(x) and rm(list="x") are slow. The latter 2-3 times faster, but still very slow (100-200 times slower) compared to a simple assignment, e.g. x <- NULL. For a few number of calls to rm() this makes little difference, but if it's called thousands of times it is noticable.

Some benchmark results:

> options(digits=3)
> microbenchmark::microbenchmark(
  "rm(x)"            = { x <- 1; rm(x) },
  "rm(list='x')"     = { x <- 1; rm(list="x") },
  ".Internal(rm(x))" = { x <- 1; .Internal(remove("x", parent.frame(), FALSE)) },
  "x <- NULL"        = { x <- 1; x <- NULL },
  times=10e3, unit="ms"
)

Unit: milliseconds
             expr      min       lq     mean   median       uq    max neval
            rm(x) 0.030027 0.033492 0.036719 0.034647 0.036186 3.3753 10000
     rm(list='x') 0.018479 0.021558 0.023979 0.022329 0.023483 1.5960 10000
 .Internal(rm(x)) 0.000385 0.001155 0.001249 0.001156 0.001541 0.0192 10000
        x <- NULL 0.000000 0.000001 0.000174 0.000001 0.000386 0.0273 10000

Troubleshooting

One reason rm() is slow is that already at the R level it carries lots of extra weight in order to work in many different cases, e.g. rm(x), rm(list="x"), rm(x,y), rm(list=c("x", "y"), envir=env, inherits=TRUE) etc. As the benchmark stats show, calling .Internal(remove("x", ...)) is yet faster, but still 10 times slower than a plain assignment.

> base::rm
function (..., list = character(), pos = -1, envir = as.environment(pos),
    inherits = FALSE)
{
    dots <- match.call(expand.dots = FALSE)$...
    if (length(dots) && !all(vapply(dots, function(x) is.symbol(x) ||
        is.character(x), NA, USE.NAMES = FALSE)))
        stop("... must contain names or character strings")
    names <- vapply(dots, as.character, "")
    if (length(names) == 0L)
        names <- character()
    list <- .Primitive("c")(list, names)
    .Internal(remove(list, envir, inherits))
}

Suggestion 1

As a straightforward first improvement, the base package could provide:

.rm <- function(x) .Internal(remove(x, parent.frame(), FALSE))
> options(digits=3)
> microbenchmark::microbenchmark(
  "rm(x)"            = { x <- 1; rm(x) },
  "rm(list='x')"     = { x <- 1; rm(list="x") },
  ".Internal(rm(x))" = { x <- 1; .Internal(remove("x", parent.frame(), FALSE)) },
  ".rm('x')" = { x <- 1; .rm("x") },
  "x <- NULL"        = { x <- 1; x <- NULL },
  times=10e3, unit="ms"
)

Unit: milliseconds
             expr      min       lq     mean   median       uq    max neval
            rm(x) 0.030412 0.033492 0.036597 0.034647 0.036186 1.6772 10000
     rm(list='x') 0.018863 0.021558 0.023578 0.022328 0.023483 1.5206 10000
 .Internal(rm(x)) 0.000385 0.000771 0.001293 0.001156 0.001540 1.4509 10000
         .rm('x') 0.000770 0.001540 0.001976 0.001925 0.002310 1.5279 10000
        x <- NULL 0.000000 0.000001 0.000154 0.000001 0.000386 0.0189 10000

Suggestion 2

The above could probable be improved by a native implementation. In [1], @s-u suggests:

If you really want to go overboard, you can define your own function:

SEXP rm(SEXP x, SEXP rho) { setVar(x, R_UnboundValue, rho); return R_NilValue; }
poof <- function(x) .Call(rm_C, substitute(x), parent.frame())

That will be faster than anything else (mainly because it avoids the trip through strings as it can use the symbol directly).

Miscellaneous

Alternative names for this function:

  • .rm()
  • poof()
  • yank()

See also

BUG: NEWS.md are not found by utils::news()

Background

About NEWS.md files in packages:

Issue

However, utils::news() don't report on them, e.g.

> p <- system.file(package = "batchtools")
> news(package = "batchtools")
NULL
> system.file("NEWS.md", package = "batchtools", mustWork = TRUE)
[1] "/home/hb/R/x86_64-pc-linux-gnu-library/3.4/batchtools/NEWS.md"

Suggestion

Without having to worry about parsing NEWS.md and converting its format to HTML etc, it could be displayed as is, that is, just as if it was named NEWS.

Troubleshooting

The problem comes from the fact that tools:::.build_news_db() does not look for NEWS.md:

> f <- tools:::.build_news_db(package = "batchtools")
> f
NULL

It doesn't look hard to do that (UPDATE: It's actually not straightforward; see comment below);

> tools:::.build_news_db
function (package, lib.loc = NULL, format = NULL, reader = NULL) 
{
    dir <- system.file(package = package, lib.loc = lib.loc)
    nfile <- file.path(dir, "NEWS.Rd")
    if (file_test("-f", nfile)) 
        return(.build_news_db_from_package_NEWS_Rd(nfile))
    nfile <- file.path(dir, "NEWS")
    if (!file_test("-f", nfile)) 
        return(invisible())
    if (!is.null(format)) 
        .NotYetUsed("format", FALSE)
    if (!is.null(reader)) 
        .NotYetUsed("reader", FALSE)
    reader <- .news_reader_default
    reader(nfile)
}
<bytecode: 0x36c6e00>

See also

parallel::makePSOCKcluster(): Add support for reverse SSH tunnels (and any optional SSH command-line options)

Quick summary

Add support for reverse SSH tunneling (-R <port>:localhost:<port>) when setting up PSOCK clusters using parallel::makeCluster(). This helps avoid firewall and port forwarding issues that appear when trying to connect to remote machines / clusters.

Basically, the proposed patch allows you to connect to remote R machines from anywhere as long as you can ssh directly to the machine.

If you have comments, suggestions, ideas and / or critique, please comment below. The plan is to collect and summarize feedback here, then to bring it up on R-devel, and eventually submit the patch to https://bugs.r-project.org/.

Background

The makeCluster() function of the parallel package can be used to run on a remote cluster. This can typically be done as:

library("parallel")
cl <- makeCluster("remote.myserver.org", user="johndoe", master="local.mymachine.org", port=11001, homogeneous=FALSE)
res <- parLapply(cl, 1:3, fun=function(x) x^2)
stopCluster(cl)

(*) If port is not specified, a random port in [11000,11999] is used.

By default this results in a connection to remote.myserver.org over SSH via an internal system() call like:

ssh -l johndoe remote.myserver.org \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=local.mymachine.org PORT=11001 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE\"

Issue

Now, in order for this to remote connection to be successfully set up, it is not only necessary for the ssh -l johndoe remote.myserver.org connection to work, but also for remote.myserver.org to be able to open a socket in the reverse direction back to our local machine at local.mymachine.org on port 11001. The latter part is problematic because it requires us to open up any local firewalls to allow for incoming connection to port 11001 (or anyone in range [11000,11999]).
Even worse is when we're behind a local router, e.g. if we're on a notebook connected via a WiFi router. In such cases we also have to configure the router forward ("port forwarding") incoming connections to port 11001 (or anyone in range [11000,11999]) to our notebook. If two or more users try to do the same, things become complicated. This not only requires you to have access privileges to configure the local router but you most likely also have to configure the DHCP to use static IP for your notebook and for everyone else who wish to do the same. You also have to make sure you're not trying to use the same ports.

Solution

In SSH there is a concept called reverse tunneling, which basically makes it possible to set up a reverse port-to-port connection within the outgoing connection. This way there is no need to worry about the remote.myserver.org being able to connect back to your local machine. As long as you can make the outgoing SSH connection, the reverse connection should work out of the box (*).

By replacing the above SSH call with

ssh -l johndoe -R 11001:localhost:11001 remote.myserver.org \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11001 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE\"

the remote R worker will try open up the reverse connection on port 11001 on localhost (== the remote machine). Since reverse tunneling is used, this will be port forwarded to port 11001 on the calling machine (= your local machine).

In addition to the above, this also has the advantage of not having to know your public IP address or have dynamic DNS setup.

(*) An exception is when you use SSH tunneling in your outgoing connection to remote.myserver.org. In such cases, you might have to use more complex reverse SSH tunneling than proposed here.

Suggestion

Add support for reverse SSH tunneling, e.g.

cl <- makeCluster("remote.myserver.org", user="johndoe", revtunnel=TRUE, master="localhost", port=11001, homogeneous=FALSE)

Proposed patch

Here is a patch (svn diff src/library/parallel) that:

  • Adds argument revtunnel (logical) to control whether reverse SSH tunneling should be used or not (this issue).
  • Adds argument rcmdopts (a character string) for adding any command-line options of choice. This gives the user further options on how to set up a reverse SSH tunnel and / or other SSH configurations.
  • If user=NULL (new default), then -l <user> is skipped. This makes it possible to specify the user name in ~/.ssh/config (See Issue #31 for full discussion)
Index: src/library/parallel/R/snow.R
===================================================================
--- src/library/parallel/R/snow.R   (revision 71320)
+++ src/library/parallel/R/snow.R   (working copy)
@@ -97,8 +97,10 @@
                     outfile = "/dev/null",
                     rscript = rscript,
                     rscript_args = character(),
-                    user = Sys.i[["user"]],
+                    user = NULL,
                     rshcmd = "ssh",
+                    revtunnel = FALSE,
+                    rshopts = NULL,
                     manual = FALSE,
                     methods = TRUE,
                     renice = NA_integer_,
Index: src/library/parallel/R/snowSOCK.R
===================================================================
--- src/library/parallel/R/snowSOCK.R   (revision 71320)
+++ src/library/parallel/R/snowSOCK.R   (working copy)
@@ -71,11 +71,24 @@
         if (machine != "localhost") {
             ## This assumes an ssh-like command
             rshcmd <- getClusterOption("rshcmd", options)
+            opts <- NULL
+
+            ## Specify '-l user'?
             user <- getClusterOption("user", options)
+            if (!is.null(user)) opts <- c(opts, paste("-l", user))
+
+            ## Use SSH reverse tunneling?
+            revtunnel <- getClusterOption("revtunnel", options)
+            if (isTRUE(revtunnel)) opts <- c(opts, sprintf("-R %d:%s:%d", port, master, port))
+
+            ## Additional SSH options?
+            opts <- c(opts, getClusterOption("rshopts", options))
+
             ## this assume that rshcmd will use a shell, and that is
             ## the same shell as on the master.
             cmd <- shQuote(cmd)
-            cmd <- paste(rshcmd, "-l", user, machine, cmd)
+            opts <- paste(opts, collapse = " ")
+            cmd <- paste(rshcmd, opts, machine, cmd)
         }

         if (.Platform$OS.type == "windows") {

Generic support for dimension-aware attributes, e.g. colattr(x, 'gender')

Description

Generic support for dimension-aware attributes that are acknowledged whenever the object is subsetted. For vectors we have names(), for matrices and data frames we have rownames() and colnames(), and for arrays and other objects we have dimnames().

Example

> x <- matrix(1:12, ncol=4)
> colnames(x) <- c("A", "B", "C", "D")
> colattr(x, 'gender') <- c("male", "male", "female", "male")
> colattr(x, 'age') <- c(26, 43, 28, 33)
> x

     male male female male
       26   43     28   33
        A    B      C    D
[1,]    1    4      7   10
[2,]    2    5      8   11
[3,]    3    6      9   12

> y <- x[,2:3]
> y
     male female
       43     28
        B      C
[1,]    4      7
[2,]    5      8
[3,]    6      9

> colnames(y)
[1] B C
> colattr(y, 'name')
[1] B C
> colattr(y, 'gender')
[1] "male" "female"
> colattr(y, 'age')
[1] 43 28

WISH: Way to call function with explicit "missing" arguments

(adopted from Wiki entry)

Wish

A way to specify that an argument value should be considered "missing", e.g. foo(x, y=missing()).

Background / Issue

Some functions use code that is evaluated conditionally on an argument being (explicitly) specified or not - if not explicitly specified we say the argument is "missing" . For example, base::sample() sets size to be x or length(x) iff "missing" (depending on the value of x);

> sample
function (x, size, replace = FALSE, prob = NULL)
{
    if (length(x) == 1L && is.numeric(x) && x >= 1) {
        if (missing(size))
            size <- x
        sample.int(x, size, replace, prob)
    }
    else {
        if (missing(size))
            size <- length(x)
        x[sample.int(length(x), size, replace, prob)]
    }
}
<bytecode: 0x000000000b909e88>
<environment: namespace:base>

If explicit "missing" values would be supported by R, we could do things like:

my_sample <- function(x, size) {
  if (!missing(size)) size <- 2*size
  sample(x, size=size)
}

Instead, we have to write:

my_sample <- function(x, size) {
  if (missing(size)) {
    sample(x)
  } else {
    size <- 2*size
    sample(x, size=size)
  }
}

Comment

A common design pattern is to allow NULL to represent a "missing" value;

sample2 <- function (x, size  = NULL, replace = FALSE, prob = NULL)
{
    if (length(x) == 1L && is.numeric(x) && x >= 1) {
        if (is.null(size))
            size <- x
        sample.int(x, size, replace, prob)
    }
    else {
        if (is.null(size))
            size <- length(x)
        x[sample.int(length(x), size, replace, prob)]
    }
}

Note that this would allow us to do:

my_sample <- function(x, size=NULL) {
  if (!is.null(size)) size <- 2*size
  sample(x, size=size)
}

See also

nullfile() / nullcon(): Gets the "null" device / file

Background

On Unix, we have the null device /dev/null which is a file. On Windows there is NUL.

Suggestion

A unified function for retrieving the platform specific null device. Something like:

nullfile <- function() {
  switch(.Platform$OS.type, windows = "NUL", "/dev/null")
}

It's also useful to be able to check whether a file refers to the null device or not:

is_nullfile <- function(pathname) {
  normalizePath(pathname) == nullfile()
}

Discussion

The name nullfile() was chosen to be in line with tempfile(). Is devnull() a better or a worse name? It certainly is biased to Unix.

Also, should it be NUL or NUL: on Windows. I've seen both used. Note, on R for Windows, both normalizePath("NUL") and normalizePath("NUL:") gives what I think is garbage output ("\\\\.\\NUL").

UPDATE 2017-02-21: Added is_nullfile() to the wishlist. Mention that normalizePath("NUL") doesn't work on Windows.
UPDATE 2016-03-03: Using a switch instead of if-else statement.

WISH: base::file.rename() to fall back to copy-then-delete for cross-device renames

(Moved from HenrikBengtsson/R.utils#42 (comment))

Background

On Unix, moving / renaming a file across devices is not natively supported. For instance, if we try this in R, we get an error a warning:

> file.rename(src, dest)
[1] FALSE
Warning: cannot rename file '/dev1/path/mysrc' to '/dev2/path/mysrc', reason 'Invalid cross-device link'

Workaround

When renaming / moving files across devices using Unix command mv, it will detect that it's a cross-device rename and fall back to first copy the file over and then rename the original file, cf. http://stackoverflow.com/questions/24209886/invalid-cross-device-link-error-with-boost-filesystem. R could use the same strategy, by conceptually doing a:

> dir.create(dirname(dest), recursive = TRUE)
> res <- file.copy(src, dest)
> if (res) file.remove(src)

(probably needs a few more assertions before deleting the original file).

Suggestion / Wish

In Issue HenrikBengtsson/R.utils#42 (comment) regarding a similar feature of R.utils::renameFile(), @lawremi wrote:

It would be nice to have the copying logic as a patch to file.rename(). One obvious error to handle from rename() is EXDEV, where the hard link fails across devices.

Expanding on this, file.rename() could update its native code to detect error code EXDEV that rename() sets whenever trying to rename across devices:

By detecting this case, we can update the native code to take a copy-then-delete approach.

See also

BUG-ISH: parallel::makeCluster(..., type="PSOCK") overrides ~/.ssh/config username specifications

Quick summary

The PSOCK functionality of the parallel package is currently always passing an -l <username> option in the ssh call. If user doesn't specify a username (via argument user), then it will fall back to using the default username (= Sys.info()[["user"]]). The problem with this is that it overrides any username specifications in a ~/.ssh/config file. I propose a patch to parallel that only passes the -l <username> option, if user is explictly set. If not specified, it relies on ssh to do the correct thing.

Background

Using the parallel package, we can connect to a remote machine (or cluster) by using:

library("parallel")
myip <- readLines("https://myexternalip.com/raw")
cl <- makeCluster("remote.myserver.org", user="henrik", master=myip, homogeneous=FALSE)

The default is that the connection is set up via an ssh call. To verify that we can connect to the remote machine from our current location, we can use:

ssh -l henrik remote.myserver.org

(To verify that the remote machine in turn can connect back, which is also required, is a different topic and not relevant to this issue).

Issue

If one uses different usernames locally and remotely, it is convenient to configure the default username for a given server by editing ~/.ssh/config, e.g.

$ more ~/.ssh/config

Host remote.myserver.org
  User henrik

So, if my local username is hb, i.e.

$ echo $USER
hb

with the above ~/.ssh/config file, I no longer have to specify -l henrik, but I can just do:

ssh remote.myserver.org

regardless of my local username. (Without a ~/.ssh/config file, ssh would fall back to use my local username (as in -l $USER). I will return to this in my proposal at the end.)

Unfortunately, this does not work with parallel::makeCluster(). In other words, it is not possible to do just:

library("parallel")
myip <- readLines("https://myexternalip.com/raw")
cl <- makeCluster("remote.myserver.org", master=myip, homogeneous=FALSE)

Troubleshooting

The reason that the username in ~/.ssh/config is ignored is that the parallel package will override it because it always passes option -l <local username> to ssh. For example,

> library("parallel")
> myip <- readLines("https://myexternalip.com/raw")
> trace(system, tracer=quote(print(command)))
> cl <- makeCluster("remote.myserver.org", master=myip, homogeneous=FALSE)
Tracing system(cmd, wait = FALSE) on entry 
[1] "ssh -l hb remote.myserver.org \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=1.2.3.4 PORT=11633 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE\""

which matches my local username:

> Sys.info()[["user"]]
[1] "hb

If we dig into the code of the parallel package, we find the following piece of code in parallel:::newPSOCKnode():

        if (machine != "localhost") {
            ## This assumes an ssh-like command
            rshcmd <- getClusterOption("rshcmd", options)
            user <- getClusterOption("user", options)
            ## this assume that rshcmd will use a shell, and that is
            ## the same shell as on the master.
            cmd <- shQuote(cmd)
            cmd <- paste(rshcmd, "-l", user, machine, cmd)
        }

where

> options <- parallel:::defaultClusterOptions
> options$user
[1] "hb"

Further inspection shows that getClusterOption("user", options) will return the above default options$user value if not explicitly specified as an argument to makeCluster() (see initial example).

The parallel:::defaultClusterOptions object is initialized with user = Sys.info()[["user"]] as seen in parallel:::initDefaultClusterOptions() which is called when the parallel package is loaded.

Workaround

Since parallel:::newPSOCKnode() will always inject -l <username> it is not clear how to circumvent this. For instance, trying to trick it with parallel:::defaultClusterOptions$user <- NULL will not work.

Another alternative would be to use a custom rshcmd script that discards the -l username options. However, that would also discard user when it is legitimitely specified as an argument to makeCluster(). It would also be tricky to write a solution that would work out-of-the-box on all operating systems.

The only solution I see is to parse ~/.ssh/config to check whether it specifies a different remote username than the default local username. If it does, then user should be set to this remote username specified in ~/.ssh/config. That could kind of work, but it would require a robust parser.

In summary, I don't see a neat workaround for this problem.

Suggestion

If option -l <username> is not specified in the call to ssh, it will fall back to use whatever is specified in the ~/.ssh/config file and otherwise it will fall back to use the local username (basically -l $USER). In other words, there is no real reason for the parallel package to do this work instead of ssh. If the parallel package would only pass -l <username> if user is explicitly set, then and User specifications in ~/.ssh/config would also be acknowledged.

Patch

Here's an SVN patch (svn diff src/library/parallel/R/snow*.R) that would achieve this:

Index: src/library/parallel/R/snow.R
===================================================================
--- src/library/parallel/R/snow.R   (revision 71304)
+++ src/library/parallel/R/snow.R   (working copy)
@@ -97,7 +97,7 @@
                     outfile = "/dev/null",
                     rscript = rscript,
                     rscript_args = character(),
-                    user = Sys.i[["user"]],
+                    user = NULL,
                     rshcmd = "ssh",
                     manual = FALSE,
                     methods = TRUE,
Index: src/library/parallel/R/snowSOCK.R
===================================================================
--- src/library/parallel/R/snowSOCK.R   (revision 71304)
+++ src/library/parallel/R/snowSOCK.R   (working copy)
@@ -75,7 +75,8 @@
             ## this assume that rshcmd will use a shell, and that is
             ## the same shell as on the master.
             cmd <- shQuote(cmd)
-            cmd <- paste(rshcmd, "-l", user, machine, cmd)
+            opts <- if (is.null(user)) NULL else paste("-l", user)
+            cmd <- paste(rshcmd, opts, machine, cmd)
         }

         if (.Platform$OS.type == "windows") {

Proof of concept

Here's a proof-of-concept hack that allows you to test the above patch without having to rebuild R from source:

## Tweak parallel:::newPSOCKnode()
newPSOCKnode <- parallel:::newPSOCKnode
expr <- body(newPSOCKnode)
code <- deparse(expr)
pattern <- 'cmd <- paste(rshcmd, "-l", user, machine, cmd)'
replacement <- 'opts <- if (is.null(user)) NULL else paste("-l", user); cmd <- paste(rshcmd, opts, machine, cmd)'
code <- gsub(pattern, replacement, code, fixed=TRUE)
expr <- parse(text=code)
body(newPSOCKnode) <- expr
assignInNamespace("newPSOCKnode", newPSOCKnode, ns=getNamespace("parallel"))

## Remove default 'user' option
opts <- parallel:::defaultClusterOptions
opts$user <- NULL
assignInNamespace("defaultClusterOptions", opts, ns=getNamespace("parallel"))
stopifnot(is.null(parallel:::defaultClusterOptions$user))
> library("parallel")
> myip <- readLines("https://myexternalip.com/raw")
> trace(system, tracer=quote(print(command)))

## Argument 'user' still works
> cl <- makeCluster("remote.myserver.org", user="henrik", master=myip, homogeneous=FALSE)
Tracing system(cmd, wait = FALSE) on entry 
[1] "ssh -l henrik remote.myserver.org \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=1.2.3.4 PORT=11266 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE\""

## But if not specified it is not passed
> cl <- makeCluster("remote.myserver.org", master=myip, homogeneous=FALSE)
Tracing system(cmd, wait = FALSE) on entry 
[1] "ssh remote.myserver.org \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=1.2.3.4 PORT=11633 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE\""

tools: Informative error message if vignette engine pattern produces empty name

The filename pattern for vignette engine should be such that sub(engine$pattern, "", filename) returns the name. For instance, vignette engine R.rsp::md registers pattern = "[.]md$" such that for vignette foo.bar.md the name becomes foo.bar.

Issue

A likely mistake (I just did it and had to resort to trial-and-error and code inspection), is to use a pattern that consumes also the name part. For instance, if we instead of pattern = "[.]md$" would have used pattern = ".*[.]md$", which still matched foo.bar.md, then inferred vignette name would be empty, i.e. "". If such a pattern is used, the error message produced by R CMD build and tools::buildVignette[s]() is not very informative:

Failed to locate the โ€˜weaveโ€™ output file (by engine โ€˜R.rsp::mdโ€™) for vignette with 
name โ€˜โ€™. The following files exist in directory โ€˜.โ€™: โ€˜intro.Rโ€™, โ€˜intro.htmlโ€™, โ€˜intro.mdโ€™

Suggestion

Update tools::buildVignette, to check the inferred name and give an informative error message if an empty name was produced, e.g.

    # Infer the vignette name
    names <- sapply(engine$pattern, FUN = sub, "", file)
    name <- basename(names[(names != file)][1L])
    if (nchar(name) == 0) {
        stop("Inferred an empty vignette name from vignette file ", sQuote(file),
             ". This happened because the file name pattern ", sQuote(engine$pattern),
             " for vignette engine ",
             sQuote(paste(engine$package, engine$name, sep = "::")),
             " matched all of the vignette file name.")
    }

This should then produce an error such as:

Error: Inferred an empty vignette name from vignette file 'foo.bar.md'. This happened
because the file name pattern '.*[.]md$' for vignette engine 'R.rsp::md' matched all of
the vignette file name.

Support for IEC (KiB, MiB, ...) and SI (kB, MB, ...) binary units

Background

There are a few standards [1] for binary prefixes for byte-size units:

  • IEC: KiB (1024 bytes), MiB (1024^2 bytes), GiB (1024^3 bytes), TiB (1024^4 bytes), ...
  • JEDEC & customary standard: KB (1024 bytes), MB (1024^2 bytes), GB (1024^3 bytes)

Note that for decimal prefixes, we have:

  • SI: kB (1000 bytes), MB (1000^2 bytes), GB (1000^3 bytes),, TB (1000^4 bytes), ...

For byte versus bit, we have:

  • IEC & customary standard: 'B' for 'byte' and 'bit' for 'bit' [3,4].
  • IEEE: 'b' for 'bit' [3].

Problem

  • R uses Kb, Mb and Gb. None of these are part of the above byte standards. Note the lower case 'b' is typically used for bit and not byte.

For example,

> size <- object.size(1:1e7)
> size
40000040 bytes
> format(size, units="auto")
[1] "38.1 Mb"

This is specific example illustrates a problem with utils:::format.object_size(). Another example is:

> base::gc()
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 279622 15.0     592000 31.7   350000 18.7
Vcells 478234  3.7    1023718  7.9   786432  6.0
> str(base::gc())
 num [1:2, 1:6] 279638 478263 15 3.7 592000 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "Ncells" "Vcells"
  ..$ : chr [1:6] "used" "(Mb)" "gc trigger" "(Mb)" ...

The issue with non-standard byte units in R has been reported to R-devel [5].

Wish / Suggestion

  • Use units KiB, MiB, GiB, TiB, ... everywhere in R because they are unambiguous. UPDATE: ... or SI units?
  • Migrate smoothly by:
    • Add support for IEC, JEDEC and SI prefixes where applicable;
      • IEC units for utils:::format.object_size(), cf. PR #16649. Completed as of 2016-01-06 in r69879.
      • JEDEC units for utils:::format.object_size(), cf. PR #16657. UPDATE: See discussion in comments below.
      • SI units for utils:::format.object_size(). UPDATE: Added to R-devel on 2017-01-11 (r71960)
    • Add options for default unit standard used in R, e.g. getOptions("byte.unit.standard", "legacy").
    • Make IEC SI units the new default, e.g. gc(), format.object_size(..., units="auto") and allocation error messages.
    • Deprecate invalid units (lower case b) with .Deprecate().
    • Eventually drop them using .Defunct().

Known functions / code affected:

Note, the out-of-memory errors in the native code can not easily be tweaked to support a global option; if tried, then there is a risk that that triggers another out-of-memory error.

Usages of IEC / SI elsewhere

  • The Ubuntu Linux distribution uses the IEC prefixes for base-2 units and SI prefixes for base-10 units [6].
  • Windows and Android uses JEDEC prefixes.
  • Mac OS X uses decimal SI units kB since 2009.

References

  1. Binary prefix, Wikipedia, https://en.wikipedia.org/wiki/Binary_prefix
  2. Byte, Wikipedia, https://en.wikipedia.org/wiki/Byte#Unit_symbol
  3. Bit, Wikipedia, https://en.wikipedia.org/wiki/Bit#Unit_and_symbol
  4. Man page units(7), http://man7.org/linux/man-pages/man7/units.7.html
  5. R devel thread 'format(object.size(...), units): KB, MB, and GB instead of Kb, Mb, and Gb?' started on 2014-09-07
  6. UnitsPolicy, Ubuntu Wiki, Jan 2016, https://wiki.ubuntu.com/UnitsPolicy
  • UPDATE 2016-05-03: Added src/gnuwin32/malloc.c to the list of places that needs to be updated.
  • UPDATE 2017-01-01: Aim for SI to be the new standard.
  • UPDATE 2017-01-11: Propose option byte.unit.standard for smooth transition.
  • UPDATE 2017-05-17: Identified more (all?) locations in R and native code that require updating.

WISH: Allow for mc.cores=0 in parallel::mclapply() and friends

Background

From help("options") we have that mc.cores is default as:

mc.cores:
a integer giving the maximum allowed number of additional R processes allowed to be run in parallel to the current R process. Defaults to the setting of the environment variable MC_CORES if set. Most applications which use this assume a limit of 2 if it is unset.

From this definition, I would interpret mc.cores = 0, 1, 2 to mean:

  • mc.cores = 0: Only the main R process may run.
  • mc.cores = 1: The main R process plus one more forked process may run.
  • mc.cores = 2: The main R process plus two more forked processes may run.

Comment: This means that from a computational point of view it makes little sense to use mc.cores = 1 iff you're using parallel::mclapply() and friends, because it forks off a single R processing (with the main process only polling/waiting for it to finish) and performs the same calculation that you could have done in the main R process alone. In this sense, mc.cores = 1 could effectively be doing/implemented the same as mc.cores = 0. (However, you could imagine implementations that are making full use of exactly two R processes. This is for instance possible to do using the future package.)

On compute clusters with schedulers such as PBS and Slurm, you submit jobs and request the number of cores you would need. If you request a single-core process, it makes sense to do all calculations in the main R process. Thus, we should really use mc.cores = 0 whenever allocated single-core R sessions. If we use mc.cores = 1, we are actually consuming two processes.

Problem

Currently, mc.cores = 0 gives an error when used by parallel::mclapply() and friends , e.g.

>  parallel::mclapply(1:3, FUN=rnorm, mc.cores=0L)
Error in parallel::mclapply(1:3, FUN = rnorm, mc.cores = 0L) :  'mc.cores' must be >= 1

> options(mc.cores=0L)
> parallel::mclapply(1:3, FUN=rnorm)
Error in parallel::mclapply(1:3, FUN = rnorm) : 'mc.cores' must be >= 1

This means that in order to write code that is agile to cluster settings and work with any number of allocated cores, we need to tedious coding such as:

if (mc.cores == 0L) {
  y <- lapply(X, FUN=...)
  } else {
  y <- mclapply(X, FUN=..., mc.cores=mc.cores)
}

It is clear that not everyone is aware that mc.cores specifies additional R process. For instance, it is not uncommon to see mc.cores = detectCores() where the developer probably intended mc.cores = detectCores() - 1. (PS. It is not really a good thing to use detectCores() this way, cf. the help).

Wish / Suggestion

Add support for mc.cores = 0 by mclapply() and friends in the parallel package. Specifically:

  • Update mclapply() and friends to allow for mc.cores = 0, which should fall back to lapply() or similarly. Actually, mc.cores = 1 could do the same thing.
  • Clarify in help pages that mc.cores = 0 is a properly fine setting.
  • Consistently handle when mc.cores is a missing value.
  • Update mclapply() on Windows to have argument mc.cores = 0 and not mc.cores = 1 as done currently.

Details

The current implementation of parallel::mclapply() already falls back to using base::lapply() whenever is called by a multicore child process and recursive multicore processing is not explicitly enabled;

> parallel::mclapply
function (X, FUN, ..., mc.preschedule = TRUE, mc.set.seed = TRUE,
    mc.silent = FALSE, mc.cores = getOption("mc.cores", 2L),
    mc.cleanup = TRUE, mc.allow.recursive = TRUE)
{
    cores <- as.integer(mc.cores)
    if (is.na(cores) || cores < 1L)
        stop("'mc.cores' must be >= 1")
    .check_ncores(cores)
    if (isChild() && !isTRUE(mc.allow.recursive))
        return(lapply(X = X, FUN = FUN, ...))
[...]

Thus, it would take very little to extend it to also support mc.cores = 0, e.g.

    if (is.na(cores) || cores < 0L)
        stop("'mc.cores' must be >= 0")
    if (mc.cores == 0 || (isChild() && !isTRUE(mc.allow.recursive)))
        return(lapply(X = X, FUN = FUN, ...))
[...]

We may even want to use if (mc.cores <= 1 || ...) as suggested above.

UPDATE 2016-02-04: As @ilarischeinin points in his comment below, with mclapply(..., mc.preschedule=TRUE) (the default), a bit further down in the code it actually already says:

    ## mc.preschedule = TRUE from here on.
    if (length(X) < cores) cores <- length(X)
    if (cores < 2L) return(lapply(X = X, FUN = FUN, ...))

Thus, it's clear that here the developer has had similar thoughts.

Continuing, On R for Windows, which does not support multicore processing / forking of processes, mclapply() falls back to calling lapply();

> parallel::mclapply
function (X, FUN, ..., mc.preschedule = TRUE, mc.set.seed = TRUE,
    mc.silent = FALSE, mc.cores = 1L, mc.cleanup = TRUE,
mc.allow.recursive = TRUE)
{
    cores <- as.integer(mc.cores)
    if (cores < 1L)
        stop("'mc.cores' must be >= 1")
    if (cores > 1L)
        stop("'mc.cores' > 1 is not supported on Windows")
    lapply(X, FUN, ...)
}

It would not be hard to update this one accordingly, i.e.

    if (is.na(cores) || cores < 0L)
        stop("'mc.cores' must be >= 0")
    if (cores > 0L)
        stop("'mc.cores' > 0 is not supported on Windows")
    lapply(X, FUN, ...)

Interestingly, looking at parallel::pvec(), we can see that the developer also thinks it is unnecessary to fork off a process if mc.cores = 1 (see also paragraph on mclapply(..., mc.preschedule=TRUE) above);

> pvec
function (v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE,
    mc.cores = getOption("mc.cores", 2L), mc.cleanup = TRUE)
{
    if (!is.vector(v))
        stop("'v' must be a vector")
    cores <- as.integer(mc.cores)
    if (cores < 1L)
        stop("'mc.cores' must be >= 1")
    if (cores == 1L)
        return(FUN(v, ...))
[...]

Thus, also here it is easy to update to support mc.cores = 0.

ANNOYANCE: Interrupt behavior of Sys.sleep() with setTimeLimit() depends on OS and R front-end used

Issue

Although setting setTimeLimit(cpu = timeout, elapsed = timeout) will signal a timeout error on Sys.sleep(wait) whenever wait > timeout, it will do so after timeout seconds only on Windows. On Linux and macOS, the timeout will be signaled only after wait seconds.

Answer / Reason

In R-devel thread 'setTimeLimit sometimes fails to terminate idle call in R' on 2013-05-16, Simon Urbanek wrote:

What causes this difference?

The time limit can only be checked in R_ProcessEvents() so for all practical purposes it can be only triggered by interruptible code that calls R_CheckUserInterrupt().
Now, it is entirely up to the front-end to decide how it will the the event loop. For example the terminal version of R has no other interrupts to worry about other than input handlers which trigger asynchronously, so it doesn't need to do any polling. Sys.sleep() only triggers on input handlers, so if you don't have any external event source hook as input handler, there is no reason to process any events so Sys.sleep() won't see any reason to check the time limit.
[...]
But note that this is really just a special case of of Sys.sleep(). If you actually run R code, then ProcessEvents is triggered automatically during the evaluation (or in interruptible C code).

Reproducible example

Windows

R version 3.4.2 (2017-09-28) -- "Short Summer"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
[...]
> setTimeLimit(cpu = 1.0, elapsed = 1.0, transient=FALSE)
> system.time(try(Sys.sleep(3)))
Error in Sys.sleep(3) : reached elapsed time limit
   user  system elapsed 
  0.004   0.000   1.007

Linux

$ R --vanilla
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS
[...]
> setTimeLimit(cpu = 1.0, elapsed = 1.0, transient=FALSE)
> system.time(try(Sys.sleep(3)))
Error in Sys.sleep(3) : reached elapsed time limit
   user  system elapsed 
  0.004   0.000   3.002

This is also the same on R 2.11.0, R 2.15.3, R 3.3.2, R 3.3.3, and R devel (2017-11-03 r73667).

macOS

@ilarischeinin confirms via Twitter that macOS has the same problem:

3s, with R 3.3.3 on Yosemite 10.10.5. [...]

So does @mllg confirms that macOS behaves as Linux above ("3 seconds").

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

@alexrecuenco reports that when using the RStudio frontend it times out after "1 second" (as wanted) whereas using the R terminal it times out after "3 seconds":

R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)
> setTimeLimit(cpu = 1.0, elapsed = 1.0, transient=FALSE)
> t0 <- Sys.time(); try(Sys.sleep(3)) ; dt <- Sys.time() - t0; print(dt)
Error in Sys.sleep(3) : reached elapsed time limit
Time difference of 1.00736 secs
Error: reached elapsed time limit

macOS workaround

@ilarischeinin wrote: Just to confirm that the quartz(); dev.off() trick that was mentioned on the R-devel thread from 2013 still applies on macOS 10.13 High Sierra and R 3.4.2:

$ R --vanilla
R version 3.4.2 (2017-09-28) -- "Short Summer"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)
[...]
> setTimeLimit(cpu = 1.0, elapsed = 1.0, transient=FALSE)
> t0 <- Sys.time(); try(Sys.sleep(3)) ; dt <- Sys.time() - t0; print(dt)
Error in Sys.sleep(3) : reached elapsed time limit
Time difference of 3.007089 secs
> quartz(); dev.off()
null device 
          1 
> t0 <- Sys.time(); try(Sys.sleep(3)) ; dt <- Sys.time() - t0; print(dt)
Error in Sys.sleep(3) : reached elapsed time limit
Time difference of 1.080138 secs

Comment 1: Could it be related to the R version?

Comment 2: That macOS has the same behavior as Linux shouldn't be surprising, because they use the same native code base for Sys.sleep().

See also

UPDATE 2017-11-05: Simplified example by using system.time(try(Sys.sleep(3))) instead of t0 <- Sys.time(); try(Sys.sleep(3)) ; dt <- Sys.time() - t0; print(dt).

UPDATE 2017-11-05: @alexrecuenco found the answer in an R-devel thread from Oct 2013 (see above). Thxs.

UPDATE 2017-12-02: Fix cut'n'paste typo; The timining in the Windows example now confirms that it times out after one second (not three seconds).

BUG: socketSelect(..., timeout): failed timeout on Unix for certain fractional timeouts

Issue

Certain non-integer values of argument timeout to base::socketSelect() results in infinite timeout on Linux (but not Windows).

Example

To illustrate the problem, set up a server-client connection between the current R session and a background R session. In a fresh R session, do:

setupConnection <- function(host = "localhost", port = 11001L) {
  Rscript <- file.path(R.home("bin"), "Rscript")
  cmd <- sprintf("Sys.sleep(1); socketConnection('%s', port = %d, server = FALSE, blocking = TRUE, open = 'a+b'); repeat { Sys.sleep(10) }", host, port)
  system2(Rscript, args = c("-e", shQuote(cmd)), wait = FALSE)
  socketConnection(host, port = port, server = TRUE, blocking = TRUE, open = 'a+b')
}
con <- setupConnection()
print(con)

##         description        class         mode        text 
## "<-localhost:11001"   "sockconn"        "a+b"    "binary" 
##              opened     can read    can write 
##            "opened"        "yes"        "yes" 

Next, use socketSelect() to wait for input to be available on the connection, but give up after a specific certain timeout period:

for (timeout in c(0, 1, 2, 2.1, 2.5, 3)) {
  t <- system.time({
    ans <- socketSelect(list(con), write = FALSE, timeout = timeout)
  })
  stopifnot(!ans) ## Nothing should be available
  print(t)
  stopifnot(abs(t[["elapsed"]] - timeout) < 0.1)
}

##   user  system elapsed 
##      0       0       0 
##   user  system elapsed 
##  0.000   0.000   1.001 
##   user  system elapsed 
##  0.000   0.000   2.002 
##   user  system elapsed 
##  0.000   0.000   2.102 
##   user  system elapsed 
##  0.000   0.000   2.501 
##   user  system elapsed 
##  0.000   0.000   3.003 

So far so good. However, using timeouts in (0,2) results in an infinite timeout on Linux. For example,

# Wait at most 1.9 seconds
> t <- system.time(r <- socketSelect(list(con), write = FALSE, timeout = 1.9)); print(t); print(r)
^C   user  system elapsed
  3.780  14.888  20.594
> t <- system.time(r <- socketSelect(list(con), write = FALSE, timeout = 0.1)); print(t); print(r)
^C   user  system elapsed
  2.596  11.208  13.907
[1] FALSE

Note how I had to signal a user interrupt (Ctrl-C) to exit socketSelect(). It is as if there is something special with non-integer values in (0,2). Also, note the original timeouts (0, 1, 2, 2.1, 2.5, 3) still works afterward.

I've tested and confirmed the above behavior on R 3.3.2 on Ubuntu 16.04 and on R 3.3.1 on RedHat 6.6. It does not appear on R 3.3.1 on Windows (running via Linux Wine). Behavior on macOS is unknown / untested. The fact that it works on Windows, may suggest it is a Linux specific.

Session information details

Ubuntu 16.04:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_3.3.2

RedHat 6.6:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_3.3.1

Windows via Wine on Linux:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows XP x64 (build 2600) Service Pack 3

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=C                         LC_NUMERIC=C
[5] LC_TIME=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_3.3.1

See also

WISH: Rscript -p <n> to specify number of parallel R processes

Wish

Add command-line option to R and Rscript for specifying number of parallel processes to use/allocate. Although it is possible for individual packages to define this themselves, it would be handy to have one standard option for this across the board. In R itself, we already have two related options with different names for this.

Suggested names for options (same for R):

Rscript -p <n>
Rscript --processes=<n>
Rscript --max-processes=<n>
Rscript --cores=<n>
Rscript --max-cores=<n>

Alternatives? What do other software use? Here are some example:

make -j <n> / make --jobs=<n>
parallel -j <n>
xargs -P <n> / xargs --max-procs=<n>
julia -p, --procs {N|auto}

Examples:

  • Rscript foo.R (default; equivalent to Rscript -p $R_PROCESSES foo.R)
  • Rscript -p 1 foo.R - R uses a single process.
  • Rscript -p 2 foo.R - R uses two processes; the main process plus one more, cf. options(mc.cores=1L).
  • Rscript -p 3 foo.R - R uses three processes; the main process plus two more, cf. options(mc.cores=2L).

Also, we might want an analogous environment variable, e.g. R_PROCESSES.

What should it do?

Specifying this command line option should set related R options, e.g.

  • options(mc.cores=n-1) - Number of additional cores/processes used by the parallel::mclapply() and friends.
  • options(Ncpus=n) (or n-1?) - Number of processes used by install.packages() to install packages in parallel. HB: Is this including the main R processes or additional ones?

Additional comments

Note that mc.cores as defined by the parallel package specifies additional R processes. In other words, the total number of processes used is one more than mc.cores. Then why change definition from additional (n-1) to total number of processes/cores (n)? Because specifying additional cores is confusing and also not known to many. One proof of this is that you see examples using options(mc.cores=detectCores()) when they really meant options(mc.cores=detectCores()-1). More importantly, mc.cores is really defined when using parallel::mclapply() and friends which calls a function on additional set of process and uses the main R process to wait/poll for results. However, you can imagine other implementations that uses also the main process for full processing (and poll only occasionally). This is for instance supported by the future package.

Related

Preserve element names in multi-dimensional subsetting

Background

Vectors in can have dimensions and corresponding names, cf. dim() and dimnames(). They can concurrently have element names, cf. names(). For instance,

> x <- matrix(1:6, nrow=2)
> rownames(x) <- c("A", "B")
> colnames(x) <- c("X", "Y", "Z")
> names(x) <- letters[1:6]
> x
  X Y Z
A 1 3 5
B 2 4 6
attr(,"names")
[1] "a" "b" "c" "d" "e" "f"

This allows us to access elements both by dimensional names as by element names, e.g.

> x[3]
c
3
> x["c"]
c
3
> x[1,2]
[1] 3
> x["A","Y"]
[1] 3

Problem

When subsetting by dimensions, we loose the element names, e.g.

> y <- x[,2:3]
> y
  Y Z
A 3 5
B 4 6
> names(y)
NULL

Wish

Preserve element names also when subsetting by dimensions, e.g.

> y
  Y Z
A 3 5
B 4 6
attr(,"names")
[1] "c" "d" "e" "f"
> y[1]
c
3
> y["c"]
c
3

R CMD check --flavor=<flavor>: customized add-on package validation

(adopted from existing Wiki entry)

Background

By default

R CMD check MyPkg

validates the package MyPkg using a (continuously growing) set of tests/checks. In addition to these, we can run

R CMD check --as-cran MyPkg

which will run an additional set of validation tests to assert that the package meets CRAN's requirements before package submission.

When submitting a package to Bioconductor, they require that in addition to R CMD check you also run:

R CMD BiocCheck MyPkg

to run a "set of tests that encompass Bioconductor Best Practices".

Problems / Issues

Although we currently are provided with great mechanisms for validating R packages, we don't have an easy way to extend and/or customize it (beyond hard-coded command-line options and environment variables). This in turn makes it hard for non-R core developers to implement and share additional package validation tests.

Slightly related: The current R CMD check --as-cran hard codes CRAN into the R base distribution. We already have Bioconductor as an additional large-scale package repository. One could imagine others in the future. In other words, there might be a need to decouple CRAN for R base at some point, which would be in line we what we often hear when users ask questions about CRAN and/or R in the wrong forum; "CRAN is not R" and "R is not CRAN".

Wish / Suggestion

One approach could be to extend R CMD check with an option for running add-on validation tests;

R CMD check --flavor=<flavors> MyPkg

For example,

R CMD check --flavor=CRAN MyPkg
R CMD check --flavor=Bioconductor MyPkg
R CMD check --flavor=CRAN,Bioconductor MyPkg
R CMD check --flavor=CRAN,covr,lintr MyPkg

Alternative option names

R CMD check --as=<...> MyPkg
R CMD check --by=<...> MyPkg
R CMD check --flavor=<...> MyPkg
R CMD check --suite=<...> MyPkg
R CMD check --with=<...> MyPkg

How?

A natural place to implement add-on package validation tests is in an R package itself. The most straightforward could be to name the package the same as the flavor, i.e. we would have packages CRAN and Bioconductor. These packages would then need to export a R_CMD_check_flavor() function that can be called by R CMD check --flavor=<flavor> MyPkg.

Things to consider

  • Should / could there be a way for an add-on validation package to register or tag itself that it is a validation package. Is R_CMD_check_flavor() good enough?
  • Should the flavor be bound to the package name (as suggested above)? An alternative is that a package has an R_CMD_check_get_flavors() function which returns supported "flavors", cf. a vignette build package can provide/register one or more vignette engines.
  • What about sub-flavors, e.g. imagine that Bioconductor wish to provide different types of tests depending on package type (software, annotation data, ...) or "current" versus "upcoming" tests. Do we need a special syntax for non-default validation sets, e.g. R CMD check --flavor=CRAN,Bioconductor::upcoming?

WISH: Built-in MD5 checksum calculator, e.g. `tools::md5(x)`

Background

R has a built-in MD5 checksum calculator tools::md5sum(), but it only operates on files. It takes a vector of pathnames (not connections) as input and returns a character string of the same length containing MD5 checksums, e.g.

> files <- dir(R.home(), pattern = "^C", full.names = TRUE)
> files
[1] "C:/PROGRA~1/R/R-3.2.5/CHANGES" "C:/PROGRA~1/R/R-3.2.5/COPYING"
> tools::md5sum(files)
     C:/PROGRA~1/R/R-3.2.5/CHANGES      C:/PROGRA~1/R/R-3.2.5/COPYING
"d45eec95ce49830fdd3277950397dbde" "0cce1e42ef3fb133940946534fcf8896"

Wish / Suggestion

Calculating MD5 checksums is such a common task that it would warrant a core R functions for calculating the checksum for an R object x, e.g. tools::md5(x).

There is an internal src/library/tools/src/md5.c file that implements the MD5 checksum. It even has an internal md5_buffer() function that seems to do exactly this.

See also

  • digest package: Provides well-tested function digest::digest(x, algo="md5") for calculating the MD5 checksum for R object x.

WISH: Auto-installing of non-installed packages

Wish

Instead of getting an error when trying to load/attach package that does not exist:

> library("future")
Error in library("future") : there is no package called 'future'

wouldn't it be handy if this package is installed on the fly? For instance,

> library("future")
Note in library("future") : there is no package called 'future', will try install the package
and its dependencies.  To disable this, set option(install.packages = FALSE).
Installing package into 'C:/Users/hb/R/win-library/3.4'
(as 'lib' is unspecified)
also installing the dependencies 'globals', 'listenv'
trying URL 'https://cran.r-project.org/src/contrib/globals_0.6.1.tar.gz'
Content type 'application/x-gzip' length 10882 bytes (10 KB)
downloaded 10 KB

trying URL 'https://cran.r-project.org/src/contrib/listenv_0.6.0.tar.gz'
Content type 'application/x-gzip' length 31182 bytes (30 KB)
downloaded 30 KB

trying URL 'https://cran.r-project.org/src/contrib/future_0.13.0.tar.gz'
Content type 'application/x-gzip' length 109367 bytes (106 KB)
downloaded 106 KB
[...]
* DONE (future)

Automatic installation of package could be controlled via an option and system environment variable:

  • options(install.packages=Sys.getenv("R_INSTALL_PACKAGES", FALSE)):
    • FALSE or "never" - gives an error as now (backward compatible).
    • TRUE or "always" - will try to install packages that are not installed.
    • NA or "ask" - ask user, e.g. `Package 'future' is not install. Do you wish to install it [Y/n]: "

Possible solution

Add support for pre-load hooks to functions loading/attaching packages. Then a pre-load hook function can be used the check whether the package is already installed or not. If not, then install.packages() can be called by the hook function before returning to the package loading/attaching function.

WISH: Rprofmem() overhaul

Background

The Rprofmem() of the utils package is built into the very core of R (*) and can be used to log every memory allocation requested in (include by R itself, any R packages and directly by the user). It provides information on the number of bytes allocated and the (reverse) call stack trace that led up to the request. For example:

> Rprofmem()
> x <- integer(1000)
> y <- double(1000)
> z <- complex(1000)
> a <- foo("integer", 1000)
> b <- foo("double", 1000)
> c <- foo("complex", 1000)
> Rprofmem(NULL)
> cat(readLines("Rprofmem.out", warn=FALSE), sep="\n")
4040 :"integer"
8040 :"double"
16040 :"complex"
4040 :"vector" "foo"
8040 :"vector" "foo"
16040 :"vector" "foo"

Note that Rprof(..., interval=0.01, memory.profiling=TRUE) also profiles memory usage, but that is based on a sampling method (every interval seconds), which means it will only report on the overall memory usage (at those snapshots) but not on the invidual memory allocation requests. Because of this, Rprof() will not provide any information on what caused the memory footprint to increase.

(*) In order for Rprofmem() (and tracemem()) to work at all, the R executable must have been built with memory profiling enabled (see Section 'Wishes' below).

Details

The Rprofmem() is built into the very core of R, more precisely it logs every memory allocation done in the allocVector3() function (part of the R API);

#ifdef R_MEMORY_PROFILING
        R_ReportAllocation(hdrsize + size * sizeof(VECREC));
#endif

The more commonly used allocVector() function is just an inline function calling allocVector3(). Moreover, Rprofmem() memory profiling also logs every "newpage" memory allocation done by internal GetNewPage().

Note that Rprofmem() does not log low-level memory allocation done by Calloc() / Free().

Wishes

1. [FIXED] Fix the bug causing allocations without a call stack to clutter up output

When R does memory allocations internally, these are also logged. These entries have an empty call stack. Due to a bug in the code, the log of such entries lack newlines causing several entries to appear on the same line in the log file. For example,

> Rprofmem()
> x <- integer(1000)
> y <- double(1000)
> Rprofmem(NULL)
> cat(readLines("Rprofmem.out", warn=FALSE), sep="\n")
4040 :"integer"
200 :360 :360 :1064 :8040 :"double"

This makes it unnecessarily hard/tricky to parse the Rprofmem log file.

Moved to Issue #42 (solved).

2. Enable memory profiling by default

In order for Rprofmem() (and tracemem()) to work at all, the R executable must have been built with memory profiling enabled, i.e. ./configure --enable-memory-profiling, cf. the 'R Installation and Administration' manual. However, note that the Windows binaries provided via CRAN do indeed have this enabled by default.

However, as Radford Neal suggests (below R-devel thread), "the overhead of having [Rprofmem()] enabled is negligible when profiling is not actually being done". Internally, the logging is done after testing if (R_IsMemReporting) { ... }, which is a very cheap logical test and could therefore could be part of the default build (and not conditional on #ifdef R_MEMORY_PROFILING).

For more details on Raford Neal's suggestions and improvements, see:

Implementing this is very simple, e.g. Radford's patch.

3. Log more information

In the current implementation, which is more or less from 2006, the R memory profiling collects and report on:

  1. Number of bytes allocated (requested)
  2. The call stack trace as the name of the functions called

However, it should be possible to gather more information that this, e.g.

  1. Timestamp
  2. Number of bytes allocated
  3. Data type allocated
  4. The call stack:
    • name and namespace/environment of each function, e.g. base::vector()
    • the source code line (if available)
    • frame identifiers

Similar improvements have already been proposed by others:

4. Different output formats

One could imagine that Rprofmem() supports different flavors of logging. For backward compatibility, one could have a "legacy" mode. One could have options to control exactly what to log, and possibly also options for output format, e.g. tab- or comma-separated value files.

5. Log deallocations

AFAIU, all deallocations of memory allocated by allocVector3() are done by the R garbage collector. It would be useful if these memory deallocations would be recorded in the Rprofmem output too. An immediate benefit would be that one could use the cumulative sum of logged allocations and deallocations to infer the amount of memory currently allocated by R.

Updates

  • UPDATE 2021-10-18: Add feature request '5. Log deallocations'
  • UPDATE 2017-05-30: Bug / Item 1 (Issue #42) has been resolved in R-devel (to become R 3.5.0)
  • UPDATE 2017-05-29: Moved Item 1 to Issue #42.
  • UPDATE 2016-06-05: Add frames/frame identifiers as information to collect. This will make it possible to distinguish two separate calls to a() which each in turn calls b() ({ a() -> b(), a() -> b() }) from one call to a() that in turn calls b() twice (a() -> { b(), b() }).

HASNA(x): SEXP flag indicating whether `x` has missing values or not (or unknown)

Adopted from existing Wiki entry:

Wish / Suggestion

Analogously to NAMED(x), an internal SEXP flag that indicates whether x has missing values or not (or it's unknown) and that can be queried as HASNA(x) with possible values:

  • HASNA(x) = 0: x has no missing values
  • HASNA(x) = 1: x has one or more missing values
  • HASNA(x) = 2: it is unknown whether x has missing values or not

This SEXP flag can be set by any function that have scanned x for missing values, e.g. anyNA(x), sum(x) etc.

This would allow functions to skip expensive testing for missing values whenever HASNA(x) == 0, because for real x the internal ISNAN(x) and ISNA(x) are quite expensive and slows down the processing significantly. For instance, with HASNA(x) == 0 a call to sum(x, na.rm=TRUE) can fall back to sum(x, na.rm=FALSE). Currently, it is up to the user/developer to keep track and use na.rm=FALSE.

Similarly, functions such as anyNA(x) can return (TRUE or FALSE) instantaneously - O(1) - if HASNA(x) != 2. Also, sum(x, na.rm=FALSE) and many similar functions can directly return a missing value if HASNA(x) == 1.

Status

ML wrote (2015-11-14):

Luke [Tierney] is changing the SEXP header for reference counting. Thanks to the need for alignment, we will get some extra bits. We have already decided to use one of those for this purpose. Another bit will track whether a vector is sorted.

  • HB: That's good news. Will there be two bits for sorted to specifying increasing versus decreasing ordering?
    • GB: This would be an extremely cheap check (binary search for an element different than the first in the worst case, assuming NAs-at-end or NAs-at-beginning). Not sure it's worth a valuable header bit.
  • HB: What about character vector; will they ever be flagged as sorted? For instance, how will you know in what locale such a vector was sorted, e.g. you first sorted/collated it lexicographically using the C locale but then work in the en_US.UTF-8 locale.
  • ML: Good point. Will need to invalidate the flag after a locale change.

WISH: Simple class for files / pathnames

It would be useful to have a class for files (e.g. File) and basic functions for creating objects of such classes. I can imagine that this file class is a simple extension of character, because that is currently how files/pathnames are currently represented, e.g.

File <- function(...) { x <- c(...); structure(x, class=c("File", class(x))) }

Then existing functions returning files/pathnames, e.g.

dir <- function(...) File(base::dir(...))

and instead of c() users can have, say, p() for very brief syntax, e.g. p <- File.

Examples:

> pathname <- p("R/zzz.R")
> pathnames <- p("R/000.R", "R/zzz.R")
> pathnames <- dir("R/")

WISH: install.packages had option to throw error code if install fails

When installing packages using scripts (e.g. Dockerfiles) it is nice to have install.packages throw an error instead of a warning if the package installation fails. One can simulate this of course using withCallingHandlers(warning = stop), but it's not ideal to treat any warning as an error; since some warnings are just that and do not mean that the package installation has failed. Or maybe there's already a good work-around for this I'm just overlooking?

Thanks for any ideas and for maintaining this Wishlist!

BUG: Rprofmem() clutters up output when the call stack is empty

(Extracted from Issue #25)

When R does memory allocations internally, these are also logged with Rprofmem(). These entries have an empty call stack. Due to a bug in the code, the log of such entries lack newlines causing several entries to appear on the same line in the log file. For example,

> Rprofmem()
> x <- integer(1000)
> y <- double(1000)
> Rprofmem(NULL)
> cat(readLines("Rprofmem.out", warn=FALSE), sep="\n")
4040 :"integer"
200 :360 :360 :1064 :new page:8040 :"double"

The lack of newlines for some of the lines makes it unnecessarily hard/tricky to parse the Rprofmem log file.

Solution

Fixing this is very simple; it is just a matter of making sure there are no side effects, which it appears not to be (see below R-devel thread 'RProfmem output format').

$ svn diff src/main/memory.c 
Index: src/main/memory.c
===================================================================
--- src/main/memory.c	(revision 72746)
+++ src/main/memory.c	(working copy)
@@ -3803,7 +3803,6 @@
 
 static void R_OutputStackTrace(FILE *file)
 {
-    int newline = 0;
     RCNTXT *cptr;
 
     for (cptr = R_GlobalContext; cptr; cptr = cptr->nextcontext) {
@@ -3810,13 +3809,12 @@
 	if ((cptr->callflag & (CTXT_FUNCTION | CTXT_BUILTIN))
 	    && TYPEOF(cptr->call) == LANGSXP) {
 	    SEXP fun = CAR(cptr->call);
-	    if (!newline) newline = 1;
 	    fprintf(file, "\"%s\" ",
 		    TYPEOF(fun) == SYMSXP ? CHAR(PRINTNAME(fun)) :
 		    "<Anonymous>");
 	}
     }
-    if (newline) fprintf(file, "\n");
+    fprintf(file, "\n");
 }
 
 static void R_ReportAllocation(R_size_t size)

With the above patch, we get:

> Rprofmem()
> x <- integer(1000)
> y <- double(1000)
> Rprofmem(NULL)
> cat(readLines("Rprofmem.out", warn=FALSE), sep="\n")
4040 :"integer" 
240 :
480 :
472 :
1064 :
new page:
8040 :"double" 

See also

  • This issue has been discussed previously:
  • The profmem package is backward and forward compatible with this bug / fix.

/cc @kalibera, if you still have a few brain cycles to spare after tweaking summaryRprof() (r72743), I'm cc:ing you in the hope that you can fix also this very old bug that seems to be forgotten about over and over.

WISH: environmentName(new.env()) to return the address and not just an empty string

We have

> environmentName(globalenv())
[1] "R_GlobalEnv"

> environmentName(baseenv())
[1] "base"

> environmentName(getNamespace("tools"))
[1] "tools"

but for

> env <- new.env()
> print(env)
<environment: 0x2c31298>

we get:

> environmentName(env)
[1] ""

It would be useful if the latter returned the environment address instead, i.e.

> environmentName(env)
[1] "0x2c31298"

TYPO: Some error messages for R_MAX_NUM_DLLS lack prefix `R_`

Some of the error messages for R_MAX_NUM_DLLS does not include prefix R_. From file src/main/Rdynload.c in trunk:

    char *req = getenv("R_MAX_NUM_DLLS");
    if (req != NULL) {
	int reqlimit = atoi(req);
	if (reqlimit < 100)
	    R_Suicide(_("R_MAX_NUM_DLLS must be at least 100"));
	if (reqlimit > maxlimit) {
	    if (maxlimit == 1000)
		R_Suicide(_("MAX_NUM_DLLS cannot be bigger than 1000"));
	    
	    char msg[128];
	    snprintf(msg, 128,
	      _("MAX_NUM_DLLS bigger than %d may exhaust open files limit"),
	      maxlimit);
	    R_Suicide(msg);
	}
	MaxNumDLLs = reqlimit;
    } else
	MaxNumDLLs = 100;

Action

WISH: List and clear registered finalizers

Background

Using base::reg.finalizer(e, f), then f(e) will be called when object e is garbage collected, e.g.

> env <- new.env()
> reg.finalizer(env, function(e) { print("Finalizing"); print(e) })
NULL
> rm(env)
> t <- gc()
[1] "Finalizing"
<environment: 0x000000000be93920>

It is possible to register more than one finalizer per object, e.g.

> env <- new.env()
> reg.finalizer(env, function(e) { print("Finalizing A"); print(e) })
NULL
> reg.finalizer(env, function(e) { print("Finalizing B"); print(e) })
NULL
> reg.finalizer(env, function(e) { print("Finalizing C"); print(e) })
NULL
> rm(env)
> t <- gc()
[1] "Finalizing C"
<environment: 0x000000000be91f90>
[1] "Finalizing B"
<environment: 0x000000000be91f90>
[1] "Finalizing A"
<environment: 0x000000000be91f90>

Note also how the finalizers are applied in a LIFO.

Wish / Suggestions

  1. Document that multiple finalizers can be registered and in what order they are applied/called. Currently, help("reg.finalizer") does not mention this at all and talks about "the finalizer".
  2. Add a function for listing all registered finalizers for an object, e.g. list.reg.finalizers(e).
  3. Add a function for removing all registered finalizers for an object, e.g. reg.finalizer(e, f=NULL).

parallel: Unnecessary loading of 'stats', 'graphics' & 'grDevices' by parallel package

Problem

When loading the parallel package, you also load the stats package which in turn loads graphics, grDevices etc. Example:

$ R_DEFAULT_PACKAGES=base R --vanilla --quiet
> loadedNamespaces()
[1] "base"
> loadNamespace("parallel")
<environment: namespace:parallel>
> loadedNamespaces()
[1] "graphics"  "parallel"  "utils"     "grDevices" "stats"     "base"

Troubleshooting / Workaround

The reason for the parallel package loading stats in the first place is that it uses stats::runif(1) in a few places in order (i) to create the random seed, and (ii) [o setup the default port for SNOW clusters.

We can avoid parallel calling these statements by (i) generating the random seed ourselves and (ii) specifying the port to use before loading parallel. For example:

$ R_DEFAULT_PACKAGES=base R --vanilla --quiet
> sample.int(1L)
[1] 1
> Sys.setenv(R_PARALLEL_PORT=11321)
> loadedNamespaces()
[1] "base"
> loadNamespace("parallel")
<environment: namespace:parallel>
> loadedNamespaces()
[1] "parallel" "base"

UPDATE 2016-01-17:
As Martin Morgan suggest his comment on PR #16668, it is enough to specify the port, i.e.

$ R_DEFAULT_PACKAGES=base R --vanilla --quiet
> Sys.setenv(R_PARALLEL_PORT=11321)
> loadNamespace("parallel")
<environment: namespace:parallel>
> loadedNamespaces()
[1] "parallel" "base"
> res <- parallel::mclapply(1:3, FUN=seq_len)
> loadedNamespaces()
[1] "parallel" "base"
> utils::str(res)
List of 3
 $ : int 1
 $ : int [1:2] 1 2
 $ : int [1:3] 1 2 3

Although most usages of R involves the stats package, not all do. Being able to launch a minimal R session with only 'base' and 'parallel' loaded as fast a possible and with minimum memory usage is useful for, say, SNOW R sessions that run in the background and serve as computational power for the main R session, cf. parallel::makeCluster().

Suggestion

Where parallel uses stats::runif(1) for generating a random seed, it can equally well use base::sample.int(1L). The above workaround example shows this.

The other place where parallel uses stats::runif(1) is to generate a random port;

    if (is.na(port))
        port <- 11000 + 1000 * ((stats::runif(1L) + unclass(Sys.time())/300) %% 1)

This is a bit trickier to parse, but the key is expression

1000 * ((stats::runif(1L) + unclass(Sys.time())/300) %% 1)`

and the fact that we later use as.integer(port). In other words, this expression generates a random number in [0,999]. I'm pretty sure this can be replaced by:

1000 * ((sample.int(1000, size=1L)/1000 + unclass(Sys.time())/300) %% 1)

or possibly

1000 * ((sample.int(10000, size=1L)/10000 + unclass(Sys.time())/300) %% 1)

Results

With the default loading of 'parallel' where 'stats' et al. are loaded, the memory usage of R is ~102 MiB, whereas without `stats' it is ~45 MiB.

For example:

$ R_DEFAULT_PACKAGES=base,parallel /usr/bin/time -v Rscript --vanilla --quiet -e "x <- sample.int(1L); Sys.setenv(R_PARALLEL_PORT=11321); x <- loadNamespace('parallel'); loadedNamespaces()"
[1] "graphics"  "parallel"  "utils"     "grDevices" "stats"     "base"
        Command being timed: "Rscript --vanilla --quiet -e x <- sample.int(1L); Sys.setenv(R_PARALLEL_PORT=11321); x <- loadNamespace('parallel'); loadedNamespaces()"
        User time (seconds): 0.15
        System time (seconds): 0.14
        Percent of CPU this job got: 89%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.33
[...]
        Maximum resident set size (kbytes): 104240
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 9582
        Voluntary context switches: 25
        Involuntary context switches: 8
[...]

versus

$ R_DEFAULT_PACKAGES=base /usr/bin/time -v Rscript --vanilla --quiet -e "x <- sample.int(1L); Sys.setenv(R_PARALLEL_PORT=11321); x <- loadNamespace('parallel'); loadedNamespaces()"
[1] "parallel" "base"
        Command being timed: "Rscript --vanilla --quiet -e x <- sample.int(1L); Sys.setenv(R_PARALLEL_PORT=11321); x <- loadNamespace('parallel'); loadedNamespaces()"
        User time (seconds): 0.06
        System time (seconds): 0.01
        Percent of CPU this job got: 93%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.09
[...]
        Maximum resident set size (kbytes): 46064
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 5904
        Voluntary context switches: 24
        Involuntary context switches: 13
[...]

resample(x): A less error prone version of sample(x)

Suggestion / Wish

I'd like to propose to add the following sampling function to the base package:

resample <- function(x, ...) {
  x[sample.int(length(x), ...)]
}

Both sample() and sample.int() are in base already.

Background

It is to simple too do mistakes with the existing base::sample() function, which is mainly because its API tries to handle to many special cases.

Below are some examples illustrating the problem. These examples should be seen as x being a random set of data with random length, e.g. with zero, one or more elements.

Example 1 (resampling without replacement)

> x <- 11:13
> sample(x)
[1] 12 11 13

> x <- 11:12
> sample(x)
[1] 12 11

> x <- 11:11
> sample(x)
 [1] 10  7  8  9  1  5  3  4 11  2  6

In the latter case, we have length(x) == 1 and then x is interpreted as the upper range of values to sample from, i.e. it is effectively returning sample(1:x).

With the proposed resample() it works as expected also for length(x) == 1, e.g.

> resample(x)
[1] 11

Example 2 (resampling subset with replacement)

> x <- 11:13
> sample(x, size=5, replace=TRUE)
[1] 13 13 11 13 13

> x <- 11L
> sample(x, size=5, replace=TRUE)
[1] 9 8 8 6 9  ## Because it samples from 1:11

> resample(x, size=5, replace=TRUE)
[1] 11 11 11 11 11

For more background and further rationales for this, see for instance:

Known existing implementations

  • The R.utils implements resample() as above (as a default S3 method).

R CMD check: Add warning for use of skeleton vignette titles, e.g. "Vignette Title"

Issue

It's not uncommon that new package forget to update skeleton vignette titles, e.g. "Vignette Title". This is often not detected until they end up on CRAN where the mistake becomes obvious because the package CRAN page lists the vignette titles. These skeleton titles often originate from cut'n'paste from examples and / or from vignette-skeleton generating functions.

Suggestion

Add a check for R CMD check that gives a WARNING (or just a NOTE?) about this. Should this be always be checked or only when using --as-cran?

capabilities(): Make it possible to *disable* capabilities for testing/reproducibility purposes

For testing/reproducibility purposes, make it possible to disable capabilities();

> capabilities()
       jpeg         png        tiff       tcltk         X11        aqua 
       TRUE        TRUE        TRUE        TRUE        TRUE       FALSE 
   http/ftp     sockets      libxml        fifo      cledit       iconv 
       TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
        NLS     profmem       cairo         ICU long.double     libcurl 
       TRUE        TRUE        TRUE       FALSE        TRUE        TRUE 

For instance, on some systems tcltk is not supported. It would be great to be able to emulate that by disabling that on an R installation that has it. This could be done with environment variables, e.g. R_DISABLE=tcltk,profmem.

WISH: R CMD check to assert that DLLs are unregistered when package is unloaded

Background

Packages with native code loads DLLs when loaded. More precisely, on Windows Dynamic Link Library (DLL) files are loaded and on Unix-like systems shared library (SO) files are loaded.

For example, when a fresh R session is started we have the following DLLs:

$ R --vanilla
> dll0 <- getLoadedDLLs()
> dll0
                                                Filename Dynamic.Lookup
base                                                base          FALSE
methods       /usr/lib/R/library/methods/libs/methods.so          FALSE
utils             /usr/lib/R/library/utils/libs/utils.so          FALSE
grDevices /usr/lib/R/library/grDevices/libs/grDevices.so          FALSE
graphics    /usr/lib/R/library/graphics/libs/graphics.so          FALSE
stats             /usr/lib/R/library/stats/libs/stats.so          FALSE

When loading a package with native code, it will add another entry, e.g.

> library("matrixStats")
> getLoadedDLLs()
                                                                              Filename Dynamic.Lookup
base                                                                              base          FALSE
methods                                     /usr/lib/R/library/methods/libs/methods.so          FALSE
utils                                           /usr/lib/R/library/utils/libs/utils.so          FALSE
grDevices                               /usr/lib/R/library/grDevices/libs/grDevices.so          FALSE
graphics                                  /usr/lib/R/library/graphics/libs/graphics.so          FALSE
stats                                           /usr/lib/R/library/stats/libs/stats.so          FALSE
matrixStats /home/hb/R/x86_64-pc-linux-gnu-library/3.3/matrixStats/libs/matrixStats.so           TRUE

When unloading a package that registers a DLL it (ideally) not only unloads the package but also unregister its DLL, e.g.

> unloadNamespace("matrixStats")
> getLoadedDLLs()
                                                Filename Dynamic.Lookup
base                                                base          FALSE
methods       /usr/lib/R/library/methods/libs/methods.so          FALSE
utils             /usr/lib/R/library/utils/libs/utils.so          FALSE
tools             /usr/lib/R/library/tools/libs/tools.so          FALSE
internet                 /usr/lib/R/modules//internet.so           TRUE
grDevices /usr/lib/R/library/grDevices/libs/grDevices.so          FALSE
graphics    /usr/lib/R/library/graphics/libs/graphics.so          FALSE
stats             /usr/lib/R/library/stats/libs/stats.so          FALSE

A package can unload its registered DLLs using:

.onUnload <- function(libpath) {
    gc()
    library.dynam.unload(utils::packageName(), libpath)
 }

Forcing the garbage collector to run (gc()) will trigger finalizer functions to be called of which some may need the DLL to run.

Issue

It turns out that several packages forget to unregister their DLLs when unloaded. For example,

> library("digest")
> unloadNamespace("digest")
> getLoadedDLLs()
                                                                  Filename Dynamic.Lookup
base                                                                  base          FALSE
methods                         /usr/lib/R/library/methods/libs/methods.so          FALSE
utils                               /usr/lib/R/library/utils/libs/utils.so          FALSE
tools                               /usr/lib/R/library/tools/libs/tools.so          FALSE
internet                                   /usr/lib/R/modules//internet.so           TRUE
grDevices                   /usr/lib/R/library/grDevices/libs/grDevices.so          FALSE
graphics                      /usr/lib/R/library/graphics/libs/graphics.so          FALSE
stats                               /usr/lib/R/library/stats/libs/stats.so          FALSE
digest    /home/hb/R/x86_64-pc-linux-gnu-library/3.3/digest/libs/digest.so           TRUE

(UPDATE: The digest package has since fixed this, but the example still applies to many other packages).

The problem with packages not unregistering their DLLs when unloaded is that it risks to eventually fill up R's internal DLL registry which can only hold MAX_NUM_DLLS (== 100). When this happens, R will fail to load any packages that needs to register a DLL with the following error message:

`maximal number of DLLs reached...

This is guaranteed to happen if one tries to load and unload all CRAN packages one by one, e.g.

for (pkg in CRANpkgs) {
  loadNamespace(pkg)
  unloadNamespace(pkg)
}

There have been several reports on hitting this limit, e.g.

Suggestion / Wish

R CMD check assertion

Have R CMD check also asserts that the package also unloads any registered DLLs, e.g.

* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... WARNING
  Unloading the namespace does not unload DLL
* checking loading without being on the library search path ... OK

unloadNamespace()

Assert / warn

Maybe unloadNamespace() should check for left-over DLLs and give a warning whenever coupled DLLs are not unloaded.

Concerns

Karl Miller wrote on 2016-12-20 (https://stat.ethz.ch/pipermail/r-devel/2016-December/073528.html):
"It's not always clear when it's safe to remove the DLL."

UPDATE 2016-12-20: Add recommendation to run gc() before removing DLL when unloading a package. See thread https://stat.ethz.ch/pipermail/r-devel/2016-December/073522.html.

A more informative abort message than "aborting ..."

WISH

Use An exceptional error occurred that R could not recover from. The R session is now aborting ... instead of just aborting ..., because from the latter it is not always clear where that messages comes from, i.e. it could have been outputted by something else. Here's an example:

checking tests ... ERROR
Running the tests in โ€˜tests/devEval.Rโ€™ failed.
Last 13 lines of output:
1: .External2(C_X11, d$display, d$width, d$height, d$pointsize, d$gamma, d$colortype, d$maxcubesize, d$bg, d$canvas, d$fonts, NA_integer_, d$xpos, d$ypos, d$title, type, antialias, d$family)
[...]
12: tryCatch({ res <- devEval(type, name = "any", aspectRatio = 2/3, scale = 1.2, { plot(100:1) }) printf("Result: %s (%s)\n\n", sQuote(res), attr(res, "type")) devOff()}, error = function(ex) { printf("Failed: %s\n\n", sQuote(ex$message))})
aborting ...
checking for unstated dependencies in vignettes ... OK

ACTIONS TAKEN

Support for vector("<custom class>", length=n)

It would be useful if vector() could be generalize such that any type of relevant class can be supported, e.g.

vector("listenv", length=n)

Internally, vector() uses av native switch statement for allocating basic data types. If not one of the supported, then an error is generated. An alternative to the latter is to have it search for an allocateVector.<class>() function before giving an error.

R() / Rscript(): Calling R / Rscript with identical setup as the main process

Created for Wiki entry.

Wish / Suggestion

Implement functions R() and Rscript() for calling R / Rscript via system() with the option to use identical system setup as the main process, e.g.

  • same system environment variables (e.g. PATH, same .Renviron file, ...)
  • same library paths (.libPaths())
  • same .Rprofile
  • same options()
  • ...

Example:

Rscript("-e"="commandArgs()", seed=34, foo="Hello world!")

ROBUSTNESS: Give error if condition for control statements have length != 1

Issue

Control statements if(cond) and while(cond) in R gives an error if length(cond) == 0, but if length(cond) > 1, then only a warning is produced, e.g.

> x <- 1:2
> if (x == 1) message("x == 1")
x == 1
Warning message:
In if (x == 1) message("x == 1") :
  the condition has length > 1 and only the first element will be used

By the design of if and while control statements, it makes no sense to use a condition with a length other than one. Because of this there is not logical reason why the above should not be considered an error rather than a warning.

Suggestion

The long-term goal should be that R always produces an error if the length of the condition differs from one. However, due to only a warning has been produces this far, there is a great risk that there exist lots of code that will break immediately if an error is generated. In order to avoid wreaking havoc, a migration from producing a warning to an error may go via an option check.condition with default value FALSE. When FALSE, the current behavior remains and only a warning is produced. With TRUE, an error is produced. R CMD check --as-cran could enable options(check.condition = TRUE) such that all new packages submitted to CRAN must pass this new requirement. This will also allow individual developers to run checks locally.

Patch

A complete patch is available in https://github.com/HenrikBengtsson/r-source/compare/hb-develop...HenrikBengtsson:hb-feature/check-condition. Example:

> options("check.condition")
$check.condition
[1] FALSE
> if (x == 1) message("x == 1")
x == 1
Warning message:
In if (x == 1) message("x == 1") :
  the condition has length > 1 and only the first element will be used

> options(check.condition = TRUE)
> x <- 1:2
> if (x == 1) message("x == 1")
Error in if (x == 1) message("x == 1") : the condition has length > 1

References

WISH: Make mode="wb" the default for download.file()

Background

The utils::download.file() function can be used to download a file from the web. For example:

url <- "https://cran.r-project.org/doc/manuals/r-devel/NEWS.html"
p <- basename(url)
download.file(url, destfile=p)
trying URL 'https://cran.r-project.org/doc/manuals/r-devel/NEWS.html'
Content type 'text/html' length 295979 bytes (289 KB)
==================================================
downloaded 289 KB
> readLines(p, n=3)
[1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\"><html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>R: R News</title>"
[2] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />"                                                                                                                   
[3] "<link rel=\"stylesheet\" type=\"text/css\" href=\"R.css\" />" 

Issue

Whenever download binary files, they must be download "as-is" byte for byte. However, the default behavior of download.file() is to download files as "text" rather than "binary". More precisely, argument mode defaults to "w" (=as text). Depending on platform / OS, downloading a binary file as text may or may not modify the downloaded file. More specifically, on MS Windows newline characters (LF) downloaded will be modified to be Windows-specific (CRLF). This explains why the following download of a binary RDS file fails on Windows:

> url <- "https://cran.r-project.org/src/contrib/Views.rds"
> p <- basename(url)
> download.file(url, destfile=p)
trying URL 'https://cran.r-project.org/src/contrib/Views.rds'
Content type '๏ฟฝ' length 22681 bytes (22 KB)
downloaded 22 KB
> d <- readRDS(p)
Error in readRDS(p) : error reading from connection
> file.size(p)
[1] 22776

whereas it still works on, say, Linux:

> d <- readRDS(p)
> file.size(p)
[1] 22681

Note the difference in file sizes; on Windows the file is 22776 bytes whereas on Linux it is 22681 bytes (95 bytes less than on Windows).

Solution / Workaround

As help("download.file", package="utils") says:

Code written to download binary files must use mode = "wb", but the problems incurred by a text transfer will only be seen on Windows.

Thus, if we download as "binary";

> download.file(url, destfile=p, mode="wb")
trying URL 'https://cran.r-project.org/src/contrib/Views.rds'
Content type '๏ฟฝ' length 22681 bytes (22 KB)
downloaded 22 KB
> d <- readRDS(p)
> file.size(p)
[1] 22681

The file is downloaded "as-is" also on Windows.

There are several user requests / questions on "corrupt" downloads, which can be traced back to the mistake of not specifying mode="wb". Since this only occurs on Windows, there is a risk that developers on Linux and macOS don't pay attention to this problem and only use the default mode in their code / packages.

Suggestion

Change the default for argument mode to be mode="wb".

I cannot see a downside of making this the new default. Text reading functions such as readLines() already handle different newline symbols (LF, CRLF, CR, ...).

The upside is that the risk for bugs and incorrect downloads is removed.

See also

Identify all external software and files R uses during runtime

External software that R uses

External files that R uses

Background

If your system running R doesn't have which installed, you'll get:

> sessionInfo()
Error in system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE) : 
error in running command
> traceback()
5: system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE)
4: withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning"))
3: suppressWarnings(system(paste(which, shQuote(names[i])), intern = TRUE, 
       ignore.stderr = TRUE))
2: Sys.which("uname")
1: sessionInfo()

Wish

Identify all external software used by R (each of the core packages) and the recommended R packages. These should be considered required system run-time dependencies.

WISH: writeline() - non-sinkable output, cf. readline()

Background

Some user interaction make little sense if the intended message does not reach the user. The base::readline() provides a mechanism for asking the user a question in a way that is guaranteed to be seen by the user, e.g.

> readline(prompt = "What's your name?: ")
What's your name?: joe
[1] "joe"

The readline message cannot be captured by sink(), e.g.

# Send all stdout and stderr to file
> con <- file("void.out", open = "w")
> sink(file = con, append = TRUE, type = "output")
> sink(file = con, append = TRUE, type = "message")

# Verify
> cat("A sinked message to stdout\n")
> cat("A sinked message to stderr\n", file = stderr())

# readline output is not redirect
> readline(prompt = "What's your name?: ")
What's your name?: joe

# Stop redirecting
> sink(type = "message")
> sink(type = "output")
> close(con)

# See what's redirected
> readLines("void.out")
[1] "A sinked message to stdout" "A sinked message to stderr" "[1] \"joe\""               

Limitations

readline() messages can only be 255 characters long

The readline output message may contain at most 255 characters + \0 (define CONSOLE_PROMPT_SIZE 256), e.g.

> x <- paste0(sprintf("%03i. ", seq(1, 300, by = 5)), collapse = "")
> x
[1] "001. 006. 011. 016. 021. 026. 031. 036. 041. 046. 051. 056. 061. 066. 071. 076. 081. 086. 091. 096. 101. 106. 111. 116. 121. 126. 131. 136. 141. 146. 151. 156. 161. 166. 171. 176. 181. 186. 191. 196. 201. 206. 211. 216. 221. 226. 231. 236. 241. 246. 251. 256. 261. 266. 271. 276. 281. 286. 291. 296. "
> nchar(x)
[1] 300
> readline(prompt = x)
001. 006. 011. 016. 021. 026. 031. 036. 041. 046. 051. 056. 061. 066. 071. 076. 081. 086. 091. 096. 101. 106. 111. 116. 121. 126. 131. 136. 141. 146. 151. 156. 161. 166. 171. 176. 181. 186. 191. 196. 201. 206. 211. 216. 221. 226. 231. 236. 241. 246. 251. 
6. 151. 156. 161. 166. 171. 176. 181. 186. 191. 196. 201. 206. 211. 216. 221. 226. 231. 236. 241. 246. 251. [1] ""

It's not possible to output to console without requiring a response

Occasionally, it can be useful to send a message to the user that is guaranteed to be seen and not captured, but that does not require input from the user. Note that outputting to standard error is not safe in this sense, because it can be redirected from within R. Likewise, base::message() can be captured because it outputs to stderr. An alternative would be to provide a writeline() that works similar to readline() but that does not prompt the user for input. Such a function would also help provide a workaround for the above readline limitation.

A poor-man's workaround / proof-of-concept is to output the message using a system call, e.g.

writeline <- function (..., appendLF = FALSE) 
{
    fh <- tempfile()
    on.exit(file.remove(fh))
    cat(..., file = fh)
    if (appendLF) 
        cat("\n", file = fh, append = TRUE)
    if (.Platform$OS.type == "windows") {
        file.show(fh, pager = "console", header = "", title = "", 
            delete.file = FALSE)
    }
    else {
        system(sprintf("cat %s", fh))
    }
    invisible()
}

(adopted from R.utils::cmsg()).

Complications

The native readline() implementation differ between platforms, i.e. fixing the readline limitation and /or adding a writeline() function would require adding code supporting all existing platforms as well as solid testing.

Related

Just so it's not forgotten (should be it's own issue); the base::menu() function outputs the menu items to (sinkable) standard output (not standard error) and the actual menu prompt via the same non-sinkable output as readline(). As a first step, menu() should probably output to standard error.

NextMethod() quirks: It's not a regular function - don't pass arguments!

(Adding some old notes of mine here)

Passing arguments to NextMethod() by explicitly specifying them as one do in function calls should be avoided. The following example illustrates why. First, assume the following setup:

x <- structure(NA, class = "A")
y_truth <- list(x = x, a = 3)

foo <- function(x, a) UseMethod("foo")

foo.default <- function(x, a) {
  list(x = x, a = a)
}

Next, consider we want create a foo() method for class A that should do what the default method does except that it should increase the value of a by one. If the default method is updated, our foo() for A should not have to be updated, so we need to use NextMethod().

Attempt 1: Pass arguments by name

It's tempting to use NextMethod() as follows:

foo.A <- function(x, a) {
  tmp <- a + 1
  NextMethod("foo", object = x, a = tmp)
}

which at a first glance seems to do what we want:

y <- foo(x, a = 2)
stopifnot(identical(y, y_truth))

However, if you try to set argument a by position, we get a rather surprising error:

y <- foo(x, 2)
## Error in foo.default(x, 2, a = 3) : unused argument (a)

Hmm, that's not how we'd expect foo() to work.

Comment: It doesn't matter if we use object = x or x above.

Attempt 2: Arguments by position

Let's try to pass tmp by position instead;

foo.A <- function(x, a) {
  tmp <- a + 1
  NextMethod("foo", x, tmp)
}

Nah, that's even worse:

y <- foo(x, a = 2)
## Error in foo.default(x, a = 2) : unused argument (tmp)

y <- foo(x, 2)
Error in foo.default(x, 2) : unused argument (tmp)

Attempt 3: Arguments by position (same name)

Ok, it doesn't like tmp to be passed. What happens if we use local variable named the same as the argument and pass that instead?

foo.A <- function(x, a) {
  a <- a + 1
  NextMethod("foo", x, a)
}

You'd think that hack could do the trick, eh? Nope:

y <- foo(x, a = 2)
## Error in foo.default(x, a = 2) : unused argument (a)

y <- foo(x, 2)
## Error in foo.default(x, 2) : unused argument (a)

Attempt 4: Don't pass arguments (modify instead)

That leaves us with not passing the argument implicitly by not specifying them when calling NextMethod():

foo.A <- function(x, a) {
  a <- a + 1
  NextMethod("foo", x)
}

which works:

y <- foo(x, a = 2)
stopifnot(identical(y, y_truth))

y <- foo(x, 2)
stopifnot(identical(y, y_truth))

Cleaner version of Attempt 4

There's actually no need to specify the generic function (the first argument as a string) nor the object to dispatch on, so we can just do:

foo.A <- function(x, a) {
  a <- a + 1
  NextMethod()
}

which is easier to remember and less to update in case you rename the generic. Most importantly, it does work:

y <- foo(x, a = 2)
stopifnot(identical(y, y_truth))

y <- foo(x, 2)
stopifnot(identical(y, y_truth))

Session info

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.1

See also

WISH: Atomic writing to file

(Adopted from Wiki entry)

Background

When writing to file, there is always the risk that the process is interrupted which may result in an incomplete file. Depending on file format, it can be extremely hard, or even impossible, to detect that the file is incomplete. For instance, if writing a data frame with 100,000 rows to a comma-delimited file using write.csv(), if we're unlucky, the writing may be interrupted at the end of a row, e.g. when 98,953 complete rows have been written. If so, data <- read.csv() will happily read the 98,953 rows and there is no way for us to know that the file is incomplete. Even if it is possible to detect incomplete and/or corrupt files, it can be extremely tedious to identify them.

This is a real problem when generating a large number of files, especially large files for which the risk of being exposed to an interrupt increases.

Suggestion / Wish

If the file are written atomically, that is, either all of the file is there at the end or not at all, then the problem of knowing whether the file is complete or not would not exist. One approach for writing files atomically is to write using a temporary file name and then rename on completion.

Prototype / example

Assume we save the file using saveRDS(x, file="foo.rds", atomic=TRUE). This could in principle be done as:

  1. saveRDS(x, file="foo.rds.tmp")
  2. file.rename("foo.rds.tmp", "foo.rds")

If there is an interrupt, there will be a left-over *.rds.tmp file, but not the final *.rds file. There could be options for automatically cleaning up incomplete files, or renaming the temporary file to, say, *.rds.error if an error was thrown while writing the file.

WISH: Increase limit of maximum number of open connections (currently 125+3)

Background

As documents in help("connections", package="base") the maximum number of connections one can have open in R (in addition to the three always reserved) is 125;

"A maximum of 128 connections can be allocated (not necessarily open) at any one time. Three of these are pre-allocated (see stdout). The OS will impose limits on the numbers of connections of various types, but these are usually larger than 125."

Here is an example showing what happens when we try to open too many connections:

> cons <- list()
> for (ii in 1:126) { cons[[ii]] <- textConnection("foo") }
Error in textConnection("foo") : all connections are in use
> nrow(showConnections())
[1] 125
> head(showConnections())
  description class            mode text   isopen   can read can write
3 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     
4 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     
5 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     
6 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     
7 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     
8 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     
> tail(showConnections())
    description class            mode text   isopen   can read can write
122 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     
123 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     
124 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     
125 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     
126 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     
127 "\"foo\""   "textConnection" "r"  "text" "opened" "yes"    "no"     

Issue

There are several use cases where one might hit the upper limit of number of open connections possible in R. A common use case where one is may face this issue is when using SNOW compute clusters. SNOW clusters as implemented by the parallel package (a core R package) uses one connection per SNOW worker. These days more users have access to large clusters or machines with a large number of cores, making it more likely to try to use clusters with > 125 nodes.

> library("parallel")
> cl <- makeCluster(126L)
Error in socketConnection(port = port, server = TRUE, blocking = TRUE,  : 
  all connections are in use
> nrow(showConnections())
[1] 125

The problem with the low NCONNECTIONS limit in relationship to SNOW clusters has been discussed by others in the past, e.g.

Troubleshooting

The total limit of 128 connections is hardcoded into the R source code as constant / macro NCONNECTIONS in src/main/connections.c;

#define NCONNECTIONS 128 /* snow needs one per slave node */

which is used to preallocate a set of Rconnection:s of this size;

static Rconnection Connections[NCONNECTIONS];

The NCONNECTIONS limit was increased from 50 to 128 in R 2.4.0 (released October 2006), which appears to have been done for the same reason as explained here.

Wish

  • Increase the NCONNECTIONS limit, to say, 1024.
    • I've verified that it works with NCONNECTIONS=16384 on Linux (see comment below). Similar checks may have to be done on macOS and Windows as well.
    • This would only require a simple update of the above constant / macro.
    • The disadvantage of increasing the limit is that it will also increase the linear-search time of internal int ConnIndex(Rconnection con) for non-existing connections. Using a linked list would avoid this particular problem (see below).
  • Make the error message informative about the actual limit, e.g. all 128 connections are in use.
  • An alternative, and possibly better, approach would be to re-implement Connections as a linked list, which (including its memory usage) could grow and shrink as needed. This could even remove having a limit at all. This would require redesign of the code and increase the risk of introducing bugs. (This idea was proposed by @mtmorgan).

ROBUSTNESS: x || y and x && y to give warning/error if length(x) != 1 or length(y) != 1

Idea

In the spirit of Issue #38 (if/while (c(TRUE, TRUE)) ...) of giving a warning (soon error), @hadley proposed in a Tweet:

@HenrikBengtsson as part of new if() warning, I wonder if && and || should give warning when collapsing vector to scalar

Issue

Today we have that x || y performs x[1] || y for length(x) > 1. For instance,

> c(TRUE, TRUE) || FALSE
[1] TRUE
> c(TRUE, FALSE) || FALSE
[1] TRUE
> c(TRUE, NA) || FALSE
[1] TRUE
> c(FALSE, TRUE) || FALSE
[1] FALSE

This property is symmetric in LHS and RHS (i.e. y || x behaves the same) and it also applies to x && y.

The issue is that the above truncation of x is completely silent -there's neither an error nor a warning being produced.

Discussion/Suggestion

Using x || y and x && y with a non-scalar x or y is likely a mistake. Either the code is written assuming x and y are scalars, or there is a coding error and vectorized versions x | y and x & y were intended. Should x || y always be considered an mistake if length(x) != 1 or length(y) != 1? If so, should it be a warning or an error? For instance,

> x <- c(TRUE, TRUE)
> y <- FALSE
> x || y

Error in x || y : applying scalar operator || to non-scalar elements
Execution halted

What about the case where length(x) == 0 or length(y) == 0? Today x || y returns NA in such cases, e.g.

> logical(0) || c(FALSE, NA)
[1] NA
> logical(0) || logical(0)
[1] NA
> logical(0) && logical(0)
[1] NA

I don't know the background for this behavior, but I'm sure there is an argument behind that one. Maybe it's simply that || and && should always return a scalar logical and neither TRUE nor FALSE can be returned.

WISH: New default `data=NA_real_` for `matrix()` and `array()`

Issue

The default value on argument data for matrix() and array() is NA (logical). It is quite common to see pre-allocation of matrices and arrays using as in:

> X <- matrix(nrow=10, ncol=10)
> str(X)
 logi [1:10, 1:10] NA NA NA NA NA NA ...

However, I would argue that often users assume that they actually get a numeric matrix. With the current setup, unnecessary coercion and copying of the whole matrix follows if one assigns a double value to one of the elements.

For full details with empirical evidence, see [1].

Proposal

My suggestion is to change the default to be data = NA_real_ (not data = NA as now):

matrix <- function (data = NA_real_, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL) { ... }
array <- function (data = NA_real_, dim = length(data), dimnames = NULL) { ... }

References

A thread for partial argument/dollar/attribute matching in core R (paste here if found)

Issue

> options(warnPartialMatchArgs = TRUE)
> example("prcomp", package = "stats")
[...]
Warning messages:
1: In all.equal.numeric(cov(Z), S, tol = 0.08) :
  partial argument match of 'tol' to 'tolerance'
2: In all.equal.numeric(pz3$sdev, pZ$sdev, tol = 1e-15) :
  partial argument match of 'tol' to 'tolerance'
3: In prcomp.default(USArrests, scale = TRUE) :
  partial argument match of 'scale' to 'scale.'
4: In prcomp.default(x, ...) :
  partial argument match of 'scale' to 'scale.'
5: In prcomp.default(USArrests, scale = TRUE) :
  partial argument match of 'scale' to 'scale.'
6: In prcomp.default(USArrests, scale = TRUE) :
  partial argument match of 'scale' to 'scale.'

Action

MEMORY INEFFICIENCY: Parallel cluster workers hold on to objects longer than necessary

Issue

The internal parallel:::slaveLoop() function that acts as the main program for cluster workers created by parallel::makeCluster() hangs on to the object returned to the master process longer than necessary.

Example

With

## Memory usage of a process (only Linux)
pmem <- function(pid) {
  res <- system2("pmap", args=pid, stdout=TRUE)
  res <- grep("total", res, value=TRUE)
  res <- strsplit(res, split="[ ]+", res)[[1]]
  grep("[0-9]+", res, value=TRUE)
}

we get the following

> library("parallel")
> cl <- makeCluster(1L)

## Get the PID for the worker
> pid <- clusterEvalQ(cl, { Sys.getpid() })[[1]]
> pid
[1] 16087
> pmem(pid)
[1] "203696K"

## Have worker allocate large object and return
> res <- clusterEvalQ(cl, { integer(length = 100e6) })
> object.size(res)
400000088 bytes
> pmem(pid)
[1] "594324K"

## Ask worker to garbage collect (above memory is NOT released)
> res <- clusterEvalQ(cl, { gc() })
> pmem(pid)
[1] "594324K"

## Ask worker to garbage collect again (above memory IS released)
> res <- clusterEvalQ(cl, { gc() })
> pmem(pid)
[1] "203696K"

Note how the first garbage collection has little effect and it's only in the second call that the value object from the first iteration is removed.

Troubleshooting

The reason for the first garbage collection to have no effect is that parallel:::slaveLoop() holds on to the value of the first call in local variable value throughout the evaluation of the second call as well, which only overwrites the local value variable after completing the evaluation. This can be seen if one study its repeat statement below:

slaveLoop <- function(master)
{
    repeat
        tryCatch({
            msg <- recvData(master)
            # cat(paste("Type:", msg$type, "\n"))

            if (msg$type == "DONE") {
                closeNode(master)
                break;
            } else if (msg$type == "EXEC") {
                success <- TRUE
                ## This uses the message rather than the exception since
                ## the exception class/methods may not be available on the
                ## master.
                handler <- function(e) {
                    success <<- FALSE
                    structure(conditionMessage(e),
                              class = c("snow-try-error","try-error"))
                }
                t1 <- proc.time()
                value <- tryCatch(do.call(msg$data$fun, msg$data$args, quote = TRUE),
                                  error = handler)
                t2 <- proc.time()
                value <- list(type = "VALUE", value = value, success = success,
                              time = t2 - t1, tag = msg$data$tag)
                sendData(master, value)
            }
        }, interrupt = function(e) NULL)
}

Patch

The obvious solution is be to remove local variable value directly after the data has been sent back to the master process. Analogously, the local variable msg should also be removed as soon as possible, because it may contain large data objects. That one can be removed before the data is sent back to the master process, which then would be available for garbage collection while sendData() is called.

Subversion patch

$ svn diff src/library/parallel/R/worker.R
Index: src/library/parallel/R/worker.R
===================================================================
--- src/library/parallel/R/worker.R (revision 70874)
+++ src/library/parallel/R/worker.R (working copy)
@@ -44,7 +44,9 @@
                 t2 <- proc.time()
                 value <- list(type = "VALUE", value = value, success = success,
                               time = t2 - t1, tag = msg$data$tag)
+                rm(list = "msg")
                 sendData(master, value)
+                rm(list = "value")
             }
         }, interrupt = function(e) NULL)
 }

This patch has been validated by rebuilding R from SVN source and running make check.

With this patch, we get:

> library("parallel")
> cl <- makeCluster(1L)
> pid <- clusterEvalQ(cl, { Sys.getpid() })[[1]]
> pmem(pid)
[1] "235832K"

> res <- clusterEvalQ(cl, { integer(length = 100e6) })
> pmem(pid)
[1] "626460K"

> res <- clusterEvalQ(cl, { gc() })
> pmem(pid)
[1] "235832K"

WISH: R -f and Rscript supporting URLs

(Adopted from Wiki entry)

Background

Executables Rscript and R -f takes R script files as input and source():s them, e.g.

$ echo "cat('Hello\n')" > hello.R
$ R --quiet -f hello.R
> "cat('Hello\n')"
[1] "cat('Hello\n')"
>

$ Rscript hello.R
[1] "cat('Hello\n')"

This only works with local files. If we try with an online file, we get an error;

$ Rscript https://example.org/hello.R
Fatal error: cannot open file 'https://example.org/hello.R': Invalid argument

A possible, but tedious, workaround is to use:

$ Rscript -e "source('https://example.org/hello.R')"
Hello

Wish

Add support for Rscript and R -f to recognize URLs and then try to source them as URLs and not local files, e.g.

$ R --quiet -f https://example.org/hello.R
> "cat('Hello\n')"
[1] "cat('Hello\n')"
>

$ Rscript -f https://example.org/hello.R
Hello

UPDATE: Fixed Rscript <url> example. /HB 2016-05-29

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.