Coder Social home page Coder Social logo

bioconductor / delayedarray Goto Github PK

View Code? Open in Web Editor NEW
20.0 16.0 9.0 1.38 MB

A unified framework for working transparently with on-disk and in-memory array-like datasets

Home Page: https://bioconductor.org/packages/DelayedArray

R 99.19% C 0.59% TeX 0.22%
core-package bioconductor-package

delayedarray's Introduction

DelayedArray is an R/Bioconductor package that provides a unified framework for working transparently with on-disk and in-memory array-like datasets.

See https://bioconductor.org/packages/DelayedArray for more information including how to install the release version of the package (please refrain from installing directly from GitHub).

delayedarray's People

Contributors

hpages avatar jwokaty avatar ltla avatar mtmorgan avatar nturaga avatar petehaitch avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

delayedarray's Issues

show method for >2D arrays load the whole 2d slices (than necessary) in the middle steps

Is it expected to return just a few rows and a few columns? (e.g., first and last 2 rows, ...)

da <- DelayedArray(array(1, dim=c(20, 5, 4)))
trace(extract_array, quote(print(str(index))), exit=quote(print(returnValue())))
## Tracing function "extract_array" in environment
## <namespace:DelayedArray>
## Warning: Tracing only in the namespace; to untrace you will need:
## untrace("extract_array", where = getNamespace("DelayedArray"))
## [1] "extract_array"
## attr(,"package")
## [1] "DelayedArray"
> da
## Tracing extract_array(x, index) on entry 
## List of 3
##  $ : int(0) 
##  $ : int(0) 
##  $ : int(0) 
## NULL
## Tracing extract_array(x@seed, index) on entry 
## List of 3
##  $ : int(0) 
##  $ : int(0) 
##  $ : int(0) 
## NULL
## Tracing extract_array(x@seed, index) on exit 
## <0 x 0 x 0 array of double>


## Tracing extract_array(x, index) on exit 
## <0 x 0 x 0 array of double>


## <20 x 5 x 4> DelayedArray object of type "double":
## Tracing extract_array(x, index) on entry 
## List of 3
##  $ : int(0) 
##  $ : int(0) 
##  $ : int(0) 
## NULL
## Tracing extract_array(x@seed, index) on entry 
## List of 3
##  $ : int(0) 
##  $ : int(0) 
##  $ : int(0) 
## NULL
## Tracing extract_array(x@seed, index) on exit 
## <0 x 0 x 0 array of double>

## ,,1
## Tracing extract_array(x, expand_Nindex_RangeNSBS(Nindex)) on entry 
## List of 3
##  $ : NULL
##  $ : NULL
##  $ : int 1
## NULL
## Tracing extract_array(x@seed, index) on entry 
## List of 3
##  $ : NULL
##  $ : NULL
##  $ : int 1
## NULL
## Tracing extract_array(x@seed, index) on exit 
## , , 1
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1    1    1    1    1
##  [2,]    1    1    1    1    1
##  [3,]    1    1    1    1    1
##  [4,]    1    1    1    1    1
##  [5,]    1    1    1    1    1
##  [6,]    1    1    1    1    1
##  [7,]    1    1    1    1    1
##  [8,]    1    1    1    1    1
##  [9,]    1    1    1    1    1
## [10,]    1    1    1    1    1
## [11,]    1    1    1    1    1
## [12,]    1    1    1    1    1
## [13,]    1    1    1    1    1
## [14,]    1    1    1    1    1
## [15,]    1    1    1    1    1
## [16,]    1    1    1    1    1
## [17,]    1    1    1    1    1
## [18,]    1    1    1    1    1
## [19,]    1    1    1    1    1
## [20,]    1    1    1    1    1

## Tracing extract_array(x, expand_Nindex_RangeNSBS(Nindex)) on exit 
## , , 1

##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1    1    1    1    1
##  [2,]    1    1    1    1    1
##  [3,]    1    1    1    1    1
##  [4,]    1    1    1    1    1
##  [5,]    1    1    1    1    1
##  [6,]    1    1    1    1    1
##  [7,]    1    1    1    1    1
##  [8,]    1    1    1    1    1
##  [9,]    1    1    1    1    1
## [10,]    1    1    1    1    1
## [11,]    1    1    1    1    1
## [12,]    1    1    1    1    1
## [13,]    1    1    1    1    1
## [14,]    1    1    1    1    1
## [15,]    1    1    1    1    1
## [16,]    1    1    1    1    1
## [17,]    1    1    1    1    1
## [18,]    1    1    1    1    1
## [19,]    1    1    1    1    1
## [20,]    1    1    1    1    1

## Tracing extract_array(x, expand_Nindex_RangeNSBS(Nindex)) on entry 
## List of 2
##  $ : int [1:4] 1 2 19 20
##  $ : NULL
## NULL
## Tracing extract_array(x, expand_Nindex_RangeNSBS(Nindex)) on exit 
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    1    1    1    1    1
## [3,]    1    1    1    1    1
## [4,]    1    1    1    1    1
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1    1    1    1    1
##  [2,]    1    1    1    1    1
##   ...    .    .    .    .    .
## [19,]    1    1    1    1    1
## [20,]    1    1    1    1    1

## ...

## ,,4
## Tracing extract_array(x, expand_Nindex_RangeNSBS(Nindex)) on entry 
## List of 3
##  $ : NULL
##  $ : NULL
##  $ : int 4
## NULL
## Tracing extract_array(x@seed, index) on entry 
## List of 3
##  $ : NULL
##  $ : NULL
##  $ : int 4
## NULL
## Tracing extract_array(x@seed, index) on exit 
## , , 1

##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1    1    1    1    1
##  [2,]    1    1    1    1    1
##  [3,]    1    1    1    1    1
##  [4,]    1    1    1    1    1
##  [5,]    1    1    1    1    1
##  [6,]    1    1    1    1    1
##  [7,]    1    1    1    1    1
##  [8,]    1    1    1    1    1
##  [9,]    1    1    1    1    1
## [10,]    1    1    1    1    1
## [11,]    1    1    1    1    1
## [12,]    1    1    1    1    1
## [13,]    1    1    1    1    1
## [14,]    1    1    1    1    1
## [15,]    1    1    1    1    1
## [16,]    1    1    1    1    1
## [17,]    1    1    1    1    1
## [18,]    1    1    1    1    1
## [19,]    1    1    1    1    1
## [20,]    1    1    1    1    1

## Tracing extract_array(x, expand_Nindex_RangeNSBS(Nindex)) on exit 
## , , 1

##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1    1    1    1    1
##  [2,]    1    1    1    1    1
##  [3,]    1    1    1    1    1
##  [4,]    1    1    1    1    1
##  [5,]    1    1    1    1    1
##  [6,]    1    1    1    1    1
##  [7,]    1    1    1    1    1
##  [8,]    1    1    1    1    1
##  [9,]    1    1    1    1    1
## [10,]    1    1    1    1    1
## [11,]    1    1    1    1    1
## [12,]    1    1    1    1    1
## [13,]    1    1    1    1    1
## [14,]    1    1    1    1    1
## [15,]    1    1    1    1    1
## [16,]    1    1    1    1    1
## [17,]    1    1    1    1    1
## [18,]    1    1    1    1    1
## [19,]    1    1    1    1    1
## [20,]    1    1    1    1    1

## Tracing extract_array(x, expand_Nindex_RangeNSBS(Nindex)) on entry 
## List of 2
##  $ : int [1:4] 1 2 19 20
##  $ : NULL
## NULL
## Tracing extract_array(x, expand_Nindex_RangeNSBS(Nindex)) on exit 
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    1    1    1    1    1
## [3,]    1    1    1    1    1
## [4,]    1    1    1    1    1
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1    1    1    1    1
##  [2,]    1    1    1    1    1
##   ...    .    .    .    .    .
## [19,]    1    1    1    1    1
## [20,]    1    1    1    1    1

registering `SerialParam()` sets a global preference, but package-specific preference required

This line https://github.com/Bioconductor/DelayedArray/blob/master/R/zzz.R#L12 sets a global preference for SerialParam() on Windows -- whenever DelayedArray is loaded, the users' preference is replaced by SerialParam() for all uses of DelayedArray. But really one wants only a package-local preference, maybe through a helper function invoked on each use of bplapply() and friends. Maybe I gave misleading advice earlier.

The current implementation causes windows builds in both release and devel to fail, e.g., http://bioconductor.org/checkResults/devel/bioc-LATEST/BiocParallel/tokay1-checksrc.html when, in the example before the failing example, GenomicAlignments depends on DelayedArray, replacing the normal default (SnowParam()) with SerialParam().

How to implement a realization backend

The Implementing A DelayedArray Backend vignette only covers how to implement a backend for read access only. Backends that support saving DelayedArray objects (a.k.a. realization backends) are not covered yet. Reasons for this are various: no demand so far, exact procedure still kind of a work-in-progress and subject to changes, lack of time, etc...

In the meantime, I'm putting some material here (and will move it to the Implementing A DelayedArray Backend vignette as time allows).

Say we want to implement a realization backend for the ADS format (the imaginary format made up for the Implementing A DelayedArray Backend vignette), the 2 core things we need to implement are:

  1. A RealizationSink subclass for the ADS backend. RealizationSink is a virtual class defined in the DelayedArray package (in R/RealizationSink-class.R). By analogy with the HDF5RealizationSink class defined in the HDF5Array package (in R/writeHDF5Array.R), let's assume that the RealizationSink subclass for the ADS backend will be called ADSRealizationSink.

  2. Coercion methods from ADSRealizationSink to ADSArraySeed, ADSArray, and DelayedArray.

RealizationSink subclass

The purpose of an ADSRealizationSink object is to point to a new ADS dataset and allow writing blocks of data to it. The class definition for ADSRealizationSink would typically look something like:

setClass("ADSRealizationSink",
    contains="RealizationSink",
    representation(
        dim="integer",     # Naming this slot "dim" makes dim() work out of the box.
        dimnames="list",
        type="character",  # Single string.
        ## Additional slots would typically include the path or connection to a file....
        ...
    )
)

Then we need a constructor function for these objects. The constructor should be named as the class and its first 3 arguments should be dim, dimnames, and type. It can have more arguments but those are optional and calling ADSRealizationSink() with the first 3 arguments only (i.e. ADSRealizationSink(dim, dimnames, type)) should work. Furthermore, every call to ADSRealizationSink() should create a new dataset that is ready to be written to.

ADSRealizationSink objects must support the following operations (via defining appropriate methods):

  1. dim(), dimnames(), and type(). These should return the values that were passed to the call to ADSRealizationSink() that was used to create the object.
  2. write_block(). This is a generic defined in the DelayedArray package in R/read_block.R.
  3. close(). This base R S3 generic is promoted to S4 generic in the DelayedArray package in R/RealizationSink-class.R. A default method for RealizationSink objects is provided and does nothing (no-op). Implement a method for ADSRealizationSink objects only if some connection needs to be closed and/or other resources need to be released after writing the data is complete and before the ADSRealizationSink object can be turned into an ADSArraySeed object for reading.

Coercion methods from ADSRealizationSink to ADSArraySeed, ADSArray, and DelayedArray

From ADSRealizationSink to ADSArraySeed

Think of an ADSRealizationSink object as a "write" connection to an ADS data set. Think of an ADSArraySeed object as a "read only" connection to an ADS data set. The purpose of the coercion from ADSRealizationSink to ADSArraySeed is to change the nature of this connection from "write" to "read only" and to produce an object that can be wrapped into a DelayedArray object.

From ADSRealizationSink to ADSArray and DelayedArray

setAs("ADSRealizationSink", "ADSArray",
    function(from) DelayedArray(as(from, "ADSArraySeed"))
)

setAs("ADSRealizationSink", "DelayedArray", function(from) as(from, "ADSArray"))

A basic example

Once we have the above (i.e. ADSRealizationSink objects, ADSRealizationSink() constructor, and the 3 coercion methods), we can realize an arbitrary DelayedArray object x as a new pristine ADSArray object x2 by using the simple code below:

realize_as_ADSArray <- function(x)
{
    sink <- ADSRealizationSink(dim(x), dimnames(x), type(x))
    DelayedArray:::BLOCK_write_to_sink(x, sink) 
    close(sink)
    as(sink, "DelayedArray")  # a pristine ADSArray object semantically equivalent to `x`
}

DelayedArray:::BLOCK_write_to_sink() reads blocks from x, realizes them in memory, and writes them to sink (with write_block()).
Note that realize_as_ADSArray() also works on an ordinary array or any array-like object that supports extract_array(), not just on a DelayedArray object.

Add some convenience

Now we can build some convenience on top of this.

One basic convenience is a coercion method from ANY to ADSArray that just does what realize_as_ADSArray() does:

setAs("ANY", "ADSArray", function(from) realize_as_ADSArray(from))

Unfortunately, trying to coerce a DelayedArray or DelayedMatrix object to ADSArray would produce a broken object if we didn't also have the following coercion methods:

setAs("DelayedArray", "ADSArray", function(from) realize_as_ADSArray(from))
setAs("DelayedMatrix", "ADSArray", function(from) realize_as_ADSArray(from))

So in the same way that an array-like object x (ordinary array or DelayedArray object) can be realized as an HDF5Array or RleArray object with as(x, "HDF5Array") or as(x, "RleArray"), now it can also be realized as an ADSArray object with as(x, "ADSArray").

Real examples of realization backends

Refer to R/writeHDF5Array.R and R/writeTENxMatrix.R in the HDF5Array package for the implementation of the HDF5Array and TENxMatrix realization backends.

Note that you can use supportedRealizationBackends() to see the list of realization backends currently supported.

possble DelayedArray constructor over `IntegerList` (or other `List`)

Hi @hpages ,

In the VariantAnnotation package, the CollapsedVCF (or ExpandedVCF) are saving the data entries in IntegerList / CharacterList.... And in the development of VCFArray, we are trying to represent these data entries as DelayedArray instances. Now we are converting the data entries into array to add dimension, and then use the DelayedArray constructor over the array. Is it possible to have the DelayedArray constructor directly work on List object? so that the internal data saving are still using a more efficient way in List structure than the ordinary list? @mtmorgan

 > XX <- IntegerList(c(list(rep(NA, 2)), list(rep(NA, 2)), list(rep(NA, 2)), list(rep(NA, 2))))
 > XX
 IntegerList of length 4
 [[1]] <NA> <NA>
 [[2]] <NA> <NA>
 [[3]] <NA> <NA>
 [[4]] <NA> <NA>

 > DelayedArray(array(XX))
 <4> DelayedArray object of type "list":          ## the type is ordinary list. 
    [1]    [2]    [3]    [4]
 NA, NA NA, NA NA, NA NA, NA

!> DelayedArray(array(XX))[1:2]  
 [[1]]
 [1] NA NA

 [[2]]
 [1] NA NA

## What we want is something like: 

!> DelayedArray(XX)
 <4> DelayedArray object of type "IntegerList":      
    [1]    [2]    [3]    [4]
 NA, NA NA, NA NA, NA NA, NA

!> DelayedArray(XX)[1:2]  
 IntegerList of length 2
 [[1]] <NA> <NA>
 [[2]] <NA> <NA>

Re-implementing rowsum/colsum for HDF5Matrix

I just noticed that bsseq has been broken in devel for a while due to the removal of the rowsum,HDF5Matrix-method (which I okayed back in 1a6a44f#comments).

I now remember why I had this method: if the data are so large that they need to be stored in HDF5 then the result of rowsum()/colsum() may itself be so large that it can't be stored in memory.
Therefore, the method I had allowed for this by allowing the result to be written to an HDF5RealizationSink (it also allowed me some extra control over the result, such as writing 2 separate rowsum()s to the same file as in PeteHaitch/DelayedMatrixStats@27712d7).

Would it be possible to have a rowsum()/colsum() that supports realizing the result to a RealizationSink as it goes?
I will try to find time to do this if you're okay with it.

Calling TFSBTools

After updating DelayedArray to the latest(0.66), upon starring R, DelayedArray tried to load TFSBtools which is not in a list of required package as follows;

Loading required package: TFBSTools
Error: package or namespace load failed for ‘TFBSTools’:
.onLoad failed in loadNamespace() for 'DelayedArray', details:
call: initialize(value, ...)

This message repeats dozen times until it stops. I'd appreciate if you can suggest a workaround for this.

What about using a seed with "dual layout" i.e. row- and col- oriented?

I'm opening an issue to facilitate the discussion about this. The discussion started with Mike's following email (May 12, 2018):

Herve,

I am in the process of implementing a delayedArray backend and have some design questions for you. In vignette In theory, it should be possible to implement a DelayedArray backend for any file format that has the capability to store array data with fast random access, which I understand as the minimum requirement for extract_array method (used for random indexing [i,j]) and dims,dimnames slots. (edited)
However, I am not quite sure what is the purpose of seed class which contains the filepath of the actual data store. I thought the DelayedArray should be agnostic about the physical storage information of the backend.
If it is mandated, then my seed object will contain at least two file paths, which represent different data layout (both row and column oriented storage) of the same matrix. is that going to play well with the DelayedArray?
Also, I am envisioning our extended array type will be a generic disk-based class, which allows arbitrary file format (h5, tiledb,etc..) to be used in this type of hybrid data layout framework.
Hopefully you can give me some clarifications on these. Thanks a lot!
Mike

Implementing sweep() for DelayedArray objects

Hi Hervé,

I thought to implement a sweep,DelayedArray-method as base::sweep() does not work when x is a DelayedArray:

suppressPackageStartupMessages(library(DelayedArray))
x <- DelayedArray(matrix(1:10000000, ncol = 1000))
sweep(x, 2, colMeans(x))
#> Error in .DelayedArray_Ops_with_right_vector(.Generic, e1, e2): length of right object is not a divisor of number of rows in left
#>   object

In doing so, I realised that base::sweep() almost works, it only breaks at the final line:

# https://github.com/wch/r-source/blob/af7f52f70101960861e5d995d3a4bec010bc89e6/src/library/base/R/sweep.R#L44
FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...)

i.e. in this case, because FUN() (-) works when e1 and e2 are either both array objects or both DelayedArray objects, but errors if there is one of each type.

I'm thinking of 2 directions I could go in:

  1. Changes to DelayedArray,Ops-method: Automagic wrapping of e1 or e2 in a DelayedArray when one arg is an array and the other is a DelayedArray (and the e1 and e2 have conformable dimensions). base::sweep() will then work out-of-the-box on for most commonly used FUN when x is a DelayedArray
  2. Implement a sweep,DelayedArray-method: This will introduce some redundant code from base::sweep() but gives more control over how it is implemented. For example, base::sweep() creates an intermediate array of the same dimensions as x. This intermediate array may be very large, even impossible to store in-memory, in which case it could be written to disk as an HDF5Array. Alternatively, as in the above example, this intermediate array may be highly redundant and could be efficiently stored in-memory as an RleArray.

What do you think is the best way to go?

Feature Request/Help on Applying Linear Models

Hello,

Absolutely amazing package! I am working on a relatively simply problem but can't quite figure out an efficient solution. I have a dataset x and a dataset y that I've saved as HDF5 objects and load them into R using HDF5Array. I want to run the linear models on each column of the y and x HDF5Array objects. Specifically, I want to run a set of linear models (lm(), gee::gee(), lme4::lmer()). Is there a way to use blockApply(), apply() or other delayed operations to do this efficiently?

I've provided a simple example below for reference so you know exactly what I'm doing. This isn't how I've actually implemented but just wanted you to see exactly the problem.

# Set made up data
x = as(matrix(1:40, ncol = 4), "HDF5Array")
y = as(matrix(41:80, ncol = 4), "HDF5Array")

# Initalize results
results = matrix(NA, ncol = 4)

for(i in 1:dim(x)[1]){
  # only use lm for simple example
  fit = lm(y[,1] ~ x[,1])
  results[i] = coefficients(fit)[2]
}

Is it possible to implement something like lm() and other modeling like the functions in DelayedMatrixStats?

replacing dimnames with identical values demotes >2-dim DelayedArray derivatives into DelayedArray

Hi @hpages ,

To continue with the slack conversion, I'm opening an issue here for better track.

In my case of GDSArray, if already a 2-dimensional GDSMatrix, replacing the dimnames with identical values is expected as no-op. However, for >2-dimensional GDSArray, replacing dimnames[1:2] with identical values demotes the object into DelayedDimnames.

library(SeqArray)
library(GDSArray)

gds <- SeqArray::seqExampleFileName("gds")
ga1 <- GDSArray(gds, "genotype/data")
ga2 <- GDSArray(gds, "phase/data")

names(dimnames(ga2))
## [1] "variant.id" "sample.id"                                                                                                                                                                                                                              
names(dimnames(ga1))
## [1] "variant.id" "sample.id"  "ploidy.id"                                                                                                                                                                                                                 
identical(dimnames(ga1)[1:2], dimnames(ga2))
## [1] TRUE                                                                                                                                                                                                                                                  

dnames <- dimnames(ga2)
dimnames(ga2) <- dnames
is(ga2, "GDSMatrix")
## [1] TRUE                                                                                                                                                                                                                                                  
str(ga2)
## no-op                                                                                                                                                                                                                                                     

dimnames(ga1)[1:2] <- dnames  ## demoted as "DelayedDimnames", not a no-op.                                                                                                                                                                                  
is(ga1, "GDSArray")
## [1] FALSE                                                                                                                                                                                                                                                 
str(ga1)
## Demoted into "DelayedDimnames"                                                                                                                                                                                                                            

I was mainly curious about this performance. In your development of DelayedArray, was it something expected? Or are you also expecting no-op for this kind of operation?

The question comes from the assay,SummarizedExperiment method, when I put the 3-dimensional GDSArray into the @assay slot, each time calling of assays(se) will demotes the object inside @assays into DelayedArray (internally as DelayedDimnames).

!> selectMethod(assays, "SummarizedExperiment")
 Method Definition:

 function (x, ..., withDimnames = TRUE)
 {
     assays <- as(x@assays, "SimpleList")
     if (withDimnames) {
         assays <- endoapply(assays, function(assay) {
             dimnames(assay)[1:2] <- dimnames(x)
             assay
         })
     }
     assays
 }

For the source of this question from my package VariantExperiment, It doesn't really matter to check is(, "GDSArray") or is(, "DelayedArray"), and I also think your idea for an updated show method would be nice. Like your example in slack:

<6 x 2 x 90354753> DelayedArray object of type "list" with a seed of class GDSArraySeed: 

Thanks,
Qian

Obscure error when passing DelayedArray objects to DataFrame()

Originally reported by Martin via our Slack

library(DelayedArray)
da2 <- DelayedArray(matrix(1:52, 26, 2))
da3 <- DelayedArray(array(1:26, c(26, 1, 1)))
DataFrame(da2 = I(da2[1,]), da3 = I(da3[1,,]))
# Error in eval(subscript, envir = eframe, enclos = eframe) : 
#    '...' used in an incorrect context

names on dimnames

Names on dimnames seem to be dropped. Those are sometimes useful for indicating what the names mean.

dn <- dimnames(array)
names(dn) <- c("foo", "bar")
dimnames(array) <- dn
names(dimnames(array)) # NULL

Transposing a subsetted DelayedArray: turtles all the way down?

I recently noticed an interesting behaviour with transposing a subsetting DelayedArray:

library(DelayedArray)
blah <- matrix(runif(1000), ncol=20, nrow=50)
X <- DelayedArray(blah)
Y <- t(X[,20:1])

When I try to dig out the seed, I need to go pretty deep:

class(Y)[1]
## [1] "DelayedMatrix"
class(Y@seed)[1]
## [1] "SeedDimPicker"
class(Y@seed@seed)[1]
## [1] "DelayedMatrix"
class(Y@seed@seed@seed)
## [1] "matrix"

This gets even more amusing if I transpose and subset again:

Z <- t(Y[20:1,])

... which results in the following:

class(Z)[1]
## [1] "DelayedMatrix"
class(Z@seed)[1]
## [1] "SeedDimPicker"
class(Z@seed@seed)[1]
## [1] "DelayedMatrix"
class(Z@seed@seed@seed)[1]
## [1] "SeedDimPicker"
class(Z@seed@seed@seed@seed)[1]
## [1] "DelayedMatrix"
class(Z@seed@seed@seed@seed@seed)
## [1] "matrix"

You could say that I shouldn't be trying to dig out the seed in the first place, and that's fair enough. However, I can imagine that at least a few of us will be writing back-end specific code for greater efficiency (beachmat for me, and probably also DelayedMatrixStats for @PeteHaitch), in which case it would be pretty inconvenient to have to dig so deep to figure out what the backend actually is. Not only that - we would have to correct for multiple sets of delayed operations, not just in Y but also in Y@seed@seed, which makes life more difficult when doing back-end specific optimizations.

From the outside looking in, it should be possible for (i) the first transposition to store the actual backend (in this case, "matrix") as the seed of the SeedDimPicker, rather than storing a DelayedMatrix; and (ii) for re-transpositions to undo the construction of a SeedDimPicker so that Z@seed becomes identical to X@seed (aside from relatively minor modifications to index). This seems simpler, and possibly more efficient, unless there are some compelling arguments for the current set-up.

FYI, beachmat now does chunk-by-chunk realization when it encounters an unknown seed type, so the nested seed setup above will work happily with beachmat. However, certain optimizations that skip explicit realization are no longer possible with the nested seeds.

Oh, and:

R Under development (unstable) (2017-10-27 r73632)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] DelayedArray_0.5.22 BiocParallel_1.13.1 IRanges_2.13.28    
[4] S4Vectors_0.17.37   BiocGenerics_0.25.3 matrixStats_0.53.1 

loaded via a namespace (and not attached):
[1] compiler_3.5.0

Error when calling as.matrix() on DelayedMatrix with length > .Machine$integer.max

I made a large DelayedMatrix, M0, by cbind()-ing two HDF5Matrix objects (there was also some row-subsetting). When calling as.matrix(M0)), I got a rather obscure error. It looks like integer overflow, which may be related to the fact the the length(M0) > .Machine$integer.max:

> dim(M0)
[1] 27660298      177
> length(M0)
[1] 4895872746
> length(M0) > .Machine$integer.max
[1] TRUE
> showtree(M0)
27660298x177 double: DelayedMatrix object
└─ 27660298x177 double: Unary iso op
   └─ 27660298x177 double: Subset
      └─ 29307078x177 double: Abind (along=2)
         ├─ 29307078x32 double: Set dimnames
         │  └─ 29307078x32 double: [seed] HDF5ArraySeed object
         └─ 29307078x145 double: Set dimnames
            └─ 29307078x145 double: [seed] HDF5ArraySeed object
> m0 <- as.matrix(M0)
Error in successiveIRanges(rep.int(width, k), from = offset + 1L) :
  'width' cannot contain NAs or negative values
In addition: Warning message:
In block_lens * dims[n, ] : NAs produced by integer overflow

I can post more details tomorrow, but I wanted to jot this down before heading home for the day. Obviously calling as.matrix() on such a large DelayedMatrix isn't generally a good idea, but I've got a big memory machine and it helps a bit when doing quick exploratory analyses.

Improving performance of loading random subsets of data into memory

Hi Hervé,

I'm looking to improve the performance of loading random subsets of disk-backed data into memory. My specific case is loading random rows of an HDF5Matrix into memory, but it may be a more general issue.

I use an example of loading 10,000 randomly sampled rows of a 500,000 x 6 HDF5Matrix into memory as an ordinary matrix. I initially compared 2 strategies: "single bracket subset" and "linear single bracket subset". For each of these strategies I also looked to see whether it helped to first sort the index (I thought this might help by improving the contiguity of data access). As a baseline, I compared these with the naive (and generally unviable) method of loading all the data into memory and then subsetting. I later thought that extract_array() might be what I'm really looking for and added it to the comparison. In my package and analysis code, I've been using "single bracket subset" (i.e. as.matrix(x[i, ])) quite a bit and found this to be a bit of a bottleneck, which is what prompted this investigation.

Remarkably, the "linear single bracket subset" method is only 3x slower than the "load all data then subset" strategy. In contrast, the "single bracket subset" and "extract array" methods are 60-70x slower than the "load all data then subset" strategy.

It looks like the "linear single brack subset" method trades off increased memory allocations for this impressive performance. But my understanding is "linear single bracket subset" is using the block-processing machinery, so memory usage should be controllable using the tools that are already in place.

I think this is a fair analysis, but I may have overlooked something.

What I'm left wondering is: What is (or should be) the canonical way for loading a subset of a DelayedArray into memory as an ordinary array, especially row- or column-slices of an HDF5Matrix? Can existing functions benefit by leveraging "linear single brack subset"? Might there be a need for explicit methods that offer different ways of handling the memory vs. performance tradeoff?

I look forward to hearing your thoughts.

Thanks,
Pete

# Setup
suppressPackageStartupMessages(library(HDF5Array))
suppressPackageStartupMessages(library(microbenchmark))
suppressPackageStartupMessages(library(profmem))

# Simulate some data
nrow <- 500000
ncol <- 6
x <- writeHDF5Array(matrix(runif(nrow * ncol), nrow = nrow, ncol = ncol))
n <- 10000
i <- sample(nrow(x), n)

# Functions
loadAll <- function(x, i) {
    as.matrix(x)
}
loadAllThenSubset <- function(x, i) {
    as.matrix(x)[i, ]
}
singleBracketSubset <- function(x, i) {
    as.matrix(x[i, ])
}
singleBracketWithSortSubset <- function(x, i) {
    o <- order(i)
    as.matrix(x[i[o], ])[order(o), ]
}
linearSingleBracketSubset <- function(x, i) {
    i2 <- DelayedArray:::to_linear_index(list(i, NULL), dim(x))
    matrix(x[i2], ncol = ncol(x))
}
linearSingleBracketSubsetWithSort <- function(x, i) {
    o <- order(i)
    i2 <- DelayedArray:::to_linear_index(list(i, NULL), dim(x))
    o2 <- order(i2)
    matrix(x[i2[o2]], ncol = ncol(x))[order(o), ]
}
extractArray <- function(x, i) {
    Nindex <- list(i, NULL)
    extract_array(x, Nindex)
}

# Check all methods get same result
all.equal(loadAllThenSubset(x, i), singleBracketSubset(x, i))
#> [1] TRUE
all.equal(loadAllThenSubset(x, i), singleBracketWithSortSubset(x, i))
#> [1] TRUE
all.equal(loadAllThenSubset(x, i), linearSingleBracketSubset(x, i))
#> [1] TRUE
all.equal(loadAllThenSubset(x, i), linearSingleBracketSubsetWithSort(x, i))
#> [1] TRUE
all.equal(loadAllThenSubset(x, i), extractArray(x, i))
#> [1] TRUE

# Benchmark running times
microbenchmark(
    loadAll(x, i),
    loadAllThenSubset(x, i),
    singleBracketSubset(x, i),
    singleBracketWithSortSubset(x, i),
    linearSingleBracketSubset(x, i),
    linearSingleBracketSubsetWithSort(x, i),
    extractArray(x, i),
    times = 10)
#> Unit: milliseconds
#>                                     expr        min         lq       mean
#>                            loadAll(x, i)   210.8005   226.0236   267.8243
#>                  loadAllThenSubset(x, i)   229.3803   236.5944   281.0997
#>                singleBracketSubset(x, i) 15809.3548 15956.4365 16661.4532
#>        singleBracketWithSortSubset(x, i) 15948.9005 16023.4474 16956.4327
#>          linearSingleBracketSubset(x, i)   753.1962   794.5452   835.2245
#>  linearSingleBracketSubsetWithSort(x, i)   736.4124   753.4237   799.8785
#>                       extractArray(x, i) 15911.2468 16203.3846 16754.2316
#>      median         uq        max neval cld
#>    247.3354   276.3357   382.3206    10  a 
#>    243.5460   346.0734   391.9918    10  a 
#>  16098.7406 16440.0989 19399.0879    10   b
#>  16586.2331 17295.1056 19703.1299    10   b
#>    808.5406   893.2855   958.5585    10  a 
#>    786.1045   818.1867   929.4077    10  a 
#>  16395.2204 16658.6316 19342.9889    10   b

# Benchmark total memory allocations
total(profmem(loadAll(x, i)))
#> [1] 88592128
total(profmem(loadAllThenSubset(x, i)))
#> [1] 89069664
total(profmem(singleBracketSubset(x, i)))
#> [1] 5531216
total(profmem(singleBracketWithSortSubset(x, i)))
#> [1] 5547480
total(profmem(linearSingleBracketSubset(x, i)))
#> [1] 425103376
total(profmem(linearSingleBracketSubsetWithSort(x, i)))
#> [1] 422994840
total(profmem(extractArray(x, i)))
#> [1] 5888816

Created on 2018-03-28 by the reprex package (v0.2.0).

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                                             
#>  version  R Under development (unstable) (2018-03-05 r74359)
#>  system   x86_64, darwin15.6.0                              
#>  ui       X11                                               
#>  language (EN)                                              
#>  collate  en_AU.UTF-8                                       
#>  tz       America/New_York                                  
#>  date     2018-03-28
#> Packages -----------------------------------------------------------------
#>  package        * version date       source                             
#>  backports        1.1.2   2017-12-13 CRAN (R 3.5.0)                     
#>  base           * 3.5.0   2018-03-06 local                              
#>  BiocGenerics   * 0.25.3  2018-02-09 Bioconductor                       
#>  BiocParallel   * 1.13.3  2018-03-23 Bioconductor                       
#>  codetools        0.2-15  2016-10-05 CRAN (R 3.5.0)                     
#>  compiler         3.5.0   2018-03-06 local                              
#>  datasets       * 3.5.0   2018-03-06 local                              
#>  DelayedArray   * 0.5.22  2018-03-02 Bioconductor                       
#>  devtools         1.13.5  2018-02-18 CRAN (R 3.5.0)                     
#>  digest           0.6.15  2018-01-28 CRAN (R 3.5.0)                     
#>  evaluate         0.10.1  2017-06-24 CRAN (R 3.5.0)                     
#>  graphics       * 3.5.0   2018-03-06 local                              
#>  grDevices      * 3.5.0   2018-03-06 local                              
#>  grid             3.5.0   2018-03-06 local                              
#>  HDF5Array      * 1.7.9   2018-03-02 Bioconductor                       
#>  htmltools        0.3.6   2017-04-28 CRAN (R 3.5.0)                     
#>  IRanges        * 2.13.28 2018-02-24 Bioconductor                       
#>  knitr            1.20    2018-02-20 CRAN (R 3.5.0)                     
#>  lattice          0.20-35 2017-03-25 CRAN (R 3.5.0)                     
#>  magrittr         1.5     2014-11-22 CRAN (R 3.5.0)                     
#>  MASS             7.3-49  2018-02-23 CRAN (R 3.5.0)                     
#>  Matrix           1.2-12  2017-11-20 CRAN (R 3.5.0)                     
#>  matrixStats    * 0.53.1  2018-02-11 CRAN (R 3.5.0)                     
#>  memoise          1.1.0   2017-04-21 CRAN (R 3.5.0)                     
#>  methods        * 3.5.0   2018-03-06 local                              
#>  microbenchmark * 1.4-4   2018-01-24 CRAN (R 3.5.0)                     
#>  multcomp         1.4-8   2017-11-08 CRAN (R 3.5.0)                     
#>  mvtnorm          1.0-7   2018-01-25 CRAN (R 3.5.0)                     
#>  parallel       * 3.5.0   2018-03-06 local                              
#>  profmem        * 0.5.0   2018-01-30 CRAN (R 3.5.0)                     
#>  Rcpp             0.12.16 2018-03-13 CRAN (R 3.5.0)                     
#>  rhdf5          * 2.23.6  2018-02-15 Github (Bioconductor/rhdf5@f452f9e)
#>  Rhdf5lib         1.1.5   2018-01-11 Bioconductor                       
#>  rmarkdown        1.9     2018-03-01 CRAN (R 3.5.0)                     
#>  rprojroot        1.3-2   2018-01-03 CRAN (R 3.5.0)                     
#>  S4Vectors      * 0.17.38 2018-03-28 Bioconductor                       
#>  sandwich         2.4-0   2017-07-26 CRAN (R 3.5.0)                     
#>  splines          3.5.0   2018-03-06 local                              
#>  stats          * 3.5.0   2018-03-06 local                              
#>  stats4         * 3.5.0   2018-03-06 local                              
#>  stringi          1.1.7   2018-03-12 CRAN (R 3.5.0)                     
#>  stringr          1.3.0   2018-02-19 CRAN (R 3.5.0)                     
#>  survival         2.41-3  2017-04-04 CRAN (R 3.5.0)                     
#>  TH.data          1.0-8   2017-01-23 CRAN (R 3.5.0)                     
#>  tools            3.5.0   2018-03-06 local                              
#>  utils          * 3.5.0   2018-03-06 local                              
#>  withr            2.1.2   2018-03-15 CRAN (R 3.5.0)                     
#>  yaml             2.1.18  2018-03-08 CRAN (R 3.5.0)                     
#>  zoo              1.8-1   2018-01-08 CRAN (R 3.5.0)

NA subscripting

Would be useful to support subscripting with NAs. The rows or columns would just be filled with NA.

Subsetting a DelayedArray object

Hi,
I am using DelayedArray package to process some sequencing data. In the manual it says DelayedArray object can be subsetted by a logical DelayedArray object. So I tried some of the examples in the manual.

Linear single bracket subsetting:

A[which(A <= 1e-5)]

Subassignment:

A[A < 0.2] <- NA
where A is a delayedArray.

Linear single bracket subsetting didn't work and I got this error Error in which(A <= 1e-05) : argument to 'which' is not logical, I am guessing which function is not recognizing logical delayedArray object as an acceptable logical object.

For Subassignment, I am getting the following error Error in A[A < 0.2] <- NA : object of type 'S4' is not subsettable

Any idea why this is happening? My main intent was to subset the DelayedArray, so is it possible in this case?

Thank you,
Divy

Subsetting with drop=TRUE doesn't behave as expected

As the title suggests:

library(DelayedArray)
blah <- matrix(runif(1000), ncol=20, nrow=50)
X <- DelayedArray(blah)
X[,1,drop=TRUE]
## <50 x 1> DelayedMatrix object of type "double":
##            [,1]
##  [1,] 0.9186617
##  [2,] 0.4121906
##  [3,] 0.7185965
##  [4,] 0.3157570
##  [5,] 0.2866116
##   ...         .
## [46,] 0.6924624
## [47,] 0.1779758
## [48,] 0.8557219
## [49,] 0.5277486
## [50,] 0.7759722

I would have expected a vector as I would have gotten from blah[,1,drop=TRUE].

Running on DelayedArray version 0.5.22.

Subsetting on DelayedArray objects with list elements fails

Hi @hpages ,

When we tried to construct a DelayedArray object with list elements, and do the subsetting, it will fail. Any idea on this? Thanks.

!> DelayedArray(array(list(), dim=10))
 <10> DelayedArray object of type "list":
  [1]  [2]  [3]    .  [9] [10]
 NULL NULL NULL    . NULL NULL

!> aa <- DelayedArray(array(list(), dim=10))[1:2]
 Error in .normarg_grid(grid, x) :
   The current default grid maker returned an error when called on 'x'.
   Please use setDefaultGridMaker() to set a valid default grid maker.

replacement method [<- support the same subassignment as regular array

It is neither intuitive nor convenient for users to be required to always construct a DelayedArray index before they can update the data. Can this be done automatically behind the scene in [<- method so that the existing user code doesn't need the major changes to achieve partial write to DelayedArray?

> a <- array(0, dim=c(2, 2))
> A <- DelayedArray(a)
> a[1,1] <- 1
> a[4] <- 2
> a
     [,1] [,2]
[1,]    1    0
[2,]    0    2
> A[1,1] <- 1
Error in `[<-`(`*tmp*`, 1, 1, value = 1) : 
  subassignment to a DelayedArray object 'x' (i.e. 'x[i] <- value') is supported only when the
  subscript 'i' is a logical DelayedArray object with the same dimensions as 'x' and when 'value' is a
  scalar (i.e. an atomic vector of length 1)
> A[1] <- 1
Error in `[<-`(`*tmp*`, 1, value = 1) : 
  subassignment to a DelayedArray object 'x' (i.e. 'x[i] <- value') is supported only when the
  subscript 'i' is a logical DelayedArray object with the same dimensions as 'x' and when 'value' is a
  scalar (i.e. an atomic vector of length 1)
> idx <- DelayedArray(array(c(T,F,F,F), dim  = c(2,2)))
> A[idx] <- 1
> A
<2 x 2> DelayedMatrix object of type "double":
     [,1] [,2]
[1,]    1    0
[2,]    0    0

Implementing crossprod and dual-DA %*%

I recently needed crossprod for DA objects, as well as a %*% that would accept two DAs. I've put the code I've used here, in the hope that it may help development of the methods within the DelayedArray package itself. Both functions are parallelized by splitting the input DA into blocks that contain all columns (for crossprod) or all rows (for %*%), and then distributing each block to a separate worker for calculations.

One major difficulty is that this strategy does not respect the maximum block size defined by setAutoBlockSize. Rather, the size of the blocks is inversely proportional to the number of cores. Each block is fully realized in memory during processing, so the core-based definition of the block size may exceed memory limits for a very large input matrix. While I'm aware of colGrid and rowGrid, these may not define blocks with sufficient granularity to allow distribution across the specified number of workers. Someone who sets setAutoBlockSize to a very large value - ostensibly to improve performance - would end up having very few blocks from colGrid or rowGrid to process independently.

Tagging @kasperdanielhansen who may also have some code lying around to do this.

Serialized DelayedArray instances from release (BioC 3.6) are broken in devel (BioC 3.7)

The recent internal changes to the DelayedArray class mean that DelayedArray instances serialized using the release version of BioC are broken when read in using the devel branch. Unfortunately, updateObject,DelayedArray-method is unable to fix them. A reproducible example is shown below.

Can updateObject,DelayedArray-method be fixed? If necessary, can updateObject,SummarizedExperiment-method please be similarly updated?

This was brought to my attention by @j-lawson whose MIRA package uses a serialized BSseq instance in tests (BSseq extends SummarizedExperiment); the object is available from https://github.com/databio/MIRA/blob/4eaaca95cbb52bac6dadcee45261c463d748c219/data/exampleBSseqObj.RData. The serialized instance was created by something like bsseq::BS.chr22[1:20, ], where bsseq::BS.chr22 is a BSseq object with DelayedMatrix instances as assay elements.

Reprex

Create some data and serialize in using release (BioC 3.6)

library(DelayedArray)
x <- DelayedArray(matrix(1:10, ncol = 2))
y <- x[1:3, ]
saveRDS(y, "~/test.rds")
Session info
sessionInfo()
R version 3.4.3 Patched (2018-01-20 r74142)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.9 (Santiago)

Matrix products: default
BLAS: /jhpce/shared/jhpce/core/conda/miniconda-3/envs/svnR-3.4.x/R/3.4.x/lib64/R/lib/libRblas.so
LAPACK: /jhpce/shared/jhpce/core/conda/miniconda-3/envs/svnR-3.4.x/R/3.4.x/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] DelayedArray_0.4.1  IRanges_2.12.0      S4Vectors_0.16.0
[4] BiocGenerics_0.24.0 matrixStats_0.53.1

loaded via a namespace (and not attached):
[1] compiler_3.4.3

Load the data using devel (BioC 3.7)

library(DelayedArray)
y <- readRDS("~/test.rds")
validObject(y)
#> [1] TRUE
y
#> Error in new("standardGeneric", .Data = function (object)  :
#>   DelayedMatrix object uses internal representation from DelayedArray < 0.5.11
#>   and cannot be displayed or used. Please update it with:
#>     object <- updateObject(object, verbose=TRUE)
updateObject(y, verbose = TRUE)
#> updateObject(object="ANY") default for object of class 'matrix'
#> [updateObject] DelayedMatrix object uses internal representation from
#> [updateObject] DelayedArray < 0.5.11. Updating it ...
#> Error in initialize(value, ...) :
#>   invalid names for slots of class “DelayedMatrix”: index, delayed_ops
Session info
sessionInfo()
R Under development (unstable) (2018-03-21 r74433)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.9 (Santiago)

Matrix products: default
BLAS: /jhpce/shared/jhpce/core/conda/miniconda-3/envs/svnR-devel/R/devel/lib64/R/lib/libRblas.so
LAPACK: /jhpce/shared/jhpce/core/conda/miniconda-3/envs/svnR-devel/R/devel/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices datasets  utils
[8] methods   base

other attached packages:
[1] DelayedArray_0.5.30 BiocParallel_1.13.3 IRanges_2.13.28
[4] S4Vectors_0.17.41   BiocGenerics_0.25.3 matrixStats_0.53.1
[7] devtools_1.13.5

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0    withr_2.1.2    memoise_1.1.0  knitr_1.20
[6] digest_0.6.15

Strange behaviour when subsetting and coercing to vector

Hi,

I noticed some strange behavior in the devel version of DelayedArray (0.7.19). When subsetting and coercing a DelayedMatrix to a vector, the output is twice the length I'd expect, and doesn't contain the correct elements.

Here is a minimal example:

require(DelayedArray)
m <- DelayedArray(matrix(0:8, ncol = 3))
m[m>0]
 [1] 1 2 0 1 2 0 1 2 0 0 1 1 1 2 2 2

This doesn't happen in the release version of DelayedArray (0.6.1). Here, the example gives the expected output:

require(DelayedArray)
m <- DelayedArray(matrix(0:8, ncol = 3))
m[m>0]
[1] 1 2 3 4 5 6 7 8

Help implementing 'parallel' writing to a RealizationSink with BiocParallel

I'm trying to implement writing to an arbitrary RealizationSink backend via BiocParallel::bplapply() with an arbitrary BiocParallelParam backend. That is, I want to be able to construct blocks of output, possibly in parallel, and then safely write each block to a HDF5Array, RleArray, etc. (controllable by the user) after each block is generated.

I've managed to get this set up for an HDF5RealizationSink (my main use case) by using the inter-process locks available in BiocParallel to ensure the writes are safe (@mtmorgan am I using these correctly?). But I've not had any luck using the same code for an arrayRealizationSink. Nor can I get it to work for the RleArray backend, although here the problem seems to be slightly different.

Below is a toy example that tries to construct a DelayedMatrix column-by-column using this strategy, in conjunction with different DelayedArray backends:

suppressPackageStartupMessages(library(DelayedArray))
suppressPackageStartupMessages(library(BiocParallel))

FUN_with_IPC_lock <- function(b, nrow, sink, grid, ipcid) {
    # Ensure DelayedArray package is loaded on worker (needed for backends
    # other than MulticoreParam).
    suppressPackageStartupMessages(library("DelayedArray"))
    if (is(sink, "HDF5RealizationSink")) {
        # Ensure HDF5Array package is loaded on worker, if required.
        # NOTE: Feels a little clunky that this is necessary; would be good if
        #       this was automatically loaded, properly.
        suppressPackageStartupMessages(library("HDF5Array"))
    }
    message("Processing block = ", b, " on PID = ", Sys.getpid())
    tmp <- matrix(seq_len(nrow) + (b - 1L) * nrow, ncol = 1)
    ipclock(ipcid)
    write_block_to_sink(tmp, sink, grid[[b]])
    ipcunlock(ipcid)
}

g <- function(FUN, BPPARAM, nrow = 100L, ncol = 10L) {
    # Construct RealizationSink
    backend <- getRealizationBackend()
    message("The backend is ", if (is.null(backend)) "NULL" else backend)
    sink <- DelayedArray:::RealizationSink(dim = c(nrow, ncol), type = "integer")
    # Consruct ArrayGrid over columns of RealizationSink
    grid <- RegularArrayGrid(dim(sink), c(nrow(sink), 1L))
    # Fill the RealizationSink
    n_block <- length(grid)
    # Construct an IPC mutex ID
    ipcid <- ipcid()
    # Apply function with BiocParallel
    bplapply(seq_len(n_block), FUN, BPPARAM = BPPARAM, nrow = nrow, sink = sink,
             grid = grid, ipcid = ipcid)
    # Return the RealizationSink as a DelayedArray
    as(sink, "DelayedArray")
}

# Everything works with the HDF5Array backend
setRealizationBackend("HDF5Array")
#> Loading required package: rhdf5
# Works
g(FUN_with_IPC_lock, SerialParam())
#> The backend is HDF5Array
#> Processing block = 1 on PID = 5740
#> Processing block = 2 on PID = 5740
#> Processing block = 3 on PID = 5740
#> Processing block = 4 on PID = 5740
#> Processing block = 5 on PID = 5740
#> Processing block = 6 on PID = 5740
#> Processing block = 7 on PID = 5740
#> Processing block = 8 on PID = 5740
#> Processing block = 9 on PID = 5740
#> Processing block = 10 on PID = 5740
#> <100 x 10> DelayedMatrix object of type "integer":
#>         [,1]  [,2]  [,3]  [,4] ...  [,7]  [,8]  [,9] [,10]
#>   [1,]     1   101   201   301   .   601   701   801   901
#>   [2,]     2   102   202   302   .   602   702   802   902
#>   [3,]     3   103   203   303   .   603   703   803   903
#>   [4,]     4   104   204   304   .   604   704   804   904
#>   [5,]     5   105   205   305   .   605   705   805   905
#>    ...     .     .     .     .   .     .     .     .     .
#>  [96,]    96   196   296   396   .   696   796   896   996
#>  [97,]    97   197   297   397   .   697   797   897   997
#>  [98,]    98   198   298   398   .   698   798   898   998
#>  [99,]    99   199   299   399   .   699   799   899   999
#> [100,]   100   200   300   400   .   700   800   900  1000
# Works: the IPC lock does its job!
g(FUN_with_IPC_lock, MulticoreParam(2L))
#> The backend is HDF5Array
#> <100 x 10> DelayedMatrix object of type "integer":
#>         [,1]  [,2]  [,3]  [,4] ...  [,7]  [,8]  [,9] [,10]
#>   [1,]     1   101   201   301   .   601   701   801   901
#>   [2,]     2   102   202   302   .   602   702   802   902
#>   [3,]     3   103   203   303   .   603   703   803   903
#>   [4,]     4   104   204   304   .   604   704   804   904
#>   [5,]     5   105   205   305   .   605   705   805   905
#>    ...     .     .     .     .   .     .     .     .     .
#>  [96,]    96   196   296   396   .   696   796   896   996
#>  [97,]    97   197   297   397   .   697   797   897   997
#>  [98,]    98   198   298   398   .   698   798   898   998
#>  [99,]    99   199   299   399   .   699   799   899   999
#> [100,]   100   200   300   400   .   700   800   900  1000
# Works: the IPC lock does its job!
g(FUN_with_IPC_lock, SnowParam(2L))
#> The backend is HDF5Array
#> Loading required package: HDF5Array
#> Loading required package: rhdf5
#> Processing block = 1 on PID = 5755
#> Processing block = 2 on PID = 5755
#> Processing block = 3 on PID = 5755
#> Processing block = 4 on PID = 5755
#> Processing block = 5 on PID = 5755
#> Loading required package: HDF5Array
#> Loading required package: rhdf5
#> Processing block = 6 on PID = 5761
#> Processing block = 7 on PID = 5761
#> Processing block = 8 on PID = 5761
#> Processing block = 9 on PID = 5761
#> Processing block = 10 on PID = 5761
#> <100 x 10> DelayedMatrix object of type "integer":
#>         [,1]  [,2]  [,3]  [,4] ...  [,7]  [,8]  [,9] [,10]
#>   [1,]     1   101   201   301   .   601   701   801   901
#>   [2,]     2   102   202   302   .   602   702   802   902
#>   [3,]     3   103   203   303   .   603   703   803   903
#>   [4,]     4   104   204   304   .   604   704   804   904
#>   [5,]     5   105   205   305   .   605   705   805   905
#>    ...     .     .     .     .   .     .     .     .     .
#>  [96,]    96   196   296   396   .   696   796   896   996
#>  [97,]    97   197   297   397   .   697   797   897   997
#>  [98,]    98   198   298   398   .   698   798   898   998
#>  [99,]    99   199   299   399   .   699   799   899   999
#> [100,]   100   200   300   400   .   700   800   900  1000

# Only SerialParam() works for the in-memory backend. The in-memory backend,
# an arrayRealizationSink, is implemented as an array inside an environment.
setRealizationBackend(NULL)
# Works
g(FUN_with_IPC_lock, SerialParam())
#> The backend is NULL
#> Processing block = 1 on PID = 5740
#> Processing block = 2 on PID = 5740
#> Processing block = 3 on PID = 5740
#> Processing block = 4 on PID = 5740
#> Processing block = 5 on PID = 5740
#> Processing block = 6 on PID = 5740
#> Processing block = 7 on PID = 5740
#> Processing block = 8 on PID = 5740
#> Processing block = 9 on PID = 5740
#> Processing block = 10 on PID = 5740
#> <100 x 10> DelayedMatrix object of type "integer":
#>         [,1]  [,2]  [,3]  [,4] ...  [,7]  [,8]  [,9] [,10]
#>   [1,]     1   101   201   301   .   601   701   801   901
#>   [2,]     2   102   202   302   .   602   702   802   902
#>   [3,]     3   103   203   303   .   603   703   803   903
#>   [4,]     4   104   204   304   .   604   704   804   904
#>   [5,]     5   105   205   305   .   605   705   805   905
#>    ...     .     .     .     .   .     .     .     .     .
#>  [96,]    96   196   296   396   .   696   796   896   996
#>  [97,]    97   197   297   397   .   697   797   897   997
#>  [98,]    98   198   298   398   .   698   798   898   998
#>  [99,]    99   199   299   399   .   699   799   899   999
#> [100,]   100   200   300   400   .   700   800   900  1000
# Doesn't work: sink isn't filled. The IPC mutex isn't doing its job?
g(FUN_with_IPC_lock, MulticoreParam(2L))
#> The backend is NULL
#> <100 x 10> DelayedMatrix object of type "integer":
#>         [,1]  [,2]  [,3]  [,4] ...  [,7]  [,8]  [,9] [,10]
#>   [1,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>   [2,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>   [3,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>   [4,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>   [5,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>    ...     .     .     .     .   .     .     .     .     .
#>  [96,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>  [97,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>  [98,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>  [99,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#> [100,]    NA    NA    NA    NA   .    NA    NA    NA    NA
# Doesn't work: sink isn't filled. The IPC mutex isn't doing its job?
g(FUN_with_IPC_lock, SnowParam(2L))
#> The backend is NULL
#> Processing block = 6 on PID = 5777
#> Processing block = 7 on PID = 5777
#> Processing block = 8 on PID = 5777
#> Processing block = 9 on PID = 5777
#> Processing block = 10 on PID = 5777
#> Processing block = 1 on PID = 5771
#> Processing block = 2 on PID = 5771
#> Processing block = 3 on PID = 5771
#> Processing block = 4 on PID = 5771
#> Processing block = 5 on PID = 5771
#> <100 x 10> DelayedMatrix object of type "integer":
#>         [,1]  [,2]  [,3]  [,4] ...  [,7]  [,8]  [,9] [,10]
#>   [1,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>   [2,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>   [3,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>   [4,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>   [5,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>    ...     .     .     .     .   .     .     .     .     .
#>  [96,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>  [97,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>  [98,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#>  [99,]    NA    NA    NA    NA   .    NA    NA    NA    NA
#> [100,]    NA    NA    NA    NA   .    NA    NA    NA    NA

# Only SerialParam() works for the RleArray backend.
setRealizationBackend("RleArray")
# Works
g(FUN_with_IPC_lock, SerialParam())
#> The backend is RleArray
#> Processing block = 1 on PID = 5740
#> Processing block = 2 on PID = 5740
#> Processing block = 3 on PID = 5740
#> Processing block = 4 on PID = 5740
#> Processing block = 5 on PID = 5740
#> Processing block = 6 on PID = 5740
#> Processing block = 7 on PID = 5740
#> Processing block = 8 on PID = 5740
#> Processing block = 9 on PID = 5740
#> Processing block = 10 on PID = 5740
#> <100 x 10> RleMatrix object of type "integer":
#>         [,1]  [,2]  [,3]  [,4] ...  [,7]  [,8]  [,9] [,10]
#>   [1,]     1   101   201   301   .   601   701   801   901
#>   [2,]     2   102   202   302   .   602   702   802   902
#>   [3,]     3   103   203   303   .   603   703   803   903
#>   [4,]     4   104   204   304   .   604   704   804   904
#>   [5,]     5   105   205   305   .   605   705   805   905
#>    ...     .     .     .     .   .     .     .     .     .
#>  [96,]    96   196   296   396   .   696   796   896   996
#>  [97,]    97   197   297   397   .   697   797   897   997
#>  [98,]    98   198   298   398   .   698   798   898   998
#>  [99,]    99   199   299   399   .   699   799   899   999
#> [100,]   100   200   300   400   .   700   800   900  1000
# Doesn't work: Don't understand why.
g(FUN_with_IPC_lock, MulticoreParam(2L))
#> The backend is RleArray
#> Error in validObject(.Object): invalid class "ChunkedRleArraySeed" object: 
#>     length of object data [0] does not match object dimensions
#>     [product 1000]
# Doesn't work: don't understand why I get this error.
g(FUN_with_IPC_lock, SnowParam(2L))
#> The backend is RleArray
#> Processing block = 6 on PID = 5793
#> Processing block = 7 on PID = 5793
#> Processing block = 8 on PID = 5793
#> Processing block = 9 on PID = 5793
#> Processing block = 10 on PID = 5793
#> Processing block = 1 on PID = 5787
#> Processing block = 2 on PID = 5787
#> Processing block = 3 on PID = 5787
#> Processing block = 4 on PID = 5787
#> Processing block = 5 on PID = 5787
#> Error in validObject(.Object): invalid class "ChunkedRleArraySeed" object: 
#>     length of object data [0] does not match object dimensions
#>     [product 1000]

Created on 2018-05-21 by the reprex package (v0.2.0).

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  tz       America/New_York            
#>  date     2018-05-21
#> Packages -----------------------------------------------------------------
#>  package      * version date       source        
#>  backports      1.1.2   2017-12-13 CRAN (R 3.5.0)
#>  base         * 3.5.0   2018-04-24 local         
#>  BiocGenerics * 0.27.0  2018-05-01 Bioconductor  
#>  BiocParallel * 1.15.3  2018-05-11 Bioconductor  
#>  compiler       3.5.0   2018-04-24 local         
#>  datasets     * 3.5.0   2018-04-24 local         
#>  DelayedArray * 0.7.0   2018-05-01 Bioconductor  
#>  devtools       1.13.5  2018-02-18 CRAN (R 3.5.0)
#>  digest         0.6.15  2018-01-28 CRAN (R 3.5.0)
#>  evaluate       0.10.1  2017-06-24 CRAN (R 3.5.0)
#>  graphics     * 3.5.0   2018-04-24 local         
#>  grDevices    * 3.5.0   2018-04-24 local         
#>  HDF5Array    * 1.9.0   2018-05-01 Bioconductor  
#>  htmltools      0.3.6   2017-04-28 CRAN (R 3.5.0)
#>  IRanges      * 2.15.13 2018-05-20 Bioconductor  
#>  knitr          1.20    2018-02-20 CRAN (R 3.5.0)
#>  magrittr       1.5     2014-11-22 CRAN (R 3.5.0)
#>  matrixStats  * 0.53.1  2018-02-11 CRAN (R 3.5.0)
#>  memoise        1.1.0   2017-04-21 CRAN (R 3.5.0)
#>  methods      * 3.5.0   2018-04-24 local         
#>  parallel     * 3.5.0   2018-04-24 local         
#>  Rcpp           0.12.17 2018-05-18 CRAN (R 3.5.0)
#>  rhdf5        * 2.25.0  2018-05-01 Bioconductor  
#>  Rhdf5lib       1.3.1   2018-05-17 Bioconductor  
#>  rmarkdown      1.9     2018-03-01 CRAN (R 3.5.0)
#>  rprojroot      1.3-2   2018-01-03 CRAN (R 3.5.0)
#>  S4Vectors    * 0.19.5  2018-05-20 Bioconductor  
#>  snow           0.4-2   2016-10-14 CRAN (R 3.5.0)
#>  stats        * 3.5.0   2018-04-24 local         
#>  stats4       * 3.5.0   2018-04-24 local         
#>  stringi        1.2.2   2018-05-02 CRAN (R 3.5.0)
#>  stringr        1.3.1   2018-05-10 CRAN (R 3.5.0)
#>  tools          3.5.0   2018-04-24 local         
#>  utils        * 3.5.0   2018-04-24 local         
#>  withr          2.1.2   2018-03-15 CRAN (R 3.5.0)
#>  yaml           2.1.19  2018-05-01 CRAN (R 3.5.0)

Bug in makeCappedVolumeBox() results in zero-column volumn box

suppressPackageStartupMessages(library(DelayedArray))

# What triggered the error
makeCappedVolumeBox(maxvol = 1000000L, maxdim = c(24587696L, 18L), shape = "scale")
#> [1] 1168752       0

# Some exploring to see when you end up with boxes with zero columns
nc <- seq.int(1, 50)
names(nc) <- nc
sapply(nc, function(i) {
  makeCappedVolumeBox(maxvol = 1000000L, maxdim = c(24587696L, i), shape = "scale")[2]
})
#>  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
#>  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1 
#> 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
#>  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

Created on 2018-09-27 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2018-09-27
#> Packages -----------------------------------------------------------------
#>  package      * version date       source        
#>  backports      1.1.2   2017-12-13 CRAN (R 3.5.0)
#>  base         * 3.5.1   2018-07-05 local         
#>  BiocGenerics * 0.27.1  2018-06-17 Bioconductor  
#>  BiocParallel * 1.15.12 2018-09-13 Bioconductor  
#>  compiler       3.5.1   2018-07-05 local         
#>  datasets     * 3.5.1   2018-07-05 local         
#>  DelayedArray * 0.7.41  2018-09-14 Bioconductor  
#>  devtools       1.13.6  2018-06-27 CRAN (R 3.5.0)
#>  digest         0.6.17  2018-09-12 CRAN (R 3.5.1)
#>  evaluate       0.11    2018-07-17 CRAN (R 3.5.0)
#>  graphics     * 3.5.1   2018-07-05 local         
#>  grDevices    * 3.5.1   2018-07-05 local         
#>  htmltools      0.3.6   2017-04-28 CRAN (R 3.5.0)
#>  IRanges      * 2.15.17 2018-08-24 Bioconductor  
#>  knitr          1.20    2018-02-20 CRAN (R 3.5.0)
#>  magrittr       1.5     2014-11-22 CRAN (R 3.5.0)
#>  matrixStats  * 0.54.0  2018-07-23 CRAN (R 3.5.1)
#>  memoise        1.1.0   2017-04-21 CRAN (R 3.5.0)
#>  methods      * 3.5.1   2018-07-05 local         
#>  parallel     * 3.5.1   2018-07-05 local         
#>  Rcpp           0.12.18 2018-07-23 CRAN (R 3.5.1)
#>  rmarkdown      1.10    2018-06-11 CRAN (R 3.5.0)
#>  rprojroot      1.3-2   2018-01-03 CRAN (R 3.5.0)
#>  S4Vectors    * 0.19.19 2018-07-18 Bioconductor  
#>  stats        * 3.5.1   2018-07-05 local         
#>  stats4       * 3.5.1   2018-07-05 local         
#>  stringi        1.2.4   2018-07-20 CRAN (R 3.5.1)
#>  stringr        1.3.1   2018-05-10 CRAN (R 3.5.0)
#>  tools          3.5.1   2018-07-05 local         
#>  utils        * 3.5.1   2018-07-05 local         
#>  withr          2.1.2   2018-03-15 CRAN (R 3.5.0)
#>  yaml           2.2.0   2018-07-25 CRAN (R 3.5.1)

I initially hit this when running the following:

dim(x)
#> [1] 24587696       18
saveHDF5SummarizedExperiment(x)
#> Start writing assay 1/3 to '/scratch/temp/5531374.1.cegs2.q/Rtmp0WGrOh/test/assays.h5':
#> Error in chunkdim(x) :
#>   The "chunkdim" method for HDF5RealizationSink objects returned an
#>   integer vector with illegal zeros. chunkdim() should always return an
#>   integer vector with nonzero values unless the zero values correspond to
#>   dimensions in 'x' that are also zero. Please contact the author of the
#>   HDF5RealizationSink class (defined in the HDF5Array package) about this
#>   and point him/her to the man page for extract_array() in the
#>   DelayedArray package (?extract_array).

Reshaping Array

First, thanks for the package. It's exactly what we needed. We're working on neuroimaging data: https://github.com/muschellij2/niftiArray. We're trying to do statistics over multiple 3D arrays. I believe this doesn't fit well with DelayedMatrixStats as matrixStats requires 2D matrices.

Similar to #30 and the comment here #30 (comment), I was wondering about "reshaping" an array. Really, I want to be able to vectorize a DelayedArray or convert a 4D DelayedArray to a 2D matrix where the first 3 dimensions are the rows of the matrix and the 4th dimension is the columns. I can realize the array, convert to a matrix, and then create a DelayedArray or DelayedMatrix for the current solution.

Ideally, I'd be able to do this without realizing the array into memory. I'm not sure if HDF5 (the backend we're using) would support this, but we would it useful. Trying to set dim or change the dimensions fails due the (correct) checks on DelayedMatrix. Just wondering if it were possible, if it makes sense with HDF5, and if both are true, then would it be possible to implement? I can send a reprex if necessary, but the setting of dimensions is well documented.

generic for deep copying

I haven't found the specification regarding to the generic interface (e.g. clone or deep_copy generics) for implementing the deep copy of DelayedArray object with the file backend.

Bug in rowSums,DelayedMatrix-method

Calling rowSums() on a large-ish DelayedMatrix leads to a serialization/forking/memory issue on macOS and Linux

library(DelayedArray)
x <- DelayedArray(matrix(1L, nrow = 10000000, ncol = 100))

# Errors
rowSums(x)

On macOS (16GB RAM) the error is:

Error: vector memory exhausted (limit reached?)

On Linux (20GB RAM) the error is:

Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal
Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal
Error: failed to stopSOCKclustercluster: error writing to connection
Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal
# Then "Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal" repeats indefinitely

I thought it might be a more general issue with blockApply() and its use of BiocParallel, but I haven't been able to trigger the problem in some brief testing. For example, using colSums() or blockApply()-ing max() over individual columns or rows of x worked fine.

Implicitly controlling the grid for blockApply

If grid is not otherwise specified, blockApply calls defaultGrid to determine the size of the blocks of the matrix to be extracted in the block processing mechanism. There are two obvious questions here:

  1. Is it possible for the default grid choice to be aware of the internal chunking scheme (for HDF5Matrix and RLEMatrix objects)? I recall having these discussions with @hpages and I believe that this is on the agenda, but I'll just mention it here anyway.
  2. Is it possible for users to implicitly override the default grid choice, e.g., via global options? One can imagine that blockApply is called within some internal functions where it would be inconvenient to have to specify the grid as a user-visible parameter in the top-level exported function.

With respect to point number 2, my real issue is that beachmat relies on defaultGrid to choose the block size for realizing chunks of a DelayedMatrix object for data access. Unfortunately, it's not possible to pass any grid specifications explicitly to the beachmat API, as this would not be representation-agnostic. All information must either be present in or derivable from the matrix object (which would be the case for back-ends that have an inherent chunking, but not obvious for other types); or it should be extracted from global options, hence my request in the second point above.

Should I be using @seed instead of seed()?

With the change to the seed() getter in 3551306, should I now be using x@seed instead of seed(x) to check whether a DelayedArray is "simple" (i.e. the seed is just a matrix, Matrix, HDF5ArraySeed, SolidRleArraySeed, etc.) rather than a DelayedOp?

Memory issue with subassignment

The new feature introduced by #32 seems to have memory leak issue, as the example shown below the DelayedArray object gets bigger after each iteration of subassignment

library(HDF5Array)

M <- matrix(1:100, nrow=10)
tmp <- tempfile()
h5write(M, tmp, "data")
hm <- HDF5Array(tmp, "data")
object.size(hm)
#> 2144 bytes
set.seed(123)
for(k in 1:10)
{
  i <- sample(10,10)
  j <- sample(10,10)
  hm[i, j][hm[i, j] >0] <- 0
}
  
object.size(hm)
#> 10593240 bytes

Created on 2018-10-01 by the reprex package (v0.2.1)

unable to change dim

I am trying to adapt the in-memory-matrix code to support DelayedArray, but run into the following error when it tries to coerce 2d matrix to 3d array

> a <- array(runif(10), dim=c(2, 5))
> A <- DelayedArray(a)
> dim(a) <- c(dim(a), 1)
> a
, , 1

          [,1]      [,2]      [,3]      [,4]       [,5]
[1,] 0.8584375 0.3051332 0.5198660 0.1106134 0.87002666
[2,] 0.4446721 0.6195048 0.3658824 0.1589915 0.01943329

> dim(A) <- c(dim(A), 1)
Error in .normalize_dim_replacement_value(value, x_dim) : 
  too many dimensions supplied
> packageVersion("DelayedArray")
[1] ‘0.7.41

dimnames setting on a DelayedArray should coerce to character

At some point, this error message appeared:

library(DelayedArray)
A <- DelayedArray(matrix(100, 10, 20))
colnames(A) <- seq_len(20)
## Error in validObject(.Object) : invalid class “DelayedDimnames” object:
##     each list element in 'x@dimnames' must be NULL, or a character vector
##     of length the extent of the corresponding dimension, or special value
##     -1

Well, okay. But it's possible to do this with every other matrix:

B <- matrix(100, 10, 20)
colnames(B) <- seq_len(20)

library(Matrix)
C <- rsparsematrix(10, 20, 0.1)
colnames(C) <- seq_len(20)

The difference in behavior seems unnecessary and breaks code that would otherwise work with DA objects, e.g., scater::plotHeatmap (picked up by @j-andrews7).

Session info
R version 3.6.0 Patched (2019-05-10 r76483)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Users/luna/Software/R/R-3-6-branch-dev/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/R-3-6-branch-dev/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] Matrix_1.2-17       DelayedArray_0.11.5 BiocParallel_1.19.2
[4] IRanges_2.19.16     S4Vectors_0.23.23   BiocGenerics_0.31.6
[7] matrixStats_0.55.0

loaded via a namespace (and not attached):
[1] compiler_3.6.0  grid_3.6.0      lattice_0.20-38

Slow matrix multiplication using DelayedArray

How could I achieve the same speed for matrix multiplication between two DelayedArray with matrix seed?

library(DelayedArray)
set.seed(123)
n = 100
p = 1000

A_mat = matrix(rpois(n*p, lambda = 0.1), nrow = p)
B_mat = matrix(rpois(n*p, lambda = 0.1), nrow = n)
A_mat_da = DelayedArray::DelayedArray(A_mat)
B_mat_da = DelayedArray::DelayedArray(B_mat)
system.time(A_mat %*% B_mat)
#   user  system elapsed 
#  0.059   0.003   0.063 
system.time(A_mat_da %*% B_mat_da)
#   user  system elapsed 
#  3.075   1.762   1.307 

I understand this is due to block processing, but even if I provide a BPPARAM, there is still no increase in the speed?

DelayedArray::setAutoBPPARAM(BPPARAM = BiocParallel::MulticoreParam(workers = 5))
DelayedArray::getAutoBPPARAM()
system.time(A_mat_da %*% B_mat_da)
#    user  system elapsed 
#  2.975   1.724   1.274 

sub-assignment for DelayedArrays doesn't follow base API

Hi Hervé,

Just wondering if this is expected behaviour for subassignment on multi dimensional arrays?

suppressedPackageStartupMessages(library(DelayedArray))
seed <- array(NA_real_, dim = c(10, 2, 5))
f <- DelayedArray(seed)
# subassignment on regular array
seed[,,1] <- matrix(runif(20), ncol = 2)
# incompatible dimensions?
f[,,1] <- matrix(runif(20), ncol = 2)
#> Error in new_DelayedSubassign(x@seed, Nindex, value): dimensions of replacement value are incompatible with the number
#>   of array elements to replace

Created on 2019-09-19 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       macOS Mojave 10.14.6        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2019-09-19                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package      * version date       lib source        
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
#>  backports      1.1.4   2019-04-10 [1] CRAN (R 3.6.0)
#>  BiocGenerics * 0.31.5  2019-07-03 [1] Bioconductor  
#>  BiocParallel * 1.19.2  2019-08-07 [1] Bioconductor  
#>  callr          3.3.1   2019-07-18 [1] CRAN (R 3.6.0)
#>  cli            1.1.0   2019-03-19 [1] CRAN (R 3.6.0)
#>  crayon         1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
#>  DelayedArray * 0.11.4  2019-07-03 [1] Bioconductor  
#>  desc           1.2.0   2018-05-01 [1] CRAN (R 3.6.0)
#>  devtools       2.2.0   2019-09-07 [1] CRAN (R 3.6.0)
#>  digest         0.6.20  2019-07-04 [1] CRAN (R 3.6.0)
#>  DT             0.8     2019-08-07 [1] CRAN (R 3.6.0)
#>  ellipsis       0.2.0.1 2019-07-02 [1] CRAN (R 3.6.0)
#>  evaluate       0.14    2019-05-28 [1] CRAN (R 3.6.0)
#>  fs             1.3.1   2019-05-06 [1] CRAN (R 3.6.0)
#>  glue           1.3.1   2019-03-12 [1] CRAN (R 3.6.0)
#>  highr          0.8     2019-03-20 [1] CRAN (R 3.6.0)
#>  htmltools      0.3.6   2017-04-28 [1] CRAN (R 3.6.0)
#>  htmlwidgets    1.3     2018-09-30 [1] CRAN (R 3.6.0)
#>  IRanges      * 2.19.16 2019-09-13 [1] Bioconductor  
#>  knitr          1.24    2019-08-08 [1] CRAN (R 3.6.0)
#>  lattice        0.20-38 2018-11-04 [1] CRAN (R 3.6.1)
#>  magrittr       1.5     2014-11-22 [1] CRAN (R 3.6.0)
#>  Matrix         1.2-17  2019-03-22 [1] CRAN (R 3.6.1)
#>  matrixStats  * 0.55.0  2019-09-07 [1] CRAN (R 3.6.0)
#>  memoise        1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
#>  pkgbuild       1.0.5   2019-08-26 [1] CRAN (R 3.6.0)
#>  pkgload        1.0.2   2018-10-29 [1] CRAN (R 3.6.0)
#>  prettyunits    1.0.2   2015-07-13 [1] CRAN (R 3.6.0)
#>  processx       3.4.1   2019-07-18 [1] CRAN (R 3.6.0)
#>  ps             1.3.0   2018-12-21 [1] CRAN (R 3.6.0)
#>  R6             2.4.0   2019-02-14 [1] CRAN (R 3.6.0)
#>  Rcpp           1.0.2   2019-07-25 [1] CRAN (R 3.6.1)
#>  remotes        2.1.0   2019-06-24 [1] CRAN (R 3.6.0)
#>  rlang          0.4.0   2019-06-25 [1] CRAN (R 3.6.0)
#>  rmarkdown      1.15    2019-08-21 [1] CRAN (R 3.6.0)
#>  rprojroot      1.3-2   2018-01-03 [1] CRAN (R 3.6.0)
#>  S4Vectors    * 0.23.23 2019-09-13 [1] Bioconductor  
#>  sessioninfo    1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
#>  stringi        1.4.3   2019-03-12 [1] CRAN (R 3.6.0)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 3.6.0)
#>  testthat       2.2.1   2019-07-25 [1] CRAN (R 3.6.0)
#>  usethis        1.5.1   2019-07-04 [1] CRAN (R 3.6.0)
#>  withr          2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
#>  xfun           0.9     2019-08-21 [1] CRAN (R 3.6.0)
#>  yaml           2.2.0   2018-07-25 [1] CRAN (R 3.6.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

DelayedArray backend fails on repeated invocation of devtools::test

Our DelayedArray backends (sampoll/rhdf5client2/RHDF5Array and its predecessor sampoll/rhdf5client/H5S_Array) fail tests if devtools::test is invoked repeatedly. It looks like this:

Testing rhdf5client2
✔ | OK F W S | Context
✔ | 2 | Sources [0.4 s]
✔ | 2 | Files [1.1 s]
✔ | 4 | Datasets [7.3 s]
✔ | 2 | DelayedArray subclass RHDF5Array [9.4 s]

══ Results ═════════════════════════════════════════════════════════════════════
Duration: 18.3 s

OK: 10
Failed: 0
Warnings: 0
Skipped: 0

test(".")
Loading rhdf5client2
in method for ‘show’ with signature ‘"Dataset"’: no definition for class “Dataset”
Testing rhdf5client2
✔ | OK F W S | Context
✔ | 2 | Sources [0.4 s]
✔ | 2 | Files [1.3 s]
✔ | 4 | Datasets [7.4 s]
✖ | 0 1 | DelayedArray subclass RHDF5Array [0.1 s]
────────────────────────────────────────────────────────────────────────────────
test.R:53: error: DelayedArray can be instantiated and accessed
invalid class “RHDF5Matrix” object:
'x' must have exactly 2 dimensions
1: RHDF5Array("http://h5s.channingremotedata.org:5000", "h5serv", "tenx_full.h5s.channingremotedata.org",
"/newassay001") at /Users/spollack/TMP-TMP/rhdf5client2/tests/testthat/test.R:53
2: DelayedArray(RHDF5ArraySeed(endpoint, svrtype, domain, dsetname)) at /Users/spollack/TMP-TMP/rhdf5client2/R/RHDF5Array.R:100
3: DelayedArray(RHDF5ArraySeed(endpoint, svrtype, domain, dsetname))
4: new_DelayedArray(seed, Class = "RHDF5Array") at /Users/spollack/TMP-TMP/rhdf5client2/R/RHDF5Array.R:94
5: new2(Class, seed = seed)
6: new(...)
7: initialize(value, ...)
8: initialize(value, ...)
9: validObject(.Object)
10: stop(msg, ": ", errors, domain = NA)
────────────────────────────────────────────────────────────────────────────────

══ Results ═════════════════════════════════════════════════════════════════════
Duration: 9.2 s

OK: 8
Failed: 1
Warnings: 0
Skipped: 0

I believe in you!
Warning message:
In .removePreviousCoerce(class1, class2, where, prevIs) :
methods currently exist for coercing from “RHDF5Matrix” to “DelayedArray”; they will be replaced.
===== <end of output, discussion recommences below> =====

After quite a bit of stack-tracing and source-code reading, I have only the following insights:

devtools::test calls devtools::load_all() (Note: Running testthat::test_dir("tests/testthat/") repeatedly works fine.)

load_all() causes setClass("RHDF5Matrix", contains=c("DelayedArray", "RHDF5Array")) to be executed. From setClass, methods::setIs is invoked for both superclasses. On the second invocation of load_all(), setIs removes the previous coercion method and replaces it with one which fails validity check. That's where the warning is printed out.

This is kind of a hypothesis: because of the multiple inheritance, there is a right way and a wrong way to coerce a "RHDF5Matrix" to a "DelayedArray". The right way is via RHDF5Array and the wrong way is via DelayedMatrix. If it goes via DelayedMatrix, the dim method returns NULL.

methods::validityMethod runs the validity method on each superclass of "RHDF5Matrix." The coercion to "DelayedMatrix" returns NULL for dim, which is not of dimension 2, so the validity function fails.

A reasonable solution is "just don't invoke devtools::test repeatedly." We are OK with this solution, but apprehensive that the Bioconductor build scripts will do it.

Any advice or insight would be very helpful.

Adding an mapply-like blockApply() function to the API

Hi Hervé,

In a recent email you wrote:

At some point the old block-processing functions (i.e. block_APPLY(), block_MAPPLY(), and
block_APPLY_and_COMBINE()) will go away.

I'd like to request that a mapply-like blockApply() function is part of the exported API. In minfi, we need to iterate over multiple (conformable) array-like objects in parallel for several of the preprocessing routines (e.g., converting the red & green channels from the array to methylated/unmethylated calls).

The return value is a large matrix-like object, so it'd be great if this mapply-like blockApply() also supported the sink or BACKEND argument (as in #10)

Linear indexing causes trouble when an index is larger than .Machine$integer.max

[Originally reported by @demuellae here ]

Linear indexing causes trouble when an index is larger than .Machine$integer.max:

Error: subscript contains NAs
In addition: Warning message: In NSBS(i, x, exact = exact,
strict.upper.bound = !allow.append, : NAs introduced by coercion to integer range

I am assuming because indices larger than .Machine$integer.max get converted into NAs somewhere downstream (sorry, have not yet been able to find out where exactly)

Error in extract_array when print the extended backend

source code of extension is here https://github.com/RGLab/mbenchmark/blob/master/R/bmarrayseed.R
To reproduce, install mbenchmark and run

library(mbenchmark)
mat <- matrix(seq_len(2e4), nrow = 1e2, ncol =2e2)
dims <- dim(mat)

#bigmemory
library(bigmemory)
bm.file <- tempfile()
suppressMessages(bm <- as.big.matrix(mat, backingfile = basename(bm.file), backingpath = dirname(bm.file)))
#wrap it into DelayedArray
library(DelayedArray)
bmseed <- BMArraySeed(bm)
bm <- DelayedArray(bmseed)
> as.matrix(bm[1:2,1:2])
     [,1] [,2]
[1,]    1  101
[2,]    2  102
> bm
Error in extract_array(x@seed, index) : 
  The "extract_array" method for BMArraySeed objects returned an
  array with incorrect dimensions. Please contact the author of
  the BMArraySeed class (defined in the mbenchmark package) about
  this and point him/her to the man page for extract_array() in
  the DelayedArray package (?extract_array).

type accessor not found, when DelayedArray and Biostrings package are loaded

Hi

the type accessor seems to be block, if the Biostrings package is loaded in the same session as DelayedArray. Not loading Biostrings or loading Biostrings before DelayedArray solves the problem.

library(DelayedArray)
library(Biostrings)
a <- array(runif(1500000), dim=c(10000, 30, 5))
A <- DelayedArray(a)
type(A)
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'type' for signature '"DelayedArray"'
sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 18362)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
#> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Germany.1252    
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] Biostrings_2.52.0    XVector_0.24.0       DelayedArray_0.10.0 
#> [4] BiocParallel_1.17.18 IRanges_2.19.10      S4Vectors_0.23.17   
#> [7] BiocGenerics_0.31.4  matrixStats_0.54.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.1      lattice_0.20-38 digest_0.6.19   grid_3.6.0     
#>  [5] magrittr_1.5    evaluate_0.14   highr_0.8       zlibbioc_1.30.0
#>  [9] stringi_1.4.3   Matrix_1.2-17   rmarkdown_1.13  tools_3.6.0    
#> [13] stringr_1.4.0   xfun_0.8        yaml_2.2.0      compiler_3.6.0 
#> [17] htmltools_0.3.6 knitr_1.23

Can this behavior be circumvented?

Type-preserving alternative to read_block

This request is mainly motivated by the current DelayedMatrix matrix multiplication code, which does a read_block call to obtain a dense matrix for actual multiplication. This prevents us from taking advantage of underlying features of the matrix representation that might enable faster multiplication with native methods, e.g., dgCMatrix with Matrix::%*%. I guess this touches on some stuff that SparseArraySeed was designed for, but in a more general fashion (e.g., to handle remote matrices with operations executed server-side).

This would be resolved by having a read_block alternative that tries to preserve the underlying nature of the matrix. The easiest way to do this is to allow extract_array to return something other than a dense ordinary matrix, e.g., by passing something like dense=FALSE to extract_array. I know that an arbitrary matrix representation may not support various delayed operations, whereas a dense matrix always will. In such cases, one could use selectMethod() to check whether the delayed operation dispatch is possible for the matrix representation; execute it if it is; and convert it into a dense matrix if it isn't.

Does this sound reasonable, or am I missing something?

Adding 'sink' arg to blockApply()

Hi Hervé,

Will blockApply() be gaining a sink or BACKEND argument? This can be convenient when blockApply()-ing over a matrix-like object to create a normalised matrix-like object, for example.

Thanks!

Please consider adding RUnit to Suggests

DelayedArray:::.test() calls BiocGenerics:::testPackage() which calls library("RUnit", quietly = TRUE). I Since RUnit is only a suggested package of BiocGenerics, I think this causes install.packages() on DelayedArray to fail picking up RUnit. This, in turn, causes automated reverse package dependency checks on DelayedArray to fail with:

     ERROR
    Running the tests in ‘tests/run_unitTests.R’ failed.
    Last 13 lines of output:
      
      The following objects are masked from 'package:matrixStats':
      
          colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
      
      The following objects are masked from 'package:base':
      
          aperm, apply
      
      [1] TRUE
      > DelayedArray:::.test()
      Error in library("RUnit", quietly = TRUE) : 
        there is no package called 'RUnit'
      Calls: <Anonymous> -> <Anonymous> -> library
      Execution halted

I observe this when running revdep checks on matrixStats. Adding RUnit to Suggests should fix this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.