bwlewis / irlba Goto Github PK

Fast truncated singular value decompositions

R 73.83% C 26.17%

svd pca principal-component-analysis singular-value-decomposition sparse-principal-components

irlba's Introduction

irlba

Implicitly-restarted Lanczos methods for fast truncated singular value decomposition of sparse and dense matrices (also referred to as partial SVD). IRLBA stands for Augmented, Implicitly Restarted Lanczos Bidiagonalization Algorithm. The package provides the following functions (see help on each for details and examples).

irlba() partial SVD function
ssvd() l1-penalized matrix decompoisition for sparse PCA (based on Shen and Huang's algorithm)
prcomp_irlba() principal components function similar to the prcomp function in stats package for computing the first few principal components of large matrices
svdr() alternate partial SVD function based on randomized SVD (see also the rsvd package by N. Benjamin Erichson for an alternative implementation)
partial_eigen() a very limited partial eigenvalue decomposition for symmetric matrices (see the RSpectra package for more comprehensive truncated eigenvalue decomposition)

Help documentation for each function includes extensive documentation and examples. Also see the package vignette, vignette("irlba", package="irlba").

An overview web page is here: https://bwlewis.github.io/irlba/.

New in 2.3.2

Fixed a regression in prcomp_irlba() discovered by Xiaojie Qiu, see #25, and other related problems reported in #32.
Added rchk testing to pre-CRAN submission tests.
Fixed a sign bug in ssvd() found by Alex Poliakov.

What's new in Version 2.3.1?

Fixed an irlba() bug associated with centering (PCA), see #21.
Fixed irlba() scaling to conform to scale, see #22.
Improved prcomp_irlba() from a suggestion by N. Benjamin Erichson, see #23.
Significanty changed/improved svdr() convergence criterion.
Added a version of Shen and Huang's Sparse PCA/SVD L1-penalized matrix decomposition (ssvd()).
Fixed valgrind errors.

Deprecated features

I will remove partial_eigen() in a future version. As its documentation states, users are better off using the RSpectra package for eigenvalue computations (although not generally for singular value computations).

The mult argument is deprecated and will be removed in a future version. We now recommend simply defining a custom class with a custom multiplcation operator. The example below illustrates the old and new approaches.

library(irlba)
set.seed(1)
A <- matrix(rnorm(100), 10)

# ------------------ old way ----------------------------------------------
# A custom matrix multiplication function that scales the columns of A
# (cf the scale option). This function scales the columns of A to unit norm.
col_scale <- sqrt(apply(A, 2, crossprod))
mult <- function(x, y)
        {
          # check if x is a  vector
          if (is.vector(x))
          {
            return((x %*% y) / col_scale)
          }
          # else x is the matrix
          x %*% (y / col_scale)
        }
irlba(A, 3, mult=mult)$d
## [1] 1.820227 1.622988 1.067185

# Compare with:
irlba(A, 3, scale=col_scale)$d
## [1] 1.820227 1.622988 1.067185

# Compare with:
svd(sweep(A, 2, col_scale, FUN=`/`))$d[1:3]
## [1] 1.820227 1.622988 1.067185

# ------------------ new way ----------------------------------------------
setClass("scaled_matrix", contains="matrix", slots=c(scale="numeric"))
setMethod("%*%", signature(x="scaled_matrix", y="numeric"), function(x ,y) [email protected] %*% (y / x@scale))
setMethod("%*%", signature(x="numeric", y="scaled_matrix"), function(x ,y) (x %*% [email protected]) / y@scale)
a <- new("scaled_matrix", A, scale=col_scale)

irlba(a, 3)$d
## [1] 1.820227 1.622988 1.067185

We have learned that using R's existing S4 system is simpler, easier, and more flexible than using custom arguments with idiosyncratic syntax and behavior. We've even used the new approach to implement distributed parallel matrix products for very large problems with amazingly little code.

Wishlist / help wanted...

More Matrix classes supported in the fast code path
Help improving the solver for singular values in tricky cases (basically, for ill-conditioned problems and especially for the smallest singular values); in general this may require a combination of more careful convergence criteria and use of harmonic Ritz values; Dmitriy Selivanov has proposed alternative convergence criteria in #29 for example.

References

Baglama, James, and Lothar Reichel. "Augmented implicitly restarted Lanczos bidiagonalization methods." SIAM Journal on Scientific Computing 27.1 (2005): 19-42.
Halko, Nathan, Per-Gunnar Martinsson, and Joel A. Tropp. "Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions." (2009).
Shen, Haipeng, and Jianhua Z. Huang. "Sparse principal component analysis via regularized low rank matrix approximation." Journal of multivariate analysis 99.6 (2008): 1015-1034.
Witten, Daniela M., Robert Tibshirani, and Trevor Hastie. "A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis." Biostatistics 10.3 (2009): 515-534.

irlba's People

Contributors

Stargazers

Watchers

Forkers

gpcr vspinu jsams solertis xiaojieqiu minghao2016 ilyakorsunsky evolvedmicrobe zdk123 wenlanzz lei-d ltla jan-glx zeta1999 vishalbelsare rnaimehaom upv

irlba's Issues

IRBLB/Leja-point algorithm

Add the fast Leja point-based method for bounded memory SVD computation.

strange behavior on small matrix

Hi! I just installed irlba and ran a test example.

I don't know if this is expected behavior, but it seems strange:

i <- c(1,3:8) 
j <- c(2,9,6:10) 
x <- 7 * (1:7)
A <- sparseMatrix(i, j, x = x) 
l <- irlba(A)
A
round(l$u %*% diag(l$d) %*% t(l$v))

The output:

> A
8 x 10 sparse Matrix of class "dgCMatrix"

[1,] . 7 . . .  .  .  .  .  .
[2,] . . . . .  .  .  .  .  .
[3,] . . . . .  .  .  . 14  .
[4,] . . . . . 21  .  .  .  .
[5,] . . . . .  . 28  .  .  .
[6,] . . . . .  .  . 35  .  .
[7,] . . . . .  .  .  . 42  .
[8,] . . . . .  .  .  .  . 49
> round(l$u %*% diag(l$d) %*% t(l$v))
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    0    0    0    0    0    0    0    0     0
[2,]    0    0    0    0    0    0    0    0    0     0
[3,]    0    0    0    0    0    0    0    0   14     0
[4,]    0    0    0    0    0   21    0    0    0     0
[5,]    0    0    0    0    0    0   28    0    0     0
[6,]    0    0    0    0    0    0    0   35    0     0
[7,]    0    0    0    0    0    0    0    0   42     0
[8,]    0    0    0    0    0    0    0    0    0    49

As you can see, the reconstructed version is missing the number 7 in the top row. With R's built-in svd, the reconstruction is perfect. I would have expected irlba to give a very approximate reconstruction, perhaps, since the method is iterative - but not to miss out the smallest value entirely. If not a bug, then something for the FAQ, perhaps?

Thank you for writing this library.

conformable matrix error with big.matrix

There are examples ( stackoverflow andelsewhere) where irlba() is passed an out-of-core matrix created using the bigmemory/bigalgebra packages. When trying to follow these examples, I receive an error:

library(bigalgebra)
library(irlba)

A<- matrix(runif(20*10),20,10)

mult <-function(A, B, transpose=FALSE) {
    ## Bigalgebra requires matrix/vector arguments
    if(is.null(dim(B))) B <- cbind(B)

    if(transpose)
        return(cbind((t(B) %*% A)[]))

    cbind((A %*% B)[])
}


> S.ref <- svd(A)
> S.irlba <- irlba(A,  mult = mult)

> S.irlba.bm <- irlba(as.big.matrix(A),  mult = mult)
Error in F - S * V[, j, drop = FALSE] : non-conformable arrays

There is a peculiarity in the big.matrix that leads the transpose of the expected result when a vector is multiplied by the big.matrix:

W<- matrix(runif(20*5),20,5)

## from inside inrla, where the problem occurs, code that defines F
> str( t(as.matrix( mult( t(W[ ,1, drop=FALSE]), A ))) )  # good
 num [1:10, 1] 5.84 5.53 5.95 4.67 6.03 ...

# big.matrix returns transpose of expected result,
> str( t(as.matrix( mult( t(W[ ,1, drop=FALSE]), as.big.matrix(A) ))) )
 num [1, 1:10] 5.84 5.53 5.95 4.67 6.03 ...

# two column multiplication works as expected
> str( t(as.matrix( mult( t(W[ ,1:2, drop=FALSE]), A ))) )  # good
 num [1:10, 1:2] 5.84 5.53 5.95 4.67 6.03 ...
>str( t(as.matrix( mult( t(W[ ,1:2, drop=FALSE]), as.big.matrix(A) ))) ) #good
 num [1:10, 1:2] 5.84 5.53 5.95 4.67 6.03 ...

My crude fix, branches for the big.matrix case:

#   Lanczos process
    while (j <= work)
    {
      j_w <- ifelse(w_dim > 1, j, 1)

      ## workaround for big.matrix problem when handed n x 1 vector,  yields a non-conforming F

      if(class(A)=="big.matrix"){
        if (iscomplex)
        {
          F <- Conj( (as.matrix(mult(Conj(t(W[,j_w,drop=FALSE])), A))))
        }
        else F <- as.matrix(mult(t(W[,j_w,drop=FALSE]), A))

      } else {

        if (iscomplex)
        {
          F <- Conj(t(as.matrix(mult(Conj(t(W[,j_w,drop=FALSE])), A))))
        }
        else F <- t(as.matrix(mult(t(W[,j_w,drop=FALSE]), A)))

      }

Thank you for adapting inrla for the R environment.

error updating to irlba 2.3.2

Using warm starts.

I'm much interested by the parameter v of the function irlba.
I am making some algorithm that require the computation of a partial SVD in many iterations, but each time on a matrix with small differences between the one is the previous iteration.
So I expect the SVD to be very similar between iterations, and that is why I would like to use warm starts to improve performance.

I tried:

A <- matrix(0, 5e3, 5e3); A[] <- rnorm(length(A))
print(system.time(
  S <- irlba(A, 10)  
)) # 7 sec
print(system.time(
  S2 <- irlba(A, 10, v = S$v)  
)) # 6 sec

I'm a bit surprised that it is not much faster because I gave it the previous answer.
Am I missing something?

increase default maxiter

I've run across this a few times now in various examples and think it was a mistake to reduce the default value of maxiter. See for instance the nice post from Julia Silge: https://juliasilge.com/blog/tidy-word-vectors/

A future version of the package (2.3.2) will increase the default maximum number of iterations to 1000.

Make the irlba_c routines available for linking

I would like to use your wonderful irlba in package I am working on, using constrained svd as a subroutine in some Rcpp / RcppArmadillo code. For large datasets, the overhead of calling R code from C++ overwhelms computation savings of irlba.

Would you consider making your C code available to be called by other C routines? I am following the advice in Romain Francois' stackoverflow post, and he links to relevant R docs. I am not as clear on how to go about using his hacky solution (capturing the irlba function pointer).

thanks!

Un-deprecate mult?

Is it possible to un-deprecate the mult argument? I would like to use a custom matrix multiplication algorithm (in this case, parallelized with BiocParallel) on matrix representations that already have their own %*% methods. This is currently easy to achieve with mult - but if mult were to become unavailable, I would have to derive a new class for the sole purpose of implementing a new %*% method. This is not always easy in situations where the class of the input matrix is not known in advance. The example in ?irlba subclasses from an ordinary matrix, but one could imagine that the input is a sparse matrix instead, and I'm not sure how to define a single class during package compilation that accommodates all possible input representations.

I guess I could do something like:

setClass("myclass", slots=c(data="list"))

... store the matrix as an element of data, and then define %*% to retrieve the matrix from data when needed. But this seems a lot harder to write and understand compared to just calling mult.

irlba with warm start returns incorrect decomposition

I have an interative algorithm that:

performs SVD on a matrix,
slightly modifies the original matrix, and
performs SVD on the new matrix,
iterates until 1,2,3 until convergence

Since the modification of the matrix at each iteration is pretty small, I expect the changes to the U,D, V matrices of the SVD to be pretty small. Therefore, I want to use the warm start feature of irlba where I can use the previous SVD as a starting point so that irlba will converge faster. However, when using the warm start approach, it seems that irlba just returns the warm start matrix and gives an incorrect decomposition.

Here is an example:

library(irlba)
#> Loading required package: Matrix

# simulate a simple dataset
set.seed(1)
n = 100
p = 1000
X = matrix(rnorm(n*p), n,p)

# decomposition of original matrix
decomp.init = irlba(X)

# decomposing of related matrix
###########################

# irlba
decmp.irlba = irlba(X * 10)

# svd
decomp.svd = svd(X * 10)

# the singular values are the same by both methods
head(decmp.irlba$d)
#> [1] 414.4095 407.4580 405.0538 401.7232 398.5208
head(decomp.svd$d)
#> [1] 414.4095 407.4580 405.0538 401.7232 398.5208 396.5865

# Since I expect the decompostion of the second matrix (i.e. X * 10)
# to be very similar to to first (i.e. X), I want to use
# the decomposition of the first matrix as a warm start for the decomposition
# of the second

# irlba using warm start
decomp.restart = irlba(X * 10, v=decomp.init)

# However, this does not compute the correct singular values
head(decomp.restart$d)
#> [1] 41.44095 40.74580 40.50538 40.17232 39.85208

# In fact it just returns the warm start value without doing any calculations
head(decomp.init$d)
#> [1] 41.44095 40.74580 40.50538 40.17232 39.85208

The cause seems to be this line.

sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-apple-darwin19.6.0 (64-bit) Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] irlba_2.3.3 Matrix_1.3-2 RUnit_0.4.32 Rfast_2.0.1
[5] RcppZiggurat_0.1.6 Rcpp_1.0.6 decorrelate_0.0.2

loaded via a namespace (and not attached):
[1] jquerylib_0.1.3 bslib_0.2.4 pillar_1.5.0 compiler_4.0.3
[5] highr_0.8 tools_4.0.3 digest_0.6.27 jsonlite_1.7.2
[9] evaluate_0.14 lifecycle_1.0.0 tibble_3.1.0 lattice_0.20-41
[13] pkgconfig_2.0.3 rlang_0.4.10 reprex_1.0.0 cli_2.3.1
[17] rstudioapi_0.13 yaml_2.2.1 parallel_4.0.3 xfun_0.21
[21] knitr_1.31 sass_0.3.1 fs_1.5.0 vctrs_0.3.6
[25] grid_4.0.3 glue_1.4.2 R6_2.5.0 processx_3.4.5
[29] fansi_0.4.2 rmarkdown_2.7 callr_3.5.1 clipr_0.7.1
[33] magrittr_2.0.1 ps_1.6.0 ellipsis_0.3.1 htmltools_0.5.1.1
[37] assertthat_0.2.1 utf8_1.1.4 crayon_1.4.1

Making smallest=TRUE non-experimental

What would it take to make smallest=TRUE non-experimental? Is this something that can be fixed by enough programming effort or are there more fundamental theoretical limitations with using irlba for this?

This is motivated by jlmelville/uwot#16; tagging in @jlmelville.

Discrepancy in irlba with center argument

irlba seems to behave inconsistently depending on whether column-centring is performed explicitly or via center. To demonstrate, I've mocked up some single-cell RNA-seq data:

set.seed(1000)
ncells <- 100
ngenes <- 10000
counts <- matrix(as.double(rpois(ncells*ngenes, lambda=100)), ncol=ncells)
centers <- rowMeans(counts)

If I apply irlba on the transposed matrix (i.e., genes are now columns, cells are rows) with explicit centring outside the function or via center, I get substantially different results:

library(irlba)
set.seed(100)
out <- irlba(t(counts - centers), nu=10, nv=10)
head(out$d)
## [1] 1105.339 1091.932 1086.880 1085.875 1083.415 1080.327

set.seed(100)
out2 <- irlba(t(counts), center=centers, nu=10, nv=10)
head(out2$d)
## [1] 3961.623 2629.205 1221.687 1190.174 1170.183 1165.110

I might have expected some small differences due to vagaries of random initialization or numerical precision, but these differences in the singular values seem to be rather large. On a related note, running the following code in a fresh R session results in a segfault ("memory not mapped"):

set.seed(1000)
ncells <- 100
ngenes <- 10000
counts <- matrix(rpois(ncells*ngenes, lambda=100), ncol=ncells)
centers <- rowMeans(counts)
out <- irlba::irlba(t(counts), center=centers, nu=10, nv=10)

Presumably, it's something to do with the integer nature of counts, as coercion to double-precision avoids the problem. Anyway, here's my session information:

R version 3.4.0 Patched (2017-04-24 r72627)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/R-3-4-branch_devel/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/R-3-4-branch_devel/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] irlba_2.2.1   Matrix_1.2-11

loaded via a namespace (and not attached):
[1] compiler_3.4.0  grid_3.4.0      lattice_0.20-35

Infinite Values for norm2

Hi, for my larger datasets (250,000 x 2000) run with the R code (fastpath=FALSE), I run into the problem that some of the data structures (e.g. V) get so large that the L2 norm (norm2) gets infinite. Then I get errors comparing R and S and eps2, because R or S are infinite. I fixed this problem by scaling by the max of the vector before doing the L2 scaling (example below). Then the code runs to completion. However, I get really large results (e.g. max d is 1e150), which don't match the C implementation. I suspect these large values are themselves the result of a bug.

  V[, 1] <- max_scale(V[, 1])
  V[, 1] <- V[, 1] / norm2(V[, 1])

This may be a related issue: when I ran the C version (fastpath=TRUE) yesterday on the same data, I got the error message "BLAS/LAPACK routine 'DLASCL' gave error code -4". It seems that this error arises when there are NA or INF values in the original matrix. I wonder if this error can also arise from INF values of L2 norm computation. Strangely, I run the same thing today and don't get this error, so if this is not an issue others have, please ignore.

Thanks for looking into this!

Replace partial_eigen() with IRL

Especially at least in the symmetric case. See #50. Reference: https://www.math.uri.edu/~jbaglama/papers/paper9.pdf

Cholmod error 'invalid xtype'

I use irlba and Matrix packages to compute collaborative filtering scores in production. After I upgraded irlba 2.0.0 -> 2.1.1, I get systematically the error:

irlba(X[users.train, ], nu = rank, nv = rank, maxit = 50) : 
  Cholmod error 'invalid xtype' at file:../MatrixOps/cholmod_sdmult.c, line 82

Error in crossprod(x) : could not find symbol "..." in environment of the generic function

I do not understand this error. I have had similar errors like this, which were usually solved by updating R or the package itself. Are there any compatibility issues I am unaware of?

R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] irlba_2.0.0 tm_0.6-2 NLP_0.1-8 Matrix_1.2-2

loaded via a namespace (and not attached):
[1] colorspace_1.2-6 ggplot2_2.1.0 grid_3.1.1 gtable_0.1.2 lattice_0.20-29 munsell_0.4.2 parallel_3.1.1 plyr_1.8.3 Rcpp_0.12.1
[10] scales_0.3.0 slam_0.1-32 tools_3.1.1

prcomp_irlba code has a bug in dealing with scaling

when you set scale. = TRUE in the prcomp_irlba function, it will throw an error

> prcomp_irlba(exprs(HSMM), scale. = T)
Error in if (tol * ans$d[1] < eps) warning("convergence criterion below machine epsilon") : 
  missing value where TRUE/FALSE needed

I think the error comes from the variable center in line 94 of the prcomp.R file, it should be
args$center instead of center because center is a logic value by default.

prcomp_irlba

You give the following example for the prcomp_irlba, however, the results are not consistent:

> set.seed(1)
> x  <- matrix(rnorm(200), nrow=20)
> p1 <- prcomp_irlba(x, n=3)
> summary(p1)
Importance of components%s:
                         PC1    PC2    PC3
Standard deviation     1.541 1.2513 1.1916
Proportion of Variance 0.443 0.2921 0.2649
Cumulative Proportion  0.443 0.7351 1.0000
> p2 <- prcomp(x, tol=0.7)
> summary(p2)
Importance of first k=3 (out of 10) components:
                          PC1    PC2    PC3
Standard deviation     1.5411 1.2513 1.1916
Proportion of Variance 0.2806 0.1850 0.1678
Cumulative Proportion  0.2806 0.4656 0.6334

`right.only=T` is a memory hog

I have a matrix dgCMatrix of about 1.8GB with 20mil columns and I need just right vectors, so I tried right.only=T but it turned out to be both considerably less efficient and much more memory greedy than right.only=F .

By looking at the code I see that fastpath is bypassed in this case and work dimesion is artificially inchreased by 20 here. This basically creates the V matrix of almost 7GB in my case and eventually blows up on my 16GB laptop.

I guess right.only is intended for n>>p case then those extra 20 column don't matter. In my case it's the other way around. Thus, maybe an appropriate fix would be to simply document this in the rigth_only argument documentation.

Convergence criteria

I work on Soft-Impute and Soft-SVD algorithms described in Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.

They proposed to keep track the change of Frobenius norm in order to track convergence. I believe this could be better criteria than just tracking change in singular values:

I've created function which calculates this delta in my reco package.
If you will find this approach useful for irlba, let me know, I will send PR.

irlba is using all possible cores?

Hi,

As you may already know the single-cell analysis package Seurat makes use of irlba.

I open a question there: satijalab/seurat#3991

It seems that irlba is using every core available to it. Is there a way that I can limit this to N cores/rsessions?

As an example, this is enough for it to start 30 rsessions and use 3000% cpu

matrix(rnorm(100000000), ncol = 1000) -> testmat
res = irlba(testmat)

Thanks,

Kim

Issue with proportion of variance in prcomp_irlba with scaling

I believe this is related to issue #23. The calculated proportion of variance is pretty consistent with stats::prcomp without scaling (fixed after #23), but when scale. = TRUE it differs widely. I believe the solution would be to check for scale. and use the scaled matrix to calculate the proportion of variance if scale. = TRUE.

Based on your example, but using scale. = TRUE:

set.seed(1)
x  <- matrix(rnorm(200), nrow=20)
p1 <- prcomp_irlba(x, n=3, scale. = TRUE)
summary(p1)

Importance of components:
                          PC1    PC2    PC3
Standard deviation     1.5308 1.3415 1.3121
Proportion of Variance 0.2769 0.2126 0.2034
Cumulative Proportion  0.2769 0.4895 0.6929

p2 <- prcomp(x, tol=0.7, scale = TRUE)
summary(p2)

Importance of first k=4 (out of 10) components:
                          PC1    PC2    PC3    PC4
Standard deviation     1.5308 1.3415 1.3121 1.1814
Proportion of Variance 0.2343 0.1800 0.1722 0.1396
Cumulative Proportion  0.2343 0.4143 0.5864 0.7260

Using scaled matrix:

# Proportion of Variance:
round(summary(p1)$importance[1,]**2 / 
sum( apply( Re(scale(x, center=colMeans(x), scale=TRUE)) , 2, stats::var ) ), 4)

   PC1    PC2    PC3 
0.2343 0.1800 0.1722

# Cumulative Proportion:
round(cumsum(summary(p1)$importance[1,]**2 / 
sum( apply( Re(scale(x, center=colMeans(x), scale=TRUE)) , 2, stats::var ) )), 4)

   PC1    PC2    PC3 
0.2343 0.4143 0.5864

Specifying both center= and scale= does not yield consistent results

Consider the following code snippet:

set.seed(1234)
a <- matrix(rnorm(10000), ncol=20)
center <- runif(ncol(a))
scale <- runif(ncol(a))

library(irlba)
set.seed(100)
out <- irlba(a, center=center, scale=scale, k=5, fastpath=FALSE)
out$d
## [1] 489.44223 316.87816 124.50882 121.74289  97.49367
set.seed(100)
ref <- irlba(t((t(a) - center)/scale), k=5, fastpath=FALSE)
ref$d
## [1] 531.5186 349.5748 126.9928 124.3211 106.4896

Interestingly, specifying only one of center and scale yields consistent results, which suggests that the above problem is caused by an interaction between these two. Some testing indicates that lines 542-543 are the offenders. In particular, it seems like line 543 should have dv / scale instead if scale is defined. This seems to fix the problem for fastpath=FALSE, though note that the same issue is present in the C code.

Session information

R version 3.5.0 Patched (2018-04-30 r74679)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/R-3-5-branch/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/R-3-5-branch/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] irlba_2.3.3   Matrix_1.2-14

loaded via a namespace (and not attached):
[1] compiler_3.5.0  grid_3.5.0      lattice_0.20-35

Expand fastpath to cover more sparse matrix classes

Currently limited to dgCMatrix, see
https://github.com/bwlewis/irlba/blob/master/R/irlba.R#L275
and
https://github.com/bwlewis/irlba/blob/master/src/irlb.c#L476.

Generalize this

irlba does not produce consistent results

I am using irlba to decompose 20,000 x 1 million matrix for Latent Semantic Analysis. irlba is the only tool libary that woks with such a big matrix in R, but I get very different results every time I call the function. Even if I set tol=0, hoping that more iterations to produce more reliable result, it stops (converges?) around 230 loops. I understand that irlba's SVD is an approximation, but is there any way to get consistent results?

Thank you for developing an amazing library.

Kohei

replacement has length zero bug

reported by @koheiw moved from issue #5 thread:

mx <- matrix(sample(1:100, 10^3 * 10^3, replace=TRUE), nrow=10^3)
S <- irlba(mx, nv=300, verbose=TRUE, center=colMeans(mx), right_only=TRUE)

Working dimension size 317
Error in W[, j_w] <- W[, j_w] - ds * drop(cross(dv, VJ)) * du : 
  replacement has length zero

fastpath not handling invariant subspace correctly

Repro (this will be fixed in 2.1.1)

set.seed(2)
s1 <- irlba(diag(c(1,2,3,4,5,0,0,0,0)), 4)
set.seed(2)
s2 <- irlba(diag(c(1,2,3,4,5,0,0,0,0)), 4, fastpath=FALSE)
s1$d - s2$d

Error in irlba: object 'C_IRLB' not found

First of all, thank you for this great package!

With version 2.1.2 of package irlba I get the expected result for the attached sparse matrix

> packageVersion('irlba')
[1] '2.1.2'
> Xh <- readRDS('test_fail_irlba_C_IRLB_not_found.rds')
> SVD <- irlba(Xh,20)
> str(SVD)
List of 5
 $ d    : num [1:20] 9.6 8.65 8.53 7.25 6.92 ...
 $ u    : num [1:82, 1:20] 0.01995 0.01768 -0.00259 -0.03284 -0.12891 ...
 $ v    : num [1:90, 1:20] -0.0976 -0.1344 -0.1261 -0.1774 -0.0851 ...
 $ iter : int 6
 $ mprod: int 84

But with 2.3.3 it doesn't return anything and shows an intractable error message for me.

> packageVersion('irlba')
[1] '2.3.3'
> Xh <- readRDS('test_fail_irlba_C_IRLB_not_found.rds')
> SVD <- irlba(Xh,20)
Error in irlba(Xh, 20) : object 'C_IRLB' not found
> str(SVD)
Error in str(SVD) : object 'SVD' not found

Thanks in advance and best regards.
Víctor de Buen

test_fail_irlba_C_IRLB_not_found.zip

sparse matrices and prcomp_irlba

While 'irlba' can work with sparse matrices, 'prcomp_irlba' cannot. All sparse matrices get coerced to dense ones, due to the line if (!is.matrix(x)) x <- as.matrix(x)

Hence, maybe the man page should not state that A can be sparse.

BTW, would it be difficult to make prcomp_irlba leverage the support for sparse matrices of the irlba function?

leading eigenvectors are very dependent on seed when there are many small eigenvals

Hi,

Using your code from your RSPectra comparison with eigenvalues like these:

 [1] 1.2010e+05 3.3975e+04 1.5586e+04 8.3248e+03 7.5574e+03 7.0686e+03 6.2953e+03 6.1647e+03 5.5930e+03 5.3994e+03 4.9431e+03         
 [12] 4.6991e+03 4.3388e+03 4.1773e+03 3.9298e+03 3.8293e+03 3.7712e+03 3.7303e+03 3.6607e+03 3.5395e+03 3.4908e+03 3.3796e+03         
 [23] 3.2971e+03 3.2209e+03 3.1451e+03 3.0029e+03 2.8943e+03 2.8009e+03 2.7219e+03 2.6431e+03 9.7071e-10 6.6343e-10 2.8837e-10         
 [34] 1.8375e-10 1.0578e-10 1.0274e-10 7.8556e-11 7.1549e-11 5.6877e-11 4.8820e-11 4.0724e-11 3.5984e-11 3.4707e-11 2.8522e-11         
 [45] 2.5319e-11 2.5076e-11 2.3933e-11 2.3539e-11 2.1689e-11 2.1335e-11 2.0271e-11 1.9699e-11 1.9426e-11 1.8862e-11 1.8385e-11         
 [56] 1.8283e-11 1.7684e-11 1.7657e-11 1.6216e-11 1.5041e-11 1.3888e-11 1.3781e-11 1.2863e-11 1.2862e-11 1.2635e-11 1.2570e-11         
 [67] 1.2540e-11 1.1974e-11 1.1850e-11 1.1368e-11 1.1355e-11 1.1192e-11 1.0941e-11 1.0780e-11 1.0678e-11 1.0511e-11 9.8443e-12         
 [78] 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12         
 [89] 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12         
[100] 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12         
[111] 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12         
[122] 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12         
[133] 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12         
[144] 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12         
[155] 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12         
[166] 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12         
[177] 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12         
[188] 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12 9.8073e-12         
[199] 9.8073e-12 9.8073e-12 9.8073e-12

for example gives very different leading eigenvectors depending on the random seed. svd and RSpectra seem mostly stable using this.

best,
michael

expand vignette

Add sections on fast/slow code paths and which matrix classes work with which path, on memory use, and other internals.

"pathological example" fails on i386 architecture

In case it helps, printing l$d at the point of failure produces the following on i386:

 [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.9999999 0.9999996
[15] 0.9999978 0.9999895 0.9999511 0.9997809 0.9990578 0.9961445

...and the following on all other architectures (x86_64, ARM, POWER, etc.):

 [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.9999999
[15] 0.9999996 0.9999978 0.9999895 0.9999511 0.9997809 0.9990578

I think this may be caused by i386 using 80-bit floating point precision internally, while other architectures use 64-bit.

Is it possible to adjust this test so that it would pass under both conditions?

"Error in if (tol * ans$d[1] < eps) warning("convergence criterion below machine epsilon") : missing value where TRUE/FALSE needed"

Hi, I am trying to integrate scRNA-seq and scATAC-seq data using Seurat's vignette ((https://satijalab.org/seurat/v3.1/atacseq_integration_vignette.html)

The following step is giving me an error.

pbmc.atac <- RunUMAP(pbmc.atac, reduction = "lsi", dims = 1:50).

Here is the error.
"Error in if (tol * ans$d[1] < eps) warning("convergence criterion below machine epsilon") :
  missing value where TRUE/FALSE needed"?

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/katiyn/miniconda3/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] patchwork_1.0.0 ggplot2_3.3.0   Seurat_3.1.5

loaded via a namespace (and not attached):
 [1] rsvd_1.0.3          ggrepel_0.8.2       Rcpp_1.0.4.6
 [4] ape_5.3             lattice_0.20-38     ica_1.0-2
 [7] tidyr_1.0.3         listenv_0.8.0       png_0.1-7
[10] zoo_1.8-8           assertthat_0.2.1    digest_0.6.25
[13] lmtest_0.9-37       R6_2.4.1            plyr_1.8.6
[16] ggridges_0.5.2      httr_1.4.1          pillar_1.4.4
[19] rlang_0.4.6         lazyeval_0.2.2      data.table_1.12.8
[22] irlba_2.3.3         leiden_0.3.3        Matrix_1.2-17
[25] reticulate_1.15     splines_3.6.1       Rtsne_0.15
[28] stringr_1.4.0       htmlwidgets_1.5.1   uwot_0.1.8
[31] igraph_1.2.5        munsell_0.5.0       sctransform_0.2.1
[34] compiler_3.6.1      pkgconfig_2.0.3     globals_0.12.5
[37] htmltools_0.4.0     tidyselect_1.1.0    gridExtra_2.3
[40] tibble_3.0.1        RANN_2.6.1          codetools_0.2-16
[43] fitdistrplus_1.0-14 future_1.17.0       viridisLite_0.3.0
[46] withr_2.2.0         crayon_1.3.4        dplyr_0.8.5
[49] MASS_7.3-51.3       rappdirs_0.3.1      grid_3.6.1
[52] tsne_0.1-3          nlme_3.1-139        jsonlite_1.6.1
[55] gtable_0.3.0        lifecycle_0.2.0     magrittr_1.5
[58] npsurv_0.4-0.1      scales_1.1.1        KernSmooth_2.23-15
[61] stringi_1.4.6       future.apply_1.5.0  lsei_1.2-0.1
[64] pbapply_1.4-2       reshape2_1.4.4      ROCR_1.0-11
[67] ellipsis_0.3.0      vctrs_0.3.0         cowplot_1.0.0
[70] RcppAnnoy_0.0.16    RColorBrewer_1.1-2  tools_3.6.1
[73] glue_1.4.0          purrr_0.3.4         parallel_3.6.1
[76] survival_3.1-12     colorspace_1.4-1    cluster_2.0.8
[79] plotly_4.9.2.1

prcomp_irlba crashes R

Attempting prcomp_irlba crashes R (3.3.2). I updated to the latest version of irlba (2.3.2) to make sure that wasn't the issue.

prcomp_irlba(x.sparse, n = 50, retx = F, center = T, scale = F)

Where x.sparse is 940498 x 8713 sparse matrix of dummy vars

This works fine:
irlba(x.sparse, nv = 50, nu = 0, center = colMeans(x.sparse), right_only = TRUE)

If this is a memory issue is there a way to issue a warning and end the computation instead of the crash?

Thanks!

Inconsistent results with complex matrices

I'm using {irlba} to perform svd on complex matrices and found that the results where inconsistent. In my real code, I had an issue that the resulting principal values depended on the order of the rows. Trying to create a reproducible example, I realised that the solution was inconsistent between different calls even if the matrix remained the same

set.seed(42)
N <- 10
M <- 40

m <- matrix(complex(real = rnorm(M*N), imaginary = rnorm(M*N)),
             ncol = N, nrow = M)

first_pc <- function(matrix, package) {
  if (package == "svd") {
    s <- svd(matrix, nu = 2)  
  } else {
    s <- suppressWarnings(irlba::irlba(matrix, nu = 2, rng = runif) )
  }

  s$v[, 1]
}

Plotting the first of several calls to irlba::irlba, the results change considerably!

plot(first_pc(m, "irlba"), type = "l")
lines(first_pc(m, "irlba"))
lines(first_pc(m, "irlba"))
lines(first_pc(m, "irlba"))

Using base::svd, the results are always the same:

plot(-first_pc(m, "svd"), type = "l")
lines(-first_pc(m, "svd"))
lines(-first_pc(m, "svd"))
lines(-first_pc(m, "svd"))

With real values, this doesn't happen:

m <- matrix(rnorm(M*N),
            ncol = N, nrow = M)

plot(first_pc(m, "irlba"), type = "l")
lines(first_pc(m, "irlba"))
lines(first_pc(m, "irlba"))
lines(first_pc(m, "irlba"))

^{Created on 2020-11-02 by the reprex package (v0.3.0)}

use NULL/is.null for optional arguments (scale, center, etc.)

Many of the optional arguments are checked internally using the missing() function. Unfortunately, trying to write a wrapper function where these arguments may or may not be provided is made more difficult with this approach. If you instead set the default values to NULL and checked using is.null() instead, it would be easier to write wrapper functions that optionally use or don't use these arguments. I think this is the standard approach for multi-valued optional arguments like this, insofar as anything in R is standard. Thanks for considering the change!

Why do i have different results for irlba$v and prcomp_irlba$rotation for the same matrix?

set.seed(1)
x <- matrix(rnorm(200), nrow=20)
p1 <- prcomp_irlba(x, n=3)
p3 <- irlba(A=x, nv=3)
and i got p1$rotation ：

p3$v:

If im not wrong ,maybe it should have X=UΔV‘ (SVD) and UΔ = XV(PCA)

irlba and svd doen't macth

I have a squared matrix M that is obtained after performing some calculation. I have to compute its SVD. I've tried both svd and irlbafunctions and the results are not the same. I would be very grateful if you could indicate me how to get SVD by using irlba since your method is extremely useful in my problem. The matrix is attached to this message and this is the code I use:

> library(irlba)
[M.txt](https://github.com/bwlewis/irlba/files/806438/M.txt)

> t1 <- svd(M))
> t2 <- irlba(M, 5)
> t3 <- irlba(M, 400)
> t1$u[1:5,1:5]
              [,1]         [,2]         [,3]        [,4]         [,5]
[1,]  1.337658e-05 -0.022042268  0.072997682 -0.06740229  0.024306528
[2,]  4.150039e-03 -0.088295027 -0.014797329 -0.06674204 -0.040441062
[3,] -1.461567e-03 -0.005261221 -0.009778176  0.09441301 -0.059103520
[4,] -3.989281e-02  0.061646129 -0.021303995  0.02768563 -0.058945328
[5,] -3.433767e-02  0.021539362 -0.016010397 -0.02408006  0.002703832
> t2$u[1:5,1:5]
            [,1]         [,2]          [,3]          [,4]        [,5]
[1,] 0.004987207 -0.004488571 -0.0676958252  0.0511607108 -0.05923026
[2,] 0.001833914 -0.089510673  0.0004203942 -0.0269936629  0.05055553
[3,] 0.001189413 -0.012061085 -0.0166318788  0.0971423657  0.10112741
[4,] 0.037901336  0.070035622  0.1303613492  0.0001745698  0.02288906
[5,] 0.027499900  0.009384006 -0.0056690380  0.0468421469  0.01239040
> t3$u[1:5,1:5]
              [,1]         [,2]         [,3]        [,4]          [,5]
[1,]  0.0005569089  0.022238880 -0.072787347  0.06127649  0.0008511529
[2,]  0.0084894343  0.087430938  0.014782482  0.06570138 -0.0463571028
[3,] -0.0026525785  0.005337798  0.009813996 -0.09609238 -0.0642166988
[4,] -0.0427231000 -0.059921706  0.022765963 -0.02962612 -0.0642055456
[5,] -0.0344191182 -0.020140115  0.015668233  0.02770949  0.0123238452

difficulty finding invariant subspace

The issue is that the tolerance for detecting an invariant subspace, internally set to 2 * eps, is too small. In some cases, irlba moves on past the subspace and blows up. Here is a rank 10 example:

library(irlba)

set.seed(1)
r = 10
n = 1000
X1 = matrix(rnorm(n * r), n) 
X2 = matrix(rnorm(n * r), n)
X = X1 %*% t(X2)

irlba(X, 20, fastpath=FALSE, verbose=TRUE, svtol=Inf)$d
#Working dimension size 27
#iter= 1, mprod= 54, sv[20]=9.69e-10, %change=, k=23
#iter= 2, mprod= 62, sv[20]=7.59e-07, %change=9.99e-01, k=23
# [1] 5.274810e+06 3.966244e+04 1.140974e+03 1.059975e+03 1.044177e+03
# [6] 1.023285e+03 9.910259e+02 9.744555e+02 9.668851e+02 9.537244e+02
#[11] 9.332256e+02 9.053668e+02 3.587707e+02 3.964667e+00 5.473509e-02
#[16] 4.585135e-03 9.709451e-04 2.294908e-05 2.173387e-05 7.588123e-07

# even worse!
irlba(X, 20, work=35, fastpath=FALSE, verbose=TRUE, svtol=Inf)$d
#Working dimension size 35
#iter= 1, mprod= 70, sv[20]=1.53e+11, %change=, k=23
#iter= 2, mprod= 94, sv[20]=3.20e+12, %change=9.52e-01, k=23
# [1] 3.209668e+28 7.737559e+25 2.065238e+23 6.129716e+20 2.035176e+18
# [6] 6.530794e+17 7.609339e+15 1.343803e+15 3.212394e+13 3.199367e+12
#[11] 3.199367e+12 3.199367e+12 3.199367e+12 3.199367e+12 3.199367e+12
#[16] 3.199367e+12 3.199367e+12 3.199367e+12 3.199367e+12 3.199367e+12

head(svd(X)$d, 5)
#[1] 1140.9737 1059.9747 1044.1772 1023.2846  991.0259

# This is OK:
irlba(X, 5)$d
#[1] 1140.9737 1059.9747 1044.1772 1023.2846  991.0259

# And this is OK:
irlba(X, 20, work=25, fastpath=FALSE, verbose=TRUE, svtol=Inf)$d
#Working dimension size 25
#iter= 1, mprod= 50, sv[20]=1.16e-11, %change=, k=23
#iter= 2, mprod= 54, sv[20]=1.15e-10, %change=9.00e-01, k=23
# [1] 1.140974e+03 1.059975e+03 1.044177e+03 1.023285e+03 9.910259e+02
# [6] 9.744555e+02 9.668851e+02 9.537244e+02 9.332256e+02 9.053668e+02
#[11] 8.502320e+01 9.502886e-01 1.334770e-02 2.414586e-04 5.837561e-06
#[16] 1.982668e-07 2.297901e-08 1.013727e-08 8.461105e-10 1.154962e-10

One possible solution is to adjust the tolerance for detection of invariant subspace.

Poor results with only scale-type arguments in irlba and prcomp_irlba

when I compare the results from the irlba package using the irlba function or prcomp_irlba and compare that to R's svd and prcomp functions, the results are usually quite good ('mean relative difference' on the order of 1e-7 or so as reported by all.equal). However, when using scale.=T or the sd's of the columns of the input matrix for scale in irlba(), the mean relative difference from the pca function is quite large, on the order of 0.1. When using center alone or with scale, the mean relative difference goes back to 1e-7 or so. Is this expected behavior?

I ran this on the latest git installed with install_github

Here is sample code, I left out the centered and scaled centered cases as those are working well

normalize_signs = function(X, Y) {
    for (i in 1:ncol(X)) {
        if (sign(X[1, i]) != sign(Y[1, i])) {
            Y[,i] = -Y[,i]
        }
    }
    return(Y)
}

all.equal_pca = function(X, Y) {
    Y = normalize_signs(X, Y)
    return(all.equal(X, Y, check.attributes=F))
}

set.seed(1)
X = matrix(rnorm(2000), ncol=40)
M = 5 # number of PCA components
centers = colMeans(X)
sds = apply(X, 2, sd)
Xc = sweep(X, 2, centers, `-`)
Xs = sweep(X, 2, sds, `/`)
Xcs = sweep(Xc, 2, sds, `/`)

# unscaled
scaled=F
centered=F
pca = prcomp(X, center=centered, scale.=scaled)
sv = svd(X)
svir = irlba(X, nv=M, nu=M)
pcair = prcomp_irlba(X, n=M, center=centered, scale.=scaled)

Xpca = predict(pca)[,1:M]

Xsvl = sv$u[,1:M] %*% diag(sv$d[1:M])
Xsvr = X %*% sv$v[,1:M]

Xsvirl = svir$u %*% diag(svir$d)
Xsvirr = X %*% svir$v

Xpcair = predict(pcair)
Xpcair2 = X %*% pcair$rotation

all.equal_pca(Xsvl, Xsvr)
all.equal_pca(Xpca, Xsvl)
all.equal_pca(Xsvirl, Xsvirr)
all.equal_pca(Xpca, Xsvirl)
all.equal_pca(Xpcair, Xpcair2)
all.equal_pca(Xpca, Xpcair)
all.equal_pca(Xpcair, Xsvirl)

# scaled, uncentered
scaled=T
centered=F
pca = prcomp(X, center=centered, scale.=scaled)
sv = svd(Xs)
svir = irlba(X, nv=M, nu=M, scale=sds)
pcair = prcomp_irlba(X, n=M, center=centered, scale.=scaled)

Xpca = predict(pca)[,1:M]

Xsvl = sv$u[,1:M] %*% diag(sv$d[1:M])
Xsvr = Xs %*% sv$v[,1:M]

Xsvirl = svir$u %*% diag(svir$d)
Xsvirr = Xs %*% svir$v

Xpcair = predict(pcair)
Xpcair2 = Xs %*% pcair$rotation

all.equal_pca(Xsvl, Xsvr)
all.equal_pca(Xpca, Xsvl)
all.equal_pca(Xsvirl, Xsvirr)
all.equal_pca(Xpca, Xsvirl)
all.equal_pca(Xpcair, Xpcair2)
all.equal_pca(Xpca, Xpcair)
all.equal_pca(Xpcair, Xsvirl)

RFC: REMOVE GENERIC SUBSPACE DEFLATION ARGUMENTS?

The du,dv,ds arguments have been marked as "DEPRECATED" since version 2.0.0. I would like to remove them soon.

Any objections?

Store scaling parameters when doing PCA with scaling

When scaling, stats::prcomp stores the calculated scaling values in the returned object, while irlba::prcomp_irlba only stores scale=TRUE. This behavior doesn't match up with the documentation for irlba::prcomp_irlba.

Storing the scaling values makes it possible to apply the fitted PCA model object to other datasets.

I'm happy to write a patch, though it looks like pull request #52 rewrites irlba::prcomp_irlba; I haven't checked to see if the problem exists there.

Some code to see the difference:

library(irlba)

set.seed(1234)
r<-100L
c<-10L
M<-matrix(data=runif(r*c),nrow=r,ncol=c)

# scaling and centering
builtin<-prcomp(M,rank.=4,center=TRUE,scale.=TRUE)
str(builtin$scale)  # a numeric vector
summary(builtin$x-( sweep(sweep(M,2,builtin$center),2,builtin$scale,FUN=`/`) %*% builtin$rotation ))

packaged<-prcomp_irlba(M,n=4,center=TRUE,scale.=TRUE)
str(packaged$scale)  # the logical TRUE
scaling<-apply(M,2,sd)
summary(packaged$x-( sweep(sweep(M,2,packaged$center),2,scaling,FUN=`/`) %*% packaged$rotation ))


# just scaling. Uses RMS
RMS <- function (v) sqrt(sum(v^2)/(length(v)-1))
builtin<-prcomp(M,rank.=4,center=FALSE,scale.=TRUE)
str(builtin$scale)  # a numeric vector
summary(builtin$x-( sweep(M,2,builtin$scale,FUN=`/`) %*% builtin$rotation ))

packaged<-prcomp_irlba(M,n=4,center=FALSE,scale.=TRUE)
str(packaged$scale)  # the logical TRUE
scaling<-apply(M,2,RMS)
summary(packaged$x-( sweep(M,2,scaling,FUN=`/`) %*% packaged$rotation ))

sparse matrices and irlba

The help page for 'irlba' states that A can be a 'numeric real- or complex-valued matrix or real-valued sparse matrix'. I tried it with a 'dgTMatrix" and it blew up my memory. Looking into the source code, I found a comment that the fast C implementation only works for dgCMatrices.

Two Suggestions:

Please state in the help page explicitely that only coordinate-form sparse matrices (class dgCMatrix) are supported and that other sparse types (such as triple-sparse dgTMatrices) will get expanded to dense form.
Maybe issue a warning when a sparse matrix gets expanded using 'as.matrix'.

PS: Thanks a lot for irlba -- especially the sparse implementation is really useful for my project.

Consistency in sign.

I know that principal value decomposition is defined up to sign, but I think that for reproducibility, irlba() should be consistent. Ideally, it should follow the same convention as the base::svd() function. The workaround right now seems to set the seed each time irlba is run.

Reproducible example:

data("volcano")
sapply(1:10, FUN = function(x) sign(irlba::irlba(volcano, nv = 1)$u[1]))
#>  [1] -1  1  1 -1 -1 -1  1  1  1  1

sapply(1:10, FUN = function(x) sign(svd(volcano, nv = 1)$u[1]))
#>  [1] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

sapply(1:10, FUN = function(x) {
    set.seed(42) 
    sign(irlba::irlba(volcano, nv = 1)$u[1])})
#>  [1] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

(42 turns out to be the answer to the ultimate question of life, the universe, everything and consistent singular value decompositions)

problem updating to 2.3.2

Hi,

I've been trying to run Monocle2.0 and got stuck with the BLAS/LAPACK error. This issue can be resolved apparently by updating the irlba package to 2.3.2. Unfortunately, I am unable to update to the latest version (attached screenshot).

Can you help me resolve this error?

Thanks!
Sanjeev

how to run PCA including all chromosomes at once?

how to run PCA including all chromosomes at once?
or run separately and sum the eigen values from each chr?

Poor covergence with clustered large singular values.

Giuseppe Rodriguez pointed out a pathology when the singular values cluster near the largest value. In this case it's hard to pick out a truncated space containing the largest few singular values because they are all very similar. This is the mirror image of the problem detecting the smallest singular values for discrete ill-posed problems.

We implemented an additional convergence tolerance requiring the estimated singular values to converge (along with the estimated spaces) to help improve this a bit. The convergence criteria are now:

(tol) subspace convergence when ||A^T U - VS|| < tol*||A||, where the spectral norm ||A|| is approximated by the largest estimated singular value, and U, V, S are the matrices corresponding to the estimated left and right singular vectors, and diagonal matrix of estimated singular values, respectively.
(svtol) singular value convergence when the maximum absolute relative change in each singular value from one iteration to the next is at most svtol.

See https://github.com/bwlewis/irlba/blob/master/tests/test.R#L130 for a good example of this problem.

The new convergence criterion comes at increased cost, two or more Lanczos iterations are always required.

The improvement outlined above is only a partial fix. A better solution will use block methods, which we plan on adding soon.

wrong PC calculated in 2.1.2?

It seems the PCs are not correct. Here is my test:

set.seed(1)
x  <- matrix(rnorm(200), nrow=20)
p1 <- prcomp_irlba(x, n=3)
summary(p1)

# Compare with
p2 <- prcomp(x, tol=0.7)
summary(p2)

norm(p1$x - p2$x)
# output [1] 2.566585

norm(scale(x, scale = F) %*% p1$rotation - p2$x)
# output [1] 4.597156e-13

Installation Error on Ubuntu 16.04

Hi,

I've got the following error!!! Basically I am trying to install network3D package for network analysis.

install.packages("irlba")
Installing package into ‘/home/rstudio/R/x86_64-pc-linux-gnu-library/3.3’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/src/contrib/irlba_2.1.2.tar.gz'
Content type 'application/x-gzip' length 218471 bytes (213 KB)
==================================================
downloaded 213 KB

installing source package ‘irlba’ ...
** package ‘irlba’ successfully unpacked and MD5 sums checked
** libs
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I"/usr/lib/R/library/Matrix/include" -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c irlb.c -o irlb.o
gcc: error: unrecognized command line option ‘-fstack-protector-strong’
gcc: error: unrecognized command line option ‘-Wdate-time’
/usr/lib/R/etc/Makeconf:132: recipe for target 'irlb.o' failed
make: *** [irlb.o] Error 1
ERROR: compilation failed for package ‘irlba’
removing ‘/home/rstudio/R/x86_64-pc-linux-gnu-library/3.3/irlba’
Warning in install.packages :
installation of package ‘irlba’ had non-zero exit status

The downloaded source packages are in
‘/tmp/Rtmpac9rpA/downloaded_packages’

===================
SessionInfo is as follows;

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] tools_3.3.1

BLAS/LAPACK routine 'DLASCL' gave error code -4

Error in irlba(t(x), nv = n) : 
  BLAS/LAPACK routine 'DLASCL' gave error code -4

Any ideas?

Possible performance issue in irlba with sparse matrices

I think that irlba takes too much time with sparse matrices. Check out a comparison with RSpectra:

require(irlba)
require(RSpectra)
require(Matrix)

A <- as(sparseMatrix(i=1:5000,j=1:5000,x=1:5000), "dgCMatrix");
set.seed(1)
system.time(r<-irlba(A,40,tol=1e-5,verbose=TRUE))
#> Working dimension size 47
#> Initializing starting vector v with samples from standard normal distribution.
#> Use `set.seed` first for reproducibility.
#> irlba: using fast C implementation
#>    user  system elapsed
#>   7.696   0.000   7.680
r$mprod
#> [1] 3796
set.seed(1)
system.time(r<-irlba(A,40,tol=1e-5,work=80,verbose=TRUE))
#> Working dimension size 80
#> Initializing starting vector v with samples from standard normal distribution.
#> Use `set.seed` first for reproducibility.
#> irlba: using fast C implementation
#>    user  system elapsed
#>   1.904   0.000   1.905
r$mprod
#> [1] 1424
system.time(r<-RSpectra::svds(A,40,tol=1e-5))
#>    user  system elapsed
#>   0.192   0.000   0.193
r$nops
#> [1] 1141

When increasing the maximum basis size, the time reduces but it's still ten times slower than RSpectra. I think it can be an issue with the matvec implementation in C or the restarting.

bwlewis / irlba Goto Github PK

irlba's Introduction

irlba

New in 2.3.2

What's new in Version 2.3.1?

Deprecated features

Wishlist / help wanted...

References

irlba's People

Contributors

Stargazers

Watchers

Forkers

irlba's Issues

Recommend Projects

Recommend Topics

Recommend Org