I was wondering whether it would possible to support sparse matrices as output of txim

Chatting with <a class="user-mention notranslate" data-hovercard-type="user" data-hove

sparse matrix support? about tximport HOT 22 CLOSED

koenvandenberge commented on August 28, 2024

sparse matrix support?

from tximport.

Comments (22)

mikelove commented on August 28, 2024 1

This is mostly done. Alevin work continues but not related to this thread

from tximport.

mikelove commented on August 28, 2024

Interesting problem. So sparsifying the largest chunk possible and then combining is probably the implementation to beat in terms of speed. To think out loud, the sparse version would import the nonzero elements and indices and then Matrix::sparseMatrix to create the matrix, right? I wonder if they’ve already implemented this in the SingleCellExperiment codebase or if it would go here already.

from tximport.

roryk commented on August 28, 2024

Heya-- I've looked at datasets at around a million cells using Seurat, and the memory blows up when fitting models to that many cells; the initial importing I think is just the first and one of the least memory intensive portions of what DESeq2 will end up doing with the data, so sparsing it just might solve that one problem but not help getting much further.

We've looked at thousands of cells with DESeq2 and it seems to work fine, but you do need a bunch of memory to get it working.

from tximport.

koenvandenberge commented on August 28, 2024

Yes, importing the nonzero element values and indices would probably be a good idea.
I guess it could be relatively simple by selecting the non-zero elements in the last column of the quant.sf files.
Then the only thing we need is a vector with feature names and a vector with column names for the matrix.

I think that the SingleCellExperiment framework could be good as output, since it also builds on SummarizedExperiment but accepts sparse matrices. But then there's the problem of when tximport should return a SingleCellExperiment rather than a SummarizedExperiment ...

I'm often still working locally on my 16GB RAM laptop. While I agree that I won't be able to run DESeq2 on thousands of cells, I think that importing them in sparse matrix format should be possible. I think that it is a good thing to be optimistic and hope that in due time fast and memory-efficient methods will be around to process those datasets, even on a local machine. For now, I just wanted to make some plots :-).

from tximport.

mikelove commented on August 28, 2024

Setting aside the inference stuff (we're working on this, but side topic), we've tried to set it up so tximport has no dependencies to encourage broad adoption (developers don't want too many upstream dependencies) and tximeta builds the rich S4 objects. So I'm thinking tximport would give back a more basic object like a sparse matrix as implemented in Matrix package (and this would only be a Suggests). I think I can whip this up soon-ish. Meanwhile, we also are looking into tximport importing alevin output, which I should also find time to implement.

It seems simple enough for the point estimates, but would take some more time for the inferential replicates.

from tximport.

mikelove commented on August 28, 2024

In version 1.11.1, I started to implement sparse import. Might be of interest also to @csoneson

The commit: 18beb22

Some caveats: this only works for txOut=TRUE, countsFromAbundance either "no" or "scaledTPM", doesn't work with inferential replicates, and only imports counts (and abundances if countsFromAbundance="scaledTPM"). There is a new argument sparse and an argument sparseThreshold=1.

To think about: the length matrix is dense, so given that we're in a setting where dense won't work at all because we are out of memory, there's no hope to import the length matrix. That means that lengthScaledTPM won't work as it is implemented. I could import a sparse length matrix, but since the un-stored elements in a sparse matrix are taken to be zero, I'd have to be really careful with this. It would break some existing code across packages that assumes that the length matrix does not have any zeros (the log of the length matrix becomes an offset).

Gene-level summarization in tximport uses some fast base R code that doesn't work out of the box for sparse matrices, so I've not implemented that right now. I'd have to work more on that.

Inferential replicate import is possible, but that would take much more work and testing than this implementation took. I just did a first pass to see if it is useful, and then when/if I have more time I could extend to inferential replicates.

 > txi.cfa <- tximport(files, type="salmon", txOut=TRUE, 
     countsFromAbundance="scaledTPM", sparse=TRUE)
 > txi.cfa$counts[1:10,]
 10 x 6 sparse Matrix of class "dgCMatrix"
                      sample1  sample2    sample3  sample4    sample5    sample6
 ENST00000456328.2   1.350157   .        2.177719  .         5.730738   1.897019
 ENST00000450305.2   .          .        .         .         .          .
 ENST00000488147.1 124.781758 193.5745 137.846412 94.20554 178.185519 235.519863
 ENST00000619216.1   .          .        .         .         .          .
 ENST00000473358.1   .          .        .         .         .          .
 ENST00000469289.1   .          .        .         .         .          .
 ENST00000607096.1   .          .        .         .         .          .
 ENST00000417324.1   .          .        .         .         .          .
 ENST00000461467.1   .          .       23.789177 49.52768  27.406386  44.840428
 ENST00000606857.1   .          .        .         .         .          .
!> txi.cfa$abundance[1:10,]
 10 x 6 sparse Matrix of class "dgCMatrix"
                     sample1 sample2   sample3 sample4  sample5   sample6
 ENST00000456328.2 0.0591243  .      0.0631446 .       0.199672  0.083908
 ENST00000450305.2 .          .      .         .       .         .
 ENST00000488147.1 5.4642800  8.1588 3.9969600 3.94843 6.208390 10.417400
 ENST00000619216.1 .          .      .         .       .         .
 ENST00000473358.1 .          .      .         .       .         .
 ENST00000469289.1 .          .      .         .       .         .
 ENST00000607096.1 .          .      .         .       .         .
 ENST00000417324.1 .          .      .         .       .         .
 ENST00000461467.1 .          .      0.6897850 2.07585 0.954901  1.983360
 ENST00000606857.1 .          .      .         .       .         .

from tximport.

mikelove commented on August 28, 2024

Another note is that I'll be working on supporting alevin output (see #24), and i'll have to see to what degree we could keep the data sparse there as well...

from tximport.

mikelove commented on August 28, 2024

Chatting with @csoneson on Slack, and wanted to save thoughts here.

There's an obvious weak point in the current code where I grow some lists sample-by-sample:

https://github.com/mikelove/tximport/blob/master/R/tximport.R#L384-L386

I do this because I don't know ahead of time how many counts will be >= 1.

An alternative, which might be faster is to go through one time just counting, so that I can allocate the correct size of vectors, then read files again. The tradeoff is read_csv vs growing a list sample-by-sample. I can take a shot at implementing both and someone might be able to time trial these on a real dataset?

from tximport.

roryk commented on August 28, 2024

Happy to help out testing which works faster.

from tximport.

pinin4fjords commented on August 28, 2024

Sorry to butt in on this old thread, but I'm currently trying (possibly foolishly) to aggregate the results of 50k Kallisto runs with tximport- with the older non-sparse version of tximport. I gave up when the memory topped 150Gb on our cluster before failing.

Is this something the sparse options are likely to deal with effectively, or are these numbers far beyond what tximport can sensibly deal with?

from tximport.

mikelove commented on August 28, 2024

You can try e.g. 10 samples with and without sparse=TRUE to see what your savings will be with sparsity.

Likely you need to find another approach that doesn't involve creating an in-memory matrix that is 20k x 50k

from tximport.

pinin4fjords commented on August 28, 2024

Okay that's useful- thanks.

from tximport.

lambdamoses commented on August 28, 2024

I'm working with some alevin output, and even not so large datasets would run out of memory because a dense matrix is returned. If the barcodes are not filtered, there can be over 200 times more columns (barcodes) than there are real cells, so the vast majority of columns will be almost all 0, and that's just a huge waste of memory. It seems that the sparse argument does not affect alevin: https://github.com/mikelove/tximport/blob/d1a195b41f92489562f451ecf99f8fd6d47bb112/R/tximport.R#L201-L214. Is there any way to read the alevin binary for the matrix as a sparse matrix? I guess that might require you to write some custom Rcpp code.

from tximport.

rob-p commented on August 28, 2024

cc @k3yavi --- I think we should bump the priority of having a sparse-matrix pipeline from alevin -> tximport.

from tximport.

mikelove commented on August 28, 2024

Let’s doooo it. 🤘

from tximport.

lambdamoses commented on August 28, 2024

I'm doing it right now. If it works well, I'll PR

from tximport.

mikelove commented on August 28, 2024

@lambdamoses we are planning to change the output format to make it massively smaller on disk and faster and more efficient to read as well

from tximport.

lambdamoses commented on August 28, 2024

Do you mean that you are changing the alevin binary matrix output so it's a sparse matrix? That would be great! I mean I'm rewriting ReadAlevin to read the existing binary file as a dgCMatrix, without changing alevin itself.

from tximport.

rob-p commented on August 28, 2024

@lambdamoses : Yes, we are planning on having alevin output sparse (and compressed) matrices directly so that they can be directly read as sparse matrices into R, either via some existing library function or via some (hopefully trivial) function that we implement.

from tximport.

pinin4fjords commented on August 28, 2024

I adapted @k3yavi's Alevin parser to make 10X-style matrix market format via a sparse python matrix. I put the script on the gitter channel here a couple of days ago: https://gitter.im/COMBINE-lab/salmon (disclaimer: not thoroughly tested yet, it's part of a new and currently private internal repo). That format is nicely parsable by DropletUtils etc in R, straight to a SingleCellExperiment.

Obviously it would be great if that (or e.g. Loom format) was a direct output of Alevin.

from tximport.

lambdamoses commented on August 28, 2024

Before alevin and tximport support sparse matrices, you can use this script to read alevin output as a sparse matrix:

#' Read salmon alevin binary output as sparse matrix
#' 
#' The most annoying part about alevin: all available methods to load the binary
#' output return a dense matrix, which is a huge waste of memory. Here I try to
#' load it as a sparse matrix.
#' 
#' @param files Directory with the salmon alevin output.
#' @param save_mtx Logical, whether the sparse matrix should be saved as a mtx
#' file. If \code{FALSE}, then a dgCMatrix is returned.
#' @param save_path Where to save the mtx file. If \code{save_mtx = TRUE}, then
#' this argument must be specified. If the directory doesn't exist, then it will
#' be created.
#' @param \dots Other arguments passed to \code{BUSpaRse::save_cellranger}.
#' @return If \code{save_mtx = FALSE}, then a dgCMatrix with cells in columns 
#' and genes in rows, with barcodes as column names and gene IDs as row names.
#' Otherwise, 3 files will be written to the directory specified in \code{save_path}:
#' matrix.mtx.gz, features.tsv.gz, and barcodes.tsv.gz, and the dgCMatrix is returned
#' invisibly.
#' 
readAlevin_sparse <- function(files, save_mtx = FALSE, save_path = NULL,
                              ...) {
  dir <- sub("/alevin$","",dirname(files))
  barcode.file <- file.path(dir, "alevin/quants_mat_rows.txt")
  gene.file <- file.path(dir, "alevin/quants_mat_cols.txt")
  matrix.file <- file.path(dir, "alevin/quants_mat.gz")
  for (f in c(barcode.file, gene.file, matrix.file)) {
    if (!file.exists(f)) {
      stop("expecting 'files' to point to 'quants_mat.gz' file in a directory 'alevin'
           also containing 'quants_mat_rows.txt' and 'quant_mat_cols.txt'.
           please re-run alevin preserving output structure")
    }
  }
  cell.names <- readLines(barcode.file)
  gene.names <- readLines(gene.file)
  num.cells <- length(cell.names)
  num.genes <- length(gene.names)
  #mat <- matrix(nrow=num.genes, ncol=num.cells, dimnames=list(gene.names, cell.names))
  values <- rowinds <- colinds <- vector("list", num.cells)
  con <- gzcon(file(matrix.file, "rb"))
  pb <- txtProgressBar(style = 3, max = num.cells)
  for (j in seq_len(num.cells)) {
    v <- readBin(con, double(), endian = "little", n=num.genes)
    non0_inds <- which(v > 0)
    values[[j]] <- v[non0_inds]
    rowinds[[j]] <- non0_inds
    colinds[[j]] <- rep(j, length(non0_inds))
    setTxtProgressBar(pb, j)
  }
  close(pb)
  close(con)
  values <- unlist(values)
  rowinds <- unlist(rowinds)
  colinds <- unlist(colinds)
  mat <- sparseMatrix(i = rowinds, j = colinds, x = values, 
                      dims = c(num.genes, num.cells),
                      dimnames = list(gene.names, cell.names))
  if (save_mtx) {
    save_path <- normalizePath(save_path, mustWork = FALSE)
    if (!dir.exists(save_path)) {
      dir.create(save_path, recursive = TRUE)
    }
    save_cellranger(mat, save_path, ...)
    invisible(mat)
  } else {
    return(mat)
  }
}

Here is the save_cellranger function:
https://github.com/BUStools/BUSpaRse/blob/ca0236611c82aac2281c0ffcfc1c9827a9c65bf5/R/utils.R#L104-L122

from tximport.

LTLA commented on August 28, 2024

Just wanted to mention that there's nothing special in SingleCellExperiment with respect to sparse matrices; those will work off the shelf with your standard SummarizedExperiment. You only need a SCE if you want a better place to store stuff like reduced dimensions, spike-in information, etc.

from tximport.

sparse matrix support? about tximport HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent