Coder Social home page Coder Social logo

Comments (22)

mikelove avatar mikelove commented on August 28, 2024 1

This is mostly done. Alevin work continues but not related to this thread

from tximport.

mikelove avatar mikelove commented on August 28, 2024

Interesting problem. So sparsifying the largest chunk possible and then combining is probably the implementation to beat in terms of speed. To think out loud, the sparse version would import the nonzero elements and indices and then Matrix::sparseMatrix to create the matrix, right? I wonder if they’ve already implemented this in the SingleCellExperiment codebase or if it would go here already.

from tximport.

roryk avatar roryk commented on August 28, 2024

Heya-- I've looked at datasets at around a million cells using Seurat, and the memory blows up when fitting models to that many cells; the initial importing I think is just the first and one of the least memory intensive portions of what DESeq2 will end up doing with the data, so sparsing it just might solve that one problem but not help getting much further.

We've looked at thousands of cells with DESeq2 and it seems to work fine, but you do need a bunch of memory to get it working.

from tximport.

koenvandenberge avatar koenvandenberge commented on August 28, 2024

Yes, importing the nonzero element values and indices would probably be a good idea.
I guess it could be relatively simple by selecting the non-zero elements in the last column of the quant.sf files.
Then the only thing we need is a vector with feature names and a vector with column names for the matrix.

I think that the SingleCellExperiment framework could be good as output, since it also builds on SummarizedExperiment but accepts sparse matrices. But then there's the problem of when tximport should return a SingleCellExperiment rather than a SummarizedExperiment ...

I'm often still working locally on my 16GB RAM laptop. While I agree that I won't be able to run DESeq2 on thousands of cells, I think that importing them in sparse matrix format should be possible. I think that it is a good thing to be optimistic and hope that in due time fast and memory-efficient methods will be around to process those datasets, even on a local machine. For now, I just wanted to make some plots :-).

from tximport.

mikelove avatar mikelove commented on August 28, 2024

Setting aside the inference stuff (we're working on this, but side topic), we've tried to set it up so tximport has no dependencies to encourage broad adoption (developers don't want too many upstream dependencies) and tximeta builds the rich S4 objects. So I'm thinking tximport would give back a more basic object like a sparse matrix as implemented in Matrix package (and this would only be a Suggests). I think I can whip this up soon-ish. Meanwhile, we also are looking into tximport importing alevin output, which I should also find time to implement.

It seems simple enough for the point estimates, but would take some more time for the inferential replicates.

from tximport.

mikelove avatar mikelove commented on August 28, 2024

In version 1.11.1, I started to implement sparse import. Might be of interest also to @csoneson

The commit: 18beb22

Some caveats: this only works for txOut=TRUE, countsFromAbundance either "no" or "scaledTPM", doesn't work with inferential replicates, and only imports counts (and abundances if countsFromAbundance="scaledTPM"). There is a new argument sparse and an argument sparseThreshold=1.

To think about: the length matrix is dense, so given that we're in a setting where dense won't work at all because we are out of memory, there's no hope to import the length matrix. That means that lengthScaledTPM won't work as it is implemented. I could import a sparse length matrix, but since the un-stored elements in a sparse matrix are taken to be zero, I'd have to be really careful with this. It would break some existing code across packages that assumes that the length matrix does not have any zeros (the log of the length matrix becomes an offset).

Gene-level summarization in tximport uses some fast base R code that doesn't work out of the box for sparse matrices, so I've not implemented that right now. I'd have to work more on that.

Inferential replicate import is possible, but that would take much more work and testing than this implementation took. I just did a first pass to see if it is useful, and then when/if I have more time I could extend to inferential replicates.

 > txi.cfa <- tximport(files, type="salmon", txOut=TRUE, 
     countsFromAbundance="scaledTPM", sparse=TRUE)
 > txi.cfa$counts[1:10,]
 10 x 6 sparse Matrix of class "dgCMatrix"
                      sample1  sample2    sample3  sample4    sample5    sample6
 ENST00000456328.2   1.350157   .        2.177719  .         5.730738   1.897019
 ENST00000450305.2   .          .        .         .         .          .
 ENST00000488147.1 124.781758 193.5745 137.846412 94.20554 178.185519 235.519863
 ENST00000619216.1   .          .        .         .         .          .
 ENST00000473358.1   .          .        .         .         .          .
 ENST00000469289.1   .          .        .         .         .          .
 ENST00000607096.1   .          .        .         .         .          .
 ENST00000417324.1   .          .        .         .         .          .
 ENST00000461467.1   .          .       23.789177 49.52768  27.406386  44.840428
 ENST00000606857.1   .          .        .         .         .          .
!> txi.cfa$abundance[1:10,]
 10 x 6 sparse Matrix of class "dgCMatrix"
                     sample1 sample2   sample3 sample4  sample5   sample6
 ENST00000456328.2 0.0591243  .      0.0631446 .       0.199672  0.083908
 ENST00000450305.2 .          .      .         .       .         .
 ENST00000488147.1 5.4642800  8.1588 3.9969600 3.94843 6.208390 10.417400
 ENST00000619216.1 .          .      .         .       .         .
 ENST00000473358.1 .          .      .         .       .         .
 ENST00000469289.1 .          .      .         .       .         .
 ENST00000607096.1 .          .      .         .       .         .
 ENST00000417324.1 .          .      .         .       .         .
 ENST00000461467.1 .          .      0.6897850 2.07585 0.954901  1.983360
 ENST00000606857.1 .          .      .         .       .         .

from tximport.

mikelove avatar mikelove commented on August 28, 2024

Another note is that I'll be working on supporting alevin output (see #24), and i'll have to see to what degree we could keep the data sparse there as well...

from tximport.

mikelove avatar mikelove commented on August 28, 2024

Chatting with @csoneson on Slack, and wanted to save thoughts here.

There's an obvious weak point in the current code where I grow some lists sample-by-sample:

https://github.com/mikelove/tximport/blob/master/R/tximport.R#L384-L386

I do this because I don't know ahead of time how many counts will be >= 1.

An alternative, which might be faster is to go through one time just counting, so that I can allocate the correct size of vectors, then read files again. The tradeoff is read_csv vs growing a list sample-by-sample. I can take a shot at implementing both and someone might be able to time trial these on a real dataset?

from tximport.

roryk avatar roryk commented on August 28, 2024

Happy to help out testing which works faster.

from tximport.

pinin4fjords avatar pinin4fjords commented on August 28, 2024

Sorry to butt in on this old thread, but I'm currently trying (possibly foolishly) to aggregate the results of 50k Kallisto runs with tximport- with the older non-sparse version of tximport. I gave up when the memory topped 150Gb on our cluster before failing.

Is this something the sparse options are likely to deal with effectively, or are these numbers far beyond what tximport can sensibly deal with?

from tximport.

mikelove avatar mikelove commented on August 28, 2024

You can try e.g. 10 samples with and without sparse=TRUE to see what your savings will be with sparsity.

Likely you need to find another approach that doesn't involve creating an in-memory matrix that is 20k x 50k

from tximport.

pinin4fjords avatar pinin4fjords commented on August 28, 2024

Okay that's useful- thanks.

from tximport.

lambdamoses avatar lambdamoses commented on August 28, 2024

I'm working with some alevin output, and even not so large datasets would run out of memory because a dense matrix is returned. If the barcodes are not filtered, there can be over 200 times more columns (barcodes) than there are real cells, so the vast majority of columns will be almost all 0, and that's just a huge waste of memory. It seems that the sparse argument does not affect alevin: https://github.com/mikelove/tximport/blob/d1a195b41f92489562f451ecf99f8fd6d47bb112/R/tximport.R#L201-L214. Is there any way to read the alevin binary for the matrix as a sparse matrix? I guess that might require you to write some custom Rcpp code.

from tximport.

rob-p avatar rob-p commented on August 28, 2024

cc @k3yavi --- I think we should bump the priority of having a sparse-matrix pipeline from alevin -> tximport.

from tximport.

mikelove avatar mikelove commented on August 28, 2024

Let’s doooo it. 🤘

from tximport.

lambdamoses avatar lambdamoses commented on August 28, 2024

I'm doing it right now. If it works well, I'll PR

from tximport.

mikelove avatar mikelove commented on August 28, 2024

@lambdamoses we are planning to change the output format to make it massively smaller on disk and faster and more efficient to read as well

from tximport.

lambdamoses avatar lambdamoses commented on August 28, 2024

Do you mean that you are changing the alevin binary matrix output so it's a sparse matrix? That would be great! I mean I'm rewriting ReadAlevin to read the existing binary file as a dgCMatrix, without changing alevin itself.

from tximport.

rob-p avatar rob-p commented on August 28, 2024

@lambdamoses : Yes, we are planning on having alevin output sparse (and compressed) matrices directly so that they can be directly read as sparse matrices into R, either via some existing library function or via some (hopefully trivial) function that we implement.

from tximport.

pinin4fjords avatar pinin4fjords commented on August 28, 2024

I adapted @k3yavi's Alevin parser to make 10X-style matrix market format via a sparse python matrix. I put the script on the gitter channel here a couple of days ago: https://gitter.im/COMBINE-lab/salmon (disclaimer: not thoroughly tested yet, it's part of a new and currently private internal repo). That format is nicely parsable by DropletUtils etc in R, straight to a SingleCellExperiment.

Obviously it would be great if that (or e.g. Loom format) was a direct output of Alevin.

from tximport.

lambdamoses avatar lambdamoses commented on August 28, 2024

Before alevin and tximport support sparse matrices, you can use this script to read alevin output as a sparse matrix:

#' Read salmon alevin binary output as sparse matrix
#' 
#' The most annoying part about alevin: all available methods to load the binary
#' output return a dense matrix, which is a huge waste of memory. Here I try to
#' load it as a sparse matrix.
#' 
#' @param files Directory with the salmon alevin output.
#' @param save_mtx Logical, whether the sparse matrix should be saved as a mtx
#' file. If \code{FALSE}, then a dgCMatrix is returned.
#' @param save_path Where to save the mtx file. If \code{save_mtx = TRUE}, then
#' this argument must be specified. If the directory doesn't exist, then it will
#' be created.
#' @param \dots Other arguments passed to \code{BUSpaRse::save_cellranger}.
#' @return If \code{save_mtx = FALSE}, then a dgCMatrix with cells in columns 
#' and genes in rows, with barcodes as column names and gene IDs as row names.
#' Otherwise, 3 files will be written to the directory specified in \code{save_path}:
#' matrix.mtx.gz, features.tsv.gz, and barcodes.tsv.gz, and the dgCMatrix is returned
#' invisibly.
#' 
readAlevin_sparse <- function(files, save_mtx = FALSE, save_path = NULL,
                              ...) {
  dir <- sub("/alevin$","",dirname(files))
  barcode.file <- file.path(dir, "alevin/quants_mat_rows.txt")
  gene.file <- file.path(dir, "alevin/quants_mat_cols.txt")
  matrix.file <- file.path(dir, "alevin/quants_mat.gz")
  for (f in c(barcode.file, gene.file, matrix.file)) {
    if (!file.exists(f)) {
      stop("expecting 'files' to point to 'quants_mat.gz' file in a directory 'alevin'
           also containing 'quants_mat_rows.txt' and 'quant_mat_cols.txt'.
           please re-run alevin preserving output structure")
    }
  }
  cell.names <- readLines(barcode.file)
  gene.names <- readLines(gene.file)
  num.cells <- length(cell.names)
  num.genes <- length(gene.names)
  #mat <- matrix(nrow=num.genes, ncol=num.cells, dimnames=list(gene.names, cell.names))
  values <- rowinds <- colinds <- vector("list", num.cells)
  con <- gzcon(file(matrix.file, "rb"))
  pb <- txtProgressBar(style = 3, max = num.cells)
  for (j in seq_len(num.cells)) {
    v <- readBin(con, double(), endian = "little", n=num.genes)
    non0_inds <- which(v > 0)
    values[[j]] <- v[non0_inds]
    rowinds[[j]] <- non0_inds
    colinds[[j]] <- rep(j, length(non0_inds))
    setTxtProgressBar(pb, j)
  }
  close(pb)
  close(con)
  values <- unlist(values)
  rowinds <- unlist(rowinds)
  colinds <- unlist(colinds)
  mat <- sparseMatrix(i = rowinds, j = colinds, x = values, 
                      dims = c(num.genes, num.cells),
                      dimnames = list(gene.names, cell.names))
  if (save_mtx) {
    save_path <- normalizePath(save_path, mustWork = FALSE)
    if (!dir.exists(save_path)) {
      dir.create(save_path, recursive = TRUE)
    }
    save_cellranger(mat, save_path, ...)
    invisible(mat)
  } else {
    return(mat)
  }
}

Here is the save_cellranger function:
https://github.com/BUStools/BUSpaRse/blob/ca0236611c82aac2281c0ffcfc1c9827a9c65bf5/R/utils.R#L104-L122

from tximport.

LTLA avatar LTLA commented on August 28, 2024

Just wanted to mention that there's nothing special in SingleCellExperiment with respect to sparse matrices; those will work off the shelf with your standard SummarizedExperiment. You only need a SCE if you want a better place to store stuff like reduced dimensions, spike-in information, etc.

from tximport.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.