Coder Social home page Coder Social logo

Comments (5)

LTLA avatar LTLA commented on August 31, 2024

Works fine for me.

library(bluster)
library(BiocNeighbors)

m <- matrix(rnorm(5e6), ncol=50)
system.time(X <- clusterRows(m, BLUSPARAM = SNNGraphParam(cluster.fun="louvain", BNPARAM=AnnoyParam())))
##    user  system elapsed 
## 181.504   1.076 182.632 

system.time(X <- clusterRows(m, BLUSPARAM = SNNGraphParam(cluster.fun="louvain"))) 
# ... DNF after 10 minutes...

For smaller datasets, there may not be any speed-up as the Annoy approach writes its index to file for more general multi-threading via BPPARAM; the file I/O offsets any performance gain from approximation.

Session information Still on the last devel cycle, but there weren't any major changes in the latest release, so it shouldn't matter.
R version 4.2.1 RC (2022-06-17 r82506)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-2-branch-devel/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-2-branch-devel/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocNeighbors_1.15.1 bluster_1.7.0       

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9           lattice_0.20-45      codetools_0.2-18    
 [4] grid_4.2.1           stats4_4.2.1         magrittr_2.0.3      
 [7] cli_3.4.1            rlang_1.0.6          S4Vectors_0.35.4    
[10] Matrix_1.5-1         BiocParallel_1.31.12 igraph_1.3.5        
[13] parallel_4.2.1       compiler_4.2.1       pkgconfig_2.0.3     
[16] BiocGenerics_0.43.4  cluster_2.1.4       

from bluster.

Yunuuuu avatar Yunuuuu commented on August 31, 2024

@LTLA , thanks for your reply and the development of the great single cell toolkit in Bioconducot. Since I only used the first 50 PCs to clustering, annoy cannot provide the optimization. I recently compare the performance between scran and seurat. Seurat used annoy algorithm as the default and used louvain and jaccard to cluster cells, which I found much faster than scran in the same paramters, but I cannot get a similar performance. is it possible to run faster here ?

from bluster.

Yunuuuu avatar Yunuuuu commented on August 31, 2024

here is my sessionInfo:

[R]> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8      
 [2] LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8      
 [8] LC_NAME=C                 
 [9] LC_ADDRESS=C              
[10] LC_TELEPHONE=C            
[11] LC_MEASUREMENT=zh_CN.UTF-8
[12] LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices
[5] utils     datasets  methods   base     

other attached packages:
 [1] here_1.0.1                 
 [2] batchelor_1.14.0           
 [3] SingleCellExperiment_1.20.0
 [4] SummarizedExperiment_1.28.0
 [5] Biobase_2.58.0             
 [6] GenomicRanges_1.50.0       
 [7] GenomeInfoDb_1.34.0        
 [8] IRanges_2.32.0             
 [9] S4Vectors_0.36.0           
[10] BiocGenerics_0.44.0        
[11] MatrixGenerics_1.10.0      
[12] matrixStats_0.62.0         

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9               
 [2] bluster_1.8.0            
 [3] compiler_4.2.1           
 [4] progressr_0.11.0         
 [5] XVector_0.38.0           
 [6] bitops_1.0-7             
 [7] BiocNeighbors_1.16.0     
 [8] tools_4.2.1              
 [9] DelayedMatrixStats_1.20.0
[10] zlibbioc_1.44.0          
[11] statmod_1.4.37           
[12] metapod_1.6.0            
[13] digest_0.6.30            
[14] jsonlite_1.8.3           
[15] lattice_0.20-45          
[16] pkgconfig_2.0.3          
[17] rlang_1.0.6              
[18] igraph_1.3.5             
[19] Matrix_1.5-1             
[20] DelayedArray_0.24.0      
[21] cli_3.4.1                
[22] parallel_4.2.1           
[23] GenomeInfoDbData_1.2.9   
[24] cluster_2.1.4            
[25] locfit_1.5-9.6           
[26] rprojroot_2.0.3          
[27] grid_4.2.1               
[28] scuttle_1.8.0            
[29] BiocParallel_1.32.0      
[30] limma_3.54.0             
[31] irlba_2.3.5.1            
[32] edgeR_3.40.0             
[33] magrittr_2.0.3           
[34] BiocSingular_1.14.0      
[35] codetools_0.2-18         
[36] sparseMatrixStats_1.10.0 
[37] beachmat_2.14.0          
[38] rsvd_1.0.5               
[39] dqrng_0.3.0              
[40] ResidualMatrix_1.8.0     
[41] ScaledMatrix_1.6.0       
[42] RCurl_1.98-1.9           
[43] scran_1.26.0            

from bluster.

LTLA avatar LTLA commented on August 31, 2024

FWIW the vast majority of time is spent inside igraph's cluster_* functions. With 100,000 random cells, the nearest neighbors and graph construction takes about 30 seconds; the rest of it (> 4 minutes) is spent inside cluster_louvain.

You can speed up the graph construction by parallelizing it via BPPARAM, but it'll probably just save you a few seconds or so. The actual clustering itself is serial so it doesn't benefit from parallelization, at least not in igraph's C implementation.

IIRC Seurat had their own implementation of the Louvain algorithm. I don't know whether or not this is of the same quality as igraph's implementation, but if it's noticeably faster, they may be taking some short-cuts that igraph does not.

from bluster.

Yunuuuu avatar Yunuuuu commented on August 31, 2024

Thanks for your detail explanation, I'll persist in bioconductor single cell toolkit instead of pursuing performance blindly.

So from from my side it would be ok to close the issue now, and I also got the same description that we won't get any benifits from annoy for small datasets in http://bioconductor.org/books/3.15/OSCA.advanced/dealing-with-big-data.html#fast-approximations, I'll read this book again.

Thanks for your help @LTLA

from bluster.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.