fertiglab / cogaps Goto Github PK

Bayesian MCMC matrix factorization algorithm

Home Page: https://www.bioconductor.org/packages/release/bioc/html/CoGAPS.html

License: BSD 3-Clause "New" or "Revised" License

R 13.76% C++ 85.47% M4 0.49% TeX 0.28%

bioinformatics single-cell-rna-seq non-negative-matrix-factorization human-cell-atlas bayesian-inference machine-learning

cogaps's Introduction

FertigLab Website

Alembic Theme GitHub Page: Here

Theme Elements Page: Here

cogaps's People

Contributors

Stargazers

Watchers

Forkers

genesofeve gofflab grimbough smgroves theron-palmer sherman5 melanieloth hmf0103 deshpandelab jmitchell81 maxyte yafeiwang89 ashtsang msteph20 gladelephant

cogaps's Issues

calcCoGAPSStat missing argument

hello! I am trying to run gene enrichment and I can't find the argument in calcCoGAPSStat that would accept a gene list. Any advice would be appreciated!
Here is the documentation from the CoGAPS package
Usage
calcCoGAPSStat(
object,
sets = NULL,
whichMatrix = "featureLoadings",
numPerm = 1000,
...
)

Rename "genes" to "measurements" or some other alternative

In many places throughout the software and documentation the columns of the data are referred to as "samples" and the rows of the data are referred to as "genes". Since CoGAPS can work on more types of data than just genomic data, it would be nice to come up with a new name.

HDF5 File Support

It would be nice if CoGAPS could read data directly from an hdf5 (particularly h5ad) file.

nPattern value selection

I'm not certain what to set nPattern at (in params for CoGAPS::CoGAPS(). Is there any harm (besides run time??) at setting a large value for this? Then potentially prune patterns with "low" amplitudes?

Add Profiling

update lambda to average only non-zero entries

@sherman5 To accommodate the additional sparsity of single cell data. A switch needs to be created to toggle between the current version of lambda for bulk RNASeq, etc. and the version implemented in the feature/cal_totalsum.

Remove boost dependency for distribution functions

This has caused a few headaches in the past and in general seems to be vulnerable to errors based on architecture/compile flags.

how to select pattern after scCoGAPS

Thanks for this wonderful work.
I tried to learn the pattern from dataset A then transfer to dataset B for annotation, but I am a bit confused about how to select the patterns, could you please give me some hints?

add basic benchmarks

Currently there are no performance benchmarks. It would be helpful for all development to have some top-level benchmarks we can use to track how our changes are effecting performance. More detailed benchmarks can be done on a case-by-case basis.

incorporate sub_func into randgen

related to #5 and #1

put all functions into one namespace called Random

fix the check for semi supervised case

Here is a comment left by someone in the code, let's treat it properly here.

# TODO: fix this check for semisupervised case
    #  if (sum(object@featureLoadings < 0) > 0 | sum(object@loadingStdDev < 0) > 0)
     #       "negative values in feature Matrix"
      #  if (sum(object@sampleFactors < 0) > 0 | sum(object@factorStdDev < 0) > 0)
       #     "negative values in sample Matrix"

Add checkpoints to CoGAPS

Add parameters for how often to create a backup and how many to store. Be able to start cogaps from a backup and end up at the same final run.

less pattern numbers than 'nPatterns'

Dear developers,

Thanks for the cool package you designed. I tried to follow the 'cogaps for seurat objects' tutorial. One thing seems confusing is that the pattern numbers I found were less than the 'nPatterns' I set. Is that normal or I missed anything?

Another question is what normalization is best for running CoGAPS for single-cell data?

Thanks!

Genome-wide vs. single-cell

From the vignette + documentation, it's not entirely clear to me what the difference between distributed = "genome-wide" and distributed = "single-cell" is. What is changing? I note that for the SeuratWrapper vignette (https://htmlpreview.github.io/?https://github.com/satijalab/seurat-wrappers/blob/master/docs/cogaps.html) "genome-wide" is used despite handling scRNA-seq data - is this correct?

'Error in corcut' after full CoGAPS run

As my 3-hour CoGAPS run finished up, the following error appeared:

Matching Patterns Across Subsets...
Error in corcut(allPatterns, gapsParams@cut, gapsParams@minNS) : 
  NA values in correlation of patterns
In addition: Warning messages:
1: In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data
2: In cor(allPatterns) : the standard deviation is zero

Is there a way to export part of the CoGAPS result during the run, so that the result isn't lost if this occurs?

add at() method to matrix class

Right now elements are accessed through a double** which is extremely messy. Adding an at() method will make double** get_matrix() obsolete and make the matrix class much cleaner, allowing for the underlying implementation to be changed.

checkInputs Warning clarification

Good afternoon,

Thank you so much for the detailed vignettes and papers for using CoGAPS. We are really finding it beneficial to our research. We are running CoGAPS using the Seurat Wrapper and find that we are getting this warning message:

3: In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data

after running:
results <- RunCoGAPS(object, nPatterns = 25, nIterations = 100000, outputFrequency = 5000, sparseOptimization = TRUE, nThreads = 4, distributed = "genome-wide", singleCell = TRUE, seed = 891)

if we change the above code to add temp.file=TRUE into RunCoGAPS() the message goes away. Upon searching, I found this is part of checkInputs

Does this warning message affect the output of the function? What is the purpose of the temp.file?

Thank you for your time and consideration!

CoGAPS on proteomic data

Hello
Is CoGAPS usable on proteomic data?

pass by const reference instead of value in matrix functions

Simple fix that should have an impact on performance.

limit number of cores for tests

The automated build-check process yields test failures during check. This happens because although usually unset, the base _R_CHECK_LIMIT_CORES_ is set to 2 during CRAN checks and is ignored in the package's parallel implementation.

Error: Error:   _R_CHECK_LIMIT_CORES_' environment variable detected, BiocParallel
  workers must be <= 2 was (4)

Here is the description from RShowDoc("R-ints")

_R_CHECK_LIMIT_CORES_
If set, check the usage of too many cores in package parallel. If set to ‘warn’ gives a warning, to ‘false’ or ‘FALSE’ the check is skipped, and any other non-empty value gives an error when more than 2 children are spawned. Default: unset (but ‘TRUE’ for CRAN submission checks).

We should add support for _R_CHECK_LIMIT_CORES_ variable.

PUMP branch

stub for my PUMP-based features... Am currently merging my fork into a branch cloned from most recent master (I cloned master bc there was no develop yesterday)

using randgen() to generate random numbers is ineffcient

Any time a random number is generated by calling randgen, the function has to go through a switch statement to determine which distribution it should generate the number from. It would be more efficient to have a function for each distribution and call those directly. This would also improve readability of the code.

double randgen(char rand_type, double para1, double para2) {
    switch (rand_type) {
        case 'U': {
            boost::random::uniform_01<boost::mt19937 &> zeroone(rng);
            return zeroone();
        }

        case 'N': {
            double mean = para1;
            double std_dev = para2;
            boost::normal_distribution<> nd(mean, std_dev);
            boost::variate_generator<boost::mt19937 &, boost::normal_distribution<> > norm_rnd(rng, nd);
            return norm_rnd();
        }

        case 'P': {
            double lambda = (para1 == 0 ? para2 : para1);
            boost::poisson_distribution<> pd(lambda);
            boost::variate_generator<boost::mt19937 &, boost::poisson_distribution<> > poisson_rnd(rng, pd);
            return poisson_rnd();
        }

        case 'E': {
            double lambda = (para1 == 0 ? para2 : para1);
            boost::exponential_distribution<> expd(lambda);
            boost::variate_generator<boost::mt19937 &, boost::exponential_distribution<> > exp_rnd(rng, expd);
            return exp_rnd();
        }

        default:
            return -9999.0;
    }

    // EJF -- return dummy value to avoid warning
    return -9999.0;
}

family="Helvetica-Narrow" in PatternMatcher.R bug

family="Helvetica-Narrow" in PatternMatcher.R errors on computers where the font is not installed. Needs to be removed from Bioconductor and all other versions ASAP.

Write unit tests for GAPSNorm and Matrix

use floats instead of doubles for matrix values

This can have a crazy performance boost depending on how the compiler is handling vector multiplication.

RNASeq Data Preprocessing for CoGAPS

Hello,

I recently came across this package and am really interested to try it out. However, I am not entirely sure what should I do to preprocess my RNA-Seq data before inputting to CoGAPS. Would you mind listing the necessary steps? I tried using rlog-transformed count matrix, which I have used for WGCNA, but then I got an error saying that the algorithm does not accept negative values.

Thank you in advance.

Best,
Mikhael

Make sure explicit SIMD instructions don't break on non-Intel processors

Current CoGAPS version?

According to the most recent vignettes (found https://www.bioconductor.org/packages/devel/bioc/vignettes/CoGAPS/inst/doc/CoGAPS.html and https://bioconductor.org/packages/release/bioc/vignettes/CoGAPS/inst/doc/CoGAPS.html) CoGAPS 3.8 and 3.9 should be the most recent versions. However, when I install from Github, 3.7 is the version that I get from both the "master" and "develop" branches. Wondering if I am getting the wrong version, or if the vignettes are incorrect. I have also heard that there is an issue with the sparseOptimization flag, which I will set to "FALSE" for the time being. Let us know when this feature is ready to be used! Thanks so much!

Set patterns manually (e.g cell type)

Hi,
I have two data sets I wish to cross compare, but instead of learning N patterns from an experiment what I would like is to set the pattens manually ( say according to cell type) and generate coefficients etc to these.

It isn't entirely obvious to me how this is done (indeed if it can) so any pointers would be appreciated.

RunCoGAPS on Seurat object

Hello,

I found the idea of finding patterns and projecting them into similar datasets very promising in such a challenging part of scRNA-seq analysis as cell identification. I am trying to annotate my own data using CoGAPS and I found this tutorial that shows how to run CoGAPS on Seurat objects. The function that was used there is called RunCoGAPS, but I can't find it in the package documentation. I have several questions about that function:

does it use all genes or only highly variable ones?
which data slot does it use? normalized I guess?
will the cell embeddings from RunCoGAPS be similar to the result of scCoGAPS on a scRNA-seq normalized count matrix?

thank you,
Yulia

result interpretation

Hello

I ran the vignette but I have some troubles on interpreting the results.

The rows in the example matrix have a prefix Hs. followed by a number. Is this a gene annotation?
If I understood correctly, would the patterns be groups of correlated features? If so, where can I find which feature belongs to which pattern? I was not able to find it in the result object.

CoGAPS checkpoints not enabled

I'm using scCoGAPS and haven't been able to use the checkpoint saving steps. I found that the checkpointsEnabled() is set to FALSE, with no clear way to change this. I've used the BiocManager to install CoGAPS as well as the conda bioconductor to install CoGAPS on a slurm unix cluster. I've also installed on a mac, and all forms didn't have checkpointsEnabled(). Any thoughts to turn on checkpoints will be much appreciated.

The compile settings (representative of one of these installations) is below.

SIMD: AVX instructions enabled
Compiler did not support OpenMP

R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/bnphan/miniconda3/envs/single-cell/lib/libopenblasp-r0.3.7.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] CoGAPS_3.6.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6                cluster_2.1.0              
 [3] XVector_0.26.0              GenomicRanges_1.38.0       
 [5] BiocGenerics_0.32.0         zlibbioc_1.32.0            
 [7] IRanges_2.20.1              BiocParallel_1.20.1        
 [9] lattice_0.20-41             GenomeInfoDb_1.22.0        
[11] caTools_1.18.0              tools_3.6.3                
[13] SummarizedExperiment_1.16.0 parallel_3.6.3             
[15] grid_3.6.3                  rhdf5_2.30.1               
[17] Biobase_2.46.0              KernSmooth_2.23-16         
[19] gtools_3.8.2                matrixStats_0.56.0         
[21] Matrix_1.2-18               GenomeInfoDbData_1.2.2     
[23] Rhdf5lib_1.8.0              RColorBrewer_1.1-2         
[25] S4Vectors_0.24.4            bitops_1.0-6               
[27] SingleCellExperiment_1.8.0  RCurl_1.98-1.1             
[29] gdata_2.18.0                DelayedArray_0.12.0        
[31] compiler_3.6.3              gplots_3.0.3               
[33] stats4_3.6.3               ```

Cache downloads for vignette

Nightly builds download ~500MB file from Zenodo, which sometimes causes build to fail. Bioc reviewers in the process of review of our other package proposed using BiocFileCache to only download the file once. Lets try following the proposal also for CoGAPS.

Problem Installing Version 3.18.0 - clang: error: the clang compiler does not support '-march=native'

The most recent Version 3.18.0 fails to install on my computer. Version 3.15.2 installs correctly.

I think this may be due to a C++ compiling error on Mac with M1 Pro chip.
See: https://stackoverflow.com/questions/65966969/why-does-march-native-not-work-on-apple-m1

Is there any way to change the clang version used when compiling CoGAPS?

My apologies in advance if I'm not using computing terms correctly! I'm new to CompBio.

Thanks!

Old packages: 'CoGAPS'
Update all/some/none? [a/s/n]: 
a
Package which is only available in source form, and may need
  compilation of C/C++/Fortran: ‘CoGAPS’
Do you want to attempt to install these from sources? (Yes/no/cancel) yes
installing the source package ‘CoGAPS’

trying URL 'https://bioconductor.org/packages/3.16/bioc/src/contrib/CoGAPS_3.18.0.tar.gz'
Content type 'application/x-gzip' length 20839969 bytes (19.9 MB)
==================================================
downloaded 19.9 MB

* installing *source* package ‘CoGAPS’ ...
** using staged installation
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether clang++ -arch arm64 -std=gnu++14 accepts -g... yes
checking how to run the C++ preprocessor... clang++ -arch arm64 -std=gnu++14 -E
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether clang++ -arch arm64 -std=gnu++14 accepts -g... (cached) yes
./configure: line 2752: AX_COMPILER_VENDOR: command not found
./configure: line 2753: AX_COMPILER_VERSION: command not found
./configure: line 2764: AX_OPENMP: command not found
building on  compiler version 
Using AVX instructions if available
configure: creating ./config.status
config.status: creating src/Makevars
** libs
clang++ -arch arm64 -std=gnu++14 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -DBOOST_MATH_PROMOTE_DOUBLE_POLICY=0 -DGAPS_DISABLE_CHECKPOINTS -D__GAPS_R_BUILD__ -Iinclude -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/BH/include' -I/opt/R/arm64/include  -march=native  -fPIC  -falign-functions=64 -Wall -g -O2  -c Cogaps.cpp -o Cogaps.o
clang: error: the clang compiler does not support '-march=native'
make: *** [Cogaps.o] Error 1
ERROR: compilation failed for package ‘CoGAPS’

Handle duplicates when sub-setting indices in distributed CoGAPS

Any rows or columns of the data that are duplicated when sampling the subsets in distributed cogaps are not collapsed back down. We can end up with multiple results for a single cell or a single gene because of this.

enable sparse matrix support when input is object

CoGAPS/R/HelperFunctions.R

Line 342 in 241867b

convertDataToMatrix <- function(data)

to reproduce:

> library("CoGAPS")
> 
> #pass mtx directly fails
> mtx <- Matrix::readMM("inst/extdata/GIST.mtx")
class(mtx)
> class(mtx)
[1] "dgTMatrix"
attr(,"package")
[1] "Matrix"
> res <- CoGAPS(mtx)
Error in convertDataToMatrix(data) : unsupported data type
> 
> #pass mtx as filepath works
> res2 <- CoGAPS("inst/extdata/GIST.mtx", messages = FALSE, nIterations=100)

This is CoGAPS version 3.22.0 
Running Standard CoGAPS on inst/extdata/GIST.mtx (1363 genes and 9 samples)

Linking to Rhdf5lib

Apologies for breaking your package by changing the returned value of Rhdf5lib::pkg_config().

It looks like you've removed the dependency on this, and based on a cursory look can't see where you need it, so the is probably the simplest solution. However, if you do need to link against Rhdf5lib you can update the entry in src/Makevars to be:

    RHDF5_LIBS=$(shell echo 'Rhdf5lib::pkgconfig("PKG_CXX_LIBS")'|\
        "${R_HOME}/bin/R" --vanilla --slave)  
    PKG_LIBS=$(RHDF5_LIBS)

I've reverted the change in BioC release, so no need to worry about fixing that before the deadline. Apologies again for do this at such a crappy time.

Remove singleCell parameter and always calculate sparse mean

running time cogaps

Hi,

I'm trying to run cogaps for a sc-RNAseq dataset (5k genes x 7k cells), with this command:

k=70
cores=2
params <- new("CogapsParams")
params <- setParam(params, "nPatterns", k)
params <- setDistributedParams(params, nSets=cores)
#getParam(params, "nSets")

#set.seed(676)
result <- scCoGAPS(assays(scset)$logcounts, 
                   params, 
                   distributed="single-cell", 
                   singleCell=TRUE,
                   BPPARAM=BiocParallel::MulticoreParam(workers=cores),
                   messages=T)

Running time is quite high

setting distributed parameters - call this again if you change nPatterns

This is CoGAPS version 3.3.40
Running single-cell CoGAPS on 5000 genes and 7133 samples with parameters:

-- Standard Parameters --
nPatterns            70
nIterations          5000
seed                 295
singleCell           TRUE
sparseOptimization   FALSE
distributed          single-cell

-- Sparsity Parameters --
alpha          0.01
maxGibbsMass   100

-- Distributed CoGAPS Parameters --
nSets          2
cut            70
minNS          1
maxNS          3

Creating subsets...
set sizes (min, mean, max): (3566, 3566.5, 3567)
Running Across Subsets...

Loading Data...Done! (00:00:02)
    worker 1 is starting!
    worker 2 is starting!
-- Calibration Phase --
500 of 5000, Atoms: 74391(74157), ChiSq: 136910144, Time: 00:20:43 / 10:52:57
1000 of 5000, Atoms: 82704(92591), ChiSq: 135660368, Time: 00:48:20 / 11:12:02
1500 of 5000, Atoms: 90865(102149), ChiSq: 135199152, Time: 01:20:04 / 11:34:25
2000 of 5000, Atoms: 93763(115371), ChiSq: 134509280, Time: 01:52:35 / 11:40:22
2500 of 5000, Atoms: 97406(110010), ChiSq: 135169216, Time: 02:26:07 / 11:43:22
3000 of 5000, Atoms: 98089(116037), ChiSq: 134589984, Time: 03:01:02 / 11:47:17
3500 of 5000, Atoms: 96612(122988), ChiSq: 134334176, Time: 03:35:33 / 11:46:17
4000 of 5000, Atoms: 97890(122860), ChiSq: 134350288, Time: 04:11:01 / 11:46:30
4500 of 5000, Atoms: 93855(120010), ChiSq: 134670336, Time: 04:46:07 / 11:44:25
5000 of 5000, Atoms: 94880(124085), ChiSq: 134414736, Time: 05:19:54 / 11:38:54
-- Sampling Phase --
500 of 5000, Atoms: 94504(122858), ChiSq: 134543696, Time: 05:53:40 / 11:33:37
1000 of 5000, Atoms: 93654(128185), ChiSq: 134323920, Time: 06:28:42 / 11:30:54
1500 of 5000, Atoms: 95538(127658), ChiSq: 134265616, Time: 07:02:27 / 11:26:00
2000 of 5000, Atoms: 96887(126601), ChiSq: 134209024, Time: 07:37:26 / 11:23:14

However, as suggested by the vignettes, I only split the computation between 2 cores, to keep a consistent number of cells/genes. Would you recommend increasing the number of cores?
Also, I've noticed that the default nIterations changed between the bioconductor package version (around 500 I believe) and the latest github dev version (5000). What about choosing a value half way, like 2000? Would it dramatically affect the results?

Thank you for your time,

Giovanni

Re-Write Vignettes

Instead of a single, monolithic vignette, it is more useful to have multiple vignettes that focus on different user objectives with CoGAPS. Some ideas:

Data Input
Configuring Parameters
Running over multiple values for nPatterns
Cell Type Identification with Pattern Markers
Gene Set Statistics

Branch: revamp_vignettes

SparseOptimization pattern discrepancy

Good afternoon! I recently ran into an issue where there is pattern discrepancy between runs with sparseOptimization set to TRUE versus FALSE. The code I ran and the output is below. With sparseOptimization set to TRUE I noticed that the ChiSq value was -nan and during the equilibration phase, the P matrix was 0. With sparseOptimization set to FALSE there seemed to be no problems, however the number of patterns learned differed in either case, i.e. SparseOptimization = TRUE gave 5 patterns while SparseOptimization = FALSE gave 6 patterns. This was true for a range of patterns that I ran (5-50)

SPARSE OPTIMIZATION ENABLED

params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42, 
sparseOptimization=TRUE,
distributed="genome-wide")

params <- setDistributedParams(params, nSets=6)

Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 
Running genome-wide CoGAPS on Hoxd10_mat (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   TRUE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          6 
cut            5 
minNS          3 
maxNS          9 

Creating subsets...
set sizes (min, mean, max): (5067, 5067.833, 5072)
Running Across Subsets...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 2 is starting!
    worker 4 is starting!
    worker 6 is starting!
    worker 3 is starting!
    worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 13376(A), 1242(P), ChiSq: -nan, Time: 00:00:45 / 01:16:13
...
30000 of 30000, Atoms: 20636(A), 1461(P), ChiSq: -nan, Time: 00:35:40 / 01:16:38
-- Sampling Phase --
1000 of 30000, Atoms: 20671(A), 1460(P), ChiSq: -nan, Time: 00:36:54 / 01:16:28
...
29000 of 30000, Atoms: 20645(A), 1469(P), ChiSq: -nan, Time: 01:12:07 / 01:13:27
    worker 2 is finished! Time: 01:12:22
30000 of 30000, Atoms: 20670(A), 1484(P), ChiSq: -nan, Time: 01:13:21 / 01:13:21
    worker 1 is finished! Time: 01:13:21
    worker 3 is finished! Time: 01:13:24
    worker 5 is finished! Time: 01:15:26
    worker 4 is finished! Time: 01:15:26
    worker 6 is finished! Time: 01:19:08

Matching Patterns Across Subsets...
Running Final Stage...

Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 2 is starting!
    worker 6 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 10022(A), 0(P), ChiSq: -nan, Time: 00:00:27 / 00:45:43
...
30000 of 30000, Atoms: 15174(A), 0(P), ChiSq: -nan, Time: 00:47:13 / 00:47:13
    worker 1 is finished! Time: 00:47:13
    worker 2 is finished! Time: 00:47:28
    worker 5 is finished! Time: 00:47:34
Warning message:
In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data

SPARSE OPTIMIZATION DISABLED

params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42,
distributed="genome-wide")

params <- setDistributedParams(params, nSets=6)

Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)

This is CoGAPS version 3.19.1 
Running genome-wide CoGAPS on Hoxd10_mat (30407 genes and 380 samples) with parameters:

-- Standard Parameters --
nPatterns            5 
nIterations          30000 
seed                 42 
sparseOptimization   FALSE 
distributed          genome-wide 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          6 
cut            5 
minNS          3 
maxNS          9 

Creating subsets...
set sizes (min, mean, max): (5067, 5067.833, 5072)
Running Across Subsets...

    worker 2 is starting!
    worker 3 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
    worker 4 is starting!
    worker 5 is starting!
    worker 6 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 4665(A), 966(P), ChiSq: 5137063, Time: 00:01:16 / 02:08:43
...
30000 of 30000, Atoms: 9933(A), 2460(P), ChiSq: 4886798, Time: 00:49:52 / 01:47:09
-- Sampling Phase --
1000 of 30000, Atoms: 10033(A), 2514(P), ChiSq: 4886740, Time: 00:51:31 / 01:46:45
...
30000 of 30000, Atoms: 9953(A), 2489(P), ChiSq: 4886819, Time: 01:34:05 / 01:34:05
    worker 1 is finished! Time: 01:34:05
    worker 5 is finished! Time: 01:44:52
    worker 4 is finished! Time: 01:54:06
    worker 2 is finished! Time: 01:54:29
    worker 6 is finished! Time: 01:54:31
    worker 3 is finished! Time: 01:54:38

Matching Patterns Across Subsets...
Running Final Stage...

    worker 5 is starting!
    worker 4 is starting!
    worker 3 is starting!
    worker 2 is starting!
    worker 6 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
    worker 1 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 5928(A), 0(P), ChiSq: 14908930, Time: 00:00:10 / 00:16:56
...
30000 of 30000, Atoms: 10469(A), 0(P), ChiSq: 14908930, Time: 00:08:47 / 00:18:52
-- Sampling Phase --
1000 of 30000, Atoms: 10403(A), 0(P), ChiSq: 14908930, Time: 00:09:00 / 00:18:39
...
30000 of 30000, Atoms: 10379(A), 0(P), ChiSq: 14908930, Time: 00:15:17 / 00:15:17
    worker 1 is finished! Time: 00:15:17
    worker 5 is finished! Time: 00:16:29
    worker 3 is finished! Time: 00:19:47
    worker 2 is finished! Time: 00:20:37
    worker 4 is finished! Time: 00:20:38
    worker 6 is finished! Time: 00:20:45
Warning message:
In checkInputs(data, uncertainty, allParams) :
  running distributed cogaps without mtx/tsv/csv/gct data

After obtaining the patterns, I ran patternMarkers on patterns learned with sparseOptimization = TRUE. When I set threshold = “all”, I would get this error.

test <- patternMarkers_all(Hoxd10_matnp5, threshold = "all")

Error in colnames(markerScores)[apply(markerScores, 1, which.min)] : 
  invalid subscript type 'list'
This error would not trigger when threshold was set to “cut”.
PatternMarkers worked normally when run on patterns learned without sparseOptimization. 

UPDATE @dimalvovs  - delete rows for readability

Atoms and Patterns Stabilization

Good afternoon! I recently have run into large changes in the atoms (A) and (P) matrix values when running CoGAPS across multiple nSets. In a previous issue, these numbers were mentioned as a metric to determine if CoGAPS had stabilized, so I found this a bit concerning. Can you help me understand what is happening here, and point me to a solution? Thank you in advance! (Include the full parameters to help)

-- Standard Parameters --
nPatterns            25 
nIterations          1e+05 
seed                 891 
sparseOptimization   TRUE 
distributed          single-cell 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          6 
cut            25 
minNS          3 
maxNS          9 
[1] 33650 21914
[1] 14160 21914

This is CoGAPS version 3.14.0 
Running single-cell CoGAPS on /tmp/90251.tmpdir/Rtmp31EVRF/file62dc263e1.mtx (14160 genes and 21914 samples) with parameters:

-- Standard Parameters --
nPatterns            25 
nIterations          1e+05 
seed                 891 
sparseOptimization   TRUE 
distributed          single-cell 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 

-- Distributed CoGAPS Parameters -- 
nSets          6 
cut            25 
minNS          3 
maxNS          9 

14160 gene names provided
first gene name: FAM87B 

21914 sample names provided
first sample name: 56546_tube1_AAACCTGCACGGCGTT-1 

Creating subsets...
set sizes (min, mean, max): (3652, 3652.333, 3654)
Running Across Subsets...

    worker 2 is starting!
    worker 4 is starting!
Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:01:26)
    worker 1 is starting!
    worker 6 is starting!
    worker 3 is starting!
    worker 5 is starting!
-- Equilibration Phase --
5000 of 100000, Atoms: 100971(A), 52232(P), ChiSq: 468311360, Time: 03:28:08 / 206:51:55
10000 of 100000, Atoms: 132274(A), 61924(P), ChiSq: 466190848, Time: 09:57:29 / 271:50:33
15000 of 100000, Atoms: 147622(A), 64281(P), ChiSq: 465477408, Time: 17:28:09 / 302:57:29
20000 of 100000, Atoms: 157316(A), 65815(P), ChiSq: 465079392, Time: 25:23:51 / 319:39:54
25000 of 100000, Atoms: 165574(A), 67174(P), ChiSq: 464867552, Time: 33:37:39 / 330:19:17
30000 of 100000, Atoms: 172295(A), 68477(P), ChiSq: 464668896, Time: 42:07:54 / 338:07:31
35000 of 100000, Atoms: 179143(A), 69047(P), ChiSq: 464584608, Time: 50:52:01 / 344:12:38
40000 of 100000, Atoms: 184012(A), 69617(P), ChiSq: 464515392, Time: 59:49:42 / 349:18:55
45000 of 100000, Atoms: 189512(A), 69805(P), ChiSq: 464441792, Time: 68:54:57 / 353:19:43
50000 of 100000, Atoms: 196919(A), 68410(P), ChiSq: 464404192, Time: 78:14:55 / 357:11:07
55000 of 100000, Atoms: 206964(A), 64601(P), ChiSq: 464468192, Time: 87:40:43 / 360:20:54
60000 of 100000, Atoms: 808401(A), 12023(P), ChiSq: 500546240, Time: 96:09:45 / 359:07:47
65000 of 100000, Atoms: 1055385(A), 8298(P), ChiSq: 527779296, Time: 103:43:00 / 354:42:26
70000 of 100000, Atoms: 950700(A), 7337(P), ChiSq: 533742400, Time: 112:06:23 / 353:24:58
75000 of 100000, Atoms: 924196(A), 7082(P), ChiSq: 539779776, Time: 120:40:25 / 352:40:01
80000 of 100000, Atoms: 948739(A), 6947(P), ChiSq: 541298112, Time: 129:33:54 / 352:45:34
85000 of 100000, Atoms: 1000041(A), 6826(P), ChiSq: 543738112, Time: 138:59:54 / 354:05:43
90000 of 100000, Atoms: 1064006(A), 6708(P), ChiSq: 546889728, Time: 148:56:49 / 356:23:28
95000 of 100000, Atoms: 1251780(A), 8249(P), ChiSq: 534668288, Time: 159:05:50 / 358:46:42

add unit tests for random number generation

Cut thresholding for PatternMarkers is not working as intended

Using the "cut" threshold for the patternMarkers function returns more genes as PatternMarkers than using the "all" threshold. I would guess that the cut thresholded genes are simply being included as an element of the list returned by the patternMarkers function, and I don't know what "cut" is doing.

Example:
`data("GIST")

PMList <- patternMarkers(GIST.result, threshold = "all")

PMListCut <- patternMarkers(GIST.result, threshold = "cut")

lapply(PMList$PatternMarkers, length)

lapply(PMListCut$PatternMarkers, length)`

sub_func used as a namespace but defined as a class

Amatrix rownames repeated

When sampling with annotated weights to make subsets, the final CoGAPS results has repeated samples in the Sample Factor Matrix. Should the sample weights be averaged by sample for each pattern? Thanks!

Eliminate warnings during build

Although build errors have been fixed, there are multiple warnings generated during the build process.

patternMatcher isn't exported

In either the devel or bioconductor versions.

bioconductor nightly build fails on Windows platform

##############################################################################
##############################################################################
###
### Running command:
###
###   chmod a+r CoGAPS -R && F:\biocbuild\bbs-3.18-bioc\R\bin\R.exe CMD build --keep-empty-dirs --no-resave-data CoGAPS
###
##############################################################################
##############################################################################


* checking for file 'CoGAPS/DESCRIPTION' ... OK
* preparing 'CoGAPS':
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-building 'CoGAPS.Rmd' using rmarkdown
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
-- Equilibration Phase --
10000 of 50000, Atoms: 68(A), 29(P), ChiSq: 329, Time: 00:00:00 / 00:00:00
20000 of 50000, Atoms: 74(A), 43(P), ChiSq: 161, Time: 00:00:00 / 00:00:00
30000 of 50000, Atoms: 84(A), 38(P), ChiSq: 141, Time: 00:00:00 / 00:00:00
40000 of 50000, Atoms: 80(A), 32(P), ChiSq: 143, Time: 00:00:01 / 00:00:02
50000 of 50000, Atoms: 81(A), 42(P), ChiSq: 117, Time: 00:00:01 / 00:00:02
-- Sampling Phase --
10000 of 50000, Atoms: 74(A), 43(P), ChiSq: 136, Time: 00:00:01 / 00:00:01
20000 of 50000, Atoms: 80(A), 37(P), ChiSq: 126, Time: 00:00:02 / 00:00:02
30000 of 50000, Atoms: 80(A), 41(P), ChiSq: 127, Time: 00:00:02 / 00:00:02
40000 of 50000, Atoms: 81(A), 38(P), ChiSq: 130, Time: 00:00:03 / 00:00:03
50000 of 50000, Atoms: 82(A), 44(P), ChiSq: 104, Time: 00:00:03 / 00:00:03
trying URL 'https://zenodo.org/record/7709664/files/inputdata.Rds?download=1'
Content type 'application/octet-stream' length 433262849 bytes (413.2 MB)
==================================================
downloaded 413.2 MB


Quitting from lines 235-240 [load single cell data from zenodo] (CoGAPS.Rmd)
Error: processing vignette 'CoGAPS.Rmd' failed with diagnostics:
error reading from connection
--- failed re-building 'CoGAPS.Rmd'

SUMMARY: processing the following file failed:
  'CoGAPS.Rmd'

Error: Vignette re-building failed.
Execution halted

Package build fails on master

When trying to build the package from master branch of github repo, the build fails with the following error:

   -- Sampling Phase --
   10000 of 50000, Atoms: 72(A), 40(P), ChiSq: 108, Time: 00:00:12 / 00:00:21
   20000 of 50000, Atoms: 77(A), 41(P), ChiSq: 115, Time: 00:00:14 / 00:00:20
   30000 of 50000, Atoms: 78(A), 36(P), ChiSq: 135, Time: 00:00:17 / 00:00:21
   40000 of 50000, Atoms: 73(A), 38(P), ChiSq: 111, Time: 00:00:19 / 00:00:21
   50000 of 50000, Atoms: 75(A), 43(P), ChiSq: 118, Time: 00:00:21 / 00:00:21
   
   Quitting from lines 243-248 [load single cell data from file] (CoGAPS.Rmd)
   Error: processing vignette 'CoGAPS.Rmd' failed with diagnostics:
   unknown input format
   --- failed re-building 'CoGAPS.Rmd'
   
   SUMMARY: processing the following file failed:
     'CoGAPS.Rmd'
   
   Error: Vignette re-building failed.

The bioconductor build passes these lines just fine, so it may mean that .Rds files are corrupted in github master branch (or created with an incompatible R version).

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] CoGAPS_3.19.1