Alembic Theme GitHub Page: Here
Theme Elements Page: Here
Bayesian MCMC matrix factorization algorithm
Home Page: https://www.bioconductor.org/packages/release/bioc/html/CoGAPS.html
License: BSD 3-Clause "New" or "Revised" License
hello! I am trying to run gene enrichment and I can't find the argument in calcCoGAPSStat that would accept a gene list. Any advice would be appreciated!
Here is the documentation from the CoGAPS package
Usage
calcCoGAPSStat(
object,
sets = NULL,
whichMatrix = "featureLoadings",
numPerm = 1000,
...
)
In many places throughout the software and documentation the columns of the data are referred to as "samples" and the rows of the data are referred to as "genes". Since CoGAPS can work on more types of data than just genomic data, it would be nice to come up with a new name.
It would be nice if CoGAPS could read data directly from an hdf5 (particularly h5ad) file.
I'm not certain what to set nPattern at (in params
for CoGAPS::CoGAPS()
. Is there any harm (besides run time??) at setting a large value for this? Then potentially prune patterns with "low" amplitudes?
@sherman5 To accommodate the additional sparsity of single cell data. A switch needs to be created to toggle between the current version of lambda for bulk RNASeq, etc. and the version implemented in the feature/cal_totalsum.
This has caused a few headaches in the past and in general seems to be vulnerable to errors based on architecture/compile flags.
Thanks for this wonderful work.
I tried to learn the pattern from dataset A then transfer to dataset B for annotation, but I am a bit confused about how to select the patterns, could you please give me some hints?
Currently there are no performance benchmarks. It would be helpful for all development to have some top-level benchmarks we can use to track how our changes are effecting performance. More detailed benchmarks can be done on a case-by-case basis.
Here is a comment left by someone in the code, let's treat it properly here.
# TODO: fix this check for semisupervised case
# if (sum(object@featureLoadings < 0) > 0 | sum(object@loadingStdDev < 0) > 0)
# "negative values in feature Matrix"
# if (sum(object@sampleFactors < 0) > 0 | sum(object@factorStdDev < 0) > 0)
# "negative values in sample Matrix"
Add parameters for how often to create a backup and how many to store. Be able to start cogaps from a backup and end up at the same final run.
Dear developers,
Thanks for the cool package you designed. I tried to follow the 'cogaps for seurat objects' tutorial. One thing seems confusing is that the pattern numbers I found were less than the 'nPatterns' I set. Is that normal or I missed anything?
Another question is what normalization is best for running CoGAPS for single-cell data?
Thanks!
From the vignette + documentation, it's not entirely clear to me what the difference between distributed = "genome-wide" and distributed = "single-cell" is. What is changing? I note that for the SeuratWrapper vignette (https://htmlpreview.github.io/?https://github.com/satijalab/seurat-wrappers/blob/master/docs/cogaps.html) "genome-wide" is used despite handling scRNA-seq data - is this correct?
As my 3-hour CoGAPS run finished up, the following error appeared:
Matching Patterns Across Subsets...
Error in corcut(allPatterns, gapsParams@cut, gapsParams@minNS) :
NA values in correlation of patterns
In addition: Warning messages:
1: In checkInputs(data, uncertainty, allParams) :
running distributed cogaps without mtx/tsv/csv/gct data
2: In cor(allPatterns) : the standard deviation is zero
Is there a way to export part of the CoGAPS result during the run, so that the result isn't lost if this occurs?
Right now elements are accessed through a double**
which is extremely messy. Adding an at()
method will make double** get_matrix()
obsolete and make the matrix class much cleaner, allowing for the underlying implementation to be changed.
Good afternoon,
Thank you so much for the detailed vignettes and papers for using CoGAPS. We are really finding it beneficial to our research. We are running CoGAPS using the Seurat Wrapper and find that we are getting this warning message:
3: In checkInputs(data, uncertainty, allParams) :
running distributed cogaps without mtx/tsv/csv/gct data
after running:
results <- RunCoGAPS(object, nPatterns = 25, nIterations = 100000, outputFrequency = 5000, sparseOptimization = TRUE, nThreads = 4, distributed = "genome-wide", singleCell = TRUE, seed = 891)
if we change the above code to add temp.file=TRUE
into RunCoGAPS()
the message goes away. Upon searching, I found this is part of checkInputs
Does this warning message affect the output of the function? What is the purpose of the temp.file?
Thank you for your time and consideration!
Hello
Is CoGAPS usable on proteomic data?
Simple fix that should have an impact on performance.
The automated build-check process yields test failures during check. This happens because although usually unset, the base _R_CHECK_LIMIT_CORES_
is set to 2 during CRAN checks and is ignored in the package's parallel implementation.
Error: Error: _R_CHECK_LIMIT_CORES_' environment variable detected, BiocParallel
workers must be <= 2 was (4)
Here is the description from RShowDoc("R-ints")
_R_CHECK_LIMIT_CORES_
If set, check the usage of too many cores in package parallel. If set to ‘warn’ gives a warning, to ‘false’ or ‘FALSE’ the check is skipped, and any other non-empty value gives an error when more than 2 children are spawned. Default: unset (but ‘TRUE’ for CRAN submission checks).
We should add support for _R_CHECK_LIMIT_CORES_
variable.
stub for my PUMP-based features... Am currently merging my fork into a branch cloned from most recent master
(I cloned master
bc there was no develop
yesterday)
Any time a random number is generated by calling randgen
, the function has to go through a switch statement to determine which distribution it should generate the number from. It would be more efficient to have a function for each distribution and call those directly. This would also improve readability of the code.
double randgen(char rand_type, double para1, double para2) {
switch (rand_type) {
case 'U': {
boost::random::uniform_01<boost::mt19937 &> zeroone(rng);
return zeroone();
}
case 'N': {
double mean = para1;
double std_dev = para2;
boost::normal_distribution<> nd(mean, std_dev);
boost::variate_generator<boost::mt19937 &, boost::normal_distribution<> > norm_rnd(rng, nd);
return norm_rnd();
}
case 'P': {
double lambda = (para1 == 0 ? para2 : para1);
boost::poisson_distribution<> pd(lambda);
boost::variate_generator<boost::mt19937 &, boost::poisson_distribution<> > poisson_rnd(rng, pd);
return poisson_rnd();
}
case 'E': {
double lambda = (para1 == 0 ? para2 : para1);
boost::exponential_distribution<> expd(lambda);
boost::variate_generator<boost::mt19937 &, boost::exponential_distribution<> > exp_rnd(rng, expd);
return exp_rnd();
}
default:
return -9999.0;
}
// EJF -- return dummy value to avoid warning
return -9999.0;
}
family="Helvetica-Narrow" in PatternMatcher.R errors on computers where the font is not installed. Needs to be removed from Bioconductor and all other versions ASAP.
This can have a crazy performance boost depending on how the compiler is handling vector multiplication.
Hello,
I recently came across this package and am really interested to try it out. However, I am not entirely sure what should I do to preprocess my RNA-Seq data before inputting to CoGAPS. Would you mind listing the necessary steps? I tried using rlog-transformed count matrix, which I have used for WGCNA, but then I got an error saying that the algorithm does not accept negative values.
Thank you in advance.
Best,
Mikhael
According to the most recent vignettes (found https://www.bioconductor.org/packages/devel/bioc/vignettes/CoGAPS/inst/doc/CoGAPS.html and https://bioconductor.org/packages/release/bioc/vignettes/CoGAPS/inst/doc/CoGAPS.html) CoGAPS 3.8 and 3.9 should be the most recent versions. However, when I install from Github, 3.7 is the version that I get from both the "master" and "develop" branches. Wondering if I am getting the wrong version, or if the vignettes are incorrect. I have also heard that there is an issue with the sparseOptimization flag, which I will set to "FALSE" for the time being. Let us know when this feature is ready to be used! Thanks so much!
Hi,
I have two data sets I wish to cross compare, but instead of learning N patterns from an experiment what I would like is to set the pattens manually ( say according to cell type) and generate coefficients etc to these.
It isn't entirely obvious to me how this is done (indeed if it can) so any pointers would be appreciated.
Hello,
I found the idea of finding patterns and projecting them into similar datasets very promising in such a challenging part of scRNA-seq analysis as cell identification. I am trying to annotate my own data using CoGAPS and I found this tutorial that shows how to run CoGAPS on Seurat objects. The function that was used there is called RunCoGAPS
, but I can't find it in the package documentation. I have several questions about that function:
thank you,
Yulia
Hello
I ran the vignette but I have some troubles on interpreting the results.
The rows in the example matrix have a prefix Hs. followed by a number. Is this a gene annotation?
If I understood correctly, would the patterns be groups of correlated features? If so, where can I find which feature belongs to which pattern? I was not able to find it in the result object.
I'm using scCoGAPS and haven't been able to use the checkpoint saving steps. I found that the checkpointsEnabled()
is set to FALSE
, with no clear way to change this. I've used the BiocManager to install CoGAPS as well as the conda bioconductor to install CoGAPS on a slurm unix cluster. I've also installed on a mac, and all forms didn't have checkpointsEnabled(). Any thoughts to turn on checkpoints will be much appreciated.
The compile settings (representative of one of these installations) is below.
SIMD: AVX instructions enabled
Compiler did not support OpenMP
R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /home/bnphan/miniconda3/envs/single-cell/lib/libopenblasp-r0.3.7.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] CoGAPS_3.6.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 cluster_2.1.0
[3] XVector_0.26.0 GenomicRanges_1.38.0
[5] BiocGenerics_0.32.0 zlibbioc_1.32.0
[7] IRanges_2.20.1 BiocParallel_1.20.1
[9] lattice_0.20-41 GenomeInfoDb_1.22.0
[11] caTools_1.18.0 tools_3.6.3
[13] SummarizedExperiment_1.16.0 parallel_3.6.3
[15] grid_3.6.3 rhdf5_2.30.1
[17] Biobase_2.46.0 KernSmooth_2.23-16
[19] gtools_3.8.2 matrixStats_0.56.0
[21] Matrix_1.2-18 GenomeInfoDbData_1.2.2
[23] Rhdf5lib_1.8.0 RColorBrewer_1.1-2
[25] S4Vectors_0.24.4 bitops_1.0-6
[27] SingleCellExperiment_1.8.0 RCurl_1.98-1.1
[29] gdata_2.18.0 DelayedArray_0.12.0
[31] compiler_3.6.3 gplots_3.0.3
[33] stats4_3.6.3 ```
Nightly builds download ~500MB file from Zenodo, which sometimes causes build to fail. Bioc reviewers in the process of review of our other package proposed using BiocFileCache to only download the file once. Lets try following the proposal also for CoGAPS.
The most recent Version 3.18.0 fails to install on my computer. Version 3.15.2 installs correctly.
I think this may be due to a C++ compiling error on Mac with M1 Pro chip.
See: https://stackoverflow.com/questions/65966969/why-does-march-native-not-work-on-apple-m1
Is there any way to change the clang version used when compiling CoGAPS?
My apologies in advance if I'm not using computing terms correctly! I'm new to CompBio.
Thanks!
Old packages: 'CoGAPS'
Update all/some/none? [a/s/n]:
a
Package which is only available in source form, and may need
compilation of C/C++/Fortran: ‘CoGAPS’
Do you want to attempt to install these from sources? (Yes/no/cancel) yes
installing the source package ‘CoGAPS’
trying URL 'https://bioconductor.org/packages/3.16/bioc/src/contrib/CoGAPS_3.18.0.tar.gz'
Content type 'application/x-gzip' length 20839969 bytes (19.9 MB)
==================================================
downloaded 19.9 MB
* installing *source* package ‘CoGAPS’ ...
** using staged installation
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether clang++ -arch arm64 -std=gnu++14 accepts -g... yes
checking how to run the C++ preprocessor... clang++ -arch arm64 -std=gnu++14 -E
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether clang++ -arch arm64 -std=gnu++14 accepts -g... (cached) yes
./configure: line 2752: AX_COMPILER_VENDOR: command not found
./configure: line 2753: AX_COMPILER_VERSION: command not found
./configure: line 2764: AX_OPENMP: command not found
building on compiler version
Using AVX instructions if available
configure: creating ./config.status
config.status: creating src/Makevars
** libs
clang++ -arch arm64 -std=gnu++14 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -DBOOST_MATH_PROMOTE_DOUBLE_POLICY=0 -DGAPS_DISABLE_CHECKPOINTS -D__GAPS_R_BUILD__ -Iinclude -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/BH/include' -I/opt/R/arm64/include -march=native -fPIC -falign-functions=64 -Wall -g -O2 -c Cogaps.cpp -o Cogaps.o
clang: error: the clang compiler does not support '-march=native'
make: *** [Cogaps.o] Error 1
ERROR: compilation failed for package ‘CoGAPS’
Any rows or columns of the data that are duplicated when sampling the subsets in distributed cogaps are not collapsed back down. We can end up with multiple results for a single cell or a single gene because of this.
Line 342 in 241867b
to reproduce:
> library("CoGAPS")
>
> #pass mtx directly fails
> mtx <- Matrix::readMM("inst/extdata/GIST.mtx")
class(mtx)
> class(mtx)
[1] "dgTMatrix"
attr(,"package")
[1] "Matrix"
> res <- CoGAPS(mtx)
Error in convertDataToMatrix(data) : unsupported data type
>
> #pass mtx as filepath works
> res2 <- CoGAPS("inst/extdata/GIST.mtx", messages = FALSE, nIterations=100)
This is CoGAPS version 3.22.0
Running Standard CoGAPS on inst/extdata/GIST.mtx (1363 genes and 9 samples)
Apologies for breaking your package by changing the returned value of Rhdf5lib::pkg_config().
It looks like you've removed the dependency on this, and based on a cursory look can't see where you need it, so the is probably the simplest solution. However, if you do need to link against Rhdf5lib you can update the entry in src/Makevars
to be:
RHDF5_LIBS=$(shell echo 'Rhdf5lib::pkgconfig("PKG_CXX_LIBS")'|\
"${R_HOME}/bin/R" --vanilla --slave)
PKG_LIBS=$(RHDF5_LIBS)
I've reverted the change in BioC release, so no need to worry about fixing that before the deadline. Apologies again for do this at such a crappy time.
Hi,
I'm trying to run cogaps for a sc-RNAseq dataset (5k genes x 7k cells), with this command:
k=70
cores=2
params <- new("CogapsParams")
params <- setParam(params, "nPatterns", k)
params <- setDistributedParams(params, nSets=cores)
#getParam(params, "nSets")
#set.seed(676)
result <- scCoGAPS(assays(scset)$logcounts,
params,
distributed="single-cell",
singleCell=TRUE,
BPPARAM=BiocParallel::MulticoreParam(workers=cores),
messages=T)
Running time is quite high
setting distributed parameters - call this again if you change nPatterns
This is CoGAPS version 3.3.40
Running single-cell CoGAPS on 5000 genes and 7133 samples with parameters:
-- Standard Parameters --
nPatterns 70
nIterations 5000
seed 295
singleCell TRUE
sparseOptimization FALSE
distributed single-cell
-- Sparsity Parameters --
alpha 0.01
maxGibbsMass 100
-- Distributed CoGAPS Parameters --
nSets 2
cut 70
minNS 1
maxNS 3
Creating subsets...
set sizes (min, mean, max): (3566, 3566.5, 3567)
Running Across Subsets...
Loading Data...Done! (00:00:02)
worker 1 is starting!
worker 2 is starting!
-- Calibration Phase --
500 of 5000, Atoms: 74391(74157), ChiSq: 136910144, Time: 00:20:43 / 10:52:57
1000 of 5000, Atoms: 82704(92591), ChiSq: 135660368, Time: 00:48:20 / 11:12:02
1500 of 5000, Atoms: 90865(102149), ChiSq: 135199152, Time: 01:20:04 / 11:34:25
2000 of 5000, Atoms: 93763(115371), ChiSq: 134509280, Time: 01:52:35 / 11:40:22
2500 of 5000, Atoms: 97406(110010), ChiSq: 135169216, Time: 02:26:07 / 11:43:22
3000 of 5000, Atoms: 98089(116037), ChiSq: 134589984, Time: 03:01:02 / 11:47:17
3500 of 5000, Atoms: 96612(122988), ChiSq: 134334176, Time: 03:35:33 / 11:46:17
4000 of 5000, Atoms: 97890(122860), ChiSq: 134350288, Time: 04:11:01 / 11:46:30
4500 of 5000, Atoms: 93855(120010), ChiSq: 134670336, Time: 04:46:07 / 11:44:25
5000 of 5000, Atoms: 94880(124085), ChiSq: 134414736, Time: 05:19:54 / 11:38:54
-- Sampling Phase --
500 of 5000, Atoms: 94504(122858), ChiSq: 134543696, Time: 05:53:40 / 11:33:37
1000 of 5000, Atoms: 93654(128185), ChiSq: 134323920, Time: 06:28:42 / 11:30:54
1500 of 5000, Atoms: 95538(127658), ChiSq: 134265616, Time: 07:02:27 / 11:26:00
2000 of 5000, Atoms: 96887(126601), ChiSq: 134209024, Time: 07:37:26 / 11:23:14
However, as suggested by the vignettes, I only split the computation between 2 cores, to keep a consistent number of cells/genes. Would you recommend increasing the number of cores?
Also, I've noticed that the default nIterations
changed between the bioconductor package version (around 500 I believe) and the latest github dev version (5000). What about choosing a value half way, like 2000? Would it dramatically affect the results?
Thank you for your time,
Giovanni
Instead of a single, monolithic vignette, it is more useful to have multiple vignettes that focus on different user objectives with CoGAPS. Some ideas:
Branch: revamp_vignettes
Good afternoon! I recently ran into an issue where there is pattern discrepancy between runs with sparseOptimization set to TRUE versus FALSE. The code I ran and the output is below. With sparseOptimization set to TRUE I noticed that the ChiSq value was -nan and during the equilibration phase, the P matrix was 0. With sparseOptimization set to FALSE there seemed to be no problems, however the number of patterns learned differed in either case, i.e. SparseOptimization = TRUE gave 5 patterns while SparseOptimization = FALSE gave 6 patterns. This was true for a range of patterns that I ran (5-50)
SPARSE OPTIMIZATION ENABLED
params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42,
sparseOptimization=TRUE,
distributed="genome-wide")
params <- setDistributedParams(params, nSets=6)
Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)
This is CoGAPS version 3.19.1
Running genome-wide CoGAPS on Hoxd10_mat (30407 genes and 380 samples) with parameters:
-- Standard Parameters --
nPatterns 5
nIterations 30000
seed 42
sparseOptimization TRUE
distributed genome-wide
-- Sparsity Parameters --
alpha 0.01
maxGibbsMass 100
-- Distributed CoGAPS Parameters --
nSets 6
cut 5
minNS 3
maxNS 9
Creating subsets...
set sizes (min, mean, max): (5067, 5067.833, 5072)
Running Across Subsets...
Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
worker 1 is starting!
worker 2 is starting!
worker 4 is starting!
worker 6 is starting!
worker 3 is starting!
worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 13376(A), 1242(P), ChiSq: -nan, Time: 00:00:45 / 01:16:13
...
30000 of 30000, Atoms: 20636(A), 1461(P), ChiSq: -nan, Time: 00:35:40 / 01:16:38
-- Sampling Phase --
1000 of 30000, Atoms: 20671(A), 1460(P), ChiSq: -nan, Time: 00:36:54 / 01:16:28
...
29000 of 30000, Atoms: 20645(A), 1469(P), ChiSq: -nan, Time: 01:12:07 / 01:13:27
worker 2 is finished! Time: 01:12:22
30000 of 30000, Atoms: 20670(A), 1484(P), ChiSq: -nan, Time: 01:13:21 / 01:13:21
worker 1 is finished! Time: 01:13:21
worker 3 is finished! Time: 01:13:24
worker 5 is finished! Time: 01:15:26
worker 4 is finished! Time: 01:15:26
worker 6 is finished! Time: 01:19:08
Matching Patterns Across Subsets...
Running Final Stage...
Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
worker 1 is starting!
worker 2 is starting!
worker 6 is starting!
worker 4 is starting!
worker 3 is starting!
worker 5 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 10022(A), 0(P), ChiSq: -nan, Time: 00:00:27 / 00:45:43
...
30000 of 30000, Atoms: 15174(A), 0(P), ChiSq: -nan, Time: 00:47:13 / 00:47:13
worker 1 is finished! Time: 00:47:13
worker 2 is finished! Time: 00:47:28
worker 5 is finished! Time: 00:47:34
Warning message:
In checkInputs(data, uncertainty, allParams) :
running distributed cogaps without mtx/tsv/csv/gct data
SPARSE OPTIMIZATION DISABLED
params <- CogapsParams(nPatterns=5, nIterations=30000, seed=42,
distributed="genome-wide")
params <- setDistributedParams(params, nSets=6)
Hoxd10_matnp5 <- CoGAPS(Hoxd10_mat, params)
This is CoGAPS version 3.19.1
Running genome-wide CoGAPS on Hoxd10_mat (30407 genes and 380 samples) with parameters:
-- Standard Parameters --
nPatterns 5
nIterations 30000
seed 42
sparseOptimization FALSE
distributed genome-wide
-- Sparsity Parameters --
alpha 0.01
maxGibbsMass 100
-- Distributed CoGAPS Parameters --
nSets 6
cut 5
minNS 3
maxNS 9
Creating subsets...
set sizes (min, mean, max): (5067, 5067.833, 5072)
Running Across Subsets...
worker 2 is starting!
worker 3 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
worker 1 is starting!
worker 4 is starting!
worker 5 is starting!
worker 6 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 4665(A), 966(P), ChiSq: 5137063, Time: 00:01:16 / 02:08:43
...
30000 of 30000, Atoms: 9933(A), 2460(P), ChiSq: 4886798, Time: 00:49:52 / 01:47:09
-- Sampling Phase --
1000 of 30000, Atoms: 10033(A), 2514(P), ChiSq: 4886740, Time: 00:51:31 / 01:46:45
...
30000 of 30000, Atoms: 9953(A), 2489(P), ChiSq: 4886819, Time: 01:34:05 / 01:34:05
worker 1 is finished! Time: 01:34:05
worker 5 is finished! Time: 01:44:52
worker 4 is finished! Time: 01:54:06
worker 2 is finished! Time: 01:54:29
worker 6 is finished! Time: 01:54:31
worker 3 is finished! Time: 01:54:38
Matching Patterns Across Subsets...
Running Final Stage...
worker 5 is starting!
worker 4 is starting!
worker 3 is starting!
worker 2 is starting!
worker 6 is starting!
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
worker 1 is starting!
-- Equilibration Phase --
1000 of 30000, Atoms: 5928(A), 0(P), ChiSq: 14908930, Time: 00:00:10 / 00:16:56
...
30000 of 30000, Atoms: 10469(A), 0(P), ChiSq: 14908930, Time: 00:08:47 / 00:18:52
-- Sampling Phase --
1000 of 30000, Atoms: 10403(A), 0(P), ChiSq: 14908930, Time: 00:09:00 / 00:18:39
...
30000 of 30000, Atoms: 10379(A), 0(P), ChiSq: 14908930, Time: 00:15:17 / 00:15:17
worker 1 is finished! Time: 00:15:17
worker 5 is finished! Time: 00:16:29
worker 3 is finished! Time: 00:19:47
worker 2 is finished! Time: 00:20:37
worker 4 is finished! Time: 00:20:38
worker 6 is finished! Time: 00:20:45
Warning message:
In checkInputs(data, uncertainty, allParams) :
running distributed cogaps without mtx/tsv/csv/gct data
After obtaining the patterns, I ran patternMarkers on patterns learned with sparseOptimization = TRUE. When I set threshold = “all”, I would get this error.
test <- patternMarkers_all(Hoxd10_matnp5, threshold = "all")
Error in colnames(markerScores)[apply(markerScores, 1, which.min)] :
invalid subscript type 'list'
This error would not trigger when threshold was set to “cut”.
PatternMarkers worked normally when run on patterns learned without sparseOptimization.
UPDATE @dimalvovs - delete rows for readability
Good afternoon! I recently have run into large changes in the atoms (A) and (P) matrix values when running CoGAPS across multiple nSets. In a previous issue, these numbers were mentioned as a metric to determine if CoGAPS had stabilized, so I found this a bit concerning. Can you help me understand what is happening here, and point me to a solution? Thank you in advance! (Include the full parameters to help)
-- Standard Parameters --
nPatterns 25
nIterations 1e+05
seed 891
sparseOptimization TRUE
distributed single-cell
-- Sparsity Parameters --
alpha 0.01
maxGibbsMass 100
-- Distributed CoGAPS Parameters --
nSets 6
cut 25
minNS 3
maxNS 9
[1] 33650 21914
[1] 14160 21914
This is CoGAPS version 3.14.0
Running single-cell CoGAPS on /tmp/90251.tmpdir/Rtmp31EVRF/file62dc263e1.mtx (14160 genes and 21914 samples) with parameters:
-- Standard Parameters --
nPatterns 25
nIterations 1e+05
seed 891
sparseOptimization TRUE
distributed single-cell
-- Sparsity Parameters --
alpha 0.01
maxGibbsMass 100
-- Distributed CoGAPS Parameters --
nSets 6
cut 25
minNS 3
maxNS 9
14160 gene names provided
first gene name: FAM87B
21914 sample names provided
first sample name: 56546_tube1_AAACCTGCACGGCGTT-1
Creating subsets...
set sizes (min, mean, max): (3652, 3652.333, 3654)
Running Across Subsets...
worker 2 is starting!
worker 4 is starting!
Data Model: Sparse, Normal
Sampler Type: Sequential
Loading Data...Done! (00:01:26)
worker 1 is starting!
worker 6 is starting!
worker 3 is starting!
worker 5 is starting!
-- Equilibration Phase --
5000 of 100000, Atoms: 100971(A), 52232(P), ChiSq: 468311360, Time: 03:28:08 / 206:51:55
10000 of 100000, Atoms: 132274(A), 61924(P), ChiSq: 466190848, Time: 09:57:29 / 271:50:33
15000 of 100000, Atoms: 147622(A), 64281(P), ChiSq: 465477408, Time: 17:28:09 / 302:57:29
20000 of 100000, Atoms: 157316(A), 65815(P), ChiSq: 465079392, Time: 25:23:51 / 319:39:54
25000 of 100000, Atoms: 165574(A), 67174(P), ChiSq: 464867552, Time: 33:37:39 / 330:19:17
30000 of 100000, Atoms: 172295(A), 68477(P), ChiSq: 464668896, Time: 42:07:54 / 338:07:31
35000 of 100000, Atoms: 179143(A), 69047(P), ChiSq: 464584608, Time: 50:52:01 / 344:12:38
40000 of 100000, Atoms: 184012(A), 69617(P), ChiSq: 464515392, Time: 59:49:42 / 349:18:55
45000 of 100000, Atoms: 189512(A), 69805(P), ChiSq: 464441792, Time: 68:54:57 / 353:19:43
50000 of 100000, Atoms: 196919(A), 68410(P), ChiSq: 464404192, Time: 78:14:55 / 357:11:07
55000 of 100000, Atoms: 206964(A), 64601(P), ChiSq: 464468192, Time: 87:40:43 / 360:20:54
60000 of 100000, Atoms: 808401(A), 12023(P), ChiSq: 500546240, Time: 96:09:45 / 359:07:47
65000 of 100000, Atoms: 1055385(A), 8298(P), ChiSq: 527779296, Time: 103:43:00 / 354:42:26
70000 of 100000, Atoms: 950700(A), 7337(P), ChiSq: 533742400, Time: 112:06:23 / 353:24:58
75000 of 100000, Atoms: 924196(A), 7082(P), ChiSq: 539779776, Time: 120:40:25 / 352:40:01
80000 of 100000, Atoms: 948739(A), 6947(P), ChiSq: 541298112, Time: 129:33:54 / 352:45:34
85000 of 100000, Atoms: 1000041(A), 6826(P), ChiSq: 543738112, Time: 138:59:54 / 354:05:43
90000 of 100000, Atoms: 1064006(A), 6708(P), ChiSq: 546889728, Time: 148:56:49 / 356:23:28
95000 of 100000, Atoms: 1251780(A), 8249(P), ChiSq: 534668288, Time: 159:05:50 / 358:46:42
Using the "cut" threshold for the patternMarkers function returns more genes as PatternMarkers than using the "all" threshold. I would guess that the cut thresholded genes are simply being included as an element of the list returned by the patternMarkers function, and I don't know what "cut" is doing.
Example:
`data("GIST")
PMList <- patternMarkers(GIST.result, threshold = "all")
PMListCut <- patternMarkers(GIST.result, threshold = "cut")
lapply(PMList$PatternMarkers, length)
lapply(PMListCut$PatternMarkers, length)`
When sampling with annotated weights to make subsets, the final CoGAPS results has repeated samples in the Sample Factor Matrix. Should the sample weights be averaged by sample for each pattern? Thanks!
In either the devel or bioconductor versions.
##############################################################################
##############################################################################
###
### Running command:
###
### chmod a+r CoGAPS -R && F:\biocbuild\bbs-3.18-bioc\R\bin\R.exe CMD build --keep-empty-dirs --no-resave-data CoGAPS
###
##############################################################################
##############################################################################
* checking for file 'CoGAPS/DESCRIPTION' ... OK
* preparing 'CoGAPS':
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-building 'CoGAPS.Rmd' using rmarkdown
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data...Done! (00:00:00)
-- Equilibration Phase --
10000 of 50000, Atoms: 68(A), 29(P), ChiSq: 329, Time: 00:00:00 / 00:00:00
20000 of 50000, Atoms: 74(A), 43(P), ChiSq: 161, Time: 00:00:00 / 00:00:00
30000 of 50000, Atoms: 84(A), 38(P), ChiSq: 141, Time: 00:00:00 / 00:00:00
40000 of 50000, Atoms: 80(A), 32(P), ChiSq: 143, Time: 00:00:01 / 00:00:02
50000 of 50000, Atoms: 81(A), 42(P), ChiSq: 117, Time: 00:00:01 / 00:00:02
-- Sampling Phase --
10000 of 50000, Atoms: 74(A), 43(P), ChiSq: 136, Time: 00:00:01 / 00:00:01
20000 of 50000, Atoms: 80(A), 37(P), ChiSq: 126, Time: 00:00:02 / 00:00:02
30000 of 50000, Atoms: 80(A), 41(P), ChiSq: 127, Time: 00:00:02 / 00:00:02
40000 of 50000, Atoms: 81(A), 38(P), ChiSq: 130, Time: 00:00:03 / 00:00:03
50000 of 50000, Atoms: 82(A), 44(P), ChiSq: 104, Time: 00:00:03 / 00:00:03
trying URL 'https://zenodo.org/record/7709664/files/inputdata.Rds?download=1'
Content type 'application/octet-stream' length 433262849 bytes (413.2 MB)
==================================================
downloaded 413.2 MB
Quitting from lines 235-240 [load single cell data from zenodo] (CoGAPS.Rmd)
Error: processing vignette 'CoGAPS.Rmd' failed with diagnostics:
error reading from connection
--- failed re-building 'CoGAPS.Rmd'
SUMMARY: processing the following file failed:
'CoGAPS.Rmd'
Error: Vignette re-building failed.
Execution halted
When trying to build the package from master branch of github repo, the build fails with the following error:
-- Sampling Phase --
10000 of 50000, Atoms: 72(A), 40(P), ChiSq: 108, Time: 00:00:12 / 00:00:21
20000 of 50000, Atoms: 77(A), 41(P), ChiSq: 115, Time: 00:00:14 / 00:00:20
30000 of 50000, Atoms: 78(A), 36(P), ChiSq: 135, Time: 00:00:17 / 00:00:21
40000 of 50000, Atoms: 73(A), 38(P), ChiSq: 111, Time: 00:00:19 / 00:00:21
50000 of 50000, Atoms: 75(A), 43(P), ChiSq: 118, Time: 00:00:21 / 00:00:21
Quitting from lines 243-248 [load single cell data from file] (CoGAPS.Rmd)
Error: processing vignette 'CoGAPS.Rmd' failed with diagnostics:
unknown input format
--- failed re-building 'CoGAPS.Rmd'
SUMMARY: processing the following file failed:
'CoGAPS.Rmd'
Error: Vignette re-building failed.
The bioconductor build passes these lines just fine, so it may mean that .Rds files are corrupted in github master branch (or created with an incompatible R version).
> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] CoGAPS_3.19.1
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.