cole-trapnell-lab / monocle-release Goto Github PK

View Code? Open in Web Editor NEW

270.0 270.0 114.0 211.58 MB

R 99.39% C++ 0.53% Shell 0.08%

monocle-release's People

Stargazers

Watchers

Forkers

loyale vals adahard sushilashenoy junjiezhujason dima2010 shicheng-guo nickjhathaway iandriver wh2353 cyang-2014 nikhil roryk albluca tankmermaid chenxofhit anorris8 kristikrebs djcaluk nyuhuyang bvieth putnamdk andrewwbutler polojacky chizhou-siti mpmorley th86 sneddonucsf sudhantt miguel1velazquez puriney marcrdm imex35 kh49 dongjt0727 hrk2109 j1205 kwells4 amjiuzi wangz10 helene-ccfk mkuchroo katherinehuang14 aswinssoman biodavidjm wangprince2017 6008 haroon123 evolvedmicrobe martinahox japrin flamehuang rmtsoa nanp lagzxadr mengchengyao ykirita dragonmasterx87 tuqiang2014 hpliner haojiang9999 sophia409 fuchanghe lg-hbjz fengyq juhjeong colin986 stavsafriel geneticresources dark2211 xyfqwlzoe caramirezal lijxug xjyx utnesp dhtc xiaosuyu1997 breme86 braveheart3118 deepbody-me indranillab zqw1103 shouwenwang ncku-bioinformatic-club jlduan kevinrue misaka-dayu howtofindme jianguozhou3 mubashermohammed colorfulbox ytlee413 pumc-fwyy-lab darwinawardwinner musculusmus tkik chen-guanming mianmianyin yunuuuu feigeliudan01

monocle-release's Issues

clusterCells throws error if max_components < length(clustering_genes)

Looks like it's coming from the call to kmeans in reduceDimension(). Might be better to reset max_components if few clustering genes are specified, or at least to stop with a more informative error message.

example files for tutorial section 2.1

I am trying to run your tutorial. In section 2.1 you are using three text file:

fpkm_matrix.txt
cell_sample_sheet.txt
gene_annotations.txt

Did you upload the example files you used somewhere? I couldn't find them on github.

Thanks!

newCellDataSet with sparseMatrix

I've been having issues getting newCellDataSet to accept a spareMatrix (and have had this issue since the initial release of monocle 2). Below is my code and the error message I get each time I try to run it.

bc1 <- newCellDataSet(as(as.matrix(expression.data1),"sparseMatrix"), phenoData = pd1, lowerDetectionLimit=1,expressionFamily = negbinomial.size())

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘annotatedDataFrameFrom’ for signature ‘"dgCMatrix"’

My R version is 3.3.1 and monocle version 2.2.0.

Error in UseMethod("depth") : no applicable method for 'depth' applied to an object of class "NULL"

When I typed the following command I got that error:

plot_cell_clusters(HSMM,1,2,color="cell_class")
Error in UseMethod("depth") :
no applicable method for 'depth' applied to an object of class "NULL"

And a blank figure popped up.

However, I typed it again and a normal plot appeared.

Any ideas about that?

markers argument for plot_cell_trajectory does not work

When plotting cell trajectory with plot_cell_trajectory, I tried to vary size of cell markers by feeding a gene name (a row name of my expression matrix) to the "markers" argument, but the size of the cell markers remained similar.
In fact, when I feed in some nonsensical string for the argument, the plot still proceeded without any error message, as if the argument is entirely ignored and unused.

My session info is as such:
`R version 3.4.0 Patched (2017-05-31 r72753)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] splines stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] HSMMSingleCell_0.110.0 plyr_1.8.4 scran_1.4.4 BiocParallel_1.10.1 SC3_1.4.2
[6] scater_1.3.35 monocle_2.4.0 DDRTree_0.1.5 irlba_2.2.1 VGAM_1.0-3
[11] ggplot2_2.2.1 Biobase_2.36.2 BiocGenerics_0.22.0 Matrix_1.2-10 BiocInstaller_1.26.0

loaded via a namespace (and not attached):
[1] bitops_1.0-6 matrixStats_0.52.2 doParallel_1.0.10 RColorBrewer_1.1-2 dynamicTreeCut_1.63-1
[6] tools_3.4.0 doRNG_1.6.6 DT_0.2 R6_2.2.1 KernSmooth_2.23-15
[11] vipor_0.4.5 DBI_0.6-1 lazyeval_0.2.0 colorspace_1.3-2 gridExtra_2.2.1
[16] compiler_3.4.0 pkgmaker_0.22 labeling_0.3 slam_0.1-40 caTools_1.17.1
[21] scales_0.4.1 mvtnorm_1.0-6 DEoptimR_1.0-8 robustbase_0.92-7 proxy_0.4-17
[26] stringr_1.2.0 digest_0.6.12 rrcov_1.4-3 htmltools_0.3.6 WriteXLS_4.0.0
[31] limma_3.32.2 htmlwidgets_0.8 rlang_0.1.1 RSQLite_1.1-2 FNN_1.1
[36] shiny_1.0.3 zoo_1.8-0 combinat_0.0-8 gtools_3.5.0 dplyr_0.5.0
[41] RCurl_1.95-4.8 magrittr_1.5 Rcpp_0.12.11 ggbeeswarm_0.5.3 munsell_0.4.3
[46] S4Vectors_0.14.3 viridis_0.4.0 stringi_1.1.5 edgeR_3.18.1 zlibbioc_1.22.0
[51] rhdf5_2.20.0 gplots_3.0.1 Rtsne_0.13 grid_3.4.0 gdata_2.17.0
[56] shinydashboard_0.6.0 lattice_0.20-35 locfit_1.5-9.1 igraph_1.0.1 rjson_0.2.15
[61] rngtools_1.2.4 reshape2_1.4.2 codetools_0.2-15 biomaRt_2.32.0 XML_3.98-1.7
[66] data.table_1.10.4 httpuv_1.3.3 foreach_1.4.3 gtable_0.2.0 assertthat_0.2.0
[71] mime_0.5 xtable_1.8-2 e1071_1.6-8 pcaPP_1.9-61 class_7.3-14
[76] qlcMatrix_0.9.5 viridisLite_0.2.0 tibble_1.3.3 pheatmap_1.0.8 iterators_1.0.8
[81] AnnotationDbi_1.38.1 registry_0.3 beeswarm_0.2.3 memoise_1.1.0 IRanges_2.10.2
[86] tximport_1.4.0 cluster_2.0.6 fastICA_1.2-0 densityClust_0.2.1 statmod_1.4.29
[91] ROCR_1.0-7 `

object 'disp_table' not found when estimateDispersions(cds)

I am taking the averages of clusters of cells and running monocle with them.

cluster.gene.avg.list = list()
for(n in unique(km$km.cluster)){
  cluster = scaled.data[, km$km.cluster %in% n]; 
  cluster.gene.avg.list[[n]] <- rowMeans(cluster)
}
cluster.avg = do.call(cbind, cluster.gene.avg.list)
colnames(cluster.avg) = paste0("C", colnames(cluster.avg))

cluster.avg looks something like this:

> head(cluster.avg[,1:5])
                 C16          C14          C3         C40          C4
A4GALT   -0.15852378 -0.268434779 -0.07329452  0.05332599 -0.05781838
AA413626 -0.06237282  0.007022857 -0.06237282 -0.06237282  0.03301530
AA414768  0.12676744 -0.054279647 -0.13331964 -0.05369653 -0.12118771
AA465934 -0.13563968 -0.209988851 -0.31531591 -0.24636586  0.08446100
AA987161 -0.13853327 -0.209409869  0.04191605 -0.22809665 -0.02878748
AAAS     -0.11055740 -0.167985646 -0.14395537 -0.16066358 -0.03010135

Looks okay but then...


library(monocle)
pheno <- data.frame(row.names=colnames(cluster.avg), Cluster=colnames(cluster.avg))
cds <- newCellDataSet(as.matrix(cluster.avg),
                      phenoData=new("AnnotatedDataFrame", data=pheno), 
                      featureData=new("AnnotatedDataFrame", data=gene.design))
cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds)

returns the error:

Error in estimateDispersionsForCellDataSet(object, modelFormulaStr, relative_expr,  : 
  object 'disp_table' not found

If I add expressionFamily=negbinomial() to newCellDataSet then the error changes to:

Error in `[.data.frame`(`*tmp*`, res$mu == 0) : 
  undefined columns selected

Any help would be much appreciated.

EDIT: I should mention that gene.design is a data.frame with gene information:

               Mean Disperson.Raw Disperson.Norm Pct.Cells.Exp
A4GALT   0.19691749      1.345950      0.3952008         0.052
AA413626 0.01126692      1.287438      0.1910357         0.003
AA414768 0.32179354      1.525643      0.6702481         0.084
AA465934 0.20542904      1.040429     -0.6708513         0.058
AA987161 0.25788802      1.354202      0.4239956         0.070
AAAS     0.35056009      1.154092     -0.7676060         0.105

error from estimateDispersions

pd = new("AnnotatedDataFrame", data=cell)
fd = new("AnnotatedDataFrame", data=gene)
test = newCellDataSet(data,phenoData = pd, featureData = fd, expressionFamily=negbinomial.size())
test = estimateSizeFactors(test)
test = estimateDispersions(test)

this yield the error

Error in [.data.frame(*tmp*, res$mu == 0) :
undefined columns selected
In addition: Warning message:
Deprecated, use tibble::rownames_to_column() instead.

Any particular reason for this? Thanks.

What is the parameter of "reverse" in orderCells function?

Hi, we are doing pseudotime analysis on single-cell RNA-seq data. The ordering looks good based on known marker genes' expression, but the beginning and ending states are flipped. We tried to run the orderCells function with the reverse parameter, but we don't know what parameter should be inputted. We've tried reverse=true, reverse=1, and some other parameters we could think of, but none of them work. Could you please help us on that? Thanks!

progenitor_method="omit" for buildBranchCellDataSet

Currently the buildBranchCellDataSet function in BEAM() and plot_genes_branched_heatmap() handle the progenitor branch with either the "duplicate" or "sequential" progenitor_method option. It may be nice to have an option that omits the progenitor branch and only models from the selected branch point forward.

I've found that the row clustering in the branched heatmap can sometimes be driven by expression dynamics in the progenitor branch. Sometimes it may be more appropriate to focus on solely expression patterns after the branch point (eg. Up in Branch A, Down in Branch B)

dispersionTable not found

Hello,

I installed monocle through bioconductor. However, when I ran the code in the vignette

disp_table <- dispersionTable(HSMM)

An error occurred
Error: could not find function "dispersionTable"

My R is version 3.2.5

Can you help me to fix this ?

error message when I am using estimateDispersions() function.

Thank you for developing nice tools for analyzing scRNA-seq data. I have used monocle 1 with fun. Now I reinstalled monocle to try census count and visualize data using monocle.

However, I am getting errors that I previously didn't had.
Here I attach error message from estimateDispersions() function, and my sessionInfo().

load("Xerr.RData") # npX : scRNA-seq expression data, pd = pheno, AnnotatedDataFrame, fd = feature, AnnotatedDataFrame.
pXX <- newCellDataSet(npX, phenoData = pd, featureData = fd)

rpc_matrix <- relative2abs(pXX)

pXX <- newCellDataSet(as(as.matrix(rpc_matrix), "sparseMatrix"),

```
                   phenoData = pd,
```
```
                   featureData = fd,
```

                   lowerDetectionLimit=1,

                   expressionFamily=negbinomial.size())

pXX <- estimateSizeFactors(pXX)
pXX <- estimateDispersions(pXX)
Error in intI(i, n = x@Dim[1], dn[[1]], give.dn = FALSE) :
invalid character indexing
In addition: Warning message:
Deprecated, use tibble::rownames_to_column() instead.

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] C

attached base packages:
[1] splines stats4 parallel stats graphics grDevices utils
[8] datasets methods base

other attached packages:
[1] monocle_2.2.0 DDRTree_0.1.4 irlba_2.1.2
[4] VGAM_1.0-3 ggplot2_2.2.1 Biobase_2.34.0
[7] BiocGenerics_0.20.0 Matrix_1.2-7.1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 compiler_3.3.2 RColorBrewer_1.1-2
[4] plyr_1.8.4 tools_3.3.2 tibble_1.2
[7] gtable_0.2.0 lattice_0.20-34 igraph_1.0.1
[10] DBI_0.5-1 HSMMSingleCell_0.108.0 fastICA_1.2-0
[13] dplyr_0.5.0 stringr_1.1.0 cluster_2.0.5
[16] combinat_0.0-8 grid_3.3.2 R6_2.2.0
[19] qlcMatrix_0.9.5 pheatmap_1.0.8 limma_3.30.8
[22] reshape2_1.4.2 magrittr_1.5 scales_0.4.1
[25] matrixStats_0.51.0 assertthat_0.1 colorspace_1.3-2
[28] stringi_1.1.2 lazyeval_0.2.0 munsell_0.4.3
[31] slam_0.1-40

I am not sure what I've done wrong. I can send "Xerr.RData" file if needed.

Thank you so much!
Sincerely,
ilyoup

Provide function to calculate fold changes between arbitrary groups of cells

It would be handy to have a function that takes as input a CellDataSet and a grouping column (potentially even a continuous variable from pData) and returns a dataframe with fold changes in expression. The function should be easy to use in conjunction with output from differentialGeneTest()

Missing descriptions for some arguments in reduceDimension()

The descriptions for the arguments relative_expr, auto_param_selection and scaling are missing from the help for reduceDimension.
What do changing these arguments do?

kmeans centers not distinct in new DDRTree version

I find that with some (but not all) of my datasets, I receive an error when performing reduceDimension(object, max_components = 2, reduction_method = "DDRTree"):
Error in kmeans(t(Z), K, centers = centers): initial centers are not distinct.
The problem seems to occur in my smaller datasets (though I'm not sure if that is just by happenstance). This problem has started occurring relatively recently (after release of the new DDRTree version), and is not observed when reverting to an older version of DDRTree. Any ideas for how to fix?

differentialGeneTest taking too long

hi there,

I am using Mocole and have around 700 cells and ~ 13000 expressed genes and 13 clusters after filtering out and running DDRTree. While running differentialGeneTest it is taking too long. I am using 10 cores. Moreover, I am also not sure if it is doing anything or stuck somewhere without throwing an error. Is there a way to speed up the process? Or do you think ~13000 genes is too high?

FAIL differentialGeneTest

why do differentialGeneTest fail on a lot of the genes?

Custom colors in tSNE plot

Hi,

Just started to use Monocle and several functions look very promising for what I am looking to do. Was just wondering if there is a way to provide custom colors when plotting tSNE plot using plot_cell_clusters function?

Thannks,
ST

Specify reshape2::dcast / reshape2::melt in calculateMarkerSpecificity()

As written, currently throws an error if reshape2 isn't attached.

10x data

hi,
i used the 10x data to run this code:
cellranger_pipestance_path <- "/path/to/your/pipeline/output/directory"
gbm <- load_cellranger_matrix(cellranger_pipestance_path)
gbm_cds <- newCellDataSet(exprs(gbm),
phenoData = pData(gbm),
featureData = fData(gbm),
lowerDetectionLimit=0.5,
expressionFamily=negbinomial.size())

but, the error happened:
Error: CellDataSet 'phenoData' is class 'data.frame' but should be or extend 'AnnotatedDataFrame'

Any reply will be appreciated.
Thanks
frank

Relative values after using relative2abs

Hi,

Later edit: Read the Census paper and saw that transcripts counts are also normalized and relative, so it all makes sense now (the name of the function relative2abs was the confusing part).

Original message:
Thanks for developing and maintaining Monocle2, it's a great tool! My question is related to the type of data I should expect after using relative2abs. More specifically, I have a TPM dataset, which I read into a CellDataSet object like this

HSMM <- newCellDataSet(matrix.rsem_filter, phenoData = pd, featureData = fd)

A random sample from my object looks like that

exprs(HSMM)[1:5, 4]

 A1BG A1BG-AS1     A1CF      A2M  A2M-AS1 
 0.00     0.00     0.41   281.39     0.00

I want to transform the data to absolute counts, so I do

rpc_matrix <- relative2abs(HSMM)

followed by

HSMM <- newCellDataSet(as(as.matrix(rpc_matrix), "sparseMatrix"),
                       phenoData = pd,
                       featureData = fd,
                       lowerDetectionLimit=1,
                       expressionFamily=negbinomial.size())

My understanding is that I should now expect absolute count levels in my new HSMM object (or in rpc_matrix). Instead, the values I have are still relative:

rpc_matrix[1:5, 4]
A1BG    A1BG-AS1        A1CF         A2M     A2M-AS1 
0.000000000 0.000000000 0.003786963 2.599057193 0.000000000

In this case, I am wondering whether setting lowerDetectionLimit=1 is informative, since the data is still relative, and, for example, gene A1CF, which would have been above the limit before transformation, it's now below the limit.

Thanks!

Simona

Error while running estimateDispersions()

Hello,
I have an error when I estimate dispersion on my CDS. The data that I am feeding into it is UMI based (10X). The error that I get is:

gbm <- estimateDispersion(gbm)
Removing 140 outliers Warning message: Deprecated, use tibble::rownames_to_column() instead.

I suspect that I have zeros or some other value in one of my cells. However, I am not sure how to find that or remove those. I can send the CDS, if required.

Bug in differentialGeneTest with relative_expr = FALSE and running estimateSizeFactors

Hi,

I've been getting familiar with Monocle, and I believe I found a bug when running estimateSizeFactors before running differentialGeneTest with relative_expr = FALSE. If I understand the documentation and differentialGeneTest code properly, size factors should be ignored when relative_expr = FALSE, but I don't think that's happening.

Basically, I am getting 30 differentially expressed genes if I run

expr.data<-read.delim(<file>, sep="")
cell_sample_sheet<-read.delim("sample_sheet.txt", row.names=1)
pd<-new("AnnotatedDataFrame", data=cell_sample_sheet)

gene_ann<-data.frame(gene_short_name = row.names(expr.data), row.names = row.names(expr.data))
fd <- new("AnnotatedDataFrame",data=gene_ann)

cells<-newCellDataSet(as.matrix(expr.data),
                      phenoData = pd,
                      featureData = fd,
                      expressionFamily=negbinomial.size())

cells <- estimateSizeFactors(cells)
cells <- estimateDispersions(cells)

cells<-detectGenes(cells, min_expr = 0.1)

expressed_genes <- row.names(subset(fData(cells), num_cells_expressed >= 10))

cells.diff_test_res <- differentialGeneTest(cells,
                                      fullModelFormulaStr = "~Genotype",
                                      cores=8,
                                      relative_expr = FALSE)
cells.sig_genes <- subset(cells.diff_test_res, qval < 0.1)

I get 126 diff. expressed genes if I omit cells <- estimateSizeFactors(cells), however. I assume the result with 126 genes is the correct result, since I'm using 10x Genomics single cell (UMI) data. I didn't realize I don't need to run cells <- estimateSizeFactors(cells).

Also, a less important (and unrelated) "bug" (if it even qualifies as a bug), it looks like plot_genes_jitter calculates cds_exprs$adjusted_expression <- log10(cds_exprs$expression), but never uses it. The data is transformed by ggplot, instead.

Really appreciate your help.

Correction

Apparently you cannot run cells <- estimateDispersions(cells) without first running cells <- estimateSizeFactors(cells) in a clean environment. I deleted all objects (rm(list=ls())) and re-ran the code, getting an error stating that I had to run cells <- estimateSizeFactors(cells), first.

I do think something funky is going on though, because if I re-run the code (omitting cells <- estimateSizeFactors(cells)) without deleting all objects, I am able to run cells <- estimateDispersions(cells), and I get the different results mentioned above.

So, I guess there are potentially four different issues here:

Should I run cells <- estimateSizeFactors(cells) and cells <- estimateDispersions(cells) with 10x Genomics UMI data? This is directly related to #50.
Is there a bug where erroneous size factors persist even if I re-run newCellDataSet?
Should differentialGeneTest with relative_expr = FALSE ignore whether cells <- estimateSizeFactors(cells) and cells <- estimateDispersions(cells) are run/set?
Oh, and the small "bug" in plot_genes_jitter.

CellDataSet crashing Rstudio 1.0.136

I am currently trying to use Monocle for some new scRNAseq analysis. After executing newCellDataSet() RStudio aborts the current session. I am currently working with a relatively small dataset of filtered genes (24 genes by 48 samples), so file size is unlikely an issue. Creating a CellDataSet via command line R works fine, however, saving the object as an .rda file and loading the object in R once again forces a crash. I've replicated this error both on my Linux Mint desktop and my Macbook.

I'm also able to replicate the crashing using a small, fake dataset:

library(monocle)
expr_test <- data.frame(S1 = c(1,1), S2 = c(0,0), row.names = c("G1", "G2"))
p_data <- data.frame(Group = c(1, 2), row.names = colnames(expr_test))
f_data <- data.frame(Anno = c("Foo", "Bar"), row.names = row.names(expr_test))
cell_data <- newCellDataSet(as.matrix(expr_test),
                            phenoData = new("AnnotatedDataFrame", data = p_data),
                            featureData = new("AnnotatedDataFrame", data = f_data))

Here is my session info with Monocle as the only specific library call:

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8      
 [2] LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                 
 [9] LC_ADDRESS=C              
[10] LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8
[12] LC_IDENTIFICATION=C       

attached base packages:
 [1] splines   stats4    parallel  stats    
 [5] graphics  grDevices utils     datasets 
 [9] methods   base     

other attached packages:
[1] monocle_2.2.0       DDRTree_0.1.4      
[3] irlba_2.1.2         VGAM_1.0-3         
[5] ggplot2_2.2.1       Biobase_2.34.0     
[7] BiocGenerics_0.20.0 Matrix_1.2-7.1     

loaded via a namespace (and not attached):
 [1] igraph_1.0.1          
 [2] Rcpp_0.12.9           
 [3] cluster_2.0.5         
 [4] magrittr_1.5          
 [5] munsell_0.4.3         
 [6] colorspace_1.3-2      
 [7] lattice_0.20-34       
 [8] R6_2.2.0              
 [9] stringr_1.1.0         
[10] dplyr_0.5.0           
[11] plyr_1.8.4            
[12] tools_3.3.2           
[13] grid_3.3.2            
[14] gtable_0.2.0          
[15] DBI_0.5-1             
[16] matrixStats_0.51.0    
[17] lazyeval_0.2.0        
[18] assertthat_0.1        
[19] tibble_1.2            
[20] reshape2_1.4.2        
[21] RColorBrewer_1.1-2    
[22] HSMMSingleCell_0.108.0
[23] slam_0.1-40           
[24] qlcMatrix_0.9.5       
[25] pheatmap_1.0.8        
[26] stringi_1.1.2         
[27] limma_3.30.9          
[28] fastICA_1.2-0         
[29] scales_0.4.1          
[30] combinat_0.0-8

orderCells(): Unable to allocate vector of size ~9000GB

I'm sorry to bother twice in a day, but this is another showstopper bug that we've encountered when trying to use Monocle. I've tried to file a report about this before and closed it due to lack of precision with finding the issue, but since we've fixed the newCellDataSet() issue, I think it'll be better to pinpoint the cause of the issue.

We've been trying to use a ~23000x875 raw gene-sample dataset to test with Monocle. However, this problem still occurs even when trimming down to just a 25x25 dataset. The data set, dummy grouping factors, and code that we are using can be found (temporarily?) here: http://filebin.ca/2UGWtSOINFOc

The issue occurs within the orderCells() function. I've gone through the vignette on Bioconductor. In my debugging, I've been able to track down the vector problem to this for loop which contains the recursive call within the function: extract_good_ordering():

        for (child in V(pq_tree)[nei(curr_node, mode = "out")]) {
                    p_level[[length(p_level) + 1]] <- extract_good_ordering_(pq_tree, 
                                                                            child, dist_matrix)
                  }

Even if I replace some of the deprecated functions for nei() into .nei(), the vector allocation still fails.

Given that the code for the P and Q nodes look similar, I'd assume it fails for both cases.

I don't know much about graph theory or how the code applies to such, but I was able to track it down to here. Given the (looks like) the appending of mode nodes to the graph with each recursion... Could it be a problem with the list depth?

My version of igraph is that hosted on GitHub at igraph/rigraph@master, pulled from and installed on 1/6/2016.

Thank you.

Error during differentialGeneTest calculations

I am using 10X data.

print(head(fData(MCX2MCX3)))
id gene_short_name use_for_ordering
ENSMUSG00000000001 ENSMUSG00000000001 Gnai3 FALSE
ENSMUSG00000000003 ENSMUSG00000000003 Pbsn FALSE
ENSMUSG00000000028 ENSMUSG00000000028 Cdc45 FALSE
ENSMUSG00000000031 ENSMUSG00000000031 H19 FALSE
ENSMUSG00000000037 ENSMUSG00000000037 Scml2 FALSE
ENSMUSG00000000049 ENSMUSG00000000049 Apoh FALSE
num_cells_expressed
ENSMUSG00000000001 1546
ENSMUSG00000000003 0
ENSMUSG00000000028 56
ENSMUSG00000000031 56
ENSMUSG00000000037 96
ENSMUSG00000000049 12

print(head(pData(MCX2MCX3)))
barcode library age genotype
MCX2_AAACCTGCATAAAGGT-1 MCX2_AAACCTGCATAAAGGT-1 MCX2 P7 CTL
MCX2_AAACCTGCATTTCACT-1 MCX2_AAACCTGCATTTCACT-1 MCX2 P7 CTL
MCX2_AAACCTGGTGTTCTTT-1 MCX2_AAACCTGGTGTTCTTT-1 MCX2 P7 CTL
MCX2_AAACGGGAGACTTTCG-1 MCX2_AAACGGGAGACTTTCG-1 MCX2 P7 CTL
MCX2_AAACGGGAGAGGGCTT-1 MCX2_AAACGGGAGAGGGCTT-1 MCX2 P7 CTL
MCX2_AAACGGGAGGTGATAT-1 MCX2_AAACGGGAGGTGATAT-1 MCX2 P7 CTL
num_genes_expressed
MCX2_AAACCTGCATAAAGGT-1 3600
MCX2_AAACCTGCATTTCACT-1 2007
MCX2_AAACCTGGTGTTCTTT-1 4521
MCX2_AAACGGGAGACTTTCG-1 1512
MCX2_AAACGGGAGAGGGCTT-1 2322
MCX2_AAACGGGAGGTGATAT-1 569

diff_test_res <- differentialGeneTest(MCX2MCX3, fullModelFormulaStr="~genotype")
gives error as below

<simpleError in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), NULL) else NULL): 'data' must be of a vector type, was 'NULL'>
<simpleError in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), NULL) else NULL): 'data' must be of a vector type, was 'NULL'>
<simpleError in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), NULL) else NULL): 'data' must be of a vector type, was 'NULL'>
..
..
..
..

The resulting dataframe/matrix has all FAIL with p-val and q-val equal to 1.

Vignette not up to date

I think the monocle 2 vignette has not been entirely updated. There are references to several functions that are not longer present. For example: plot_pc_variance_explained or plot_cell_clusters

Include support for data sets larger than memory capacity

I've encountered a problem: when I simply try to create a newCellDataSet(), the function crashes my Linux-based RStudio 0.99.447, and with R 3.2.2. For some reason, it appears to work with one data while it crashes with another. The cause is quite mysterious, but I think it might have to do with the phenotype and assays labeling.

Now that I've tried to fix it a few times, I think it's due to the large nature of the dataset. I have over 20000 genes and over 800 samples, so one of the intermediary vectors may not come out right. This may not be an actual problem with Monocle, but with igraph instead.

Error message using markerDiffTable() function: "invalid character indexing"

Hi @Xiaojieqiu I have the same problem that @ikwak2 (issue #20) but with the markerDiffTable function. It appears this error: Error in intI(i, n = x@Dim[1], dn[[1]], give.dn = FALSE) : invalid character indexing

I've tried to search some zero or NA or INF value but there is none, rownames (genes) and colnames (cell barcodes) haven't got any strange symbol... I'm a bit confused. I copy below my code in case of you can help me:

path <- paste(getwd(), "2_count_outs/outs/filtered_gene_bc_matrices/GRCh38/", sep = "/")
matrix <- readMM(paste(path, "matrix.mtx", sep = ""))
pd <- read.table(paste(path, "barcodes.tsv", sep = ""))
colnames(pd) <- "cell_ID"
pd <- as.data.frame(apply(pd, 2, function(x) gsub("-", "", x)))
rownames(pd) <- pd$cell_ID
fd <- read.table(paste(path, "genes.tsv", sep = ""))
colnames(fd) <- c("transcript_ID", "gene_short_name")
rownames(fd) <- fd$transcript_ID
colnames(matrix) <- pd$cell_ID; rownames(matrix) <- fd$transcript_ID
pdata <- new("AnnotatedDataFrame", data = pd)
fdata <- new("AnnotatedDataFrame", data = fd)
rawdata <- newCellDataSet(matrix, phenoData = pdata, featureData = fdata, expressionFamily = negbinomial.size())

rawdata <- rawdata[1:30000,1:500]
rawdata <- estimateSizeFactors(rawdata)
rawdata <- estimateDispersions(rawdata)

rawdata <- detectGenes(rawdata, min_expr = 1) #zero
expressed_genes <- row.names(subset(fData(rawdata), num_cells_expressed >= 1))

gata1 <- row.names(subset(fData(rawdata), gene_short_name == "GATA1"))
gypa <- row.names(subset(fData(rawdata), gene_short_name == "GYPA"))
mpo <- row.names(subset(fData(rawdata), gene_short_name == "MPO"))
cebpb <- row.names(subset(fData(rawdata), gene_short_name == "CEBPB"))
dntt <- row.names(subset(fData(rawdata), gene_short_name =="DNTT"))
ebf1 <- row.names(subset(fData(rawdata), gene_short_name =="EBF1"))
fos <- row.names(subset(fData(rawdata), gene_short_name == "FOS"))
prdm1 <- row.names(subset(fData(rawdata), gene_short_name == "PRDM1"))
thy1 <- row.names(subset(fData(rawdata), gene_short_name == "THY1"))

cth <- newCellTypeHierarchy()
cth <- addCellType(cth, "Erythrocyte", classify_func = function(x) {x[ery_id,] >= 1 & x[gypa,] >= 1})
cth <- addCellType(cth, "Myeloid", classify_func = function(x) {x[mpo,] >= 1 & x[cebpb,] >= 1})
cth <- addCellType(cth, "LiT", classify_func = function(x) {x[ebf1,] >= 1 & x[dntt,] >= 1})
cth <- addCellType(cth, "LiB", classify_func = function(x) {x[fos,] >= 1 & x[prdm1,] >= 1})
cth <- addCellType(cth, "Progenitors", classify_func = function(x) {x[thy1,] >= 1 & x[fos,] < 1})
rawdata_ct <- classifyCells(rawdata, cth)

marker_diff <- markerDiffTable(rawdata[expressed_genes,], cth, cores = 2)
## and here the error appears:
## Error in intI(i, n = x@Dim[1], dn[[1]], give.dn = FALSE) : invalid character indexing

I can't upload my matrix count because is .mtx format, but if you need it I'll send you by email.

Thanks in advance!

BEAM function never finishes

Dear developers,

I ran Monocle2 on a RPKM matrix of 400 single-cells. RPKMs were calculated with an in-house script proper to my lab (I don't own the code). I estimated the RNA counts with relative2abs() then I generated the dataset with negbinomial().
I successfully ordered by pseudotime the cells with supervised and unsupervised methods and now I would like to identify the genes explaining the branching using BEAM().

I ran BEAM on the lung dataset in a minute but with my data, BEAM function runs but never ends. I let my computer on all the week-end on just to see and it was still running after 3 days.

I created a test dataset from my data (50 randomly picked cells) and I've got the same issue.

Here is my script:
https://github.com/IStevant/XY-mouse-gonad-scRNA-seq/blob/master/Scripts/test_monocle2.R

And here is the test dataset:
https://github.com/IStevant/XY-mouse-gonad-scRNA-seq/blob/master/Data/sample_test_monocle2.csv

I hope you will find out what's going wrong.

Extract normalized expression matrix

Hi,

I'm working with 10x Genomics UMI data. I'd like to extract Monocle's normalized expression matrix so I can make custom expression plots, etc. Calling exprs(<cds>) only returns the original matrix, as far as I can tell.

Is there a quick way to extract the normalized matrix?

Really appreciate your help.

CellTypeHierarchy Issues

When I try to run the classifyCells function as part of the CellTypeHierarchy analysis (section 3.1 in the vignette), I get the following error:

Error in if (type_res[cell_name] == TRUE) next_nodes <- c(next_nodes, :
missing value where TRUE/FALSE needed
In addition: Warning message:
Deprecated, use tibble::rownames_to_column() instead.

I am using:

R: 3.3.1
monocle: 2.2.0
dplyr: 0.5.0

Differential Gene expression between clusters

Hi,
I have 10X data where I am able to segregate my data into 5 different clusters. 3 of those clusters are very segregated and two are partly overlapping. I have a few genes that are high in one population and low in the other. I would like to perform DGE between just the two closely related populations.
The "differentialGeneTest" shows the DGE but does not indicate which clusters a certain target is greatly expressed in. I could subset each set of clusters and run "differentialGeneTest" individually, but that seems like a long workaround as you increase the number of clusters.
Any idea what the best method for doing DGE between some specific clusters?
Thanks

orderCells(): takes a long time

Hi
Monocle works well on 10x dataset about 3k cells, but when I move on to larger dataset like about 30k cells, the orderCells() takes about 36 hours, I wonder whether this step can be paralleled? Thanks.

error in plot_genes_branched_pseudotime

On some set of genes, I get the error

Error in full_model_expectation[x[2], x[1]] : subscript out of bounds

What is even weirder is that if I plot those set of genes separately with some random other gene, it works.

Any particular cause for this?

Also, why can't plot_genes_branched_pseudotime work only one 1 gene?

I get the error

"Error in if (nrow(ancestor_exprs) == 1) exprs_data <- t(as.matrix(ancestor_exprs)) else exprs_data <- ancestor_exprs :
argument is of length zero

when I input data[c(31421),] or data[,31421,] for instance. Thanks.

monocle version 1

Is there an archived version of the original monocle code available? I recently upgraded to the new version and am trying to reproduce plots made previously for a comparison but the furthest back that I can find is 2.4.0. Any help would be much appreciated.

estimateDispersions fails

Hello,

I am having trouble running estimateDispersions function on RNAseq count data (HTSeq). The following error is thrown.

Error in parametricDispersionFit(disp_table) :
Parametric dispersion fit failed. Try a local fit and/or a pooled estimation. (See '?estimateDispersions')
In addition: Warning messages:
1: In .local(object, ...) :
in estimateDispersions: Ignoring extra argument(s).
2: Deprecated, use tibble::rownames_to_column() instead.

I am running the following code.

HSMM=newCellDataSet(scell_non0_matrix,expressionFamily = negbinomial.size()); HSMM <- estimateSizeFactors(HSMM);

min_expr in plot_genes_by_pseudotime has incorrect log transformation

From Kieran Campbell:

In the function plot_genes_in_pseudotime, the following is used as the predicted expression curve:

res <- 10^(predict(vg, type = "response"))
res[res < log10(min_expr)] <- log10(min_expr)

where vg is the vgam model fit and min_expr is the minimum expression cut-off (default 0.1). However, res has been transformed back to the non-logged scale but is censored on the log-scale, meaning the cut-off in most plots is actually far below where it should be.

Error when using few genes

How many features/genes does monocle 2 need? When I used 8 genes, I received this error

Thanks

10x data formatting

hello,
This is a rather basic question. I am trying to upload the matrix (.mtx) file that is an output of the cellrangerRkit.

"gbm_cds <- newCellDataSet(exprs(gbm),
phenoData = new("AnnotatedDataFrame", data=pData(gbm)),
featureData = new("AnnotatedDataFrame", data=fData(gbm)),
lowerDetectionLimit = 0.5,
expressionFamily = negbinomial.size())"

However, the matrix does not have the correct featureData. I get the error:

"none of your featureData columns are named “gene_short_name”. Some functions will not be able to take this function as an input as a result"

Do I need to change a column title [that is currently labeled as "symbol"] to "gene_short_name"? If not, will it create problems when I run some of the downstream analyses? For example when I calculate dispersions or when I classify cell type hierarchy?

Also,featureData of my matrix looks like this:

> fData(gbm)
                      id                       symbol
ENSMUSG00000051951 ENSMUSG00000051951           Xkr4
ENSMUSG00000089699 ENSMUSG00000089699         Gm1992
ENSMUSG00000102343 ENSMUSG00000102343        Gm37381
ENSMUSG00000025900 ENSMUSG00000025900            Rp1

Error when using reduceDimension

Hi,
I am trying to order cells using monocle, but I get an error at the dimensionality reduction step (reduceDimension) . If I look at the verbose output, it seems this happens at iteration 19 and the error says:
Clearing MST sparse matrix
Setting up MST sparse matrix with 1225
Error in colnames(Y)[closest_vertex] : invalid subscript type 'list'

Did anyone see this error before? Do you have any idea what might cause such an error?

`orderCells` gives `igraph` error with some `root_state` inputs

The exact error message is Error in as.igraph.vs(graph, root) : Invalid vertex names. Unfortunately, I can't work up a reasonable minimal example of this bug, but it happens when I call orderCells a second time for the purpose of rooting the pseudotime variable. It happens with some root_state inputs and not others, and it happens even when I've checked to make sure the desired root state is present in the PhenoData. Here's my sessionInfo, and thanks for all your hard work on this package.


R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      splines   stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] enrichR_0.0.0.9000  gridExtra_2.2.1     hexbin_1.27.1       monocle_2.2.0       DDRTree_0.1.4       irlba_2.1.2        
 [7] VGAM_1.0-3          Biobase_2.32.0      BiocGenerics_0.18.0 Matrix_1.2-8        class_7.3-14        reshape_0.8.6      
[13] magrittr_1.5        dplyr_0.5.0         reshape2_1.4.2      colorspace_1.3-2    Seurat_1.3.3        cowplot_0.7.0      
[19] ggplot2_2.2.1       freezr_0.1.0       

loaded via a namespace (and not attached):
 [1] segmented_0.5-1.4      nlme_3.1-131           tsne_0.1-3             bitops_1.0-6           matrixStats_0.51.0    
 [6] pbkrtest_0.4-6         RColorBrewer_1.1-2     httr_1.2.1             prabclus_2.2-6         tools_3.3.1           
[11] R6_2.2.0               KernSmooth_2.23-15     DBI_0.5-1              lazyeval_0.2.0         mgcv_1.8-17           
[16] trimcluster_0.1-2      nnet_7.3-12            curl_2.3               glmnet_2.0-5           quantreg_5.29         
[21] SparseM_1.74           labeling_0.3           slam_0.1-40            diptest_0.75-7         caTools_1.17.1        
[26] scales_0.4.1           DEoptimR_1.0-8         mvtnorm_1.0-5          robustbase_0.92-7      proxy_0.4-16          
[31] pbapply_1.3-1          stringr_1.2.0          digest_0.6.12          minqa_1.2.4            mixtools_1.0.4        
[36] lme4_1.1-12            limma_3.28.21          FNN_1.1                jsonlite_1.3           combinat_0.0-8        
[41] mclust_5.2.2           gtools_3.5.0           ModelMetrics_1.1.0     car_2.1-4              modeltools_0.2-21     
[46] lars_1.2               Rcpp_0.12.9            munsell_0.4.3          ape_4.1                stringi_1.1.2         
[51] MASS_7.3-45            flexmix_2.3-13         gplots_3.0.1           Rtsne_0.11             plyr_1.8.4            
[56] gdata_2.17.0           crayon_1.3.2           lattice_0.20-34        knitr_1.15.1           igraph_1.0.1          
[61] boot_1.3-18            fpc_2.1-10             codetools_0.2-15       evaluate_0.10          data.table_1.10.4     
[66] nloptr_1.0.4           foreach_1.4.3          testthat_1.0.2         MatrixModels_0.4-1     purrr_0.2.2           
[71] gtable_0.2.0           kernlab_0.9-25         assertthat_0.1         HSMMSingleCell_0.106.2 qlcMatrix_0.9.5       
[76] tibble_1.2             pheatmap_1.0.8         iterators_1.0.8        cluster_2.0.5          fastICA_1.2-0         
[81] caret_6.0-73           ROCR_1.0-7

Error in reduceDimension()

Hi,

I get this error when running
count_matrix <- reduceDimension(count_matrix, max_components = 2, reduction_method = 'tSNE', verbose = T)

Remove noise by PCA ...
You're computing too large a percentage of total singular values, use a standard svd instead.Reduce dimension by tSNE ...
Fehler in Rtsne.default(as.matrix(topDim_pca), dims = max_components, pca = F, :
Perplexity is too large.

I have absolutely no idea how to fix it and why it happens. I would be grateful for any help!

Remove Cell cycle affect

Hi:

Thanks for putting this tool together. I have a general question about removing the cell cycle affect from the data analysis. Do you have any suggestion to either use regression model to reduce the weight of cell cycle related affect for clustering?

Thanks,
Rosemarie

RStudio crashes when I try to use newCellDataSet()

I've tried doing this with Monocle many times now, and it doesn't seem to work. It works outside of RStudio, but it crashes only if I run it in RStudio.

I have no slightest clue why; I've tried to make sure that the data is indeed correct and conforms to the AnnotatedDataFrame requirements.

I've included a link to the data I've used for Monocle. http://filebin.ca/2SZMzFp2gDxX

summary statistics

in your documents about differential expression you mention "We could also simply compute summary statistics such as mean or median expression level on a per-CellType basis"...

This would be really useful for my objectives, would you mind sharing a way to do this.

Thanks for your time.

Andrew

problem in plotting log transformed values in plot_pseudotime_heatmap

Hi ,
First thank you for the great package.
Just a small issue in plot_pseudotime_heatmap function when plotting the data in the log space.
The pseudo-count added to the values is set to NA at the fist line , which causes all the values to be NA
whichthrows an error.

Maybe just passe this line pseudocount <- NA to the arguments of the function.

Code and documentation/vignette suggestion

Hi,

Loving Monocle. I have a small suggestion in the code and documentation/vignette. Perhaps I've misunderstood, so correct me if I'm wrong.

I noticed the call to estimate size factors and dispersion is buried at the bottom of a section Converting TPM to mRNA counts that not everyone needs. I skipped that section since I have UMI counts and proceeded with remaining sections, including differential expression. I assumed I had performed any required steps up to that point, and that any other necessary calculations were within the differentialGeneTest call.

My suggestion would be to place the documentation for estimateSizeFactors and estimateDispersions in a more general location. I also think it would be helpful to add a warning or error if someone runs differentialGeneTest without running important steps prior.

Or, have I misunderstood something?

Really appreciate it!

Column gene_short_name in fData required for many functions

Hi,

I spent a little time today trying to work out why setting the markers argument in plot_cell_trajectory wasn't doing anything, before I realised it's actually required that a column in the fData is named gene_short_name (I had named it just gene_name).

This wasn't stated in the vignette when describing how to construct the CellDataSet object and it's not really in the manual as far as I can tell. I think it would be useful to have this more explicitly stated somewhere (like in the vignette).

Cheers,
Tim

Cannot estimate dispersion in Monocle2

Hi,

I was trying Monocle2 to use the unsupervised cell clustering.
I loaded read counts as input (full length single-cell RNAseq, no UMI, no spikes) as following :

conds<- substr(colnames(expr_matrix), 14,18)

HSMM_expr_matrix <- as.data.frame(expr_matrix)
HSMM_sample_sheet <- data.frame(cells=names(expr_matrix), stages=conds)
rownames(HSMM_sample_sheet)<- names(expr_matrix)
HSMM_gene_annotation <- as.data.frame(rownames(expr_matrix))
rownames(HSMM_gene_annotation)<- rownames(expr_matrix)

pd <- new("AnnotatedDataFrame", data = HSMM_sample_sheet)
fd <- new("AnnotatedDataFrame", data = HSMM_gene_annotation)
HSMM <- newCellDataSet(
    as.matrix(HSMM_expr_matrix), 
    phenoData = pd, 
    featureData = fd,
    expressionFamily=negbinomial.size()
    )

I then was able to estimateSizeFactors() with no issues:
HSMM <- estimateSizeFactors(HSMM)

But estimateDispersions() failed (in French, sorry):
HSMM <- estimateDispersions(HSMM)

Error in as.data.frame(cds_pdata %>% do(disp_calc_helper_NB(cds[, .$rowname],  :
  erreur d'évaluation de l'argument 'x' lors de la sélection d'une méthode pour la fonction 'as.data.frame' : Error in t(t(rounded[nzGenes, ])/pData(cds[nzGenes, ])$Size_Factor) :
  erreur d'évaluation de l'argument 'x' lors de la sélection d'une méthode pour la fonction 't' : Error in t(rounded[nzGenes, ]) :
  erreur d'évaluation de l'argument 'x' lors de la sélection d'une méthode pour la fonction 't' : Error in rounded[nzGenes, ] : indice hors limites

My sessionInfo():


sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] splines   stats4    parallel  stats     graphics  grDevices utils
 [8] datasets  methods   base

other attached packages:
[1] monocle_1.99.0      DDRTree_0.1.3       irlba_2.0.0
[4] VGAM_1.0-1          ggplot2_2.1.0       Biobase_2.30.0
[7] BiocGenerics_0.16.1 Matrix_1.2-6

loaded via a namespace (and not attached):
 [1] igraph_1.0.1           Rcpp_0.12.4            cluster_2.0.4
 [4] magrittr_1.5           munsell_0.4.3          colorspace_1.2-6
 [7] lattice_0.20-33        R6_2.1.2               stringr_1.0.0
[10] plyr_1.8.3             dplyr_0.4.3            tools_3.2.3
[13] grid_3.2.3             gtable_0.2.0           DBI_0.4
[16] matrixStats_0.50.2     lazyeval_0.1.10        assertthat_0.1
[19] reshape2_1.4.1         RColorBrewer_1.1-2     HSMMSingleCell_0.104.0
[22] slam_0.1-34            pheatmap_1.0.8         stringi_1.0-1
[25] limma_3.26.9           fastICA_1.2-0          scales_0.4.0
[28] combinat_0.0-8

Provide an efficient means of extracting model coefficients

fitModel() Can currently be used to generate full VGAM model objects for each gene in the CellDataSet, but these can be large, so collecting them all in order to extract coefficients can be onerous. We should wrap fit_model_helper() to pull out the coefficients (and their significance scores) and return it as a dataframe for easy plotting.