Comments (16)
Hello,
Yes, there are a number of parameters you may want to tweak.
When setting your gene expression matrix setGexpMats
, the default filtering quite quite strict filter=TRUE, minMeanBoth=4.5, minMeanTest=6, minMeanRef=8
. I am considering changing the default parameters to be the 25% quantile of the input matrices as opposed to hard thresholds). Generally, I would recommend filtering out the most lowly expressed genes such that you still have > 4000 genes after filtering.
In calcGexpCnvBoundaries
, there are two parameters pd
and pa
which is the expected magnitude shift due to deletion and amplification respectively. While we expect a doubling of gene expression for each extra copy or a halfing for each copy lost (which is the default), this appears to differ quite a bit depending on dataset/protocol, I am considering changing this to a standard deviation measurement (ex. if gene expression is 1.96 standard deviations lower than median, we may want to at this region further for a potential deletion). You may also want to increase min.traverse
(to something like 10 from the default 3), which is the depth of the tree traversal (higher number = look for smaller pools of cells).
If you provide allele information, it should improve the power to identify CNVs, particular for smaller alterations. In our manuscript, we show that while the gene expression based approach is able to identify whole-chromosome-level alterations with high accuracy and precision, it performs rather poorly for focal alterations, while the allele-based approach can achieve high accuracy and precision even for deletions as small as 10mbs. Adding allele information will also allow you to catch copy-neutral LOH events.
To identify tumor cells (separate from normal), you can do hierarchical clustering on the final posterior probabilities for all CNVs tested. Tumor cells and normal cells should separate into two obvious groups. You can also identify normal cells as they should not have high posterior probabilities for any of the CNVs tested.
Please do note that this work is not yet published, with the software still not official released, and thus subject to substantial changes during our manuscript revision. We are currently working to deposit a copy of the manuscript to bioRxiv. I'd be happy to incorporate any of your feedback and feature requests in the release if you have any to share. Thanks!
Best,
Jean
from honeybadger.
Hello,
Thanks for your replying. It is very helpful.
In these days I tried hard to understand how HoneyBadger works and I did a lot experiment according to your advice. Here are my updates:
When setting your gene expression matrix setGexpMats, the default filtering quite quite strict filter=TRUE, minMeanBoth=4.5, minMeanTest=6, minMeanRef=8. I am considering changing the default parameters to be the 25% quantile of the input matrices as opposed to hard thresholds). Generally, I would recommend filtering out the most lowly expressed genes such that you still have > 4000 genes after filtering.
I have ~9000 protein-coding genes in my dataset after filtering. I realized that the default threshold is too strict to my data. Thus before the first post I had already done this step
In calcGexpCnvBoundaries, there are two parameters pd and pa which is the expected magnitude shift due to deletion and amplification respectively. While we expect a doubling of gene expression for each extra copy or a halfing for each copy lost (which is the default), this appears to differ quite a bit depending on dataset/protocol, I am considering changing this to a standard deviation measurement (ex. if gene expression is 1.96 standard deviations lower than median, we may want to at this region further for a potential deletion). You may also want to increase min.traverse (to something like 10 from the default 3), which is the depth of the tree traversal (higher number = look for smaller pools of cells).
From the code it seems that the HMM step is to try to identify CNV status of regions. In this step, with default pd
and pa
, I can get some candidate regions. However, the problem is that the posterior probability for these regions are always between 0.6 ~ 0.7 in my dataset. Even though I changed pa
or pd
to get more regions( bound genes ?) . The posterior probability are still not high enough to pass significant level ( Sorry for misuse statistic terms ). I increased the m
parameter to 0.5 then I made posterior probability to 0.8 ~ 0.95. Is that reasonable to do that ? I am not sure why the bayesian inference works after increasing m
. I said that because I had tried decreasing m
first but this didn't work as I expected.
If you provide allele information, it should improve the power to identify CNVs, particular for smaller alterations. In our manuscript, we show that while the gene expression based approach is able to identify whole-chromosome-level alterations with high accuracy and precision, it performs rather poorly for focal alterations, while the allele-based approach can achieve high accuracy and precision even for deletions as small as 10mbs. Adding allele information will also allow you to catch copy-neutral LOH events.
How many SNPs are enough for HoneyBadger ? I filtered ExAC v3.1 VCF with AF > 0.1 and then I got ~60000 SNPs. From the tutorial dataset, it only contains ~5000 SNPs. I can see very strong LOH in chr17 (consistent with gene expression plot if we consider chr17 as copy number loss).
However, even after tweaking pa
and pd
, I can't get this chromosome as candidate for calculating posterior probability. The regions I got high confidence from gene expression are all focal regions, including chr6 [42206801, 44127457] , chr12 [ 8058302, 10454685] and chr11 [77066932, 78139660].
I haven't tried to do combined calculation but it seems that I lost chr17 already.
To identify tumor cells (separate from normal), you can do hierarchical clustering on the final posterior probabilities for all CNVs tested. Tumor cells and normal cells should separate into two obvious groups. You can also identify normal cells as they should not have high posterior probabilities for any of the CNVs tested.
I still don't get full posterior probabilities for all cells. But the returnPlot
parameter in plotGexpMat
does not return smoothed matrix. Is that because it return to a certain slot for the refclass itself ? I am not sure and I can't find the slot.
Please do note that this work is not yet published, with the software still not official released, and thus subject to substantial changes during our manuscript revision. We are currently working to deposit a copy of the manuscript to bioRxiv. I'd be happy to incorporate any of your feedback and feature requests in the release if you have any to share. Thanks!
Yes, I noted that. So far I changed some code to fit my analysis in my computer, do you welcome contributing code by github branching or other ways ?
By the way, is the HoneyBadger sensitivity to drop-out event ? Do we need run knn.error.model from PAGODA to correct expression value first ? The knn.error.model from PAGODA always takes very long time to run.
Since this work is not yet published, my email is [email protected].
Please feel free let me know if more detailed information about the tool or the data I have is necessary.
Best,
Yao He
from honeybadger.
Hello,
Today I tried breast cancer data from this latest published study "Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer". https://www.nature.com/articles/ncomms15081
I can recovery Figure2b using HoneyBadger's plotGexpProfile
.
However, I encounter almost the same problem as my own dataset. Specifically, with default parameter of calcGexpCnvBoundaries
, no bound genes are identified. Using min.traverse = 10
, pd = -.5
, pa = 0.5
, serval bound genes came up but posterior probability are so low ( 0.5 ~ 0.8) and only chr8 [47960185, 143790644] is identified as CNV. It seems that the striking expression change due to CNV in the MGH dataset is not typical for other tumors.
from honeybadger.
Hi Yao,
Thanks so much for the wonderful feedback.
To address your questions:
The intuition for the posterior probabilities for the gene expression model is that, given N genes in a region, we ask what is the average expression. If the average is low, that would point to a deletion and if it's high it would point to an amplification. However, to quantify this, we need to assess if this average expression significantly lower than what is expected given the observed variance in the average estimates. As you can imagine, if N is small, there is quite a bit of variability in the average estimate. So even if you observe a very low average in a cell, this could have arisen by chance. If N is much bigger, we have a much more robust estimate of the average expression in this region. Some datasets are also cleaner than others, perhaps with a better normal reference, so the variance as a function of N can differ from cell to cell and dataset to dataset.
For the obvious deletions and amplifications you can visually identify, can you test them manually (instead of relying on the HMM):
region <- GRanges(seqnames='chr5', ranges=IRanges(start=0, end=1e9))
hb$calcGexpCnvProb(region = region)
Similarly, this can be done for alleles using calcAlleleCnvProb
. This will return a list of the posterior probabilities of amplification, deletion, and the estimated expression deviation in this region for each cell.
It would be interesting to compare this estimated expression deviation for different datasets. I agree the striking expression change due to CNV in the MGH dataset (and our own MM datasets in the data) may not be typical for other tumors so I should change to flexible thresholds set based on the observed gene expression variances in each dataset.
In terms of the number of SNPs HoneyBADGER requires, the more the better of course. 5000 is quite a good number. Actually, I think the allele visualization on chromosome 17, 21, and 22 for your data is particularly striking and agrees well with the expression data. It would be cool to test out those regions with the combined approach.
Unfortunately, I think the current combined approach requires regions to be identified by both the allele and gene expression method. I need to change this from an intersect to a union. Sorry for that mistake.
Finally, yes, I would greatly welcome any contributions. If you could please make a pull request with a detailed comment of your edits, I would be happy it merge your changes so you can be a contributor in the package and would love to acknowledge you in the publication.
Thanks again!
Best,
Jean
from honeybadger.
Thanks so much for your instruction.
Here are some quick replies:
- Better reference
I tried GTeX and the matched normal cells as reference , the chromosomal expression profile do not change a lot. - Test ranges manually
For chr17:
exome data: LOH but copy number neutral
expression model , 443 genes.deletion posterior probability
: 0.4 ~ 0.7.mean expression deviation
: -0.03 ~ -0.20
allele model: 109 snps, 99 genes.deletion posterior probability
: 0.96 ~ 1.
For chr21:
exome data: LOH and copy number loss
expression model, 72 genes.deletion posterior probability
: 0.4 ~ 0.57.mean expression deviation
: -0.03 ~ 0.3
allele model: 32 snps, 22 genes.deletion posterior probability
: 0 ~ 1. (some cells have these copy number loss while others don't)
It seems that the gene expression model need further tweaking.
Can you tell me how to set thresholds based on the observed gene expression variances manually ?
Thanks again!
Best,
Yao
from honeybadger.
I have issue installing the package:
Error in parse(outFile) :
/private/var/folders/c8/gwf0h1lx1j3fjwxcmmfgmdtdj1zjr2/T/Rtmpw09DOU/R.INSTALL6ed33bdf35fe/JEFworks-HoneyBADGER-94eb861/R/Hone:912:70: unexpected symbol
911: sd <- sd(mat.smooth)
912: z <- HiddenMarkov::dthmm(mat.smooth, matrix(c(1-2t
^
ERROR: unable to collate and parse R files for package ‘HoneyBADGER’
from honeybadger.
In reply to @saeedsaberi : Ah, thanks for the catch. Fixed as of commit 65e6ef6 . If you have additional questions, please open a new issue so that we may keep each thread as addressing a separate problem.
from honeybadger.
another issue:
hb$retestIdentifiedCnvs()
Retesting bound genes ...
Error: 'subjectHits' is not an exported object from 'namespace:GenomicRanges'
from honeybadger.
@JEFworks , Please lt me know if this issue is from my side! I had no luck so far solving it.
S
from honeybadger.
@saeedsaberi Can you please post your sessionInfo() ? This may be related to the version of GenomicRanges you're using.
from honeybadger.
Here it is:
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] HoneyBADGER_0.1 BiocInstaller_1.24.0 HiddenMarkov_1.8-11 caTools_1.17.1 RColorBrewer_1.1-2 GenomicRanges_1.26.4
[7] GenomeInfoDb_1.10.3 IRanges_2.8.2 S4Vectors_0.12.2 BiocGenerics_0.20.0 rjags_4-6 coda_0.19-1
[13] biomaRt_2.30.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 git2r_0.19.0 plyr_1.8.4 XVector_0.14.1 bitops_1.0-6 tools_3.3.2
[7] zlibbioc_1.20.0 digest_0.6.12 bit_1.1-12 RSQLite_2.0 memoise_1.1.0 tibble_1.3.4
[13] gtable_0.2.0 lattice_0.20-35 rlang_0.1.4 DBI_0.7 curl_3.0 knitr_1.17
[19] httr_1.3.1 withr_2.1.0 devtools_1.13.4 bit64_0.9-7 grid_3.3.2 Biobase_2.34.0
[25] R6_2.2.2 AnnotationDbi_1.36.2 XML_3.98-1.9 ggplot2_2.2.1 blob_1.1.0 scales_0.5.0
[31] colorspace_1.3-2 RCurl_1.95-4.8 lazyeval_0.2.1 munsell_0.4.3
from honeybadger.
@saeedsaberi Hum, I'm not sure if the issue is related to: alyssafrazee/ballgown#92
subjectHits was removed from the S4Vectors package in 0.12 (and renamed to to)
I made a few edits that seem like they should work but have not able to recreate your error. Can you please reinstall from Github using devtools and try again?
from honeybadger.
Thanks @JEFworks , it is working now. Just you'd need to add the following explicitly in the beginning now:
library(GenomicRanges)
from honeybadger.
I'm wondering if you have a conda channel for this, I have some problem installing this on a cluster because rjags won't work.
from honeybadger.
@saeedsaberi We do not have a conda channel for this because this is an R package and not a python package supported through Anaconda. If you have additional questions, please open a new issue so that we may keep each thread as addressing a separate problem. Please check that JAGS (http://mcmc-jags.sourceforge.net/) is installed on your cluster. Additional tutorial for installation here: https://www.r-bloggers.com/getting-started-with-jags-rjags-and-bayesian-modelling/
from honeybadger.
Hey,
I'm having a fundamental question, I want to use this on a 10Xgenomics dataset so I want to know if this package can be useful or not?
one basic question is, in 10X data there is a huge bias towards the 3' in the reads so not all the snps going to be seen in the data. Can I call CNVs only based on the expression of proxy genes?
also we will have about 5000 cells per library so this package might be too slow?
from honeybadger.
Related Issues (20)
- lt$setGexpDev HOT 1
- Warning in setGeneFactors and error in retestIdentifiedCnvs
- Allele-mode for selecting normal cells HOT 2
- 10X + Honeybadger question HOT 2
- Error in summarizeResults HOT 3
- Error in calcGexpCnvBoundaries when running with numeric chromosome names HOT 1
- Error: subscript contains invalid names HOT 9
- read bam files HOT 4
- Filtering of identified CNVs HOT 2
- speed of running setAlleleMats step
- What is the reason for only including snps in HoneyBADGER? HOT 2
- gene filtering different in HoneyBADGER object HOT 3
- Error: Failed to install 'HoneyBADGER' from GitHub HOT 2
- Showing error when trying Getting_Started.Rmd
- no method for coercing this S4 class to a vector HOT 3
- showing NULL in the step of calcGexpCnvBoundaries - getting started tutorial HOT 12
- Applying bayesian hierarchical model in integrating analyses tutorial
- error in hb$summarizeResults
- Error in summarizeResults for the allele-based approach
- Error in calcCombCnvProb HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from honeybadger.