Coder Social home page Coder Social logo

Comments (16)

JEFworks avatar JEFworks commented on June 16, 2024

Hello,

Yes, there are a number of parameters you may want to tweak.

When setting your gene expression matrix setGexpMats, the default filtering quite quite strict filter=TRUE, minMeanBoth=4.5, minMeanTest=6, minMeanRef=8. I am considering changing the default parameters to be the 25% quantile of the input matrices as opposed to hard thresholds). Generally, I would recommend filtering out the most lowly expressed genes such that you still have > 4000 genes after filtering.

In calcGexpCnvBoundaries, there are two parameters pd and pa which is the expected magnitude shift due to deletion and amplification respectively. While we expect a doubling of gene expression for each extra copy or a halfing for each copy lost (which is the default), this appears to differ quite a bit depending on dataset/protocol, I am considering changing this to a standard deviation measurement (ex. if gene expression is 1.96 standard deviations lower than median, we may want to at this region further for a potential deletion). You may also want to increase min.traverse (to something like 10 from the default 3), which is the depth of the tree traversal (higher number = look for smaller pools of cells).

If you provide allele information, it should improve the power to identify CNVs, particular for smaller alterations. In our manuscript, we show that while the gene expression based approach is able to identify whole-chromosome-level alterations with high accuracy and precision, it performs rather poorly for focal alterations, while the allele-based approach can achieve high accuracy and precision even for deletions as small as 10mbs. Adding allele information will also allow you to catch copy-neutral LOH events.

To identify tumor cells (separate from normal), you can do hierarchical clustering on the final posterior probabilities for all CNVs tested. Tumor cells and normal cells should separate into two obvious groups. You can also identify normal cells as they should not have high posterior probabilities for any of the CNVs tested.

Please do note that this work is not yet published, with the software still not official released, and thus subject to substantial changes during our manuscript revision. We are currently working to deposit a copy of the manuscript to bioRxiv. I'd be happy to incorporate any of your feedback and feature requests in the release if you have any to share. Thanks!

Best,
Jean

from honeybadger.

ahy1221 avatar ahy1221 commented on June 16, 2024

Hello,
Thanks for your replying. It is very helpful.
In these days I tried hard to understand how HoneyBadger works and I did a lot experiment according to your advice. Here are my updates:

When setting your gene expression matrix setGexpMats, the default filtering quite quite strict filter=TRUE, minMeanBoth=4.5, minMeanTest=6, minMeanRef=8. I am considering changing the default parameters to be the 25% quantile of the input matrices as opposed to hard thresholds). Generally, I would recommend filtering out the most lowly expressed genes such that you still have > 4000 genes after filtering.

I have ~9000 protein-coding genes in my dataset after filtering. I realized that the default threshold is too strict to my data. Thus before the first post I had already done this step

In calcGexpCnvBoundaries, there are two parameters pd and pa which is the expected magnitude shift due to deletion and amplification respectively. While we expect a doubling of gene expression for each extra copy or a halfing for each copy lost (which is the default), this appears to differ quite a bit depending on dataset/protocol, I am considering changing this to a standard deviation measurement (ex. if gene expression is 1.96 standard deviations lower than median, we may want to at this region further for a potential deletion). You may also want to increase min.traverse (to something like 10 from the default 3), which is the depth of the tree traversal (higher number = look for smaller pools of cells).

From the code it seems that the HMM step is to try to identify CNV status of regions. In this step, with default pd and pa, I can get some candidate regions. However, the problem is that the posterior probability for these regions are always between 0.6 ~ 0.7 in my dataset. Even though I changed pa or pd to get more regions( bound genes ?) . The posterior probability are still not high enough to pass significant level ( Sorry for misuse statistic terms ). I increased the m parameter to 0.5 then I made posterior probability to 0.8 ~ 0.95. Is that reasonable to do that ? I am not sure why the bayesian inference works after increasing m. I said that because I had tried decreasing m first but this didn't work as I expected.

If you provide allele information, it should improve the power to identify CNVs, particular for smaller alterations. In our manuscript, we show that while the gene expression based approach is able to identify whole-chromosome-level alterations with high accuracy and precision, it performs rather poorly for focal alterations, while the allele-based approach can achieve high accuracy and precision even for deletions as small as 10mbs. Adding allele information will also allow you to catch copy-neutral LOH events.

How many SNPs are enough for HoneyBadger ? I filtered ExAC v3.1 VCF with AF > 0.1 and then I got ~60000 SNPs. From the tutorial dataset, it only contains ~5000 SNPs. I can see very strong LOH in chr17 (consistent with gene expression plot if we consider chr17 as copy number loss).
image

However, even after tweaking pa and pd, I can't get this chromosome as candidate for calculating posterior probability. The regions I got high confidence from gene expression are all focal regions, including chr6 [42206801, 44127457] , chr12 [ 8058302, 10454685] and chr11 [77066932, 78139660].
I haven't tried to do combined calculation but it seems that I lost chr17 already.

To identify tumor cells (separate from normal), you can do hierarchical clustering on the final posterior probabilities for all CNVs tested. Tumor cells and normal cells should separate into two obvious groups. You can also identify normal cells as they should not have high posterior probabilities for any of the CNVs tested.

I still don't get full posterior probabilities for all cells. But the returnPlot parameter in plotGexpMat does not return smoothed matrix. Is that because it return to a certain slot for the refclass itself ? I am not sure and I can't find the slot.

Please do note that this work is not yet published, with the software still not official released, and thus subject to substantial changes during our manuscript revision. We are currently working to deposit a copy of the manuscript to bioRxiv. I'd be happy to incorporate any of your feedback and feature requests in the release if you have any to share. Thanks!

Yes, I noted that. So far I changed some code to fit my analysis in my computer, do you welcome contributing code by github branching or other ways ?

By the way, is the HoneyBadger sensitivity to drop-out event ? Do we need run knn.error.model from PAGODA to correct expression value first ? The knn.error.model from PAGODA always takes very long time to run.

Since this work is not yet published, my email is [email protected].
Please feel free let me know if more detailed information about the tool or the data I have is necessary.

Best,
Yao He

from honeybadger.

ahy1221 avatar ahy1221 commented on June 16, 2024

Hello,
Today I tried breast cancer data from this latest published study "Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer". https://www.nature.com/articles/ncomms15081
I can recovery Figure2b using HoneyBadger's plotGexpProfile.
image

However, I encounter almost the same problem as my own dataset. Specifically, with default parameter of calcGexpCnvBoundaries, no bound genes are identified. Using min.traverse = 10, pd = -.5, pa = 0.5, serval bound genes came up but posterior probability are so low ( 0.5 ~ 0.8) and only chr8 [47960185, 143790644] is identified as CNV. It seems that the striking expression change due to CNV in the MGH dataset is not typical for other tumors.

from honeybadger.

JEFworks avatar JEFworks commented on June 16, 2024

Hi Yao,

Thanks so much for the wonderful feedback.

To address your questions:

The intuition for the posterior probabilities for the gene expression model is that, given N genes in a region, we ask what is the average expression. If the average is low, that would point to a deletion and if it's high it would point to an amplification. However, to quantify this, we need to assess if this average expression significantly lower than what is expected given the observed variance in the average estimates. As you can imagine, if N is small, there is quite a bit of variability in the average estimate. So even if you observe a very low average in a cell, this could have arisen by chance. If N is much bigger, we have a much more robust estimate of the average expression in this region. Some datasets are also cleaner than others, perhaps with a better normal reference, so the variance as a function of N can differ from cell to cell and dataset to dataset.

For the obvious deletions and amplifications you can visually identify, can you test them manually (instead of relying on the HMM):

region <- GRanges(seqnames='chr5', ranges=IRanges(start=0, end=1e9))
hb$calcGexpCnvProb(region = region)

Similarly, this can be done for alleles using calcAlleleCnvProb. This will return a list of the posterior probabilities of amplification, deletion, and the estimated expression deviation in this region for each cell.

It would be interesting to compare this estimated expression deviation for different datasets. I agree the striking expression change due to CNV in the MGH dataset (and our own MM datasets in the data) may not be typical for other tumors so I should change to flexible thresholds set based on the observed gene expression variances in each dataset.

In terms of the number of SNPs HoneyBADGER requires, the more the better of course. 5000 is quite a good number. Actually, I think the allele visualization on chromosome 17, 21, and 22 for your data is particularly striking and agrees well with the expression data. It would be cool to test out those regions with the combined approach.

Unfortunately, I think the current combined approach requires regions to be identified by both the allele and gene expression method. I need to change this from an intersect to a union. Sorry for that mistake.

Finally, yes, I would greatly welcome any contributions. If you could please make a pull request with a detailed comment of your edits, I would be happy it merge your changes so you can be a contributor in the package and would love to acknowledge you in the publication.

Thanks again!

Best,
Jean

from honeybadger.

ahy1221 avatar ahy1221 commented on June 16, 2024

Thanks so much for your instruction.
Here are some quick replies:

  • Better reference
    I tried GTeX and the matched normal cells as reference , the chromosomal expression profile do not change a lot.
  • Test ranges manually
    For chr17:
    exome data: LOH but copy number neutral
    expression model , 443 genes. deletion posterior probability: 0.4 ~ 0.7. mean expression deviation : -0.03 ~ -0.20
    allele model: 109 snps, 99 genes. deletion posterior probability: 0.96 ~ 1.
    For chr21:
    exome data: LOH and copy number loss
    expression model, 72 genes. deletion posterior probability: 0.4 ~ 0.57. mean expression deviation : -0.03 ~ 0.3
    allele model: 32 snps, 22 genes. deletion posterior probability: 0 ~ 1. (some cells have these copy number loss while others don't)

It seems that the gene expression model need further tweaking.
Can you tell me how to set thresholds based on the observed gene expression variances manually ?

Thanks again!
Best,
Yao

from honeybadger.

saeedsaberi avatar saeedsaberi commented on June 16, 2024

I have issue installing the package:

Error in parse(outFile) :
/private/var/folders/c8/gwf0h1lx1j3fjwxcmmfgmdtdj1zjr2/T/Rtmpw09DOU/R.INSTALL6ed33bdf35fe/JEFworks-HoneyBADGER-94eb861/R/Hone:912:70: unexpected symbol
911: sd <- sd(mat.smooth)
912: z <- HiddenMarkov::dthmm(mat.smooth, matrix(c(1-2t
^
ERROR: unable to collate and parse R files for package ‘HoneyBADGER’

from honeybadger.

JEFworks avatar JEFworks commented on June 16, 2024

In reply to @saeedsaberi : Ah, thanks for the catch. Fixed as of commit 65e6ef6 . If you have additional questions, please open a new issue so that we may keep each thread as addressing a separate problem.

from honeybadger.

saeedsaberi avatar saeedsaberi commented on June 16, 2024

another issue:
hb$retestIdentifiedCnvs()
Retesting bound genes ...
Error: 'subjectHits' is not an exported object from 'namespace:GenomicRanges'

from honeybadger.

saeedsaberi avatar saeedsaberi commented on June 16, 2024

@JEFworks , Please lt me know if this issue is from my side! I had no luck so far solving it.
S

from honeybadger.

JEFworks avatar JEFworks commented on June 16, 2024

@saeedsaberi Can you please post your sessionInfo() ? This may be related to the version of GenomicRanges you're using.

from honeybadger.

saeedsaberi avatar saeedsaberi commented on June 16, 2024

Here it is:
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] HoneyBADGER_0.1 BiocInstaller_1.24.0 HiddenMarkov_1.8-11 caTools_1.17.1 RColorBrewer_1.1-2 GenomicRanges_1.26.4
[7] GenomeInfoDb_1.10.3 IRanges_2.8.2 S4Vectors_0.12.2 BiocGenerics_0.20.0 rjags_4-6 coda_0.19-1
[13] biomaRt_2.30.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 git2r_0.19.0 plyr_1.8.4 XVector_0.14.1 bitops_1.0-6 tools_3.3.2
[7] zlibbioc_1.20.0 digest_0.6.12 bit_1.1-12 RSQLite_2.0 memoise_1.1.0 tibble_1.3.4
[13] gtable_0.2.0 lattice_0.20-35 rlang_0.1.4 DBI_0.7 curl_3.0 knitr_1.17
[19] httr_1.3.1 withr_2.1.0 devtools_1.13.4 bit64_0.9-7 grid_3.3.2 Biobase_2.34.0
[25] R6_2.2.2 AnnotationDbi_1.36.2 XML_3.98-1.9 ggplot2_2.2.1 blob_1.1.0 scales_0.5.0
[31] colorspace_1.3-2 RCurl_1.95-4.8 lazyeval_0.2.1 munsell_0.4.3

from honeybadger.

JEFworks avatar JEFworks commented on June 16, 2024

@saeedsaberi Hum, I'm not sure if the issue is related to: alyssafrazee/ballgown#92

subjectHits was removed from the S4Vectors package in 0.12 (and renamed to to)

I made a few edits that seem like they should work but have not able to recreate your error. Can you please reinstall from Github using devtools and try again?

from honeybadger.

saeedsaberi avatar saeedsaberi commented on June 16, 2024

Thanks @JEFworks , it is working now. Just you'd need to add the following explicitly in the beginning now:

library(GenomicRanges)

from honeybadger.

saeedsaberi avatar saeedsaberi commented on June 16, 2024

I'm wondering if you have a conda channel for this, I have some problem installing this on a cluster because rjags won't work.

from honeybadger.

JEFworks avatar JEFworks commented on June 16, 2024

@saeedsaberi We do not have a conda channel for this because this is an R package and not a python package supported through Anaconda. If you have additional questions, please open a new issue so that we may keep each thread as addressing a separate problem. Please check that JAGS (http://mcmc-jags.sourceforge.net/) is installed on your cluster. Additional tutorial for installation here: https://www.r-bloggers.com/getting-started-with-jags-rjags-and-bayesian-modelling/

from honeybadger.

saeedsaberi avatar saeedsaberi commented on June 16, 2024

Hey,

I'm having a fundamental question, I want to use this on a 10Xgenomics dataset so I want to know if this package can be useful or not?
one basic question is, in 10X data there is a huge bias towards the 3' in the reads so not all the snps going to be seen in the data. Can I call CNVs only based on the expression of proxy genes?
also we will have about 5000 cells per library so this package might be too slow?

from honeybadger.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.