Coder Social home page Coder Social logo

Comments (22)

jvikkula avatar jvikkula commented on August 19, 2024 3

Hi @gioelelm and @yueqiw,

Did you manage to solve this issue? Since I'm facing rather similar issue, but for me most of the genes and not just some genes are detected with significantly lower counts in Velocyto pipeline. I calculated an average initial cell size from the matrix that cellranger produces and it is ~4126, whereas the average initial cell size after Velocyto pipeline is ~171 for spliced, ~378 for unspliced and ~17 for ambiguous. Also it feels somewhat suspicious that there are a lot more unspliced reads than spliced reads. I have all together 14 samples and the results are similar with all the samples.

On the other hand, some genes, which are not detected at all by cellranger, are very highly expressed based on the Velocyto pipeline. I also started wondering if this is an issue related to the annotation file.

I am also running Velocyto for data generated from 10x genomics cellranger pipeline:
velocyto run10x -m repeat_msk.gtf 10x_sample_folder refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf

I tried running without the repeat mask file, but that did not have an effect.

I would appreciate it a lot if you were able to help me with this issue. Thanks!

-Johanna

from velocyto.py.

LikaiTan avatar LikaiTan commented on August 19, 2024 1

from velocyto.py.

hyjforesight avatar hyjforesight commented on August 19, 2024 1

@sandrav-CGEN check this theislab/scvelo#813

from velocyto.py.

gioelelm avatar gioelelm commented on August 19, 2024

Hi,

First of all thank you for using velocyto and a double thank you for opening this issue, users feedback is really appreciated.

I have some hypothesis on why this might be happening including differences in how multiple mappings, chimeric molecules and how umi are treated. I think I deal with many corner cases in a rather conservative way that maybe can be relaxed.

However, I think that to properly answer your question one needs to understand the details of both the pipelines. I am worried that without a detailed knowledge of the cellranger source code (that I unfortunately do not have yet) whatever I say at this point about the differences is going to be a little hand-wavy. It will take some time to do this comparison and will maybe require to have the access to the specific files from the users observing any unexpected behaviour (notice that is the first time I observe such such a difference so it might be sample specific somehow).

Notice also that we provide a way to inspect what the counting pipeline is doing by outputing a molecular counting report using the -d 1 (hd5 file detailed summary of each read mapping annotation) or -d p1(complete python pickle object with all the mapping decision) options. But again to get an appropriate comparative of all the corner cases is its own little project.

I am sorry if I cannot be more helpful than this at this point. I can try to come back to you in a week or so with some more suggestions. But the final extensive comparison will require a little extra time... maybe a discussion with cellranger core developer will speed up the process.

from velocyto.py.

yueqiw avatar yueqiw commented on August 19, 2024

Hi Gioele,

Thanks for the fast reply! I wasn't sure if I was using velocyto properly, so I wanted to check if this issue has been observed before. Since it's never been reported, I have several more questions:

(1) If you look into the 10x datasets from your lab, do you observe similar issues? What about data from similar techniques, such as Dropseq?

(2) I looked at a separate experiment analyzed using the same cellranger and velocyto pipeline, and got very similar results. So the issue is not specific to that particular sample. Interestingly, the sets of genes that are under-represented in velocyto are very similar between two independent experiments (~600 genes are shared among 1000 under-represented genes from experiment 1 and 1000 under-represented genes from experiment 2. The samples are the same species, but quite different in terms of age and cell type composition). I guess the issue is specific to how a certain subset of genes are counted.

It will take some time to do this comparison and will maybe require to have the access to the specific files from the users observing any unexpected behaviour.

If it helps resolve the unexpected behavior and it's specific to our datasets, we'd be happy to provide the files.

Thanks!

from velocyto.py.

gioelelm avatar gioelelm commented on August 19, 2024

(1) If you look into the 10x datasets from your lab, do you observe similar issues? What about data from similar techniques, such as Dropseq?

Of course I checked, but last time the comparison looked definitely better than your plots. However the codebase changed a little in the meanwhile. I will give a second check asap.

(2) I looked at a separate experiment analyzed using the same cellranger and velocyto pipeline, and got very similar results. So the issue is not specific to that particular sample. Interestingly, the sets of genes that are under-represented in velocyto are very similar between two independent experiments (~600 genes are shared among 1000 under-represented genes from experiment 1 and 1000 under-represented genes from experiment 2. The samples are the same species, but quite different in terms of age and cell type composition). I guess the issue is specific to how a certain subset of genes are counted.

Ok interesting. Given this evidence, I can think of only two effects that can give rise to this kind of consistency.

  1. Annotation of introns that span over exons/introns of other genes. Even if this sounds like an unrealistic situation in animal genomes. I have seen at least couple of cases like this. Some of them were extreme, probably an annotation error, where two exons of the same transcript model where annotated one upstream and one downstream of a chromosomic region containing dozens of genes. This means that the intron between them was overlapping with hundreds of features (introns and exons). This in practice was causing velocyto to call chimeric molecules. I fixed that bug long ago by removing this unrealistic introns but maybe there are some more of modest length. If the problem is of this nature one needs to figure out a non trivial logic ( or a good heuristic ) to assign that read.
    Notice that this kind of weird annotations, would not cause any troubles to cellranger that, I guess, is not considering he introns of transcript models.

  2. Multiple mappings. If cellranger counts the first match of multiple mappings (or somehow) then depending how the aligner is producing the output those would be systematically assigned to the same genes. I think it is good practice from the hold bulk-sequencing days to remove multiple mappings from the analysis and that is what I am doing in the current version of velocyto.

If it helps resolve the unexpected behaviour and it's specific to our datasets, we'd be happy to provide the files.

That would be great, please contact me over email and we can discuss how to share the data, however I will start first by repeating some diagnostics on my data.

I would like to close this issue for now since what is reported here is not an explicit bug but let's talk about it over email ([email protected]). I will reopen, and fix it immediately, if there is evidence that this was an unexpected behaviour and not a effect of further stringency necessary when considering introns of transcript models.

from velocyto.py.

yueqiw avatar yueqiw commented on August 19, 2024

Hi Gioele,

Thanks for the detailed explanation. I'll approach you via email soon.

In the meantime when you run diagnostics, another factor I could think about is that we use human genome and the GRCh38-1.2.0 gtf provided by cellranger. So it's also possible that the effect is somehow related to human genome annotation.

from velocyto.py.

LikaiTan avatar LikaiTan commented on August 19, 2024

Hi @gioelelm and @yueqiw,

Did you manage to solve this issue? Since I'm facing rather similar issue, but for me most of the genes and not just some genes are detected with significantly lower counts in Velocyto pipeline. I calculated an average initial cell size from the matrix that cellranger produces and it is ~4126, whereas the average initial cell size after Velocyto pipeline is ~171 for spliced, ~378 for unspliced and ~17 for ambiguous. Also it feels somewhat suspicious that there are a lot more unspliced reads than spliced reads. I have all together 14 samples and the results are similar with all the samples.

On the other hand, some genes, which are not detected at all by cellranger, are very highly expressed based on the Velocyto pipeline. I also started wondering if this is an issue related to the annotation file.

I am also running Velocyto for data generated from 10x genomics cellranger pipeline:
velocyto run10x -m repeat_msk.gtf 10x_sample_folder refdata-cellranger-GRCh38-1.2.0/genes/genes.gtf

I tried running without the repeat mask file, but that did not have an effect.

I would appreciate it a lot if you were able to help me with this issue. Thanks!

-Johanna

Hi
We have exactly the same problem. I'm using 10x 5' chemicals for both TCR and transcriptome, don't know if it is the problem.

Likai

from velocyto.py.

yueqiw avatar yueqiw commented on August 19, 2024

@LikaiTan @jvikkula I didn't find a solution. Based on the email correspondence, the author @gioelelm wasn't able to invest much time on this issue.

He did note in the email that

I have noticed that doing the counting with different gtfs can might have a rather noticeable effect on the results.

Since this issue has been closed, I'd suggest opening a new issue describing your problem and refer to this one.

Hope it helps!

from velocyto.py.

jvikkula avatar jvikkula commented on August 19, 2024

@yueqiw Thanks for your reply. I spent still some time trying to solve this but wasn't able to do that, but I found out that you can also use for example dropEst to do the counting and that is working rather well for me.

from velocyto.py.

yueqiw avatar yueqiw commented on August 19, 2024

@jvikkula Does dropEst count both spliced and unspliced transcripts? And were you able to use the counting result for RNA velocity? Thanks!

from velocyto.py.

jvikkula avatar jvikkula commented on August 19, 2024

@yueqiw dropEst counts exonic, intronic and exon/intron spanning reads (https://dropest.readthedocs.io/en/latest/dropest.html#velocyto-integration). I'm assuming that exonic reads are same as spiliced ones and unspliced transcripts you get when you sum intronic and exon/intron spanning reads.

from velocyto.py.

yueqiw avatar yueqiw commented on August 19, 2024

@jvikkula Thanks! That's very helpful!

from velocyto.py.

LikaiTan avatar LikaiTan commented on August 19, 2024

@yueqiw dropEst counts exonic, intronic and exon/intron spanning reads (https://dropest.readthedocs.io/en/latest/dropest.html#velocyto-integration). I'm assuming that exonic reads are same as spiliced ones and unspliced transcripts you get when you sum intronic and exon/intron spanning reads.

Hi @jvikkula ,
I have a stupid question. I tried to run the dropest pipline, however I don't know how to get the config.xml. I tried 10x.xml but didn't work out.
Run: 07/22/2019 17:04:15.
Can't open file with barcodes: './../data/barcodes/10x_aug_2016_split'

Many thanks in advance.

from velocyto.py.

jvikkula avatar jvikkula commented on August 19, 2024

@LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst.

from velocyto.py.

LikaiTan avatar LikaiTan commented on August 19, 2024

@LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst.

Problem solved, now I get comparable counts as that from cellranger.
Thanks very much!

from velocyto.py.

Zifeng-L avatar Zifeng-L commented on August 19, 2024

@LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst.

Problem solved, now I get comparable counts as that from cellranger.
Thanks very much!

Excuse meļ¼Œcan you tell me how to take the dropEst without config.xml?

from velocyto.py.

LikaiTan avatar LikaiTan commented on August 19, 2024

@LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst.

Problem solved, now I get comparable counts as that from cellranger.
Thanks very much!

Excuse meļ¼Œcan you tell me how to take the dropEst without config.xml?

Hi I think you need a config.xml. It's too long ago and I forgot many details. for me it's like this:




<barcodes_file>怐path to where you install dropest怑/data/barcodes/10x_aug_2016_split</barcodes_file>
<barcodes_type>const</barcodes_type>
<min_merge_fraction>0.2</min_merge_fraction>
<max_cb_merge_edit_distance>2</max_cb_merge_edit_distance>
<max_umi_merge_edit_distance>1</max_umi_merge_edit_distance>
<min_genes_after_merge>100</min_genes_after_merge>
<min_genes_before_merge>20</min_genes_before_merge>

    <PreciseMerge>
        <max_merge_prob>1e-5</max_merge_prob>
        <max_real_merge_prob>1e-7</max_real_merge_prob>
    </PreciseMerge>
</Estimation>

from velocyto.py.

Zifeng-L avatar Zifeng-L commented on August 19, 2024

@LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst.

Problem solved, now I get comparable counts as that from cellranger.
Thanks very much!

Excuse meļ¼Œcan you tell me how to take the dropEst without config.xml?

Hi I think you need a config.xml. It's too long ago and I forgot many details. for me it's like this:

<barcodes_file>怐path to where you install dropest怑/data/barcodes/10x_aug_2016_split</barcodes_file>
<barcodes_type>const</barcodes_type>
<min_merge_fraction>0.2</min_merge_fraction>
<max_cb_merge_edit_distance>2</max_cb_merge_edit_distance>
<max_umi_merge_edit_distance>1</max_umi_merge_edit_distance>
<min_genes_after_merge>100</min_genes_after_merge>
<min_genes_before_merge>20</min_genes_before_merge>

    <PreciseMerge>
        <max_merge_prob>1e-5</max_merge_prob>
        <max_real_merge_prob>1e-7</max_real_merge_prob>
    </PreciseMerge>
</Estimation>

Thanks a lot!!!! So it seems that the 5' data can be processed the same as the 3' data, no matter where the barcodes and UMI are?

from velocyto.py.

Zifeng-L avatar Zifeng-L commented on August 19, 2024

Yes, though it didn't gave me informative results šŸ˜‚On 23 Jun 2020 13:47, Ann Li [email protected] wrote: @LikaiTan I took only the estimation/dropEst part from the 10x config file and not the "tags search"/dropTag part at all, because if I understood correctly, that part is not needed for dropEst. Problem solved, now I get comparable counts as that from cellranger. Thanks very much! Excuse meļ¼Œcan you tell me how to take the dropEst without config.xml? Hi I think you need a config.xml. It's too long ago and I forgot many details. for me it's like this: <barcodes_file>怐path to where you install dropest怑/data/barcodes/10x_aug_2016_split</barcodes_file> <barcodes_type>const</barcodes_type> <min_merge_fraction>0.2</min_merge_fraction> <max_cb_merge_edit_distance>2</max_cb_merge_edit_distance> <max_umi_merge_edit_distance>1</max_umi_merge_edit_distance> <min_genes_after_merge>100</min_genes_after_merge> <min_genes_before_merge>20</min_genes_before_merge> <max_merge_prob>1e-5</max_merge_prob> <max_real_merge_prob>1e-7</max_real_merge_prob> Thanks a lot!!!! So it seems that the 5' data can be processed the same as the 3' data, no matter where the barcodes and UMI are? ā€”You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

Let me try! Thank you very much!

from velocyto.py.

hyjforesight avatar hyjforesight commented on August 19, 2024

Same issue here. Any suggestions?

from velocyto.py.

sandrav-CGEN avatar sandrav-CGEN commented on August 19, 2024

Same issue here. I will try the dropEst approach now...

from velocyto.py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.