Coder Social home page Coder Social logo

Comments (3)

bwlang avatar bwlang commented on July 22, 2024 1

Hi Daehwan:

This does occur when mapping reads to the entire human genome, but I don't
observe this strange splicing pattern in other loci (though I'm not sure
I'd know).
ChrUn_gl000220 contains the 45S pre-ribosomal RNA loci. These can be up to
90% of all transcripts in the cell, and individuals can have many repeats of
these loci. Both GRCh38 and 37 only represent 45S rRNA loci in an unplaced contig containing 2
repeats. Reads from many other loci are probably compressed onto these two
repeats.
I will try the --no-temp-splicesite and --dta-cufflinks options to see what I get.
Right now I'm happy with mapping these reads via bowtie and doing the rest
via Hisat, but it's a bit of a cumbersome workflow.
I wonder if Hisat could do a better job by increasing the cost of gap
extension (or whatever is similar for hisat). That might get it to use the
nearest possible splice site instead of the most closely matching.

from hisat2.

bwlang avatar bwlang commented on July 22, 2024

Here is a small set of reads that illustrates the problem.
I baited pairs using mirabait and http://hgdownload.cse.ucsc.edu/goldenpath/hg19/snp138Mask/chrUn_gl000220.fa (requiring 6 31mers) in either read, then sampled 50k pairs with ngs-tools
Then I aligned the reads to hg19 using the attached refseq gtf as a guide and in directional mode where i could specify these.

undepleted_chrUn_gl000220.1.fastq.gz
undepleted_chrUn_gl000220.2.fastq.gz

encode_and_RM_rRNA.merged.interval.gz
hg19_genes.gtf.gz

Compared with the larger dataset above, the Tophat "false" splice junctions are less obvious, and star more evenly distributes the reads across the two loci (don't understand this... but that's a different rabbit hole).
spliced_mappers_rrna_100k

from hisat2.

infphilo avatar infphilo commented on July 22, 2024

Hi @bwlang,

Thank you for your detailed information (and sorry for the late response). It looks like chrUn_gl000220.fa contains many repeats (as shown in lowercase) and sequences of low complexity, which could mainly explain why HISAT2 reports many spliced alignments. Does this issue generally happen across the whole human genome or on this particular sequence (chrUn_gl000220)?

With HISAT2, we can be more conservative regarding spliced alignment using --no-temp-splicesite and --dta-cufflinks options, but they could have negative impacts on the alignment of non-ribosome RNA reads.

Thanks,
Daehwan

from hisat2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.