I think hisat is not handling ribosomal RNA reads very well (see image). Neither are S

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

hisat2 does not handle rRNA reads well about hisat2 HOT 3 OPEN

daehwankimlab commented on July 22, 2024

hisat2 does not handle rRNA reads well

from hisat2.

Comments (3)

bwlang commented on July 22, 2024 1

Hi Daehwan:

This does occur when mapping reads to the entire human genome, but I don't
observe this strange splicing pattern in other loci (though I'm not sure
I'd know).
ChrUn_gl000220 contains the 45S pre-ribosomal RNA loci. These can be up to
90% of all transcripts in the cell, and individuals can have many repeats of
these loci. Both GRCh38 and 37 only represent 45S rRNA loci in an unplaced contig containing 2
repeats. Reads from many other loci are probably compressed onto these two
repeats.
I will try the --no-temp-splicesite and --dta-cufflinks options to see what I get.
Right now I'm happy with mapping these reads via bowtie and doing the rest
via Hisat, but it's a bit of a cumbersome workflow.
I wonder if Hisat could do a better job by increasing the cost of gap
extension (or whatever is similar for hisat). That might get it to use the
nearest possible splice site instead of the most closely matching.

from hisat2.

bwlang commented on July 22, 2024

Here is a small set of reads that illustrates the problem.
I baited pairs using mirabait and http://hgdownload.cse.ucsc.edu/goldenpath/hg19/snp138Mask/chrUn_gl000220.fa (requiring 6 31mers) in either read, then sampled 50k pairs with ngs-tools
Then I aligned the reads to hg19 using the attached refseq gtf as a guide and in directional mode where i could specify these.

undepleted_chrUn_gl000220.1.fastq.gz
undepleted_chrUn_gl000220.2.fastq.gz

encode_and_RM_rRNA.merged.interval.gz
hg19_genes.gtf.gz

Compared with the larger dataset above, the Tophat "false" splice junctions are less obvious, and star more evenly distributes the reads across the two loci (don't understand this... but that's a different rabbit hole).

from hisat2.

infphilo commented on July 22, 2024

Hi @bwlang,

Thank you for your detailed information (and sorry for the late response). It looks like chrUn_gl000220.fa contains many repeats (as shown in lowercase) and sequences of low complexity, which could mainly explain why HISAT2 reports many spliced alignments. Does this issue generally happen across the whole human genome or on this particular sequence (chrUn_gl000220)?

With HISAT2, we can be more conservative regarding spliced alignment using --no-temp-splicesite and --dta-cufflinks options, but they could have negative impacts on the alignment of non-ribosome RNA reads.

Thanks,
Daehwan

from hisat2.

Recommend Projects

hisat2 does not handle rRNA reads well about hisat2 HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent