Coder Social home page Coder Social logo

daehwankimlab / hisat2 Goto Github PK

View Code? Open in Web Editor NEW
461.0 461.0 111.0 32.5 MB

Graph-based alignment (Hierarchical Graph FM index)

License: GNU General Public License v3.0

Makefile 0.29% C++ 62.76% C 18.41% Perl 6.76% Python 8.64% Shell 1.38% CSS 0.14% HTML 1.41% VBScript 0.01% CMake 0.09% Java 0.09%

hisat2's People

Contributors

a-n-other avatar benlangmead avatar chbe-helix avatar daehwankimlab20191011 avatar hhuang58 avatar hlfernandez avatar imzhangyun avatar infphilo avatar jpaggi avatar kloetzl avatar leejj82 avatar molecules avatar mourisl avatar nigeldyer avatar parkchanhee avatar roryk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hisat2's Issues

Paired-end unaligned output (via '--un-conc') no longer paired

After aligning to one genome, I tried to align unaligned sequences to a second genome, but the unaligned fastq is no longer paired correctly:

Error, fewer reads in file specified with -2 than in file specified with -1
libc++abi.dylib: terminating with uncaught exception of type int
(ERR): hisat2-align died with signal 6 (ABRT) 

EDIT: version information

$ hisat2 --version
<install path>/hisat2-2.0.1-beta/hisat2-align-s version 2.0.1-beta
64-bit
Built on anitra.peabody.jhu.edu
Thu Nov 19 15:42:46 EST 2015
Compiler: Thread model: posix
Options: -O3 -m64 -msse2 -funroll-loops -g3 -mmacosx-version-min=10.6 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

And checking the line counts for two separate runs:

> echo $(( $(wc -l < pair1_unal.2.fastq) / 4 ))
823739
> echo $(( $(wc -l < pair1_unal.1.fastq) / 4 ))
823749
> echo $(( $(wc -l < pair2_unal.1.fastq) / 4 ))
941561
> echo $(( $(wc -l < pair2_unal.2.fastq) / 4 ))
931280

Just to check that the order matches:

> head -n 16 pair1_unal.1.fastq | grep "^@"
@DJDP2NM1:346:HC2CHBCXX:2:1101:4282:2221 1:Y:0:AGTCAA
@DJDP2NM1:346:HC2CHBCXX:2:1101:4143:2234 1:N:0:AGTCAA
@DJDP2NM1:346:HC2CHBCXX:2:1101:11297:2216 1:N:0:AGTCAA
@DJDP2NM1:346:HC2CHBCXX:2:1101:14238:2221 1:Y:0:AGTCAA
> head -n 16 pair1_unal.2.fastq | grep "^@"
@DJDP2NM1:346:HC2CHBCXX:2:1101:4282:2221 2:Y:0:AGTCAA
@DJDP2NM1:346:HC2CHBCXX:2:1101:4143:2234 2:N:0:AGTCAA
@DJDP2NM1:346:HC2CHBCXX:2:1101:11297:2216 2:N:0:AGTCAA
@DJDP2NM1:346:HC2CHBCXX:2:1101:14238:2221 2:Y:0:AGTCAA
> tail -n 72 pair1_unal.1.fastq | grep "^@.*:"
@HWI-1KL138:239:HC5VKBCXX:2:2216:5739:34660 1:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:5426:34706 1:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:7056:34669 1:Y:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:5909:34700 1:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:7023:34634 1:Y:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:7785:34656 1:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:9909:34567 1:Y:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:11084:34675 1:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:10330:34696 1:Y:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:11410:34621 1:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:11406:34714 1:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:11842:34668 1:Y:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:12588:34688 1:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:13170:34508 1:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:13639:34552 1:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:14874:34704 1:Y:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:14980:34540 1:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:15840:34735 1:N:0:AGTCAA
> tail -n 16 pair1_unal.2.fastq | grep "^@.*:"
@HWI-1KL138:239:HC5VKBCXX:2:2216:9959:100454 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:10737:100356 2:Y:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:10102:100383 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:13536:100402 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:13210:100332 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:13329:100306 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:14345:100252 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:14612:100475 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:14016:100390 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:16204:100410 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:16959:100418 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:17484:100313 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:17691:100274 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:19157:100450 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:2391:100634 2:Y:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:3060:100637 2:N:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:3293:100581 2:Y:0:AGTCAA
@HWI-1KL138:239:HC5VKBCXX:2:2216:2503:100535 2:N:0:AGTCAA

And the last sequences to not match.

The two commands were (approximately):

hisat2 \
  --phred33 --fr \
  --un-conc pair1_GRCh38p3_unal.fastq \
  -1 pair1_1.fastq \
  -2 pair1_2.fastq \
  -I 0 \
  -S pair1_GRCh38p3.sam \
  -X 500 \
  -p 3 \
  -x GRCh38p3 \
  2>&1 | tee -a pair1_GRCh38_hisat2.log

hisat2 \
  --phred33 --fr \
  --un-conc pair1_GRCh38p3_H37Rv_unal.fastq \
  -1 pair1_GRCh38p3_unal.1.fastq \
  -2 pair1_GRCh38p3_unal.2.fastq \
  -I 0 \
  -S pair1_GRCh38p3_unal_H37Rv.sam \
  -X 500 \
  -p 3 \
  -x H37Rv \
  2>&1 | tee -a pair1_GRCh38_H37Rv_hisat2.log
  • NOTE: File names have been shortened for clarity.

extract_snps.py throwing assertion error

Hi

I have tried extract_snps.py to get snps using a vcf file (ftp://ftp-mouse.sanger.ac.uk/current_snps/) and also using the snpCommon.txt file that you mentioned in the manual. In both cases I got an assertion error.

example with snpCommon.txt ::

~/hisat2-2.0.0-beta/extract_snps.py GRCh37_ensembl_genome.fa snpCommon.txt > humanSNP_test.out

Traceback (most recent call last):
  File "/package/hisat2-2.0.0-beta/extract_snps.py", line 284, in <module>
    extract_snps(args.genome_file, args.snp_file, args.verbose, args.testset)
  File "/package/hisat2-2.0.0-beta/extract_snps.py", line 242, in extract_snps
    assert len(snp_list) > 0
AssertionError

Can you point out what's the issue here? Also is it not supposed to work with a vcf file like the one above?

remove scaffold and other unplaced sequence before mapping ?

Hi,
I downloaded reference genomes from Ensembl (fasta format).
But there are lots of sequences with name "dna:scaffold": https://github.com/CTLife/TEMP/tree/master/RefGenomes

Such as Mouse_GRCm38 (mm10), except chromosome 1-19, Mt, X and Y; others should be removed before mapping ? https://github.com/CTLife/TEMP/blob/master/RefGenomes/Mouse_GRCm38.p4.txt

Such as Human_GRCh38.p5 (hg38), https://github.com/CTLife/TEMP/blob/master/RefGenomes/Human_GRCh38.p5.txt, there are 516 sequences. In addition to chromosome 1-22, Mt, X and Y; others (such as CHR_HG2241_PATCH and KI270728.1) should be removed before mapping ?

Improve docs for extract_snps.py

Hey Guys,

Quick question re: usage.

$extract_snps.py -h
usage: extract_snps.py [-h] [-v] [--testset] [genome_file] [snp_file]

What exactly is a [genome_file]? I've tried running this command with FASTA files (e.g. mm10.fa) and the mm10 genes.gtf files successfully used for $extract_exons.py and $extract_splice_sites.py but in both cases I receive the following error:

Traceback (most recent call last):
  File "/Users/johnfinnigan/Desktop/Utilities/HISAT2/extract_snps.py", line 284, in <module>
    extract_snps(args.genome_file, args.snp_file, args.verbose, args.testset)
  File "/Users/johnfinnigan/Desktop/Utilities/HISAT2/extract_snps.py", line 242, in extract_snps
    assert len(snp_list) > 0
AssertionError

I'm not entirely sure what I'm doing wrong. Any help would be greatly appreciated

index builder requires massive memory and mapping fails

I tried building an index for the chinese hamster genome (link at bottom). I made a splice site and exon file with the python scripts.
It failed with the pre-compiled linux version from the hisat2 webpage. Unsure why.
I then downloaded and compiled directly from the github and it now succeeded, but required more than 40 gb of memory. To begin with I only had 16 gb available and it kept hitting the memory limit and then saying that it would adjust and use less memory, but I stopped it after the 4th time that it happened and switched to a bigger machine.
Now the index was made and I tried using Hisat2 to align 3 single ended fastq files which had previously successfully been aligned using tophat. I used the following command line
hisat2 -p 8 --no-unal -k 1 -x combined -U Sample_L_3/L_3_TTAGGC_L001_R1_001.fastq.gz,Sample_L_3/L_3_TTAGGC_L001_R1_002.fastq.gz,Sample_L_3/L_3_TTAGGC_L002_R1_001.fastq.gz -S Sample_L_3/hisat2_output

After 10 hours on a ubuntu 15.10, 8 core, 16 gb ram, machine I stopped it.
It gave zero output, but constantly used 100% of 1 core and a few gigs of ram.

Fasta:
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Cricetulus_griseus/latest_assembly_versions/GCF_000223135.1_CriGri_1.0/GCF_000223135.1_CriGri_1.0_genomic.fna.gz
Gff:
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Cricetulus_griseus/latest_assembly_versions/GCF_000223135.1_CriGri_1.0/GCF_000223135.1_CriGri_1.0_genomic.gff.gz

Hisat v2.0.1 reports identical alignments and does not mark either as secondary.

I first discovered this while using MarkDuplicates on Tophat2 output. I got the same Picard error as below and figured, maybe this issue is gone in HISAT2. After looking at the offending reads, the alignments reported by HISAT2 are clearly identical and reported more than once. This is related to issue 36 on Tophat2

Sofware versions:
samtools: 1.3.1
Picard 2.2.2
Tophat 2.1.1

Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once.  3: null:NB501257:32:HWCMVBGXX:1:212$
9:5229:14102
        at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:133)
        at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
        at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
        at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:442)
        at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:193)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
ralstonm@host  ~/sandbox/test_mark_duplicates/test>samtools view mark_dup_hisat.sorted.bam | grep NB501257:32:HWCMVBGXX:1:21$
09:5229:14102
NB501257:32:HWCMVBGXX:1:21209:5229:14102        433     chr1    54441201        0       54M2D22M        =       113992827       0       ATTA$
TAGTGTGGTATTTCTGAACAACAATTTTATCAACTGTGTCATGATCATCAAGTTACAAAAGCAAATCCTCT AE/AEAEEEEEAEEEAAEEEEEEEEEE6EEEE6EEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEE$
EAAAAAA AS:i:-16        ZS:i:-16        XN:i:0  XM:i:1  XO:i:1  XG:i:2  NM:i:3  MD:Z:42A11^AA22 YT:Z:UP NH:i:2
NB501257:32:HWCMVBGXX:1:21209:5229:14102        81      chr1    113992827       255     76M     chr3    75264199        0       AATGGGCATAAT$
GTGAAGTGAAAAAGGCCCTTTCTAAACAAGAGATGCAGTCTGCTGGATCACAGAGAGGTCGTG EEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEAAAAA$
S:i:0   ZS:i:-3 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:76 YT:Z:UP NH:i:1
NB501257:32:HWCMVBGXX:1:21209:5229:14102        81      chr1    113992827       255     76M     chr3    75264199        0       AATGGGCATAAT$
GTGAAGTGAAAAAGGCCCTTTCTAAACAAGAGATGCAGTCTGCTGGATCACAGAGAGGTCGTG EEEEAEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEAAAAAA
S:i:0   ZS:i:-3 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:76 YT:Z:UP NH:i:1
NB501257:32:HWCMVBGXX:1:21209:5229:14102        161     chr3    75264199        0       20M2D56M        chr1    113992827       0       AGAGG
ATTTGCTTTTGTAACTTGATGATCATGACACAGTTGATAAAATTGTTGTTCAGAAATACCACACTATTAAT AAAAAAEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEE6EEEE6EEEEEEEEEEAAEEEAEEEE
EAEA/EA AS:i:-16        ZS:i:-16        XN:i:0  XM:i:1  XO:i:1  XG:i:2  NM:i:3  MD:Z:20^TT13T42 YT:Z:UP NH:i:2
NB501257:32:HWCMVBGXX:1:21209:5229:14102        161     chr3    75264199        0       20M2D56M        chr1    113992827       0       AGAGG
ATTTGCTTTTGTAACTTGATGATCATGACACAGTTGATAAAATTGTTGTTCAGAAATACCACACTATTAAT AAAAAAEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEE6EEEE6EEEEEEEEEEAAEEEAEEEE
EAEA/EA AS:i:-16        ZS:i:-16        XN:i:0  XM:i:1  XO:i:1  XG:i:2  NM:i:3  MD:Z:20^TT13T42 YT:Z:UP NH:i:2

hisat2-align died with signal 11

Hi, I am moving from TopHat2 to HISAT2. I have downloaded the index files of the mice transcriptome from the HISAT2 website, specifically GRCm38 genome_tran. The HISAT2 died and prompted "hisat2-align died with signal 11", when I was trying to map mice's RNA-Seq data (SRR1873514). Any idea about what's problem here? By the way, it was OK to map the same data to human's transcriptome. It was also OK for the index files genome and genome_snp_tran of GRCm38.

--no-temp-splicesite option

Hi,
In ths paper, you said that HISATx1 uses a one-pass approach that aligns each pair of reads independently of others. What does one-pass mean? Does it menn the option --no-temp-splicesite is used to disable the default alignment strategy of making use of splice sites found by earlier reads to align later reads in the same run?

Haihua

Trouble with gapped alignment

I am trying to use HISAT2 for alignment of PE 150 bp Illumina DNA sequences. It is working great for perfect matches; however, gapped alignment is a problem. As an example, here are a pair of reads that map to chrM within 300 bp of each other. The first read is a perfect match to the reference (hg37) while the second has a 6 bp deletion.

@SRR3106538.397648993 ec:Z:0_18:14_0_4:0_0
TGAAATCTGGTTAGGCTGGTGTTAGGGTTCTTTGTTTTTGGGGTTTGGCAGAGATGTGTTTAAGTGCTGTGGCCAGAAGCGGGGGGAGGGGGGGGGGGGTTGGGGGGAAATTTTTTGTTATGATGTCTGTGtGGAAAGCGGCtGtGCaGAC
+
???????????????????????????????????????????????????????????????????????????????????????????????????+?+??+??????????+++?+++++?+++?++$?++?+++???$?$?+$+++

@SRR3106538.397648993 ec:Z:0_0:14_0_1:0_0
CACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCCCATCCTATTATTTATCGCa
+
??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????+????+???++?+???+???++?+?++++???#

Both BLAT and BWA MEM can align these reads properly. However, HISAT2 fails to align the second read. I have tried adjusting the minimum score threshold to "C,-50" and the gap penalties to "6,0" - the latter to match the default penalty used by BWA. I am also ignoring qualities. None of these measures have succeeded.

I'm not sure if this is a bug or if I'm just missing the magic combination of options that will enable HISAT2 to deal with gaps. Either way, thanks in advance for any help.

Cuffdiff is complaining about missing XS attribute

Hi,
I don't see the XS column in the bam file using samtools view
the command is the following :
hisat2 -q -u1000000 -5 0 -3 0 --known-splicesite-infile /./splicesites.txt --dta-cufflinks -p20 -x mm10 -U /../.fastq -S /.../...

version is 2.0.1 and I have also tried with .3 to no avail.

any help will be greatly appreciated.
Best

hisat2-align died with signal 6 (ABRT)

i am getting this error while using hisat2 to align data with reference genome,
can u please help me in solving this??

thank you

error is as follows

Error, fewer reads in file specified with -2 than in file specified with -1
terminate called after throwing an instance of 'int'
(ERR): hisat2-align died with signal 6 (ABRT)

i am attaching my code in png format

capture

HISAT2 statistic result is discordant with the one gets from bam file

Usually, when we use HISAT to do an alignment, we can get a report likes the following:
1652485 reads; of these:
1652485 (100.00%) were paired; of these:
123729 (7.49%) aligned concordantly 0 times
1267088 (76.68%) aligned concordantly exactly 1 time
261668 (15.83%) aligned concordantly >1 times
----
123729 pairs aligned concordantly 0 times; of these:
7415 (5.99%) aligned discordantly 1 time
----
116314 pairs aligned 0 times concordantly or discordantly; of these:
232628 mates make up the pairs; of these:
164172 (70.57%) aligned 0 times
53749 (23.11%) aligned exactly 1 time
14707 (6.32%) aligned >1 times
95.03% overall alignment rate

But when I use bam file to calculate unique ratio, I've got an discordant result.
The parameter I use to calculate is NH:i:1 && YT:Z:CP
And the result is 2984370

I will appreciate a lot if you can tell me what's wrong with it.

HISAT2 annotation mode with sra-acc function not work for unknown reasons

Hey Daehwan,

I tried this pipeline to run HISAT2 with annotation, but it is just paused there without any warning forever.

module load hisat/2.0.0

extract_splice_sites.py dmel_ERCC.gtf >dmel_ERCC.htsat2_splicesites

extract_exons.py dmel_ERCC.gtf >dmel_ERCC.htsat2_exons

hisat2-build --ss dmel_ERCC.htsat2_splicesites --exon dmel_ERCC.htsat2_exons dmel_ERCC.fa dmel_ERCC

hisat2 -x dmel_ERCC --sra-acc SRR070418

dmel_ERCC is just Drosophila melanogaster (FlyBase version 6.04) with Spike-In ERCC information, and the fasta and gtf files can be found in my good drive:
https://drive.google.com/folderview?id=0B8RMbm3u9H6-X3pTNHF5ODBZR0k&usp=sharing

Thanks very much!

Best regards,
Haiwang

Suggestion: Improved logging

Hi there,

I'm a big fan of HISAT2 but I struggle with the summary logs. Whilst they're intuitive to read as a user whilst running the tool, they're very tricky to parse computationally after the run is complete. The reason that I'm interested in this is because I'm the author of a tool called MultiQC, which creates summary reports describing the output of multiple bioinfo tools. MultiQC can only rely on the output from each tool and no additional logs or information from the user (which would vary case by case).

My first request would be to log the HISAT software version number and input parameters along with the summary stats (such as input filenames). This is especially useful when a pipeline concatenates the stderr from multiple runs into a single file. In the case of HISAT2 it would also help as the stderr log is identical to that of Bowtie2 and it's impossible to tell where the log output came from.

Secondly, a machine parseable format such as YAML would be a lot easier to unambiguously interpret, whilst still being fairly easy to read by eye.

Finally, and this one is less important and more subjective, I prefer summary statistics to be printed to a file - this makes them much easier to find. For example, for an input file input_R1.fastq.gz you could create input_R1.fastq.gz.HISAT2_summary.yaml. This is what a lot of other bioinformatics tools do (eg. STAR, Salmon and even Tophat with the align_summary.txt file). Note that if printing summary stats to a file then you can of course keep the stderr messages as they are currently.

So, in summary, instead of / in addition to the current stderr stream:

20000 reads; of these:
  20000 (100.00%) were unpaired; of these:
    1247 (6.24%) aligned 0 times
    18739 (93.69%) aligned exactly 1 time
    14 (0.07%) aligned >1 times
93.77% overall alignment rate

It would be fantastic to have something like this:

- HISAT2_version: 2.0.4
- input_files:
  - input_R1.fastq.gz
  - input_R2.fastq.gz
- summary_stats:
  - total_reads: 20000
  - unpaired_reads:
    - counts:
      - total: 20000
      - aligned_0_times: 1247
      - aligned_1_time: 18739
      - aligned_gt1_time: 14
    - percentages:
      - total: 100.00%
      - aligned_0_times: 6.24%
      - aligned_1_time: 93.69%
      - aligned_gt1_time: 0.07%
  - overall_alignment_rate: 93.77%

Does this make sense? Do you have any thoughts on the topic?

Thanks in advance,

Phil

Using --sra-acc randomly fails after a variable amount of reads

Using the --sra-acc option works as expected, but I'm getting:

Time loading forward index: 00:00:28
Time loading reference: 00:00:08
VCursorCellDataDirect failed: '(INSDC:quality:text:phred_33)QUALITY' [25526273]$
Multiseed full-index search: 00:32:22
25526272 reads; of these:
25526272 (100.00%) were unpaired; of these:
753459 (2.95%) aligned 0 times
21078965 (82.58%) aligned exactly 1 time
3693848 (14.47%) aligned >1 times
97.05% overall alignment rate

Note the VCursor error; the thing is, it fails silently after a variable amount of reads (25526272 here, 25460736 the next run, etc) - it could lead to problems if undetected.

hisat2-align died with signal 11

Hi there,

I am trying to align ESTs and their probes to the Salmo salar and O. mykiss reference genomes with HISAT2 Beta v. 2.03 : ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000233375.1_ICSASG_v2/ , http://www.genoscope.cns.fr/trout/data/

When trying to align the ESTs provided below I get the following error from this command (these ESTs used to align in v 2.01 without error). On a side note, a similar file of ESTs had poor alignment success ~45% in v 2.01 but now works much better with 72% success in v2.03.

./hisat2 -f -x /media/genetics/Data1/Jeff_workstation_1/hisat2-2.0.3-b
eta/genomes/Salmo_salar_WGS -U /media/genetics/Data1/Jeff_workstation_1/hisat2-2.0.3-beta/reads/32K.FASTA -S /media/genetics/Data1/Jeff_workstation_1/hisat2-2.0.3-beta/hits/32K_EST_hits.sam -p 8 --un /media/genetics/Data1/Jeff_workstation_1/hisat2-2.0.3-beta/hits/32K_EST_nohits.fasta

Use of uninitialized value $l in scalar chomp at ./hisat2 line 536, line 235645.
Use of uninitialized value $l in substitution (s///) at ./hisat2 line 537, line 235645.
Use of uninitialized value $l in print at ./hisat2 line 541, line 235645.
(ERR): hisat2-align died with signal 11 (SEGV)

https://drive.google.com/folderview?id=0B7DoVb4vIYNuc0xZN0xNcEtXUGM&usp=sharing

I have also emailed [email protected] with a similar issue

Thank you!

hisat2-build abnormal messages

When I ran hisat2-build with 2.0.3-beta:
hisat2-build -p 20 --ss splice_sites.txt --exon exons.txt $IN/${id}.fa genome

I just got following messages:

fchr[A]: 0
fchr[C]: 816149930
fchr[G]: 1403119588
fchr[T]: 1990227070
fchr[$]: 2807565693
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 1733303589 bytes to primary GFM file: Hisat2Index/Bos_taurus/genome.1.ht2
Wrote 701863824 bytes to secondary GFM file: Hisat2Index/Bos_taurus/genome.2.ht2
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
Returning from initFromVector
Wrote 1477089719 bytes to primary GFM file: Hisat2Index/Bos_taurus/genome.5.ht2
Wrote 713467316 bytes to secondary GFM file: Hisat2Index/Bos_taurus/genome.6.ht2
Re-opening _in5 and _in5 as input streams
Returning from HierEbwt constructor

In the first line, "fchr[A]: " should not be zero.

Also sometime the hisat-build process may exit.

Returning from GFM constructor
do_build.sh: line 7: 37686 Segmentation fault hisat2-build -p 20 --ss $OUT/$bpp/splice_sites.txt --exon $OUT/$bpp/exons.txt $IN/${id}.fa $OUT/$bpp/genome 2> $OUT/$bpp/my.log

Is there something error? My RAM is 1TB.

bam format

--hello,

is HISAT2 able to output directly in bam format as tophat2 ?
also, I was using tophat2 with this command line:

tophat2 -p 10 -N 6 --read-gap-length 6 --read-edit-dist 6 --library-type fr-unstranded -r 152 --mate-std-dev 53 -G bowtie2_index/mygenome.gtf --transcriptome-index transcript_index -o myOutput bowtie2_index/mygenome L001_R1.fastq.gz,L002_R1.fastq.gz L001_R2.fastq.gz,L002_R2.fastq.gz

which is the similar command using HISAT2 ?

thanx,
Laurent --

hisat2 does not handle rRNA reads well

I think hisat is not handling ribosomal RNA reads very well (see image). Neither are Star or Tophat2.
Tophat also finds a bunch of strange splice junctions (top)
Hisat is finding a bunch of splice junctions that are probably not there (middle)
Star does better on the splice junctions (bottom), but it only seems to map to the first locus it sees.

Hisat is particularly problematic because many more reads are split across the two 45S loci than with the other mappers. Feeding a hisat bam file to Picard's RNASeq Metrics tool produces about 50% rRNA, while Tophat, Star, and k-mer classifier produce ~80% rRNA.

This may be compounded by the fact that the genome reference is not complete for rRNA regions.

Are there settings I should be changing to handle this situation better?
I could disable splicing, but that doesn't make much sense since it would screw up the mapping of the mRNA.

Right now I'm going to try filtering rRNA as pre-mapping step, then mapping with bowtie and merging the bams.
I'll attach some downsampled fastq files and my rRNA interval list in case they're of use to you.

spliced_mappers_rrna

HISAT2 alignment never finishes

Hi there,
I am running hisat2 for paired end alignment using several cores (-p parameter) on an index I built myself. Seems like the program never exits and runs forever, also the SAM file seem to be complete.

Thanks,
Pascal

Bug with make_grch38_tran.sh script

Hi,

I believe you have some minor bugs in the make_grch38_tran.sh and possibly to the similar ones.

In line 50:

    if ! which extract_snps.py ; then

it should be:

   if ! which extract_splice_sites.py ; then

Similarly, in line 64

   HISAT2_SS_SCRIPT=`which extract_exons.py`

it should be:

   HISAT2_EXON_SCRIPT=`which extract_exons.py`

This seems quite serious since it creates an empty exon file!

Cheers

"exceeded integer bounds, remove adjacent SNPs or switch to 64-bit version"

I'm having an issue building in index using a snp file I generated myself. The reference is mm10 and I'm using SNPs where the SNP density may be very high in some regions. Should I convert my SNPs into dbSNP format and run extract_snps.py on that?

$ ~/hisat2-2.0.0-beta/hisat2-build -p 18 --snp castVs129.snp 129.fa 129
[ ... ]
18 182120741
18 184736243
17 176414121
17 174605850
17 178728918
17 177410899
17 173222117
18 175043857
20 168496928
FINISHED RECURSIVE SORTS: 16
BUILD TABLE: 25
BUILD INDEX: 10
COUNTED NEW NODES: 2
exceeded integer bounds, remove adjacent SNPs or switch to 64-bit version
Total time for call to driver() for forward index: 00:33:37
Error: Encountered internal HISAT2 exception (#1)
Command: hisat2-build --wrapper basic-0 -p 18 --snp castVs129.snp 129.fa 129 
Deleting "129.3.ht2" file written during aborted indexing attempt.
Deleting "129.4.ht2" file written during aborted indexing attempt.
Deleting "129.1.ht2" file written during aborted indexing attempt.
Deleting "129.2.ht2" file written during aborted indexing attempt.

Soft-clipping in default mode and --end-to-end option

I also came across the issue that Hisat2 is performing soft-clipping on reads (operation 'S' in the CIGAR string). Since this was unexpected I tried to read up on it on the Hisat2 homepage (ccb.jhu.edu/software/hisat2/manual.shtml) but the only mention of soft-clipping I found was:

--sp MX,MN
Sets the maximum (MX) and minimum (MN) penalties for soft-clipping per base, both integers. A number less than or equal to MX and greater than or equal to MN is subtracted from the alignment score for each position. The number subtracted is MN + floor( (MX-MN)(MIN(Q, 40.0)/40.0) ) where Q is the Phred quality value. Default: MX = 2, MN = 1.

I then found the following passage in the MANUAL document here on GitHub:

--end-to-end
In this mode, HISAT2 requires that the entire read align from one end to the other, without any trimming (or "soft clipping") of characters from either end. The match bonus--maalways equals 0 in this mode, so all alignment scores are less than or equal to 0, and the greatest possible alignment score is 0. This is mutually exclusive with--local.--end-to-endis the default mode.

I went ahead to use --end-to-end to prevent soft-clipping altogether (version 2.0.3b) but soft-clipping was still happening. I am wondering if the option --end-to-end exists at all (since it not documented anywhere else than here) and if it is working as intended? I found that the only way to prevent soft-clipping was to use stupidly high values for --sp, such as --sp 1000,1000.

I think in any case it would be good to add some documentation about soft-clipping because it certainly isn't what I would have expected as the default mode (especially since the documentation seems to rule it out as well).

Many thanks, Felix

Problem hisat2-build

Good morning hisat2-experts,

I really want to use hisat2, but I run into trouble building up an index on a compute server with enough cpus, disk space and ram using this command:

~/hisat2-2.0.3-beta/hisat2-build -p 60 --ss Sscrofa10.2.84.splicesites.txt --exon Sscrofa10.2.84.exons.txt --large-index --haplotype Sus_scrofa.10.2.84.haplotype --snp Sus_scrofa.10.2.84.snp -f Sscrofa10.2.84.fa Sscrofa10.2.84

Without snp and haplotype information this job succeeds! Snp and haplotype files were generated successfully with hisat2_extract_snps_haplotypes_VCF.py. Data origin is ensembl.
These are the last lines of the console output:

FINISHED RECURSIVE SORTS: 420
SORT, Make index: 522
TOTAL: 2698
Allocating ftab, absorbFtab
Entering GFM loop
Exited GFM loop
fchr[A]: 0
fchr[C]: 1042314871
fchr[G]: 1795533585
fchr[T]: 2547556460
fchr[$]: 3591534603
Exiting GFM::buildToDisk()
Returning from initFromVector
Wrote 2227544711 bytes to primary GFM file: Sscrofa10.2.84.1.ht2l
Wrote 1783815948 bytes to secondary GFM file: Sscrofa10.2.84.2.ht2l
Re-opening _in1 and _in2 as input streams
Returning from GFM constructor
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted

(Temporary) output files are: Sscrofa10.2.84.{1..8}.ht2l

I am very pleased about every helping advice.

Cheers, Frieder

Uniquely mapped reads: filtering by FLAG vs. NH:i field value

I am interested in alignments of only uniquely-mapped reads and am thinking of filtering HISAT2 alignments to retain only uniquely-mapped reads based on the fact that for a read that maps to more than one location HISAT2 will:

[1] set the SAM secondary bit in the FLAG field, and
[2] have the NH:i field set to >1

To filter based on [1], I use: hisat2 -1 1.fastq -2 2.fastq -q --phred33 --rna-strandness RF -x genome_tran -k 2 | samtools view -h -F 256 -

To filter based on [2], I use: hisat2 -1 1.fastq -2 2.fastq -q --phred33 --rna-strandness RF -x genome_tran -k 2 | grep -P "^\@|NH:i:1$

I expect both filterings to have the same effect but samtools flagstat suggests that it is not so. With [1], I get:

54850426 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
44272472 + 0 mapped (80.71% : N/A)

With [2], I get:

41636729 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
41636729 + 0 mapped (100.00% : N/A)

I am trying to understand the reason for the discrepancy between the number of mapped reads left in the HISAT2 alignment after filtering (44272472 vs. 41636729 ).

Without either filtering, I get:

57236373 + 0 in total (QC-passed reads + QC-failed reads)
2385947 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
46658419 + 0 mapped (81.52% : N/A)

(Using HISAT2 02-05-16 GitHub master, samtools 1.3)

VG format inputs and outputs

The paper indicates that the authors do not know of other published graph based aligners. I'm only familiar with two--- GCSA and BWBBLE. Recently another came to my attention, but I've lost the reference. (It related to resequencing against a bacterial population reference graph.) There are and will be other graph based methods. It would be really useful if they could talk to each other.

I develop vg. It provides a toolchain for working with bidirectional, cyclic sequence graphs--- construction, mapping, and genotyping in addition to many other functions. Although the documentation on the wiki refers to a prototype key/value based genome index, vg now uses GCSA2 for mapping and xg for storing a succinct version of the reference graph. Mapped reads can be subjected into the linear reference, although I don't think the support for this is as good as in HISAT2. From the perspective of this project probably the greatest utility in vg are its graph manipulation and visualization functions, which provide ways of understanding alignments against the graph and also modifying the graph using alignments.

We have implemented protobuf-based serialization formats for the graph (nodes, edges, and paths through it), alignments, pileup, and streams of these objects. I think it would be amazing to work with HISAT2 alignments in graph space and also really useful if I could provide HISAT2 a reference graph constructed using other methods. So, I'd like to suggest implementing adapters for some of the protobuf serialization formats that vg uses. If others find this potentially useful I can try to help the process along.

In the Global Alliance for Genomics and Health reference variation group we are exploring ways of working with these kinds of reference systems. It would be great if a user or developer of HISAT2 were participating! Right now we are working on constructing graphs for difficult regions of the reference, and we will be evaluating aligners to map against these in the coming months.

(ERR): hisat2-align died with signal 11 (SEGV) (core dumped)

Hello,

I tried to use Hisat2 with two different references. The first consists of some large pseudomolecules and the second consists of lots of contigs (both are from the same species). I used version 2.0.4 for mapping and 2.0.3 for indexing. Mapping RNA-seq data to pseudomolecules works fine without any problems but it fails for the contigs.
hisat2-align died with signal 11 (SEGV) (core dumped)

I found in a previous issue that these error might be related to indexing. https://github.com/infphilo/hisat2/issues/13

I also get warning messages while building an index for the contigs:
Warning: Encountered reference sequence with only gaps

Can you help me with that? Is there a minimum length for unmasked sequences?

Best
Sven

Crash in hisat2-align: CIGAR and query sequence are of different length

Hi,

HISAT2 crashed for me with the following error.

[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 2250172
[main_samview] truncated file.
(ERR): hisat2-align died with signal 13 (PIPE) 

On the same file TopHat2 returns good results.

Corrupted index makes hisat2 run forever

Hi,

we created a GRCh37 index (with some extra MHC haplotype sequences included). There was no "eye-catching" error and we went on aligning or datasets with hisat2 against this index.
These jobs were running on one single cpu core (-p 12 was used) "forever".
strace was just dumping read(6, "", 4096) = 0 ... and obviously nothing happened ...
So we took a closer look at the index. There were two files missing (5+6).

  • In case of an error you could make the error message more "eye-catching" during or after building the index. You could even remove the incomplete index to make people aware that there was sth. serious going wrong.
  • the alignment process (or the wrapper hisat2) should probably check if the index is complete and not corrupted.

Great tool though, .. just my 2p ;-)
Sven

(ERR): hisat2-align exited with value 1

Not sure if this is the right place, but I am having trouble running Hisat2 and can't figure out what is wrong. Could you please help?
I am running 2.0.1-beta version and Hisat2 was run as follows:
hisat2 -p 31 -x ${DATABASE} -1 ${R1} -2 ${R2} -U ${UNP} | samtools view -bS - > ${OUTPUT}.bam

And the error I get is:

Error reading _plen[] array: 0, 112320 Error: Encountered internal HISAT2 exception (#1) Command: /data004/software/GIF/packages/hisat/2.0.1-beta/hisat2-align-s --wrapper basic-0 -p 31 -x genome_indexed -1 /tmp/21951.inpipe1 -2 /tmp/21951.inpipe2 -U /tmp/21951.unp (ERR): hisat2-align exited with value 1

Any help will be greatly appreciated!

Thanks,

Identical alignments reported as secondary in output

Hi @infphilo,

I ran into a problem again. Occasionaly the aligner reports a secondary hit that is identical to the primary hit (approx 187 times with 1239178 sam lines). This looks like a buffering/syncing thing in the alignment optimisation. This might also be related to issue #44. For now Iโ€™ll remove the secondary reads.

I'm sorry for the headache this will give you.

Thanks for looking at the problem,
mmterpstra

Problem

Here some examples with the duplicates being on subsequent line numbers:

M01997:246:000000000-ANBW6_NNNNNN:1:2103:14983:7946 16  12  6646173 0   4M90N116M31S    *   0   0   TGGGGCTCATTTGCAGGGGGGAGCCAAAAGGGTCATCATCTCTGCCCCCTCTGCTGATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAGCAATGCCTCTGCACCACCAACTGCTTAGCAC GGGG1FFFFGF0/CCCGGGFF1</1FCHGGCFHFF><10FGFGEE//FCHGHHHHHFEGEE/FG1GEEEFHGBF/GEHHHHHFHFFA2HHHHFFA2DB0FHG0EFEG221BFBBFGFHHHHHHFEACBG1BCGEAAGGEB1FFFFFAA>1> AS:i:-31    ZS:i:-31    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:120    YT:Z:UU XS:A:+  NH:i:2
M01997:246:000000000-ANBW6_NNNNNN:1:2103:14983:7946 272 12  6646173 0   4M90N116M31S    *   0   0   TGGGGCTCATTTGCAGGGGGGAGCCAAAAGGGTCATCATCTCTGCCCCCTCTGCTGATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAGCAATGCCTCTGCACCACCAACTGCTTAGCAC GGGG1FFFFGF0/CCCGGGFF1</1FCHGGCFHFF><10FGFGEE//FCHGHHHHHFEGEE/FG1GEEEFHGBF/GEHHHHHFHFFA2HHHHFFA2DB0FHG0EFEG221BFBBFGFHHHHHHFEACBG1BCGEAAGGEB1FFFFFAA>1> AS:i:-31    ZS:i:-31    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:120    YT:Z:UU XS:A:+  NH:i:2
M01997:246:000000000-ANBW6_NNNNNN:1:2103:27010:12001    16  1   115256595   0   5M2071N88M58S   *   0   0   GAATCCTCTATGGTGGGATCATATTCATCTACAAAGTGGTTCTGGATTAGCTGGATTGTCAGTGCGCTTTTCCCAACACCACCTGCTCCAACCTCCGCCAGTTTGTACTCAGTCAGTTCACACCAGCAAGAGCCTCAAGCTCCACTGCCTC <GHEGGFHHFHCF@011GG2G>D11FB2GFFGHHCGFEGF1BFFHBFGGBFHHHGHHGFEEE>//GECHGAGGEA>F>F@>E>>0GFCGBA/A/EA/FFGHHE2DF1HFFBGBAD1HGABB0HFHFFHHGE1BGFGB1G>B1B1AAAAA1A AS:i:-58    ZS:i:-58    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:93 YT:Z:UU XS:A:-  NH:i:2
M01997:246:000000000-ANBW6_NNNNNN:1:2103:27010:12001    272 1   115256595   0   5M2071N88M58S   *   0   0   GAATCCTCTATGGTGGGATCATATTCATCTACAAAGTGGTTCTGGATTAGCTGGATTGTCAGTGCGCTTTTCCCAACACCACCTGCTCCAACCTCCGCCAGTTTGTACTCAGTCAGTTCACACCAGCAAGAGCCTCAAGCTCCACTGCCTC <GHEGGFHHFHCF@011GG2G>D11FB2GFFGHHCGFEGF1BFFHBFGGBFHHHGHHGFEEE>//GECHGAGGEA>F>F@>E>>0GFCGBA/A/EA/FFGHHE2DF1HFFBGBAD1HGABB0HFHFFHHGE1BGFGB1G>B1B1AAAAA1A AS:i:-58    ZS:i:-58    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:93 YT:Z:UU XS:A:-  NH:i:2
M01997:246:000000000-ANBW6_NNNNNN:1:2103:14433:13500    16  19  3054192 1   1M1469N63M585N45M1322N25M17S    *   0   0   GCTCTTGGGAGAGGTAGGGCAGGACCTGGGCACAAATCCCGTTCAGCCTTTTGACGATCTCAGCCTGTTTGTGCATCTCGATGTTCAAGCCGTAGGACATCTCGTAGTACATCACATAGTGACGCTGCATCTCTCTCTCTCTCCCCCGCTA C..://CGCGDG<</CGCC1GDCGAECFGF0=1GFED?CFGDFC?//BGFGHGG?1F1CAGAFFF0HFGBFBFFGFFGEFF43C0/EE>1DFGHGFGBGE>1EEBHHGHHFGFF5HFGE1GGGHGBFGHGECCCFAGE2AA2@D??@BA?A AS:i:-17    ZS:i:-17    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:134    YT:Z:UU XS:A:-  NH:i:2
M01997:246:000000000-ANBW6_NNNNNN:1:2103:14433:13500    272 19  3054192 1   1M1469N63M585N45M1322N25M17S    *   0   0   GCTCTTGGGAGAGGTAGGGCAGGACCTGGGCACAAATCCCGTTCAGCCTTTTGACGATCTCAGCCTGTTTGTGCATCTCGATGTTCAAGCCGTAGGACATCTCGTAGTACATCACATAGTGACGCTGCATCTCTCTCTCTCTCCCCCGCTA C..://CGCGDG<</CGCC1GDCGAECFGF0=1GFED?CFGDFC?//BGFGHGG?1F1CAGAFFF0HFGBFBFFGFFGEFF43C0/EE>1DFGHGFGBGE>1EEBHHGHHFGFF5HFGE1GGGHGBFGHGECCCFAGE2AA2@D??@BA?A AS:i:-17    ZS:i:-17    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:134    YT:Z:UU XS:A:-  NH:i:2

cmdline (based on the sam output with version 2.0.3-beta):

.../software/hisat2/2.0.3-beta-foss-2016a/bin/hisat2-align-s --wrapper basic-0 -x .../ftp.broadinstitute.org/bundle/2.8/b37//hisat2/2.0.3-beta-goolf-1.7.20/human_g1k_v37_decoy -S .../some.hisat2.sam --known-splicesite-infile .../ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/hisat2-2.0.3-beta-foss-2016a/Homo_sapiens.GRCh37.75.splicesites.txt --score-min L,0,-0.6 --sp 1,1.5 -D 20 -R 3 --threads 1 -U /tmp/145331.unp

Here are the first few line numbers based on a diff of the read headers:

10802a10803
> M01997:246:000000000-ANBW6_NNNNNN:1:2103:14983:7946
22124a22126
> M01997:246:000000000-ANBW6_NNNNNN:1:2103:27010:12001
26429a26432
> M01997:246:000000000-ANBW6_NNNNNN:1:2103:14433:13500
28921a28925
> M01997:246:000000000-ANBW6_NNNNNN:1:2103:6939:14355
32653a32658
> M01997:246:000000000-ANBW6_NNNNNN:1:2103:20841:15645
76258a76264
> M01997:246:000000000-ANBW6_NNNNNN:1:2104:12265:9450
91860a91867
> M01997:246:000000000-ANBW6_NNNNNN:1:2104:14486:14331
99438a99446
> M01997:246:000000000-ANBW6_NNNNNN:1:2104:14377:16768
99993,100000d100000

Testing for minimal reproduction

Redo shows that this is not repeatable in a simplified run and then yields the correct output:

$ cat test.fq
@M01997:246:000000000-ANBW6_NNNNNN:1:2103:14433:13500
GCTCTTGGGAGAGGTAGGGCAGGACCTGGGCACAAATCCCGTTCAGCCTTTTGACGATCTCAGCCTGTTTGTGCATCTCGATGTTCAAGCCGTAGGACATCTCGTAGTACATCACATAGTGACGCTGCATCTCTCTCTCTCTCCCCCGCTA
+
C..://CGCGDG<</CGCC1GDCGAECFGF0=1GFED?CFGDFC?//BGFGHGG?1F1CAGAFFF0HFGBFBFFGFFGEFF43C0/EE>1DFGHGFGBGE>1EEBHHGHHFGFF5HFGE1GGGHGBFGHGECCCFAGE2AA2@D??@BA?A
@M01997:246:000000000-ANBW6_NNNNNN:1:2103:27010:12001
GAATCCTCTATGGTGGGATCATATTCATCTACAAAGTGGTTCTGGATTAGCTGGATTGTCAGTGCGCTTTTCCCAACACCACCTGCTCCAACCTCCGCCAGTTTGTACTCAGTCAGTTCACACCAGCAAGAGCCTCAAGCTCCACTGCCTC
+
<GHEGGFHHFHCF@011GG2G>D11FB2GFFGHHCGFEGF1BFFHBFGGBFHHHGHHGFEEE>//GECHGAGGEA>F>F@>E>>0GFCGBA/A/EA/FFGHHE2DF1HFFBGBAD1HGABB0HFHFFHHGE1BGFGB1G>B1B1AAAAA1A
@M01997:246:000000000-ANBW6_NNNNNN:1:2103:14983:7946
TGGGGCTCATTTGCAGGGGGGAGCCAAAAGGGTCATCATCTCTGCCCCCTCTGCTGATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAGCAATGCCTCTGCACCACCAACTGCTTAGCAC
+
GGGG1FFFFGF0/CCCGGGFF1</1FCHGGCFHFF><10FGFGEE//FCHGHHHHHFEGEE/FG1GEEEFHGBF/GEHHHHHFHFFA2HHHHFFA2DB0FHG0EFEG221BFBBFGFHHHHHHFEACBG1BCGEAAGGEB1FFFFFAA>1>

Running on this fastq:

$ .../software/hisat2/2.0.3-beta-foss-2016a/bin/hisat2-align-s --wrapper basic-0 -x .../ftp.broadinstitute.org/bundle/2.8/b37//hisat2/2.0.3-beta-goolf-1.7.20/human_g1k_v37_decoy -S /dev/stdout --known-splicesite-infile .../ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/hisat2-2.0.3-beta-foss-2016a/Homo_sapiens.GRCh37.75.splicesites.txt --score-min L,0,-0.6 --sp 1,1.5 -D 20 -R 3 --threads 1 -U test.fq
3 reads; of these:
  3 (100.00%) were unpaired; of these:
    0 (0.00%) aligned 0 times
    0 (0.00%) aligned exactly 1 time
    3 (100.00%) aligned >1 times
100.00% overall alignment rate
@HD VN:1.0  SO:unsorted
@SQ SN:1    LN:249250621
@SQ SN:2    LN:243199373
@SQ SN:3    LN:198022430
@SQ SN:4    LN:191154276
@SQ SN:5    LN:180915260
@SQ SN:6    LN:171115067
@SQ SN:7    LN:159138663
@SQ SN:8    LN:146364022
@SQ SN:9    LN:141213431
@SQ SN:10   LN:135534747
@SQ SN:11   LN:135006516
@SQ SN:12   LN:133851895
@SQ SN:13   LN:115169878
@SQ SN:14   LN:107349540
@SQ SN:15   LN:102531392
@SQ SN:16   LN:90354753
@SQ SN:17   LN:81195210
@SQ SN:18   LN:78077248
@SQ SN:19   LN:59128983
@SQ SN:20   LN:63025520
@SQ SN:21   LN:48129895
@SQ SN:22   LN:51304566
@SQ SN:X    LN:155270560
@SQ SN:Y    LN:59373566
@SQ SN:MT   LN:16569
@SQ SN:GL000207.1   LN:4262
@SQ SN:GL000226.1   LN:15008
@SQ SN:GL000229.1   LN:19913
@SQ SN:GL000231.1   LN:27386
@SQ SN:GL000210.1   LN:27682
@SQ SN:GL000239.1   LN:33824
@SQ SN:GL000235.1   LN:34474
@SQ SN:GL000201.1   LN:36148
@SQ SN:GL000247.1   LN:36422
@SQ SN:GL000245.1   LN:36651
@SQ SN:GL000197.1   LN:37175
@SQ SN:GL000203.1   LN:37498
@SQ SN:GL000246.1   LN:38154
@SQ SN:GL000249.1   LN:38502
@SQ SN:GL000196.1   LN:38914
@SQ SN:GL000248.1   LN:39786
@SQ SN:GL000244.1   LN:39929
@SQ SN:GL000238.1   LN:39939
@SQ SN:GL000202.1   LN:40103
@SQ SN:GL000234.1   LN:40531
@SQ SN:GL000232.1   LN:40652
@SQ SN:GL000206.1   LN:41001
@SQ SN:GL000240.1   LN:41933
@SQ SN:GL000236.1   LN:41934
@SQ SN:GL000241.1   LN:42152
@SQ SN:GL000243.1   LN:43341
@SQ SN:GL000242.1   LN:43523
@SQ SN:GL000230.1   LN:43691
@SQ SN:GL000237.1   LN:45867
@SQ SN:GL000233.1   LN:45941
@SQ SN:GL000204.1   LN:81310
@SQ SN:GL000198.1   LN:90085
@SQ SN:GL000208.1   LN:92689
@SQ SN:GL000191.1   LN:106433
@SQ SN:GL000227.1   LN:128374
@SQ SN:GL000228.1   LN:129120
@SQ SN:GL000214.1   LN:137718
@SQ SN:GL000221.1   LN:155397
@SQ SN:GL000209.1   LN:159169
@SQ SN:GL000218.1   LN:161147
@SQ SN:GL000220.1   LN:161802
@SQ SN:GL000213.1   LN:164239
@SQ SN:GL000211.1   LN:166566
@SQ SN:GL000199.1   LN:169874
@SQ SN:GL000217.1   LN:172149
@SQ SN:GL000216.1   LN:172294
@SQ SN:GL000215.1   LN:172545
@SQ SN:GL000205.1   LN:174588
@SQ SN:GL000219.1   LN:179198
@SQ SN:GL000224.1   LN:179693
@SQ SN:GL000223.1   LN:180455
@SQ SN:GL000195.1   LN:182896
@SQ SN:GL000212.1   LN:186858
@SQ SN:GL000222.1   LN:186861
@SQ SN:GL000200.1   LN:187035
@SQ SN:GL000193.1   LN:189789
@SQ SN:GL000194.1   LN:191469
@SQ SN:GL000225.1   LN:211173
@SQ SN:GL000192.1   LN:547496
@SQ SN:NC_007605    LN:171823
@SQ SN:hs37d5   LN:35477943
@PG ID:hisat2   PN:hisat2   VN:2.0.3-beta   CL:".../software/hisat2/2.0.3-beta-foss-2016a/bin/hisat2-align-s --wrapper basic-0 -x .../ftp.broadinstitute.org/bundle/2.8/b37//hisat2/2.0.3-beta-goolf-1.7.20/human_g1k_v37_decoy -S /dev/stdout --known-splicesite-infile .../ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/hisat2-2.0.3-beta-foss-2016a/Homo_sapiens.GRCh37.75.splicesites.txt --score-min L,0,-0.6 --sp 1,1.5 -D 20 -R 3 --threads 1 -U test.fq"
M01997:246:000000000-ANBW6_NNNNNN:1:2103:14433:13500    0   19  3054192 255 1M1469N63M585N45M1322N25M17S    *   0   0   GCTCTTGGGAGAGGTAGGGCAGGACCTGGGCACAAATCCCGTTCAGCCTTTTGACGATCTCAGCCTGTTTGTGCATCTCGATGTTCAAGCCGTAGGACATCTCGTAGTACATCACATAGTGACGCTGCATCTCTCTCTCTCTCCCCCGCTA C..://CGCGDG<</CGCC1GDCGAECFGF0=1GFED?CFGDFC?//BGFGHGG?1F1CAGAFFF0HFGBFBFFGFFGEFF43C0/EE>1DFGHGFGBGE>1EEBHHGHHFGFF5HFGE1GGGHGBFGHGECCCFAGE2AA2@D??@BA?A AS:i:-17    ZS:i:-18    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:134    YT:Z:UU XS:A:-  NH:i:1
M01997:246:000000000-ANBW6_NNNNNN:1:2103:27010:12001    0   1   115256595   255 5M2071N88M58S   *   0   0   GAATCCTCTATGGTGGGATCATATTCATCTACAAAGTGGTTCTGGATTAGCTGGATTGTCAGTGCGCTTTTCCCAACACCACCTGCTCCAACCTCCGCCAGTTTGTACTCAGTCAGTTCACACCAGCAAGAGCCTCAAGCTCCACTGCCTC <GHEGGFHHFHCF@011GG2G>D11FB2GFFGHHCGFEGF1BFFHBFGGBFHHHGHHGFEEE>//GECHGAGGEA>F>F@>E>>0GFCGBA/A/EA/FFGHHE2DF1HFFBGBAD1HGABB0HFHFFHHGE1BGFGB1G>B1B1AAAAA1A AS:i:-58    ZS:i:-62    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:93 YT:Z:UU XS:A:-  NH:i:1
M01997:246:000000000-ANBW6_NNNNNN:1:2103:14983:7946 0   12  6646173 255 4M90N116M31S    *   0   0   TGGGGCTCATTTGCAGGGGGGAGCCAAAAGGGTCATCATCTCTGCCCCCTCTGCTGATGCCCCCATGTTCGTCATGGGTGTGAACCATGAGAAGTATGACAACAGCCTCAAGATCATCAGCAATGCCTCTGCACCACCAACTGCTTAGCAC GGGG1FFFFGF0/CCCGGGFF1</1FCHGGCFHFF><10FGFGEE//FCHGHHHHHFEGEE/FG1GEEEFHGBF/GEHHHHHFHFFA2HHHHFFA2DB0FHG0EFEG221BFBBFGFHHHHHHFEACBG1BCGEAAGGEB1FFFFFAA>1> AS:i:-31    ZS:i:-34    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:120    YT:Z:UU XS:A:+  NH:i:1

PS: if it matters it was ran on CentOS 6.7 with intel arch with a home compiled hisat2

hisat2-align died with signal 6 (ABRT) && hisat2-align died with signal 4 (ILL)

Hello, Daehwan Kim.
I use the same script to run HISAT2, the first time I succeed and get the correct result, but the next time failed and until now , it still fails.
The error information are listed as follows:
image
image
Notably, all of this error information come from the same shell script.
I wonder if I use HISAT2 in a wrong way or there is bugs.

CIGAR M operator maps off end of reference

Hi there,
Thanks for hisat2, it's turning into a really great aligner for hg38.

I'm getting some improper alignments though with the following error from qualimap/samtools:
net.sf.samtools.SAMFormatException: SAM validation error: ERROR: Read name NS500580:53:HLLYNBGXX:1:21205:17950:6068, CIGAR M operator maps off end of reference

The read looks like this:
NS500580:53:HLLYNBGXX:1:21205:17950:6068 83 chrM 16503 255 69M6S = 16303 -269 GGTTCCTACTTCAGGGTCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCGCGATGGATCACAG EEEEEEEEEEEEEEEAAEEAEEEAEEEEEEAEEEEEEEAEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEAAAAAAS:i:-11 ZS:i:-14 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:61A5N0N0 YS:i:-5 YT:Z:CP XS:A:+ NH:i:1 RG:Z:10

I can provide you a bam file with the above read if you wish. Would be great if this could be resolved somehow. I'm using 2.0.1-beta

Fasta file error only when adding --ss and --exon inputs

Hi,

When I run hisat2-build -p 4 ${genome_file} ${index_file}
the index builds fine.

However, when I use the same inputs but run:

hisat2-build -p 4 --ss ${ss_file} --exon ${exon_file} ${genome_file} ${index_file}

Then I get the error:

...
Writing header
Reserving space for joined string
Joining reference sequences
Reference file does not seem to be a FASTA file
Time to join reference sequences: 00:00:00
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal HISAT2 exception (#1)
...

Now, this is clearly not an accurate error report as it's the same fasta file.

I'm happy to supply the input file if you want (this is yeast so takes ~1min to build index).

Thanks
Edward

Hisat2 reporting non-primary alignments

Hi Daehwan,
I am using Hisat2 VN:2.0.3-beta, and the default reporting option says:

Reporting: (default) look for multiple alignments, report best, with MAPQ

As it says I would expect a single alignment per sequence, but there are potentially many more, e.g. like so:
SRR522252_C_B_RNA.66 16 10 29698912 1 35M * 0 0 CAGAAAGTTCTCTAGCCAAAGACCGGTGTCTCATC >?<>=97=>7<8;;@8A?A<<9B?B>B=@ABB9=; AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z: 35 YT:Z:UU NH:i:5 SRR522252_C_B_RNA.66 272 10 128676274 1 35M * 0 0 CAGAAAGTTCTCTAGCCAAAGACCGGTGTCTCATC >?<>=97=>7<8;;@8A?A<<9B?B>B=@ABB9=; AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z: 35 YT:Z:UU NH:i:5 SRR522252_C_B_RNA.66 256 2 32963549 1 35M * 0 0 GATGAGACACCGGTCTTTGGCTAGAGAACTTTCTG ;=9BBA@=B>B?B9<<A?A8@;;8<7>=79=><?> AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z: 35 YT:Z:UU NH:i:5 SRR522252_C_B_RNA.66 272 11 20847595 1 35M * 0 0 CAGAAAGTTCTCTAGCCAAAGACCGGTGTCTCATC >?<>=97=>7<8;;@8A?A<<9B?B>B=@ABB9=; AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z: 35 YT:Z:UU NH:i:5

Am I misunderstanding something? If this is intended it might be good to warn users somehow that they should filter out non-primary alignments. Thanks, Felix

AS, ZS and NH tags: semantics?

Dear Daehwan,

I stumbled across two cases (see below) where I am unclear about whether I either incorrectly interpret the tags or they are indeed inconsistent. Hisat2 version and invocation were:

@PG ID:hisat2   PN:hisat2   VN:2.0.3-beta   CL:"/home/rschulz/bin/hisat2-2.0.3-beta/hisat2-align-s --wrapper basic-0 -p 4 --rna-strandness RF -x /home/rschulz/research/data/genomes/GRCm38/hisat2/genome_tran -S ./output/E.sam -1 /tmp/31877.inpipe1 -2 /tmp/31877.inpipe2"

Based on the definition from the Hisat2 web site, ZS:i:<N> Alignment score for the best-scoring alignment found other than the alignment reported. [...], I understand ZS to refer to the best-scoring alignment among the other found alignments, which could be greater than AS. However, I am confused by the additional sentence Note that, when the read is part of a concordantly-aligned pair, this score could be greater than [AS:i].. Why can ZS only be greater than AS when the read is part of a concordantly-aligned pair?

NH is defined as The number of mapped locations for the read or the pair. Does the use of locations instead of alignments imply that distinct alignments spanning the same coordinates in the target genome are not counted here? That could explain case 1 below, but not case 2.

Any help with understanding this would be much appreciated.

Case 1: ZS is present, suggesting that there are >1 alignments, but NH=1.

HWI-ST1037:275:C496DACXX:7:1206:15243:63664 163 1   4802273 255 7S10M43550N82M  =   4845943 43777   GTTTGGGTCCCCCCTCCCCTGTCTCGGAAACAAACAAACAAACAAACCGAAACACAGACATACAGTATTTCCAACCTAGGTAATATGAAAAGAAATCAA BBBFFFBFFBBFFFFIIIFFFBBFBFBFFFFFFI7BFF<BBF<BFFIF7B77<<<BBBBB<07B<0<<BB<B<B777'<<0<B<<<<BBFFFFFF<<BF AS:i:-9 ZS:i:-16    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:92 YS:i:0  YT:Z:CP XS:A:+  NH:i:1

Case 2: NH=2, but ZS is not present.

HWI-D00505:41:C437JACXX:5:1315:15701:47013  163 1   4807888 1   95M472N5M   =   4808454 666 CCGACGCACTGTCCGCCAGCCGGTGGATGTGCGGCAACAACATGTCCGCTCCGATGCCCGCCGTTGTGCCGGCCGCCCGGAAGGCCACCGCCGCGGTTAT    BBBFFFFFFFFFFIIIIIIIIIIFFIFFIFFIIIIIFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFF<BFF    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100    YS:i:0  YT:Z:CP XS:A:+  NH:i:2

Best, Reiner

Hisat2-align failing

Hi Daehwan,

I hope this is the right place to report an error and ask for assistance.

I have created an index for the genome of my organism of study using extract_splice_sites.py, extract_exons.py and the following command:

hisat2-build -p 32 --ss gtffile.ss --exon gtffile.exon genome.fa genomeindex

When I go to align reads using:

hisat2 -p 32 -x genomeindex -1 1_1.fq -2 1_2.fq -S ucsd_1.sam

I get the error:
(ERR): hisat2-align died with signal 11 (SEGV)

Am I doing something wrong or is this a bug?

Thanks,
Craig

P.S - hisat2-inspect is also returning a Seg fault for all option other than --names

Gene fusions

Hello, are there any plans to also include gene fusion detection in Hisat2, or will that be left to other tools working on the output Samfiles?

--mm command line flag dramatically increases runtime

I noticed that adding the --mm flag considerably increases runtime, but it seems to only happen for some reads. I have reduced this to a simple test case, which you can download from here:

meru.qb3.berkeley.edu/snf/read1.fastq
meru.qb3.berkeley.edu/snf/read2.fastq
meru.qb3.berkeley.edu/snf/read1_660.fastq
meru.qb3.berkeley.edu/snf/read2_660.fastq

The above four files are paired-end reads from an experiment I am trying to align with HISAT2. If I align either of these the program finishes normally:

[snf@meru test]$ time hisat2 -t -k 100 -p 6 -x /Volumes/11tb/seq/Homo_sapiens/Ensembl/GRCh38/Sequence/grch38_tran/genome_tran -1 read1_660.fastq -2 read2_660.fastq --rna-strandness RF --dta-cufflinks -S test3.sam
Time loading forward index: 00:00:05
Time loading reference: 00:00:00
Multiseed full-index search: 00:00:00
165 reads; of these:
  165 (100.00%) were paired; of these:
    22 (13.33%) aligned concordantly 0 times
    10 (6.06%) aligned concordantly exactly 1 time
    133 (80.61%) aligned concordantly >1 times
    ----
    22 pairs aligned concordantly 0 times; of these:
      3 (13.64%) aligned discordantly 1 time
    ----
    19 pairs aligned 0 times concordantly or discordantly; of these:
      38 mates make up the pairs; of these:
        19 (50.00%) aligned 0 times
        7 (18.42%) aligned exactly 1 time
        12 (31.58%) aligned >1 times
94.24% overall alignment rate
Time searching: 00:00:01
Overall time: 00:00:06
hisat2 -t -k 100 -p 6 -u 165 -x  -1 read1_660.fastq -2 read2_660.fastq  RF  -  1.34s user 5.26s system 100% cpu 6.572 total

[snf@meru test]$ time hisat2 -t -k 100 -p 6 -x /Volumes/11tb/seq/Homo_sapiens/Ensembl/GRCh38/Sequence/grch38_tran/genome_tran -1 read1.fastq -2 read2.fastq --rna-strandness RF --dta-cufflinks -S test3.sam 
Time loading forward index: 00:00:04
Time loading reference: 00:00:01
Multiseed full-index search: 00:00:00
164 reads; of these:
  164 (100.00%) were paired; of these:
    21 (12.80%) aligned concordantly 0 times
    10 (6.10%) aligned concordantly exactly 1 time
    133 (81.10%) aligned concordantly >1 times
    ----
    21 pairs aligned concordantly 0 times; of these:
      3 (14.29%) aligned discordantly 1 time
    ----
    18 pairs aligned 0 times concordantly or discordantly; of these:
      36 mates make up the pairs; of these:
        18 (50.00%) aligned 0 times
        7 (19.44%) aligned exactly 1 time
        11 (30.56%) aligned >1 times
94.51% overall alignment rate
Time searching: 00:00:02
Overall time: 00:00:07
hisat2 -t -k 100 -p 6 -x  -1 read1.fastq -2 read2.fastq --rna-strandness RF    1.25s user 5.19s system 100% cpu 6.415 total

However, if I include the --mm option, then the program does not finish even after hours elapsed, but only for the "660" file (containing 660 lines). Happy to explain more or run other tests if it helps.

edit: forgot to include some relevant info - let me know if you'd like more info about the system:

[snf@meru seq]$ uname -a
Darwin meru.qb3.berkeley.edu 13.4.0 Darwin Kernel Version 13.4.0: Wed Mar 18 16:20:14 PDT 2015; root:xnu-2422.115.14~1/RELEASE_X86_64 x86_64

[snf@meru seq]$ sw_vers
ProductName:    Mac OS X
ProductVersion: 10.9.5
BuildVersion:   13F1603

[snf@meru seq]$ hisat2 --version
/usr/local/bin/../Cellar/hisat2/2.0.1/bin/hisat2-align-s version 2.0.1-beta
64-bit
Built on mavericksvm.local
Wed Jan 27 03:26:09 GMT 2016
Compiler: Thread model: posix
Options: -O3 -m64 -msse2 -funroll-loops -g3 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

[snf@meru seq]$ brew info hisat2
homebrew/science/hisat2: stable 2.0.1 (bottled)
graph-based alignment to a population of genomes
http://ccb.jhu.edu/software/hisat2/
/usr/local/Cellar/hisat2/2.0.1 (17 files, 4M) *
  Poured from bottle
From: https://github.com/Homebrew/homebrew-science/blob/master/hisat2.rb

-Stephen

Reference genomes on hisat2's website

I am trying to use bedtools/bcftools downstream of hisat2 and am just curious which reference genomes were used to create the index files on the hisat2 website?

For example, I am trying to use "GCA_000001635.6_GRCm38.p4_genomic.fna" with the "M. musculus, Ensembl GRCm38" genome but am not sure if the reference indices were built using the latest GRCm38 patch.

Any help would be more than appreciated. Thank you.

hisat2-build genome index problems for C. elegans

I'm trying to build an index for C. elegans genome using the command below:

$HISAT2_HOME/hisat2-build Caenorhabditis_elegans.WBcel235.dna.toplevel.fa -ss splice_sites.txt and --exon exon.txt C_elegans

It builds an index in under 2 minutes but when I use hisat2 inspect --ss --exon C_elegans, I see that splice sites and exon info is not included in the index. Should I add other options to the script for this to work? Do you think the reason can be the difference between chromosome names between human and C. elegans?

Would it be possible for you to provide hisat2 indexes for C. elegans in your website? These are the ensembl files for latest fasta and gtf for C. elegans.

ftp://ftp.ensembl.org/pub/release-83/fasta/caenorhabditis_elegans/dna/
ftp://ftp.ensembl.org/pub/release-83/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.83.gtf.gz

head splice_sites.txt
I 4357 5194 -
I 5295 6036 -
I 6326 9726 -
I 9845 10094 -
I 10584 11617 +
I 11556 11617 +
I 11560 11617 +
I 11560 11622 +
I 11685 14949 +
I 11688 14950 +

head exons.txt
I 3746 3908 -
I 4118 4357 -
I 5194 5295 -
I 6036 6326 -
I 9726 9845 -
I 10094 10229 -
I 10412 10584 +
I 11494 11560 +
I 11617 11688 +
I 14949 15159 +

hisat2-align crashes

hisat2 crashes during alignment, I have been able to narrow it down to a combination of one specific read pair and a very short template sequence.

$ hisat2-build contig.fasta contig
$ hisat2 -x contig -1 f.fastq -2 r.fastq
(ERR): hisat2-align died with signal 11 (SEGV) (core dumped)

$ cat contig.fasta

contig
AAGAGGGAAAGGAAAGATAAAAGAAAGGCAGTCGATTCAGTTAAGAAGCAAGGGGCTTCGAAGAAATAGTCAGCTCCTGTTTCTTTTCTTCAGCAGAAACAT

contig.fasta.txt

$ cat f.fastq

@SRR648705.1811113 HWI-ST132_0474:2:1102:1147:78001 length=100
TTCTGCTGAAGAAAAGAAACAGGAGCTNNNNNNNNNTTCGAAGCCCCTTGCTTCTTAAC
+SRR648705.1811113 HWI-ST132_0474:2:1102:1147:78001 length=100
CCCFFFFFHHHHHIIIGIDHIHIIIII%%%%%%%%%00BDF>DFHGGIIGIIG<EECA@

f.fastq.txt

$ cat r.fastq

@SRR648705.1811113 HWI-ST132_0474:2:1102:1147:78001 length=100
CTGATTCGAGCAACTGATGGCAAGAAAA
+SRR648705.1811113 HWI-ST132_0474:2:1102:1147:78001 length=100
CCCFFFFFDFHHHIIII4A++<3+<EGC

r.fastq.txt

Encountered, MPI

Hi,

When I use hisat2 to build index file for genome "hg19", it cannot finish successfully. The error shows as below:

Error: Encountered internal HISAT2 exception.
slurmd[cn7004]: *** STEP 287714.0 KILLED AT 2016-04-16T03:00:42 WITH SIGNAL 9 ***

Another question, is hisat based on MPI?

Thanks.

(ERR): hisat2-align died with signal 11 (SEGV) (core dumped)

Dear Daehwan,

I try to use hisat2. But I found one problem:
If I use --ss and --exon in the index step: $ hisat2-build genome.fasta --ss splice.txt --exon exons.txt gmindex, I got 8 .ht2 files.
in the next step: $ hisat2 -q -p 30 --known-splicesite-infile splice.txt -x gmindex -1 Fusarium_PH1_good_1.fq.gz -2 Fusarium_PH1_good_2.fq.gz -S ph1.sam
It will show (ERR): hisat2-align died with signal 11 (SEGV) (core dumped).
But if I don't use --ss and --exon, $ hisat2-build genome.fasta gmindex, everythink is ok including the mapping step.
could you help me about it?

Best regards,
Zhang Hao

The process looks like frozen

I ran HISAT2 for different data in the same time, but some of them didn't get the result, the jobs were continuing, but the result never updated anymore.
image
(pay attention to IGF0003548, IGF0003560 please)
image
(get into IGF0003548, you can see the two sam files just stop at 199M and 178M, while the others had been successfully compressed into bam, they all ran in the same time!)
image
(it looks like they continues their jobs, but in fact, they don't)
To solve this question, I have to kill the process and start a new one again and again untill I get the correct result, but to be honest, it is so tiring.
I wonder if there is any bugs in HISAT2?

Treat VCF files as tab-delimited

In hisat2_extract_snps_haplotypes_VCF.py, lines are parsed using str.split(), which splits on all whitespace. We have some filter values that contain spaces, which breaks the script. Switching to use split("\t"), or, better yet, csv.reader(..., delimiter="\t") fixes the issue.

HISAT and HISAT2 errors

Dear Daehwan,

I installed HISAT and HISAT2 to try out alignments on my own Desktop computer. I recently generated RNA-seq data from human cell lines. I have ~100 million reads (single end, Ion Proton) ranging from 30-300nts. I built HISAT and HISAT2 from source on my Fedora20 and faced no problem. 

However after i ran HISAT using the following command :

hisat --local -D 15 -R 2 -N 2 -L 20 -i S,1,0.75 -p 10 --known-splicesite-infile splicesite.txt -x hs1/hsagenome -q U1_length_qual_trimmed.fastq -S untreated_mcf.sam --un untreated_mcf_unal.fastq

I get the following error

Encountered internal Bowtie 2 exception (#1)
Command: /usr/local/bin/hisat-align-s --wrapper basic-0 --local -D 15 -R 2 -N 2 -L 20 -i S,1,0.75 -p 10 --known-splicesite-infile splicesite.txt -x hs1/hsagenome -q U1_length_qual_trimmed.fastq --passthrough
(ERR): hisat-align exited with value 1

I built the index using hisat-build from hg19 which gave .bt2 files. I am not sure whats wrong.

Then i proceeded to try out HISAT2 (built the index as well) using

hisat2 --end-to-end -D 15 -R 2 -N 1 -L 24 -i S,1,1.15 --known-splicesite-infile splicesite.txt -p 10 -x /home/gopal/MCF_RNAseq/compare/hg19/genome -q U1_length_qual_trimmed.fastq -S untreated_mcf_genome_mapped.sam --un untreated_mcf_rejected.fastq

But this stop in between with the following

(ERR): hisat2-align died with signal 15 (TERM)

Could you please tell what I am possibly doing wrong here.

Thanks,

Gopal

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.