I'm having an issue building in index using a snp file I generated myself. The referen

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

"exceeded integer bounds, remove adjacent SNPs or switch to 64-bit version" about hisat2 HOT 9 CLOSED

daehwankimlab commented on July 22, 2024

"exceeded integer bounds, remove adjacent SNPs or switch to 64-bit version"

from hisat2.

Comments (9)

infphilo commented on July 22, 2024

It's very likely you have many SNPs concentrated in small genomic regions, which can cause HISAT2 to use a lot of memory. This is something I'm working on now. Briefly speaking, I'll incorporate haplotype information to only consider the combinations of SNPs that are present in the human populations, instead of allowing for all the combinations of SNPs.

from hisat2.

roryk commented on July 22, 2024

This happened for me as well using the UCSC commonSnps 142 and hg38 with the HLA and decoy alternative alleles.

from hisat2.

infphilo commented on July 22, 2024

@roryk, thanks for the input again, it is my main focus nowadays to solve this issue. I'm also working on representing a variety of HLA sequences using HISAT2's graph approach. Hopefully, I'll come up with a systematic solution to this problem within two months.

from hisat2.

JosieReinhardt commented on July 22, 2024

Hello, I have the same error with the current linux binary, my own genome & snp annotations (it's an insect, not human if that matters - and I don't have haplotype data). I could just remove the adjacent SNPs I suppose but would rather not. I can use up to 1TB ram so high memory usage is not really an issue for me. If I give the program the full 1TB, will this error be avoided?

Also I'm a bit confused by the "switch to 64-bit version" suggestion, given I downloaded the Linux x86_64 binary... And when I call the version, this is confirmed:

hisat2 --version
/homes/bin/hisat2-align-s version 2.0.1-beta
64-bit
Built on igm3
Thu Nov 19 15:53:38 EST 2015
Compiler: gcc version 4.5.4 (GCC) 
Options: -O3 -m64 -msse2 -funroll-loops -g3 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

Thanks!

from hisat2.

infphilo commented on July 22, 2024

The latest version of HISAT2 has some fixes for the case that many SNPs are concentrated in small genomic regions (e.g. hisat2_extract_snps_haplotypes_UCSC.py and hisat2_extract_snps_haplotypes_VCF.py). In order to prevent graph index construction from exploding, we need to either (1) use haplotype information so that we can consider only those SNP combinations that are in the population or (2) discard some SNPs. I anticipate SNPs and haplotypes both will be available for many species.

from hisat2.

JosieReinhardt commented on July 22, 2024

That makes sense - I decided to build the index with a much more conservative set of SNP calls and it did work. Perhaps it would be helpful to give some guidelines for users about exactly how dense SNPs can be allowed to be, given various amounts of available memory? Thanks.

from hisat2.

infphilo commented on July 22, 2024

For the human genome size, it's typically either <200GB or exploding (no matter how much memory you have). I'll try to provide a guideline, and those scripts HISAT2 provides should generally work when using common SNPs.

Even for a comprehensive set of SNPs, as long as you have haplotype information, I think indexing also works most of the time. I was able to index the human genome plus 80 million SNPs using haplotype information.

from hisat2.

orionzhou commented on July 22, 2024

Hi Daehwan, I am wondering if you could provide any directions on how to remove adjacent SNPs to avoid this "exceeded integer bounds" error. I have tried many ways to remove SNPs in SNP-dense regions, and have discarded regions where there are more than 2 SNPs in each 10bp windows, but the program is still crashing with the same error. I am building the index with only 8 million high confidence SNPs. Any suggestions would be greatly appreciated!

from hisat2.

ekg commented on July 22, 2024

It may be possible to use tools in vcflib to convert the clusters of SNPs to haplotype alleles, then remove low-frequency haplotypes. The specific tool is vcfgeno2haplo, and you give it -w to specify a window size. You'd then need to use vcffixup to re-count the allele frequencies and then filter after that.

We have the same problem in vg, and we've resolved it by graph pruning (as implemented in vg prune). Recently we have integrated @jltsiren's GBWT haplotype index and he has provided a mechanism to only keep real haplotypes in the SNP clusters that have been pruned. It sounds like you have a similar mechanism in hisat2. Can you take GFA as input? If so then you could use the construction and pruning tools in vg to mitigate this problem.

from hisat2.

"exceeded integer bounds, remove adjacent SNPs or switch to 64-bit version" about hisat2 HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent