Coder Social home page Coder Social logo

Comments (9)

infphilo avatar infphilo commented on July 22, 2024

It's very likely you have many SNPs concentrated in small genomic regions, which can cause HISAT2 to use a lot of memory. This is something I'm working on now. Briefly speaking, I'll incorporate haplotype information to only consider the combinations of SNPs that are present in the human populations, instead of allowing for all the combinations of SNPs.

from hisat2.

roryk avatar roryk commented on July 22, 2024

This happened for me as well using the UCSC commonSnps 142 and hg38 with the HLA and decoy alternative alleles.

from hisat2.

infphilo avatar infphilo commented on July 22, 2024

@roryk, thanks for the input again, it is my main focus nowadays to solve this issue. I'm also working on representing a variety of HLA sequences using HISAT2's graph approach. Hopefully, I'll come up with a systematic solution to this problem within two months.

from hisat2.

JosieReinhardt avatar JosieReinhardt commented on July 22, 2024

Hello, I have the same error with the current linux binary, my own genome & snp annotations (it's an insect, not human if that matters - and I don't have haplotype data). I could just remove the adjacent SNPs I suppose but would rather not. I can use up to 1TB ram so high memory usage is not really an issue for me. If I give the program the full 1TB, will this error be avoided?

Also I'm a bit confused by the "switch to 64-bit version" suggestion, given I downloaded the Linux x86_64 binary... And when I call the version, this is confirmed:

hisat2 --version
/homes/bin/hisat2-align-s version 2.0.1-beta
64-bit
Built on igm3
Thu Nov 19 15:53:38 EST 2015
Compiler: gcc version 4.5.4 (GCC) 
Options: -O3 -m64 -msse2 -funroll-loops -g3 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}

Thanks!

from hisat2.

infphilo avatar infphilo commented on July 22, 2024

The latest version of HISAT2 has some fixes for the case that many SNPs are concentrated in small genomic regions (e.g. hisat2_extract_snps_haplotypes_UCSC.py and hisat2_extract_snps_haplotypes_VCF.py). In order to prevent graph index construction from exploding, we need to either (1) use haplotype information so that we can consider only those SNP combinations that are in the population or (2) discard some SNPs. I anticipate SNPs and haplotypes both will be available for many species.

from hisat2.

JosieReinhardt avatar JosieReinhardt commented on July 22, 2024

That makes sense - I decided to build the index with a much more conservative set of SNP calls and it did work. Perhaps it would be helpful to give some guidelines for users about exactly how dense SNPs can be allowed to be, given various amounts of available memory? Thanks.

from hisat2.

infphilo avatar infphilo commented on July 22, 2024

For the human genome size, it's typically either <200GB or exploding (no matter how much memory you have). I'll try to provide a guideline, and those scripts HISAT2 provides should generally work when using common SNPs.

Even for a comprehensive set of SNPs, as long as you have haplotype information, I think indexing also works most of the time. I was able to index the human genome plus 80 million SNPs using haplotype information.

from hisat2.

orionzhou avatar orionzhou commented on July 22, 2024

Hi Daehwan, I am wondering if you could provide any directions on how to remove adjacent SNPs to avoid this "exceeded integer bounds" error. I have tried many ways to remove SNPs in SNP-dense regions, and have discarded regions where there are more than 2 SNPs in each 10bp windows, but the program is still crashing with the same error. I am building the index with only 8 million high confidence SNPs. Any suggestions would be greatly appreciated!

from hisat2.

ekg avatar ekg commented on July 22, 2024

It may be possible to use tools in vcflib to convert the clusters of SNPs to haplotype alleles, then remove low-frequency haplotypes. The specific tool is vcfgeno2haplo, and you give it -w to specify a window size. You'd then need to use vcffixup to re-count the allele frequencies and then filter after that.

We have the same problem in vg, and we've resolved it by graph pruning (as implemented in vg prune). Recently we have integrated @jltsiren's GBWT haplotype index and he has provided a mechanism to only keep real haplotypes in the SNP clusters that have been pruned. It sounds like you have a similar mechanism in hisat2. Can you take GFA as input? If so then you could use the construction and pruning tools in vg to mitigate this problem.

from hisat2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.