Comments (9)
It's very likely you have many SNPs concentrated in small genomic regions, which can cause HISAT2 to use a lot of memory. This is something I'm working on now. Briefly speaking, I'll incorporate haplotype information to only consider the combinations of SNPs that are present in the human populations, instead of allowing for all the combinations of SNPs.
from hisat2.
This happened for me as well using the UCSC commonSnps 142 and hg38 with the HLA and decoy alternative alleles.
from hisat2.
@roryk, thanks for the input again, it is my main focus nowadays to solve this issue. I'm also working on representing a variety of HLA sequences using HISAT2's graph approach. Hopefully, I'll come up with a systematic solution to this problem within two months.
from hisat2.
Hello, I have the same error with the current linux binary, my own genome & snp annotations (it's an insect, not human if that matters - and I don't have haplotype data). I could just remove the adjacent SNPs I suppose but would rather not. I can use up to 1TB ram so high memory usage is not really an issue for me. If I give the program the full 1TB, will this error be avoided?
Also I'm a bit confused by the "switch to 64-bit version" suggestion, given I downloaded the Linux x86_64 binary... And when I call the version, this is confirmed:
hisat2 --version
/homes/bin/hisat2-align-s version 2.0.1-beta
64-bit
Built on igm3
Thu Nov 19 15:53:38 EST 2015
Compiler: gcc version 4.5.4 (GCC)
Options: -O3 -m64 -msse2 -funroll-loops -g3 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
Thanks!
from hisat2.
The latest version of HISAT2 has some fixes for the case that many SNPs are concentrated in small genomic regions (e.g. hisat2_extract_snps_haplotypes_UCSC.py and hisat2_extract_snps_haplotypes_VCF.py). In order to prevent graph index construction from exploding, we need to either (1) use haplotype information so that we can consider only those SNP combinations that are in the population or (2) discard some SNPs. I anticipate SNPs and haplotypes both will be available for many species.
from hisat2.
That makes sense - I decided to build the index with a much more conservative set of SNP calls and it did work. Perhaps it would be helpful to give some guidelines for users about exactly how dense SNPs can be allowed to be, given various amounts of available memory? Thanks.
from hisat2.
For the human genome size, it's typically either <200GB or exploding (no matter how much memory you have). I'll try to provide a guideline, and those scripts HISAT2 provides should generally work when using common SNPs.
Even for a comprehensive set of SNPs, as long as you have haplotype information, I think indexing also works most of the time. I was able to index the human genome plus 80 million SNPs using haplotype information.
from hisat2.
Hi Daehwan, I am wondering if you could provide any directions on how to remove adjacent SNPs to avoid this "exceeded integer bounds" error. I have tried many ways to remove SNPs in SNP-dense regions, and have discarded regions where there are more than 2 SNPs in each 10bp windows, but the program is still crashing with the same error. I am building the index with only 8 million high confidence SNPs. Any suggestions would be greatly appreciated!
from hisat2.
It may be possible to use tools in vcflib to convert the clusters of SNPs to haplotype alleles, then remove low-frequency haplotypes. The specific tool is vcfgeno2haplo
, and you give it -w
to specify a window size. You'd then need to use vcffixup
to re-count the allele frequencies and then filter after that.
We have the same problem in vg, and we've resolved it by graph pruning (as implemented in vg prune
). Recently we have integrated @jltsiren's GBWT haplotype index and he has provided a mechanism to only keep real haplotypes in the SNP clusters that have been pruned. It sounds like you have a similar mechanism in hisat2. Can you take GFA as input? If so then you could use the construction and pruning tools in vg to mitigate this problem.
from hisat2.
Related Issues (20)
- Align ATAC-seq with Hisat2?
- error minimum intron length with hisat2 v. 2.2.1
- Repeat mapping with different result
- Feature request: Add support for xz and zstd
- hisat2 hangs aligning axolotl reads HOT 1
- Output files(.snp, .haplotype) of hisat2_extract_snps_haplotypes_*.py are empty
- Please add the pbat option of hisat-3n
- A question about methylation information extraction
- Any plans to support Apple Silicon architecture?
- Installation Issue Error 1 - make HOT 1
- -np argument seemingly not working
- ERR): "fastq file.fastq" does not exist. Exiting now ...
- [Bug Report] hisat2-align exited with value 137, space complexity of hisat2
- hisat2 location does not exist
- Hisat-3N mapping quality
- hisat2-build index for circRNA-seq
- hisat2-build failed for Segmentation fault
- [Future request] hisat-3n table option to report conversions summarized to genomic feature or reads counts
- Issue with hisatgenotype HOT 1
- Mapping using different parameters --very-sensitive and default
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hisat2.