medvedevgroup / vargeno Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 4.0 36.28 MB

Towards fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics.

Home Page: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty641/5056043

License: MIT License

Makefile 2.22% C++ 89.95% C 1.89% CMake 1.40% R 1.71% TeX 2.45% Shell 0.27% GDB 0.06% Batchfile 0.02% Python 0.03%

algorithms bioinformatics computational-biology data-structures genotyping snps

vargeno's People

Contributors

Stargazers

Watchers

Forkers

ldenti cappelchi natbio12 wangdi2014

vargeno's Issues

Segmentation fault during geno step

Hi,
when I try to run vargeno on the same data linked in my previous issue (#2), it crashes during the geno step.

This is the output of vargeno index:

[BloomFilter constructBfFromGenomeseq] bit vector: 1130814221/9600000000
[BloomFilter constructBfFromGenomeseq] lite bit vector: 2131757218/18400000000
[BloomFilter constructBfFromVCF] bit vector: 68265608/1120000000
SNP Dictionary
Total k-mers:        2593345952
Unambig k-mers:      2367171409
Ambig unique k-mers: 37905369
Ambig total k-mers:  226174543
Ref Dictionary
Total k-mers:        2858648351
Unambig k-mers:      2488558606
Ambig unique k-mers: 61723937
Ambig total k-mers:  370089745

and these are the files produced during the index step:

4.0K    vargeno.RMNISTHS_30xdownsample.index.chrlens
1.2G    vargeno.RMNISTHS_30xdownsample.index.ref.bf
2.2G    vargeno.RMNISTHS_30xdownsample.index.ref.bf.lite.bf
34G     vargeno.RMNISTHS_30xdownsample.index.ref.dict
134M    vargeno.RMNISTHS_30xdownsample.index.snp.bf
39G     vargeno.RMNISTHS_30xdownsample.index.snp.dict

When running the geno step, vargeno prints "Processing..." and crashes shortly thereafter:

Initializing...
Processing...
Segmentation fault (core dumped)

\time reports that it is terminated by signal 11 but I'm not sure where this happens. At first I thought that it was due to RAM saturation (the machine used to test the tool is equipped with 256GB of RAM) but the same behaviour occurs on a cluster with 1TB of RAM.

Anyway, I also tried to run vargeno on a smaller set of variants (I halved the input VCF) and it is able to conclude the analysis.

The complete VCF contains 84739838 variants and the sample consists of 696168435 reads. The whole (unzipped) data accounts for ~240GB of disk space. If you want to reproduce this behaviour on your machine, I can share the data with you.

Luca

pre-compiled binary

Would it be possible to provide pre-compiled binaries for the releases?

Test example produces an empty VCF file

Running the test example from the documentation produces an empty VCF file:

vargeno index chr22.fa snp.vcf test_prefix
vargeno geno test_prefix reads.fq snp.vcf genotyped.vcf

the resulting genotyped.vcf file does not contain any variants:

cat genotyped.vcf | tail -3

produces:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	DONOR

Empty .dict files

I've been trying to run vargeno on non-human data and running into problems at the indexing stage. No error is reported during the process, but the .dict files are both empty, and so the genotyping step fails.

I'm working with a fragmentary reference assembly of a grasshopper genome, so both the bioinformatic and biological properties of the data are not at all what vargeno was designed for.

Do you have any tips for troubleshooting? Attached (here) is a sample of the .vcf input. Since my data is not human data and I'm obviously not working with dbSNP it's a little unclear how to properly format this file. Variants were detected with freebayes in the first instance.

Here is the terminal output:

$ vargeno index packardii.sub.fa snp.vcf test
[BloomFilter constructBfFromGenomeseq] bit vector: 755356701/9600000000
[BloomFilter constructBfFromGenomeseq] lite bit vector: 988176227/18400000000
[BloomFilter constructBfFromVCF] bit vector: 0/1120000000
SNP Dictionary
Total k-mers:        21626752
Unambig k-mers:      20575340
Ambig unique k-mers: 296062
Ambig total k-mers:  1051412
Ref Dictionary
Total k-mers:        1305711431
Unambig k-mers:      1130124620
Ambig unique k-mers: 36489256
Ambig total k-mers:  175586811

And here are the output files:

-rw-r--r--  1 oliver users   12348187 Feb  5 11:42 test.chrlens
-rw-r--r--  1 oliver users 1200000008 Feb  5 10:43 test.ref.bf
-rw-r--r--  1 oliver users 2300000008 Feb  5 10:43 test.ref.bf.lite.bf
-rw-r--r--  1 oliver users          0 Feb  5 14:47 test.ref.dict
-rw-r--r--  1 oliver users  140000008 Feb  5 11:41 test.snp.bf
-rw-r--r--  1 oliver users          0 Feb  5 11:42 test.snp.dict

All of the test files (in /vargeno/test) run fine and reproduce the provided output files. I'm running on Ubuntu 18.04.5 in a conda environment with the following packages:

# packages in environment at /home/oliver/miniconda2/envs/vargeno:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
bioawk                    1.0                  hed695b0_5    bioconda
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
seqtk                     1.3                  hed695b0_2    bioconda
vargeno                   1.0.3                hc9558a2_1    bioconda
zlib                      1.2.11            h516909a_1010    conda-forge

Indexing problem

I'm facing the same problem reported in #7 .
I am taking into account the test dataset, but the index command generates empty .dict files. I'm running vargeno on Ubuntu 20.04.2 in conda environment and the terminal output shows that the bit vector associated with the Bloom Filter from VCF is empty:

[BloomFilter constructBfFromGenomeseq] bit vector: 27926073/9600000000                                                  
[BloomFilter constructBfFromGenomeseq] lite bit vector: 29747190/18400000000                                            
[BloomFilter constructBfFromVCF] bit vector: 0/1120000000                                                               
SNP Dictionary                                                                                                          
Total k-mers:        2816                                                                                               
Unambig k-mers:      2816                                                                                               
Ambig unique k-mers: 0                                                                                                  
Ambig total k-mers:  0                                                                                                  
Ref Dictionary                                                                                                         
Total k-mers:        34894128                                                                                          
Unambig k-mers:      31402166                                                                                           
Ambig unique k-mers: 926632                                                                                             
Ambig total k-mers:  3491962

Output VCF file is empty

Hello,
I tried to run vargeno (last commit, 00ee0f0) on the following data: reference, VCFs (provided by 1000genomes), and sample sequenced from NA12878 individual.

I ran some tests on different chromosomes but I always got an output VCF file that contains only the header. I tried to figure out the reasons of this and these are my hypotheses.

I think the main problem is in the index step. Indeed, in my various tests, I obtain

...
[BloomFilter constructBfFromVCF] bit vector: 0/1120000000
...

From here, I started digging in the code and I saw that there could be some problems on how you index the input reference. In more details:

when you read the VCF file and extract the chromosome name from each line, you add "chr" at its start:

vargeno/src/generate_bf.cc

Line 206 in 00ee0f0

if(chr_name[0] != 'c') chr_name = "chr" + chr_name;

but you don't do the same when you read the headers from the input FASTA file:

vargeno/src/generate_bf.cc

Line 47 in 00ee0f0

id = line.substr(1);

For instance, this is a problem if the headers in the FASTA file contain only the chromosome number. I think this problem affects also the way you store information in the .chrlens file of your index:

vargeno/src/qv.cc

Line 2345 in 00ee0f0

fprintf(chrlens, "%s %lu\n", ref.seqs[i].name, ref.seqs[i].size);
when a header in the FASTA file contains the unique identifier for the sequence and also additional information (such as: ">22 dna:chromosome chromosome:GRCh37:22:1:51304566:1"), you consider all the line as unique identifier:

vargeno/src/generate_bf.cc

Line 47 in 00ee0f0

id = line.substr(1);

but when you parse the VCF file, you consider as chromosome name only the unique identifier since you get the chromosome number from the first column of the VCF (this should not affect the .chrlens file since you use a different function to read the reference FASTA when you write the .chrlens file)

I tried to solve these two problems by changing some lines in your code but I'm not sure if what I've done is right and enough (I'll anyway open a pull request: it could be a good starting point for you). With my fixes, now the output is not empty anymore.

Moreover, I think that this behaviour occurs also if the input VCF contains the field GT specified in the header and the GT columns (as in the VCFs provided by the 1000genomes project). If I run vargeno index, I obtain:

...
SNP Dictionary               
Total k-mers:        0
Unambig k-mers:      0
Ambig unique k-mers: 0
Ambig total k-mers:  0
...

Currently, I solved this problem by removing out from the VCFs the line in the header and the ~2500 columns of the samples. Maybe you could find a better solution to this or maybe you can update the readme accordingly.

Thanks in advance!

Best,
Luca

medvedevgroup / vargeno Goto Github PK

vargeno's People

Contributors

Stargazers

Watchers

Forkers

vargeno's Issues

Segmentation fault during geno step

pre-compiled binary

Test example produces an empty VCF file

Empty .dict files

Indexing problem

Output VCF file is empty

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent