Comments (6)
Dear Jordi,
Thanks for this issue. I'm sorry I'm seeing it only now.
I'll have a deeper look in a few days, as soon I find a few minutes.
Best,
Pierre
from discosnp.
Dear Jordi,
I tried to reproduce the problem you mention with a toy example. With no success.
I created a file with two sequences identical but with 3 mutations, two closes and one isolated. These are lower case characters in the second sequence.
sequences.fa
:
>tibo
TGTATCCGCATTTGATGCTACCGTGGATGAGTCAGCGTCGAGCACGCGGCACTTATTGCATGAGTAGGGTTGACTAAGAGCCGTTAGATGCCTCGCTGTACTAATAGTTGTCGACAGATCGTCAAGATTAGAAAACGGTACCAGCATTTTCGGAGGTTCTCTAACTAGTATGGATAGCCGTGTCTTCACTGTGCTGCGGC
>tibo2
TGTATCCGCATTTGATGCTACCGTGGATGAGTgAGCGaCGAGCACGCGGCACTTATTGCATGAGTAGGGTTGACTAAGAGCCGTTAGATGCCTCGCTGTACTAATAGTTGTCGACAGATCGTCAAGATTAGAAAACGGTACCAGCATTTTCGGAGGTTCgCTAACTAGTATGGATAGCCGTGTCTTCACTGTGCTGCGGC
I also used "tibo" as the reference genome:
ref.fa
>tibo
TGTATCCGCATTTGATGCTACCGTGGATGAGTCAGCGTCGAGCACGCGGCACTTATTGCATGAGTAGGGTTGACTAAGAGCCGTTAGATGCCTCGCTGTACTAATAGTTGTCGACAGATCGTCAAGATTAGAAAACGGTACCAGCATTTTCGGAGGTTCTCTAACTAGTATGGATAGCCGTGTCTTCACTGTGCTGCGGC
Mutations are located at positions 33 38 and 160.
I then ran the following commands with disco commit version "1be346567efc4db5cdda6c19d1d1fb07a3310385" (actual version).
ls sequences.fa > fof
run_discoSnp++.sh -r fof -c 1 -G ref.f
The resulting vcf is
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1
tibo 33 1_2 C G . PASS TySNP;Rk=0;UL=91;UR=2;CL=.;CR=.;Genome=.;Sd=-1 GT:DP:PL:AD:HQ 0|1:2:21,7,21:1,1:0,0
tibo 38 1_1 T A . PASS TySNP;Rk=0;UL=91;UR=2;CL=.;CR=.;Genome=.;Sd=-1 GT:DP:PL:AD:HQ 0|1:2:21,7,21:1,1:0,0
tibo 160 2 T G . PASS Ty=SNP;Rk=0;UL=91;UR=10;CL=.;CR=.;Genome=T;Sd=1 GT:DP:PL:AD:HQ 0/1:2:21,7,21:1,1:0,0
We retrieve positions 33, 38, and 160 with no difference between close SNPs and isolated SNPs.
Could you please check if this problem persists with the current version?
Feel free to re-open this issue if this is not the case.
I hope this helps.
Pierre
from discosnp.
Hello Pierre,
Thanks for your answer.
Indeed it works when using a reference sequence, but in my analysis I don't have one so I tried your data without it with:
run_discoSnp++.sh -c 1 -r fof
And then the problem appears again. (I used the latest github version)
The unitigs found by discosnp in the discoRes_k_31_c_1_D_100_P_3_b_0_coherent.fa file are:
>SNP_higher_path_2|P_1:30_T/G|high|nb_pol_1|left_unitig_length_91|right_unitig_length_10|C1_1|Q1_0|G1_0/1:21,7,21|rank_0
cgagcacgcggcacttattgcatgagtagggttgactaagagccgttagatgcctcgctgtactaatagttgtcgacagatcgtcaagattAGAAAACGGTACCAGCATTTTCGGAGGTTCTCTAACTAGTATGGATAGCCGTGTCTTCACTgtgctgcggc
>SNP_lower_path_2|P_1:30_T/G|high|nb_pol_1|left_unitig_length_91|right_unitig_length_10|C1_1|Q1_0|G1_0/1:21,7,21|rank_0
cgagcacgcggcacttattgcatgagtagggttgactaagagccgttagatgcctcgctgtactaatagttgtcgacagatcgtcaagattAGAAAACGGTACCAGCATTTTCGGAGGTTCGCTAACTAGTATGGATAGCCGTGTCTTCACTgtgctgcggc
>SNP_higher_path_1|P_1:30_A/T,P_2:35_G/C|high|nb_pol_2|left_unitig_length_91|right_unitig_length_2|C1_1|Q1_0|G1_0/1:21,7,21|rank_0
gaacctccgaaaatgctggtaccgttttctaatcttgacgatctgtcgacaactattagtacagcgaggcatctaacggctcttagtcaacCCTACTCATGCAATAAGTGCCGCGTGCTCGACGCTGACTCATCCACGGTAGCATCAAATGCGGATAca
>SNP_lower_path_1|P_1:30_A/T,P_2:35_G/C|high|nb_pol_2|left_unitig_length_91|right_unitig_length_2|C1_1|Q1_0|G1_0/1:21,7,21|rank_0
gaacctccgaaaatgctggtaccgttttctaatcttgacgatctgtcgacaactattagtacagcgaggcatctaacggctcttagtcaacCCTACTCATGCAATAAGTGCCGCGTGCTCGTCGCTCACTCATCCACGGTAGCATCAAATGCGGATAca
And the vcf discoRes_k_31_c_1_D_100_P_3_b_0_coherent.vcf is giving 3 snps:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1
SNP_lower_path_2 121 2 G T . . Ty=SNP;Rk=0;UL=91;UR=10;CL=.;CR=.;Genome=.;Sd=. GT:DP:PL:AD:HQ 0/1:2:21,7,21:1,1:0,0
SNP_higher_path_1 122 1_1 A T . . TySNP;Rk=0;UL=91;UR=2;CL=.;CR=.;Genome=.;Sd=. GT:DP:PL:AD:HQ 0|1:2:21,7,21:1,1:0,0
SNP_higher_path_1 127 1_2 G C . . TySNP;Rk=0;UL=91;UR=2;CL=.;CR=.;Genome=.;Sd=. GT:DP:PL:AD:HQ 0|1:2:21,7,21:1,1:0,0
Identically as I observed before with my dataset, the isolated SNP position is incorrect by one base (SNP_lower_path_2 121). If you look at the sequence of the SNP_lower_path_2 unitig, the nucleotide in position 121 is a C and not a G, the G being positioned at position 122 (or 121 in a 0-based system). The positions of the 2 others SNPs are correct (in a 1-based system).
From what I have seen in the code of VCF_creator.py I think you may need to add a +1 to the mappingPositionCouple value somewhere (maybe in the WhichPathIsTheRef() function of the SNP class (and maybe to the INDEL one too)). I have the impression that the calculation are correct for the SNPCLOSE class because it don't use this value to determine the position and only add the corrected position and the unitig length, but I didn't had the time to look further.
Tell me if you can reproduce the problem or if you want me to run some tests and maybe propose a pull request.
(And for the issue I do no have the right to reopen it myself so I will leave it to you if you found it appropriate)
Best regards,
Jordi
from discosnp.
Thanks again Jordi for your detailed and precious feedback.
I found and corrected the bug for close SNPs. Everything should now be zero-based. You may pull and retry (no need to recompile).
I let the issue open as the bug also exists for INDELS. I have to explore this code.
(When I have a few days free I'll recode from scratch VCF_creator.)
from discosnp.
The bug is also fixed for unmapped indels
from discosnp.
Thanks a lot Pierre for fixing this.
I see that you have made everything 0-based, but the vcf convention state that vcf are 1-based. To avoid confusion with software working on vcf files do you think you could produce 1-based vcf instead? I think this could prevent many mistakes in the future.
Jordi
from discosnp.
Related Issues (20)
- SNP ID in VCF file with reference genome HOT 3
- problem running disco snp HOT 8
- compatibility with HiFi reads HOT 3
- Using Disco SNP to find sex markers; calling on a cohort with males and females HOT 1
- Requires python3 but calls python2 HOT 1
- DiscoSnp VCF record; Ref allele does not match Assembly Used HOT 8
- Type of Variant (Ty) format error in .vcf HOT 1
- Options for .vcf filtering HOT 1
- Problem running discoSnpRad HOT 4
- Call to python 2 in run_discoSNP++.sh HOT 2
- EXCEPTION: cannot open ./trashme_3684_dsk_partitions_gatb/parts.360 HOT 4
- Problem with VCF creation: an error occurred in determining the filter of close snps (an unmapped SNP is "PASS") HOT 3
- Multiple reference alleles output HOT 3
- error: there was a problem with readFileName Dumping HOT 3
- error: there was a problem with readFileName Dumping HOT 2
- Trouble filtering with vcftools HOT 2
- Equivalent information between discosnp and STACKS2 HOT 8
- Unable to open VCF file in IGV - SAMPLE metainformation line HOT 3
- Nucleotide diversity from DiscoSnp HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from discosnp.