Coder Social home page Coder Social logo

Comments (6)

pierrepeterlongo avatar pierrepeterlongo commented on July 19, 2024

Dear Jordi,

Thanks for this issue. I'm sorry I'm seeing it only now.
I'll have a deeper look in a few days, as soon I find a few minutes.

Best,
Pierre

from discosnp.

pierrepeterlongo avatar pierrepeterlongo commented on July 19, 2024

Dear Jordi,

I tried to reproduce the problem you mention with a toy example. With no success.

I created a file with two sequences identical but with 3 mutations, two closes and one isolated. These are lower case characters in the second sequence.
sequences.fa:

>tibo
TGTATCCGCATTTGATGCTACCGTGGATGAGTCAGCGTCGAGCACGCGGCACTTATTGCATGAGTAGGGTTGACTAAGAGCCGTTAGATGCCTCGCTGTACTAATAGTTGTCGACAGATCGTCAAGATTAGAAAACGGTACCAGCATTTTCGGAGGTTCTCTAACTAGTATGGATAGCCGTGTCTTCACTGTGCTGCGGC
>tibo2
TGTATCCGCATTTGATGCTACCGTGGATGAGTgAGCGaCGAGCACGCGGCACTTATTGCATGAGTAGGGTTGACTAAGAGCCGTTAGATGCCTCGCTGTACTAATAGTTGTCGACAGATCGTCAAGATTAGAAAACGGTACCAGCATTTTCGGAGGTTCgCTAACTAGTATGGATAGCCGTGTCTTCACTGTGCTGCGGC

I also used "tibo" as the reference genome:
ref.fa

>tibo
TGTATCCGCATTTGATGCTACCGTGGATGAGTCAGCGTCGAGCACGCGGCACTTATTGCATGAGTAGGGTTGACTAAGAGCCGTTAGATGCCTCGCTGTACTAATAGTTGTCGACAGATCGTCAAGATTAGAAAACGGTACCAGCATTTTCGGAGGTTCTCTAACTAGTATGGATAGCCGTGTCTTCACTGTGCTGCGGC

Mutations are located at positions 33 38 and 160.
I then ran the following commands with disco commit version "1be346567efc4db5cdda6c19d1d1fb07a3310385" (actual version).

ls sequences.fa > fof
run_discoSnp++.sh -r fof -c 1 -G ref.f

The resulting vcf is

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	G1
tibo	33	1_2	C	G	.	PASS	TySNP;Rk=0;UL=91;UR=2;CL=.;CR=.;Genome=.;Sd=-1	GT:DP:PL:AD:HQ	0|1:2:21,7,21:1,1:0,0
tibo	38	1_1	T	A	.	PASS	TySNP;Rk=0;UL=91;UR=2;CL=.;CR=.;Genome=.;Sd=-1	GT:DP:PL:AD:HQ	0|1:2:21,7,21:1,1:0,0
tibo	160	2	T	G	.	PASS	Ty=SNP;Rk=0;UL=91;UR=10;CL=.;CR=.;Genome=T;Sd=1	GT:DP:PL:AD:HQ	0/1:2:21,7,21:1,1:0,0

We retrieve positions 33, 38, and 160 with no difference between close SNPs and isolated SNPs.
Could you please check if this problem persists with the current version?
Feel free to re-open this issue if this is not the case.

I hope this helps.

Pierre

from discosnp.

jdalino avatar jdalino commented on July 19, 2024

Hello Pierre,

Thanks for your answer.

Indeed it works when using a reference sequence, but in my analysis I don't have one so I tried your data without it with:
run_discoSnp++.sh -c 1 -r fof

And then the problem appears again. (I used the latest github version)

The unitigs found by discosnp in the discoRes_k_31_c_1_D_100_P_3_b_0_coherent.fa file are:

>SNP_higher_path_2|P_1:30_T/G|high|nb_pol_1|left_unitig_length_91|right_unitig_length_10|C1_1|Q1_0|G1_0/1:21,7,21|rank_0
cgagcacgcggcacttattgcatgagtagggttgactaagagccgttagatgcctcgctgtactaatagttgtcgacagatcgtcaagattAGAAAACGGTACCAGCATTTTCGGAGGTTCTCTAACTAGTATGGATAGCCGTGTCTTCACTgtgctgcggc

>SNP_lower_path_2|P_1:30_T/G|high|nb_pol_1|left_unitig_length_91|right_unitig_length_10|C1_1|Q1_0|G1_0/1:21,7,21|rank_0
cgagcacgcggcacttattgcatgagtagggttgactaagagccgttagatgcctcgctgtactaatagttgtcgacagatcgtcaagattAGAAAACGGTACCAGCATTTTCGGAGGTTCGCTAACTAGTATGGATAGCCGTGTCTTCACTgtgctgcggc

>SNP_higher_path_1|P_1:30_A/T,P_2:35_G/C|high|nb_pol_2|left_unitig_length_91|right_unitig_length_2|C1_1|Q1_0|G1_0/1:21,7,21|rank_0
gaacctccgaaaatgctggtaccgttttctaatcttgacgatctgtcgacaactattagtacagcgaggcatctaacggctcttagtcaacCCTACTCATGCAATAAGTGCCGCGTGCTCGACGCTGACTCATCCACGGTAGCATCAAATGCGGATAca

>SNP_lower_path_1|P_1:30_A/T,P_2:35_G/C|high|nb_pol_2|left_unitig_length_91|right_unitig_length_2|C1_1|Q1_0|G1_0/1:21,7,21|rank_0
gaacctccgaaaatgctggtaccgttttctaatcttgacgatctgtcgacaactattagtacagcgaggcatctaacggctcttagtcaacCCTACTCATGCAATAAGTGCCGCGTGCTCGTCGCTCACTCATCCACGGTAGCATCAAATGCGGATAca

And the vcf discoRes_k_31_c_1_D_100_P_3_b_0_coherent.vcf is giving 3 snps:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  G1
SNP_lower_path_2        121     2       G       T       .       .       Ty=SNP;Rk=0;UL=91;UR=10;CL=.;CR=.;Genome=.;Sd=. GT:DP:PL:AD:HQ  0/1:2:21,7,21:1,1:0,0
SNP_higher_path_1       122     1_1     A       T       .       .       TySNP;Rk=0;UL=91;UR=2;CL=.;CR=.;Genome=.;Sd=.   GT:DP:PL:AD:HQ  0|1:2:21,7,21:1,1:0,0
SNP_higher_path_1       127     1_2     G       C       .       .       TySNP;Rk=0;UL=91;UR=2;CL=.;CR=.;Genome=.;Sd=.   GT:DP:PL:AD:HQ  0|1:2:21,7,21:1,1:0,0

Identically as I observed before with my dataset, the isolated SNP position is incorrect by one base (SNP_lower_path_2 121). If you look at the sequence of the SNP_lower_path_2 unitig, the nucleotide in position 121 is a C and not a G, the G being positioned at position 122 (or 121 in a 0-based system). The positions of the 2 others SNPs are correct (in a 1-based system).

From what I have seen in the code of VCF_creator.py I think you may need to add a +1 to the mappingPositionCouple value somewhere (maybe in the WhichPathIsTheRef() function of the SNP class (and maybe to the INDEL one too)). I have the impression that the calculation are correct for the SNPCLOSE class because it don't use this value to determine the position and only add the corrected position and the unitig length, but I didn't had the time to look further.

Tell me if you can reproduce the problem or if you want me to run some tests and maybe propose a pull request.

(And for the issue I do no have the right to reopen it myself so I will leave it to you if you found it appropriate)

Best regards,
Jordi

from discosnp.

pierrepeterlongo avatar pierrepeterlongo commented on July 19, 2024

Thanks again Jordi for your detailed and precious feedback.

I found and corrected the bug for close SNPs. Everything should now be zero-based. You may pull and retry (no need to recompile).

I let the issue open as the bug also exists for INDELS. I have to explore this code.
(When I have a few days free I'll recode from scratch VCF_creator.)

from discosnp.

pierrepeterlongo avatar pierrepeterlongo commented on July 19, 2024

The bug is also fixed for unmapped indels

from discosnp.

jdalino avatar jdalino commented on July 19, 2024

Thanks a lot Pierre for fixing this.

I see that you have made everything 0-based, but the vcf convention state that vcf are 1-based. To avoid confusion with software working on vcf files do you think you could produce 1-based vcf instead? I think this could prevent many mistakes in the future.

Jordi

from discosnp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.