Hello,
Thanks for this great tool I just have a concern.
I ran DiscoSNP using the following command:
Running discoSnp++ 2.3.X, in directory /home/robert/tools/DiscoSnp with following parameters:
read_sets=mahi_file.list
prefix=discoRes_k_31_c_3
c=3
C=2147483647
k=31
b=0
d=1
D=40
s=
P=3
p=discoRes
G=/home/robert/assembly/MAHI_genome.fasta
e=
starting date=Thu 28 Oct 2021 09:15:24 AM ADT
I also generated a second VCF by mapping the fasta to the same reference as follows:
ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -M -t 10 -v 3 -R @rg\tID:F_003_\tLB:L001\tSM:F_003_\tPL:ILLUMINA /home/robert/assembly/MAHI_genome.fasta.gz /data/MAHI/F_003_i/F-003-i_S82_L002_R1_001.fastq.gz /data/MAHI/F_003/F-003-_S82_L002_R2_001.fastq.gz
The only difference was that for the VCF generated by read mapping the assembly was bgzipped.
I then wanted to do some concordance between the VCFs using the VCF generated by read mapping as TRUTH and the DiscoSnp VCF as the CALL set.
AT this point I noticed that for some records the ref alleles do not match. For example, here's a TRUTH VCF record:
ptg000002l 799357 . C A 4761.33 PASS AB=0.548571;ABP=6.59633;AC=28;AF=0.33;AN=86;AO=176;CIGAR=1X;DP=511;DPB=511;DPRA=0.994422;EPP=5.42853;EPPR=44.2023;GTI=2;LEN=1;MEANALT=1.08;MQM=60;MQMR=60;NS=49;NUMALT=1;ODDS=0.545245;PAIRED=0.988636;PAIREDR=0.99696;PAO=0;PQA=0;PQR=0;PRO=0;QA=6338;QR=11961;RO=329;RPL=81;RPP=5.42853;RPPR=3.54492;RPR=95;RUN=1;SAF=106;SAP=19.0002;SAR=70;SRF=192;SRP=22.976;SRR=137;TYPE=snp;technology.ILLUMINA=1 GT:AD:AO:DP:FT:GQ:PL:QA:QR:RO ./.:1,0:0:1:DP4:11:0,3,37:0:37:1 1/1:0,6:6:6:PASS:19:203,18,0:222:0:0 0/1:7,6:6:13:PASS:99:164,0,198:222:259:7 0/1:2,7:7:9:PASS:54:210,0,43:259:74:2 0/0:15,0:0:15:PASS:53:0,45,481:0:531:15 0/0:12,0:0:12:PASS:44:0,36,403:0:444:12 0/1:3,4:4:7:PASS:90:116,0,82:148:111:3 0/0:8,0:0:8:PASS:32:0,24,270:0:296:8 0/0:10,0:0:10:PASS:38:0,30,313:0:344:10 1/1:0,9:9:9:PASS:28:292,27,0:321:0:0 0/1:4,8:8:13:PASS:99:232,0,99:296:148:4 0/0:11,0:0:11:PASS:41:0,33,359:0:395:11 1/1:0,11:11:11:PASS:34:370,33,0:407:0:0 0/0:7,0:0:7:PASS:29:0,21,226:0:247:7 0/1:6,9:9:15:PASS:99:258,0,147:333:210:6 0/0:9,0:0:9:PASS:35:0,27,303:0:333:9 0/1:4,3:3:7:PASS:82:82,0,116:111:148:4 0/0:18,0:0:18:PASS:62:0,54,603:0:666:18 0/1:5,12:12:17:PASS:99:341,0,119:432:185:5 1/1:0,21:21:21:PASS:64:657,63,0:727:0:0 0/0:9,0:0:9:PASS:35:0,27,292:0:321:9 0/0:4,0:0:4:PASS:20:0,12,137:0:148:4 0/0:11,0:0:11:PASS:41:0,33,370:0:407:11 0/0:11,0:0:11:PASS:41:0,33,370:0:407:11 0/0:15,0:0:15:PASS:53:0,45,492:0:543:15 0/0:20,0:0:20:PASS:68:0,60,669:0:740:20 0/1:5,7:7:12:PASS:99:190,0,110:247:159:5 ./.:0,3:3:3:DP4:10:104,9,0:111:0:0 0/1:10,5:5:15:PASS:99:114,0,291:173:370:10 0/1:6,5:5:11:PASS:99:137,0,170:185:222:6 0/0:16,0:0:16:PASS:56:0,48,536:0:592:16 1/1:0,7:7:7:PASS:22:213,21,0:233:0:0 0/1:10,5:5:15:PASS:99:125,0,268:185:344:10 1/1:0,11:11:11:PASS:34:346,33,0:381:0:0 0/1:8,6:6:14:PASS:99:161,0,204:222:270:8 0/0:14,0:0:14:PASS:50:0,42,470:0:518:140/1:4,9:9:13:PASS:99:264,0,98:333:148:4 1/1:0,11:11:16:PASS:0:366,33,0:407:0:0 0/0:11,0:0:11:PASS:41:0,33,370:0:407:11 0/1:2,5:5:7:PASS:57:149,0,49:185:74:2 0/0:9,0:0:9:PASS:35:0,27,292:0:321:9 0/0:10,0:0:10:PASS:38:0,30,337:0:370:10 0/0:9,0:0:9:PASS:35:0,27,303:0:333:9 0/0:15,0:0:15:PASS:53:0,45,492:0:543:15 0/0:5,0:0:5:PASS:23:0,15,170:0:185:5
And here's the DiscoSnp record:
ptg000002l 799357 1589465 A G . PASS Ty=SNP;Rk=0.44587;UL=58;UR=50;CL=396;CR=395;Genome=C;Sd=-1 GT:DP:PL:AD:HQ 0/0:7:5,25,144:7,0:68,0 0/0:6:5,22,124:6,0:70,0 0/0:9:5,31,184:9,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 0/0:5:4,19,104:5,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 0/0:9:5,31,184:9,0:68,00/0:10:15,24,174:9,1:70,44 ./.:0:.,.,.:0,0:0,0 0/0:15:5,49,304:15,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:10:5,34,204:10,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:3:.,.,.:3,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:15:5,49,304:15,0:67,0 0/0:22:5,70,444:22,0:68,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 0/0:7:5,25,144:7,0:68,0 ./.:5:.,.,.:5,0:67,0 0/0:5:4,19,104:5,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:6:5,22,124:6,0:70,0 ./.:5:.,.,.:5,0:70,0 0/0:12:5,40,244:12,0:67,0 0/0:7:5,25,144:7,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:10:5,34,204:10,0:70,0 0/1:16:52,20,212:12,4:70,70 ./.:0:.,.,.:0,0:0,0 0/0:5:4,19,104:5,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0
And when I look at the assembly in this region I get this:
bedtools getfasta -fi MAHI_genome.fasta -bed var.bed
ptg000002l:799355-799359
CCTC
There's no A here. The start in the bed is 0-based so the second C is the REF being reported. I can't figure out what the problem is or why DiscoSnp is reporting a REF allele that does not appear to exist.
Any ideas? Thank You. - RObert