Coder Social home page Coder Social logo

fritzsedlazeck / survivor Goto Github PK

View Code? Open in Web Editor NEW
333.0 12.0 45.0 20.6 MB

Toolset for SV simulation, comparison and filtering

License: MIT License

Makefile 2.95% C++ 96.86% R 0.19%
bioinformatics structural-variations simulator comparison survivor vcf bioconda

survivor's People

Contributors

fritzsedlazeck avatar lindenb avatar mschatz avatar wdecoster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

survivor's Issues

row names in R script crashing

Hi
The R-script which is supposed to visualize the summary output is not working.

Error in Math.factor(t[, 1]) : 'log10' not meaningful for factors

Which is not surprising as it takes the first column for the x-axis and transforms it into log10:

t=read.table(file,header=T)
plot(log10(t[,1]),t[,2],xlab="log10(length(bp))",ylab="# of SVs",type='l',col=cols[1],main=fie)

But the format of the file has changed (at least that's my guess):

    
pbsv.summary
Len     Del     Dup     Inv     INS     TRA     UNK
0-50bp  0       0       0       0       0       0
50-100bp        1038    0       0       2875    0       0
100-1000bp      1162    0       0       2507    0       0
1000-10000bp    241     0       0       108     0       0
10000+bp        0       0       0       0       0       0

As it is now a string and no number any-more the match function reports an error

Venn vs. VCF event.

000000F_002|arrow | 3616 | TRA002SUR | N | | . | PASS | SUPP=1;SUPP_VEC=01;AVGLEN=100000;SVTYPE=TRA;SVMETHOD=SURVIVORv2;CHR2=000000F_080|arrow;END=778870;STRANDS=+- | GT:LN:DR:ST:TY:CO | ./.:0:0,0:--:NaN:NaN | 1/1:6507834:0,15:+-:TRA:000000F_002|arrow_3616-000000F_080|arrow_778870
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
But venn shows both samples have this event
[jiwu2573@login4 ngmlr-0.2.6]$ head w53w104tow64r13_new_merged.vcf_venn
IdentifierW53tow64r13_new.vcf W104tow64r13_new.vcf Length
TRA002SUR 1 1 100000

Dropping INS during vcftobed

Hey there Fritz,
In your newest release there is an issue with dropping of INS calls during the vcftobed conversion. This is related to issue #14 where you were fixing the names of the variants being lifted from VCF to the bedpe step.

Can we get an update where the INS are also represented within the bedpe output? Not sure exactly why there were lost.

Thanks,
Phil

Detail about SURVIVOR simSV

Hi fritz.

I tried to make simulated reads using SURVIVOR simSV
I created SV parameter file as bellow:

$ cat parameter_file 
PARAMETER FILE: DO JUST MODIFY THE VALUES AND KEEP THE SPACES!
DUPLICATION_minimum_length: 100
DUPLICATION_maximum_length: 10000
DUPLICATION_number: 3
INDEL_minimum_length: 20
INDEL_maximum_length: 500
INDEL_number: 1
TRANSLOCATION_minimum_length: 1000
TRANSLOCATION_maximum_length: 3000
TRANSLOCATION_number: 2
INVERSION_minimum_length: 600
INVERSION_maximum_length: 800
INVERSION_number: 4
INV_del_minimum_length: 600
INV_del_maximum_length: 800
INV_del_number: 2
INV_dup_minimum_length: 600
INV_dup_maximum_length: 800
INV_dup_number: 2

and, I run this command:

SURVIVOR simSV ~/human_reference_genome/GRCh38_p12_chr22.fa parameter_file 0.1 0 simulated

This script takes more than 2 days for generating SVs.

Can you explain how SURVIVOR simSV works and why it takes so long time?
I imaged that just modifying reference sequence doesn't take so long time.

Sorry for bothering you.
Thanks.

harazono

some request on genotyping

Hi,
This tools is quite good in merging different call sets. I have merged 336 samples into one vcf file effectively. However, there is still one problem for me. In the merged vcf file, it only shows the 0/1, 1/1 ./. genotypes without 0/0. The problem is how to determine the ./. genotypes belong to. Are they belong to 0/0 or they are definitely ./. type ? Do you have some resolutions on this?

Cheers,

SURVIVOR merge AVGLEN positive when SVTYPE=DEL

Hi Fritz,

Despite this:

convert << ";AVGLEN=";
	if(entry->type == 0){
		convert << get_avglen(entry->caller_info)*-1;
	}else if (entry->type != 3) {
		convert << get_avglen(entry->caller_info);
	} else {
		convert << "0";   // TODO think about it.
	}

I am running into occurences of merged deletions that have a positive AVGLEN.

This is with the latest 1.0.5 version.

SURVIVOR merge

Hi Fritz,

I came across a strange behavior today, when using SURVIVOR merge. This is the command I used:
SURVIVOR merge input_files 500 1 1 0 0 20 merged.vcf
Meaning that taking the SV type into account is turned on.

My input contains 2 sorted VCF files, and I know that the first one does not contain any SVTYPE=DUP, while the second does.
However, in the output I have 4 merged SVs with support vector 11 and SVTYPE=DUP. In fact from the VCF lines below it is visible that 3 INS and 1 DEL got merged with DUPs from the second set:

chr13   85730064        DUP0022605SUR   N       <DUP>   .       PASS    SUPP=2;SUPP_VEC=11;AVGLEN=902;SVTYPE=DUP;SVMETHOD=SURVIVORv2;CHR2=chr13;END=85730228;CIPOS=0,164;CIEND=0,0;STRANDS=++   GT:PSV:LN:DR:ST:TY:CO      0/1:NA:1640:0,0:+-:INS:chr13_85730228-chr13_85730228    0/1:010:164:0,0:++:DUP:chr13_85730064-chr13_85730228
chr14   88530275        DUP0025745SUR   N       <DUP>   .       PASS    SUPP=2;SUPP_VEC=11;AVGLEN=220.5;SVTYPE=DUP;SVMETHOD=SURVIVORv2;CHR2=chr14;END=88530492;CIPOS=0,217;CIEND=0,0;STRANDS=-+ GT:PSV:LN:DR:ST:TY:CO      0/1:NA:224:0,0:+-:INS:chr14_88530492-chr14_88530492     0/1:010:217:0,0:-+:DUP:chr14_88530275-chr14_88530492
chr3    195501349       DUP0060843SUR   N       <DUP>   .       PASS    SUPP=2;SUPP_VEC=11;AVGLEN=-197;SVTYPE=DUP;SVMETHOD=SURVIVORv2;CHR2=chr3;END=195501705;CIPOS=0,318;CIEND=0,0;STRANDS=-+  GT:PSV:LN:DR:ST:TY:CO      0/1:NA:38:0,0:+-:DEL:chr3_195501667-chr3_195501705      0/1:001:356:0,0:-+:DUP:chr3_195501349-chr3_195501705
chr8    51696645        DUP0085793SUR   N       <DUP>   .       PASS    SUPP=2;SUPP_VEC=11;AVGLEN=322;SVTYPE=DUP;SVMETHOD=SURVIVORv2;CHR2=chr8;END=51696870;CIPOS=0,225;CIEND=0,0;STRANDS=-+    GT:PSV:LN:DR:ST:TY:CO      1/1:NA:419:0,0:+-:INS:chr8_51696870-chr8_51696870       0/1:011:225:0,0:-+:DUP:chr8_51696645-chr8_51696870

Is there an explanation for this behavior or is it a bug?

Thanks :)

Marc

Assemblytics output to vcf

Hi,
Thank you so much for your effort to maintain this wonderful software.
I'd like to ask how to use ID for SURVIVOR option.

My goal is to convert Assemblytics output (bed) to a vcf file for SURVIVOR_ant, then use SURVIVOR_ant to annotate my structural variation. So, first I tried ./SURVIVOR bedtovcf assemblytics.output.bed DEL output.vcf to convert bed to vcf, but it changed all variant types into deletion. I tried ALL, All, all, etc., but it showed
Bed file
Type
Output vcf file
repeatedly. I would be grateful if you added other explanations how to use it properly, which part should I change, or what Type I can use, etc.
Second, I tried to use ./SURVIVOR ID, so I replaced ID to the number 11: Convert SV calls from Assemblytics: ./SURVIVOR 11 However, it didn't show any informative error message, but showed the same message with typing ./SURVIVOR. Could you explain how to use ./SURVIVOR properly?

Yours sincerely,
Thank you.
Jun.

Minimum number of supporting caller

Hi Fritz,

As a follow up on fritzsedlazeck/Sniffles#61, I saw you suggested:

Step 3:
SURVIVOR merge files 1000 1 1 0 0 0 output_merged.vcf

Step 5:
SURVIVOR merge 1000 0 1 0 0 0 output_genotyped_merged.vcf

I assume in step 5 you missed the argument specifying the input vcfs and meant:
SURVIVOR merge files_genotyped 1000 0 1 0 0 0 output_genotyped_merged.vcf

But can you explain why in step 3 you recommend 1 for "Minimum number of supporting caller" and 0 in step 5? I'd expect both of them to be "1".

Cheers,
Wouter

vcftobed use AVGLEN values in INFO column

Hello,

I used the SURVIVOR vcftobed function like so (with filenames removed)

./SURVIVOR vcftobed <input.vcf> 0 -1 <output.bed>

the output.bed file reported variants as single bases which did not account for the AVGLEN values in the INFO column of the input.vcf file. Is there a way to get vcftobed to use that length information when reporting start and end values in the output bed file?

Any help is greatly appreciated. Thank you

Alberto

Minor systematic bias on small deletions/insertions

Hi @fritzsedlazeck ,

I've noticed survivor usage in some recent publications analyzes ins/del at sizes 50 and higher -- note that the manta default is actually off by one from this value. If you haven't reconfigured it, Manta results will be 51 bases and higher, per the configuration file setting here:

https://github.com/Illumina/manta/blob/c96587d1ebaba48d17d7144f2454972489e37f3f/src/python/bin/configManta.py.ini#L23

This will have a minor impact overall, but I think it's worth noting/correcting in future if this is not already configured in the SURVIVOR pipeline.

Thanks! -c

somethings not understand

Hi,
I am using the merge option to merge my SV vcf files. However, there are two sections that I do not quite understand in using SURVIVOR merge"SURVIVOR merge sample_files 1000 2 1 1 0 30 sample_merged.vcf".
1.What does the 1000 means? You explained it as maximum allowed distance of 1kb. Does that mean the start position of the two SV sites could be 1kb distance? For example, one SV start position is chr1:100 and the other SV start position is chr1:1100 , and they can be merged as one sv site if they are the same SV type and on the same strand.
2. What does the 0 mean? You explained it as Estimate distance based on the size of SV. What distance does it estimate?

Cheers

Details about simSV parameter

SURVIVOR simSV
To simulate SV:
1: Reference fasta file
2: Parameter file
3: SNP mutations frequency (0-1)
4: 0= simulated reads; 1= real reads
5: output prefix

Hi Fritz,

Could you please tell me what is the 4th parameter? My aim is to introduce svs on given genome but I really did not understand 4th parameter. This step only creates an altered ref with given parameter file as expected but I don't understand what does 4th parameter.

Thanks,
Mehmet

options for merge

Hi,

First, SURVIVOR is really a very good tool. I used it for merging structure variations of multi samples. But we also want to define the reciprocal overlap of SVs when we merge data. I wonder is there any chance you consider add this option for SURVIVOR ?

Best,
Shuo

SURVIVOR simreads

Hi,
Thanks for the solving the problem of scanreads funciton yesterday.
However, I have the similar problem with simreads module:

/home/wyzhang/opt/biosoftware/ONT_Analysis_Tools/SURVIVOR/Debug/SURVIVOR simreads ../../1.Deletion_Simulation/mm10_Chr1_Homo_DEL.fa ./ONT_error.txt 4 mm10_Chr1_Homo_DEL_ONT

I still got the help info on the screen:
No parameters provided:
1: Reference fasta file
2: error profile file (see scan)
3: Coverage
4: output prefix

I think I have given the program all the neccessary information, but it did not work again.

Best,
Wenyu

sample names instead of file names (merge tool)

Thank you for developing such a great tool !
We use SURVIVOR to merge Manta calls across multiple samples and we love how simple it is to run.

We have a small enhancement request. Would it be possible to include the actual sample names in the output multi-sample VCF ?
Basically, we would like to have the sample name (extracted from the original VCF) to appear in the corresponding header column of the output multi-sample VCF.
We typically build a list of absolute paths to the input VCF files and we would like just the sample name, instead of the long absolute path, to be listed in the output file after merging.

Simulating NGS reads from SV introduced fasta file

Hi Fritz,

I want to compare several SV callers which use short read data including (pindel, delly and lumpy) to see their accuracy. I have used SURVIVOR to introduce artifical SVs, but I couldnt find any parameter to simulate ngs reads from simulated genome, I have only found long reads parameter.

Do I miss something or survivor only implemented to simulate long reads?

merge strategy

Hi,

I wonder the merge strategy of SURVIVOR. For different SVs, DEL, INS, DUP and others, SURVIOR merge them together or seperately?

Best,
Shuo

filter by min SV size using proportion of SV overlapping

Hello,

is it possible to filter out SV's from an input vcf file based on the proportion of overlap with regions in a bedpe file (similar to the -f option in bedtools intersect). I attempted to filter out SV's whose intervals had a 50% overlap with SV's in an exclusion bed file with the following command:

SURVIVOR filter <input.vcf> <remove.bed> 0.5 -1 0.01 10 <out.vcf>

However the out.vcf file did not have the expected output; there are SV's in out.vcf that do intersect with intervals with remove.bed that were not actually removed.

Does entering a decimal value for min SV size the way i did accomplish what I am intending to do?

SURVIVOR stalls when reference FASTA lacks newlines

Hello,
I've been trying to run LRSIM to generate some simulated human reads, and I've noticed that the run is stalling at the initial SURVIVOR stage. (Ie. stuck there for > 12 hours).
After doing some tests, I suspect that it is due to my reference FASTA file lacking newlines for a given sequence.

Ex. for e. coli tests:
SURVIVOR 0 ./ecoli.fa ./test1.hap.parameter 0 ./test1.hap 1000
completes in under a second (fasta has 80 residues per line)

while
SURVIVOR 0 ./ecoli_noNewlines.fa ./test1.hap.parameter 0 ./test1.hap 1000
stalls, and I aborted it after ~4 minutes (fasta has all residues for a sequence on one line).

I'm just wondering if that is a known formatting requirement for running SURVIVOR?

Thanks!
Lauren

SURVIVOR merge produces unsorted VCFs

The output of SURVIVOR merge is not always sorted by chrom,POS when the input files are.

For a test see https://gist.github.com/amwenger/f3cfcfa2e144a7bea7623e520cb982fa.

$ SURVIVOR 2>&1 | head -n2
Program: SURVIVOR (Tools for Structural Variations in the VCF format)
Version: 1.0.5

$ SURVIVOR merge sort-error.fofn 500 1 1 0 0 30 merged.vcf

# input file is sorted
$ grep -vE "^#" sort-error.vcf | cut -f1-3
1       811404  var1
1       811404  var2
1       811680  var3
1       811680  var4

# output file is not sorted
$ grep -vE "^#" merged.vcf | cut -f1-3
1       811680  var3;var4
1       811404  var1;var2

SURVIVOR merge vcf files

I tried to merge a couple of vcf files with the following command:
SURVIVOR merge sample_files 100 1 0 0 0 30 survivor_merge.vcf

In the output vcf file there are those lines for example:

22	18502316	INV00936SUR	N	<INV>	.	PASS	SUPP=8;SUPP_VEC=00001111111100;AVGLEN=690.1;SVTYPE=INV;SVMETHOD=SURVIVORv2;CHR2=22;END=18503011;CIPOS=0,6;CIEND=-5,0;STRANDS=++	GT:PSV:LN:DR:ST:TY:CO	./.:NaN:0:0,0:--:NaN:NaN	./.:NaN:0:0,0:--:NaN:NaN	0/0:NA:688:0,7:++:INV:22_18502319-22_18503007	0/0:NA:689:0,3:++:INV:22_18502319-22_18503008	0/1:NA:694:0,0:++:INV,INV:22_18502317-22_18503011,22_18502316-22_18503009	0/1:NA:694:0,0:++:INV,INV:22_18502317-22_18503011,22_18502316-22_18503009	0/1:NA:695:0,0:++:INV:22_18502316-22_18503011	0/1:NA:695:0,0:++:INV:22_18502316-22_18503011	0/1:NA:686:0,0:++:INV,INV:22_18502319-22_18503005,22_18502322-22_18503005	0/1:NA:686:0,0:++:INV,INV:22_18502322-22_18503004,22_18502319-22_18503005	0/1:NA:687:0,20:++:INV:22_18502319-22_18503006	0/1:NA:687:0,14:++:INV:22_18502319-22_18503006	./.:NaN:0:0,0:--:NaN:NaN	./.:NaN:0:0,0:--:NaN:NaN
22	18502316	INV00938SUR	N	<INV>	.	PASS	SUPP=4;SUPP_VEC=00000000100111;AVGLEN=641.25;SVTYPE=INV;SVMETHOD=SURVIVORv2;CHR2=22;END=18503009;CIPOS=0,75;CIEND=0,0;STRANDS=++	GT:PSV:LN:DR:ST:TY:CO	./.:NaN:0:0,0:--:NaN:NaN	./.:NaN:0:0,0:--:NaN:NaN	./.:NaN:0:0,0:--:NaN:NaN	./.:NaN:0:0,0:--:NaN:NaN	./.:NaN:0:0,0:--:NaN:NaN	./.:NaN:0:0,0:--:NaN:NaN	./.:NaN:0:0,0:--:NaN:NaN	./.:NaN:0:0,0:--:NaN:NaN	0/1:NA:559:0,0:++:INV:22_18502391-22_18502950	./.:NaN:0:0,0:--:NaN:NaN	./.:NaN:0:0,0:--:NaN:NaN	1/1:NA:620:0,2:--:INV:22_18502387-22_18503007	1/1:NA:693:0,0:++:INV:22_18502316-22_18503009	1/1:NA:693:0,0:++:INV:22_18502316-22_18503009```

Can you explain why those SV's are not merged into one? 

misleading short program description

At first, I skipped past this program because I thought it was just another SV detection tool for short reads. I heard from somewhere else that this tool does some format interchange and, going through the README now, became interested.

I strongly suggest that you change the repository description and the top of the README to better represent what this program can do.

Duplication max length does not work

Hi Fritz,

I used SURVIVOR simsv function with parameter file to introduce SVs in given genome.

For deletions and inversions I did not have any problem based on length. I introduced 1000 svs including indels duplications and inversions. Length of duplications are always greater than DUPLICATION_maximum_length. Could you please tell me how to solve this?

Best,
Mehmet

merge vcf SVTYPE

Hi,
I'm merging two VCF files from NanoSV using
SURVIVOR merge input_files 1000 1 0 0 0 10 TUMOR_NORMAL_merged.vcf

I found the merged VCF files with SVTYPE:INV added, although the SVTYPE is BND in original VCF files. Please see the following example:

TUMOR.vcf
chr1 | 179947 | 36 | T | T[chr1:181453[ | 45 | GAP;MapQual | SVTYPE=BND;SVMETHOD=NanoSV_v1.1.6;CIPOS=0,0;CIEND=0,0;SVLEN=1506;RT=0,1,0;GAP=997;MAPQ=60,60;PID=0.975,0.964;PLENGTH=0.536,0.412;RLENGTH=19764;DEPTHPVAL=0.558 | GT:DR:DV:GQ:HR:PL | 0/1:2,1:1,1:15:0,0:45,0,15

NORMAL.vcf
chr1 | 180588 | 1344 | C | C[chr1:181422[ | 60 | GAP;MapQual | SVTYPE=BND;SVMETHOD=NanoSV_v1.1.6;CIPOS=0,0;CIEND=0,0;SVLEN=834;RT=0,1,0;GAP=556;MAPQ=60,60;PID=0.941,0.932;PLENGTH=0.324,0.553;RLENGTH=4834;DEPTHPVAL=0.289 | GT:DR:DV:GQ:HR:PL | 1/1:0,0:1,1:6:0,0:60,6,0

TUMOR_NORMAL_merged.vcf
chr1 | 180588 | INV004SUR | N | | . | PASS | SUPP=2;SUPP_VEC=11;AVGLEN=1170;SVTYPE=INV;SVMETHOD=SURVIVORv2;CHR2=chr1;END=181422;CIPOS=-641,0;CIEND=-641,0;STRANDS=++ | GT:PSV:LN:DR:ST:TY:CO | 0/1:NA:1506:0,0:++:INV:chr1_179947-chr1_181453 | 1/1:NA:834:0,0:++:INV:chr1_180588-chr1_181422

ALL SVs with SVTYPE=BND in original VCF files are added as INV, except those with breakend in different chromosomes. How SURVIVOR infer the SVTYPE?
If I want to keep the SVTYPE in original VCF file, what should I do?

Thank you very much!

using NA12878_nano_error_profile_bwa.txt

Hello,

I am trying to use survivor to generated simulated data using existing error profile NA12878_nano_error_profile_bwa.txt. Below is my command.

However the bed file is empty despite running vcf2bed again and the vcf file is huge. Could you please clarify if I am using this in the right way?

SURVIVOR simSV hs37d5.fa NA12878_nano_error_profile_bwa.txt 0.1 1 simulated1

ls -sh simulated1.*

0 simulated1.bed 3.0G simulated1.fasta 0 simulated1.insertions.fa 31G simulated1.vcf

vcf2bed < simulated1.vcf > simulated1.bed

ls -sh simulated1.*

0 simulated1.bed 3.0G simulated1.fasta 0 simulated1.insertions.fa 31G simulated1.vcf

Thank you,
Ram

Terminate program as it could not find a non overlapping region

Hi,

I ran simSV with SURVIVOR simSV $REFERENCE parameter_file 0 1 test but got the following error:

# Chrs passed size threshold:24
generate SV
Terminate program as it could not find a non overlapping region

The reference I used was hg38 and the parameter file was like:

PARAMETER FILE: DO JUST MODIFY THE VALUES AND KEEP THE SPACES!
DUPLICATION_minimum_length: 50
DUPLICATION_maximum_length: 1000000
DUPLICATION_number: 2500
INDEL_minimum_length: 50
INDEL_maximum_length: 1000000
INDEL_number: 2500
TRANSLOCATION_minimum_length: 50
TRANSLOCATION_maximum_length: 1000000
TRANSLOCATION_number: 2500
INVERSION_minimum_length: 50
INVERSION_maximum_length: 1000000
INVERSION_number: 2500
INV_del_minimum_length: 50
INV_del_maximum_length: 1000000
INV_del_number: 2500
INV_dup_minimum_length: 50
INV_dup_maximum_length: 1000000
INV_dup_number: 2500

Do you know how this happened? Thanks in advance!

error when running SURVIVOR merge

Hi,
I want to merge my vcf files using 'SURVIVOR merge 925.gt.vcf 924.gt.vcf 1000 1 1 1 50 merge.vcf'
But it has an error showing "VCF Parser: could not open file: ##fileformat=VCFv4.2"
How can I resolve this problem?
Cheers.

SURVIVOR scanreads does not work

Hi,
I am trying to use the latest version of SURVIVOR (v1.02) to simulate ONT reads based on "true" alignment data. The first step is to generate error file from the alignment data. Using the command line suggested, I cannot get the program running, however:

samtools view ./MC_All_Albacore_trim_filtered_BWA.bam | ./SURVIVOR 2 scan 10000 error.txt
_Program: SURVIVOR (Tools for Structural Variations in the VCF format)
Version:

Usage: SURVIVOR [options]

Commands:
-- Simulation/ Evaluation
simSV Simulates SVs and SNPs on a reference genome.
scanreads Obtain error profiles form mapped reads for simulation.
simreads Simulates long reads (Pacio or ONT).
eval Evaluates a VCF file after SV calling over simulated data.

-- Comparison/filtering
merge Compare or merge VCF files to generate a consensus or multi sample vcf files.
filter Filter a vcf file based on size and/or regions to ignore
stats Report multipe stats over a VCF file
compMUMMer Annotates a VCF file with the breakpoints found with MUMMer (Show-diff).

-- Conversion
bincov Bins coverage vector to a bed file to filter SVs in low MQ regions
vcftobed Converts a VCF file to a bed file
bedtovcf Converts a bed file to a VCF file
smaptovcf Converts the smap file to a VCF file (beta version)_

As you can see, with the command line above, I get the "help" information on the screen, and did not have anything to output.

Also tried:

samtools view ./MC_All_Albacore_trim_filtered_BWA.bam | ./SURVIVOR scanreads 1000 1 error.txt

But still did not work.

Could you help to take a look at what is going on there?
Thanks!

SURVIVOR scanreads output doesn't contain any information.

Hi, Friz

First, thank you for developing such a great tools to evaluate SV callers.
I tried SURVIVOR scanreads and it generated a file as bellow.

samtools view ~/20180705_aligner_and_SVcaller_evaluation/minialign/chr22.bam | head -5000 |SURVIVOR scanreads 1000 error.txt
Pos     P(stop) P(match)        P(mismatch)     P(ins)] P(del)

I aligned NA12878 Nanopore(red 3) reads to GRCH38 with minialign,
bam file is as bellow:

$ samtools view ~/20180705_aligner_and_SVcaller_evaluation/minialign/chr22.bam | head -1
4730cf64-110d-4363-a502-2b940dd91a6d_Basecall_Alignment_template        256     CM000684.2      10510001        0       2730H41M1D35M1D5M1D1M1D18M1D18M1I1M1D6M1D2M1D52M1D64M2I4M2D33M2D27M1D10M1D2M1D8M2D2M1D2M2D1M7D35M2D41M1I30M3D1M1D2M1D11M2D12M1D16M5D10M1D39M3D1M1D9M1D18M2D30M1D12M2D25M1D4M1D35M1D24M1D11M9D17M1D5M1D4M1I2M1D3M2D1M1D4M1D17M1D1M1D2M1D7M3D6M1D7M1D18M1D33M3D8M1D6M2D5M1D4M2D22M1I20M1D7M1D50M1D15M1I25M1I3M3D44M1D12M3D16M1D7M1D11M1D21M2D1M1D5M1D35M1I1M1D12M2D5M2D4M1D23M1I3M1I2M2D21M1D5M2D12M1I1M1I22M1D6M1D17M4D2M1D2M1I29M1D9M1D6M1D1M1D1M1D7M1D5M1D4M1I1M1I38M1D4M1D2M4D5M1D3M1D1M1D32M1D9M1D9M1D9M2D2M1D9M9935H        *       0       0       GAATTCTTGTGTTTATATAATAAGATGTCCTATAATTTCTGTTGGAATATAAAATCAGCAACTAATATGTATTTTCAAGCATTAAATACAGAGTGCTAAGTACTTCACTGTGAAATGTGATCATATAAAACATAATAATTATACTGGATTATTTTTAAATGGGCTGTCTAACATTATATTAAAGGTTTCATCAGTAATTCATTATATCAAAATGCTCCAGGCCAGGCGTGGTGGCTTATGCCTGTATAATCAGCACTTTGGGAGGTCGAAGTGGGCGGATCGCGAGGTCAGGAGATGGAGACTAGCCTGGCAACATGATGAACCCGTGTCATATTAAAAATTAGCTGGGTGTGGTGGTGGGCAGCTGTAATCAGCTAATCAGGAGGCTGAGGCAGAAGAATTGCTTGAACGATGGAAGGCAGAGTTTACAGTGTGCCAAGATACACCTACTCCAGCGGGTGACAGAGCAGACGCAGTCTCAAGGAAAAAGCTCGAAAATGTTTGCTTATTTTGGTAAAATTATTCATTGACTATATAAATCAACAAACTGTCCATATTTCATTTTGAAATTACATATTAAGGATCTAAAACAAGGATAAAATATGCAAAAATATTTGCATTTCTGGATAAGTGGTTCCTGATTTAACTCCTTGTTGACACTGAATCTGTTTAAATTAAACAAGGTGCTGATAGACAAAAGACAATTCAAAATAAAATTGAAGGTAAATCCAAACACAGAATAAAATTTCCTCTGACGGTAAAGAGATCATACAAAATCAAATTTATATAGTCTCTTTCAAAGGGTCATAAAACCAATCAGTTAATAGTTGTTTATGTGAAAAAGGCTGTAAAGAAAATAATGACACCAAAATCCCTTTTTAAGTTCAAAGAGTCATAAGTTTCAAAAATGTAATTCAGATAAGAAATTAATGTCAGCTACATTCTTCCACAGTAAGGTTTGCATCAACTTTTTCAGCTAATAGCTTTAGCATTCGACTAAACTTGTTTCAACCATTATTTTACTAGGTGTTATAACAGGGTATACATTCTTTCATTTTCTTCATTTATTAATGGAATATTTCTAAAAGACATTTCCTCCCAAATGTGTAGTTACACAAAACACAGTTTGGACAGGAAAAACAAGATAAATGAACAAATGCTTGATTTATGCCTCCCACTTTTCACAGTAATGAATTAATTAGCCAGCAAACTTTAATAGTGATTAAAGCATTTCCCATAATAATAAAATTTTAAGTTTTTGTAAGTAGTGATAATAATAATTTGTCCAGTCTGTAGTACCTATAACCAGCCAGGATCCATTCTCAACAATGTGGGAAAGGCCAGTTTAAATTCTGTCCAAGGATACCAGCAAGATGATGAAATAGAAAGCCTCAAAGCCCTCACCTCTCAAGGACACCAGCTCGATAATAATACACAGATAAATCCCTTCGTGAAAGTTCAGGGAGAATTTAGCTCTGGATT        *

SURVIVOR filter removing all SV's and receiving error message

Hello,

I'm trying to use the SURVIVOR filter function to remove SV's from a vcf file. Here is the command I am using with the file names removed:

SURVIVOR filter <input.vcf> <remove.bed> 1 1000 0.01 10 <out.vcf>

I entered dummy values for min and max sv size and just used the values from the example usage on your github for min allele frequency and min number of reads. I am not entirely sure how to determine what values to input for these parameters.

The command executes successfully and produces an output but I do receive this warning/error message:

Warning: Max size threshold set, TRA wont be reported as their size cannot be assesst.

Additionally the <out.vcf> file has all SV's removed and only contains the header. There are 131704 SV's in my <input.vcf> and I receive the message SVs ignored: 131704 resulting in an empty output file.

Is this due to an incorrect use of parameters and if so how can I correct these settings? Any help is greatly appreciated.

Best,
Alberto

Reciprocal overlap

Hello,
First off, I love the tool and it is very useful to our SV analysis for merging call sets between individuals.

One question with regards to Command #5:
Is there some way to control reciprocal overlap? I know there is a parameter that controls "within ## BP of each of each other" for merging SVs, but I'm wondering how the logic behind it works.
From your paper:

"Two SVs were defined as overlapping if they occur on the same chromosome, their start and stop coordinates were within 1 kb,"

Is it as such:

SV1: chr1 1000 3000
SV2: chr1 1500 3500

Do you then compute: ( | SV1_start - SV2_start | + | SV1_end - SV2_end | ) ≤ 1000bp
( | 1000 - 1500 | + | 3000 - 3500 | ) = (500 + 500) ≥ 1000, so you count this as the same variant?

Or is it: | SV1_start - SV2_start | ≤ 1000 AND | SV1_end - SV2_end | ≤ 1000, then you count this variant.

Thanks for your help and this awesome tool!

-Phil Richmond @Phillip-a-richmond

merge vcf: some entries failed to merge

Hi Fritz,
We sequenced a NA12878 run and used parliament2 to generate SVs. We tried to compare the parliament2 (>500bp) DEL results with Personalis' DELs, using "SURVIVOR merge", with parameters: 1000 2 1 1 0 500. We used the parliament2 combined vcf as well as the MANTA calls from the combined vcf (simply by grep MANTA). Here is what confused us:
We got more merged calls from the MANTA subset than the combined vcf. An example:

Personalis:
chr1 59582965 DEL00BED N . LowQual IMPRECISE;SVTYPE=DEL;SVMETHOD=BEDFILE;CHR2=chr1;END=59583989;SVLEN=1024;PE=1 GT:GL:GQ:FT:RC:DR:DV:RR:RV

Combined vcf:
chr1 59583072 DEL0086690SUR N . PASS SUPP=2;AVGLEN=971.5;SVTYPE=DEL;SVMETHOD=SURVIVORv2;CHR2=1;END=59583990;STRANDS=+-;CALLERS=BREAKDANCER,LUMPY GT:SP 1/1:BREAKDANCER,LUMPY
chr1 59583302 DEL0086689SUR N . PASS SUPP=2;AVGLEN=687;SVTYPE=DEL;SVMETHOD=SURVIVORv2;CHR2=1;END=59583989;STRANDS=+-;CALLERS=MANTA GT:SP 1/1:MANTA

MANTA subset:
chr1 59583302 DEL0086689SUR N . PASS SUPP=2;AVGLEN=687;SVTYPE=DEL;SVMETHOD=SURVIVORv2;CHR2=1;END=59583989;STRANDS=+-;CALLERS=MANTA GT:SP 1/1:MANTA

In the above example, the combined vcf failed to merge the DEL with the Personalis' but the MANTA subset merged the DEL successfully.

Suspecting that the problem might be due to the overlapping calls in the combined vcf, we merged combined vcf against itself. In the above example, the two DELs from combined vcf indeed merged into one. However, the DEL from self merged combine-vcf still fail to merge with Personalis'.
chr1 59583072 DEL0087SUR N . PASS SUPP=2;SUPP_VEC=11;AVGLEN=971;SVTYPE=DEL;SVMETHOD=SURVIVORv2;CHR2=1;END=59583990;CIPOS=0,230;CIEND=-1,0;STRANDS=+- GT:PSV:LN:DR:ST:TY:CO 1/1:2,:971:0,0:+-:DEL,DEL:chr1_59583072-1_59583990,chr1_59583302-1_59583989 1/1:2,:971:0,0:+-:DEL,DEL:chr1_59583072-1_59583990,chr1_59583302-1_59583989
The above call still failed to merge with Personalis.

Your thought on this is greatly appreciated. Thanks much.

Please consider adding descriptive comments and a short summary

Dear valuable authors,

Please consider adding comments describing function of each function, and possibly a short document describing the architecture and overall structure of the code.
This way, future researchers could more easily understand and improve your code.
Of course documentation has lower priority and you can postpone it. But please consider it in the last step.

Regards

is there a way to simulate large deletions by simSV?

Hi Fritz,
I tried to simulate the reads using simSV and notice that the par file has INDEL but no separate DEL or INS. I am wondering if there is a way to directly simulate large deletions, as many benchmark data are deletions. Thanks much!

Yue

Interpretation of sv-simulated bed file

Hello Fritz,

I tried to use SURVIVOR to simulated a few number of SVs in one of the yeast reference genomes. The command line was:

SURVIVOR 1 yeast.fa par_file 0 simulated

Some lines of the bed output is as below:

NC_001139.9 239105 NC_001139.9 246078 DUP
NC_001137.3 304464 NC_001137.3 305703 DEL
NC_001134.8 469961 NC_001134.8 474862 INS
NC_001141.2 186459 NC_001141.2 189325 INV
NC_001137.3 338964 NC_001134.8 210746 TRA
NC_001137.3 339552 NC_001134.8 211334 TRA

I think I understand the insertion, deletion, and inversion lines. For instance, there is an insertion event from 469,961 to 474,852 on NC_001134.8 chromosome, right?

But I am not quite sure how to interpret the translocation and duplication lines. Appreciate your help in advance!

Simo

SVTYPE <NA>

Hello,

I've run SURVIVOR on the outputs of CNVnator, LUMPY, and Pindel, and I see that I have a significant number of SVs of type 'NA' in the output (~25%).

For example:
1 43171 NA004SUR N . PASS SUPP=1;SUPP_VEC=010;AVGLEN=295;SVTYPE=NA;SVMETHOD=SURVIVORv2;CHR2=1;END=43466;STRANDS=++ GT:LN:DR:ST:TY:CO ./.:0:0,0:--:NaN:NaN 0/0:295:0,0:++:NA:1_43171-1_43466 ./.:0:0,0:--:NaN:NaN

and

1 563387 NA0014SUR N . PASS SUPP=1;SUPP_VEC=010;AVGLEN=150;SVTYPE=NA;SVMETHOD=SURVIVORv2;CHR2=1;END=563537;STRANDS=++ GT:LN:DR:ST:TY:CO ./.:0:0,0:--:NaN:NaN 0/0:150:0,0:++:NA:1_563387-1_563537 ./.:0:0,0:--:NaN:NaN

What are these? Are they records with conflicting SV types?

Thanks for your help!
Madeline

cend.first and cend.second in combine_svs.cpp

Hello Fritz,
Should not Line 141 and 142 be as follows?

cend.first = std::min(caller_info[i]->stops[j] - caller_info[index]->stops[0], cend.first);
cend.second = std::max(caller_info[i]->stops[j] - caller_info[index]->stops[0], cend.second);

cend.first = std::min(caller_info[i]->stops[j] - caller_info[index]->stops[0], cpos.first);

cend.second = std::max(caller_info[i]->stops[j] - caller_info[index]->stops[0], cpos.first);

Cheers,
Luca

SVTYPE DUP merging into NA

Hi ,

I recently noticed that when merging two VCFs, all DUPs in my sample are merging into the NA type,

The command I ran is the following,

SURVIVOR merge vcf_files_raw_calls.txt 1000 1 1 -1 -1 -1 merged_SURVIVOR_1kbpdist_typesave.vcf

here's an example of the failed merge type

#merged line in SURVIVOR
chr1 10003 NA000SUR N . PASS SUPP=2;SUPP_VEC=110;AVGLEN=135;SVTYPE=NA;SVMETHOD=SURVIVORv2;CHR2=chr1;END=10199;CIPOS=0,75;CIEND=-49,0;STRANDS=++ GT:PSV:LN:DR:ST:TY:CO ./.:NA:73:0,2:++:NA:chr1_10078-chr1_10150 ./.:NA:197:0,3:++:NA:chr1_10003-chr1_10199 ./.:NaN:0:0,0:--:NaN:NaN

#the two entries that got merged together into this are the following
chr1 10078 TJ10668 N . PASS IMPRECISE;SVMETHOD=picky;END=10150;SVTYPE=DUP;RE=2;RNAMES=m151015_221903_42196_c100828522550000001823180911251523_s1_p0/91653/37610_37635,m150919_214537_42196_c100828492550000001823180911251585_s1_p0/30014/34259_34367;SVLEN=73;CIPOS=-47,47;CIEND=-76,76;NOTE=CIPOS_CIEND;ISVTYPE=TDSR(2) GT ./.

chr1 10003 TJ36088 N . PASS IMPRECISE;SVMETHOD=picky;END=10199;SVTYPE=DUP;RE=3;RNAMES=m150704_162046_42134_c100846082550000001823189212311577_s1_p0/140541/30105_30350,m150530_105730_00118_c100813682550000001823175310291582_s1_p0/88961/25088_25294,m150630_190056_00118_c100855592550000001823186312311594_s1_p0/131968/37141_37217;SVLEN=197;CIPOS=-2,69;CIEND=-1,90;NOTE=CIPOS_CIEND;ISVTYPE=TDSR(3) GT ./.

Please advise,

VCF merge about strand

This will merge all the vcf files specified in sample_files together using a maximum allowed distance of 1kb, as measured pairwise between breakpoints (begin1 vs begin2, end1 vs end2). Furthermore we ask SURVIVOR only to report calls supported by 2 callers and they have to agree on the type (1) and on the strand (1) of the SV

Hi,@fritzsedlazeck,
I didn't understand the strand,can you explain it briefly?
Thanks

SV position

I am using SURVIVOR to simulate structural variations, but I am not sure about that the position of each SV is about source genome or generated genome.

I compared the insertion sequences generated by SURVIVOR with generated genome, but very confusing. Some insertion sequences are really there in generated genome, but some others are not.

What happened?

SURVIVOR 10 issue

using ./SURVIVOR 10 ./test.tails 20 test.vcf

The input is:

#id	chrKey	uRef	uBreak	uMapq	dRef	dBreak	dMapq	remainSeq	annot	numReads	numZMWs	evidence
0	1_1	1	40015898	254	1	40030111	254	1	DEL	2	2	->p=i->;<-i=p<-

The output:

1	40015898	DEL00Honey	N	<DEL>	.	LowQual	IMPRECISE;SVTYPE=DEL;SVMETHOD=Honey_tails;CHR2=1;END=40030111;SVLEN=14213;PE=1	GT:GL:GQ:FT:RC:DR:DV:RR:RV

Is this value DEL00Honey is the default behavior?

also the coverage is not included, what is PE filed why it is always 1?

Thanks,

Compiling error

I'm getting a compiling error when trying to install the latest release.
Below the relevant log lines:

Building target: SURVIVOR
Invoking: Cross G++ Linker
g++ -o "SURVIVOR" ./src/vcfs/Annotate_vcf.o ./src/vcfs/Combine_3_VCF.o ./src/vcfs/Compoverlap_VCF.o ./src/vcfs/Detect_nested.o ./src/vcfs/Filter_vcf.o ./src/vcfs/Generate_distMat.o ./src/vcfs/Merge_VCF.o ./src/snp_overlap/Overlap_snps.o ./src/simulator/Error_scanner.o ./src/simulator/Eval_vcf.o ./src/simulator/Pac_Simulator.o ./src/simulator/SV_Simulator.o ./src/simulator/Sim_reads.o ./src/simulator/test_cov.o ./src/merge_vcf/IntervallTree.o ./src/merge_vcf/combine_svs.o ./src/convert/ConvertMQ0Bed.o ./src/convert/Convert_Assemblytics.o ./src/convert/Convert_Bionano.o ./src/convert/Convert_Honey_tails.o ./src/convert/Convert_MUMmer.o ./src/convert/Convert_Pindel.o ./src/convert/Convert_VCF_to_BED.o ./src/convert/Convert_hapcut2.o ./src/convert/Process_Coverage.o ./src/convert/Process_Lumpy.o ./src/analysis_sv/GIAB_summary.o ./src/analysis_sv/MT_identifier.o ./src/analysis_sv/MUMmer_overlap.o ./src/analysis_sv/Select_samples.o ./src/analysis_sv/Simplify_SVs.o ./src/analysis_sv/Summ_mat.o ./src/CorrectAllele.o ./src/DetectDif.o ./src/Extract_Seq.o ./src/SURVIVOR.o ./src/Summarize_SV.o
./src/convert/Convert_Assemblytics.o: In function std::string::_M_data() const': /home/cog/jvalleinclan/kloosterman/tools/SURVIVOR/Debug/../src/convert/Convert_Assemblytics.cpp:11: multiple definition of print_header(std::string, _IO_FILE*&)'
./src/vcfs/Compoverlap_VCF.o:/home/cog/jvalleinclan/kloosterman/tools/SURVIVOR/Debug/../src/vcfs/Compoverlap_VCF.cpp:33: first defined here
./src/SURVIVOR.o: In function official_interface(int, char**)': /home/cog/jvalleinclan/kloosterman/tools/SURVIVOR/Debug/../src/SURVIVOR.cpp:248: undefined reference to parental_phasing(std::string, std::string, std::string)'
collect2: error: ld returned 1 exit status
make: *** [SURVIVOR] Error 1

Is it on my side or is there a mis-definition somewhere?

Error when installing SURVIVOR: undefined references

I get the following errors when installing SURVIVOR:

.
.
.
Finished building: ../src/Summarize_SV.cpp
 
Building target: SURVIVOR
Invoking: Cross G++ Linker
g++  -o "SURVIVOR"  ./src/vcfs/Annotate_vcf.o ./src/vcfs/Combine_3_VCF.o ./src/vcfs/Compoverlap_VCF.o ./src/vcfs/Filter_vcf.o ./src/vcfs/Merge_VCF.o  ./src/simulator/Eval_vcf.o ./src/simulator/Pac_Simulator.o ./src/simulator/SV_Simulator.o  ./src/merge_vcf/IntervallTree.o ./src/merge_vcf/combine_svs.o  ./src/convert/ConvertMQ0Bed.o ./src/convert/Convert_Assemblytics.o ./src/convert/Convert_Bionano.o ./src/convert/Convert_Honey_tails.o ./src/convert/Convert_Pindel.o ./src/convert/Convert_VCF_to_BED.o ./src/convert/Process_Lumpy.o  ./src/CorrectAllele.o ./src/DetectDif.o ./src/Extract_Seq.o ./src/SURVIVOR.o ./src/Summarize_SV.o   
./src/SURVIVOR.o: In function `main':
/home/sean/SURVIVOR/Debug/../src/SURVIVOR.cpp:292: undefined reference to `summary_giab(std::string, std::string)'
/home/sean/SURVIVOR/Debug/../src/SURVIVOR.cpp:309: undefined reference to `detect_nested(std::string, std::string)'
/home/sean/SURVIVOR/Debug/../src/SURVIVOR.cpp:273: undefined reference to `summarize_badcoverage(std::string, int, int, std::string)'
collect2: error: ld returned 1 exit status
make: *** [SURVIVOR] Error 1

Thanks!

merge vcf : Information for DR

Hello,

I use SURVIVOR merge to merge information from several software.
I wanted to know how DR information was calculated.
For exemple, with Manta this information is true, for Delly this information is false and for Lumpy we have only the information for the variant and not for the reference.

Thank you in Advance,
Regards,

Steven

Feature Request: testing against a control SV set

I think one scenario which might often arise is that we have e.g.

  • SVs analyzed in a control group
  • SVs calls from a treated group (or group with a certain sickness)

Or alternatively that the assembly and the SV-caller would disagree with some assembly hypothesis and report SVs.

It would be great if an option would exist to define one set as background SVs which should be ignored, or flagged.
So essentially the opposite of "Consensus call from multipe SV calls (vcf files)" .

Cheers

How should start, end, and size of insertions be compared?

As we discussed during the hackathon, insertions are represented in many different ways by different callers, particularly in how the end and size of the insertion is represented. It would be great to find a way to compare these appropriately.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.