eldariont / svim Goto Github PK

View Code? Open in Web Editor NEW

151.0 151.0 19.0 1.37 MB

Structural Variant Identification Method using Long Reads

License: GNU General Public License v3.0

Python 100.00%

svim's People

Contributors

Stargazers

Watchers

Forkers

gedoardo83 anandksrao nathsomashis chen318liang yedomon xuelei-dai yekaizhou sibelyasar wook2014 nyagam asvilla ningshuang-yao micahvista kojix2 monera68 marcellusquademurray zhipengliux rajaldebnath

svim's Issues

FORMAT 'CN' is not defined in the header

Hi David,

Thanks for developing SVIM.
I am filtering SV called by SVIM and ran into this issue with the variants.vcf.
There are DUP_TAN variants with GT:CN:DP:AD, but CN was not defined in the header.
I checked wiki and did not find any information about CN.
I am not sure what to do with these variants since these are pass variants.

Would love to get your insights about it.
Thanks!
Best wishes,
Zih-Hua

merge the two lines translocation(BND) records into one line

In the output, I found out that translocation (BND) are separated into two lines, the position1 and position2 is reversed, the svim.BND.xxx sv ID is also different. I want to merge every translocation event into one line. Is there any recommended method?
I am using multi SV callers, and found out that some callers do the same thing for translocations like manta and svaba some callers do not (only one line for each translocation) like sniffles and cuteSV.

chr1    53900   svim.BND.7      N       N[chr2:11656999[        1       PASS    SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.
      GT:DP:AD        ./.:.:.,.
chr2    11656999        svim.BND.156311 N       ]chr1:53900]N   1       PASS    SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.      GT:DP:AD        ./.:.:.,.

Sequences of insertion

Hi David
I would like to explore the features of insertion predicted by SVIM, Is there any option to extract these insertion sequences or the reads with insertion?

Thanks
Edison

Phasing interpretation

Hello,
I am just wondering how should I interpret the unphased genotypes in the context of the 'variants.vcf' output file. Do those really matter?

Improved Documentation

Thanks so much for developing this tool and making it accessible to the research community!

As I was getting started with SVIM I noticed that in the example analysis you use options like --interspersed_duplications_as_insertions and --max_sv_size 1000000. I've tried to find a place where all options are listed (e.g. with svim --help) along with a description of what they do, but I haven't been able to find one. Does such documentation exist?

Thanks for your help!

ERROR when using assembly-vs-assembly BAM as input

Dear all,

According to some literatures, I decided to use minimap2 to do alignment and then use Sniffles and SVIM to call SV. Thes variants detected by both will be retained. However, when I using SVIM, I found that the BAM files from chromosome level assembly vs chromosome level assembly always gave me an error like following. Hope you can help me, thank you.

2020-05-02 03:19:04,661 [ERROR  ]  value too large to convert to uint32_t
Traceback (most recent call last):
  File "svim", line 165, in <module>
    sys.exit(main())
  File "svim", line 87, in main
    sv_signatures = analyze_alignment_file_coordsorted(aln_file, options)
  File "SVIM_COLLECT.py", line 138, in analyze_alignment_file_coordsorted
    supplementary_alignments = retrieve_other_alignments(current_alignment, bam)
  File "SVIM_COLLECT.py", line 82, in retrieve_other_alignments
    a.cigarstring = cigar
  File "pysam/libcalignedsegment.pyx", line 1296, in pysam.libcalignedsegment.AlignedSegment.cigarstring.__set__
  File "pysam/libcalignedsegment.pyx", line 2217, in pysam.libcalignedsegment.AlignedSegment.cigartuples.__set__
OverflowError: value too large to convert to uint32_t

Sincerely,
Zheng Zhuqing

Alignment pipeline error

HI @eldariont

Thanks for working on this tool. I was trying to make a call using the svim reads method as follow:

svim reads --min_mapq 30 <working-dir> <fastq.gz> <genome.fasta>

It just suddenly stopped, so I decided to share the entire .log file here:

2020-10-11 20:25:47,438 [INFO   ]  ****************** Start SVIM, version 1.4.2 ******************
2020-10-11 20:25:47,438 [INFO   ]  CMD: python3 /home/cgarci39/.conda/envs/svim/bin/svim reads --min_mapq 30 /home/cgarci39/Projects/Bacillus_subtilis/Experiments/Svim_jobs/01_15-M1_nmlr-svim /home/cgarci39/Projects/Bacillus_subtilis/Data/Raw_fastq/M1_Fastqs/M1.fastq.gz /home/cgarci39/Projects/Bacillus_subtilis/Data/Local_genomes/EA_CB0015_hybrid.fasta
2020-10-11 20:25:47,438 [INFO   ]  WORKING DIR: /home/cgarci39/Projects/Bacillus_subtilis/Experiments/Svim_jobs/01_15-M1_nmlr-svim
2020-10-11 20:25:47,438 [INFO   ]  PARAMETER: sub, VALUE: reads
2020-10-11 20:25:47,438 [INFO   ]  PARAMETER: working_dir, VALUE: /home/cgarci39/Projects/Bacillus_subtilis/Experiments/Svim_jobs/01_15-M1_nmlr-svim
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: reads, VALUE: /home/cgarci39/Projects/Bacillus_subtilis/Data/Raw_fastq/M1_Fastqs/M1.fastq.gz
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: genome, VALUE: /home/cgarci39/Projects/Bacillus_subtilis/Data/Local_genomes/EA_CB0015_hybrid.fasta
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: verbose, VALUE: False
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: cores, VALUE: 1
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: aligner, VALUE: ngmlr
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: nanopore, VALUE: False
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: min_mapq, VALUE: 30
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: min_sv_size, VALUE: 40
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: max_sv_size, VALUE: 100000
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: segment_gap_tolerance, VALUE: 10
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: segment_overlap_tolerance, VALUE: 5
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: all_bnds, VALUE: False
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: partition_max_distance, VALUE: 1000
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: distance_normalizer, VALUE: 900
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: cluster_max_distance, VALUE: 0.3
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: del_ins_dup_max_distance, VALUE: 1.0
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: trans_sv_max_distance, VALUE: 500
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: skip_genotyping, VALUE: False
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: minimum_score, VALUE: 3
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: homozygous_threshold, VALUE: 0.8
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: heterozygous_threshold, VALUE: 0.2
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: minimum_depth, VALUE: 4
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: sample, VALUE: Sample
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: types, VALUE: DEL,INS,INV,DUP:TANDEM,DUP:INT,BND
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: sequence_alleles, VALUE: False
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: insertion_sequences, VALUE: False
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: tandem_duplications_as_insertions, VALUE: False
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: interspersed_duplications_as_insertions, VALUE: False
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: read_names, VALUE: False
2020-10-11 20:25:47,439 [INFO   ]  PARAMETER: zmws, VALUE: False
2020-10-11 20:25:47,439 [INFO   ]  ****************** STEP 1: COLLECT ******************
2020-10-11 20:25:47,439 [INFO   ]  MODE: reads
2020-10-11 20:25:47,439 [INFO   ]  INPUT: /home/cgarci39/Projects/Bacillus_subtilis/Data/Raw_fastq/M1_Fastqs/M1.fastq.gz
2020-10-11 20:25:47,439 [INFO   ]  GENOME: /home/cgarci39/Projects/Bacillus_subtilis/Data/Local_genomes/EA_CB0015_hybrid.fasta
2020-10-11 20:25:47,439 [INFO   ]  Recognized reads file as gzipped FASTQ format.
2020-10-11 20:25:47,671 [INFO   ]  Starting alignment pipeline..
2020-10-11 20:27:58,503 [ERROR  ]  The alignment pipeline failed with exit code 1. Command was: set -o pipefail && gunzip -c /home/cgarci39/Projects/Bacillus_subtilis/Data/Raw_fastq/M1_Fastqs/M1.fastq.gz | ngmlr -t 1 -r /home/cgarci39/Projects/Bacillus_subtilis/Data/Local_genomes/EA_CB0015_hybrid.fasta | samtools view -b -@ 1 | samtools sort -@ 1 -o /home/cgarci39/Projects/Bacillus_subtilis/Experiments/Svim_jobs/01_15-M1_nmlr-svim/M1.fastq.ngmlr.coordsorted.bam
Traceback (most recent call last):
  File "/home/cgarci39/.conda/envs/svim/lib/python3.7/site-packages/svim/SVIM_alignment.py", line 52, in run_alignment
    run(" ".join(command_align), shell=True, check=True, executable='/bin/bash')
  File "/home/cgarci39/.conda/envs/svim/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'set -o pipefail && gunzip -c /home/cgarci39/Projects/Bacillus_subtilis/Data/Raw_fastq/M1_Fastqs/M1.fastq.gz | ngmlr -t 1 -r /home/cgarci39/Projects/Bacillus_subtilis/Data/Local_genomes/EA_CB0015_hybrid.fasta | samtools view -b -@ 1 | samtools sort -@ 1 -o /home/cgarci39/Projects/Bacillus_subtilis/Experiments/Svim_jobs/01_15-M1_nmlr-svim/M1.fastq.ngmlr.coordsorted.bam' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cgarci39/.conda/envs/svim/bin/svim", line 216, in <module>
    sys.exit(main())
  File "/home/cgarci39/.conda/envs/svim/bin/svim", line 85, in main
    bam_path = run_alignment(options.working_dir, options.genome, options.reads, reads_type, options.cores, options.aligner, options.nanopore)
  File "/home/cgarci39/.conda/envs/svim/lib/python3.7/site-packages/svim/SVIM_alignment.py", line 55, in run_alignment
    raise AlignmentPipelineError('The alignment pipeline failed with exit code {0}. Command was: {1}'.format(e.returncode, e.cmd)) from e
svim.SVIM_alignment.AlignmentPipelineError: The alignment pipeline failed with exit code 1. Command was: set -o pipefail && gunzip -c /home/cgarci39/Projects/Bacillus_subtilis/Data/Raw_fastq/M1_Fastqs/M1.fastq.gz | ngmlr -t 1 -r /home/cgarci39/Projects/Bacillus_subtilis/Data/Local_genomes/EA_CB0015_hybrid.fasta | samtools view -b -@ 1 | samtools sort -@ 1 -o /home/cgarci39/Projects/Bacillus_subtilis/Experiments/Svim_jobs/01_15-M1_nmlr-svim/M1.fastq.ngmlr.coordsorted.bam

After the job is executed it gives me also a coordsorted.bam file which I suspect is good and complete, but the pipeline does not continue.

By the way, I installed svim via conda. If there is a way to solve this issue I will really appreciate it!

Also here is my conda list for the env:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
bzip2                     1.0.8                h516909a_3    conda-forge
c-ares                    1.16.1               h516909a_3    conda-forge
ca-certificates           2020.6.20            hecda079_0    conda-forge
certifi                   2020.6.20        py37he5f6b98_2    conda-forge
cycler                    0.10.0                     py_2    conda-forge
decorator                 4.4.2                      py_0    conda-forge
freetype                  2.10.3               he06d7ca_0    conda-forge
htslib                    1.11                 hd3b49d5_0    bioconda
jpeg                      9d                   h516909a_0    conda-forge
k8                        0.2.5                he513fc3_0    bioconda
kiwisolver                1.2.0            py37h99015e2_0    conda-forge
krb5                      1.17.1               hfafb76e_3    conda-forge
lcms2                     2.11                 hbd6801e_0    conda-forge
ld_impl_linux-64          2.35                 h769bd43_9    conda-forge
libblas                   3.8.0               17_openblas    conda-forge
libcblas                  3.8.0               17_openblas    conda-forge
libcurl                   7.71.1               hcdd3856_8    conda-forge
libdeflate                1.6                  h516909a_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.2.1             he1b5a44_1007    conda-forge
libgcc-ng                 9.3.0               h5dbcf3e_17    conda-forge
libgfortran-ng            7.5.0               hae1eefd_17    conda-forge
libgfortran4              7.5.0               hae1eefd_17    conda-forge
libgomp                   9.3.0               h5dbcf3e_17    conda-forge
liblapack                 3.8.0               17_openblas    conda-forge
libnghttp2                1.41.0               h8cfc5f6_2    conda-forge
libopenblas               0.3.10          pthreads_hb3c22a3_5    conda-forge
libpng                    1.6.37               hed695b0_2    conda-forge
libssh2                   1.9.0                hab1572f_5    conda-forge
libstdcxx-ng              9.3.0               h2ae2ef3_17    conda-forge
libtiff                   4.1.0                hc7e4089_6    conda-forge
libwebp-base              1.1.0                h516909a_3    conda-forge
lz4-c                     1.9.2                he1b5a44_3    conda-forge
matplotlib-base           3.3.2            py37hd478181_0    conda-forge
minimap2                  2.17                 hed695b0_3    bioconda
ncurses                   6.2                  he1b5a44_1    conda-forge
networkx                  2.5                        py_0    conda-forge
ngmlr                     0.2.7                he513fc3_2    bioconda
numpy                     1.19.2           py37h7ea13bd_1    conda-forge
olefile                   0.46                       py_0    conda-forge
openssl                   1.1.1h               h516909a_0    conda-forge
pillow                    7.2.0            py37h718be6c_1    conda-forge
pip                       20.2.3                     py_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pysam                     0.16.0.1         py37hc334e0b_1    bioconda
python                    3.7.8           h6f2ec95_1_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
readline                  8.0                  he28a2e2_2    conda-forge
samtools                  1.11                 h6270b1f_0    bioconda
scipy                     1.5.2            py37hb14ef9d_1    conda-forge
setuptools                49.6.0           py37he5f6b98_2    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sqlite                    3.33.0               h4cf870e_1    conda-forge
svim                      1.4.2                      py_0    bioconda
tk                        8.6.10               hed695b0_1    conda-forge
tornado                   6.0.4            py37h8f50634_1    conda-forge
wheel                     0.35.1             pyh9f0ad1d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1009    conda-forge
zstd                      1.4.5                h6597ccf_2    conda-forge

Thanks in advance.
Camilo

error with gzipped fastq file and create working dir if it doesn't exist

Hi,

This looks like an interesting tool so I'm eager to give it a try.

Using a gzipped fastq file I get the following error:

Traceback (most recent call last):
  File "svim/SVIM_COLLECT.py", line 383, in <module>
    sys.exit(main())
  File "svim/SVIM_COLLECT.py", line 353, in main
    full_reads_path = create_full_file(options.working_dir, options.reads, reads_type)
  File "svim/SVIM_COLLECT.py", line 200, in create_full_file
    if line.startswith('@'):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

I could solve this by changing, in line 178 of SVIM_COLLECT.py the mode of gzip.open() from 'rb' to 'rt'.

Additionally, would it be possible that the script creates the working dir if it doesn't exist? Earlier I had the following error:

  File "svim/SVIM_COLLECT.py", line 383, in <module>
    sys.exit(main())
  File "svim/SVIM_COLLECT.py", line 321, in main
    fileHandler = logging.FileHandler("{0}/SVIM-COLLECT_{1}.log".format(options.working_dir, strftime("%y%m%d_%H%M%S", localtime())), mode="w")
  File "/home/wdecoster/anaconda3/envs/svim/lib/python3.6/logging/__init__.py", line 1030, in __init__
    StreamHandler.__init__(self, self._open())
  File "/home/wdecoster/anaconda3/envs/svim/lib/python3.6/logging/__init__.py", line 1059, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/home/wdecoster/temp/test_svim/test/SVIM-COLLECT_180620_125723.log'

which was solved by creating the 'test' directory.

Cheers,
Wouter

Accept coordinate sorted alignments

Hi!

All alignment visualizers rely on a coordinate sorted alignment, yet your tool requires me to resort again, which can be annoying when dealing with >100GB BAM files. Why do you need a special sorting? If you need info about all supplementary alignments for each read, make use of the SA tag.

Armin

issue when reading sam/bam of empty SEQ field

SVIM works fine in most cases.
But I'm wondering is it possible to run with bam of empty SEQ field (the 10th column in the SAM format).
Currently it prompts the following for such bam, which seems to call the sequence bases.
SVIM_intra.py", line 42, in analyze_alignment_indel insertion_seq = alignment.query_sequence[pos_read:pos_read+length] TypeError: 'NoneType' object is not subscriptable
But since the final result doesn't contain information regarding the SEQ filed, I think it would be more flexible to use if it can run sam/bam with limited information (such as the cigar only).

Suggestion: Add BEDPE as an optional output

Hi!

I am currently trying to implement SVIM into a pipeline that requires BEDPE files as an input for misassembly detection. At the moment I can manage converting the VCF to BEDPE but I wonder If you might consider adding in the future an option to directly output a BEDPE file from SVIM.

Suggestion/Request: add to PyPI and/or bioconda

Hi,

Not that I had issues with installing svim, but especially for novice users adding your tool to PyPI and/or bioconda could make installation less painful. Simultaneously, it would make sure that your scripts are placed in $PATH

Please let me know if you need some help to figure this out!

Cheers,
Wouter

The genotype of the variants.vcf file were ./.

Hello,
I used the SVIM to call the HiFi reads SV, in the result of variants.vcf file, which the GT was ./., my command is :

SVIM/bin/svim alignment ./ ${BAM} ${Reference} --min_mapq 20 --min_sv_size 30 --max_sv_size 100000 --minimum_depth 3 --sample ${SampleName} --sequence_alleles --insertion_sequences

and I used the HiFi coverage is 20X.
Whether need to set some parameters?

Best wishes~

raw reads or corrected reads ?

Hi !

Do you have any tips ? what should I use for the alignment, raw or corrected long reads ?

Cheers,

using svim alignment to genotype, all the tandem duplications show missing value

hello! using svim alignment to genotype, all the tandem duplications show missing value.

generated simulated data by VISOR LASeR, here is my input:
1 122200 125000 tandem duplication 2 10
1 126200 128500 tandem duplication 2 10
and then i got 50× aligned dataset (bam)
using SVIM to call SV
svim alignment dup_svim DUP50.bam pathway/ref/human_hs37d5.chr1.fasta --min_sv_size 50 --minimum_depth 1 --minimum_score 1 --types DUP:TANDEM
variants.vcf
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
1 122204 svim.DUP_TANDEM.3 N DUP:TANDEM 4 PASS SVTYPE=DUP:TANDEM;END=125002;SVLEN=2799;SUPPORT=4;STD_SPAN=4.2;STD_POS=4.09 GT:CN:DP:AD ./.:2:.:.,.
1 126200 svim.DUP_TANDEM.4 N DUP:TANDEM 42 PASS SVTYPE=DUP:TANDEM;END=128501;SVLEN=2302;SUPPORT=34;STD_SPAN=0.72;STD_POS=0.36 GT:CN:DP:AD ./.:2:.:.,.
why these two dup.tandem are missing value (./.)? then, with inputting the same bam , other softwares show 0/1 or 0/0 or 1/1.
Thank you very much!

complex events detection

Hi,

I try to use Sniffle and SVIM search for complex SVs, I am wondering if there any recommended parameter settings for CCS reads.

Thanks!

Is SVIM suitable for detecting somatic SV?

Hi,
I am wondering that can I run SVIM for paired control-tumor sample, and obtain the somatic SV through the subtraction of normal sample and tumor sample?
Is SVIM designed for germline SV detection?

Thanks,
Charlie

(Almost) no insertions

I am using SVIM to call SV in a mammalian diploid genome with pacbio reads at around 60X. The genome was assembled with the same data. Because I have good depth I am using a quality cut-off of of 10, and calling SV > 10bp. This identifies 44K deletions and only 87 insertions with v1.3.

However, if I use an older version (1.1) with the same data it detects 46K deletions and 65K insertions.

I inspected a few INS from version 1.1 and they look correct to me.

I was wondering if this is an issue with the changes make to v1.3.1?

v1.3.1 command used:
svim alignment --min_sv_size 10 --max_sv_size 1000000 --insertion_sequences --sequence_alleles --interspersed_duplications_as_insertions --minimum_score 8 PB_to_BrahGenome20200102_N_1_D PB_to_BrahGenome20200102_N_1.default.bam BrahGenome20200102_N.fasta

v1.1 command used:
svim alignment --min_sv_size 10 --max_sv_size 1000000 --minimum_score 8 PB_to_BrahGenome20200102_N_1_D_svim1.1 PB_to_BrahGenome20200102_N_1.default.bam

Thanks.

SVIM tries to use deprecated Numpy feature

EXPLANATION:

Extracted chromosome 1 from large fastq
I used minimap2 v2.17 to map chr1 onto a reference
Sort the bam with sambamba
call SVs with SVIM v1.4.2
Got the VisibleDeprecationWarning

COMMANDS:

minimap2 human_hs37d5.fasta GM24385_1[.chr1].fastq a -z 600,200 -x map-ont --MD -Y -o out.sam
svim alignment . GM24385_1.chr1.sorted.bam human_hs37d5.fasta --min_sv_size 30

INPUT:

Reads - ftp://ftp.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/UCSC_Ultralong_OxfordNanopore_Promethion/GM24385_1.fastq.gz
Reference - ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

FULL OUTPUT:

2021-02-18 11:58:53,050 [INFO ] PARAMETER: tandem_duplications_as_insertions, VALUE: False
2021-02-18 11:58:53,050 [INFO ] PARAMETER: interspersed_duplications_as_insertions, VALUE: False
2021-02-18 11:58:53,050 [INFO ] PARAMETER: read_names, VALUE: False
2021-02-18 11:58:53,050 [INFO ] PARAMETER: zmws, VALUE: False
2021-02-18 11:58:53,050 [INFO ] ****************** STEP 1: COLLECT ******************
2021-02-18 11:58:53,050 [INFO ] MODE: alignment
2021-02-18 11:58:53,051 [INFO ] INPUT: GM24385_1.chr1.sorted.bam
[E::idx_find_and_load] Could not retrieve index file for 'GM24385_1.chr1.sorted.bam'
2021-02-18 11:58:53,826 [WARNING] Input BAM file is missing a valid index. Please generate with 'samtools index'. Continuing without genotyping for now..
2021-02-18 11:59:29,171 [INFO ] Processed read 10000
2021-02-18 11:59:40,869 [INFO ] Processed read 20000
2021-02-18 11:59:49,003 [INFO ] Processed read 30000
2021-02-18 11:59:56,724 [INFO ] Processed read 40000
2021-02-18 12:00:04,810 [INFO ] Processed read 50000
2021-02-18 12:00:18,820 [INFO ] Processed read 60000
2021-02-18 12:00:31,692 [INFO ] Processed read 70000
2021-02-18 12:00:39,897 [INFO ] Processed read 80000
2021-02-18 12:00:47,652 [INFO ] Processed read 90000
2021-02-18 12:00:55,886 [INFO ] Processed read 100000
2021-02-18 12:01:05,996 [INFO ] Processed read 110000
2021-02-18 12:01:23,571 [INFO ] Found 37092 signatures for deleted regions.
2021-02-18 12:01:23,572 [INFO ] Found 30256 signatures for inserted regions.
2021-02-18 12:01:23,572 [INFO ] Found 26 signatures for inverted regions.
2021-02-18 12:01:23,572 [INFO ] Found 778 signatures for tandem duplicated regions.
2021-02-18 12:01:23,572 [INFO ] Found 2356 signatures for translocation breakpoints.
2021-02-18 12:01:23,572 [INFO ] Found 12 signatures for inserted regions with detected region of origin.
2021-02-18 12:01:23,572 [INFO ] ****************** STEP 2: CLUSTER ******************
2021-02-18 12:01:26,864 [INFO ] Clustered deleted regions: 11499 partitions and 15039 clusters
2021-02-18 12:01:30,957 [INFO ] Clustered inserted regions: 7197 partitions and 11356 clusters
2021-02-18 12:01:32,281 [INFO ] Clustered inverted regions: 8 partitions and 9 clusters
2021-02-18 12:01:32,383 [INFO ] Clustered tandem duplicated regions: 127 partitions and 189 clusters
2021-02-18 12:01:32,521 [INFO ] Clustered translocation breakpoints: 1495 partitions and 1522 clusters
2021-02-18 12:01:32,646 [INFO ] Clustered inserted regions with detected region of origin: 6 partitions and 6 clusters
2021-02-18 12:01:32,648 [INFO ] Finished clustering. Writing signature clusters..
2021-02-18 12:01:33,483 [INFO ] ****************** STEP 3: COMBINE ******************
2021-02-18 12:01:33,489 [INFO ] Combine inserted regions with translocation breakpoints..
2021-02-18 12:01:33,532 [INFO ] Create interspersed duplication candidates and flag cut&paste insertions..
2021-02-18 12:01:34,030 [INFO ] Cluster interspersed duplication candidates one more time..
2021-02-18 12:01:34,031 [INFO ] Clustered interspersed duplication candidates: 6 partitions and 6 clusters
2021-02-18 12:01:34,031 [INFO ] Write SV candidates..
2021-02-18 12:01:34,031 [INFO ] Final deletion candidates: 15039
2021-02-18 12:01:34,031 [INFO ] Final inversion candidates: 9
2021-02-18 12:01:34,032 [INFO ] Final interspersed duplication candidates: 6
2021-02-18 12:01:34,032 [INFO ] Final tandem duplication candidates: 189
2021-02-18 12:01:34,032 [INFO ] Final novel insertion candidates: 11276
2021-02-18 12:01:34,032 [INFO ] Final breakend candidates: 1522
2021-02-18 12:01:35,506 [INFO ] Draw plots..
/software/bioinformatics/svim-1.4.2/lib/python3.6/site-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order)
2021-02-18 12:01:48,380 [INFO ] Done.

MemoryError

Hi,
When I run SVIM using the following command

svim alignment --sample darmor --min_mapq 30 --min_sv_size 10 variants/darmor/18 align/darmor/darmor.westar.sorted.REF_chrC08.bam westar.fa

but it occurs MemoryError, and the SVIM often using memory more than 1TB, my each bam file is about 400MB. So, is there any suggestions about how to reduce the memory usage?

Thank you in advance!

Understanding the scoring?

Hello SVIM,

I just have a quick question about the score each variant call and the ones used as a cut-off eg. q5 and q10.

In the wiki it mentions:

Since the publication of our manuscript, we changed the score formula to the formula described **here.** The current formula puts more emphasis on the number of supporting reads than on the other features.

However, I don't see the formula used for scoring, am I missing something?

Kind Regards,

Alex Lim

bam index not found

Dear developers

I've tried to run svim alignment with a coordinate-sorted nglmr bamfile which I'd indexed with samtools. The indices are in the same directory as the alignment files. At the genotyping stage an error informs me that no index was found, see log file content below. The same error occurs with my minimap alignments also. It does produce a vcf file with variants but no genotyping information for the sample.

(filenames and directories are anonymised)

2019-08-20 13:37:13,851 [INFO   ]  ****************** Start SVIM, version 1.2.0 ******************
2019-08-20 13:37:13,851 [INFO   ]  CMD: python3 /home/annew/.local/bin/svim alignment --min_sv_size 10 --minimum_depth 10 --sample sample . /scratch/alignments_to_ref/sample_nanopore_to_fungusref_n
gmlr.sorted.bam /scratch/alignments_to_ref/fungusref.fasta
2019-08-20 13:37:13,852 [INFO   ]  WORKING DIR: /workdir/sample_SVIM-calls_nglmr
2019-08-20 13:37:13,852 [INFO   ]  PARAMETER: del_ins_dup_max_distance, VALUE: 1.0
2019-08-20 13:37:13,852 [INFO   ]  PARAMETER: segment_overlap_tolerance, VALUE: 5
2019-08-20 13:37:13,852 [INFO   ]  PARAMETER: duplications_as_insertions, VALUE: False
2019-08-20 13:37:13,852 [INFO   ]  PARAMETER: distance_normalizer, VALUE: 900
2019-08-20 13:37:13,853 [INFO   ]  PARAMETER: segment_gap_tolerance, VALUE: 10
2019-08-20 13:37:13,853 [INFO   ]  PARAMETER: minimum_depth, VALUE: 10
2019-08-20 13:37:13,853 [INFO   ]  PARAMETER: homozygous_threshold, VALUE: 0.8
2019-08-20 13:37:13,853 [INFO   ]  PARAMETER: minimum_score, VALUE: 3
2019-08-20 13:37:13,853 [INFO   ]  PARAMETER: read_names, VALUE: False
2019-08-20 13:37:13,853 [INFO   ]  PARAMETER: sample, VALUE: sample
2019-08-20 13:37:13,853 [INFO   ]  PARAMETER: heterozygous_threshold, VALUE: 0.2
2019-08-20 13:37:13,854 [INFO   ]  PARAMETER: max_sv_size, VALUE: 100000
2019-08-20 13:37:13,854 [INFO   ]  PARAMETER: trans_sv_max_distance, VALUE: 500
2019-08-20 13:37:13,854 [INFO   ]  PARAMETER: min_mapq, VALUE: 20
2019-08-20 13:37:13,854 [INFO   ]  PARAMETER: partition_max_distance, VALUE: 5000
2019-08-20 13:37:13,854 [INFO   ]  PARAMETER: skip_genotyping, VALUE: False
2019-08-20 13:37:13,854 [INFO   ]  PARAMETER: sequence_alleles, VALUE: False
2019-08-20 13:37:13,855 [INFO   ]  PARAMETER: trans_destination_partition_max_distance, VALUE: 1000
2019-08-20 13:37:13,855 [INFO   ]  PARAMETER: cluster_max_distance, VALUE: 0.3
2019-08-20 13:37:13,855 [INFO   ]  PARAMETER: sub, VALUE: alignment
2019-08-20 13:37:13,855 [INFO   ]  PARAMETER: bam_file, VALUE: /scratch/alignments_to_ref/sample_nanopore_to_fungusref_ngmlr.sorted.bam
2019-08-20 13:37:13,855 [INFO   ]  PARAMETER: types, VALUE: DEL,INS,INV,DUP_TAN,DUP_INT,BND
2019-08-20 13:37:13,855 [INFO   ]  PARAMETER: working_dir, VALUE: /local/tmp/annew/sample_SVIM-calls_nglmr
2019-08-20 13:37:13,855 [INFO   ]  PARAMETER: insertion_sequences, VALUE: False
2019-08-20 13:37:13,856 [INFO   ]  PARAMETER: min_sv_size, VALUE: 10
2019-08-20 13:37:13,856 [INFO   ]  PARAMETER: genome, VALUE: /alignments_to_ref/fungusref.fasta
2019-08-20 13:37:13,856 [INFO   ]  PARAMETER: trans_partition_max_distance, VALUE: 200
2019-08-20 13:37:13,856 [INFO   ]  ****************** STEP 1: COLLECT ******************
2019-08-20 13:37:13,856 [INFO   ]  MODE: alignment
2019-08-20 13:37:13,856 [INFO   ]  INPUT: /scratch/alignments_to_ref/sample_nanopore_to_fungusref_ngmlr.sorted.bam
2019-08-20 13:37:15,281 [INFO   ]  Processed read 10000

(...)

2019-08-20 13:38:00,210 [INFO   ]  Processed read 450000
2019-08-20 13:38:02,670 [INFO   ]  Found 106970 signatures for deleted regions.
2019-08-20 13:38:02,671 [INFO   ]  Found 37542 signatures for inserted regions.
2019-08-20 13:38:02,671 [INFO   ]  Found 243 signatures for inverted regions.
2019-08-20 13:38:02,671 [INFO   ]  Found 698 signatures for tandem duplicated regions.
2019-08-20 13:38:02,671 [INFO   ]  Found 17433 signatures for translocation breakpoints.
2019-08-20 13:38:02,671 [INFO   ]  Found 116 signatures for inserted regions with detected region of origin.
2019-08-20 13:38:02,671 [INFO   ]  ****************** STEP 2: CLUSTER ******************
2019-08-20 13:38:09,548 [INFO   ]  Clustered deleted regions: 20332 partitions and 34426 clusters
2019-08-20 13:38:13,878 [INFO   ]  Clustered inserted regions: 4769 partitions and 6662 clusters
2019-08-20 13:38:14,528 [INFO   ]  Clustered inverted regions: 173 partitions and 175 clusters
2019-08-20 13:38:14,614 [INFO   ]  Clustered tandem duplicated regions: 203 partitions and 239 clusters
2019-08-20 13:38:14,645 [INFO   ]  Clustered inserted regions with detected region of origin: 14 partitions and 14 clusters
2019-08-20 13:38:14,683 [INFO   ]  Finished clustering. Writing signature clusters..
2019-08-20 13:38:15,385 [INFO   ]  ****************** STEP 3: COMBINE ******************
2019-08-20 13:38:15,386 [INFO   ]  Cluster translocation breakpoints..
2019-08-20 13:38:17,001 [INFO   ]  Combine inserted regions with translocation breakpoints..
2019-08-20 13:38:17,027 [INFO   ]  Create interspersed duplication candidates and flag cut&paste insertions..
2019-08-20 13:38:18,819 [INFO   ]  Cluster interspersed duplication candidates one more time..
2019-08-20 13:38:18,822 [INFO   ]  Clustered interspersed duplication candidates: 14 partitions and 15 clusters
2019-08-20 13:38:18,828 [INFO   ]  ****************** STEP 4: GENOTYPE ******************
2019-08-20 13:38:18,828 [INFO   ]  Genotyping deletions..
2019-08-20 13:38:18,828 [ERROR  ]  fetch called on bamfile without index
Traceback (most recent call last):
  File "/home/annew/.local/bin/svim", line 165, in <module>
    sys.exit(main())
  File "/home/annew/.local/bin/svim", line 126, in main
    genotype(deletion_candidates, aln_file, "DEL", options)
  File "/home/annew/.local/lib/python3.5/site-packages/svim/SVIM_genotyping.py", line 49, in genotype
    alignment_it = bam.fetch(contig=contig, start=max(0, start-1000), stop=min(contig_length, end+1000))
  File "pysam/libcalignmentfile.pyx", line 1088, in pysam.libcalignmentfile.AlignmentFile.fetch
ValueError: fetch called on bamfile without index

Kind regards,

A. Webb

Wrong SVLEN values

Dear David,

I'm running SVIM 1.2.0 on some PACBIO samples and I noticed that on a few occasions the SVLEN field is showing wrong values. For example, this tandem duplication should be ~486kb but SVLEN is showing 972kb.

17 44165263 svim.DUP_TAN.511 N <DUP:TANDEM> 39 not_fully_covered SVTYPE=DUP:TANDEM;END=44651666;SVLEN=972806;SUPPORT=32;STD_SPAN=307.46;STD_POS=150.1 GT:DP:AD ./.:.:.,.

For other samples the SVLEN looks fine. Am I missing something?

Thanks!

Error at stage of GENOTYPE

Hi David
I mapped the long reads to reference genome using minimap2, and then used the SIVM to call SV, The command line i used is "svim alignment $dir $bam ". At stage of GENOTYPE, i got a error message that was "RGBA values should be within 0-1 range". Any suggestions? Thanks
Edison

Define contigs in the header

Hi,

Me again. svim ran succesfully, and I'll do some more testing with bigger datasets (PromethION data).

Upon parsing the obtained vcf file (using cyvcf2) I get multiple warnings as below:

[W::vcf_parse] contig 'chr1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr10' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr11' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr12' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr13' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr14' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr14_GL000225v1_random' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr15' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr16' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr17' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr17_GL000205v2_random' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr18' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] contig 'chr19' is not defined in the header. (Quick workaround: index the file with tabix.)

The header of the vcf file looks indeed quite empty:

##fileformat=VCFv4.3
##source=SVIMV0.1
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=DUP:TANDEM,Description="Tandem Duplication">
##ALT=<ID=INS,Description="Insertion">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO

Cheers,
Wouter

Use case of your SVIM software

Hi David,

Thank you for making your SVIM software available to the research community.

I am starting on a new project of trying to discover SV in assembled genomes of same plant species.

It appears to me, from details at https://github.com/eldariont/svim and https://github.com/eldariont/svim/wiki that the inputs have to reads, rather than fully assembled genomes. Or can I use entire genome. In which case, I will align various genomes to one of them as a reference?

I mean, if I want to discover SVs across 20 individuals, then I'd choose fasta assembly of 1 individual my reference, and fasta assemblies of rest 19 as queries? Is this possible in theory? And is this possible in practice?

Or can I break each of my 19 non-reference genomes into let's say 50kB fake fasta contiguous reads?

Could you please advise me? Thanks, in advance.

Sincerely,
Anand

Q: is it possible to allow users choose between NGMLR Minimap2?

Hi David,

I am reading the Wiki page and found that NGLMR is a required dependency in addition to minimap2.
Now since minimap2 also does alignment, I'd say it makes sense to give the users an option to use minimap2 for mapping and alignment as well, therefore having one less dependency.

Of course, this is totally dependent on the assumptions that

SVIM doesn't depend on custom NGLMR tags in critical ways and
the two aligners don't give results so different that affect the performance of SVIM

What do you think?

Steve

Long Walltime?

I have a 7x WGS ONP genome and my jobs crashed after 64hours. Is this normal? I started with raw FASTQs and using default parameters except for the nanopore option.

Check existence of bam index upfront

I forgot to index my BAM (my mistake, too eager to get to the variants...) and fed it to SVIM.

SVIM happily accepted the BAM, but ultimately errored out

2020-04-01 17:28:51,738 [INFO   ]  ****************** STEP 3: COMBINE ******************
2020-04-01 17:28:51,767 [INFO   ]  Cluster translocation breakpoints..
2020-04-01 17:28:53,830 [INFO   ]  Combine inserted regions with translocation breakpoints..
2020-04-01 17:28:55,566 [INFO   ]  Create interspersed duplication candidates and flag cut&paste insertions..
2020-04-01 17:29:19,039 [INFO   ]  Cluster interspersed duplication candidates one more time..
2020-04-01 17:29:19,042 [INFO   ]  Clustered interspersed duplication candidates: 59 partitions and 70 clusters
2020-04-01 17:29:19,061 [INFO   ]  ****************** STEP 4: GENOTYPE ******************
2020-04-01 17:29:19,062 [INFO   ]  Genotyping deletions..
2020-04-01 17:29:19,062 [ERROR  ]  fetch called on bamfile without index
Traceback (most recent call last):
  File "/home/shuang/miniconda3/envs/svim/bin/svim", line 165, in <module>
    sys.exit(main())
  File "/home/shuang/miniconda3/envs/svim/bin/svim", line 126, in main
    genotype(deletion_candidates, aln_file, "DEL", options)
  File "/home/shuang/miniconda3/envs/svim/lib/python3.7/site-packages/svim/SVIM_genotyping.py", line 49, in genotype
    alignment_it = bam.fetch(contig=contig, start=max(0, start-1000), stop=min(contig_length, end+1000))
  File "pysam/libcalignmentfile.pyx", line 1092, in pysam.libcalignmentfile.AlignmentFile.fetch
ValueError: fetch called on bamfile without index

It would be nice to check the existence of the index file (*.bam.bai and/or *.bai) upfront to save sometime.

Thanks!
Steve

Low number of interspersed duplications?

Hi,

I have recently aligned the CHM13 v1.0 assembly from the T2T consortium to the reference genome GRCh38.p13. I subsequently tried to detect SVs from the CHM13 alignment by using the following svim-asm command:

svim-asm haploid --max-sv-size 1000000 --sample CHM13 --reference_gap_tolerance 1000 --reference_overlap_tolerance 1000 --query_gap_tolerance 2000 --query_overlap_tolerance 2000 ./svim_chm13 chm13.bam hg38_p13.fa

This command results in only 4 interspersed duplications which seems quite low to me given the completeness of CHM13. I am also a little bit confused by these duplications: when I look at these 4 duplications, the first 3 of them are ending at the same location on GRCh38.p13 but they have very different lengths (all of them are PASS). Does it mean these are 3 different duplications occurring at different places in CHM13 or is it the same duplication but svim-asm cannot tell which of the 3 duplication lengths is the correct one?

Thank you for the help and the great software!

Guillaume

Git tag for 0.3.1?

https://pypi.org/project/svim/#history lists svim 0.3.1, but that is missing from the Github releases of svim - which commit corresponds to 0.3.1?

Underflow on variant position?

Hi,

I'm using SVIM v1.4.1, and I notice that for some of the random contigs and unplaced contigs (e.g. chr5_GL000208v1_random, chrUn_GL000226v1) I get surprisingly high coordinates, suspiciously always 4294967296 or 2^32. Building a normal tbi tabix index breaks for variants like that.

Below are some examples, I can share the full VCF for this sample if you want:

chr5_GL000208v1_random  4294967296      svim.BND.130629 N       ]chr5:47079731]N        7       PASS    SVTYPE=BND;SUPPORT=6;STD_POS1=.;STD_POS2=1.63   GT:DP:AD        ./.:.:.,.
chr17_KI270729v1_random 4294967296      svim.BND.311456 N       [chrX:66676029[N        1       PASS    SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.      GT:DP:AD        ./.:.:.,.
chr22_KI270736v1_random 4294967296      svim.BND.346279 N       ]chr22_KI270736v1_random:111780]N       1       PASS    SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.      GT:DP:AD        ./.:.:.,.
chrEBV  4294967296      svim.DUP_TANDEM.6691    N       <DUP:TANDEM>    1       not_fully_covered       SVTYPE=DUP:TANDEM;END=51803;SVLEN=51803;SUPPORT=1;STD_SPAN=.;STD_POS=.  GT:CN:DP:AD     ./.:2:.:.,.
chrUn_GL000226v1        4294967296      svim.DUP_TANDEM.6833    N       <DUP:TANDEM>    1       PASS    SVTYPE=DUP:TANDEM;END=15008;SVLEN=15008;SUPPORT=1;STD_SPAN=.;STD_POS=.  GT:CN:DP:AD     ./.:2:.:.,.
chrUn_KI270435v1        4294967296      svim.BND.347861 N       [chrY:10657300[N        1       PASS    SVTYPE=BND;SUPPORT=1;STD_POS1=.;STD_POS2=.      GT:DP:AD        ./.:.:.,.
chrUn_KI270435v1        4294967296      svim.BND.347860 N       ]chr16:34065991]N       2       PASS    SVTYPE=BND;SUPPORT=2;STD_POS1=.;STD_POS2=.      GT:DP:AD        ./.:.:.,.
chrUn_KI270590v1        4294967296      svim.DUP_TANDEM.6993    N       <DUP:TANDEM>    4       PASS    SVTYPE=DUP:TANDEM;END=2914;SVLEN=2914;SUPPORT=4;STD_SPAN=3.95;STD_POS=1.65      GT:CN:DP:AD     ./.:2:.:.,.

I checked the length of chr5_GL000208v1_random and that's only 92kb. So something is off here :)

Cheers,
Wouter

[Feature request] Expose CLI for customizing batch size

Right now SVIM seems to process in batch of 10K reads, and keep them in memory (from the log and memory footprint).

Would it be possible to expose an CLI allowing users to specify the batch size (bigger), hence speeding up the process?

Thanks!
Steve

Can the software detecting cnvs, differing from the dup.tandom and dup.interspered?

hello,
the software seems to be rewrited by python script, and the before version can detecting cnvs from small genomes，and I want to know whether the newest version can detect cnvs or not? and which option control the genotypes in the final vcfs, thank you!

SVIM breaks with OSError: truncated file in Genotyping step

I have one bamfile for which svim repeatedly fails with the following error message:

2020-02-27 21:09:20,651 [INFO   ]  ****************** STEP 3: COMBINE ******************
2020-02-27 21:09:20,684 [INFO   ]  Cluster translocation breakpoints..
2020-02-27 21:09:25,173 [INFO   ]  Combine inserted regions with translocation breakpoints..
2020-02-27 21:09:25,720 [INFO   ]  Create interspersed duplication candidates and flag cut&paste insertions..
2020-02-27 21:12:28,804 [INFO   ]  Cluster interspersed duplication candidates one more time..
2020-02-27 21:12:28,809 [INFO   ]  Clustered interspersed duplication candidates: 26 partitions and 27 clusters
2020-02-27 21:12:28,844 [INFO   ]  ****************** STEP 4: GENOTYPE ******************
2020-02-27 21:12:28,844 [INFO   ]  Genotyping deletions..
2020-02-27 21:12:28,859 [ERROR  ]  truncated file
Traceback (most recent call last):
  File "/mnt/SRV018/projects/external/promethion/sv_lr_svim/venv/bin/svim", line 165, in <module>
    sys.exit(main())
  File "/mnt/SRV018/projects/external/promethion/sv_lr_svim/venv/bin/svim", line 126, in main
    genotype(deletion_candidates, aln_file, "DEL", options)
  File "/mnt/SRV018/projects/external/promethion/sv_lr_svim/venv/lib/python3.7/site-packages/svim/SVIM_genotyping.py", line 58, in genotype
    current_alignment = next(alignment_it)
  File "pysam/libcalignmentfile.pyx", line 2092, in pysam.libcalignmentfile.IteratorRowRegion.__next__
OSError: truncated file

To me this appears to be a problem with the input data, however other analysis on the same bamfile worked without problems and samtools quickcheck reports no issues with the file. Also it is strange that svim fails at such a late stage, after analysing the whole file. Do you have an idea what might cause the problem?

I am using a venv with python 3.7.3, pysam 0.15.3 and svim 1.2.0

what is the recommended annotation tools after SV calling?

Hi,

I am wondering the recommended gene annotation tools after SV calling, to estimate the SV effect on human genes mutation.

population call with svim

Hi， many thanks for developing such a useful tool.

After using and comparing different kind of SV calling software, I found that svim it best for sv calling in my project. However, I can only use it for an individual sample. Are there have any function for population calling for svim? for example, sniffles + survivor for population sv calling.

Many thanks for your help.

Best,
Lin

ModuleNotFoundError: No module named 'svim'

HI,

I installed svim using pip but cannot make it work.

Problem with installing through conda

Hi,

I'm about to test out this tool via

conda install -c bioconda svim

but received the following errors.
Can you help me please?
Thank you!

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: \
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

UnsatisfiableError: The following specifications were found to be incompatible with each other:



Package biopython conflicts for:
svim -> biopython
Package ca-certificates conflicts for:
svim -> python=3.6 -> openssl=1.0 -> ca-certificates
python=3.7 -> openssl[version='>=1.1.1c,<1.1.2a'] -> ca-certificates
Package certifi conflicts for:
python=3.7 -> pip -> setuptools -> certifi[version='>=2016.09']
Package pip conflicts for:
svim -> python=3.6 -> pip
python=3.7 -> pip
Package wheel conflicts for:
svim -> python=3.6 -> pip -> wheel
python=3.7 -> pip -> wheel
Package setuptools conflicts for:
svim -> matplotlib -> setuptools
python=3.7 -> pip -> setuptools
Package python-dateutil conflicts for:
svim -> matplotlib -> python-dateutil
Package python conflicts for:
python=3.7

svim alignment error

I followed the evaluation step by step and i got the error at step 9 .
{
9. Run SVIM on BAM file with distance threshold=1.4 for signatures to be merged into clusters:
svim alignment --cluster_max_distance 1.4 svim_wd aln/hs37d5.all.querysorted.bam
}
I've got
svim alignment --cluster_max_distance 1.4 svim_wd aln/hs37d5.all.querysorted.bam
usage: svim alignment [-h] [--verbose] [--min_mapq MIN_MAPQ]
[--min_sv_size MIN_SV_SIZE] [--max_sv_size MAX_SV_SIZE]
[--segment_gap_tolerance SEGMENT_GAP_TOLERANCE]
[--segment_overlap_tolerance SEGMENT_OVERLAP_TOLERANCE]
[--partition_max_distance PARTITION_MAX_DISTANCE]
[--distance_normalizer DISTANCE_NORMALIZER]
[--cluster_max_distance CLUSTER_MAX_DISTANCE]
[--all_bnds]
[--del_ins_dup_max_distance DEL_INS_DUP_MAX_DISTANCE]
[--trans_sv_max_distance TRANS_SV_MAX_DISTANCE]
[--skip_genotyping] [--minimum_score MINIMUM_SCORE]
[--homozygous_threshold HOMOZYGOUS_THRESHOLD]
[--heterozygous_threshold HETEROZYGOUS_THRESHOLD]
[--minimum_depth MINIMUM_DEPTH] [--sample SAMPLE]
[--types TYPES] [--sequence_alleles]
[--insertion_sequences]
[--tandem_duplications_as_insertions]
[--interspersed_duplications_as_insertions]
[--read_names] [--zmws]
working_dir bam_file genome
svim alignment: error: the following arguments are required: genome

how do i handle it?

Using SVIM for the detection of large somatic SVs

Hi David,

I'm trying to call large (>10kb) somatic structural variants in a tumour sample for which I have ~80X coverage of ONT reads.
From previous analysis conducted on this sample, I know there are at least 45 deletions, 10 inversions and 6 tandem duplications, all in the 10kb - 100Mb length range. The allele frequency of somatic variants is about 30% in this sample.

I have run SVIM (v0.5.0) with the following parameters:

svim alignment $WORKSPACE $BAM --sample $SAMPLE --min_mapq 1 --min_sv_size 10000 --max_sv_size 100000000

For the somatic deletions, the recall is not bad: 32 are detected. However only one of the inversions and none of the duplications are included in the results vcf.

I was wondering whether you had an suggestions for tweaking the parameters to improve sensitivity for the detection of large, somatic SVs?

Many thanks,
Hannah

Real world benchmarks

Hi!

I encourage you to use the latest HG002 datasets to investigate your FP and FN numbers using truvari and GIAB annotation.

PacBio offers

15kb library with CCS: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb/
>15kb long read library: https://downloads.pacbcloud.com/public/dataset/HG002/Sequel-201810/

I benchmarked svim internally and didn't get great results. Just to give you a feeling, both public versions of sniffles and pbsv achieve >0.90 F1 score.

Happy developing!
Armin

undefined info in header

Me again :)

bcftools complains:

[W::vcf_parse] INFO 'STD_SPAN' is not defined in the header, assuming Type=String
[W::vcf_parse] INFO 'STD_POS' is not defined in the header, assuming Type=String

Cheers,
W

local variable 'end2' referenced before assignment

Hi there,
I am working with both ngmlr and minimap mapped reads and although the ngmlr files are easily accepted, I find that the minimap BAMs come against this error sometimes during the combine step, depending on the read data set and whether it is sorted by name or not...

2019-02-06`
10:38:27,976 [INFO ] ****************** STEP 3: COMBINE ******************
2019-02-06 10:38:27,976 [INFO ] Cluster translocation breakpoints..
2019-02-06 10:38:27,977 [INFO ] Combine inserted regions with translocation breakpoints..
2019-02-06 10:38:27,978 [INFO ] Create interspersed duplication candidates and flag cut&paste insertions..
Traceback (most recent call last):
File "/usr/local/bin/svim", line 224, in
sys.exit(main())
File "/usr/local/bin/svim", line 221, in main
combine_clusters(signature_clusters, options.working_dir, options, version, aln_file.references, aln_file.lengths, options.sample)
File "/usr/local/lib/python3.5/dist-packages/svim/SVIM_COMBINE.py", line 195, in combine_clusters
length2 = end2 - start2
UnboundLocalError: local variable 'end2' referenced before assignment

Attached in the log file also.
Any clues?

Thanks

SVIM_190206_103820.log

pysam error

Hi,

I am trying to run svim on some sam or bam files generated by minimap, but I always get the following error, any idea how I can fix this?

2020-01-08 05:34:34,700 [INFO ] ****************** STEP 1: COLLECT ******************
2020-01-08 05:34:34,700 [INFO ] MODE: alignment
2020-01-08 05:34:34,700 [INFO ] INPUT: /staging/leuven/stg_00019/Nicolas/Nanopore_tatjana/nanopore1_minimap.bam
2020-01-08 05:34:34,758 [ERROR ] 'pysam.calignmentfile.AlignedSegment' object has no attribute 'get_cigar_stats'
Traceback (most recent call last):
File "/data/leuven/332/vsc33273/miniconda3/envs/NGS/bin/svim", line 165, in
sys.exit(main())
File "/data/leuven/332/vsc33273/miniconda3/envs/NGS/bin/svim", line 87, in main
sv_signatures = analyze_alignment_file_coordsorted(aln_file, options)
File "/data/leuven/332/vsc33273/miniconda3/envs/NGS/lib/python3.6/site-packages/svim/SVIM_COLLECT.py", line 138, in analyze_alignment_file_coordsorted
supplementary_alignments = retrieve_other_alignments(current_alignment, bam)
File "/data/leuven/332/vsc33273/miniconda3/envs/NGS/lib/python3.6/site-packages/svim/SVIM_COLLECT.py", line 47, in retrieve_other_alignments
if main_alignment.get_cigar_stats()[0][5] > 0:
AttributeError: 'pysam.calignmentfile.AlignedSegment' object has no attribute 'get_cigar_stats'

Currently, read alignments need to be sorted by read name (samtools sort -n).

Same for the command line help information:

bam_file SAM/BAM file with aligned long reads (should be
queryname-sorted with samtools sort -n)

In addition, the wiki contains python svim.py usage, which can probably be replaced by just svim

SV types in output VCF

Hi,

In your WIKI you describe five different SVTYPEs: DEL, INV, DUP:TANDEM, DUP:INT, and INS:NOVEL.

I get all the above plus:

]chr19:107955]N
N[chr19:107969[
N[chr19:111124[
N[chr6:69267[
]chr1:278176]N

all SVTYPE=BND

What are these? Rubbish to filter out?

Liam

eldariont / svim Goto Github PK

svim's People

Contributors

Stargazers

Watchers

Forkers

svim's Issues

EXPLANATION:

COMMANDS:

INPUT:

FULL OUTPUT:

Recommend Projects

Recommend Topics

Recommend Org