Coder Social home page Coder Social logo

googlingthecancergenome / sv-callers Goto Github PK

View Code? Open in Web Editor NEW
73.0 4.0 34.0 215.48 MB

Snakemake-based workflow for detecting structural variants in genomic data

Home Page: https://research-software.nl/software/sv-callers

License: Apache License 2.0

Python 93.66% Shell 6.34%
bioinformatics structural-variants sv-calling germline-variants somatic-variants cancer-genomics wgs workflow snakemake hpc-applications

sv-callers's Introduction

sv-callers

DOI Published in PeerJ CI Codacy Badge Codacy Badge

Structural variants (SVs) are an important class of genetic variation implicated in a wide array of genetic diseases. sv-callers is a Snakemake-based workflow that combines several state-of-the-art tools for detecting SVs in whole genome sequencing (WGS) data. The workflow is easy to use and deploy on any Linux-based machine. In particular, the workflow supports automated software deployment, easy configuration and addition of new analysis tools as well as enables to scale from a single computer to different HPC clusters with minimal effort.

Dependencies

  • Python
  • Conda - package/environment management system
  • Snakemake - workflow management system
  • Xenon CLI - command-line interface to compute and storage resources
  • jq - command-line JSON processor (optional)
  • YAtiML - library for YAML type inference and schema validation

The workflow includes the following bioinformatics tools:

The software dependencies can be found in the conda environment files: [1],[2],[3].

1. Clone this repo.

git clone https://github.com/GooglingTheCancerGenome/sv-callers.git
cd sv-callers

2. Install dependencies.

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# install Mamba
conda install -n base -c conda-forge -y mamba
# create a new environment with dependencies & activate it
mamba env create -n wf -f environment.yaml
conda activate wf

3. Configure the workflow.

  • config files:

    • analysis.yaml - analysis-specific settings (e.g., workflow mode, I/O files, SV callers, post-processing or resources used etc.)
    • samples.csv - list of (paired) samples
  • input files:

    • example data in workflow/data directory
    • reference genome in .fasta (incl. index files)
    • excluded regions in .bed (optional)
    • WGS samples in .bam (incl. index files)
  • output files:

    • (filtered) SVs per caller and merged calls in .vcf (incl. index files)

4. Execute the workflow.

cd workflow

Locally

# 'dry' run only checks I/O files
snakemake -np

# 'vanilla' run if echo_run set to 1 (default) in analysis.yaml,
# it merely mimics the execution of SV callers by writing (dummy) VCF files;
# SV calling if echo_run set to 0
snakemake --use-conda --jobs

Submit jobs to Slurm or GridEngine cluster

SCH=slurm   # or gridengine
snakemake  --use-conda --latency-wait 30 --jobs \
--cluster "xenon scheduler $SCH --location local:// submit --name smk.{rule} --inherit-env --cores-per-task {threads} --max-run-time 1 --max-memory {resources.mem_mb} --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log" &>smk.log&

Note: One sample or a tumor/normal pair generates in total 18 SV calling and post-processing jobs. See the workflow instance of single-sample (germline) or paired-sample (somatic) analysis.

To perform SV calling:

  • edit (default) parameters in analysis.yaml

    • set echo_run to 0
    • choose between two workflow modes: single- (s) or paired-sample (p - default)
    • select one or more callers using enable_callers (default all)
  • use xenon CLI to set:

    • --max-run-time of workflow jobs (in minutes)
    • --temp-space (optional, in MB)
  • adjust compute requirements per SV caller according to the system used:

    • the number of threads,
    • the amount of memory(in MB),
    • the amount of temporary disk space or tmpspace (path in TMPDIR env variable) can be used for intermediate files by LUMPY and GRIDSS only.

Query job accounting information

SCH=slurm   # or gridengine
xenon --json scheduler $SCH --location local:// list --identifier [jobID] | jq ...

sv-callers's People

Contributors

arnikz avatar codacy-badger avatar cshneider-zz avatar sverhoeven avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sv-callers's Issues

GermlineSV Final Genotype Merged callsets

GermlineSV calling: LOCAL MACHINE

I tried the following things:

Input Data: sample_1, sample_2

Edited File:

/sv-callers/snakemake/samples.csv

for 2 samples I had edited as following

-1-

PATH,SAMPLE1,SAMPLE2
data/bam/3,chr22_1,chr22_2

(But when i edit like this, the job is running for 1 sample and only 1 folder is created, so tried 2nd way)

Another way I tried to edit samples.csv as

-2nd way-
PATH,SAMPLE1,SAMPLE2
data/bam/3,chr22_1
data/bam/3,chr22_2

after this correction in samples.csv file, executed the command "snakemake -C echo_run=0 mode=s enable_callers="['manta','delly','lumpy','gridss']" --use-conda"

Results:
/sv-callers/snakemake/data/bam/3
chr22_1(FOLDER1) chr22_1.bam chr22_1.bam.bai chr22_2 (FOLDER2) chr22_2.bam chr22_2.bam.bai N3.bam N3.bam.bai sample_testrun T3.bam T3.bam.bai

DOUBTS:

  1. inside these folders chr22_1(FOLDER1) chr22_2(FOLDER2) it had generated "all.vcf" in each corresponding folders. all.vcf of all callers are SV are generated (Will sv-caller perform merge
  2. I could not find the Final joint calling VCF file for 2 samples, I have information for only in individual sample wise.

Eg: If I run Delly separately, the final merged genotype VCF contains information of all 2samples #CHROM POS .....INFO Sample1 Sample2. Similar results I got when I run MANTA diploid_svcalling.vcf (contains information of 2 samples)

But when I run the sv-callers for 2 sample test run, am not getting the combined genotype callset information of 2samples in any callers "all.vcf". The information is seen as sample wise.

I think I am doing mistakes while editing the samples.csv file for GERMLINE svcalling. could you please tell me how to edit samples.csv file particularly for germline svcalling.

Thanks for the help.

Error while loading shared libraries: libcrypto.so.1.0.0

bcftools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory

bcftools (1.9, h47928c2_2, bioconda)

  • curl (7.64.0, h646f8bb_2, conda-forge)
    • libcurl (7.64.0, h541490c_2, conda-forge)
      • openssl (1.1.1a, h14c3975_1000, conda-forge) -> downgrade to v1.0

samtools v1.9 (h91753b0_5, bioconda)

Change the semantics of the workflow modes

Currently, the workflow (dev) supports somatic (using paired samples) or germline (using one sample) analysis. However, Manta supports single-sample analysis using germline or tumor-only model. Therefore, it would be better to rename the workflow modes to single- and paired-sample analysis and incl. Manta-specific option for tumor-only analysis.

Update test data for the workflow

Add a proposal for a new data release (v1.2.0) to the wiki. Perhaps we should also consider artificial data generated by sv-gen.

  • select input data
  • run the sv-callers workflow on
    • NA24385/HG002 sample
    • CHM1_CHM13 sample !!! Review LUMPY data !!!
  • update Jupyter Notebooks
    • example_1
    • example_2
  • document reference genomes used !!! CROSS-CHECK PROVENANCE !!!
    • NA12878 - b37_human_decoy_reference.fasta how the FASTA file was created step-by-step (b57a59692e6d189eb99fce8d14589def)
      Note: the b37 reference genome of the Broad Institute, confusingly named Homo_sapiens_assembly19.fasta, is not reachable from the Google Cloud.
    • NA24385 - hs37d5.fa (12a0bed94078e2d9e8c00da793bbc84e after decompress; ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz)
    • CHM1_CHM13 - Homo_sapiens_assembly19.fasta (886ba1559393f75872c1cf459eb57f2d)
    • COLO829 - Homo_sapiens.GRCh37.GATK.illumina.fasta (be672f01428605881b2ec450d8089a62) is in fact composed of the chromosomes (1 to 22, X, Y and MT) downloaded from Ensembl (ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/), concatenated and with the header containing only the chromosome name (for instance '1' for chromosome 1).
  • create a ZIP archive from I/O files and upload the archive to Zenodo
    • data path: /project/gcg/Data/notebooks-data

Processing trios for de novo mutations

This would require:

  • a three columns samples.csv format with mother, father, child;
  • a postprocessing step where the overlap of the SV callsets resulting from the comparisons mother-child and father-child. The overlap could be computed with StructuralVariantAnnotation (SVA)

LUMPY.vcf missing contig info in the header

bcftools raises warnings:

[W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '3' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '4' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '5' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '6' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '7' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '8' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '9' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '10' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '11' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '12' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '13' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '14' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '15' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '16' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '17' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '18' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '19' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '20' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '21' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig '22' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'X' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'Y' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse] Contig 'MT' is not defined in the header. (Quick workaround: index the file with tabix.)

A quick fix:

awk '{printf "##contig=<ID=%s,length=%s>\n", $1, $2}' catalog.fa.fai > contigs.txt

Related issue: arq5x/lumpy-sv#211

sv-callers Installation in local machine and input files

Hi,

I would like to use sv-callers for calling germlineSVs from WGS. I have following doubts:

  1. Is this tool can be installed in local centos7 machine? And whether the sv callers like manta, delly, lumpy and GRIDSS and other tools bcftools, survivor should be installed separately or it is the part of this repository (sub-modules already build in sv-callers)?

  2. The version of manta is 1.1.0, whether the current version of sv-callers supports updated version of manta or GRIDSS?

  3. Second thing is whether "cram" can be used as a input?

  4. Samples are aligned to GRCh38 reference, does sv-callers provide excluded regions in .bed file of reference genome?

Sorry for all the questions, let me know if there is a better place to ask them, person to email, etc.

Thanks in advance!
Nitha

HPC Usage issue on slurm ?

is there a way to mention nodelist / specific nodes to run the jobs on slurm for the sv-callers using xenon schedulers ?

snakemake -C echo_run=0 mode=s enable_callers="['manta','delly','lumpy','gridss']" --use-conda --latency-wait 30 --jobs 14
--cluster 'xenon scheduler $SCH --location local:// submit --queue {slurm-partition-name} --name smk.{rule} --inherit-env --cores-per-task {threads} --max-run-time 1 --max-memory {resources.mem_mb} --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log' &>smk.log&

Update in-line doc

# This workflow runs an ensembl of structural variant (SV) callers on a set of
# paired samples (BAM files, tumor vs. normal) provided in a CSV file.
#
# The workflow uses two config files:
# analysis.yaml - set input/output files, SV callers' params, requirements etc.
# environment.yaml - set software dependencies incl. versions

v1.1.0 supports both single- and paired-sample analyses

Change the semantics for naming samples

The workflow supports both germline and somatic modes. In the samples.csv file (and related code) the column names TUMOR/NORMAL are for paired samples (somatic) and TUMOR are for a single sample (germline).
Would it be better to rename to SAMPLE1/SAMPLE2 and to SAMPLE instead?

GermlineSV_calling in WGS (centos7 local machine)

Hi,

I tried TEST RUN of GERMLINE sv calling (mode=s) by sv-callers for 2 samples. But it was not successful, it kept on running for more 30hrs. The delly call BND was running for >40hrs so I killed the jobs.

  1. what is the importance of using exclude_regions: 1 file. Whether it's only used to speed up the analysis. (Whether it is used for detecting exact breakpoint in reads?

  2. The exclude_regions (data/ENCFF001TDO.bed) is particular for reference genome version type? I had seen the data set you had used the hg19 version genome file? All my samples are aligned using GRCh38 version.

  3. Incase if i don't have ENCFF001TDO.bed for GRCh38 reference and when I edit in analysis.yaml file as exclude_region=0 whether this modification creates an accuracy problem in GERMLINE svcalling? Do you provide any excluded region files for hg38?

  4. In delly they provide exclude region file for hg38 in .tsv (human.hg38.excl.tsv) can this file can be used in sv-caller insted of ENCFF001TDO.bed?

  5. Could you please brief me on the files to be modified before running sv-callers for germlineSV calling? And particularly in editing samples.csv (GERMLINE). Eg: Please provide an idea of how the .csv file looks for 3 samples in GERMLINE svcalling.

If you could provide detailed steps separately for 1. GermlineSV calling (WGS) 2. SomaticSV (WGS) calling in README it will be very useful for tool USERS to understand the workflow.

Thank you.

Add optional use of BED file with blacklisted regions

Exclusion lists available in BED format by ENCODE or Delly team. Note: reference human genome GRCh37 = hg19 in UCSC's nomenclature.

SV caller's command-line options:

  • Manta: --callRegions FILE.bed but the inverse option is needed (not on the roadmap)
  • Delly: -x FILE.bed
  • Lumpy: -x FILE.bed
  • GRIDSS: BLACKLIST=FILE.bed

Speed up conda install

It might take >40 min for the workflow to install the callers and post-processing tools.

MAX_HEAP calculation fails if `bc` is not available

The gridss rule for me is failing with

Invalid maximum heap size: -Xmx0g

This seems to be because bc is not available in my path. It is available on conda-forge, so could be included in environment.yaml.

Filter GRIDSS output for somatic SVs

GRIDSS (v1.3.4) bioconda package does not include the R script to filter the calls obtained by paired (tumor/normal) analysis. Use bcftools (>=1.8) ifor filtering variants instead of the R script.

LSF cluster usage

HI,
Could you provide example command to submit a job to LSF cluster?
I am not a specialist for job scheduler.

Thanks.

HPC cluster usage

Hi @arnikz ,

When we processed the sv-caller for 3 samples as a test run in slurm based script we found that the jobs are getting scheduled in the single node [i.e gpunode]. The takes more time than the required time to process the data-set, with hyperthreading enabled. Note openmp & pyflow has been enabled in conda env.

Doubts:-

  1. for three samples maximum execution of jobs should be 14 are lesser than that? What this falg --jobs 14 indicates?
  2. can xenon scheduler be mapped to various compute nodes?
  3. can the --max-run-time and location for slurm be changed?

jobscript used :-

#!/bin/bash

SCH=slurm # or gridengine snakemake -C echo_run=0 mode=s enable_callers="['manta','delly','lumpy','gridss']" --use-conda --latency-wait 30 --jobs 14 \ --cluster 'xenon scheduler $SCH --location local:// submit --name smk.{rule} --inherit-env --cores-per-task {threads} --max-run-time 1440 --max-memory {resources.mem_mb} --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log' &>smk.log&

What may be the problem? could you help with this issue?

Thank you in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.