ghga-de / nf-platypusindelcalling Goto Github PK

This page is reserved for NextFlow based Indell Calling Workflow (with Platypus) from DKFZ

License: MIT License

Nextflow 26.20% HTML 0.95% Python 38.89% Groovy 12.18% Perl 16.25% R 3.92% Shell 1.39% Dockerfile 0.24%

annotation annotations nextflow nextflow-pipeline pipeline platypus variant-calling workflow indel-calling dkfz

nf-platypusindelcalling's Introduction

Introduction

nf-platypusindelcalling:A Platypus-based insertion/deletion-detection workflow with extensive quality control additions. The workflow is based on DKFZ - ODCF OTP Indel Calling Pipeline.

For now, this workflow is only optimal to work in ODCF Cluster. The config file (conf/dkfz_cluster.config) can be used as an example. Running Annotation, DeepAnnotation, Filter and Tinda steps are optional and can be turned off using [runIndelAnnotation, runIndelDeepAnnotation, runIndelVCFFilter, runTinda] parameters sequential.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

This nextflow pipeline is the transition of DKFZ-ODCF/IndelCallingWorkflow.

Important Notice: The whole workflow is only ready for DKFZ cluster users for now, It is strongly recommended to them to read whole documentation before usage. This workflow works better with nextflow/22.07.1-edge in the cluster, It is recommended to use >22.07.1.

Pipeline summary

The pipeline has 6 main steps: Indel calling using platypus, basic annotations, deep annotations, filtering, sample swap check and multiqc report.

Indel Calling:

Platypus (Platypus) : Platypus tool is used to call variants using local realignmnets and local assemblies. It can detect SNPs, MNPs, short indels, replacements, deletions up to several kb. It can be both used with WGS and WES. The tool has been thoroughly tested on data mapped with Stampy and BWA.
Basic Annotations (--runIndelAnnotation True):

In-house scripts to annotate with several databases like gnomAD, dbSNP, and ExAC.

ANNOVAR (Annovar) : annotate_variation.pl is used to annotate variants. The tool makes classifications for intergenic, intogenic, nonsynoymous SNP, frameshift deletion or large-scale duplication regions.

ENSEMBL VEP('ENSEBL VEP') :can also be used alternative to annovar. Gene annotations will be extracted.

Reliability and confidation annotations: It is an optional ste for mapability, hiseq, selfchain and repeat regions checks for reliability and confidence of those scores.
Deep Annotation (--runIndelDeepAnnotation True):

If basic annotations are applied, an extra optional step for number of extra indel annotations like enhancer, cosmic, mirBASE, encode databases can be applied too.
Filtering and Visualization (--runIndelVCFFilter True):

It is an optional step. Filtering is only required for the tumor samples with no-control and filtering can only be applied if basic annotation is performed.

Indel Extraction and Visualizations: INDELs can be extracted by certain minimum confidence level

Visualization and json reports: Extracted INDELs are visualized and analytics of INDEL categories are reported as JSON.
Check Sample Swap (--runTinda True):

Canopy Based Clustering and Bias Filter, thi step can only be applied into the tumor samples with control.
MultiQC (--skipmultiqc False):

Produces pipeline level analytics and reports.

Quick Start

Install Nextflow (>=21.10.3)
Install any of Docker or Singularity (you can follow this tutorial)
Download Annovar and set-up suitable annotation table directory to perform annotation. Example:

annotate_variation.pl -downdb wgEncodeGencodeBasicV19 humandb/ -build hg19

Gene annotation is also possible with ENSEMBL VEP tool, for test purposes only, it can be used online. But for big analysis, it is recommended to either download cache file or use --download_cache flag in parameters.

Follow the documentation here

Example:

Download cache

cd $HOME/.vep
curl -O https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/homo_sapiens_vep_110_GRCh38.tar.gz
tar xzf homo_sapiens_vep_110_GRCh38.tar.gz

Download the pipeline and test it on a minimal dataset with a single command:
```
git clone https://github.com/ghga-de/nf-platypusindelcalling.git
```

before run do this to bin directory, make it runnable!:

chmod +x bin/*

nextflow run main.nf -profile test,YOURPROFILE --outdir <OUTDIR> --input <SAMPLESHEET>

Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.

The pipeline comes with config profiles called docker and singularity which instruct the pipeline to use the named tool for software management. For example, -profile test,docker.

Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.

If you are using singularity, please use the nf-core download command to download images first, before running the pipeline. Setting the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.

Simple test run

nextflow run main.nf --outdir results -profile singularity,dkfz_cluster_38

Start running your own analysis!

nextflow run main.nf --input samplesheet.csv --outdir <OUTDIR> -profile <docker/singularity> --config test/institute.config

Samplesheet columns

sample: The sample name will be tagged to the job

tumor: The path to the tumor file

tumor_index: The path to the tumor index file

control: The path to the control file, if there is no control will be kept blank.

control_index: The path to the control index file, if there is no control will be kept blank.

Data Requirements

Annotations are optional for the user. All VCF and BED files need to be indexed with tabix and should be in the same folder!

The reference set bundle which is used in PCAWG study can be found and downloaded here. (NOTE: only in hg19)

Basic Annotation Files

dbSNP INDELs (vcf)
1000K INDELs (vcf)
gnomAD Genome Sites for INDELs (vcf)
gnomAD Exome Sites for INDELs (vcf)
EVS variants (vcf)
ExAC variants (vcf)
Local Control files WGS (vcf)
Local Control files WES (vcf)

SNV Reliability Files

UCSC Repeat Masker region (bed)
UCSC Mappability regions (bed)
UCSC Simple tandem repeat regions (bed)
UCSC DAC Black List regions (bed)
UCSC DUKE Excluded List regions (bed)
UCSC Hiseq Deep sequencing regions (bed)
UCSC Self Chain regions (bed)

Deep Annotation Files

UCSC Enhangers (bed)
UCSC CpG islands (bed)
UCSC TFBS noncoding sites (bed)
UCSC Encode DNAse cluster (bed.gz)
snoRNAs miRBase (bed)
miRBase (bed)
Cosmic coding SNVs (bed)
miRNA target sites (bed)
Cgi Mountains (bed)
UCSC Phast Cons Elements (bed)
UCSC Encode TFBS (bed)

Reference Usage

This pipeline favors the use of igenomes and refgenie. Read the documentaton here to learn more.

For igenomes usage: use genomes GRCh37 (--genome "GRCh37") or GRCh38 (--genome "GRCh38").

For refgenie usage: use genomes GRCh37 (--genome "hg37") or GRCh38 (--genome "hg38").

If not using igenomes or refgenie, --fasta, --fasta_fai, and --chr_prefix need to be spesifed! If --chr_sizes is not provided it will be automatically generated.

Annotation files

Documentation

The nf-platypusindelcalling pipeline comes with documentation about the pipeline usage and output.

Please read usage document to learn how to perform sample analysis provided with this repository!

Credits

nf-platypusindelcalling was originally translated from roddy-based pipeline by Kuebra Narci [email protected].

The pipeline is originally written in workflow management language Roddy. Inspired github page

The Indel calling workflow was in the pan-cancer analysis of whole genomes (PCAWG) and can be cited in the following publication:

Pan-cancer analysis of whole genomes. The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Nature volume 578, pages 82–93 (2020). DOI 10.1038/s41586-020-1969-6

We thank the following people for their extensive assistance in the development of this pipeline:

Nagarajan Paramasivam (@NagaComBio) [email protected]

TODO

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

nf-platypusindelcalling's People

Contributors

Stargazers

Watchers

nf-platypusindelcalling's Issues

Extent CI test to the whole pipeline

Is your feature request related to a problem? Please describe

We have a small CI test only for indel calling now, this needs to be extended to cover the whole pipeline.

Describe the solution you'd like

small annotation files needs to be prepared before this.

Describe alternatives you've considered

No response

Additional context

No response

Update documentation

Is your feature request related to a problem? Please describe

VEP annotation description is missing in README

Describe the solution you'd like

VEP annotation description is missing in README

Describe alternatives you've considered

No response

Additional context

No response

Output proper vcf

Is your feature request related to a problem? Please describe

The current output vcf is not actual vcf formatted.

Describe the solution you'd like

use https://github.com/DKFZ-ODCF/SNVCallingWorkflow/blob/master/resources/analysisTools/snvPipeline/convertToStdVCF.py to output proper VCF

Describe alternatives you've considered

No response

Additional context

No response

fix annotation file input

Is your feature request related to a problem? Please describe

fix annotation file input like snvcalling pipeline

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

add nf-prov

Is your feature request related to a problem? Please describe

add nf-prov plugin

@drewjbeh

Describe the solution you'd like

Add this below the profiles
// Nextflow plugins
plugins {
id '[email protected]' // Provenance reports for pipeline runs
}
Add this below the dag block, above manifest
prov {
enabled = true
formats {
bco {
file = "${params.outdir}/pipeline_info/manifest_${trace_timestamp}.bco.json"
}
}
}

Describe alternatives you've considered

No response

Additional context

No response

FILTER 'MQ' is not defined in the header

Have you checked the docs?

Description of the bug

Main branch, annotation using annovar. Reported by @Naga

Command used and terminal output

Process:
NF_PLATYPUSINDELCALLING:PLATYPUSINDELCALLING:OUTPUT_STANDARD_VCF:BCFTOOLS_SORT
Error:
  [W::vcf_parse_filter] FILTER 'MQ' is not defined in the header
  Error encountered while parsing the input at chr1:10048
  Cleaning
Issue:
##FILTER=<ID=ALTC,Description="Alternative reads in control">
##FILTER=<ID=ALTT,Description="Less than three variant reads in tumor">
##FILTER=<ID=GTQ,Description="Quality for genotypes below thresholds">
##FILTER=<ID=GTQFRT,Description="Quality for genotypes below thresholds and variant allele frequency in tumor < 10%">
##FILTER=<ID=HapScore,Description="Too many haplotypes are supported by the data in this region.">
##FILTER=<ID=PASS,Description="Position passed all filters, call is made">
##FILTER=<ID=Q20,Description="Variant quality is below 20.">
##FILTER=<ID=QD,Description="Variant-quality/read-depth for this variant">
##FILTER=<ID=QUAL,Description="Quality of entry too low and/or low coverage in region">
##FILTER=<ID=VAF,Description="Variant allele frequency in tumor < ' + str(args.newpun) + ' times allele frequency in control">
##FILTER=<ID=VAFC,Description="Variant allele frequency in tumor < 5% or variant allele frequency in control > 5%">
##FILTER=<ID=alleleBias,Description="Variant frequency is lower than expected for het.">
##FILTER=<ID=badReads,Description="Variant supported only by reads with low quality bases close to variant position, and not present on both strands.">
##FILTER=<ID=strandBias,Description="Variant fails strand-bias filter">
MQ not in header..

Relevant files

No response

System information

No response

[FEATURE] Remove FREQ based FILTERing and move it as an annotation

Is your feature request related to a problem? Please describe

We do not wish to exclude somatic variants based on their population frequency. It would be beneficial to eliminate the filtering or penalty based on FREQ and instead include it as an annotation in the INFO columns. This would allow users to decide if they want to exclude such variants.

Similar to this commit

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

test slurm

Is your feature request related to a problem? Please describe

test slurm env in tubingen cluster

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

add .gitpod.yml

Is your feature request related to a problem? Please describe

add gitpod env

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

tries conda env build even for singularity runs

Have you checked the docs?

Description of the bug

Conda env seems to be tried to be build even for singularity runs for nf-core modules:

happens in vep and bcftools

Command used and terminal output

#!/bin/bash
module load nextflow/22.07.1-edge
nextflow run main.nf -profile dkfz_cluster_38,singularity --input testdata/samplesheet_test.csv -resume --annotation_tool vep

Relevant files

System information

No response

CI/CD Test integration

CI/CD integration needs to be ready for platypus indel calling phase.

check out fasta.contains

Have you checked the docs?

Description of the bug

when igenome used, genome parameter pulls fasta file as genome.fa shich does not include hg38 or hg37 inside fasta param! check the usage

Command used and terminal output

No response

Relevant files

No response

System information

No response

Replace annovar with VEP or SNPeff

Is your feature request related to a problem? Please describe

Annovar has license issues, so we need to replace it with either VEP or SNPeff. But, annotation bundles for those tools are heavy that we cannot ship on the cloud or containers.

Describe the solution you'd like

Annovar has license issues, so we need to replace it with either VEP or SNPeff.

Describe alternatives you've considered

No response

Additional context

No response

fix input checking

Is your feature request related to a problem? Please describe

It is not, but this will be better looking

Describe the solution you'd like

fix input_check.nf
check nf-aceseq pipeline

Describe alternatives you've considered

implement nf-validation plugin

Additional context

No response

generete contig from bam not fai

Have you checked the docs?

Description of the bug

platypus does not add #contig info to vcf header, which is againist standard vcf format which we report in the last step of the pipeline.

#contig lines are made to add from fai, which whould be from input BAM.

#contig lines should be generated from BAM files not from fai.

commend done by @NagaComBio

Command used and terminal output

No response

Relevant files

No response

System information

No response

Generate annotation files for test case

Is your feature request related to a problem? Please describe

We need small test annotation files for CI test for this pipeline.

Describe the solution you'd like

annotation files could be subtracted from the originals according to the regions on test samples.

Describe alternatives you've considered

No response

Additional context

No response

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.