nf-core / raredisease Goto Github PK
View Code? Open in Web Editor NEWCall and score variants from WGS/WES of rare disease patients.
Home Page: https://nf-co.re/raredisease
License: MIT License
Call and score variants from WGS/WES of rare disease patients.
Home Page: https://nf-co.re/raredisease
License: MIT License
Add this finishing touch to the mapping subworkflow so that the preprocessing of bam files is complete before branching into other tools e.g. variant callers
Hello ๐ , we use this to add additional annotations after vcfanno and include a header relating to software and case info
To complement the project board, we would like to have an interactive chart that can reflect the progresses of the development work and that can be easily modified e.g. when we want to include more tools.
nf-core recommends LucidChart or Google Drawings for such task. For the moment we are going for Google Drawings.
We use this to aggregate SV vcf callset. https://github.com/J35P312/SVDB
This overview is based on the WGS/WES rare disease pipeline (MIP) that is currently in use at Clinical Genomics Stockholm. This outlines the basic functionality and modules that we would like to have from a pipeline specialised for calling, annotating and scoring variants relevant for rare disease patients.
Fastq files are prepared for variantcalling by alignment with bwa-mem/bwa-mem2 followed by markduplicates. From this point the workflow is split into a SNV/indel part and a SV part.
SNV/indels are primarly called with Deepvariant Glnexus but with the possibility of turning on the GATK Haplotypecaller workflow. These two callsets can be combined into one for maximum sensitivity. Vcfanno annotates the callset with population allele frequencies (Gnomad) and predicted pathogenicity (CADD). Common variation is removed from the callset and CADD scores are caclulated for indels. VEP is used for transcript annotation including annotation with CLINVAR, SpliceAI and pLI scores. The SNV/indels are split into a clinical callset and a research callset based on a bed file with genes of interest. Finally the variants are ranked for predicted pathogenicity based on their annotation as well as their modes of inheritance.
We use Cnvnator, Manta, (Delly) and Tiddit to call structural variants. Using SVDB we combine the variants into one callset and using a local frequency database we remove common variants and sequencing/calling artefacts. The callset is annotated with vcfanno and VEP followed by a split into a clincal callset and a research callset. The SVs are then ranked in the same manner as the SNVs.
Aside from SNVs and SVs the pipelines identifies and visualizes runs of homo/auto-zygosity as well as upd:s. Also included are identification and annotation of pathogenic STRs with ExpansionHunter and Stranger. SMNCopyNumberCaller is used to diagnose patients with spinal muscular atrophy.
The tools mentioned here are not set in stone and we are certainly open to adding and changing tools as we continue development. Below is a list of tools used in the workflow.
Bcftools
BedTools
BWA
CADD
Chanjo
Chromograph
Cnvnator
Cyrius
Delly
Deepvariant
Expansionhunter
FastQC
GATK
GENMOD
Gffcompare
Glnexus
Manta
MultiQC
Peddy
PicardTools
PLINK
Rhocall
Sambamba
Samtools
SMNCopyNumberCaller
Stranger
Svdb
Telomerecat
Tiddit
Upd
Vcf2cytosure
Vcfanno
VEP
Add Expansionhunter nf-core module:
https://github.com/nf-core/modules/tree/master/modules/expansionhunter
Normalize and split multi allelic variants using bcftools norm prior to annotation
Incorporate the bcftools norm module from nf-core modules.
We could use vt decompose and normalize
Add this to call_structural_variants.nf
to combine VCFs from manta, cnvpytor, tiddit
Maybe a default file for variant_catalog (in case the user doesn't provide one) should still be added? What do you think? In case this should be included in this merge, I can try to look for how this could be done in prepare_genome.nf.
It's not a bad idea. However I think we can go ahead and merge this one and add that option in a small PR later. We could bundle it with the pipeline or have it as an url https://raw.githubusercontent.com/Illumina/ExpansionHunter/master/variant_catalog/hg19/variant_catalog.json
There has also been a discussion about adding a download workflow which would automatically download all the references.
Originally posted by @jemten in #51 (comment)
Add mosdepth to raredisease
As shown in the screenshot, it takes roughly 19 minutes to finish the test. I propose that we use the incubating stub feature: https://www.nextflow.io/docs/latest/process.html#stub for workflows that take a while e.g. call_snv_deepvariant.nf
.
If we implement, it would involve adding the stub command in each of the processes in the above ^ subworkflow:
Produce unmapped BAM (uBAM) from aligned BAM
bcftools/sort is used in the mitochondrial workflow. An issue is open here: nf-core/modules#915.
Here is a list of Sentieon tools that are relevant for the pipeline and for which issues have been opened in https://github.com/nf-core/modules.
Another tool that might be relevant but for which there is no open issue at the moment:
Our in-house pipeline (MIP) uses this tool and we want to add this to the nextflow pipeline. It's not part of nf-core modules yet: nf-core/modules#792.
Once the module is added to nf-core/modules, then it'll be added to subworkflow qc_bam.nf
We have a in-progress flowchart for the pipeline: https://docs.google.com/drawings/d/1QZsgxM4zuArI-N2kuWzwJB5PpjwOu8XxJ3-DInlNYEk/edit
To encourage everyone to contribute to it, we should write a "how-to".
It would be good to add read_group to meta so bwa_mem2 can use it and other future programs too (e.g. peddy needs it).
I have tested a little bit the addition of line:
meta.read_group = "'@rg\tID:"+row.sample + "" + row.fastq_1.split('/')[-1].split('R1*.fastq')[0] + "" + row.lane + "\tPL:ILLUMINA\tSM:"+row.sample.split('')[0]+"'"
in subworkflows/local/input_check.nf
But it creates issues when GLnexus needs to combine the different channels again (see nextflow log)
This problem does however not arise with
meta.read_group = "'@rg\tID:myid\tPL:ILLUMINA\tSM:"+row.sample.split('_')[0]+"'"
The problem arises with both a unique sample or multiple samples in the samplesheet.
Add this to call_structural_variants.nf
so we can annotate combined call sets
We need to know that the input vcf:s used in for example the annotation process have been decomposed.
Write a small script that parses the header and checks for bcftools norm command.
I have checked the following places for your error:
Steps to reproduce the behaviour:
Without customconf.conf
[dd/f36687] NOTE: Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (134) -- Execution is retried (1)
WARN: Input tuple does not match input set cardinality declared by process `NFCORE_RAREDISEASE:RAREDISEASE:DEEPVARIANT_CALLER:GLNEXUS` -- offending value: [id:caseydonkey]
Error executing process > 'NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)'
Caused by:
Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (134)
Command executed:
picard \
-Xmx6g \
MarkDuplicates \
--CREATE_INDEX \
-I 1234N.bam \
-O 1234N_sorted.bam \
-M 1234N_sorted.MarkDuplicates.metrics.txt
cat <<-END_VERSIONS > versions.yml
MARKDUPLICATES:
markduplicates: $(echo $(picard MarkDuplicates --version 2>&1) | grep -o 'Version:.*' | cut -f2- -d:)
END_VERSIONS
Command exit status:
134
Command output:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (g1PageBasedVirtualSpace.cpp:43), pid=211157, tid=211219
# guarantee(rs.is_reserved()) failed: Given reserved space must have been reserved already.
#
# JRE version: (11.0.9.1) (build )
# Java VM: OpenJDK 64-Bit Server VM (11.0.9.1-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# hs_err_pid211157.log
#
#
Command error:
/usr/local/bin/picard: line 5: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
/usr/local/bin/picard: line 66: 211157 Aborted /usr/local/bin/java -Xmx6g -jar /usr/local/share/picard-2.25.7-0/picard.jar MarkDuplicates "--CREATE_INDEX" "-I" "1234N.bam" "-O" "1234N_sorted.bam" "-M" "1234N_sorted.MarkDuplicates.metrics.txt"
With customconf.conf:
process {
withName: PICARD_MARKDUPLICATES {
memory = 5.GB
}
}
Error executing process > 'NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)'
Caused by:
Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (1)
Command executed:
picard \
-Xmx5g \
MarkDuplicates \
--CREATE_INDEX \
-I 1234N.bam \
-O 1234N_sorted.bam \
-M 1234N_sorted.MarkDuplicates.metrics.txt
cat <<-END_VERSIONS > versions.yml
MARKDUPLICATES:
markduplicates: $(echo $(picard MarkDuplicates --version 2>&1) | grep -o 'Version:.*' | cut -f2- -d:)
END_VERSIONS
Command exit status:
1
Command output:
Error occurred during initialization of VM
Could not reserve enough space for 5242880KB object heap
Command error:
/usr/local/bin/picard: line 5: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
Successful completion of the analysis
Have you provided the following extra information/files:
.nextflow.log
file modules/nf-core/modules/picard/markduplicates/main.nf:
avail_mem = task.memory.giga-2
Similar error for bamqc.
For the first error, markduplicates:
Update mosdepth in nf-core modules to v3.3.3
Use the nf-core module if possible DV
We should have a tests
folder with test workflows for local modules + subworkflows ๐ . Not for nf-core/modules because tests are already written for those in their repo ๐
Add glnexus for genotyping
nf-core/modules#729
Our in-house pipeline uses this tool and we want to add this to the nextflow pipeline. It's not part of nf-core modules yet: nf-core/modules#793.
Once the module is added to nf-core/modules, then it'll be added to subworkflow qc_bam.nf
EDIT: The module is part of nf-core/modules now. Please go ahead and add it to subworkflow qc_bam.nf
An example of what new syntax looks like: https://github.com/nf-core/rnaseq/blob/dsl2/conf/modules.config
Rationale for template update on module + pipeline level: nf-core/tools#1327
Add VEP from nf-core modules
Add preprocessing for Gens to the pipeline.
The pipeline currently re-builds the index for bwamem2 on every run. In order to save resources, there should be a check if there are existing indices to be used instead.
This subworkflow should 1) check for existing reference index files 2) allow for the re-use of indices in different downstream processes.
Some relevant arguments are --max_sv_size <length of chromosme 1>
and the ExACpLI plugin
We have agreed to use the mitochondria workflow currently implemented at GATK best practices.
The following steps are included. Modules already exist for some of them; all modules need to be included in a subworkflow. We plan to have the mitochondrial subworkflow run by default, but to have the possibility to turn it off and also to turn off the calling of variants for the autosomes.
bcftools query
andbcftools view
to prepare input for haplogrep2This list can be modified as new issues are created and new modules are added.
Test dataset including mtDNA: https://github.com/nf-core/test-datasets/tree/raredisease
Samtools and MultiQC are outdated.
Update them versions ๐
An issue has been opened on nf-core/modules#1092
In MIP, we combine callsets using svdb from manta, tiddit/sv, and cnvnator. We should add tiddit/sv.
Should Stranger be added to subworkflows/nf-core/call_repeat_expansions.nf
subworkflow?
Refactor check vcf subworkflow
nf-core/modules updated the way versions are emitted, so from <software>.version.txt
-> versions.yml
. This allows to emit multiple versions in cases a module or subworkflow uses multiple tools. Updated documentation here.
This pipeline is not updated accordingly yet.
Update the subworkflows and main workflow ๐ accordingly
Add test dataset to enable CI tests. New branch need to be created in nf-core/test-datasets
SamtoFastQ
Convert SAM or BAM file to FastQ
Add manta to SV caller subworkflow
Currently, the code snippet is in the raredisease.nf
script but when there are more mappers/aligners in the picture we should hide the logic away in a bigger subworkflow with switches for which tool to use. This way we can declutter the raredisease.nf
script.
if (params.aligner == 'bwamem2') {
ALIGN_BWAMEM2 (
INPUT_CHECK.out.reads,
PREPARE_GENOME.out.bwamem2_index
)
...
turns into...
if (aligner == 'bwamem2') {
ALIGN_BWAMEM2 (
reads,
bwamem2_index
)
...
stowed in the bigger subworkflow ๐ - where aligner
is a resource defined in the take:
definition block.
We need the SV querying module to annotate ๐ https://github.com/J35P312/SVDB#query
Create a subworkflow to check if proper indices are available for bed files and create them when not.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.