Coder Social home page Coder Social logo

nf-core / sarek Goto Github PK

View Code? Open in Web Editor NEW
354.0 128.0 391.0 84.67 MB

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing

Home Page: https://nf-co.re/sarek

License: MIT License

Nextflow 98.64% HTML 0.53% Python 0.83%
nextflow workflow pipeline bioinformatics genomics cancer next-generation-sequencing conda reproducible-research containers

sarek's Introduction

nf-core/sarek

GitHub Actions CI Status GitHub Actions Linting Status AWS CI nf-test Cite with Zenodo nf-test

Nextflow run with conda run with docker run with singularity Launch on Seqera Platform

Get help on Slack Follow on Twitter Follow on Mastodon Watch on YouTube

Introduction

nf-core/sarek is a workflow designed to detect variants on whole genome or targeted sequencing data. Initially designed for Human, and Mouse, it can work on any species with a reference genome. Sarek can also handle tumour / normal pairs and could include additional relapses.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the nf-core website.

It's listed on Elixir - Tools and Data Services Registry and Dockstore.

Pipeline summary

Depending on the options and samples provided, the pipeline can currently perform the following:

  • Form consensus reads from UMI sequences (fgbio)
  • Sequencing quality control and trimming (enabled by --trim_fastq) (FastQC, fastp)
  • Map Reads to Reference (BWA-mem, BWA-mem2, dragmap or Sentieon BWA-mem)
  • Process BAM file (GATK MarkDuplicates, GATK BaseRecalibrator and GATK ApplyBQSR or Sentieon LocusCollector and Sentieon Dedup)
  • Summarise alignment statistics (samtools stats, mosdepth)
  • Variant calling (enabled by --tools, see compatibility):
    • ASCAT
    • CNVkit
    • Control-FREEC
    • DeepVariant
    • freebayes
    • GATK HaplotypeCaller
    • Manta
    • mpileup
    • MSIsensor-pro
    • Mutect2
    • Sentieon Haplotyper
    • Strelka2
    • TIDDIT
  • Variant filtering and annotation (SnpEff, Ensembl VEP, BCFtools annotate)
  • Summarise and represent QC (MultiQC)

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

patient,sample,lane,fastq_1,fastq_2
ID1,S1,L002,ID1_S1_L002_R1_001.fastq.gz,ID1_S1_L002_R2_001.fastq.gz

Each row represents a pair of fastq files (paired end).

Now, you can run the pipeline using:

nextflow run nf-core/sarek \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR>

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.

Benchmarking

On each release, the pipeline is run on 3 full size tests:

  • test_full runs tumor-normal data for one patient from the SEQ2C consortium
  • test_full_germline runs a WGS 30X Genome-in-a-Bottle(NA12878) dataset
  • test_full_germline_ncbench_agilent runs two WES samples with 75M and 200M reads (data available here). The results are uploaded to Zenodo, evaluated against a truth dataset, and results are made available via the NCBench dashboard.

Credits

Sarek was originally written by Maxime U Garcia and Szilveszter Juhos at the National Genomics Infastructure and National Bioinformatics Infastructure Sweden which are both platforms at SciLifeLab, with the support of The Swedish Childhood Tumor Biobank (Barntumörbanken). Friederike Hanssen and Gisela Gabernet at QBiC later joined and helped with further development.

The Nextflow DSL2 conversion of the pipeline was lead by Friederike Hanssen and Maxime U Garcia.

Maintenance is now lead by Friederike Hanssen and Maxime U Garcia (now at Seqera Labs)

Main developers:

We thank the following people for their extensive assistance in the development of this pipeline:

Acknowledgements

Barntumörbanken SciLifeLab
National Genomics Infrastructure National Bioinformatics Infrastructure Sweden
QBiC GHGA
DNGC

Contributions & Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #sarek channel (you can join with this invite), or contact us: Maxime U Garcia, Friederike Hanssen

Citations

If you use nf-core/sarek for your analysis, please cite the Sarek article as follows:

Friederike Hanssen, Maxime U Garcia, Lasse Folkersen, Anders Sune Pedersen, Francesco Lescai, Susanne Jodoin, Edmund Miller, Oskar Wacker, Nicholas Smith, nf-core community, Gisela Gabernet, Sven Nahnsen Scalable and efficient DNA sequencing analysis on different compute infrastructures aiding variant discovery NAR Genomics and Bioinformatics Volume 6, Issue 2, June 2024, lqae031, doi: 10.1093/nargab/lqae031.

Garcia M, Juhos S, Larsson M et al. Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants [version 2; peer review: 2 approved] F1000Research 2020, 9:63 doi: 10.12688/f1000research.16665.2.

You can cite the sarek zenodo record for a specific version using the following doi: 10.5281/zenodo.3476425

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

CHANGELOG

sarek's People

Contributors

abhi18av avatar adamrtalbot avatar adrlar avatar apeltzer avatar asp8200 avatar berguner avatar chelauk avatar davidmasp avatar edmundmiller avatar friederikehanssen avatar ggabernet avatar grantn5 avatar jfnavarro avatar johanneskersting avatar lassefolkersen avatar lconde-ucl avatar lescai avatar malinlarsson avatar matrulda avatar maxulysse avatar mirpedrol avatar nf-core-bot avatar nickhsmith avatar nschcolnicov avatar raqmanzano avatar robsyme avatar sppearce avatar susijo avatar vsmalladi avatar wackero avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sarek's Issues

Optimize (pre)processing

Using the "old" 2.3 release version of Sarek on a 48-core node with +700G memory, the preprocessing of a 45x/45x tumour/normal WGS pair took 2d 1h 11m 24s - that is actually pretty good. OTOH, there are pretty long parts using only a single CPU Grafana showing on Munin CPU utilisation graph I know @maxulysse already managed to speed up recalibration, would be nice to

EDIT: add splitting fastq files

Cleanup issues before release

  • rename wgs_calling_regions_CAW.list for GRCh37
  • QC processes do launch with '--noReports' (should not)
  • add check for misspelled parameteres (i.e. adding '--pom file.vcf' is silently ignored)
    ... more to come as I am testing

Collect MarkDuplicatesSpark metrics with the standalone Picard tool EstimateLibraryComplexity

cf #64
cf https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_spark_transforms_markduplicates_MarkDuplicatesSpark.php

Collecting duplicate metrics slows down performance and thus the metrics collection is optional and must be specified for the Spark version of the tool with '-M'. It is possible to collect the metrics with the standalone Picard tool EstimateLibraryComplexity.

Use spark for GATK

From @apeltzer cf SciLifeLab#730

Might simply work like this: https://software.broadinstitute.org/gatk/documentation/article?id=11245

Copying the text from there over here:

You don't need a Spark cluster to run Spark-enabled GATK tools!

If you're working on a "normal" machine (even just a laptop) with multiple CPU cores, the GATK engine can still use Spark to create a virtual standalone cluster in place, and set it to take advantage of however many cores are available on the machine -- or however many you choose to allocate. See the example parameters below and the local-Spark tutorial for more information on how to control this. And if your machine only has a single core, these tools can always be run in single-core mode -- it'll just take longer for them to finish.

To be clear, even the Spark-only tools can be run on regular machines, though in practice a few of them may be prohibitively slow (SV tools and PathSeq). See the Tool Docs for tool-specific recommendations.

If you do have access to a Spark cluster, the Spark-enabled tools are going to be extra happy but you may need to provide some additional parameters to use them effectively. See the cluster-Spark tutorial for more information.
Example command-line parameters

Here are some example arguments you would give to a Spark-enabled GATK tool:

    --spark-master local[*] -> "Run on the local machine using all cores"
    --spark-master local[2] -> "Run on the local machine using two cores" 

Should be possible to run e.g. picard markduplicates like this, same for some other stuff such as BaseRecalibrator and HaplotypeCaller.


https://gatkforums.broadinstitute.org/gatk/discussion/23441/markduplicatespark-is-slower-than-normal-markduplicates

We should really look into this!


Apparently it benefits when not sorting reads prior MarkDuplicates and takes direct BWA mem BAM output. So we should really try it :-)


We should move this over to nf-core/sarek ;)


Moved over ;-)

Checking read counts for raw FASTQ and deduplicated BAM

Short answer is yes, all the input reads are in the deduplicated BAM, even the unmapped ones. Note, the recalibrated BAM does not contain all the reads.

We are expecting that deduplicated BAMs are containing all the reads (even unmapped ones). Checking on munin for /data1/szilva/test/results/Preprocessing/B/DuplicateMarked :

There are 443237637 raw read pairs considering "B" FASTQs in /data1/szilva/test/BTB0001.tsv . In read group 7_11 there are 209807914 reads, and all of them are paired (bwa would bail out for non-interlaced FASTQs anyway):

$ echo `zcat 7_170815_H57FTCCXY-11_BTB0001-B_AGCGATAG_1.fastq.gz| wc -l`/4|bc
209807914
$ echo `zcat 7_170815_H57FTCCXY-11_BTB0001-B_AGCGATAG_2.fastq.gz| wc -l`/4|bc
209807914

It can be seen that there are umapped reads (checking for a single read group):

$ samtools view -R 7_11.rg B.md.bam > 7_11.sam
$ wc -l 7_11.sam
426207196 7_11.sam
$ tail 7_11.sam | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4}'
ST-E00269:210:H57FTCCXY:7:1224:24545:73159      77      *       0
ST-E00269:210:H57FTCCXY:7:1224:24545:73159      141     *       0
ST-E00269:210:H57FTCCXY:7:1224:28706:73159      77      *       0
ST-E00269:210:H57FTCCXY:7:1224:28706:73159      141     *       0

And either the first or the second in the pair is exported at least once:

$ awk '{print $1}' 7_11.sam > 7_11.reads
$ sort --parallel=24 -u 7_11.reads | wc -l
209807914

Still, there can be reads, where only one part of the pair is saved, and the other is missing. Certainly there are multi-mapping reads (for SAM flags see https://broadinstitute.github.io/picard/explain-flags.html , 64 stands for first in pair, 128 is second in pair):

$ samtools view -@16 -c -R 7_11.rg -f 64 B.md.bam 
213004053
$ samtools view -@16 -c -R 7_11.rg -f 128 B.md.bam 
213203143

When unique read names are checked for the first and the second pair, the number of printed out reads matches to the number of reads in the initial FASTQ for that read group:

 $ time samtools view -@16 -R 7_11.rg -f 64 B.md.bam | awk '{print $1}'| sort --parallel=16 -u | wc -l
209807914
real    110m24.646s
user    129m8.126s
 $ time samtools view -@16 -R 7_11.rg -f 128 B.md.bam | awk '{print $1}'| sort --parallel=16 -u | wc -l
209807914
real    108m56.789s
user    127m5.951s

The ultimate count considering all the read groups in the sample is also all right (a single pipe can kill /tmp, hence the two-step processing):

$ samtools view -@32 -f 64 B.md.bam | awk '{print $1}' > 64.reads 
$ samtools view -@32 -f 128 B.md.bam | awk '{print $1}' > 128.reads 
$ for f in 64.reads 128.reads; do echo -n $f" ";sort --parallel=32 -u $f| wc -l; done
64.reads 443237637
128.reads 443237637

@alneberg I think we can safely say that all there reads are there in the deduplicated BAM.

Check BED compatibility

Issue by @apeltzer, moved from SciLifeLab#687

Is your feature request related to a problem? Please describe.

Not really ;-)
But this happens quite often when people use a target capture method.

Describe the solution you'd like

A small extra process that checks BED files for consistency with the utilized reference genome.
Either check only and issue a warning or also offer to remove the chrprefix for example ;-)

Comments:

@apeltzer

Can do that if you guys don't mind :-)

@maxulysse

That would be amazing \o/

wholegenome.interval_list not recognized by CreateIntervalsBed process

I may be overlooking something but Sarek does not seem to document the input file formats/purposes of the genome files. For most files, the purpose is obvious but some, at least for me, aren’t. And for others it isn’t clear what file format is expected.

For instance, I had assumed that the wholegenome.interval_list Picard-formatted file from the GATK resource bundle would be valid as a genomes.intervals file, but the result is a cryptic error message, as well as a work directory full of weird BED files:

java.lang.NumberFormatException: For input string: "VN" […]

It turns out that the relevant Sarek process only supports two out of the three formats described by the GATK documentation, and notably does not support the Picard-format file, which is included in the official GATK bundle.

Add check for unique lanes in sample when parsing TSV files from FASTQ samples

Issue by @maxulysse, moved from SciLifeLab#626

We should have a warning when at least one sample has more than 2 FASTQ with the same lane.
Right now, it's failing at the merging BAMs step.

Describe the solution you'd like

Add some checks when parsing the TSV file

Describe alternatives you've considered

No alternatives considered at the moment

Additional context

Issue discovered by Katja when running some WES project

Issue by @nicweb, moved from SciLifeLab#714

When there is no lane information given in the sample.tsv markduplicates will fail with

htsjdk.samtools.SAMFormatException: Error parsing SAM header. Problem
parsing @rg key:value pair.

as readgroup ID is empty.

Would be nice to check TSV before execution or autoset empty lane with sampleID or similar to avoid downstream execution halt, or update documentation that field should not be empty.

Community feedback needed: Consider other tools

Issue by @maxulysse, moved from SciLifeLab#666

  • ExpansionHunter for estimating repeat sizes
  • QDNAseq CNVs for shallow WGS
  • CNVkit for CNVs
  • deconstructSigs for mutational signatures
  • ditto mutation-signatures
  • SomaticSeq ensemble caller and machine learning
  • MSIsensor for replication slippage variants at microsatellite regions
  • Oncotator for annotating human point mutations and indels
  • MutSig for Mutation Significance checks
  • TrackSig to infer mutational signatures in cancer over time
  • DriverPower to discover potential coding and non-coding cancer driver elements from tumour whole-genome or whole-exome somatic mutations
  • SVclone for inferring the cancer cell fraction of tumour structural variation from whole-genome sequencing data
  • HsMetrics better QC
  • CNNScoreVariants Annotate a VCF with scores from a Convolutional Neural Network (CNN)
  • SignatureAnalyzer yet an other mutation signature analyser from Broad
  • PROSIC2 is a caller for somatic variants in tumor-normal sample pairs
  • seqCAT Evaluation of SNV profiles
  • varlociraptor A Rust library for variant calling using a latent variable model
  • GATK CNN Convolutional Neural Net to filter annotated variants
  • Google DeepVariant Deep neural network to call genetic variants from next-generation DNA sequencing data
  • Octopus
  • mosdepth fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing
  • rock circos plot out of manta
  • DeepTrio
  • SmuRF for Random Forest Ensembl Somatic Calling
  • PureCN for estimating purity/ploidy
  • Lofreq a fast and sensitive variant-caller for inferring SNVs and indels from next-generation sequencing data. LoFreq is very sensitive; most notably, it is able to predict variants below the average base-call quality (i.e. sequencing error rate).
  • DeepSomatic extension of deep learning-based variant caller DeepVariant
  • MEDICC2 Minimum Event Distance for Intra-tumour Copy-number Comparisons
  • DPClust Dirichlet Process based methods for subclonal reconstruction of tumours
  • LINX annotation, interpretation and visualisation tool for structural variants

Fix path for `ControlFREEC`

Hi everyone,

running sarek v2.5 with ControlFREEC among the tools, returned

[0;35m[nf-core/sarek] Pipeline completed with errors
Error executing process > 'ControlFreecViz (T_vs_N)'

Caused by:
  Process `ControlFreecViz (T_vs_N)` terminated with an error exit status (1)

Command executed:

  cat /opt/conda/envs/sarek-2.5/bin/assess_significance.R | R --slave --args T.pileup.gz_CNVs T.pileup.gz_ratio.txt
  cat /opt/conda/envs/sarek-2.5/bin/assess_significance.R | R --slave --args T.pileup.gz_normal_CNVs T.pileup.gz_normal_ratio.txt
  cat /opt/conda/envs/sarek-2.5/bin/makeGraph.R | R --slave --args 2 T.pileup.gz_ratio.txt T.pileup.gz_BAF.txt
  cat /opt/conda/envs/sarek-2.5/bin/makeGraph.R | R --slave --args 2 T.pileup.gz_normal_ratio.txt N.pileup.gz_BAF.txt
  perl /opt/conda/envs/sarek-2.5/bin/freec2bed.pl -f T.pileup.gz_ratio.txt > T.bed
  perl /opt/conda/envs/sarek-2.5/bin/freec2bed.pl -f T.pileup.gz_normal_ratio.txt > N.bed

Command exit status:
  1

Command output:
  (empty)

Command error:
  cat: /opt/conda/envs/sarek-2.5/bin/assess_significance.R: No such file or directory

The reason is that /opt/conda/envs/sarek-2.5/bin/assess_significance.R is actually /opt/conda/envs/nf-core-sarek-2.5/bin/assess_significance.R

A quick check suggests to fix the path in:

grep 'envs/sarek-' main.nf
    cat /opt/conda/envs/sarek-${workflow.manifest.version}/bin/assess_significance.R | R --slave --args ${cnvTumor} ${ratioTumor}
    cat /opt/conda/envs/sarek-${workflow.manifest.version}/bin/assess_significance.R | R --slave --args ${cnvNormal} ${ratioNormal}
    cat /opt/conda/envs/sarek-${workflow.manifest.version}/bin/makeGraph.R | R --slave --args 2 ${ratioTumor} ${bafTumor}
    cat /opt/conda/envs/sarek-${workflow.manifest.version}/bin/makeGraph.R | R --slave --args 2 ${ratioNormal} ${bafNormal}
    perl /opt/conda/envs/sarek-${workflow.manifest.version}/bin/freec2bed.pl -f ${ratioTumor} > ${idSampleTumor}.bed
    perl /opt/conda/envs/sarek-${workflow.manifest.version}/bin/freec2bed.pl -f ${ratioNormal} > ${idSampleNormal}.bed
    genesplicer = params.genesplicer ? "--plugin GeneSplicer,/opt/conda/envs/sarek-${workflow.manifest.version}/bin/genesplicer,/opt/conda/envs/sarek-${workflow.manifest.version}/share/genesplicer-1.0-1/human,context=200,tmpdir=\$PWD/${reducedVCF}" : "--offline"
    genesplicer = params.genesplicer ? "--plugin GeneSplicer,/opt/conda/envs/sarek-${workflow.manifest.version}/bin/genesplicer,/opt/conda/envs/sarek-${workflow.manifest.version}/share/genesplicer-1.0-1/human,context=200,tmpdir=\$PWD/${reducedVCF}" : "--offline"

The run completed successfully after fixing this.

Federico

Singularity image - too many levels of symbolic links

XBIF04:Desktop $ singularity pull --name nf-core/sarek:2.5.2.img docker://nfcore/sarek:2.5.2

INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:c5e155d5a1d130a7f8a3e24cee0d9e1349bff13f90ec6a941478e558fde53c14
43.24 MiB / 43.24 MiB [===================================================] 42s
Copying blob sha256:a5eecf513b481c8a6b4251fd2a356dd9a719faa9458bf052014b56d1b760ea21
90.70 MiB / 90.70 MiB [=================================================] 1m41s
Copying blob sha256:a2f1f1a8c0a9ef4422f59ec97c292815684a9f18d2b8a557d3b2ddfc96fc1401
99.07 MiB / 99.07 MiB [=================================================] 1m55s
Copying blob sha256:f827e546ce315d340e846de1af32a6177d9c3d894dd69bfec358196401d0975a
1.03 MiB / 1.03 MiB [======================================================] 0s
Copying blob sha256:7eb6a47b847d5fb772d28529dc4e43628d4dacf8dadcb3806f8d90c4e32fd04d
12.82 MiB / 12.82 MiB [===================================================] 15s
Copying blob sha256:54030feaba2f217c1c7c1b9d40630384f9724748734e1957ed7014768dad36d6
421 B / 421 B [============================================================] 0s
Copying blob sha256:d1eda63fd41c73c659b5479c67d70b538fda017f2fe191f05c966fd27f0f6745
1.20 GiB / 1.20 GiB [==================================================] 30m16s
Copying config sha256:53abadc78ffac411938e957de42e3d95b1914d74132e6bb7964e56fb34bf3abd
4.12 KiB / 4.12 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
FATAL: While making image from oci registry: while building SIF from layers: packer failed to pack: while unpacking tmpfs: unpack: error extracting layer: link /var/folders/78/rjtn6lb534v9n8kxwm56fl_m0000gn/T/sbuild-316747039/fs/opt/conda/pkgs/ncurses-6.1-he6710b0_1/share/terminfo/N/NCR260VT300WPP /var/folders/78/rjtn6lb534v9n8kxwm56fl_m0000gn/T/sbuild-316747039/fs/opt/conda/share/terminfo/N/NCR260VT300WPP: too many levels of symbolic links

snpEff report bug

Hi, I found a small bug.

In the report generated by snpEff at the end of it there is a link to ~/results/Reports/SAMPLE/snpEff/Strelka_SAMPLE_variants_snpEff.genes.txt

however, the txt file in this directory is named ~/results/Reports/SAMPLE/snpEff/Strelka_SAMPLE_variants_snpEff.txt

(without the genes part, therefore the link doesn't work)

Provide reference files for WES for ASCAT

Issue by @Anuragksingh, moved from SciLifeLab#763

Hello,
I had some doubts over ascat.
It would be really helpful if you can provide some insights about that.

Can we use the available ascat directly on WES data?
I guess the creation of BAF and LogR files from bam would be same for WES data also.
Or do we need to apply additional filters to use ascat on WES data?
For reference kindly see this link:
VanLoo-lab/ascat#19
Also does we need to add --exome flag when running ascat on WES data?

Any suggestions on above aspects would be really helpful.

Thanks

Comments:

@maxulysse

Running ASCAT for WES is definitively something that we're thinking about.
I remember reading about this link before, I'll definitively look more into it.
Thanks again for all your interest in this project

@szilvajuhos

To run ASCAT on exome you will need a separate file of SNPs that is covering the targeted exons only

@dingxm

Hi,
We are trying to analyze exome data for copy number.
The exome data has extended probes covering regions outside of the exomes.
It seems that last year ASCAT could not use exome data without pre-processing, is this still true or is there a version that deals with Exome data now?
Thank you

@szilvajuhos

Hej,
If you have a BED file that is covering your WES target, I would recommend to make a .loci file i.e. by filtering the latest dbSNP for common biallelic SNPs that are overlapping your target.
You have to have quite a few loci positions (since the target is smaller and the genome is not covered evenly) to make sure you will still have enough SNPs.
@malinlarsson is better in playing with it, the .loci file I have created contains ~16M lines:

$ wc -l dbSNP_151_S04380110_AllExonV5_hg38.loci
16200625
$ head -3 dbSNP_151_S04380110_AllExonV5_hg38.loci
chr1    65529
chr1    65552
chr1    65591

We have managed to make meaningful charts with this loci file (I can share it if you want to give a try), but certainly it is only the beginning of finding the right settings.

amplicon mode

Hi,
It would be a cool enhancement to have an "amplicon mode" for targeted sequencing (i.e. illumina cancer panels, etc).
Probably the big differences with what there is the inclusion of a manifest (bed file with targeted regions).

QC [fastqc+multiqc]
alignment [bwa-mem]
[deduplication?]<-probably not needed
variant calling [gatk,strelka]

Quality control for target capture experiments:
https://bioconductor.org/packages/release/bioc/html/TEQC.html

CNV:
Currently using CoNVaDING. I am sure there are better alternatives out there.
https://github.com/molgenis/CoNVaDING
Probably a good alternative is CNVkit: https://cnvkit.readthedocs.io/en/stable/

Calculation of coverages in the targeted regions, gene level and exon level.

Why are we using -B 3 (tumor mismatch penalty)?

We should have more information on why we're doing that.
I do believe it's a good idea, but can we back it up?

sarek/main.nf

Lines 673 to 681 in 05c6e0b

extra = status == 1 ? "-B 3" : ""
convertToFastq = hasExtension(inputFile1, "bam") ? "gatk --java-options -Xmx${task.memory.toGiga()}g SamToFastq --INPUT=${inputFile1} --FASTQ=/dev/stdout --INTERLEAVE=true --NON_PF=true | \\" : ""
input = hasExtension(inputFile1, "bam") ? "-p /dev/stdin - 2> >(tee ${inputFile1}.bwa.stderr.log >&2)" : "${inputFile1} ${inputFile2}"
"""
${convertToFastq}
bwa mem -K 100000000 -R \"${readGroup}\" ${extra} -t ${task.cpus} -M ${fasta} \
${input} | \
samtools sort --threads ${task.cpus} -m 2G - > ${idSample}_${idRun}.bam
"""

Add deepTools to get bigWig coverage files

bigWig is a compact format to get coverage information, and we can add it to IGV (https://github.com/igvteam/igv.js/wiki/Wig-Track) . With deepTools (https://deeptools.readthedocs.io/en/develop/index.html) we can easily calculate this file like

bamCoverage -bs 10 -v -b Preprocessing/Recalibrated/sample.recal.bam --ignoreDuplicates --numberOfProcessors 48 --outFileName coverageDir/sample.coverage.bigWig

It would be nice to have it among the results (and deepTools in the container, see conda install at: https://deeptools.readthedocs.io/en/develop/content/installation.html) . Maybe it can replace QualiMap and provide the same QC stuff.

FilterMutect2Calls complains about '.' when expecting a numeric value in MPOS fields

Mutect2 complains about '.' when it expects a numeric value during the FilterMutect2Calls process. This appears to be caused by large negative values assigned to the MPOS Info field being converted to '.' by bcftools. We have verified that replacing these '.' with numeric values (we arbitrarily used 5) fixes the problem but this is obviously not an ideal solution.

Mutect2 process skipped

Hello,

I'm trying to run this pipeline with as many annotators as possible. However I can only output a couple of them. One that I cannot output is Mutect2 - it seems sarek skips that process, even if I explicit my germlineResource and pon files (including both indexes). What am I missing?

Code line:
nextflow run nf-core/sarek --input FASTQ_SUB --tools Mutect2 --germlineResource somatic-b37_af-only-gnomad.raw.sites.vcf --germlineResourceIndex somatic-b37_af-only-gnomad.raw.sites.vcf.idx --pon somatic-b37_Mutect2-WGS-panel-b37.vcf.gz --pon_index somatic-b37_Mutect2-WGS-panel-b37.vcf.idx -profile singularity -c sarek.config

Thank you very much,
Pedro R.

mutect2 fails with --no_intervals option

Hi,
I'm trying the pipeline (dev version) with some exome data and there seems to be an issue when running it with the "--no_intervals" option. The mutect2 step fails because it tries to use an interval file "-L no_intervals.bed" that does not exist. This is an example .command.sh generated in the mutect2 process:

#!/bin/bash -euo pipefail
# Get raw calls
gatk --java-options "-Xmx7g"  Mutect2  \
-R Homo_sapiens_assembly38.fasta  \
-I NET03-A3_Tumour.recal.bam  -tumor NET03-A3_Tumour   \
-I NET03-A7_Ctr.recal.bam -normal NET03-A7_Ctr      \
-L no_intervals.bed    \
--germline-resource gnomAD.r2.1.1.GRCh38.PASS.AC.AF.only.vcf.gz   \
-O no_intervals_NET03-A3_Tumour_vs_NET03-A7_Ctr.vcf

I understand the reasoning of using the -L argument with the intervals when the pipeline is run in normal mode (i.e., without the --no_intervals" option), but when the user selects the --no-intervals option, shouldn't this be changed so that the -L argument in mutect2 points to the --targetBed file (or nothing for whole genome)?

I guess that what I'm trying to understand too is why the --targetBed targets are being passed to BamQC, manta and strelka, but not to mutect2? Is it because variants called with mutect2 outside the targets are filtered anyway in the concatVCF step later on?

Thanks
Lucia

Enable FilterMutect2Call without --pon

from @lconde-ucl (cf SciLifeLab#814)

Hi Max,

Thanks a lot. Yes, I understand that using a PON is recommended and is definitely of great help when you don't have matching normals. But when you do have matching tumour/normal pairs, it will just be another filter for systematic sequencing/protocol/pipeline artifacts.

But regardless if you use a PON or not, the filtering ("FilterMutectCalls" step) has to be done anyway, otherwise "mutect2" would only emit an unfiltered VCF. "FilterMutectCalls" provides the annotations to the variants called by mutect2 (either PASS or failed because of "germline_risk", "multiallelic", "clustered_events", "bad_haplotype", etc), so irrespective of the use of a PON or not, it is needed to get a properly annotated VCF.

As an example, this is a VCF file that you would get after running mutect2 without any PON:

> gatk Mutect2 -R $ref -I $tumor.bam -I $normal.bam -tumor T -normal N  -O unfiltered.vcf --germline-resource ...

[...]
chr1	84490478	.	C	CT	.	.	DP=60;ECNT=1;NLOD=5.02;N_ART_LOD=-8.630e-01;POP_AF=1.000e-06;REF_BASES=CCCAAGTATCCTTTTTTTTTT;RPA=11,12;RU=T;STR;TLOD=10.72	GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB	
0/1:16,6:0.377:8,2:8,4:34,34:121,122:60:30:0.192,0.263,0.273:0.046,0.013,0.941	0/0:17,0:0.127:11,0:6,0:35,0:181,0:0:0
chr1	100206310	.	TAA	T,TA,TAAA	.	.	DP=263;ECNT=1;NLOD=-2.149e+00,-2.977e+01,4.27;N_ART_LOD=12.16,31.15,8.65;POP_AF=1.000e-06,4.452e-03,1.000e-06;REF_BASES=ATCTATTTTTTAAAAAAAAAA;RPA=13,11,12,14;RU=A;STR;TLOD=7.59,23.90,1
1.86	GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB	0/1/2/3:51,7,16,9:0.135,0.212,0.158:15,4,4,3:36,3,12,6:29,28,29,23:122,100,107,104:60,60,60:17,34,17:0.182,0.172,0.193:0.019,0.011,0.970	0/0:59,8,22,12:0.119,0.223,0.158:20,5,6,
3:39,3,16,9:32,33,32,32:186,193,183,154:60,60,60:29,25,26
chr1	102889509	.	T	G	.	.	DP=47;ECNT=1;NLOD=5.72;N_ART_LOD=-1.301e+00;POP_AF=1.000e-06;REF_BASES=CTCGGTCACCTTTTTCCCCTT;TLOD=25.83	GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB	0/1:19,9:0.321:14,6:5,3:
34,35:103,100:60:27:0.303,0.273,0.321:0.014,0.034,0.952	0/0:19,0:1.054e-04:9,0:10,0:35,0:180,0:0:0
[...]

And this is what you get after running filterMutect2Calls on the above file:

> gatk FilterMutectCalls -V unfiltered.vcf -O final.vcf

[...]
chr1	84490478	.	C	CT	.	str_contraction	DP=60;ECNT=1;NLOD=5.02;N_ART_LOD=-8.630e-01;POP_AF=1.000e-06;P_CONTAM=0.00;P_GERMLINE=-5.511e+00;REF_BASES=CCCAAGTATCCTTTTTTTTTT;RPA=11,12;RU=T;STR;TLOD=10.72	GT:AD:AF:F1R2:F2R1:MBQ:M
FRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB	0/1:16,6:0.377:8,2:8,4:34,34:121,122:60:30:0.192,0.263,0.273:0.046,0.013,0.941	0/0:17,0:0.127:11,0:6,0:35,0:181,0:0:0
chr1	100206310	.	TAA	T,TA,TAAA	.	artifact_in_normal;germline_risk;multiallelic	DP=263;ECNT=1;NLOD=-2.149e+00,-2.977e+01,4.27;N_ART_LOD=12.16,31.15,8.65;POP_AF=1.000e-06,4.452e-03,1.000e-06;P_CONTAM=0.00;P_GERMLINE=-5.710e+0
0,0.00,-1.101e+01;REF_BASES=ATCTATTTTTTAAAAAAAAAA;RPA=13,11,12,14;RU=A;STR;TLOD=7.59,23.90,11.86	GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:SA_POST_PROB	0/1/2/3:51,7,16,9:0.135,0.212,0.158:15,4,4,3:36,3,12,6:29,28,29,23:122,100,107,104:60,60
,60:17,34,17:0.182,0.172,0.193:0.019,0.011,0.970	0/0:59,8,22,12:0.119,0.223,0.158:20,5,6,3:39,3,16,9:32,33,32,32:186,193,183,154:60,60,60:29,25,26
chr1	102889509	.	T	G	.	PASS	DP=47;ECNT=1;NLOD=5.72;N_ART_LOD=-1.301e+00;POP_AF=1.000e-06;P_CONTAM=0.00;P_GERMLINE=-6.212e+00;REF_BASES=CTCGGTCACCTTTTTCCCCTT;TLOD=25.83	GT:AD:AF:F1R2:F2R1:MBQ:MFRL:MMQ:MPOS:SA_MAP_AF:S
A_POST_PROB	0/1:19,9:0.321:14,6:5,3:34,35:103,100:60:27:0.303,0.273,0.321:0.014,0.034,0.952	0/0:19,0:1.054e-04:9,0:10,0:35,0:180,0:0:0
[...]

The difference is that now the FILTER column (column 7) is filled in with annotations, and the variants that pass (only the last one in this case) can be identified. This is independent of the PON.

I hope this makes sense?

That's why I think the FilterMutectCalls steps should be always done. If you feel you would like to enable it with a warning that's fine. Or if you prefer to disable it by default and have an option to enable it without a PON I think is fine to.. Whatever you think is best, but it would be great to give the user the option to use it without a PON.

Sorry for the long reply!
Many thanks as always,
Lucia

Missing Variant depths in MulitQC Report - Bcftools

With Sarek v2.5.2 on WES data (--genome 'GRCh38' --input sample_all.tsv --tools HaplotypeCaller,Strelka,Manta --targetBED exome_capture.bed) the Bcftools 'Variant depths' plot in the MulitQC report is empty. This would be nice to have in the future.

Do VQSR for HaplotypeCaller calls

Issue by @malinlarsson, moved from SciLifeLab#513

Filter the calls from HaplotypeCaller with Variant Quality Score Recalibration according to GATK best practise (Tools VariantRecalibrator, ApplyRecalibration, see https://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr, cf https://gatk.broadinstitute.org/hc/en-us/articles/360037594511-VariantRecalibrator or a more recent version)

Useful comment by @apeltzer

Keep in mind, that you'd need a "bit" of data for doing VQSR properly. The recommendation was to use at least 30 WES or 1WGS sample for performing VQSR.

I started working on a solution (in NGI-ExoSeq) to automatically download 35 of the 1000G Exome Datasets, run HaplotypeCaller on them and use them for analysis procedures with less than the minimum required 30 WES samples.

ASCAT executing R files

Hi, when running Sarek with multiple variant callers, it seems like the first one is picked and the rest are ignored. I run it indicating Strelka and ASCAT, and ASCAT was just ignored.


nextflow run ggabernet/nf-core-sarek -r v2.5.2-branch \
--outdir 's3://qbic-bucket-virginia/resultsdirsarekicgc1' \
-w 's3://qbic-bucket-virginia/workdirsarekicgc1' \
--tracedir 's3://qbic-bucket-virginia/tracesarekicgc1' \
--input 's3://qbic-bucket-virginia/icgc-sarek/input-icgc-1.tsv' \
--genome 'GRCh38' \
--tools 'Strelka,ASCAT,snpEff' \
-c awsbatch.config \
--awsregion 'us-east-1' \
--igenomes_base 's3://qbic-bucket-virginia/references' \
--awscli '/home/ec2-user/miniconda/bin/aws' -resume

Target Coverage in BAMQC

Issue by @apeltzer, moved from SciLifeLab#670

Is your feature request related to a problem? Please describe.

The target coverage is computed in BAMQC based on the entire genome.
For exome data (even with specified BED file and therefore regions, Sarek doesn't specify the coverage on covered sites but instead the overall coverage on the entire genome.

Describe the solution you'd like

Use the --gff switch in QualiMap2 to run with the specified BED file.
That provides more accurate coverage on target capture coverage.

Comments:
@szilvajuhos

Cool, other groups were also requesting something similar, I already made some prototype, we can also add the QualiMap2 part. Will work on that.

@apeltzer

Just make sure to use a more updated QualiMap2 version.
The possibility to use BED-3 instead of BED-6 format, was just introduced fairly recently (upon my request... https://bitbucket.org/kokonech/qualimap/commits/all ).
I don't know which version of QualiMap2 is shipped with the current Sarek container(s), so we should make sure that this works :-)

SamToFastq to unaligned BAM complains about unpaired mates

Hi, when starting sarek with unaligned BAM files, I get the following error in the MapReads process due to SamToFastq.

Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx372g -jar /opt/conda/envs/nf-core-sarek-2.5.2/share/gatk4-4.1.2.0-1/gatk-package-4.1.2.0-local.jar SamToFastq --INPUT=db2fe200f284c9d024ace1e110d4efa6.CHC1754N.sorted.bam --FASTQ=/dev/stdout --INTERLEAVE=true --NON_PF=true
11:40:52.633 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/envs/nf-core-sarek-2.5.2/share/gatk4-4.1.2.0-1/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Sat Feb 22 11:40:52 UTC 2020] SamToFastq  --INPUT db2fe200f284c9d024ace1e110d4efa6.CHC1754N.sorted.bam --FASTQ /dev/stdout --INTERLEAVE true --INCLUDE_NON_PF_READS true  --OUTPUT_PER_RG false --COMPRESS_OUTPUTS_PER_RG false --RG_TAG PU --RE_REVERSE true --CLIPPING_MIN_LENGTH 0 --READ1_TRIM 0 --READ2_TRIM 0 --INCLUDE_NON_PRIMARY_ALIGNMENTS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Sat Feb 22 11:40:52 UTC 2020] Executing as [email protected] on Linux 4.14.165-133.209.amzn2.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_192-b01; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.2.0
[M::bwa_idx_load_from_disk] read 3171 ALT contigs
[W::main_mem] when '-p' is in use, the second query file is ignored.
[M::process] read 795736 sequences (100000250 bp)...
[M::process] 0 single-end sequences; 795736 paired-end sequences
INFO	2020-02-22 11:40:59	SamToFastq	Processed     1,000,000 records.  Elapsed time: 00:00:07s.  Time for last 1,000,000:    7s.  Last read position: chr1:3,679,276
...
...
INFO	2020-02-22 16:15:53	SamToFastq	Processed 1,004,000,000 records.  Elapsed time: 04:35:00s.  Time for last 1,000,000:   25s.  Last read position: chrY:59,079,046
[Sat Feb 22 16:15:54 UTC 2020] picard.sam.SamToFastq done. Elapsed time: 275.03 minutes.
Runtime.totalMemory()=144277241856
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Found 6534 unpaired mates
	at htsjdk.samtools.SAMUtils.processValidationError(SAMUtils.java:467)
	at picard.sam.SamToFastq.doWork(SamToFastq.java:211)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
	at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
	at org.broadinstitute.hellbender.Main.main(Main.java:291)
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 449, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (410, 495, 597)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (36, 971)
[M::mem_pestat] mean and std.dev: (496.58, 152.28)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1158)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::process] read 658604 sequences (82750538 bp)...
[M::mem_process_seqs] Processed 795956 reads in 1069.883 CPU sec, 22.743 real sec
[M::process] 0 single-end sequences; 658604 paired-end sequences
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (9, 47804, 251, 4)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (433, 512, 600)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (99, 934)
[M::mem_pestat] mean and std.dev: (515.67, 131.15)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1101)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (1392, 2857, 3468)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 7620)
[M::mem_pestat] mean and std.dev: (2465.89, 1275.67)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 9696)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation RF
[M::mem_process_seqs] Processed 658604 reads in 1479.307 CPU sec, 30.950 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -K 100000000 -R @RG\tID:1\tPU:1\tSM:P1N\tLB:P1N\tPL:illumina -t 48 -M -p Homo_sapiens_assembly38.fasta /dev/stdin -
[main] Real time: 16546.077 sec; CPU: 697892.779 sec
[bam_sort_core] merging from 144 files and 48 in-memory blocks...

The command executed command.sh was:

gatk --java-options -Xmx372g SamToFastq --INPUT=db2fe200f284c9d024ace1e110d4efa6.CHC1754N.sorted.bam --FASTQ=/dev/stdout --INTERLEAVE=true --NON_PF=true | \
bwa mem -K 100000000 -R "@RG\tID:1\tPU:1\tSM:P1N\tLB:P1N\tPL:illumina"  -t 48 -M Homo_sapiens_assembly38.fasta         -p /dev/stdin - 2> >(tee db2fe200f284c9d024ace1e110d4efa6.CHC1754N.sorted.bam.bwa.stderr.log >&2) |         samtools sort --threads 48 -m 2G - > P1N_1.bam

I found some help from Samtools that in these cases either the BAM can be cleaned of unpaired reads before running SamToFastq or that VALIDATION_STRINGENCY=SILENT flag can be used when running SamToFastq.

Is there any reason why you are not doing that? I can make a PR otherwise to fix that.

It would also be great to have the SamToFastq step separated from the mapping, so we can cache this process in case the mapping fails. Don't you think?

I'm happy to help if you think this is a good idea!

Ascat computes all possible N-T combinations

Hi, I was wondering why 25 ascat processes were started when I had just 5 Tumor-Normal pairs, and it turns out that ascat computes all possible N-T combinations. Eg:

Ascat_CHC2113T_vs_CHC2113N
Ascat_CHC2113T_vs_CHC2115N
Ascat_CHC2113T_vs_CHC912N

I can also help with fixing that, just wanted to document it in an issue :)

Add Adaptor Clipping / Quality Trimming Step

Issue by @apeltzer, moved from SciLifeLab#664

Is your feature request related to a problem? Please describe.

Not really - but we had some discussions about it on the Gitter channel for example that it might be helpful in certain cases to provide means to clip adaptors and also quality trim reads before mapping.
I guess TrimGalore would be nice to have but I also hear very good things about "Atropos".

Describe the solution you'd like

An optional step to perform read clipping/trimming before mapping in Sarek :-)

Describe alternatives you've considered

Externalize the process, but as it is probably easy to built-in and some people might want it too - why not have it internalized + made optional :-)

Sex determination Option

Issue by @apeltzer, moved from SciLifeLab#688

Is your feature request related to a problem? Please describe.

If you don#t know the sex of an individual, one could run simple scripts to perform sex determination.
This is commonly done in population genetics and there are quite a bunch of scripts to perform that automatically, more or less checking chrX/chrY ratios with thresholds.

Describe the solution you'd like

Having an optional step may be to run sex determination, as we do have cases where we don't have the information....

Rename GRCh38 > hg38 in igenomes.config

The Broad/GATK ships their genome resource bundles relative to UCSC builds i.e. hg19 and hg38 and these are also used in Sarek. However, they have been named according to the Ensembl naming convention i.e. GRCh38 and GRCh37. This is slightly misleading and also adds to the confusion when trying to figure out how the pipeline is using and providing its genome reference data.

The keys in igenomes.config for these assemblies will have to be changed from GRCh38 > hg38 and GRCh37 > hg19, and any custom files added for use with the pipeline will also have to be renamed appropriately (and uploaded to AWS iGenomes!).

Should we use -Y option for bwa mem?

Issue by @maxulysse, moved from SciLifeLab#770

Use soft clipping CIGAR operation for supplementary alignments.
By default, BWA-MEM uses soft clipping for the primary alignment and hard clipping for supplementary alignments.

We're not using this -Y option for bwa mem, but I looked more into the GTAK best practice workflow, and they're using it.

So I'm wondering, should we use that or not?

Enable joint variant calling for germline samples

Allow calling of germline short variants jointly on multiple samples as outlined here https://software.broadinstitute.org/gatk/documentation/article?id=11019. This approach is expected to increase accuracy and sensitivity as it makes use of population-wide information.

Each sample is first called individually using HaplotypeCaller in -ERC GVCF mode. GenotypeGVCF is then run on all g.vcfs together. See GATK documentation here: https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145

Final delivery checks

Issue by @szilvajuhos, moved from SciLifeLab#448

We are delivering variant calls and annotations at the very end.
We have to have a final step/util to check all the VCF/annotation/QC/alignment files are there.

Do we really need that?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.