malariagen / pipelines Goto Github PK

Pipelines for processing malaria parasite and mosquito genome sequence data.

License: MIT License

Dockerfile 4.55% WDL 32.32% Shell 8.04% Python 27.71% R 27.38%

pipelines's Introduction

MalariaGEN pipelines

This repository is for developing portable open source pipelines for processing malaria parasite and mosquito genome sequence data.

This repository is jointly maintained by teams at the Wellcome Sanger Institute, Oxford University and the Broad Institute.

pipelines's People

Contributors

Stargazers

Watchers

Forkers

alimanfoo lbergelson sailfish009 ericrlucas gvandesteeg lcerdeira oliver-lorenz-dev simonsuddaby deribaabera1234 jorels1 janenjeri xgrau

pipelines's Issues

Copy script to convert vcf to zarr to this repo

Copy script from https://github.com/malariagen/legacy_pipelines/blob/master/prod-tools/scripts/vcf_to_zarr.py to this repository.

@gbggrant All the legacy scripts used the by vrpipe (the legacy workflow management system) should be found in https://github.com/malariagen/legacy_pipelines in case you want to refer to them.

The legacy vcf_to_zarr.py script needs edits for the new vector pipelines, which we will review in a PR:

Previous versions of zarr (until at least zarr v2.1.4) set permissions on files underneath a zarr directory to be read-write by user only, regardless of your umask. The newest version of zarr v2.4.0 will obey your umask. The legacy vcf_to_zarr.py script contains code to traverse the directory tree to explicitly set permissions. This code is no longer required.
the function vcf_to_zarr. zip_zarr() will zip the converted zarr using python package zipfile with zip64 extensions disabled. This was required on legacy machines which used old versions of zip which did not support zip64 extensions. zipfile requires zip64 extensions when the zipped file is > 4GiB (according to v3.8.3 zipfile documentation https://docs.python.org/3/library/zipfile.html). At the time, we were creating small zip files < 4GiB, so it was not an issue. Either we enable zip64 extensions to allow for larger zipfiles, or we should scrap using python package zipfile and just use a recent version of zip directly. Whatever turns out to be easier.

Create pipelines that download multiple samples from IRODS in a batch

Given a batch sample manifest file with multiple samples, each row specifying lanelet irods path and sample ID, download all sample lanelets in parallel from IRODS, and run each sample through alignment and genotyping pipeline.

Define read group contents in vector short read alignment pipelines

Legacy pipelines overwrite the read group ID, library, study, and sample name from the released bams/crams from Sanger Core Sequencing. Determine which fields we want to overwrite, going forward, if any.

Add WDL validation CI

Add some kind of CI validation of WDL files to check syntax etc.

Ag3.0 phasing pipeline production run

Issue to track production run of the mosquito phasing pipeline to build haplotypes for the Ag 3.0 (a.k.a. Ag1000G phase 3) data release.

Output of vector short read alignment pipeline is named *.bam.bam

Fix so that it only has a single .bam suffix

Determine final lanelet manifest fields required for vector alignment and genotyping

We need to solidify the final manifest fields required for vector alignment and genotyping. Right now, we are using fields found in https://github.com/malariagen/pipelines/blob/master/pipelines/short-read-alignment-and-genotyping-vector/farm5/input_files/validation/validation_lanelets_farm5_local.tsv

sample_id
run_ena
irods_path
bam_path
cram_path
read1_path
read2_path

We don't need all these fields for the production manifest passed in to the batch-irods import-alignment-genotyping vector pipelines. At the moment, we probably only need:

sample_id
irods_path

However, run_ena is used as an input in the per-lanelet-alignment vector pipelines.

We will need to solidify the final inputs in the manifest that we need to keep track of samples and their progress in the production system. For example, we don't need the study in the input manifest to run the pipelines. But will we need the study stored as a workflow label/metadata so that we can query for it later in the future?

Mosquito short read alignment pipeline - validation

Issue for discussion of work to compare outputs of the short read alignment pipeline with expected outputs from previous pipeline implementation.

Fix picards location issue for singulairty

Add the changes made to the Picard Dockerfile by @gbggrant to resolve the location issue when using a singularity container.

Picard was stored by default in /root but because singularity executes with the user's credentials it is impossible to access unless executed as root. The change proposed by @gbggrant is to move the jar from /root to /bin

Cromwell configurations on farm5 are using localization = copy, which explodes size of workflow working directory

When running against an initial 28 samples from the actual AG3.2 dataset (https://github.com/malariagen/vector-ops/blob/master/tracking/1177-VO-ML-LEHMANN-VMF00015/wgs_lanelets.tsv), we noticed that the disk usage was > 300GB / sample.

Part of this is because files from one task are copied over as inputs for the next task, doubling the disk usage.

Change cromwell server config such that localization = soft-link. Check that the input files are symlinked from the output of the previous task. Check that this happens for all tasks, including the first task. In the past, we've noticed that the WDL will copy over the initial WDL input files, regardless of the localization config.

Use flexible fields in manifest

The mandatory fields in the manifest are sample_id, irods_path. However, their order should not matter. Also, if a user decides to add extra fields for their own use, it should not matter. From comments from pull request: #67 (comment)

How to make bwa short read alignment deterministic for testing

@gbggrant requested ways to make bwa short read alignment deterministic for testing.

Trawling through the internet indicates that we can make bwa deterministic if we follow all of these steps:

Always use the same fastq with the reads in the same order
Always use the same machine and bwa settings. Alternatively, if we need to change the machine and threads given to bwa, we can explicitly set the hidden bwa mem option -K to set the chunk size. If -K is not set, then the chunksize is determined by the number threads. This makes a difference to the end bam because the insert size is determined by the chunk.

See past comments from Heng Li:
lh3/bwa#192
lh3/bwa#272
lh3/bwa#121

Also see this centre for common disease genomics repo issue:
CCDG/Pipeline-Standardization#2

Add script to test whether people have access to Sanger Farm5

Suggestions from @jon.keatley.aspis on slack Mar 26

Jon Keatley 16:17
ssh farm5-login
ls /lustre
check for .irods in ~/
busers & bugroup

Any new user should have access to:

farm5 head nodes
lustre disk
nfs disk
IRODS
LSF scheduler

Mosquito short read alignment pipeline - dockerfiles

Add dockerfiles for all the tools needed by the vector short read alignment pipeline version 1.1.1.

Part of #2, breaking out as separate issue for convenience.

Genotype Concordance Validation of genotyping pipelines

Check that genotypes from new WDL pipelines are >99% similar to genotypes from legacy pipelines for the same samples.

Legacy pipeline zarrs can be found on Sanger lustre:

/lustre/scratch118/malaria/team112/pipelines/setups/vo_agam/symlink/vo_agam_vcf2zarr/convert_vcf_to_zarr_serial_contig

Scripts that check genotype concordance for parasite release zarrs can be found: https://gitlab.com/malariagen/gsp/pf7.0/-/blob/master/work/59_genotype_concordance_script/genotype_concordance.py

Note parasite release zarrs use a different schema from mosquito release zarrs. At the moment, the script will crash when it runs on mosquito zarrs.

Input source for mosquito short read alignment pipeline

For the mosquito short read alignment pipeline, I believe there are two main options regarding the input data source for the sequence reads:

FASTQ files from ENA
BAM/CRAM files from IRODS

There are some practical trade-off in these choices:

Using FASTQ as input is simpler because no prior conversion is required; using BAM/CRAM makes for a bit more pipeline engineering, because the BAM/CRAM needs to be converted to FASTQ
Using FASTQ from ENA is more portable as ENA is accessible from anywhere whereas IRODS is only accessible from within WSI network (I believe)
Using IRODS may be faster because data transfer to WSI lustre may be faster, either because of better network connectivity or more efficient transfer protocol -- this needs to be investigated and confirmed

Port baseline Plasmodium amplicon SNP calling pipeline.

The goal is to establish a baseline pipeline for comparison by porting https://gitlab.com/malariagen-aspis/aspis-pipeline/-/tree/master/Stage1-Pipeline with minimal functional changes.

Note that "Stage 1" encompasses BCL -> VCF and is broken into 3 steps, roughly: 1) BCL -> unaligned CRAMs, 2) -> aligned CRAMs, 3) -> pileup-based SNP calling using BCFtools. Methods improvements will focus on step 3, so this initial pipeline will only consist of this step.

Create single Docker image containing a conda environment with all necessary tools. Note that step 3 only requires BCFtools, but this Docker will include all tools necessary for the other steps, as well---the exception being python-based tools for processing manifest files in step 1 (see https://gitlab.com/malariagen-aspis/aspis-pipeline/-/blob/e8b35283ad70c41b4d0c9c9a3a660e31fff4431b/Docker/ManifestTools/Dockerfile). We can strip this down to just BCFtools or add the manifest tools later, if needed. At this stage, I'd rather not work with separate Dockers for each tool (as it seems we are currently doing for the vector pipeline), but happy to discuss pros/cons.
Rewrite step 3 WDLs in style consistent with #31 for deployment on ~~Farm5~~ and GCP. Since this pipeline is shortrunning, this may include collapsing of tasks. EDIT: Gonna punt on Farm5, now that I see what's involved, I don't think it should be too hard to go back and add this once things settle.
Expose additional parameters.
Create CRAM snippets to serve as plumbing data (see #33). EDIT: CRAMs are small enough that we probably don't need to snippet.
Manually test on ~~Farm5~~ and GCP.

It is likely that the result will be a single-sample WDL for step 3. We can file a subsequent issue to rework steps 1 and 2 as necessary.

Provide example crams released by Core Sequencing

Share example mosquito crams released from Core Sequencing with @gbggrant and @kbergin.

If possible, kill birds with one stone and try to find the smallest valid crams possible to solve #25 . Querying and sorting IRODS by file size may be a bit fiddly. Instead, let's choose crams with the fewest reads according to sample metadata from past releases.

Minimal input to validate plumbing of vector short read pipelines

@gbggrant suggested we provide a minimal input so that we can run the vector short read alignment pipeline in under an hour. At present, the small bams the malaria programme has provided will process in at least 8 hours on the Broad google cloud infrastructure.

The minimal input need not be scientifically valid. They only need to test that the pipelines are connected properly.

For legacy pipelines, we tested plumbing by using small sample test bams. I would share those, but the legacy pipelines were originally developed for human sequencing, and so the test bams were actually downsampled human bams. For these pipelines, let's try downsampling the mosquito crams.

I suspect that randomly downsampled crams can easily be used to test short read alignment and genotyping. However, we may need to ensure there are sufficient genotypes after genotyping to test phasing without crashing the phasing software.

Mosquito short read alignment pipeline - test dataset

Part of #2 but breaking this out as a separate issue for convenience. Identify a suitable dataset to use for testing and validation purposes for the short read alignment pipeline.

fastq sequence read files to use as inputs
expected output bams
reference genome files

Add platform argument to build_pipeline_release script

Currently the pipeline wdl files for gcp and farm5 are extremely similar. It would be great to make these one file and pass an argument to the build_pipeline_release script which would specify the platform to generate the wdl for.

A concern about this is that in making the pipeline wdl platform non-specific (for instance by adding a placeholder field for the platform-to-be-specified) this might make the wdl non-valid for our wdl checking tool (miniwdl).

Validation/truth datasets for variant calling calibration, comparison and evaluation

What datasets are available for calibrating, evaluating and/or comparing different methods for variant calling and/or filtering, in Plasmodium or Anopheles?

malaria-dk lsf group deprecated

On Sanger farm5, use LSF group pathdev instead of malaria-dk for production

Update vector spec to reflect removal of calmd and usage of Picard SetNmMdAndUqTags tool

The initial version of the pipeline generated an output bam that failed ValidateSamFile due to a difference between the implementations of the calculation of the value of the NM tag in samtool's calmd tool and that in picard's ValidateSamFile tool. See: samtools/samtools#717 (comment)

After discussion with @alimanfoo and others, and with the understanding that the NM tag is not used elsewhere (particularly in gatk 3's UnifiedGenotyper), the decision was made to remove the usage of calmd and replace it with Picard's SetNmMdAndUqTags tool.

Convert cram to bam in vector alignment pipeline

We need to reduce the peak size of the vector alignment pipeline so that we can run more pipelines in parallel without running out of dis. Peak size refers to the total size of all files generated by the pipeline, including intermediate files. There is built-in way for cromwell to delete intermediate files as we go along. So by default, we keep all intermediate files until a workflow is complete.

One way to reduce peak size is

use CRAM instead of BAM. CRAMS are usually 1/3 the size of BAMS.

Verify that the output crams are lossless with respect to information from the bams. Verify that genotyping pipeline is not affected by use of crams in alignment pipeline.

The pertinent files are anything related to alignment. The possible workflow WDL entry points.

Follow the entrypoint WDLs to find all the dependent WDLs that may need modifying.

See https://github.com/malariagen/pipelines/blob/master/docs/specs/short-read-alignment-vector.md for alignment pipeline specs.

Validate results for phasing pipeline

from @gbggrant

I've run the 167 Burkina Faso samples through the phasing pipeline. I haven't yet integrated your cohort_vcf_to_zarr script, but the VCFs for 2R and 3R are at:
gs://malariagen/Phasing/outputs/Ag1000Phase2_BurkinaFaso/Ag1000Phase2_BurkinaFaso_2R_phased.vcf.gz and
gs://malariagen/Phasing/outputs/Ag1000Phase2_BurkinaFaso/Ag1000Phase2_BurkinaFaso_3R_phased.vcf.gz - full outputs for the pipeline can be found at gs://malariagen/Phasing/outputs/Ag1000Phase2_BurkinaFaso/ - let me know if you can't read these (I can download them to lustre).

@hardingnj to run H12 scan, and check we see signals of natural selection we expect.

Mosquito short read alignment pipeline

This issue is intended to cover implementation of a pipeline for producing analysis-ready alignments from paired-end short read Illumina sequencing of Anopheles moquitoes.

Some tasks (not necessarily exhaustive):

Import the spec into this repo (#7)
Decide a small dataset for testing purposes and import to github repo or make available some other way if too big (e.g., sequence reads for a couple of samples, reference genome, etc.) (#6, #9)
Add dockerfiles for tools needed (#11)
Reference implementation of pipeline (#14)
Validation of pipeline using expected outputs from test dataset (#22)

Number of SNPs and Indels doesn't match with the numbers reported in README files

Dear MalariaGen Team

Apologies, if I am raising this question in wrong repository. I tried to download P. vivax variant data (PV4 Dataset) from https://www.malariagen.net/resource/30. I am actually interested in using this data for Base Quality Score Recalibration (BQSR) step of GATK germline variant calling pipeline. I tried to merge the Per-chromosome variant files and then I filtered the merged variant file for high confidence variants (SNPs and Indels) as recommended in the README file. The bcftools stats shows correct number to total variants and samples. However, the Number of SNPs reported in bcftools stats report are different than what has been reported in README. Can you help me understand why is it so. I am also sharing the code used below

###############Bcftools ####################
SN [2]id [3]key [4]value
SN 0 number of samples: 1895
SN 0 number of records: 1303984
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 1117804
SN 0 number of MNPs: 0
SN 0 number of indels: 254148
SN 0 number of others: 0
SN 0 number of multiallelic sites: 331817
SN 0 number of multiallelic SNP sites: 33748

##############################README
The VCF files contains details of 4,571,056 discovered variant genome positions.
These variants were discovered amongst all samples from the release.
3,083,454 of these variant positions are SNPs, with the remainder being either
short insertion/deletions (indels), or a combination of SNPs and indels. It is
important to note that many of these variants are considered low quality. Only
the variants for which the FILTER column is set to PASS should be considered of
reasonable quality. There are 1,303,984 such PASS variants of which 945,649
are SNPs and 358,335 indels.

wget ftp://ngs.sanger.ac.uk/production/malaria/Resource/30/Pv4_vcf/*
ls -1 *.vcf.gz > vcf.list
#merging vcfs
picard GatherVcfs I=vcf.list O=merged.vcf.gz
gatk IndexFeatureFile -I merged.vcf.gz 

# using only high confidence variants for BQSR
gatk SelectVariants -R ../results/00_indexes/bwaidx/PlasmoDB-50_PvivaxP01_Genome.fasta -V merged.vcf.gz --exclude-filtered true -O filter.vcf.gz
## Complete output is attached below
bcftools stats  filter.vcf.gz > filter.bcftoolstats.txt

Assembly used: PlasmoDB-50_PvivaxP01_Genome.fasta from PlasmoDB
gatk4-4.2.6.1-1
bcftools: Version: 1.10.2 (using htslib 1.10.2-3)
filter.bcftoolstats.txt

Mosquito phasing pipeline development

Umbrella issue to communicate re development of the mosquito phasing pipeline. Feel free to raise new issues to deal with specific topics and link back here.

Repo file and folder structure

We anticipate that this repository will store all code artefacts necessary for defining data pipelines. E.g., WDL files, docker files, test datasets, pipeline documentation, etc. What file and folder structure should we use for organising these? Any other conventions we need to agree?

Pipeline versioning strategy

This is a placeholder for discussion how to version workflows.