Coder Social home page Coder Social logo

csiro-crop-informatics / repset Goto Github PK

View Code? Open in Web Editor NEW
2.0 4.0 3.0 3.46 MB

Reproducible evaluation of short read mappers

License: GNU General Public License v3.0

Dockerfile 6.61% Nextflow 49.81% R 8.96% Shell 4.95% Groovy 29.67%
rmarkdown biokanga bioinformatics alignment-algorithm bioinformatics-analysis bioinformatics-pipeline bioinformatics-containers singularity aws-batch pipeline

repset's Introduction

REPSET

REPSET or RREPSET or RΒ²EPSET is a Reproducible, Reusable, Extensible, Portable and Scalable Evaluation Tool for short read aligners

GitHub tag (latest SemVer)

Latest GitHub release GitHub commits since latest release

Nextflow

Table of Contents

Dependencies

  • Nextflow Nextflow - you may consider using the exact version of Nextflow by setting the appropriate environmental variable export NXF_VER=19.10.0 before running the workflow.
  • and
    • either Singularity Singularity
    • or Docker

Preliminaries

The pipeline consists of several, partly dependent paths which facilitate the evaluation of mappers using either DNA- or RNA-Seq data, either real or simulated. The paths can be executed separately or in a single run. When running separately or re-running the pipeline the -resume flag ensures that previously computed results (or partial results) are re-used.

Default execution will simulate, align and evaluate reads from a few small data sets defined in conf/simulations.config.

Note on terminology (mapping vs alignment)

Terms related to read mapping, alignment and related such as pseudoalignment and quasi mapping are used in bioinformatics inconsistently (which would make many a mathematician cringe). We hereby attempt to strictly follow this convention by consistently propagating these inconsistencies.

Running the pipeline

Execution profiles

There are several ways to execute the pipeline, each requires Nextflow and either Docker or Singularity. See nextflow.config for available execution profiles (or to add your own!), e.g. for local execution this could be

Running with docker

nextflow run csiro-crop-informatics/repset -profile docker

Running with singularity (locally or on a Slurm cluster)

To run the workflow with Singularity on

  • a local machine,
  • a standalone server
  • in an interactive session on a cluster

First make sure that recent version of Singularity is available and then run

nextflow run csiro-crop-informatics/repset -profile singularity

On a Slurm cluster you can run

nextflow run csiro-crop-informatics/repset -profile slurm,singularity

Note! Multiple container images will be pulled in parallel from Docker Hub (and potentially other repositories. A bug (?) in singularity may cause the parallel processing to fail with an error message similar to

Error executing process > 'indexGenerator ([[species:Saccharomyces_cerevisiae, version:R64-1-1.44, seqtype:DNA], [tool:hisat2, version:2.1.0]])'

Caused by:
  Failed to pull singularity image
  command: singularity pull  --name rsuchecki-hisat2-2.1.0_4cb1d4007322767b562e98f69179e8ebf6d31fb1.img docker://rsuchecki/hisat2:2.1.0_4cb1d4007322767b562e98f69179e8ebf6d31fb1 > /dev/null
  status : 255

Until this is fixed, our workaround is to run the following prior to running the main script, run

nextflow run csiro-crop-informatics/repset/pull_containers.nf

which will pull most of the containers used by the workflow (the remaining ones are unlikely to be pulled in parallel).

Note that singularity must be available on the node where you execute the pipeline, e.g. by running module load singularity/3.2.1 prior to running the pipeline. It is also required on each compute node. Your cluster configuration should ensure that, if it does not, the additional execution profile singularitymodule can be modified in nextflow.config to match your singularity module name and used at run-time.

Running on AWS batch

If you are new to AWS batch and/or nextflow, follow this blog post, once you are done, or you already use AWS batch, simply run

nextflow run csiro-crop-informatics/repset \
  -profile awsbatch \
  -work-dir s3://your_s3_bucket/work \

after replacing your_s3_bucket with a bucket you have created on S3.

To reduce potential connectivity issues you may consider running the workflow from an EC2 instance. This may guide your decision on where the result file(s) should be placed. If you wish to deposit the result file(s) on s3, you can specify e.g. --outdir s3://your_s3_bucket/results, otherwise you can find them under ./results.

Warning! You will be charged by AWS according to your resource use.

Mapping modes

The workflow incorporates three ways of mapping and evaluating reads, dna2dna, rna2rna, rna2dna and by default all are executed. To restrict execution to one or two of those, you can run the workflow with e.g.

  • --mapmode dna2dna - evaluate DNA-Seq read mapping to genome
  • --mapmode rna2rna|rna2dna - evaluate RNA-Seq read mapping to genome and transcriptome

Evaluated mappers

An alignment/mapping tool is included in the evaluation if appropriate templates are included as specified below in Adding another mapper. To execute the workflow for only a subset of the available tools, you can specify e.g.

  • --mappers star - only evaluate a single tool
  • --mappers 'bwa|bowtie2|biokanga' - evaluate a subset of tools
  • --mappers '^((?!bwa).)*$' - evaluate all but this tool

Other regular expressions can be specified to tailor the list of evaluated tools.

Alternative input data sets

To run the pipeline with alternative input data you can use the -params-file flag to specify a JSON or a YAML file to overwrite conf/simulations.config, for example

nextflow run main.nf -params-file path/to/conf/simulations.json

Alternatively, you can simply edit the content of conf/simulations.config`.

Computational resources

Resources required for running the workflow can be substantial and will vary greatly depending on multiple factors, such as

  • input genomes sizes
  • simulated read coverage
  • number and choice of mappers evaluated
  • mapping mode(s) selected

We have empirically derived simple functions to allow resource requests auto-scaling for key processes such as genome indexing and read mapping. These depend on either the reference size or the number of reads and were based on tools which were slowest or required the most memory. Clearly if e.g. a slower tool is added, these will need to be revised. However, failed tasks are re-submitted with increased resources as long as valid comparisons can be made between different tools' results. For that reason, CPUs and memory limits are not increased on re-submission of the mapping process but the maximum allowed wall-clock time is. In the case of the indexing process, the initial time and memory limits are increased on each task re-submission as indexing performance is not well suited for comparisons anyway. For example, many indexing processes are single-threaded, in other cases it might make sense to skip the indexing process and allow for on-the-fly index generation. Resource auto-scaling is subject to constraints which may need to be adjusted for particular compute environment either at run time (e.g. --max_memory 64.GB --max_cpus 32 --max_time 72.h) or by editing conf/requirements.config where the dynamic scaling functions can also be adjusted.

Capturing results and run metadata

Each pipeline run generates a number of files including

  • results in the form of report, figures, tables etc.
  • run meta data reflecting information about the pipeline version, software and compute environment etc.

These can be simply collected from the output directories but for full traceability of the results, the following procedure is preferable:

  1. Fork this repository

  2. (Optional) select a tagged revision or add a tag (adhering to the usual semantic versioning approach)

  3. Generate a Git Hub access token which will allow the pipeline to create releases in your forked repset repository, when creating the token it should suffice to select only the following scope:

    public_repo Access public repositories

    (assuming your fork of repset remains public)

  4. Make the access token accessible as an environmental variable

  5. Run the pipeline from the remote repository, specifying

    • the --release flag
    • the appropriate -profile
    • the intended revision e.g. -revision v0.9.10 (optional)

For example,

GH_TOKEN='your-token-goes-here' nextflow run \
  user-or-organisation-name/repset-fork-name \
  -profile singularity \
  -revision v0.9.10 \
  --release \

On successful completion of the pipeline a series of API calls will be made to

  1. create a new release
  2. upload results and meta data files as artefacts for that release
  3. finalise the release (skipped if --draft flag used)

The last of this calls will trigger minting of a DOI for that release if Zenodo integration is configured and enabled for the repository. To keep your release as a draft use the --draft flag.

Experimental pipeline overview

figures/dag.png

Execution environment

Execution environment is captured in runmeta.json.

Adding another mapper

A mapper may be included for any or all of the mapping modes (dna2dna, rna2dna, rna2rna). In each case the same indexing template will be used.

After you have cloned this repository add another entry in conf/mappers.config, under

params {
  mappersDefinitions = [
    //insert here
  ]
}

For example, to add a hypothetical my_mapper version 1.0 you might add the following:

[
  tool: 'my_mapper',
  version: '1.0',
  container: 'path/to/docker/repository/my_mapper:1.0',
  index: 'my_mapper index --input-fasta ${ref} --output-index ${ref}.idx',
  dna2dna:
  '''
  my_mapper align --index ${ref} \
  -1 ${reads[0]} -2 ${reads[1]} \
  --threads ${task.cpus} \
  ${ALIGN_PARAMS} > out.sam
  '''
],

Additional script templates can be added for rna2rna and rna2dna mapping modes. Script templates must be wrapped in either single ('script') or triple single ('''script''') quotes. If you would rather keep the templates in separate files follow these instructions.

Template variables

Applicable nextflow (not bash!) variables resolve as follows:

Indexing

  • ${task.cpus} - number of CPU threads available to the indexing process
  • ${ref} - the reference FASTA file name - we use it both to specify the input file and the base name of the generated index

Mapping

  • ${task.cpus} - number of logical CPUS available to the alignment process
  • ${ref} - base name of the index file (sufficient if aligner uses base name to find multi-file index, otherwise appropriate extension may need to be appended, e.g. ${ref}.idx).
  • ${reads[0]} and ${reads[1]} - file names of paired-end reads
  • ${ALIGN_PARAMS} any additional params passed to the aligner
    • Empty by default but one or more sets of params can be defined in conf/mapping_params.config. When multiple sets of params are specified each set is used in separate execution.

Separate template files (optional)

If you would rather keep the templates in separate files rather than embedded in conf/mappers.config you can place each file under the appropriate template directory:

and instead of including the script template string directly in conf/mappers.config as we did above, set

  • rna2dna: true, which will be resolved to templates/rna2dna/my_mapper.sh

or, when using a different file name,

  • rna2dna: 'foo_bar.sh', which will be resolved to templates/rna2dna/foo_bar.sh

See the header of conf/mappers.config for more details and limitations.

Non-core mapping parameters (optional)

Add one or more sets of mapping parameters to conf/mapping_params.config, this is meant for parameter space exploration and should include any fine tuning params while the template should only include core params essential to mapper execution.

Notes on container specification in conf/mappers.config

You can upload a relevant container image to a docker registry (such as Docker Hub) or locate an existing one e.g. among quay biocontainers. If you opt for an existing one, chose one with a specific version tag and a Dockerfile. Alternatively, follow our procedure below for defining per-tool container images and docker automated builds

We opt for Docker containers which can also be executed using Singularity. Container images are pulled from Docker Hub, but Nextflow is able to access other registries and also local images, see relevant Nextflow documentation

Per-tool container images and docker automated builds

Dockerfiles for individual tools used can be found under dockerfiles/. This includes various mappers but also other tools used by the pipeline. For each tool (or tool-set) we created a docker hub/cloud repository and configured automated builds.

Setting-up an automated build

Builds can be triggered from branches and tags.

The following approach relies on creating a branch for a specific version of a tool. The same can be achieved by simply tagging the relevant commit, but this may result in proliferation of tags while branches can be merged into master and deleted while preserving the history. If you'd rather use tags, in (2) change the 'Source type' below to 'Tag' and later tag an appropriate commit using docker/tool/version pattern rather than committing to a dedicated branch. (tags can be problematic - if tag is based on version of a tool and container needs to be updated, tags may have to be removed/re-added)

  1. Create Docker Cloud repo for your tool - do not link to specific GitHub repo or configure automated build at this stage, but only after it has been created - otherwise the tags for containers built later may be malformed.
  2. Link the created a Docker Cloud repo with this GitHub repo (go to Builds -> Configure Automated Builds)
  3. Add an automated build rule (replace tool with the name of the tool).
Source type Source Docker Tag Dockerfile location Build Context
Branch /^docker\/tool\/(.*)$/ {\1} tool.Dockerfile /dockerfiles

Adding or updating a Dockerfile

Checkout a new branch replacing tool and version with the intended tool name and version, respectively. For example,

tool='bwa'
version='0.7.17'
git checkout -b docker/${tool}/${version}

Add, create or modify dockerfiles/${tool}.Dockerfile as required.

Commit and push to trigger an automated build

git add dockerfiles/${tool}.Dockerfile
git commit
git push --set-upstream origin docker/${tool}/${version}

This should trigger an automated build in the linked Docker Hub/cloud repository.

In case the automated build is not triggered for a newly created Docker repo, it may help to delete the Docker repo and repeat steps 1-3 above. Then push some innocuous change to the branch to trigger the build.

If everything works as intended, you may update conf/containers.config to the new tool version.

Then either create a PR to merge the new branch into master or, if you have write permissions for this repository or working on your fork of it, checkout master and merge.

git checkout master
git merge docker/${tool}/${version}

Report

TODO: add information on

  • how to edit the report template
  • how the final report gets generated

If report template is sufficiently generic we will be able to easily render to html and PDF, otherwise we should settle for HTML(?).

Rendering of the report constitutes the final step of the pipeline and relies on a container defined in dockerfiles/renderer.Dockerfile for rendering environment.

Rendering outside the pipeline

There are several ways for rendering of the report outside the pipeline, with docker being the preferred option.

Using docker

Put all requited files in one place

mkdir -p localrender
cp report/report.Rmd  localrender/
cp results/* localrender/
cp flowinfo/*.{json,tsv} localrender/

Docker run rendering

docker run \
  --rm \
  --user $(id -u):$(id -g) \
  --volume $(pwd)/localrender:/render \
  --volume $(pwd)/bin:/binr \
  --workdir /render \
  rsuchecki/renderer:0.4.1_81ab6b5d71509d48e3a37b5eafb4bca5b117b5fc /binr/render.R

Rendered report should be available under ./localrender

Using singularity

Put all requited files in one place

mkdir -p localrender \
 && cp report/report.Rmd  localrender/ \
 && cp results/* localrender/ \
 && cp flowinfo/*.{json,tsv} localrender/

Singularity run rendering

singularity exec \
  --bind $(pwd)/bin:/binr \
  --pwd $(pwd)/localrender \
  docker://rsuchecki/renderer:0.4.1_81ab6b5d71509d48e3a37b5eafb4bca5b117b5fc /binr/render.R

Rendered report should be available under ./localrender

Natively

If you'd like to render the report without docker/singularity, you will need the following:

  • R e.g. on ubuntu sudo apt apt install r-base-core
  • pandoc e.g. on ubuntu sudo apt install pandoc pandoc-citeproc
  • LaTeX e.g. on ubuntu sudo apt install texlive texlive-latex-extra
  • R packages:
    • rmarkdown
    • rticles
    • bookdown
    • tidyverse
    • jsonlite
    • kableExtra

Then:

mkdir -p localrender \
 && cp report/report.Rmd  localrender/ \
 && cp results/* localrender/ \
 && cp flowinfo/*.{json,tsv} localrender/

cd localrender && ../bin/render.R

Manuscript

Manuscript source is under manuscript/ sub directory on manuscript branch which should not be merged into master. Application note is drafted in RMarkdown in manuscript/repset.Rmd file. RMarkdown is well integrated in RStudio, but can be written/edited in a text editor of your choice.

Rendering

The manuscript will be rendered the pipeline is executed while manuscript branch is checked out, either

  • locally or
  • by specifying -revision manuscript at run-time

Appropriate revision of the master branch should first be mnerged into the manuscript branch.

The manuscript can be rendered outside the pipeline in a fashion analogous to how this can be done for the report, just replace any use of report by manuscript.

Bibliography

Among the alternatives available we opted for BibTeX, see writing/references.bib.

repset's People

Contributors

alexwhan avatar rsuchecki avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

repset's Issues

Results presentation

We need to update/refine how the results are presented. Tere are several data sets which are or should be displayed in the final report.

The report itself is generated from RMarkdown input and the individual plots generation could be embedded there, or (as it is at the moment) separate scripts can be used to generate the plots (these are called by dedicated processes within the pipeline.
We could also have a mix of the two approaches.

Some of the generated plots should be included in the Manuscript as well, which might be an argument for generating them separately. However, since this is also based on a RMarkdown file plotting code could simply be duplicated, allowing for tweaks required to accommodate different requirements for the report and the manuscript.

We will also use kableExtra to embed tabular data in the report in addition to the plotted results.

In each of the following cases we may need to update the R code and potentially also the formatting of input data (either in the pipeline or in R)

aligned,paired,tool,target,seqtype,sra,aligntime
78420391,73364822,bowtie2,ucsc.hg19.fa,RNA,SRR4228250,3282000
71378006,63407826,subread,ucsc.hg19.fa,RNA,SRR4228250,3390000
69029156,69029156,biokanga,ucsc.hg19.fa,RNA,SRR4228250,3895477
70212274,70212274,hisat2,ucsc.hg19.fa,RNA,SRR4228250,781000
78749980,71688262,bwa,ucsc.hg19.fa,RNA,SRR4228250,1478937
78202761,76019142,bbmap,ucsc.hg19.fa,RNA,SRR4228250,17742714
73873470,73873470,star,ucsc.hg19.fa,RNA,SRR4228250,366156

Output data generated by the pipeline for DNA alignment stats should probably be reshaped within the pipeline before putting effort into plotting, feedback welcome!

  • Simulated DNA-Seq alignment results - summary statistics
  • Simulated DNA-Seq alignment results - Individual read placement along chromosomes for more in depth illustration on how different aligners handle repeats, low complexity etc. Perhaps using density plots - optional
  • Real DNA-Seq alignment results - to be added to the pipeline

Note that in addition to aligntime currently extracted from Nextflow trace functionality, we could easily pick-up additional information from other fields:

nextflow.trace/v2
realtime=1239
%cpu=1404 
rchar=12190352
wchar=4301
syscr=1268
syscw=15
read_bytes=26866688
write_bytes=4096
%mem=0
vmem=823168
rss=27868
peak_vmem=823168
peak_rss=27868
vol_ctxt=1584
inv_ctxt=87

Explicit package versions install in rscipts.Dockerfile?

May have to revert to earlier syntax of the dockerfiles/rscirpts.Dockerfile

e.g.

RUN R -e "install.packages('jsonlite', version = '1.5', repos = 'https://cran.csiro.au')"
# OR
RUN R -e "install.packages(c('jsonlite', 'RColorBrewer'), version = c('1.5','1.1-2'), repos = 'https://cran.csiro.au')"

as littler does not support explicit version install (?)

Questions:

  • would that be sufficient for capturing versions?
  • is it worth the effort given that a version is captured in the container? I think it is.

Issue on hold due to limited and late stage use of RScripts

Results, artefacts, persistence

We should revisit how GH releases are presented so that the distinction between a run capture release and a regular pipeline release is more obvious - I think this could be done by re-phrasing a few things under

repset/nextflow.config

Lines 186 to 199 in 5def49c

releaseArgs = [
REPO : workflow.repository.replaceFirst("^(http[s]?://github\\.com/|git@github\\.com:)","").replaceFirst("\\.git\$",""),
COMMIT : workflow.commitId,
LOCAL_FILES : [
// "${params.outdir}/report.html",
// "${params.outdir}/biokanga-manuscript.pdf",
"${params.outdir}/allstats.json",
"${params.infodir}/runmeta.json",
"${params.infodir}/trace.tsv"
],
RELEASE_TAG: "${workflow.revision}_${workflow.runName}_${workflow.sessionId}",
RELEASE_NAME: "${workflow.revision} - results and metadata for run '${workflow.runName}'",
RELEASE_BODY: "Release created and artefacts uploaded for run '${workflow.runName}', session ID ${workflow.sessionId}, commit ${workflow.commitId}, see assets for more details."
]

Alternatively, we could re-consider what we capture and where we deposit - perhaps we should be using https://developers.zenodo.org/#rest-api directly?

Google cloud execution

Repset could, in principle, be executed using https://cloud.google.com/life-sciences/

limitations

  • not well suited to short-running tasks, this could be addressed by

  • requested memory + CPU combinations must match what is available, e.g.
    Execution failed: creating instance: inserting instance: Invalid value for field 'resource.machineType': 'zones/australia-southeast1-a/machineTypes/custom-8-4783'. Memory should be a multiple of 256MiB, while 4783MiB is requested. Memory size for 8 vCPU instance should be between 7424MiB and 53248MiB, while 4783MiB is requested

    • API calls to obtain available machineTypes & their specs?
    • logic to ensure memory & CPU reservations match?
  • ?

Would it be better to go k8s on https://cloud.google.com/kubernetes-engine instead?

To be continued...

Container versions may not be enough - combine with commit SHA?

Among the things we do to ensure reproducibility is to be explicit about the version of the containers we use, e.g.

withLabel: rnftools {
  container = 'rsuchecki/rnftools:0.3.1.3'
}

There can however be scenarios where the version remains unchanged but the container has to be modified, e.g. recently ps became essential for some of Nextflow reporting functionality which we use for reporting run times. Most of the containers are not affected but Debian based ones are.

We now generate additional tags for docker containers, such that in addition to repo/tool:version we also get repo/tool:commitSHA. We can now point to that, but this is not exactly readable, should we add another tag or replace that with something like repo/tool:version_commitSHA?

Gene-level rna2rna evaluation?

When aligning RNA-Seq reads simulated from transcripts extracted from genome+annotation, many (unsurprisingly) align to an incorrect splice-form. Basic, rnf-style evaluation focuses on the exact position of the read. Should we, in addition to that, also consider capturing the numbers of reads aligning to the correct gene, as in some analyses this would suffice. You could also be more lenient about imprecise alignment within the correct transcript, but this would be relatively rare, so probably not worth pursuing.

Add CI

Add travis CI with a simple exec of on

  • one a few small genomes
  • fastest tool for each alignment mode (or one in the leanest container)
  • possibly break down into separate runs to test modes/tools separately - this may be limited by the size of some key container images
  • may have to reduce the size of some of the containers, particularly rnftools

Instructions for adding aligners to framework

We may add more aligners to the pipeline but more importantly we should make it easy to add new tools for a user not familiar with Nextflow and our pipeline. For that we need to provide instructions on how to:

  • Add reference indexing process/template
  • Add one or more read alignment templates (e.g. defaults, optimised)
  • Specify a container to be used for running the additional tool
  • (Optional) add custom computational requirements config if needed.

Further modularise

Until NF [modules] (nextflow-io/nextflow#984) are rolled out the pipeline should be made more manageable by moving some of the R and groovy scripts from main.nf to either

  • standalone scripts under bin/
  • templates e.g. under templates/R, templates/groovy

This could go hand-in-hand with finalizing #9 by streamlining how results are summarized (groovy) and presented (R)

Get basic stats for any set of real reads

  • Either
    • DNA
    • RNA
  • Either
    • Local
    • Remote (http/ftp/SRA)
  • Stats
    • Alignment rates
    • Run-time, memory
    • Read distribution?
  • Input definition in JSON or YAML would include
    • read file paths/URLs or SRA accession
    • reference file path or URL - covered elsewhere, match by target: [species, version]
  • This will require either or both
    • more robust resource requirements scaling - will require further refinements
    • exposing parameters for initial/max memory/time

Capture warnings in JSON/report

Some warnings, generated e.g. during tool version validation
WARN: Declared tool version string 0.7.17-r1188 not found in container image spec rsuchecki/bwa:0.7.17_8b61e2a77c105f3ec28d260b556af5cf12c49111
are printed to the terminal but should probably make their way into the report (via runmeta.json)

log.warn "Declared tool version string ${rec.version} not found in container image spec ${rec.container}."

possibly also

repset/main.nf

Line 70 in cec08e2

log.warn "Malformed real reads entry will be ignored: ${it}"

onComplete fails with NXF_VER newer than 19.04.0

Serialising workflow metadata to JSON causes onComplete handler execution to repeat and stack overflow.

Affects nexflow versions 19.05.0-edge -> 19.10.0

Fix may break compatibility with 19.04.0

Automated builds for multiple Dockerfiles

It may be good to have all Docker containers auto-built from the existing dockerfiles/.

Problems:

  • rather than organisation/tool:version we'd have to rely on something like organisation/catchall-name:tool_version
  • would need consistent way of storing tool version such that build hook script could consistently pick it up.
  • General build trigger(s) for whole repo? Probably not, as we can have a rule per Dockerfile but either way additional customisation of hooks https://docs.docker.com/docker-cloud/builds/advanced/#custom-build-phase-hooks will be necessary to avoid all images being built each time?

Upload results to release if tagged and token available

Use GitHub API to create releases and upload results - this could be weaved in or more likely around the pipeline.

  • requires using an authentication token which could be an input to the pipeline or a wrapper script, or env var
  • results could be uploaded to an existing release or a new release could be created if one matching current tag does not exist.
  • preferably used only for runs with specific revision e.g. using nextflow run -revision v1.2.1
    • if not, additional checks needed at execution to check if no diff to HEAD and use last commit SHA for tagging
  • uploaded results could carry additional run metadata allowing multiple results sets for the same release

Consider using NF workflow.onComplete block - trace available but not report, timeline or DAG

Singularity builds

Singularity containers built from docker pulls may not be fully consistent

WARNING: pull for Docker Hub is not guaranteed to produce the
WARNING: same image on repeated pull

This may not be a huge issue since we'd rather run the final version of the pipeline on AWS batch which means using docker containers directly. Nevertheless it would be good to have fixed singularity containers on shub for each docker container on docker hub. Currently automation on shub does not appear to be flexible enough to integrate it with the procedure developed for docker hub. For example, you can only point automated builds to already existing GitHub branches πŸ‘Ž

container images for Rmd rendering

We currently have one r-ver/tidyverse - based container image for rendering

  • the final report
  • the manuscript from a separate branch

Only the later requires LaTeX so we could define separate container images for each of the two tasks.

In addition, the r-ver/tidyverse image includes 350 MB (uncompressed) rstudio server which is not needed for either task.

containerize the rendering environment

  • R e.g. on ubuntu sudo apt apt install r-base-core
  • pandoc e.g. on ubuntu sudo apt install pandoc pandoc-citeproc
  • LaTeX e.g. on ubuntu sudo apt install texlive texlive-latex-extra
  • additional R packages
    • rmarkdown
    • rticles
    • bookdown

More recent rnftools containers not fully functional

It appears that rsuchecki/rnftools:0.3.1.3_3123fca68e14580a453deea77a0549929ed44715@sha256:58d993c965a3cddeca2eeff7191f4d3e46075b2007af2e38fdbaaaf9f7786e3a built from 3123fca may be the last one that is fully functional. Also the recent biocontainers-based image fails when using MASON SIMULATOR. This may be due to the conda recipe for rnftools not specifying versions for the dependencies (?)

  • We can continue using the existing container.
  • At least a warning required for rnftools.Dockerfile on master
  • Investigate in more detail
    • it appears that image from our Dockerfile is slightly less problematic than the biocontainers version (issues around libs for samtools), the earlier version still results in what appears as fully working container

Additional genomes for evaluation

Can be selected in many ways, this may be a start:

mysql -u anonymous -h mysql-eg-publicsql.ebi.ac.uk -P 4157
use ensemblgenomes_info_41;

Then

SELECT species, assembly_name, assembly_level, base_count \
FROM genome WHERE division = "EnsemblMetazoa" AND assembly_level = "chromosome";
species assembly_name assembly_level base_count
aedes_aegypti AaegL3 chromosome 1383974186
anopheles_darlingi AdarC3 chromosome 136950925
anopheles_gambiae AgamP4 chromosome 273109044
atta_cephalotes Attacep1.0 chromosome 317690795
belgica_antarctica ASM77530v1 chromosome 89583723
caenorhabditis_elegans WBcel235 chromosome 100286401
caenorhabditis_briggsae CB4 chromosome 108384165
culex_quinquefasciatus CpipJ2 chromosome 579057705
drosophila_simulans ASM75419v3 chromosome 124963774
drosophila_pseudoobscura Dpse_3.0 chromosome 152696192
drosophila_yakuba dyak_caf1 chromosome 165693946
drosophila_melanogaster BDGP6 chromosome 143725995
melitaea_cinxia MelCinx1.0 chromosome 389907520
mnemiopsis_leidyi MneLei_Aug2011 chromosome 155875873
nasonia_vitripennis Nvit_2.1 chromosome 295780872
pediculus_humanus PhumU2 chromosome 110804242
sarcoptes_scabiei SscaA1 chromosome 56262437
schistosoma_mansoni ASM23792v2 chromosome 364541798
solenopsis_invicta Si_gnG chromosome 396024718
trichinella_spiralis Tspiralis1 chromosome 63525422

Similarly

SELECT species, assembly_name, assembly_level, base_count \
FROM genome WHERE division = "EnsemblPlants" AND assembly_level = "chromosome";
species assembly_name base_count
arabidopsis_lyrata v.1.0 206667935
aegilops_tauschii ASM34733v1 3313764331
arabidopsis_thaliana TAIR10 119667750
beta_vulgaris RefBeet-1.2.2 566181630
brachypodium_distachyon Brachypodium_distachyon_v3.0 271163419
brassica_rapa Brapa_1.0 283822783
brassica_oleracea BOL 488622507
chondrus_crispus ASM35022v2 104980420
chlamydomonas_reinhardtii Chlamydomonas_reinhardtii_v5.5 111098438
cyanidioschyzon_merolae ASM9120v1 16728945
dioscorea_rotundata TDr96_F1_Pseudo_Chromosome_v1.0 456674974
cucumis_sativus ASM407v2 193829320
daucus_carota ASM162521v1 421502825
gossypium_raimondii Graimondii2_0 761405269
helianthus_annuus HanXRQr1.0 3027844945
glycine_max Glycine_max_v2.0 978416860
hordeum_vulgare IBSC v2 4834432680
leersia_perrieri Lperr_V1.4 266687832
lupinus_angustifolius LupAngTanjil_v1.0 609203021
manihot_esculenta Manihot esculenta v6 582117524
medicago_truncatula MedtrA17_4.0 412800391
musa_acuminata ASM31385v1 472960417
nicotiana_attenuata NIATTr2 2365682703
ostreococcus_lucimarinus ASM9206v1 13204888
oryza_glaberrima Oryza_glaberrima_V1 316419574
oryza_barthii O.barthii_v1 308272304
oryza_brachyantha Oryza_brachyantha.v1.4b 260838168
oryza_meridionalis Oryza_meridionalis_v1.3 335668232
oryza_glumipatula Oryza_glumaepatula_v1.5 372860283
oryza_punctata Oryza_punctata_v1.2 393816603
oryza_rufipogon OR_W1943 338040714
oryza_nivara Oryza_nivara_v1.0 337950324
oryza_indica ASM465v1 427004890
oryza_sativa IRGSP-1.0 375049285
phaseolus_vulgaris PhaVulg1_0 521076696
physcomitrella_patens Phypa V3 471852792
prunus_persica Prunus_persica_NCBIv2 227411381
populus_trichocarpa Pop_tri_v3 434132815
setaria_italica Setaria_italica_v2.0 405732883
solanum_tuberosum SolTub_3.0 810654046
sorghum_bicolor Sorghum_bicolor_NCBIv3 708735318
solanum_lycopersicum SL2.50 823630941
theobroma_cacao Theobroma_cacao_20110822 345993675
trifolium_pratense Trpr 304842038
triticum_dicoccoides WEWSeq v.1.0 10079039394
triticum_aestivum IWGSC 14547261565
triticum_urartu ASM34745v1 3747163292
vigna_angularis Vigan1.1 466744453
vitis_vinifera 12X 486265422
vigna_radiata Vradiata_ver6 463085359
zea_mays B73 RefGen_v4 2135083061

Include pseudoalignment approaches

We can consider adding pseudoalignment approaches (Kallisto/Salmon) for comparison with aligners, but since these approaches rely on transcriptome it makes little sense to do a transcriptome-based read simulation followed by alignment to that transcriptome if competing aligners were to target the whole genome. Valid options include:

  • Input transcriptome - simulation - (pseudo)alignment to transcriptome target using all tools
  • Input real RNA-Seq, (pseudo)aligned to either transcriptome or the genome or both depending on the tool

Points for paper

Main focus

  • Reproducibility
  • Re-computability
  • Re-use
  • Extensibility
  • report to be the main product of the pipeline
  • create manuscript branch, never pull from manuscript into master, only master to manuscript
  • examples of challenges, e.g existing frameworks
  • identify individual challenges, illustrate our solutions for each of those
    • consistent software environment - building and exposing containers
    • code tags and releases under standard semantic versioning e.g. v1.5 + generated DOIs
    • capturing output with env and runtime details #17 ; how much of pipeline info/introspection, execution environment can be captured and added to release (and so also DOI)? For that, we should have "special" run-related releases e.g. v1.5_suffix where suffix needs to be unique (but that is a technical requirement anyway) and could potentially be a result of hashing of data to be uploaded to the release (results, env info, run info). A simpler version of the suffix could just capture NF exec profile plus a timestamp.
  • [ ]

Discuss

  • At a minimum, your code will run on a 16-core machine with 120 GB of memory
  • 5GB of storage and 1 hour of compute time per month. Use your academic email to get 20GB of storage and 10 hours of compute time per month
  • If you’re working on preparing your code and need more storage or compute time, please contact us. You can also purchase extra compute time at $1 per hour and extra storage at $5/mo for every 50GB.

Dynamic resource allocation or exposed max resources params

Memory & time (and indirectly cpu) requirements for indexing and aligning will vary depending on input size. Given that we want the users to be able to supply any genome (plus optional annotation) we need to handle variable inputs.

Limitations / caveats

  • Speed comparisons will only make sense if CPU/mem and max time requirements are fixed for a given input set
  • Important to have a reasonable max time after which we accept that a tool fails

Possible solutions (either or both):

Different versions of the same tool

There are various ways in which we could include different versions of the same tools, for example

  1. Leave things as are, if more than one version of a tool to be included in evaluation embed it as part of the tool "name" - would apply to template file name, alignerParams, container config.
  • πŸ‘ Quick and easy, no need to explicitly define each tool details in a config file
  • πŸ‘Ž Proliferation of templates and top-level entries
  • πŸ‘Ž πŸ‘ Implicit enforcing of version match when pairing index with aligner
  1. Define all in dedicated config/YAML/JSON, as currently done in aligners.config for alignerParams. This then would also include container spec for indexing/alignment.
  • πŸ‘ clear
  • πŸ‘Ž very verbose.
  • Not clear what to do with the templates. Would we also define template filenames here? Currently templates are picked up automatically from dedicated dirs.

Proposed config file

mappers = [
    [
      tool: 'bbmap',
      version: '38.44',
      container: 'rsuchecki/bbmap:38.44_fae5e1e07240e69896dbf7095872fb6fea43d045'
    ],
    [
      tool: 'bbmap',
      version: '38.49',
      container: 'rsuchecki/bbmap:38.49_9e975d9bc6a657bc4306f4475be393b9fbe8e3fb'
      template: 'bbmap_old' //optional, otherwise tool name used to match templates
    ],
    [
      tool: 'minimap2',
      version: '2.17',
      container: 'rsuchecki/minimap2:2.17_1d3f326820696496f025a95632979cd4ea4140cb'
    ]
  ]

prep:

Channel.from(params.mappers)
.filter { record -> //ensure required info present
   ['tool','version','container'].every { it in record.keySet() }
}
.map { //populate optional field(s)
  if(!it.containsKey('template'))
    it.put('template', it.tool)
  it
}
  1. Extend/replace containers.config to allow optional required version level. Since NF dynamic labels are not currently supported (nextflow-io/nextflow#894), the incentive for keeping current structure is only based on potential future NF support for that and not-essential. In other words there is little point in continuing to use withLabel selectors for alignment and indexing container spec.

If 2 or 3 or a mix of those

  • Do we allow or how do we allow multiple templates to accommodate differences between versions.
  • Do we allow or how do we allow multiple sets of params to accommodate different versions. easy, another level in alignerParams
  • Need to ensure separate indexing for each version; indexing params

Since we have moved fine tuning out from the templates which now only provide core functionality with detailed parameterisation in alignerParams, perhaps we can do away with templates alltogether and have those core scripts embedded in aligners.config or similar?

πŸ‘Ž However we specify the version string - it will be free text - no practical way of enforcing match with software, but a discrepancy could happen with the container as well i.e. container label may be wrong, but at least we now exactly what container was used.

Read simulation from any genome and alignment eval

  • DNA

    • This is how Simulated DNA path of our workflow works
  • RNA (BEERS)

    • limited to existing, published datasets - human, malaria
    • requires annotated genome
  • RNA (general)

  • simulators which record original read position? If so, add conversion to RNF format already used for simulated DNA and which provides syntax for simulated 'spliced' reads

  • Alternatively, given genome + gff/bed, extract transcripts, simulate reads using existing DNA workflow, convert coordinates, evaluate using RNFtools

Validity of run-time comparisons

⚠️

Run times may not be comparable between tools/runs if we don't ensure that the underlying computational conditions were comparable. For that, the evaluation would probably have to be executed on a dedicated server with a task having exclusive access to that machine and input/output files being placed on local storage (e.g. using nextflow's scratch true).

  1. Cluster execution could be acceptable if we can ensure
  • homogeneity of the nodes (explicit partition spec?)
  • exclusive use of nodes
  • use of local scratch space
  1. Cloud (awsbatch) execution could be acceptable if we can ensure
  • homogeneity of the nodes
  • exclusive use of nodes
  • use of local scratch space

In addition, we must capture more of the task information via

trace.fields = 'task_id,name,status,exit,realtime,%cpu,rss'
  • which should include requested resources cpus,memory,time - more here

The cpu details can be easily picked-up in the mapping process
e.g. beforeScript 'cat /proc/cpuinfo > cpuinfo' which can be parsed downstream.
It is of limited value on its own for serious speed benchmarking,
but may be useful for the indicative use of speed in reports.

Report/manuscript rendering if output file exists

If a version of the manuscript already exists under writing/ the rendering process will fail as the output file glob * for the render process excludes any input files (as expected). This can happen if a user opts for rendering outside the pipeline and later re-runs the pipeline including the rendering process.

This can be fixed e.g. by specifying the output file name(s) explicitly rather than via a glob. Final report and/or manuscript file name could be defined in nextflow.config.

Multiple alignment runs/settings for a given tool/index

Exploration/comparison of different sets of parameters for a given aligner.

One way of doing that is by allowing multiple alignment templates for a given tool while preserving the link to the corresponding index. Index metadata could hold a list of alignment templates. Index item in an channel could be duplicated for each alignment template. Doable but a bit fragile Labelling would be a problem, perhaps better to do that via config (also run-time input JSON/YAML definition)?

This is basically a nested map where top level key is the tool label, second level map key is a free label and the associated string of params. For each tool we should have a default which should include all params we use which are not essential to the tool functioning within the framework. We should probably also list the "reserved" params (in/out files, threads, in/out format-related params)

JSON, YAML could be produced as part of the workflow run, then added-to or modified by the user and used with -params-file

Example:

  • NF config file
params {
  alignersParamsRNA = [
    'biokanga': [
      default: '--mode 0 --pemode 3 --pairmaxlen 100000 --substitutions 5 --minchimeric 50',
      alternative: '--mode 0 --pemode 3 --pairmaxlen 100000 --substitutions 2 --minchimeric 50', 
      another: '--mode 0 --pemode 3 --pairmaxlen 100000 --substitutions 2 --minchimeric 75'
    ],
    'bbmap': [
      default: 'maxindel=100000 ambiguous=best intronlen=20 local=t',
      custom: 'maxindel=200000 ambiguous=best intronlen=20 local=f'
    ] 
  ]
}
  • JSON
{
    "alignersParamsRNA": {
        "biokanga": {
            "default": "--mode 0 --pemode 3 --pairmaxlen 100000 --substitutions 5 --minchimeric 50",
            "alternative": "--mode 0 --pemode 3 --pairmaxlen 100000 --substitutions 2 --minchimeric 50",
            "another": "--mode 0 --pemode 3 --pairmaxlen 100000 --substitutions 2 --minchimeric 75"
        },
        "bbmap": {
            "default": "maxindel=100000 ambiguous=best intronlen=20 local=t",
            "custom": "maxindel=200000 ambiguous=best intronlen=20 local=f"
        }
    }
}
  • YAML
---
alignersParamsRNA:
  biokanga:
    default: "--mode 0 --pemode 3 --pairmaxlen 100000 --substitutions 5 --minchimeric 50"
    alternative: "--mode 0 --pemode 3 --pairmaxlen 100000 --substitutions 2 --minchimeric 50"
    another: "--mode 0 --pemode 3 --pairmaxlen 100000 --substitutions 2 --minchimeric 75"
  bbmap:
    default: maxindel=100000 ambiguous=best intronlen=20 local=t
    custom: maxindel=200000 ambiguous=best intronlen=20 local=f

Results presentation v2

(v2 - closing #26 as it is badly out-of-date)

Input:
Standardised, (close to) stable JSON file captures all (simulated ) read alignment results for all alignment modes, all tools, versions, param sets. For each result it also includes metadata (tool, read simulation, reference).

Challenge:
Depending on the run, the number of data points to be displayed can range from a handful up to undetermined.

Output:
HTML report with interactive table or tables, allowing for filtering and sorting records

Technology (?):

Capture failed task in report?

An alignment task may fail for a variety of reasons, and it is retried several times and finally ignored to allow the workflow to run to completion. Metadata on all, including failed/ignored task are captured by NF in trace and the HTML report but this information is not currently included in our rendered report. Consequently tables/figures will simply lack some entries rather than indicate that e.g. aligner X failed to complete a given task given the set time/memory limits.

  • is task metadata for already completed tasks available before pipeline finishes?

If it is,

  • include in rendered report
  • capture reason (time? memory? other?)

If it is not,

  • additional channels can be used to indicate expected output entries to data collecting processes.

If we want failures to be captured directly alongside or within evaluation summary JSON we could take over from Nextflow the handling of failures, with a dummy SAM/BAM produced and perhaps the contents of .command.err captured and embedded or linked-to in the generated report.

  • πŸ‘ Failed tasks info in plain sight, able render some plots without the relevant data points just missing
  • πŸ‘Ž NF no longer re-submitting the failed tasks which is really useful to iron-out glitches in HPC/Cloud execution - could probably get around this by setting, outputting appropriate validExitStatus in conjunction with task.attempt .

Further reproducibility wish-list

Tool version capture beyond container tag.

  • Add version command run and parse template to each tool def in mapper.config
  • Add process def to run the version command
  • Weave into the meta

Storage for singularity images built from docker (a few gigs only)

  • Ideally via api calls to an appropriate repository
  • minimum - capture and make the exact container images available for any published results

Explicitly machine-readable output

We already output formatted results and metadata, additional formatting, standardisation will be required

  • Identify relevant standards
  • Apply the standards to outputs and metadata

Update containers

Biocontainers deliver great value. In addition to small size at the reasonable cost of not always having access to (admitedly incorrectly) expected standard versions of linux tools such as awk, sed etc.

Standard biocontainers are one tool per container. This is ideal for most of our use cases with some exceptions,in particular aligners outputting SAM rather than BAM.

Multi-package containers can be quickly built by submitting a PR to https://github.com/BioContainers/multi-package-containers
This generates mamba containers - tool names and versions hashed in the name.

Biocontainers server(s) available with 40k singularity image files available

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.