Results presentation

REPSET

REPSET or RREPSET or R²EPSET is a Reproducible, Reusable, Extensible, Portable and Scalable Evaluation Tool for short read aligners

Dependencies

Nextflow - you may consider using the exact version of Nextflow by setting the appropriate environmental variable export NXF_VER=19.10.0 before running the workflow.
and
- either Singularity
- or Docker

Preliminaries

The pipeline consists of several, partly dependent paths which facilitate the evaluation of mappers using either DNA- or RNA-Seq data, either real or simulated. The paths can be executed separately or in a single run. When running separately or re-running the pipeline the -resume flag ensures that previously computed results (or partial results) are re-used.

Default execution will simulate, align and evaluate reads from a few small data sets defined in conf/simulations.config.

Note on terminology (mapping vs alignment)

Terms related to read mapping, alignment and related such as pseudoalignment and quasi mapping are used in bioinformatics inconsistently (which would make many a mathematician cringe). We hereby attempt to strictly follow this convention by consistently propagating these inconsistencies.

Running the pipeline

Execution profiles

There are several ways to execute the pipeline, each requires Nextflow and either Docker or Singularity. See nextflow.config for available execution profiles (or to add your own!), e.g. for local execution this could be

Running with docker

nextflow run csiro-crop-informatics/repset -profile docker

Running with singularity (locally or on a Slurm cluster)

To run the workflow with Singularity on

a local machine,
a standalone server
in an interactive session on a cluster

First make sure that recent version of Singularity is available and then run

nextflow run csiro-crop-informatics/repset -profile singularity

On a Slurm cluster you can run

nextflow run csiro-crop-informatics/repset -profile slurm,singularity

Note! Multiple container images will be pulled in parallel from Docker Hub (and potentially other repositories. A bug (?) in singularity may cause the parallel processing to fail with an error message similar to

Error executing process > 'indexGenerator ([[species:Saccharomyces_cerevisiae, version:R64-1-1.44, seqtype:DNA], [tool:hisat2, version:2.1.0]])'

Caused by:
  Failed to pull singularity image
  command: singularity pull  --name rsuchecki-hisat2-2.1.0_4cb1d4007322767b562e98f69179e8ebf6d31fb1.img docker://rsuchecki/hisat2:2.1.0_4cb1d4007322767b562e98f69179e8ebf6d31fb1 > /dev/null
  status : 255

Until this is fixed, our workaround is to run the following prior to running the main script, run

nextflow run csiro-crop-informatics/repset/pull_containers.nf

which will pull most of the containers used by the workflow (the remaining ones are unlikely to be pulled in parallel).

Note that singularity must be available on the node where you execute the pipeline, e.g. by running module load singularity/3.2.1 prior to running the pipeline. It is also required on each compute node. Your cluster configuration should ensure that, if it does not, the additional execution profile singularitymodule can be modified in nextflow.config to match your singularity module name and used at run-time.

Running on AWS batch

If you are new to AWS batch and/or nextflow, follow this blog post, once you are done, or you already use AWS batch, simply run

nextflow run csiro-crop-informatics/repset \
  -profile awsbatch \
  -work-dir s3://your_s3_bucket/work \

after replacing your_s3_bucket with a bucket you have created on S3.

To reduce potential connectivity issues you may consider running the workflow from an EC2 instance. This may guide your decision on where the result file(s) should be placed. If you wish to deposit the result file(s) on s3, you can specify e.g. --outdir s3://your_s3_bucket/results, otherwise you can find them under ./results.

Warning! You will be charged by AWS according to your resource use.

Mapping modes

The workflow incorporates three ways of mapping and evaluating reads, dna2dna, rna2rna, rna2dna and by default all are executed. To restrict execution to one or two of those, you can run the workflow with e.g.

--mapmode dna2dna - evaluate DNA-Seq read mapping to genome
--mapmode rna2rna|rna2dna - evaluate RNA-Seq read mapping to genome and transcriptome

Evaluated mappers

An alignment/mapping tool is included in the evaluation if appropriate templates are included as specified below in Adding another mapper. To execute the workflow for only a subset of the available tools, you can specify e.g.

--mappers star - only evaluate a single tool
--mappers 'bwa|bowtie2|biokanga' - evaluate a subset of tools
--mappers '^((?!bwa).)*$' - evaluate all but this tool

Other regular expressions can be specified to tailor the list of evaluated tools.

Alternative input data sets

To run the pipeline with alternative input data you can use the -params-file flag to specify a JSON or a YAML file to overwrite conf/simulations.config, for example

nextflow run main.nf -params-file path/to/conf/simulations.json

Alternatively, you can simply edit the content of conf/simulations.config`.

Computational resources

Resources required for running the workflow can be substantial and will vary greatly depending on multiple factors, such as

input genomes sizes
simulated read coverage
number and choice of mappers evaluated
mapping mode(s) selected

We have empirically derived simple functions to allow resource requests auto-scaling for key processes such as genome indexing and read mapping. These depend on either the reference size or the number of reads and were based on tools which were slowest or required the most memory. Clearly if e.g. a slower tool is added, these will need to be revised. However, failed tasks are re-submitted with increased resources as long as valid comparisons can be made between different tools' results. For that reason, CPUs and memory limits are not increased on re-submission of the mapping process but the maximum allowed wall-clock time is. In the case of the indexing process, the initial time and memory limits are increased on each task re-submission as indexing performance is not well suited for comparisons anyway. For example, many indexing processes are single-threaded, in other cases it might make sense to skip the indexing process and allow for on-the-fly index generation. Resource auto-scaling is subject to constraints which may need to be adjusted for particular compute environment either at run time (e.g. --max_memory 64.GB --max_cpus 32 --max_time 72.h) or by editing conf/requirements.config where the dynamic scaling functions can also be adjusted.

Capturing results and run metadata

Each pipeline run generates a number of files including

results in the form of report, figures, tables etc.
run meta data reflecting information about the pipeline version, software and compute environment etc.

These can be simply collected from the output directories but for full traceability of the results, the following procedure is preferable:

Fork this repository
(Optional) select a tagged revision or add a tag (adhering to the usual semantic versioning approach)
Generate a Git Hub access token which will allow the pipeline to create releases in your forked repset repository, when creating the token it should suffice to select only the following scope:

public_repo Access public repositories

(assuming your fork of repset remains public)
Make the access token accessible as an environmental variable
Run the pipeline from the remote repository, specifying
- the --release flag
- the appropriate -profile
- the intended revision e.g. -revision v0.9.10 (optional)

For example,

GH_TOKEN='your-token-goes-here' nextflow run \
  user-or-organisation-name/repset-fork-name \
  -profile singularity \
  -revision v0.9.10 \
  --release \

On successful completion of the pipeline a series of API calls will be made to

create a new release
upload results and meta data files as artefacts for that release
finalise the release (skipped if --draft flag used)

The last of this calls will trigger minting of a DOI for that release if Zenodo integration is configured and enabled for the repository. To keep your release as a draft use the --draft flag.

Experimental pipeline overview

Execution environment

Execution environment is captured in runmeta.json.

Adding another mapper

A mapper may be included for any or all of the mapping modes (dna2dna, rna2dna, rna2rna). In each case the same indexing template will be used.

After you have cloned this repository add another entry in conf/mappers.config, under

params {
  mappersDefinitions = [
    //insert here
  ]
}

For example, to add a hypothetical my_mapper version 1.0 you might add the following:

[
  tool: 'my_mapper',
  version: '1.0',
  container: 'path/to/docker/repository/my_mapper:1.0',
  index: 'my_mapper index --input-fasta ${ref} --output-index ${ref}.idx',
  dna2dna:
  '''
  my_mapper align --index ${ref} \
  -1 ${reads[0]} -2 ${reads[1]} \
  --threads ${task.cpus} \
  ${ALIGN_PARAMS} > out.sam
  '''
],

Additional script templates can be added for rna2rna and rna2dna mapping modes. Script templates must be wrapped in either single ('script') or triple single ('''script''') quotes. If you would rather keep the templates in separate files follow these instructions.

Template variables

Applicable nextflow (not bash!) variables resolve as follows:

Indexing

${task.cpus} - number of CPU threads available to the indexing process
${ref} - the reference FASTA file name - we use it both to specify the input file and the base name of the generated index

Mapping

${task.cpus} - number of logical CPUS available to the alignment process
${ref} - base name of the index file (sufficient if aligner uses base name to find multi-file index, otherwise appropriate extension may need to be appended, e.g. ${ref}.idx).
${reads[0]} and ${reads[1]} - file names of paired-end reads
${ALIGN_PARAMS} any additional params passed to the aligner
- Empty by default but one or more sets of params can be defined in conf/mapping_params.config. When multiple sets of params are specified each set is used in separate execution.

Separate template files (optional)

If you would rather keep the templates in separate files rather than embedded in conf/mappers.config you can place each file under the appropriate template directory:

and instead of including the script template string directly in conf/mappers.config as we did above, set

rna2dna: true, which will be resolved to templates/rna2dna/my_mapper.sh

or, when using a different file name,

rna2dna: 'foo_bar.sh', which will be resolved to templates/rna2dna/foo_bar.sh

See the header of conf/mappers.config for more details and limitations.

Non-core mapping parameters (optional)

Add one or more sets of mapping parameters to conf/mapping_params.config, this is meant for parameter space exploration and should include any fine tuning params while the template should only include core params essential to mapper execution.

Notes on container specification in `conf/mappers.config`

You can upload a relevant container image to a docker registry (such as Docker Hub) or locate an existing one e.g. among quay biocontainers. If you opt for an existing one, chose one with a specific version tag and a Dockerfile. Alternatively, follow our procedure below for defining per-tool container images and docker automated builds

We opt for Docker containers which can also be executed using Singularity. Container images are pulled from Docker Hub, but Nextflow is able to access other registries and also local images, see relevant Nextflow documentation

Per-tool container images and docker automated builds

Dockerfiles for individual tools used can be found under dockerfiles/. This includes various mappers but also other tools used by the pipeline. For each tool (or tool-set) we created a docker hub/cloud repository and configured automated builds.

Setting-up an automated build

Builds can be triggered from branches ~~and tags~~.

The following approach relies on creating a branch for a specific version of a tool. ~~The same can be achieved by simply tagging the relevant commit, but this may result in proliferation of tags while branches can be merged into master and deleted while preserving the history.~~ ~~If you'd rather use tags, in (2) change the 'Source type' below to 'Tag' and later tag an appropriate commit using docker/tool/version pattern rather than committing to a dedicated branch.~~ (tags can be problematic - if tag is based on version of a tool and container needs to be updated, tags may have to be removed/re-added)

Create Docker Cloud repo for your tool - do not link to specific GitHub repo or configure automated build at this stage, but only after it has been created - otherwise the tags for containers built later may be malformed.
Link the created a Docker Cloud repo with this GitHub repo (go to Builds -> Configure Automated Builds)
Add an automated build rule (replace tool with the name of the tool).

Source type	Source	Docker Tag	Dockerfile location	Build Context
Branch	`/^docker\/tool\/(.*)$/`	`{\1}`	`tool.Dockerfile`	`/dockerfiles`

Adding or updating a Dockerfile

Checkout a new branch replacing tool and version with the intended tool name and version, respectively. For example,

tool='bwa'
version='0.7.17'
git checkout -b docker/${tool}/${version}

Add, create or modify dockerfiles/${tool}.Dockerfile as required.

Commit and push to trigger an automated build

git add dockerfiles/${tool}.Dockerfile
git commit
git push --set-upstream origin docker/${tool}/${version}

This should trigger an automated build in the linked Docker Hub/cloud repository.

In case the automated build is not triggered for a newly created Docker repo, it may help to delete the Docker repo and repeat steps 1-3 above. Then push some innocuous change to the branch to trigger the build.

If everything works as intended, you may update conf/containers.config to the new tool version.

Then either create a PR to merge the new branch into master or, if you have write permissions for this repository or working on your fork of it, checkout master and merge.

git checkout master
git merge docker/${tool}/${version}

Report

TODO: add information on

how to edit the report template
how the final report gets generated

If report template is sufficiently generic we will be able to easily render to html and PDF, otherwise we should settle for HTML(?).

Rendering of the report constitutes the final step of the pipeline and relies on a container defined in dockerfiles/renderer.Dockerfile for rendering environment.

Rendering outside the pipeline

There are several ways for rendering of the report outside the pipeline, with docker being the preferred option.

Using docker

Put all requited files in one place

mkdir -p localrender
cp report/report.Rmd  localrender/
cp results/* localrender/
cp flowinfo/*.{json,tsv} localrender/

Docker run rendering

docker run \
  --rm \
  --user $(id -u):$(id -g) \
  --volume $(pwd)/localrender:/render \
  --volume $(pwd)/bin:/binr \
  --workdir /render \
  rsuchecki/renderer:0.4.1_81ab6b5d71509d48e3a37b5eafb4bca5b117b5fc /binr/render.R

Rendered report should be available under ./localrender

Using singularity

Put all requited files in one place

mkdir -p localrender \
 && cp report/report.Rmd  localrender/ \
 && cp results/* localrender/ \
 && cp flowinfo/*.{json,tsv} localrender/

Singularity run rendering

singularity exec \
  --bind $(pwd)/bin:/binr \
  --pwd $(pwd)/localrender \
  docker://rsuchecki/renderer:0.4.1_81ab6b5d71509d48e3a37b5eafb4bca5b117b5fc /binr/render.R

Rendered report should be available under ./localrender

Natively

If you'd like to render the report without docker/singularity, you will need the following:

R e.g. on ubuntu sudo apt apt install r-base-core
pandoc e.g. on ubuntu sudo apt install pandoc pandoc-citeproc
LaTeX e.g. on ubuntu sudo apt install texlive texlive-latex-extra
R packages:
- rmarkdown
- rticles
- bookdown
- tidyverse
- jsonlite
- kableExtra

Then:

mkdir -p localrender \
 && cp report/report.Rmd  localrender/ \
 && cp results/* localrender/ \
 && cp flowinfo/*.{json,tsv} localrender/

cd localrender && ../bin/render.R

Manuscript

Manuscript source is under manuscript/ sub directory on manuscript branch which should not be merged into master. Application note is drafted in RMarkdown in manuscript/repset.Rmd file. RMarkdown is well integrated in RStudio, but can be written/edited in a text editor of your choice.

Rendering

The manuscript will be rendered the pipeline is executed while manuscript branch is checked out, either

locally or
by specifying -revision manuscript at run-time

Appropriate revision of the master branch should first be mnerged into the manuscript branch.

The manuscript can be rendered outside the pipeline in a fashion analogous to how this can be done for the report, just replace any use of report by manuscript.

Bibliography

Among the alternatives available we opted for BibTeX, see writing/references.bib.

	releaseArgs = [
	REPO : workflow.repository.replaceFirst("^(http[s]?://github\\.com/\|git@github\\.com:)","").replaceFirst("\\.git\$",""),
	COMMIT : workflow.commitId,
	LOCAL_FILES : [
	// "${params.outdir}/report.html",
	// "${params.outdir}/biokanga-manuscript.pdf",
	"${params.outdir}/allstats.json",
	"${params.infodir}/runmeta.json",
	"${params.infodir}/trace.tsv"
	],
	RELEASE_TAG: "${workflow.revision}_${workflow.runName}_${workflow.sessionId}",
	RELEASE_NAME: "${workflow.revision} - results and metadata for run '${workflow.runName}'",
	RELEASE_BODY: "Release created and artefacts uploaded for run '${workflow.runName}', session ID ${workflow.sessionId}, commit ${workflow.commitId}, see assets for more details."
	]

species	assembly_name	assembly_level	base_count
aedes_aegypti	AaegL3	chromosome	1383974186
anopheles_darlingi	AdarC3	chromosome	136950925
anopheles_gambiae	AgamP4	chromosome	273109044
atta_cephalotes	Attacep1.0	chromosome	317690795
belgica_antarctica	ASM77530v1	chromosome	89583723
caenorhabditis_elegans	WBcel235	chromosome	100286401
caenorhabditis_briggsae	CB4	chromosome	108384165
culex_quinquefasciatus	CpipJ2	chromosome	579057705
drosophila_simulans	ASM75419v3	chromosome	124963774
drosophila_pseudoobscura	Dpse_3.0	chromosome	152696192
drosophila_yakuba	dyak_caf1	chromosome	165693946
drosophila_melanogaster	BDGP6	chromosome	143725995
melitaea_cinxia	MelCinx1.0	chromosome	389907520
mnemiopsis_leidyi	MneLei_Aug2011	chromosome	155875873
nasonia_vitripennis	Nvit_2.1	chromosome	295780872
pediculus_humanus	PhumU2	chromosome	110804242
sarcoptes_scabiei	SscaA1	chromosome	56262437
schistosoma_mansoni	ASM23792v2	chromosome	364541798
solenopsis_invicta	Si_gnG	chromosome	396024718
trichinella_spiralis	Tspiralis1	chromosome	63525422

species	assembly_name	base_count
arabidopsis_lyrata	v.1.0	206667935
aegilops_tauschii	ASM34733v1	3313764331
arabidopsis_thaliana	TAIR10	119667750
beta_vulgaris	RefBeet-1.2.2	566181630
brachypodium_distachyon	Brachypodium_distachyon_v3.0	271163419
brassica_rapa	Brapa_1.0	283822783
brassica_oleracea	BOL	488622507
chondrus_crispus	ASM35022v2	104980420
chlamydomonas_reinhardtii	Chlamydomonas_reinhardtii_v5.5	111098438
cyanidioschyzon_merolae	ASM9120v1	16728945
dioscorea_rotundata	TDr96_F1_Pseudo_Chromosome_v1.0	456674974
cucumis_sativus	ASM407v2	193829320
daucus_carota	ASM162521v1	421502825
gossypium_raimondii	Graimondii2_0	761405269
helianthus_annuus	HanXRQr1.0	3027844945
glycine_max	Glycine_max_v2.0	978416860
hordeum_vulgare	IBSC v2	4834432680
leersia_perrieri	Lperr_V1.4	266687832
lupinus_angustifolius	LupAngTanjil_v1.0	609203021
manihot_esculenta	Manihot esculenta v6	582117524
medicago_truncatula	MedtrA17_4.0	412800391
musa_acuminata	ASM31385v1	472960417
nicotiana_attenuata	NIATTr2	2365682703
ostreococcus_lucimarinus	ASM9206v1	13204888
oryza_glaberrima	Oryza_glaberrima_V1	316419574
oryza_barthii	O.barthii_v1	308272304
oryza_brachyantha	Oryza_brachyantha.v1.4b	260838168
oryza_meridionalis	Oryza_meridionalis_v1.3	335668232
oryza_glumipatula	Oryza_glumaepatula_v1.5	372860283
oryza_punctata	Oryza_punctata_v1.2	393816603
oryza_rufipogon	OR_W1943	338040714
oryza_nivara	Oryza_nivara_v1.0	337950324
oryza_indica	ASM465v1	427004890
oryza_sativa	IRGSP-1.0	375049285
phaseolus_vulgaris	PhaVulg1_0	521076696
physcomitrella_patens	Phypa V3	471852792
prunus_persica	Prunus_persica_NCBIv2	227411381
populus_trichocarpa	Pop_tri_v3	434132815
setaria_italica	Setaria_italica_v2.0	405732883
solanum_tuberosum	SolTub_3.0	810654046
sorghum_bicolor	Sorghum_bicolor_NCBIv3	708735318
solanum_lycopersicum	SL2.50	823630941
theobroma_cacao	Theobroma_cacao_20110822	345993675
trifolium_pratense	Trpr	304842038
triticum_dicoccoides	WEWSeq v.1.0	10079039394
triticum_aestivum	IWGSC	14547261565
triticum_urartu	ASM34745v1	3747163292
vigna_angularis	Vigan1.1	466744453
vitis_vinifera	12X	486265422
vigna_radiata	Vradiata_ver6	463085359
zea_mays	B73 RefGen_v4	2135083061

csiro-crop-informatics / repset Goto Github PK

repset's Introduction