phac-nml / mikrokondo Goto Github PK

View Code? Open in Web Editor NEW

10.0 6.0 2.0 11.69 MB

A simple pipeline for bacterial assembly and quality control

Home Page: https://phac-nml.github.io/mikrokondo/

License: MIT License

HTML 0.61% Python 16.57% Nextflow 81.63% Shell 0.60% JavaScript 0.59%

assembly bacteria bioinformatics quality-control annotation contamination-detection metagenomics nextflow pipelines

mikrokondo's Introduction

Introduction

What is mikrokondo?

Mikrokondo is a tidy workflow for performing routine bioinformatic tasks like, read pre-processing, assessing contamination, assembly and quality assessment of assemblies. It is easily configurable, provides dynamic dispatch of species specific workflows and produces common outputs.

Is mikrokondo right for me?

Mikrokondo is purpose built to provide sequencing and clinical laboratories with an all encompassing workflow to provide a standardized workflow that can provide the initial quality assessment of sequencing reads and assemblies, and initial pathogen-specific typing. It has been designed to be configurable so that new tools and quality metrics can be easily incorporated into the workflow to allow for automation of these routine tasks regardless of pathogen of interest. It currently accepts Illumina, Nanopore or Pacbio (Pacbio data only partially tested) sequencing data. It is capable of hybrid assembly or accepting pre-assembled genomes.

This workflow will detect what pathogen(s) is present and apply the applicable metrics and genotypic typing where appropriate, generating easy to read and understand reports. If your group is regularly sequencing or analyzing genomic sequences, implementation of this workflow will automate the hands-on time time usually required for these common bioinformatic tasks.

Citation

This software (currently unpublished) can be cited as:

Wells, M. "mikrokondo" Github https://github.com/phac-nml/mikrokondo/

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Contact

[Matthew Wells] : [email protected]

Installing mikrokondo

Step 1: Installing Nextflow

Nextflow is required to run mikrokondo (requires Linux), and instructions for its installation can be found at either: Nextflow Home or Nextflow Documentation

Step 2: Choose a Container Engine

Nextflow and Mikrokondo only supports running the pipeline using containers such as: Docker, Singularity (now apptainer), podman, gitpod, shifter and charliecloud. Currently only usage with Singularity has been fully tested, (Docker and Apptainer have only been partially tested) but support for each of the container services exists.

Note

Singularity was adopted by the Linux Foundation and is now called Apptainer. Singularity still exists, but it is likely newer installs will use Apptainer.

Docker or Singularity?

Docker or Singularity (Apptainer) Docker requires root privileges which can can make it a hassle to install on computing clusters (there are work arounds). Apptainer/Singularity does not, so running the pipeline using Apptainer/Singularity is the recommended method for running the pipeline.

Step 3: Install dependencies

Besides the Nextflow run time (requires Java), and container engine the dependencies required by mikrokondo are fairly minimal requiring only Python 3.10 (more recent Python versions will work as well) to run.

Dependencies listed

Python (3.10>=)
Nextflow (22.10.1>=)
Container service (Docker, Singularity, Apptainer have been tested)
The source code: git clone https://github.com/phac-nml/mikrokondo.git

Step 4: Further resources to download

GTDB Mash Sketch: required for speciation and determination if sample is metagenomic
Decontamination Index: Required for decontamination of reads (it is simply a minimap2 index)
Kraken2 nt database: Required for binning of metagenommic data and is an alternative to using Mash for speciation
Bakta database: Running Bakta is optional and there is a light database option, however the full one is recommended. You will have to unzip and un-tar the database for usage. You can skip running Bakta however making the requirement of downloading this database optional.
StarAMR database: Running StarAMR is optional and requires downloading the StarAMR databases. Also if you wish to avoid downloading the database, the container for StarAMR has a database included which mikrokondo will default to using if one is not specified making this requirement optional.

Configuration and settings:

The above downloadable resources must be updated in the following places in your nextflow.config. The spots to update in the params section of the nextflow.config are listed below:

// Bakta db path, note the quotation marks
bakta {
    db = "/PATH/TO/BAKTA/DB"
}

// Decontamination minimap2 index, note the quotation marks
r_contaminants {
    mega_mm2_idx = "/PATH/TO/DECONTAMINATION/INDEX"
}

// kraken db path, not the quotation marks
kraken {
    db = "/PATH/TO/KRAKEN/DATABASE/"
}

// GTDB Mash sketch, note the quotation marks
mash {
    mash_sketch = "/PATH/TO/MASH/SKETCH/"
}

// STARAMR database path, note the quotation marks
// Passing in a StarAMR database is optional if one is not specified the database in the container will be used. You can just leave the db option as null if you do not wish to pass one
staramr {
  db = "/PATH/TO/STARMAR/DB"
}

Getting Started

Usage

nextflow run main.nf --input PATH_TO_SAMPLE_SHEET --outdir OUTPUT_DIR --platform SEQUENCING_PLATFORM -profile CONTAINER_TYPE

Please check out the documentation for complete usage instructions here: docs

Under the usage section you can find example commands, instructions for configuration and a reference to a utility script to reduce command line bloat!

Data Input/formats

Mikrokondo requires two things as input:

Sample files - fastq and fasta must be in gzip format
Sample sheet - this FOFN (file of file names) contains sample names and allows user to combine read-sets. The following header fields are accepted:
- sample
- fastq_1
- fastq_2
- long_reads
- assembly

For more information see the useage docs.

Output/Results

All output files will be written into the outdir (specified by the user). More explicit tool results can be found in both the Workflow and Subworkflow sections of the docs. Here is a brief description of the outdir structure:

annotations - dir containing all annotation tool output.
assembly - dir containing all assembly tool related output, including quality, 7 gene MLST and taxon determination.
pipeline_info - dir containing all pipeline related information including software versions used and execution reports.
ReadQuality - dir containing all read tool related output, including contamination, fastq, mash, and subsampled read sets (when present)
subtyping - dir containing all subtyping tool related output, including SISTR, ECtyper, etc.
SummaryReport - dir containing collated results files for all tools, including:
- Individual sample flatted json reports
- final_report - All tool results for all samples in both .json (including a flattened version) and .tsv format
bco.json - data providence file generated from the nf-prov plug-in
manifest.json - data providence file generated from the nf-prov plug-in

Run example data

Three test profile with example data are provided and can be run like so:

Assembly test profile: nextflow run main.nf -profile test_assembly,<docker/singularity> --outdir <OUTDIR>
Illumina test profile: nextflow run main.nf -profile test_illumina,<docker/singularity> --outdir <OUTDIR>
Nanopore test profile: nextflow run main.nf -profile test_nanopore,<docker/singularity> --outdir <OUTDIR>
Pacbio test profile: nextflow run main.nf -profile test_pacbio,<docker/singularity> --outdir <OUTDIR>
- The pacbio workflow has only been partially tested as it fails at Flye due to not enough reads being present.

Testing

Integration tests are implemented using nf-test. In order to run tests locally, please do the following:

Install nf-test

# Only need to install package nf-test. Below is only for
# if you want to have nextflow and nf-test in a separate environment
conda create --name nextflow-testing nextflow nf-test
conda activate nextflow-testing

Run tests

# From mikrokondo root directory
nf-test test

Add --profile singularity to switch from using docker by default to using singularity.

Troubleshooting and FAQs:

Within release 0.1.0, Bakta is currently skipped however it can be enabled from the command line or within the nextflow.config (please check the docs for more information). It has been disabled by default due issues in using the latest bakta database releases due to an issue with amr_finder there are fixes available and older databases still work however they have not been tested. A user can still enable Bakta themselves or fix the database. More information is provided here: oschwengers/bakta#268

For a list of common issues or errors and their solutions, please read our FAQ section.

References

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Legal and Compliance Information:

Written by: National Microbiology Laboratory, Public Health Agency of Canada

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Updates and Release Notes:

mikrokondo's People

Contributors

Stargazers

Watchers

Forkers

dfornika dpbastedo

mikrokondo's Issues

Support for uncompressed and compressed fastq/fasta

Description of the bug

Passing uncompressed fastq in mikrokondo causes failures.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Add option to specify SRAs as input

Description of feature

And a column to the sample sheet to allow for SRA's to be specified and automatically retrieved by the pipelien

Contig count check before Quast

Description of the bug

Verify a given assembly contains contigs before preceding into quality controll steps.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Add clean up directive to CheckM

Description of feature

Alot of extra files are generated that we do not need e.g. annotations that use alot of extra memory.

Filter contigs by length after Quast

Description of feature

Remove contigs less than the minimum size for reporting in quast as a seperate output.

Verify ECTyper and SISTR defaults match IRIDA

Description of feature

We just need to verify and hardcode the defualts for ECTyper and SISTR so that PNC and Ref Services are using the same settings.

Skip metagenomic sample detection.

Description of feature

Provide an option to users to process all samples as if they were isolates.

Include sample name/identifier to any output files

Description of feature

Include additional sample identification information to output files for pipeline.

Current behaviour

Currently, output files stored in IRIDA Next from mikrokondo include minimal information. For example, the quast report is always named quast.pdf, no matter which sample it represents.

Requested change

The requested change is to include the sample identifier to the output files. For example INXT_SAM_AYD3BVVS75_quast.pdf.

Alternative change 1

Alternatively, the sample name could be included: 08-5578_quast.pdf. This requires further investigation.

Alternative change 2

Or, both the sample name and identifier could be included: INXT_SAM_AYD3BVVS75_08-5578_quast.pdf.

Assemblies filtered for length not being used in subsequent processes

Description of the bug

Assemblies filtered are not being passed to the checkm or mlst processes. Additonally the are not being returned from the QC_ASSEMBLIES workflow for subsequent usage.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Failed assemblies are not reported in the final summary

Description of the bug

In the final summary, make it clear when assembly fails.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Add sample failure message to final report

Description of feature

If a sample is screened out due to having too few reads that is not refleceted in the report. E.g. FastP can filter out all of you reads due to them being less than a minimum length value. However this outcome is not reflected in the final report verbosely.

Reduce temporary file usage in the contaminant removal step

Description of feature

As recommended by @apetkau :

pipe the output of minimap2 into samtools fastq to avoid one of the intermeidate files
output a compressed fastq from samtools fastq directly

Rename ext.containers to ext.parameters

          Are these `ext.containers` lines grabbing the whole dictionary of parameters for each process? For example, is the `ext.containers` here being assigned:

    locidex {
        // awaiting singluarity image build
        //singularity = "https://depot.galaxyproject.org/singularity/locidex%3A0.1.1--pyhdfd78af_1"
        singularity = "quay.io/biocontainers/locidex:0.1.1--pyhdfd78af_1"
        docker = "quay.io/biocontainers/locidex:0.1.1--pyhdfd78af_1"
        min_evalue = params.lx_min_evalue
        min_dna_len = params.lx_min_dna_len
        min_aa_len = params.lx_min_aa_len
        max_dna_len = params.lx_max_dna_len

[...]

Which makes this kind of weird later, because in the process you're doing:

container "${workflow.containerEngine == 'singularity' || workflow.containerEngine == 'apptainer' ? task.ext.containers.get('singularity') : task.ext.containers.get('docker')}"

which makes sense functionally, but the naming doesn't track with the object being interacted with. Could you either change it so all of these ext.containers objects for each process contain only containers in their dictionaries, or maybe change the dictionary name to something like ext.parameters?

Originally posted by @emarinier in #62 (comment)

Provide JSON report summary for each individual sample.

Description of feature

Provide seperate json files for each sample with a flattened JSON structure, so for metagenomic samples common information is copied into each output and the json structure has a reduced level of nesting. As recommended by @apetkau

Update Bakta to v1.9.2 to manage skip amrfinder plus bug

Description of the bug

Using an updated version of the Bakta database results in an error from AMRFinder plus saying the database is out of date.

Command used and terminal output

Traceback (most recent call last):
  File "/usr/local/bin/bakta", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/bakta/main.py", line 271, in main
    expert_amr_found = exp_amr.search(cdss, cds_aa_path)
  File "/usr/local/lib/python3.10/site-packages/bakta/expert/amrfinder.py", line 47, in search
    raise Exception(f"amrfinder error! error code: {proc.returncode}. Please, try 'amrfinder_update --force_update --database {amrfinderplus_db_path}' to update AMRFinderPlus's internal database.")
Exception: amrfinder error! error code: 1. Please, try 'amrfinder_update --force_update

Relevant files

No response

System information

No response

Report value for organisms with no comparable value

Description of feature

When an organsim missing QC criteria is passed in, still provide the warning for the QC value but also show the value.

Dynamically populated arguments in the nextflow.config need to be passed as closures into the modules.config

Description of the bug

I dont know if this is specific to version 23 of nextflow or it has just gone untested but when a parameter for a nested variable is passed from the command line to populate an argument in the params it causes part of the params (everything below where the args is to dissapear). This can be fixed by converting the dynamically created args into a closure, and will need to be done to the other arguments.

This should be resolved soon.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Down sampling of genomes with uneven coverage should flag be a quality flag

Description of feature

Isolates with uneven coverage can have their genome sizes vastly mis-estimated due to un-even coverage. Allowing users to set a minimum estimated size value in the future may be a nice feature to flag this issue and prevent confusing errors.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.