bccdc-phl / mlst-nf Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 39 KB

Run mlst on multiple samples with integrated quality control.

Nextflow 61.31% Python 38.69%

mlst microbial-genomics microbial-subtyping

mlst-nf's People

Contributors

Stargazers

Watchers

mlst-nf's Issues

Pipeline fails on low-quality assembly

Quast will fail when given assemblies with no contig greater than 500bp, which causes the pipeline to fail. One poor-quality sample could crash a full run, so it would make the overall pipeline more robust if we can prevent the pipeline from crashing in the presence of a single low-quality sample.

Make input QC optional

There are cases where we run this pipeline on the outputs of another pipeline (generally BCCDC-PHL/routine-assembly. That pipeline may already perform QC on its outputs, so running essentially the same QC on the inputs of this pipeline would be redundant.

Add a --skip_input_qc flag that causes the QUAST analysis on the input assemblies to be skipped.

Remove `versioned_outdir` param

The versioned_outdir param hasn't proven to be useful, and it clutters up our publishDir directives.

Remove the versioned_outdir param.

Pipeline version in manifest is out of sync with tagged release

The pipeline version as stated by the manifest here:

mlst-nf/nextflow.config

Line 4 in 58a3697

version = '0.1.1'

...is out of sync with our tagged releases as listed here:

https://github.com/BCCDC-PHL/mlst-nf/releases

Add support for `--collect_outputs`

We currently only generate a separate output directory for each sample. But it would be convenient to collect the sequence types for all samples into a single .csv file as well. The user should be able to specify a prefix for the collected outputs, using a --collected_outputs_prefix flag, whose default value is collected.

Update provenance to match schema

Update the provenance data collected by this pipeline to match out pipeline provenance schema

`parse_alleles.py` fails when no alleles included in mlst output

Command error:
  Traceback (most recent call last):
    File "/home/dfornika/.nextflow/assets/BCCDC-PHL/mlst-nf/bin/parse_alleles.py", line 78, in <module>
      main(args)
    File "/home/dfornika/.nextflow/assets/BCCDC-PHL/mlst-nf/bin/parse_alleles.py", line 29, in main
      num_alleles = len(mlst[sample]['alleles'])
  TypeError: object of type 'NoneType' has no len()

json output from mlst was:

{
   "sample-X.fa" : {
      "scheme" : "-",
      "sequence_type" : "-",
      "alleles" : null,
      "filename" : "sample-X.fa"
   }
}

Add GitHub Actions-based testing workflow

Add a testing workflow using GitHub actions

Update provenance format to match schema

Our tests are currently failing because the provenance files produced by this pipeline don't match our schema

Update the provenance format to match the schema.

Update mlst with databases

Pull in the latest version of mlst

Adopt nf-core conventions

In anticipation of integrating with tools and platforms like Sequera Platform we'd like to evaluate what would be necessary to adopt the nf-core conventions for our existing pipelines. Since this is a fairly simple pipeline, it's a good candidate for conversion to nf-core.

Add optional versioned output directory

The pipeline currently creates one output directory per sample and publishes all outputs there. eg:

mlst-nf/modules/mlst.nf

Line 5 in 9f0e0f0

    
           publishDir "${params.outdir}/${sample_id}", mode: 'copy', pattern: "${sample_id}_mlst.json"

When combining this pipeline with others, it may be useful to encapsulate the outputs from this pipeline in a sub-directory that is named with the pipeline name and version.

So by default we would create outputs of this structure:

.
├── sample-01
│   ├── sample-01_alleles.csv
│   └── sample-01_sequence_type.csv
├── sample-02
│   ├── sample-02_alleles.csv
│   └── sample-02_sequence_type.csv
└── sample-03
    ├── sample-03_alleles.csv
    └── sample-03_sequence_type.csv

...but when running with a --versioned_outdir flag , we would produce:

.
├── sample-01
│   └── mlst-nf-v0.1-output
│       ├── sample-01_alleles.csv
│       └── sample-01_sequence_type.csv
├── sample-02
│   └── mlst-nf-v0.1-output
│       ├── sample-01_alleles.csv
│       └── sample-01_sequence_type.csv
└── sample-03
    └── mlst-nf-v0.1-output
        ├── sample-01_alleles.csv
        └── sample-01_sequence_type.csv

...then a subsequent analysis could produce similar outputs alongside:

.
├── sample-01
│   ├── mlst-nf-v0.1-output
│   │   └── sample-01_mlst.csv
│   └── routine-assembly-v0.2-output
│       ├── sample-01_bakta.gbk
│       └── sample-01_unicycler.fa
├── sample-02
│   ├── mlst-nf-v0.1-output
│   │   └── sample-02_mlst.csv
│   └── routine-assembly-v0.2-output
│       ├── sample-02_bakta.gbk
│       └── sample-02_unicycler.fa
└── sample-03
    ├── mlst-nf-v0.1-output
    │   └── sample-03_mlst.csv
    └── routine-assembly-v0.2-output
        ├── sample-03_bakta.gbk
        └── sample-03_unicycler.fa

Run quast on input assemblies

We should QC incoming data by running quast on input assemblies.

bccdc-phl / mlst-nf Goto Github PK

mlst-nf's People

Contributors

Stargazers

Watchers

mlst-nf's Issues

Recommend Projects

Recommend Topics

Recommend Org