nf-core / bacass Goto Github PK
View Code? Open in Web Editor NEWSimple bacterial assembly and annotation pipeline
Home Page: https://nf-co.re/bacass
License: MIT License
Simple bacterial assembly and annotation pipeline
Home Page: https://nf-co.re/bacass
License: MIT License
The documentation is at https://nf-co.re/bacass/usage, but the link on the Documentation section of the nextflow page https://nf-co.re/bacass#documentation goes to https://nf-core/bacass/docs (which is a broken link)
If you use nf-core/atacseq for your analysis, please cite it using the following doi: 10.5281/zenodo.2634132
Several processes are using tools that are maybe interesting also in other pipeline and could be made available in nf-core/modules, such as:
Other processes use local modules because the nf-core modules were not applicable:
MultiQC module retrieves QC and reads trimming stats only.
Gather the files of all modules (supported in multiqc) and let MultiQC access their data to get a complete overview of the workflow.
Hi,
Thank you for this great workflow!
I encounter the following error from unicycler running hybrid mode:
Dependencies:
Program Version Status
spades.py 3.13.0 good
racon - good
makeblastdb 2.5.0+ good
tblastn 2.5.0+ good
bowtie2-build 2.4.1 good
bowtie2 2.4.1 good
samtools ? too old
java 11.0.8-internal good
pilon 1.23 good
bcftools not used
Error: Unspecified error with Unicycler dependencies
I think this is probably related to this issue.
It seems changing to another image fixes the issue. I used quay.io/biocontainers/unicycler:0.4.8--py37h13b99d1_3
.
Software versions:
Chenhao
I am trying to run bacass
v1.1.0, using nextflow
version 20.04.1 and the Singularity profile. The run terminates with an error exit status (127):
Error executing process > 'trim_and_combine (ERR3219830)'
Caused by:
Process `trim_and_combine (ERR3219830)` terminated with an error exit status (127)
Command executed:
# loop over readunits in pairs per sample
pairno=0
echo "ERR3219830_R1.fastq.gz ERR3219830_R2.fastq.gz" | xargs -n2 | while read fq1 fq2; do
skewer --quiet -t 8 -m pe -q 3 -n -z $fq1 $fq2;
done
cat $(ls *trimmed-pair1.fastq.gz | sort) >> ERR3219830_trm-cmb.R1.fastq.gz
cat $(ls *trimmed-pair2.fastq.gz | sort) >> ERR3219830_trm-cmb.R2.fastq.gz
Command exit status:
127
Command output:
(empty)
Command error:
.command.sh: line 5: skewer: command not found
The command I attempted was:
nextflow run nf-core/bacass --input bacass_short.tsv -profile singularity --kraken2db "~/db/minikraken2_v1_8GB_201904.tgz"
This breaks prokka. The docker container / conda environment needs to be rebuilt
see: tseemann/prokka#453
So apparently it doesn't work with ${sample_id}_report.tsv
- maybe I should try ``${sample_id}.report.tsv` instead and/or configure a custom name for the MultiQC config for the pipeline:
quast_config:
fn: *_report.tsv
could do the trick maybe :-)
Hi all. I'm excited to get started with your workflow. However, I cannot as of yet run the fastqc step, at least on nextflow version 19.07.0.5106 / docker.
Here's a snipped of the output:
fastqc -t {task.cpus} -q 1_R1_001_trm-cmb.R1.fastq.gz 1_R1_001_trm-cmb.R2.fastq.gz
...
Value "{task.cpus}" invalid for option threads (number expected)
Looks like the actual value of task.cpus is not getting substituted. Perhaps there is a typo in Line 252 of main.tf:
Proposed change is:
- fastqc -t {task.cpus} -q ${fq1} ${fq2}
+ fastqc -t ${task.cpus} -q ${fq1} ${fq2}
I can issue a PR, but this seems like a straightforward fix.
thanks!
Hello!, My colleagues and I have been actively working on enhancing the nf-core/bacass workflow to address lab-specific challenges in bacterial genome assembly. We are happy to add these improvements into the main nf-core/bacass repository in case you are interested.
Currently, these enhancements have been implemented in my local fork of nf-core/bacass on the buisciii-develop branch.
nextflow run main.nf \
-profile singularity,test \
--skip_kmerfinder false \
--kmerfinderdb path/to/kmerfinder_db/bacteria \
--ncbi_assembly_metadata path/to/ncbi_assembly_metadata/assembly_summary_bacteria.txt \
--outdir ./results \
-w ./work \
-resume
If you think these improvements could be implemented in nf-core/bacass, let me know so I can work on the test data and test profile.
In reference to #11 - would it be an option to simply replace Prokka with Dfast:
https://github.com/nigyta/dfast_core
From my tests, it seems to outperform Prokka in terms of "loci annotated/named" - and it's also on bioconda and easily as fast as Prokka.
The issue with Prokka and Tbl2asn seems like it won't be fixable - and having the Docker image "expire" every so often is probably not an ideal solution moving forward.
Cheers,
Marc
Assembly Polishing step is not being performed in the workflow by default nor with the param --polish_method
Steps to reproduce the behaviour:
nextflow run nf-core/bacass --input 'sample_sheet.csv' --outdir /data/nihr/nanopore_sequencing/9_08_21/hybrid_assembly_output/ -profile docker --assembly_type hybrid -resume --kraken2db "https://genome-idx.s3.amazonaws.com/kraken/k2_standard_8gb_20210517.tar.gz"
bacass/modules/local/functions.nf
Line 32 in 9599673
There seems to be an incorrect regex here. Above regex can be translated as match an h / at the beginning once or more OR match an / or more that comes with literal $.
The regex doesn't remove whitespace and trailing slash(es) I believe. Correct regex should be replaceAll("\/+$| +", "").
This is a collection of ideas that should be considered after the DSL2 conversion #56 is finished. The list is subject to change. Any ideas or discussions are welcome.
--skip_kraken2
should be either removed (i.e. using --krakendb
to determine whether Kraken2 is used) or a simple default (small, fast, but helpful) value should be chosen for --krakendb
, e.g. "https://genome-idx.s3.amazonaws.com/kraken/16S_Greengenes13.5_20200326.tgz". This is a very small 16S database but should be sufficient to detect serious bacterial contamination.Follow viralrecon recommendations, tests should just test execution and not on big data.
To prevent breaking this pipeline in the near future, the nf-validation version should be pinned to version 1.1.3 like:
plugins {
id '[email protected]'
}
No response
No response
No response
Dragonflye allows polishing the raw assembly with illumina reads if provided (See dragonflye docs). However, the current implementation of dragonflye in nf-core/bacass
performs long read assembly only.
Allow nf-core/bacass
and the nf-core/dragonflye
module to polish the assembled genome with short reads when provided.
Would be good to add the Zenodo DOI for the release to the main README of the pipeline in order to make it citable. You will have to do this via a branch pushed to the repo in order to directly update master
. See PR below for example and file changes:
nf-core/atacseq#38
See https://zenodo.org/record/2669429#.XVZ0bOhKhPY
Web-hooks are already set-up for this repo to have a unique Zenodo DOI generated everytime a new version of the pipeline is released. Would be good to add this in after every release ๐
nf-core v2.10 template update #89
Hi,
thank you for providing this pipeline.
Would you consider providing A5-miseq support for short-reads-only mode?
I normally use Spades (or Unicycler in this case), but I've consistently been getting better results with A5-miseq when assembling a short-reads-only dataset.
e.g. compare these two assemblies
bacass - unicycler:
Total n: 181
Total seq: 5473999 bp
Avg. seq: 30243.09 bp
Median seq: 1476.00 bp
N 50: 143598 bp
Min seq: 110 bp
Max seq: 623519 bp
a5-miseq:
Total n: 78
Total seq: 5532586 bp
Avg. seq: 70930.59 bp
Median seq: 4987.50 bp
N 50: 278625 bp
Min seq: 623 bp
Max seq: 761326 bp
It is not a huge difference, but I believe it would be a good addition to the pipeline. I'd love to make a PR myself, but I'm still not confident enough with Groovy/nextflow scripting.
Thank you for any assistance you can provide,
V
The documentation describes that the input sample sheet is a tab-separated file but it labelled csv in the example. The pipeline fails if the suffix is .tsv
I suggest changing the example and the pattern match to ^\S+\.tsv$
Because of the issue in this post #24, I ran the command conda env create --prefix /home/ss/test_bacass/work/conda/nf-core-bacass-1.1-0-58ac097954559efb6ec2ce857847ed28 --file /home/ss/.nextflow/assests/nf-core/bacass/environment.yml
to create the conda environment, then I ran nextflow run nf-core/bacass --input bacass_short.csv --skip_kraken2 -profile conda
and got another error message:
Error executing process > 'quast (ER064912)'
Caused by:
Process 'quast (ER064912)' terminated with an error exit status (127)Command executed:
quast -t 2 -o ER064912_assembly_QC ER064912_assembly.fasta
quast -v > v_quast.txtcommand exit status:
127Command output:
(empty)Command error:
.command.sh: line 2: quast: command not found
quast
is not in the environment.yml
file, could it be missed?
After Bakta up-date round, the last release v1.8.2
works with the db-light in zendo (oschwengers/bakta#241).
Module BAKTA_DBDOWNLOAD_RUN:BAKTA_BAKTADBDOWNLOAD()
can be updated to v1.8.2
I just start using bacass on AWS. I tried to access kraken2db from S3 bucket. My question is that is it possible to access kraken2db from S3 bucket? I have tried run bacass with
--kraken2db 's3://kraken2_db/minikraken2_v2_8GB_201904_UPDATE' using nextflow-tower
I got the error message below.
Command error:
kraken2: database ("s3://kraken2_db/minikraken2_v2_8GB_201904_UPDATE") does not contain necessary file taxo.k2d
Thank you very much,
Piroon
In the checks for long reads and fast5 files the wrong file is reported if there is an error
bacass/subworkflows/local/input_check.nf
Line 66 in 9599673
Should read
exit 1, "ERROR: Please check input samplesheet -> Long FastQ file does not exist!\n${row.LongFastQ}"
bacass/subworkflows/local/input_check.nf
Line 74 in 9599673
Should read
exit 1, "ERROR: Please check input samplesheet -> Fast5 file does not exist!\n${row.Fast5}"
I'm planning to work on updating the nf-core/tools-2.9 (#84 ) template to enhance its functionality, improve documentation, and ensure it aligns with the latest best practices.
To that end I am going through this list:
I aim to periodically update this issue to provide insights into the progress made. If anyone has expertise in template development or nf-core best practices, your input would be highly appreciated ๐๐พ .
(edit v3: updated task list)
I am running into an issue with the --save_trimmed_fail setting when running the bacass pipeline. It indicates that 'false' is not a valid choice, but it also indicates that 'true' or 'false' are the only valid arguments for this setting.
nextflow run https://github.com/nf-core/bacass \
-name bacass-test-2 \
-params-file https://api.cloud.seqera.io/ephemeral/Mn6cNSXHpkJNUK5IW3P4HQ.json \
-with-tower \
-r 2.2.0 \
-profile docker
workDir : /seqcoast-aws/scratch/1kcilKqDAJaXkO
projectDir : /.nextflow/assets/nf-core/bacass
userName : root
profile : docker
configFiles :
Input/output options
input : https://raw.githubusercontent.com/nf-core/test-datasets/bacass/bacass_hybrid.tsv
outdir : s3://seqcoast-aws
Assembly parameters
assembly_type : hybrid
canu_mode : -nanopore
Annotation
dfast_config : /.nextflow/assets/nf-core/bacass/assets/test_config_dfast.py
Skipping Options
skip_kraken2 : true
!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/bacass for your analysis please cite:
* The pipeline
10.5281/zenodo.2669428
* The nf-core framework
https://doi.org/10.1038/s41587-020-0439-x
* Software dependencies
https://github.com/nf-core/bacass/blob/master/CITATIONS.md
------------------------------------------------------
ERROR ~ ERROR: Validation of pipeline parameters failed!
-- Check 'nf-1kcilKqDAJaXkO.log' file for details
The following invalid input values have been detected:
* --save_trimmed_fail: 'false' is not a valid choice (Available choices: true, false)
-- Check script '.nextflow/assets/nf-core/bacass/./workflows/../subworkflows/local/utils_nfcore_bacass_pipeline/../../nf-core/utils_nfvalidation_plugin/main.nf' at line: 57 or see 'nf-1kcilKqDAJaXkO.log' file for more details
No response
Sample sheet validation could be performed via nf-validation plugin.
In addition, tsv/csv documentation and parsing will need adjustments.
nf-core/test-dataset::bacass
). Two options:The example run fails on my Ubuntu machine
nextflow run nf-core/bacass -r 1.1.0 -profile docker --input https://raw.githubusercontent.com/nf-core/test-datasets/bacass/bacass_short.csv --kraken2db ${PWD}/minikraken2 --max_memory 40.GB --max_cpus 10
with
Command exit status:
2
Command output:
(empty)
Command error:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
kraken2: database ("../minikraken2") does not contain necessary file taxo.k2d
However, I can use minikraken2 db locally. Docker version 19.03.5, nextflow version is 19.10.0.5170. I'm only seeing this in Ubuntu 18.04: the test runs fine on my macbook.
Any guidance?
Module compilation error
- file : /home/user/.nextflow/assets/nf-core/bacass/./workflows/../modules/local/skewer.nf
- cause: expecting '}', found ',' @ line 26, column 55.
, emit: lo
^
1 error
Steps to reproduce the behaviour:
nextflow run nf-core/bacass -r 2.0.0 -name amr_sample5 -profile docker -params-file nf-params.json
nf-params.json:
{
"input": "amr_sample2.csv",
"kraken2db": "\/home\/user\/minikraken2_v2_8GB_201904_UPDATE"
}
All nf-core pipelines will be converted to nextflow DSL2 and nf-core/bacass should not be left behind.
Additionally, this opportunity can be used to update all tools and progressively add more.
Currently, I am planning to go into that middle of September 21 at the latest. Earliest start would be when nf-core/tools releases its DSL2 template, which might be soon.
I'll write here as soon as I start. If anybody else is planning to or is tackling that problem already, please share your plans here that there is no redundant work done.
Edit: #54 with support for DSL2 is open
Hi, may I know is it possible to add homopolish as a tool for polishing after polished by medaka?
I got a "too many arguments" error for one of the commands when I had sample IDs with spaces in the sample sheet. I don't recall which command it was exactly but this should be an issue for any command. I'd suggest to either throw an error when the sample sheet is incorrect in this regard or to automatically get rid of spaces in the IDs.
Thanks!
Running the following command using bacass
1.1.0:
nextflow run nf-core/bacass --input bacass_short.csv -profile singularity --skip-kraken2
results in the following:
Missing Kraken2 DB arg
One would need to specify the --kraken2db
argument, which is a bit counter-intuitive to the role of --skip-kraken2
.
Hi, yesterday we failed with this pipeline in the Prokka process. After we rewrote the environment.yml file and made it to Prokka 1.14.0, then it works perfect now.
Hi there,
I downloaded the latest pipeline v1.1.1
and ran it offline with the following command:
nextflow run $PWD/nf-core-bacass-1.1.1/workflow/ \
-profile singularity \
--kraken2db /path/to/krakendb \
--input $PWD/samples.csv \
--assembly_type long \
--skip_annotation \
--skip_polish \
--assembler canu \
--canu_args 'stopOnLowCoverage=0 minInputCoverage=0'
The contents of the kraken2db folder are
library
taxonomy
hash.k2d
opts.k2d
seqid2taxid.map
taxo.k2d
The error I'm getting is
kraken2: database ("/path/to/krakendb") does not contain necessary file taxo.k2d
Here are the .command.run
and .command.sh
files (I added .txt
at the end to be able to attach them)
command.run.txt
command.sh.txt
In my profile file, I have defined
singularity {
enabled = true
autoMounts = true
cacheDir = "/path/to/images/singularity/nfcore/"
}
Thanks for looking into this. I'm just wondering whether I'm missing something.
Cheers,
Santiago
Based on the list of proposed enhancements for the nf-core/bacass
pipeline (#57), I suggest the integration of the Dragonflye module into the long-read assembly mode.
When I ran the command nextflow run nf-core/bacass --input bacass_short.csv --skip_kraken2 -profile conda
by using your test data then I got the error message:
Caused by:
Failed to create Conda environment
...
Status: 120
According to another post here: nextflow-io/nextflow#1081, I think it maybe also caused by time out.
Can you add this line conda { createTimeout = '1 h' }
into your nextflow.config
file then let me try it again? Thanks!
Which resolves all the nasty errors with python / updates / annoyances.
If anyone wants to help with this, what we need to do is:
As this is only spare time work from my side, need some help here from people with the possiblity to contribute here @nf-core/core :-) Also necessary to make this pipeline DSLv2 compatible at some point!
I ran the pipeline with my own data the command line is "nextflow run nf-core/bacass -r 2.2.0 -profile docker --input ./minikrakendb/baccsamplesheet.tsv --kraken2db /home/aslangabriel/minikrakendb/k2_standard_08gb_20240112.tar.gz --max_cpus 12 --max_memory '125.GB' --outdir ./minikrakendb/results", my sample sheet looks like that
and the error message was as follows
"-[nf-core/bacass] Pipeline completed with errors-
ERROR ~ Error executing process > 'NFCORE_BACASS:BACASS:FASTQ_TRIM_FASTP_FASTQC:FASTP (JSAHVC01)'
Caused by:
Process NFCORE_BACASS:BACASS:FASTQ_TRIM_FASTP_FASTQC:FASTP (JSAHVC01)
terminated with an error exit status (255)
Command executed:
[ ! -f JSAHVC01_1.fastq.gz ] && ln -sf JSAHVC01_S1_R1.fastq.gz JSAHVC01_1.fastq.gz
[ ! -f JSAHVC01_2.fastq.gz ] && ln -sf JSAHVC01_S1_R2.fastq.gz JSAHVC01_2.fastq.gz
fastp
--in1 JSAHVC01_1.fastq.gz
--in2 JSAHVC01_2.fastq.gz
--out1 JSAHVC01_1.fastp.fastq.gz
--out2 JSAHVC01_2.fastp.fastq.gz
--json JSAHVC01.fastp.json
--html JSAHVC01.fastp.html
--thread 8
--detect_adapter_for_pe
2> >(tee JSAHVC01.fastp.log >&2)
cat <<-END_VERSIONS > versions.yml
"NFCORE_BACASS:BACASS:FASTQ_TRIM_FASTP_FASTQC:FASTP":
fastp: $(fastp --version 2>&1 | sed -e "s/fastp //g")
END_VERSIONS
Command exit status:
255
Command output:
(empty)
Command error:
ERROR: Failed to open file: JSAHVC01_1.fastq.gz
Work dir:
/home/aslangabriel/work/d0/6d8835e9fba25be3a34dcf33080fa2
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh
", I have changed the file name per se the test example. please help me to fix it.
No response
No response
No response
Since the last update 2.0.0 the Nanoplot command expect a .png output but this output format was removed from version 1.1.1 to 2.0.0. I solved the issue by removing the png output line in the modules/local/nanoplot.nf
Steps to reproduce the behaviour:
Caused by:
Missing output file(s) *.png
expected by process NFCORE_BACASS:BACASS:NANOPLOT (NS45)
Command executed:
NanoPlot
-t 2
--fastq 2111-DK-l1-001.fastq
echo $(NanoPlot --version 2>&1) | sed 's/^.NanoPlot //; s/ .$//' > nanoplot.version.txt
Command exit status:
0
Command output:
(empty)
Work dir:
/home/sysgen/Desktop/ncct-projects/2201-Kostner-Assembly/work/b6/5ee3ce889e000a378cc359d94da18a
Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out
I would expect that the Nanoplot process produce different output formats for the length distribution and similar statistics of my long reads
Have you provided the following extra information/files:
.nextflow.log
file can be found in my fork of the bacass pipeline (https://github.com/jenmuell/bacass) It seems that the default docker image doesn't have a functional prokka installation due to expiry of tbl2asn. At the moment, I've gotten around this by specifying a different docker image for the prokka process, but it would be nice if this wasn't necessary.
Is there a way to process single end reads with bacass? I am attempting to modify config files, and it would be amazing if you already had a solution in hand.
Best,
Emily
Hi!
I would like to know is there a way to set default_jvm_mem_opts
in Pilon (which is a part of Unicycler) through nextflow run
command line? Especially when one uses -profile conda
or -profile docker
. Or there always be a problem by dealing with large genomes.
Hi @apeltzer
We have exchanged messages here: #28 (comment)
A related question:
I am relatively new to Nextflow and using NF with Docker. As a general rule, how big can the Docker image can be before it is considered too large? I was wondering if you could help and share some guidelines?
If I do docker build with the environment.yml from nf-core/sarek or mag, the image size comes close to 2.4 GB. Is there a way to reduce the size of the image?
I tried some of these techniques too but it did not help:
https://uwekorn.com/2021/03/03/deploying-conda-environments-in-docker-cheatsheet.html
https://jcristharif.com/conda-docker-tips.html
I also tried with Micromamba, the size still of the final Docker image is pretty huge
https://github.com/mamba-org/micromamba-docker
Have you tried Mamba/Micromamba? I would be curious to know your findings
Thanks in advance.
Two warning messages appear during the execution of tests in the nf-core/bacass GitHub CI.
WARN: A process with name 'MINIMAP2_CONSENSUS' is defined more than once in module script: /home/runner/work/bacass/bacass/./workflows/bacass.nf -- Make sure to not define the same function as process
WARN: A process with name 'MINIMAP2_POLISH' is defined more than once in module script: /home/runner/work/bacass/bacass/./workflows/bacass.nf -- Make sure to not define the same function as process
Prokka is no longer under maintenance and Bakta seems to be a reasonable replacement for genome annotation which incorporates several improvements .
Remove Prokka from nf-core/bacass and add Bakta instead.
It seems that Bakta needs a database to perform the annotations. However, even the light version of its database is somewhat heavy and could slow down the testing process.
Another option is to keep Prokka and add Bakta as an additional tool for annotation.
I am open to suggestions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.