Coder Social home page Coder Social logo

pipelines-nextflow's Introduction

NBIS Annotation service pipelines

Table of Contents

Overview

This Nextflow workflow is a compilation of several subworkflows for different stages of genome annotation. Specifically:

where the overall genome annotation process is:

graph TD
  preprocessing[Annotation Preprocessing] --> evidenceAlignment[Evidence alignment]
  transcriptAssembly[Transcript Assembly] --> evidenceAlignment
  evidenceAlignment --> evidenceMaker[Evidence-based Maker]
  denovoRepeatLibrary[De novo Repeat Library] ---> evidenceMaker
  transcriptAssembly --> pasa[PASA]
  preprocessing --> pasa
  pasa --> evidenceMaker
  evidenceMaker --> abinitioTraining[Abinitio Training]
  abinitioTraining --> abinitioMaker[Abinitio-based Maker]
  evidenceMaker --> abinitioMaker
  pasa --> functionalAnnotation[Functional Annotation]
  abinitioMaker --> functionalAnnotation
  functionalAnnotation --> EMBLmyGFF3

The subworkflow is selected using the subworkflow parameter.

Citation

If you use these pipelines in your work, please acknowledge NBIS within your communication according to this example: "Support by NBIS (National Bioinformatics Infrastructure Sweden) is gratefully acknowledged."

DOI

Acknowledgments

These workflows were based on the Bpipe workflows written by Marc Höppner (@marchoeppner) and Jacques Dainat (@Juke34).

Thank you to everyone who contributes to this project.

Maintainers

  • Mahesh Binzer-Panchal (@mahesh-panchal)
    • Expertise: Nextflow workflow development
  • Jacques Dainat (@Juke34)
    • Expertise: Genome annotation, Nextflow workflow development
  • Lucile Soler (@LucileSol)
    • Expertise: Genome Annotation

Installation and Usage

Requirements:

  • Nextflow
  • A container platform (recommended) such as Singularity or Docker, or the conda/mamba package manager if a container platform is not available. If containers or conda/mamba are unavailable, then tool dependencies must be accessible from your PATH.

Nextflow

Install Nextflow directly:

curl -s https://get.nextflow.io | bash
mv ./nextflow ~/bin

Alternatively, installation can be managed with conda (or mamba) in it's own conda environment:

conda create -c conda-forge -c bioconda -n nextflow-env nextflow
conda activate nextflow-env

See Nextflow: Get started - installation for further details.

General Usage

A workflow is run in the following way:

nextflow run NBISweden/pipelines-nextflow \
  [-profile <profile_name1>[,<profile_name2>,...] ] \
  [-c workflow.config ] \
  [-resume] \
  -params-file workflow_parameters.yml

where -profile selects from a predefined profile (select here for available profiles), -c workflow.config loads a custom configuration for altering existing process settings (defined in nextflow.config - loaded by default, such as the number of cpus, time allocation, memory, output prefixes and tool command-line options ). The -params-file is a YAML formatted file listing workflow parameters, e.g.

subworkflow: 'annotation_preprocessing'
genome: '/path/to/genome'
busco_lineage:
  - 'eukaryota_odb10'
  - 'bacteria_odb10'
outdir: '/path/to/save/results'

Note If running on a compute cluster infrastructure, nextflow must be able to communicate with the workload manager at all times, otherwise tasks will be cancelled. The best way to do this is to run nextflow using a screen or tmux terminal.

E.g. Screen

# Open a named screen terminal session
screen -S my_nextflow_run
# load nextflow with conda
conda activate nextflow-env
# run nextflow
nextflow run -c <config> -profile <profile> <nextflow_script>
# "Detach" screen terminal
<ctrl + a> <ctrl + d>
# list screen sessions
screen -ls
# "Attach" screen session
screen -r my_nextflow_run

Profiles

  • uppmax: A profile for the Uppmax clusters. Tasks are submitted to the SLURM workload manager, executed within Singularity (unless otherwise noted), and use the $SNIC_TMP scratch space. Note: The workflow parameter project is manadatory when using Uppmax clusters.
  • conda: A general purpose profile that uses conda to manage software dependencies.
  • mamba: A general purpose profile that uses mamba to manage software dependencies.
  • docker: A general purpose profile that uses docker to manage software dependencies.
  • singularity: A general purpose profile that uses singularity to manage software dependencies.
  • nbis: A profile for the NBIS annotation cluster. Tasks are submitted to the SLURM workload manager, and use the disk space /scratch for task execution. Software should be managed using one of the general purpose profiles above.
  • gitpod: A profile to set local executor settings in the Gitpod environment.
  • test: A profile supplying test data to check if the workflows run on your system.
  • pipeline_report: Adds a folder in the outdir which include workflow execution reports.
Uppmax profile good practices

Note

Nextflow is enabled using the module system on Uppmax.

module load bioinfo-tools Nextflow

The following configuration in your workflow.config is recommended when running workflows on Uppmax.

// Set your work directory to a folder in your project directory under nobackup
workDir = '/proj/<snic_storage_project>/nobackup/work'
// Restart workflows from last successful execution (i.e. use cached results where possible).
resume = true
// Add any overriding process directives here, e.g.,
process {
    withName: 'BLAST_BLASTN' {
        cpus = 12
        time = 2.d
    }
}
NBIS profile good practices

Note

Both singularity and conda are installed, however singularity is preferred for speed and reproducibility.

module load Singularity

The following configuration in your workflow.config is recommended when running workflows on the annotation cluster.

// Set your work directory to a folder on the /active partition
workDir = '/active/<project_id>/nobackup/work'
// Restart workflows from last successful execution (i.e. use cached results where possible).
resume = true
// Add any overriding process directives here, e.g.,
process {
    withName: 'BLAST_BLASTN' {
        cpus = 12
        time = 2.d
    }
}
// Use a shared cache folder singularity images
singularity.cacheDir = '/active/nxf_singularity_cachedir'
// If using conda, use a shared cache for conda environments
conda.cacheDir = '/active/nxf_conda_cachedir'
// Use mamba for speed over conda
conda.useMamba = true

Project results should be published to /projects, work directories should be on /active, while computations are performed on the local /scratch partitions.

pipelines-nextflow's People

Contributors

aersoares81 avatar juke34 avatar lucilesol avatar mahesh-panchal avatar martinpippel avatar nimarafati avatar pontus avatar royfrancis avatar verku avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pipelines-nextflow's Issues

Functional annotation pipeline failing when setting a path to a local copy of interproscan in uppmax

Hi @mahesh-panchal,

I am trying to use the nextflow NBIS Functional Annotation pipeline with the uppmax profile, but setting a path to a local copy of the latest version of interproscan (since the current version in Uppmax is two years old). For this, I made a few modifications to the config files, but the pipeline unfortunately fails at the interproscan step.

I downloaded interproscan from here to a local directory in Uppmax . The software was installed following the instructions in the referred website.

Below, I show the modifications made to parts of some of the config files (see the # <<< comments for explanations):

  • pipelines-nextflow/FunctionalAnnotation/config/software_packages_uppmax.config:
process {
    withLabel: blast {
        container = 'quay.io/biocontainers/blast:2.9.0--pl526h3066fca_4'
    }
    withName: interproscan {
        // use Uppmax module
	//module = 'bioinfo-tools:InterProScan/5.30-69.0'  # <<< I commented this line to force looking at the local copy of interproscan
        # <<< Also added the modules below, as they are typically loaded when calling the InterProScan/5.30-69.0 module in Uppmax
        module = 'bioinfo-tools:java/OpenJDK_11.0.2'
        module = 'bioinfo-tools:python3/3.9.5'
        module = 'bioinfo-tools:perl/5.26.2'
        module = 'bioinfo-tools:Phobius/1.01'
        module = 'bioinfo-tools:SignalP/5.0b'
        module = 'bioinfo-tools:tmhmm/2.0c'
    }
    withLabel: 'AGAT' {
        container = 'quay.io/biocontainers/agat:0.5.1--pl526r35_0'
    }
}
  • pipelines-nextflow/FunctionalAnnotation/nextflow.config:
profiles {

    uppmax {
        executor {
            name = 'slurm'
        }
        process {
            scratch = '$SNIC_TMP'
        }
        includeConfig "$baseDir/config/compute_resources.config"
        singularity.enabled = true
        singularity.envWhitelist = 'SNIC_TMP'
        includeConfig "$baseDir/config/software_packages_uppmax.config"
        env.PATH='${PATH}:/proj/<path>/interproscan-5.52-86.0'  # <<< I added this line with the path to the local copy of interproscan
    }
}
  • I made another config file to rewrite the path to the local copy of interproscan and to set the SNIC account, which is called pipelines-nextflow/FunctionalAnnotation/params.config:
# <<< This chunk of code provides again the path to interproscan to the up-max profile, just in case
profiles {

    uppmax {
        executor {
            name = 'slurm'
        }
        process {
            scratch = '$SNIC_TMP'
        }
        includeConfig "$baseDir/config/compute_resources.config"
        singularity.enabled = true
        singularity.envWhitelist = 'SNIC_TMP'
        includeConfig "$baseDir/config/software_packages_uppmax.config"
        env.PATH='${PATH}:/proj/<path>/ensembl_early_release/interproscan-5.52-86.0' 
    }
}

// Nextflow parameters
resume = true
process {
    clusterOptions = '-A snic2021-XX-XX' # <<< I added this line to set the snic compute account
    // You can also override existing process cpu or time settings here too.
}

The pipeline was launched using the conda environment with these commands:

# Open screen terminal
screen -S ann
# Load Nextflow environment with conda
conda activate nextflow-env
# Load modules
module load bioinfo-tools Nextflow/21.04.1
# Change NXF_HOME to a place in your project directory (export NXF_HOME=yourprojectfolder)
export NXF_HOME='/<path>/work'

# Set environment variables
REF_FASTA='/proj/<path>/Trachurus_trachurus-GCA_905171665.1-unmasked.fa'
GFF_FILE='/proj/<path>/Trachurus_trachurus-GCA_905171665.1-2021_03-genes.gff3'
PROTEIN_DB='/proj/<path>/uniprot_db/uniprot-filtered-organism-Human-2021-07-16.fasta'
FUNCT_ANN_PIPELINE='/proj/<path>/pipelines-nextflow/FunctionalAnnotation/FunctionalAnnotation.nf'
NEXTFLOW_CONFIG='/proj/<path>/pipelines-nextflow/FunctionalAnnotation/nextflow_mod.config'
OUTPUT_DIR='/proj/<path>/functionalAnn_results'

# Run the nextflow pipeline
NXF_VER=21.04.1 nextflow run -profile uppmax $FUNCT_ANN_PIPELINE \
--genome $REF_FASTA \
--gff_annotation $GFF_FILE \
--blast_db_fasta $PROTEIN_DB \
--outdir $OUTPUT_DIR

Questions:

  • Why is the pipeline failing to use the local copy of interproscan?
  • Are the modifications made to the config files correct to set the path to the local copy of interproscan?

Thanks

running on HPC

Do you test them to run them on Slurm based HPC system?
Thanks
Shicheng

[New pipeline] MAKER3

See #17 for the general picture.

The purpose of this pipeline is simply to run MAKER. MAKER can be run to make evidence annotation, abinitio evidence-driven annotation or only as chooser/combiner like evidence modeller (if all annotations computed outside MAKER.)

Prerequisite: MAKER3.02 is not in bioconda we need to update the recipe. GAAS, AGAT

It would be 3 steps:

  1. If gff3 files provided as evidence input, check the gff files are match match_part otherwise run the AGAT script to convert them
  2. run MAKER
  3. run light check the run is complete (GAAS script)
  4. run gaas_maker_merge_output_from_datastore.pl (GAAS script, this script needs AGAT to make statistics too)

The parameters needed for MAKER is very verbose and complex. We should find a way to provide a copy of the maker_opt.ctl file. If we chain this pipeline with other pipelines (to feed input files) we need to find a way to modify this file properly/accordingly.

Taking forever on uppmax

One user reported he was not able to run it on uppmax.
It was taking forever. He ran directly the nextflow pipeline after module load nextflow. Could we provide an example of sbatch file to run it through Slurm in the readme?

Pb with running busco with AnnotationPreprocessing and Singularity

I tried to run the pipeline AnnotationPreprocessing with Singularity like this :

nextflow run -c params.config -profile nbis,singularity ~/git/NBIS/pipelines-nextflow/AnnotationPreprocessing/AnnotationPreprocessing.nf

my params.config :

// Workflow parameters
params.genome = 'cns_p_ctg_mod_lr_pilon_x3_mtmask.fasta'
params.outdir = 'results'
params.min_length = 1000
// Use `busco --list-datasets` for full list of available lineage sets
params.busco_lineage = [ 'arachnida_odb10', 'bacteria_odb10' ]

My busco folders are created and empty

and in .command.log I get :

nxf-scratch-dir node-a02:/scratch/nxf.ggWYtc0RKT
ESC[33mWARNING:ESC[0m Skipping mount /sw/easybuild/software/Singularity/3.5.2/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
INFO:   ***** Start a BUSCO v4.0.2 analysis, current time: 05/13/2020 11:37:07 *****
INFO:   Configuring BUSCO with /usr/local/config/config.ini
INFO:   Mode is genome
INFO:   Input file is cns_p_ctg_mod_lr_pilon_x3_mtmask_purified.fa
INFO:   Downloading information on latest versions of BUSCO data...
CRITICAL:       Unhandled exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/busco/run_BUSCO.py", line 167, in run_BUSCO
    config_manager.load_busco_config(sys.argv)
  File "/usr/local/lib/python3.7/site-packages/busco/BuscoLogger.py", line 55, in wrapped_func
    self.retval = func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/busco/ConfigManager.py", line 46, in load_busco_config
    self.config.validate()
  File "/usr/local/lib/python3.7/site-packages/busco/BuscoConfig.py", line 243, in validate
    self._init_downloader()
  File "/usr/local/lib/python3.7/site-packages/busco/BuscoConfig.py", line 49, in _init_downloader
    self.downloader = BuscoDownloadManager(self)
  File "/usr/local/lib/python3.7/site-packages/busco/BuscoDownloadManager.py", line 49, in __init__
    self._load_versions()
  File "/usr/local/lib/python3.7/site-packages/busco/BuscoDownloadManager.py", line 58, in _load_versions
    with open(versions_file, "r") as v_file:
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/nxf.ggWYtc0RKT/busco_downloads/file_versions.tsv'

HiSat2 image for TranscriptAssembly workflow needs authentication token.

The old mulled biocontainers image didn't work.

The new image: docker://docker.pkg.github.com/nbisweden/pipelines-nextflow/hisat2:2.1.0
works, but since it's on github packages, this needs a Github authentication token to even read from the public image.

Current solution is to pull the image separately using the same instruction as nextflow and provide the authentication token using the options provided by docker or singularity, and then manually copy it to the workDir.

e.g. singularity pull --docker-login --name $TESTDIR/work/singularity/docker.pkg.github.com-nbisweden-pipelines-nextflow-hisat2-2.1.0.img docker://docker.pkg.github.com/nbisweden/pipelines-nextflow/hisat2:2.1.0

The alternative is to make the image public on another site.

error with multiQC while running TranscriptAssembly.nf

I ran

nextflow run -c params.config -profile nbis,conda ~/git/NBIS/pipelines-nextflow/TranscriptAssembly/TranscriptAssembly.nf

and got the following error :

Caused by:
  Process `transcript_assembly:multiqc` terminated with an error exit status (1)
 
Command executed:
 
  multiqc . -c multiqc_conf.yml
 
Command exit status:
  1
 
Command output:
  Searching   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 10/10  
 
Command error:
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/multiqc.py", line 594, in run
      output = mod()
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/modules/flash/flash.py", line 26, in __init__
      super(MultiqcModule, self).__init__(
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/modules/base_module.py", line 45, in __init__
      config.update({anchor: mod_cust_config.get("custom_config", {})})
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/utils/config.py", line 250, in update
      return update_dict(globals(), u)
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/utils/config.py", line 256, in update_dict
      if isinstance(val, collections.Mapping):
  AttributeError: module 'collections' has no attribute 'Mapping'
  ============================================================
 
 [ERROR  ]         multiqc : Oops! The 'seqyclean' MultiQC module broke... 
    Please copy the following traceback and report it at https://github.com/ewels/MultiQC/issues 
    If possible, please include a log file that triggers the error - the last file found was:
      None
  ============================================================
  Module seqyclean raised an exception: Traceback (most recent call last):
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/multiqc.py", line 594, in run
      output = mod()
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/modules/seqyclean/seqyclean.py", line 18, in __init__
      super(MultiqcModule, self).__init__(
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/modules/base_module.py", line 45, in __init__
      config.update({anchor: mod_cust_config.get("custom_config", {})})
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/utils/config.py", line 250, in update
      return update_dict(globals(), u)
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/utils/config.py", line 256, in update_dict
      if isinstance(val, collections.Mapping):
  AttributeError: module 'collections' has no attribute 'Mapping'
  ============================================================
  [ERROR  ]         multiqc : Oops! The 'optitype' MultiQC module broke... 
    Please copy the following traceback and report it at https://github.com/ewels/MultiQC/issues 
    If possible, please include a log file that triggers the error - the last file found was:
      None
  ============================================================
  Module optitype raised an exception: Traceback (most recent call last):
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/multiqc.py", line 594, in run
      output = mod()
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/modules/optitype/optitype.py", line 24, in __init__
      super(MultiqcModule, self).__init__(
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/modules/base_module.py", line 45, in __init__
      config.update({anchor: mod_cust_config.get("custom_config", {})})
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/utils/config.py", line 250, in update
      return update_dict(globals(), u)
    File "/projects/annotation/Lepidium_7/pacbio/hybrid_campestre_heterophyllum_LcxLh/RNAseq/work/conda/multiqc-3f02d6c39194cb687c5310d7a88b0ee5/lib/python3.10/site-packages/multiqc/utils/config.py", line 256, in update_dict
      if isinstance(val, collections.Mapping):
  AttributeError: module 'collections' has no attribute 'Mapping'
  ============================================================
  [WARNING]         multiqc : No analysis results found. Cleaning up..
  [INFO   ]         multiqc : MultiQC complete

the rest of the pipeline is running fine!

[New pipeline] DeNovoRepeatLib

See #17 for the general picture.

Maybe can be merge with the DeNovoRepeatLib pipeline (see #33).

The purpose of DeNovoRepeatLib is to make de-novo repeat library of a genome.
There is two approach, should we only use the standard one? Should we use both solutions in parallel? We can provide an option to choose.

solution 1 (standard):
Input: A genome fasta file + an existing lib e.g dfam or RepBase to classify the de novo repeat (give family name), A protein database(swissprot eukaryote/prokaryote) for remove potential proteins from repeats.
Output: A repeat library fasta file

For detailed approach see the wiki of the annotation cluster repo here and a more condense description in this post on Biostars.

TransposonPSI is now in bioconda.
protexcluder is available in the nanjiang conda channel, it should be moved into bioconda.
Be careful to Blast version (protexcluder needs particular ones).

solution 2 : Use EDTA available in conda and consequently as biocontainer.

Bioconda env for interproscan not working for the functional_annotation subworkflow

The bioconda env for interproscan is not working in the functional_annotation subworkflow.

To use interproscan, one needs to install the tool and database locally and add a similar line to a custom.config :

env.PATH ='${PATH}:/projects/interproscan/interproscan-5.59-91.0'

Or, one can use its own conda env for interproscan and add it to the custom.config :

process {
    withName: 'INTERPROSCAN' {
        conda = '/sw/anaconda/2019.10/envs/interproscan'
        container = null
    }    
}

And then run the pipeline the following way :

nextflow run NBISweden/pipelines-nextflow -profile singularity -params-file params.yml -c custom.config

Parameters and paths for test profile

In order to make github actions work with the test profiles certain issues need to be resolved.

Issues:

  • FunctionalAnnotationPreparation.nf:
    • params.blast_db_fasta: Needs an URL to download the protein fasta file from.
    • Interproscan is locally installed, and not available through conda. Need a solution to make interproscan available in the github actions test environment.

[New pipeline] EvidenceAlignment

See #17 for the general picture.

The purpose of this pipeline is to generate gff alignment from protein or transcript fasta files.
Those gff must be formatted in match match/part (see AGAT script agat_sp_alignment_output_style.pl for that purpose if tools producing the gff output do not do it by default)

2 type of inputs: Protein fasta file and/or nucleotide fasta file.
For both type of alignment we could offer an option to select which tool to use (indeed many tools exist this task). so would be nice to allow several choices (e.g for protein splice aware alignment, genomethreader, exonerated gmap, etc...).

For protein alignment:
diamond or blastx for raw alignment and exonerate or scipio or spawn or genome threader for polished (splice aware) alignment
=>priority to implement diamond, blastx and exonerate

For transcript alignment:
=> Minimap2
=> we should also implement the MAKER method in two steps: 1) raw alignment with tblastx for related species data, or blastn for species-specific data; 2) exonerate for polished alignment.

[New pipeline] PASA

See #17 for the general picture.

The idea of this pipeline is to parallelise PASA to make it faster to run. Having PASA would be nice because it can be used to predict gene from evidence that can be used in different ways:

  • PASA makes probably a a better pure evidence-based annotation than MAKER.
  • the PASA annotation can be used to improve/polish an annotation made with another tool.
  • the PASA annotation can be used to train Abinitio tools

PASA is already available on Conda. The difficulty is to parallelise it. Marc has already implemented something in esga see here.
Can we use it like it is?
It sounds we could make it more generalised and better commented. For GFF related task (e.g GffToFasta we can use AGAT's script, see other NBIS pipelines to use it) and for fastaSplitSize we can use GAAS script gaas_fasta_splitter.pl.

AugustusTraining add extra steps

Would be nice to add at the end (where asecodes_parviclava is set as species parameter in the workflow) the Augustus training steps:

new_species.pl --species=asecodes_parviclava
etraining –-species=asecodes_parviclava outdir/TrainingData/codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered.gbk.train 
augustus --species=asecodes_parviclava output.gbk.test | tee run.log
augustus --species=asecodes_parviclava TestingData/codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered.gbk.test | tee run.log

Require Augustus and the path to the share profile folder

The problem with installation.

hello there, I want to use this beautiful tool for annotating my genome , but I can't install it.

could you please help me?

this is my log when I try to install it.

'''
[caoshuo@login04 tools]$ curl -s https://get.nextflow.io | bash
/usr/bin/md5sum: 第 2 行:[: ==: 期待一元表达式
CAPSULE: Downloading dependency dev.failsafe:failsafe:jar:3.1.0
CAPSULE EXCEPTION: Error resolving dependencies. while processing attribute Allow-Snapshots: false (for stack trace, run with -Dcapsule.log=verbose)
Unable to initialize nextflow environment
'''

thanks!

best wishes.

Add MultiQC to Annotation Preprocessing pipeline

It would be nice to have a single report for all the output from
both summary stats and busco:

process MultiQC {

    input:
    path logs
    path config

    output:
    path "multiqc_*"

    script:
    """
    multiqc -c $config .
    """

}

PBS script not loading conda

Hi,

The .command.run that is generated in my scratch does not load conda and then fails to find it.

/var/spool/pbs/mom_priv/jobs/5865502.hpc-batch14.SC: line 286: conda: command not found
/var/spool/pbs/mom_priv/jobs/5865502.hpc-batch14.SC: line 286: /bin/activate: No such file or directory

In PBS, conda is not in PATH unless its loaded, in my case it's :

module load Miniconda/3 # but it might be different on other clusters

Is there a way of automatically integrating this in the .command.run file? Otherwise the solution is to put it in my .bashrc, I lost sometime before that occured to me.

update busco for AnnotationPreprocessing pipeline, problems with singularity

I wanted to update the busco version for 5.2.2 in AnnotationPreprocessing.nf, I tried with conda it worked

I tried with singularity doing : nextflow run -c params_v5_2_2singularity.config -profile nbis,singularity ~/git/NBIS/pipelines-nextflow/AnnotationPreprocessing/AnnotationPreprocessing.nf

and it says :

Error executing process > 'annotation_preprocessing:busco (genome_uppercase_purified.fa)'

Caused by:
  Process `annotation_preprocessing:busco (genome_uppercase_purified.fa)` terminated with an error exit status (1)

Command executed:

  if [ ! -w "${AUGUSTUS_CONFIG_PATH}" ]; then
      # Create writable tmp directory for augustus
      AUG_CONF_DIR=$( mktemp -d -p $PWD )
      cp -r $AUGUSTUS_CONFIG_PATH/* $AUG_CONF_DIR
      export AUGUSTUS_CONFIG_PATH=$AUG_CONF_DIR
  fi
  # before with buscov4 it was echo "BUSCO_CONFIG_FILE=$BUSCO_CONFIG_FILE", it stops working for buscov5
  echo "BUSCO_CONFIG_FILE=$AUGUSTUS_CONFIG_PATH/myconfig.ini"
  echo "AUGUSTUS_CONFIG_PATH=$AUGUSTUS_CONFIG_PATH"
  busco -c 8 -i genome_uppercase_purified.fa -l viridiplantae_odb10 -m genome --out busco_genome_uppercase_purified_viridiplantae_odb10

Command exit status:
  1

Command output:
  (empty)

Command error:
  .command.sh: line 2: AUGUSTUS_CONFIG_PATH: unbound variable

Work dir:
  /projects/annotation/Lepidium_7/pacbio/L_campestre_Lc92/genome/work/4c/4eaa28784f54efd353288694d4dfbd

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

I guess it does not find the path of augustus inside the singularity container

[BUG] FunctionalAnnotation: InterProScan not working on Rackham with original configuration

I tried to run the pipeline the same way as I ran it successfully on NAC, but adding the path to the local InterProScan module (/sw/apps/bioinfo/InterProScan/5.30-69.0/rackham) before starting the pipeline with the "uppmax" Nextflow profile, I got the following error:

Command: bin/hmmer/hmmer3/3.1b1/hmmsearch -Z 4488 --cut_tc --cpu 1 -o
/scratch/19129091/r57.uppmax.uu.se_20210312_150617097_ib08//jobTIGRFAM/000000000001_000000000206.raw.out
data/tigrfam/15.0/TIGRFAMs_HMM.LIB
/scratch/19129091/r57.uppmax.uu.se_20210312_150617097_ib08//jobTIGRFAM/000000000001_000000000206.fasta

  Error output from binary:

  Error: File existence/permissions problem in trying to open HMM file data/tigrfam/15.0/TIGRFAMs_HMM.LIB.
  HMM file data/tigrfam/15.0/TIGRFAMs_HMM.LIB not found (nor an .h3m binary of it)

Next, I installed InterProScan locally (testing different versions), and exported the respective location to the PATH instead of the path to the module, but I kept getting similar errors, e.g.

Error: File existence/permissions problem in trying to open HMM file data/sfld/3/sfld.hmm.  
HMM file data/sfld/3/sfld.hmm not found (nor an .h3m binary of it)

Some observations from testing different versions:

  • Newer Interproscan versions (e.g. 5.46-81.0, 5.51-85.0) have newer database versions installed but when run from within Nextflow, they kept trying to access the older database versions.
  • Version 5.30-69.0 requires Java 1.8 instead of 11 which is installed in the Nextflow conda environment along with Nextflow.

Finally, I got the impression that Nextflow was trying to run the Interproscan Docker container (version 5.30-69.0) that is provided in config/software_packages.config, although I added other installations to the PATH. I eventually managed to run the pipeline on Rackham by commenting out line 10 in the file config/software_packages.config where the container was specified and by commenting out line 183 in FunctionalAnnotation.nf that was adding the container installation to the PATH, and by starting the pipeline this way:

  1. activate the conda environment with Nextflow
  2. load the Uppmax module with module load bioinfo-tools InterProScan/5.30-69.0
  3. start the pipeline with the "uppmax" profile

This is fine for the purpose of my project, but maybe you have an idea how to get it to run on Rackham with the original configuration.

Installing and running pipelines questions

Hi, I i'm trying to setup your pipelines. Specifically I'm interested in the Abinitio Training and the Functional Training workflows.

I have had no trouble setting up nextflow and cloning the git repository

I installed nextflow with conda as suggested in your guide.

I'm new to singularity and nextflow so I may be doing something wrong, but my understating is that singularity docker and mamba/conda are used in your scrip to manage environments. In my past experience with workflows like snakemake the workflow manager downloaded and installed the required packages if they were needed. Is this the case for your pipelines?

I tried running the Abinitio Training workflow with both singularity and conda, but I needed to install into the environment myself to get the pipeline to start, it then stopped again when it needed blast. Do I need to be setting up these environments, and that databases for Interproscan and blast in the Functional training myself, or am I doing something wrong to get nextflow and the environment managers to deal with this for me.

Thanks for your help,

I get the following error using the conda profile in a new conda environment using the launch command

nextflow run ../../pipelines-nextflow/ -profile singularity -params-file params.yml tee log_singularity.txt

and the following paramters file

subworkflow: 'abinitio_training'
genome: '/data/Maker_annotation/Inputs/Genome_A_soft_masked.fasta'
maker_evidence_gff: '/data/Maker_annotation/Genome_rnd1.maker.output/91K_rnd1.all.maker.gff'
maker_species_publishdir: '/data/miniconda3/envs/maker_env/config/species/'
species_label: 'Genome_NBIS'
codon_table: 1
outdir: '/data/Maker_annotation/NBISweden_Ab_initio_test/results'
Error executing process > 'ABINITIO_TRAINING:SPLIT_MAKER_EVIDENCE (91K_rnd1.all.maker)'

Caused by:
  Process `ABINITIO_TRAINING:SPLIT_MAKER_EVIDENCE (91K_rnd1.all.maker)` terminated with an error exit status (1)

Command executed:

  agat_sp_separate_by_record_type.pl \
      -g 91K_rnd1.all.maker.gff \
      -o maker_results_noAbinitio_clean
  if test -f maker_results_noAbinitio_clean/mrna.gff && test -f maker_results_noAbinitio_clean/transcript.gff; then
      agat_sp_merge_annotations.pl \
          --gff maker_results_noAbinitio_clean/mrna.gff \
          --gff maker_results_noAbinitio_clean/transcript.gff \
          --out merged_transcripts.gff
      mv merged_transcripts.gff maker_results_noAbinitio_clean/mrna.gff
  elif test -f maker_results_noAbinitio_clean/transcript.gff; then
      cp maker_results_noAbinitio_clean/transcript.gff maker_results_noAbinitio_clean/mrna.gff
  fi

  cat <<-END_VERSIONS > versions.yml
  "ABINITIO_TRAINING:SPLIT_MAKER_EVIDENCE":
      agat: 0.9.2
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  /bin/bash: .command.sh: No such file or directory

Work dir:
  /data/Maker_annotation/NBISweden_Ab_initio_test/work/c0/8e63a41a6026cdba481471321884e5

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Conda env create fails (limited internet connection)

Hello,

Due to hacker attacks to mine for cryptocoin, my university created a firewall for anything coming from internet.

When we want to install anything from conda, we need to use a special mirror. I do something like this:

conda install -c http://conda.repo.test.hhu.de/main --override-channels package_name

Now, your pipeline obviously does not account for that and it fails to connect to the regular Anaconda channel (error message bellow). Is there a way of providing my private mirror in the pipeline?

Cheers,
Ricardo

`
Error executing process > 'abinitio_training:split_maker_evidence (filtered)'

Caused by:
Failed to create Conda environment
command: conda env create --prefix /gpfs/project/new_annotation/nextflow/scratch/nxf_work/conda/agat-6b6131ed4ecf41887a64062ec60a9397 --file /home/guerrer/projects/src/nextflow/AbinitioTraining/conda/label_agat.yml
status : 1
message:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url https://repo.anaconda.com/pkgs/main/noarch/repodata.json.bz2
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

If your current network has https://www.anaconda.com blocked, please file
a support request with your network engineering team.

ConnectionError(MaxRetryError("HTTPSConnectionPool(host='repo.anaconda.com', port=443): Max retries exceeded with url: /pkgs/main/noarch/repodata.json.bz2 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x2b6072023b38>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"))

`

[functional annotation]

@mahesh You already mentioned the FA-nf pipeline, might be a good alternative to ours. They would like to reimplement it in DSL2 maybe we could join our efforts and create a unique pipeline, or use their as an alternative.

Error: Session aborted -- Cause: Cannot emit a multi-channel output

Here is the command:

module load conda/2019.10
conda activate nextflow-env

nextflow run \
 -c params.config \
 -profile nbis,conda ~/bower/resources/pipelines-nextflow/AbinitioTraining/AbinitioTraining.nf

Here is the params.config

params.config
// Workflow parameters
params.maker_evidence_gff = "evidence.gff"
params.genome = "genome_clean.fa"
params.outdir = "results"
params.species_label = 'ailuroedus_buccoides'  // e.g. 'asecodes_parviclava'
params.model_selection_value = 0.3
params.locus_distance = 3000
params.codon_table = 1
params.test_size = 100
params.flank_region_size = 1000
params.maker_species_publishdir = '/projects/references/augustus/config/species/ailuroedus_buccoides/'

// Nextflow parameters
resume = true
workDir = '/scratch/roy/augustus'
conda.cacheDir = "$HOME/.nextflow/conda"
singularity.cacheDir = "$HOME/.nextflow/singularity"

Here is the error log from .nextflow.log.

.nextflow.log
Jun-22 09:26:07.233 [main] DEBUG nextflow.cli.Launcher - $> nextflow run -c params.config -profile nbis,conda /home/roy/bower/resources/pipelines-nextflow/AbinitioTraining/AbinitioTraining.nf
Jun-22 09:26:07.383 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 20.01.0
Jun-22 09:26:07.403 [main] INFO  nextflow.cli.CmdRun - Launching `/home/roy/bower/resources/pipelines-nextflow/AbinitioTraining/AbinitioTraining.nf` [confident_dubinsky] - revision: 3402b15e08
Jun-22 09:26:07.431 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /home/roy/bower/resources/pipelines-nextflow/AbinitioTraining/nextflow.config
Jun-22 09:26:07.434 [main] DEBUG nextflow.config.ConfigBuilder - User config file: /projects/annotation/3_bower_birds/ailuroedus_buccoides/augustus/params.config
Jun-22 09:26:07.435 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /home/roy/bower/resources/pipelines-nextflow/AbinitioTraining/nextflow.config
Jun-22 09:26:07.435 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /projects/annotation/3_bower_birds/ailuroedus_buccoides/augustus/params.config
Jun-22 09:26:07.462 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `nbis,conda`
Jun-22 09:26:08.086 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `nbis,conda`
Jun-22 09:26:08.126 [main] DEBUG nextflow.config.ConfigBuilder - Available config profiles: [bils, debug, test, conda, uppmax, singularity, nbis, docker]
Jun-22 09:26:08.191 [main] DEBUG nextflow.Session - Session uuid: b086f2b3-4b07-43a2-8339-0798ce408f82
Jun-22 09:26:08.192 [main] DEBUG nextflow.Session - Run name: confident_dubinsky
Jun-22 09:26:08.192 [main] DEBUG nextflow.Session - Executor pool size: 32
Jun-22 09:26:08.239 [main] DEBUG nextflow.cli.CmdRun - 
  Version: 20.01.0 build 5264
  Created: 12-02-2020 10:14 UTC (11:14 CEST)
  System: Linux 4.15.0-88-generic
  Runtime: Groovy 2.5.8 on OpenJDK 64-Bit Server VM 11.0.1-internal+0-adhoc..src
  Encoding: UTF-8 (UTF-8)
  Process: 35974@nac-login [192.168.11.35]
  CPUs: 32 - Mem: 62.9 GB (1.9 GB) - Swap: 8 GB (7.5 GB)
Jun-22 09:26:08.278 [main] DEBUG nextflow.Session - Work-dir: /scratch/roy/augustus [ext2/ext3]
Jun-22 09:26:08.279 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /home/roy/bower/resources/pipelines-nextflow/AbinitioTraining/bin
Jun-22 09:26:08.307 [main] DEBUG nextflow.Session - Observer factory: TowerFactory
Jun-22 09:26:08.309 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Jun-22 09:26:08.524 [main] DEBUG nextflow.Session - Session start invoked
Jun-22 09:26:08.529 [main] DEBUG nextflow.trace.TraceFileObserver - Flow starting -- trace file: /projects/annotation/3_bower_birds/ailuroedus_buccoides/augustus/pipeline_report/execution_trace.txt
Jun-22 09:26:09.024 [main] WARN  nextflow.NextflowMeta$Preview - DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
Jun-22 09:26:09.028 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Jun-22 09:26:09.052 [main] INFO  nextflow.Nextflow - 
NBIS
  _   _ ____ _____  _____
 | \ | |  _ \_   _|/ ____|
 |  \| | |_) || | | (___
 | . ` |  _ < | |  \___ \
 | |\  | |_) || |_ ____) |
 |_| \_|____/_____|_____/  Annotation Service

 Abintio training dataset workflow
 ===================================

 General Parameters
     maker_evidence_gff            : evidence.gff
     genome                        : genome_clean.fa
     outdir                        : results
     species_label                 : ailuroedus_buccoides

 Model selection by AED
     model_selection_value         : 0.3

 Filter by locus distance
     locus_distance                : 3000
 Protein Sequence extraction parameters
     codon_table                   : 1

 Augustus training parameters
     test_size                     : 100
     flank_region_size             : 1000

 
Jun-22 09:26:09.092 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: gff2gbk, snap_training, split_maker_evidence, blast_makeblastdb, model_selection_by_AED, gbk2augustus, blast_recursive, gff_filter_by_blast, remove_incomplete_gene_models, filter_by_locus_distance, retain_longest_isoform, augustus_training, convert_gff2zff, extract_protein_sequence
Jun-22 09:26:09.272 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:AGAT` matches label `AGAT` for process with name abinitio_training:split_maker_evidence
Jun-22 09:26:09.279 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.280 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.291 [main] DEBUG nextflow.executor.Executor - [warm up] executor > slurm
Jun-22 09:26:09.302 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'slurm' > capacity: 100; pollInterval: 5s; dumpInterval: 5m 
Jun-22 09:26:09.307 [main] DEBUG n.executor.AbstractGridExecutor - Creating executor 'slurm' > queue-stat-interval: 1m
Jun-22 09:26:09.368 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:split_maker_evidence -- maxForks: 20; blocking: true
Jun-22 09:26:09.398 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:AGAT` matches label `AGAT` for process with name abinitio_training:model_selection_by_AED
Jun-22 09:26:09.400 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.400 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.437 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:model_selection_by_AED -- maxForks: 20; blocking: true
Jun-22 09:26:09.444 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:AGAT` matches label `AGAT` for process with name abinitio_training:retain_longest_isoform
Jun-22 09:26:09.447 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.447 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.449 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:retain_longest_isoform -- maxForks: 20; blocking: true
Jun-22 09:26:09.470 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:AGAT` matches label `AGAT` for process with name abinitio_training:remove_incomplete_gene_models
Jun-22 09:26:09.471 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.471 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.476 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:remove_incomplete_gene_models -- maxForks: 20; blocking: true
Jun-22 09:26:09.480 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:AGAT` matches label `AGAT` for process with name abinitio_training:filter_by_locus_distance
Jun-22 09:26:09.481 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.481 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.484 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:filter_by_locus_distance -- maxForks: 20; blocking: true
Jun-22 09:26:09.489 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:AGAT` matches label `AGAT` for process with name abinitio_training:extract_protein_sequence
Jun-22 09:26:09.490 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.490 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.492 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:extract_protein_sequence -- maxForks: 20; blocking: true
Jun-22 09:26:09.496 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:Blast` matches label `Blast` for process with name abinitio_training:blast_makeblastdb
Jun-22 09:26:09.497 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.497 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.500 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:blast_makeblastdb -- maxForks: 20; blocking: true
Jun-22 09:26:09.507 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:Blast` matches label `Blast` for process with name abinitio_training:blast_recursive
Jun-22 09:26:09.507 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withName:blast_recursive` matches process abinitio_training:blast_recursive
Jun-22 09:26:09.508 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.508 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.511 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:blast_recursive -- maxForks: 20; blocking: true
Jun-22 09:26:09.518 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:AGAT` matches label `AGAT` for process with name abinitio_training:gff_filter_by_blast
Jun-22 09:26:09.519 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.520 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.521 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:gff_filter_by_blast -- maxForks: 20; blocking: true
Jun-22 09:26:09.528 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:Augustus` matches label `Augustus` for process with name abinitio_training:gff2gbk
Jun-22 09:26:09.529 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.529 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.530 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:gff2gbk -- maxForks: 20; blocking: true
Jun-22 09:26:09.536 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:Augustus` matches label `Augustus` for process with name abinitio_training:gbk2augustus
Jun-22 09:26:09.537 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.537 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.539 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:gbk2augustus -- maxForks: 20; blocking: true
Jun-22 09:26:09.562 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:Augustus` matches label `Augustus` for process with name abinitio_training:augustus_training
Jun-22 09:26:09.564 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.564 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.566 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:augustus_training -- maxForks: 20; blocking: true
Jun-22 09:26:09.574 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withLabel:AGAT` matches label `AGAT` for process with name abinitio_training:convert_gff2zff
Jun-22 09:26:09.575 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.575 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.577 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:convert_gff2zff -- maxForks: 20; blocking: true
Jun-22 09:26:09.584 [main] DEBUG nextflow.script.ProcessConfig - Config settings `withName:snap_training` matches process abinitio_training:snap_training
Jun-22 09:26:09.585 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: slurm
Jun-22 09:26:09.585 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'slurm'
Jun-22 09:26:09.587 [main] DEBUG nextflow.processor.TaskProcessor - Creating operator > abinitio_training:snap_training -- maxForks: 20; blocking: true
Jun-22 09:26:09.590 [main] DEBUG nextflow.Session - Session aborted -- Cause: Cannot emit a multi-channel output: augustus
Jun-22 09:26:09.625 [main] DEBUG nextflow.Session - The following nodes are still active:
  [operator] ifEmpty
  [operator] ifEmpty
  [operator] collect
  [operator] collect
  [operator] collect
  [operator] collect
  [operator] collect
  [operator] collect

Jun-22 09:26:09.630 [Actor Thread 3] DEBUG nextflow.Nextflow - Ignore exit because execution is already aborted -- message=Cannot find genome matching genome_clean.fa!

Jun-22 09:26:09.630 [Actor Thread 2] DEBUG nextflow.Nextflow - Ignore exit because execution is already aborted -- message=Cannot find gff file matching evidence.gff!

Jun-22 09:26:09.634 [main] ERROR nextflow.cli.Launcher - Cannot emit a multi-channel output: augustus
java.lang.IllegalArgumentException: Cannot emit a multi-channel output: augustus
	at nextflow.script.WorkflowDef.collectOutputs(WorkflowDef.groovy:149)
	at nextflow.script.WorkflowDef.run0(WorkflowDef.groovy:206)
	at nextflow.script.WorkflowDef.run(WorkflowDef.groovy:191)
	at nextflow.script.BindableDef.invoke_a(BindableDef.groovy:51)
	at nextflow.script.ComponentDef.invoke_o(ComponentDef.groovy:40)
	at nextflow.script.WorkflowBinding.invokeMethod(WorkflowBinding.groovy:87)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeOnDelegationObjects(ClosureMetaClass.java:397)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:339)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.callCurrent(PogoMetaClassSite.java:64)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:51)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:156)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:176)
	at Script_bd931283$_runScript_closure1$_closure18.doCall(Script_bd931283:67)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:101)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
	at org.codehaus.groovy.runtime.metaclass.ClosureMetaClass.invokeMethod(ClosureMetaClass.java:263)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1041)
	at groovy.lang.Closure.call(Closure.java:405)
	at groovy.lang.Closure.call(Closure.java:399)
	at nextflow.script.WorkflowDef.run0(WorkflowDef.groovy:204)
	at nextflow.script.WorkflowDef.run(WorkflowDef.groovy:191)
	at nextflow.script.BindableDef.invoke_a(BindableDef.groovy:51)
	at nextflow.script.ChainableDef$invoke_a.call(Unknown Source)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:115)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:127)
	at nextflow.script.BaseScript.runDsl2(BaseScript.groovy:180)
	at nextflow.script.BaseScript.run(BaseScript.groovy:189)
	at nextflow.script.ScriptParser.runScript(ScriptParser.groovy:225)
	at nextflow.script.ScriptRunner.run(ScriptRunner.groovy:218)
	at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:126)
	at nextflow.cli.CmdRun.run(CmdRun.groovy:273)
	at nextflow.cli.Launcher.run(Launcher.groovy:460)
	at nextflow.cli.Launcher.main(Launcher.groovy:642)

Config file mandatory?

Hi,
I'm very interested in your pipeline, would make my life easier!

I'm trying to run the commands exactly as you describe but there are small discrepancies:

nextflow run -profile nbis,conda AbinitioTraining.nf \
--genome 'genome_assembly.fasta' \
--maker_evidence_gff 'path/to/annotation.gff3'

(The .nf above is not recognized. It only works after taking it out.

But then, I get this:

N E X T F L O W ~ version 20.01.0
Pulling nextflow-io/AbinitioTraining ...
WARN: Cannot read project manifest -- Cause: Remote resource not found: https://api.github.com/repos/nextflow-io/AbinitioTraining/contents/nextflow.config
Remote resource not found: https://api.github.com/repos/nextflow-io/AbinitioTraining/contents/main.nf

I don't have a config file? Must I have one? What's missing exactly?

I installed with conda. And by the way, your conda instalation instructions are not so clear, I think the correct way should be:

conda create -n nextflow-env
conda activate nextflow-env
conda install -c bioconda nextflow  # (and nf-core doesn't exist in any repository)

Kind regards,
Ricardo

CI testing: Conda environment creation failed inside actions VM.

https://github.com/NBISweden/pipelines-nextflow/runs/798595494?check_suite_focus=true

Package is corrupted when trying to make the package.

According to Pontus, the strace logs show it's simultaneously trying to read and delete the file from within the same process.
https://github.com/NBISweden/pipelines-nextflow/runs/798564814?check_suite_focus=true

Unsure whether it's caused by parallel environment creation, but from the strace it may be unlikely.

“Time Limit” error when running FunctionalAnnotation.nf on vertebrate scale genome

Hi,
I would like to run Functional Annotation pipeline (FunctionalAnnotation.nf) on a genome of about 726 Mb. I tested this pipeline on one chromosome and it worked correctly, but when trying to run it on the whole genome, I run into the following error message:

Error executing process > 'functional_annotation:merge_functional_annotation (1)'

Caused by:
  Process `functional_annotation:merge_functional_annotation (1)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  agat_sp_manage_functional_annotation.pl -f Clupea_harengus.Ch_v2.0.2.104.gff3 \
      -b blast_merged.tsv -i interproscan_merged.tsv \
      -db uniprot_reviewed_yes.fasta -id NBIS \
      -pe 5 \
      -o Clupea_harengus.Ch_v2.0.2.104_plus-functional-annotation.gff

Command exit status:
  -

Command output:
  (empty)

Command wrapper:
  nxf-scratch-dir r33:/scratch/20329500/nxf.1ILctwRc1G
  WARNING: Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
  [08:55:39] 06/05/2021
  usage: /usr/local/bin/agat_sp_manage_functional_annotation.pl -f Clupea_harengus.Ch_v2.0.2.104.gff3 -b blast_merged.tsv -i interproscan_merged.tsv -db uniprot_reviewed_yes.fasta -id NBIS -pe 5 -o Clupea_harengus.Ch_v2.0.2.104_plus-fun
  ->IDs are changed using <NBIS> as prefix.
  In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a signle feature.

  ********************************************************************************
  *                              - Start parsing -                               *
  ********************************************************************************
  -------------------------- parse options and metadata --------------------------
  => Accessing the feature level json files
  slurmstepd: error: *** JOB 20329500 ON r33 CANCELLED AT 2021-06-05T14:55:50 DUE TO TIME LIMIT ***

Could you please help me know what config needs to be changed to increase the run time required for the process?
I appreciate your attention to this issue.

Augustus training parameter tuning

Modify or add a wrapper around AbinitioTraining to execute parallel training jobs over a range of values for parameters params.model_selection_value and params.locus_distance. Finally traverse the results directories and create a summary table like below:

locus_distance	model_selection_value	exon_sensitivity	exon_specificity	nucleotide_sensitivity	nucleotide_specificity	gene_sensitivity	gene_specificity	genes
1000	0.01	0.412	0.556	0.855	0.988	0.44	0.458	479
1000	0.02	0.474	0.622	0.826	0.966	0.33	0.34	752
2000	0.01	0.503	0.591	0.846	0.983	0.49	0.5	474
2000	0.02	0.531	0.641	0.863	0.977	0.37	0.407	745

The table is to be sorted by genes (low to high).

I am not sure about automatically selecting the best run yet. I would leave it to manual selection for now. (The general idea is to have the largest number of genes while maintaining high values for gene_sensitivity and gene_specificity. In addition, exon level metrics shouldn't be too low either. In the table below, row 43 would be one good choice.)

training_summary_as

subalaris-metrics

TranscriptAssembly pipeline can be improve

Should be more generalized.
With choice for read aligner:

  • star
  • Hisat2

choice of different assemblers (Why? Because better result: ):

  • denovo:
    • soap denovo trans
    • Oases
    • trinity
  • guided
    • Stringtie
    • Scallop
    • Bayerassembler

And an optional filtering step at the end:
Mikado
EviGene

"WARN: There's no process matching config selector" warning message

I'm getting this warning message:

[-        ] process > ANNOTATION_PREPROCESSING:AS... -
[-        ] process > ANNOTATION_PREPROCESSING:BUSCO -
Pulling Singularity image https://depot.galaxyproject.org/singularity/gaas:1.2.0--pl526r35_0 [cache /active/nxf_singularity_cachedir/depot.galaxyproject.org-singularity-gaas-1.2.0--pl526r35_0.img]
WARN: There's no process matching config selector: BLAST_BLASTN -- Did you mean: BLAST_BLASTP?

It doesn't look like it affects preprocessing, but I thought about bringing this up anyway.

Update AGAT to version 0.4.0

AGAT can be updated. For each pipeline we need to change the version in:

  • label_agat.yml
  • software_packages.config

For pipeline using the script agat_sp_split_by_level2_feature.pl it must be replaced by agat_sp_separate_by_record_type.pl. (name changed in version > 0.2.3)

I don't know if any pipeline use agat_sp_gxf_to_gff3.pl but is has been removed in new AGAT version (version > 0.2.3). It successor is agat_convert_sp_gxf2gxf.pl.

CI testing: `.command.trace` file unable to be written from dockerised busco process

https://github.com/NBISweden/pipelines-nextflow/runs/803540128?check_suite_focus=true

Specifically AnnotationPreprocessing pipeline, cannot write .command.trace file to the folder, /home/runner/work/pipelines-nextflow/pipelines-nextflow/work/25/26370cbc11c092fb40b0be38c8a9dd.

The .command.trace is attempted to be written by the nxf_trace() from the .command.run script. This implies that the directory has the correct write permissions as the .command.run script has already been written there.

The only difference I can think of at the moment, is that .command.run is written by the nextflow process runner (outside the docker environment), while the .command.trace is written from within the docker environment (which I'm reasonably sure has the same user and group as the nextflow process runner).

Abinitio outputs - Augustus

Hi,

Your pipeline is working well. I just don't understand what is the output of the Augustus part that I'm supposed to collect for maker and how. I have a good SNAP .hmm but here's what I see in the Augustus part:

guerrer@hilbert141:$ ll Augustus_training/test_species
test_species_exon_probs.pbl
test_species_igenic_probs.pbl
test_species_intron_probs.pbl
test_species_metapars.cfg
test_species_metapars.cgp.cfg
test_species_metapars.utr.cfg
test_species_parameters.cfg
test_species_weightmatrix.txt

AED filter not working?

Hello,

I'm running the AbinitioTraining pipeline and noticed that there are still genes with high AEDs (including some AEDs of 1.0!) in this file:
codingGeneFeatures.filter.longest_cds.complete.good_distance.gff

Which I think is already after AED filtering. Is this a bug? And by the way is your default 0.3 AED threshold not too low?

(I guess I'm not using the most updated version, It's from April last year)

Kind regards,
Ricardo

Test data for Abinitio training insufficient.

Trying to use the test profile for the Abinitio workflow resulted in:

Error executing process > 'abinitio_training:gbk2augustus (Make Augustus training set: codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered)'

Caused by:
  Missing output file(s) `codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered.gbk.train` expected by process `abinitio_training:gbk2augustus (Make Augustus training set: codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered)`

Command executed:

  randomSplit.pl codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered.gbk 10

Command exit status:
  0

Command output:
  size 10 is greater than the number of genes in file
  codingGeneFeatures.filter.longest_cds.complete.good_distance_blast-filtered.gbk. Aborting.

Command error:
  WARNING: Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container

workflow was run with:

#! /usr/bin/env bash

NXF_SCRIPT=/proj/snic2019-8-350/pipelines-nextflow/AbinitioTraining/AbinitioTraining.nf
nextflow run -profile nbis,singularity,test $NXF_SCRIPT -process.clusterOptions '-A snic2019-8-350'

error TranscriptAssembly.nf with fastp

I am running the following command :
nextflow run -c params_scriptseq.config -profile nbis,singularity ~/git/NBIS/pipelines-nextflow/TranscriptAssembly/TranscriptAssembly.nf

param file :
// Workflow parameters
params.reads = 'Gb*_{R1,R2}.fastq.gz'
params.genome = 'genome.fa'
params.single_end = false
params.outdir = 'results_scriptseq_Gb'
params.skip_trimming = false
params.trimmer = 'fastp'
params.fastp_options = ' -Q -L'
params.trimmomatic_adapter_path = '/projects/references/adapters/TruSeq3_all.fa'
params.trimmomatic_clip_options = 'LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36'
params.hisat2_options = ' --fr'
params.stringtie_options = ' --fr'
params.multiqc_config = "/home/lucso605/git/NBIS/pipelines-nextflow/TranscriptAssembly/config/multiqc_conf.yml"

and get the following error

Error executing process > 'transcript_assembly:fastp (Gb_ROV6_3)'

Caused by:
Process transcript_assembly:fastp (Gb_ROV6_3) terminated with an error exit status (127)

Command executed:

fastp -Q -L -w 2 -i Gb_ROV6_3_R1.fastq.gz -I Gb_ROV6_3_R2.fastq.gz
-o Gb_ROV6_3_fastp-trimmed_R1.fastq.gz
-O Gb_ROV6_3_fastp-trimmed_R2.fastq.gz
--json Gb_ROV6_3_fastp.json

Command exit status:
127

Command output:
(empty)

Command error:
WARNING: Skipping mount /sw/easybuild/software/Singularity/3.8.0/var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
/bin/bash: line 0: cd: /scratch/nxf.XCNzlbfCs9: No such file or directory
/bin/bash: .command.run: No such file or directory

Work dir:
/projects/annotation/species/RNAseq/work/e2/abcd73a4279e144994b22926615f69

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

Workflow refactor

It would be nice to refactor the workflow to the following directory structure:

pipelines-nextflow/
├── conf
│   ├── base.config          // Contains base configuration like in nf-core
│   ├── modules.config       // Contains per process configuration a-la nf-core, so publishDir, ext.args, etc go here.
│   ├── test.config          // Test profile using a minimal test set - output can be nonsense, but must test workflow runs through
│   └── test_full.config     // Test profile using a realistic data set
├── docs
│   ├── output.md            // Output description
│   ├── README.md
│   └── usage.md             // How to
├── lib
│   ├── Template.groovy      // Library of functions to print logo etc
├── main.nf                  // Primary workflow - calls other workflows based on a parameter (e.g. like a subcommand)
├── modules                  // Process definitions
│   ├── local                // custom definitions - where most of our stuff will be until converted to nf-core format.
│   │   └── samplesheet_check.nf
│   └── nf-core              // Existing nf-core modules we can already use - install with `nf-core install <module>`
│       └── modules
│           ├── custom
│           │   └── dumpsoftwareversions
│           │       ├── main.nf
│           │       ├── meta.yml
│           │       └── templates
│           │           └── dumpsoftwareversions.py
│           ├── fastqc
│           │   ├── main.nf
│           │   └── meta.yml
│           └── multiqc
│               ├── main.nf
│               └── meta.yml
├── modules.json
├── nextflow.config         // Base configuration file containing parameter initialisation and standard profiles
├── nextflow_schema.json 
├── README.md 
├── subworkflows            // Workflows used within workflows
│   └── local
│       └── input_check.nf
└── workflows               // Current workflows.
    ├── AbinitioTraining.nf
    ├── AnnotationPreprocessing.nf
    ├── FunctionalAnnotation.nf
    └── TranscriptAssembly.nf

The directory structure follows nf-core template structure, so less effort to port once we use their code.

I’ve added a Gitpod environment if you want to use that to develop. It’s a web based development environment with Nextflow, git, docker, conda, mamba, nf-core, pytest-workflow, and other things installed. There’s around 16 cores, 62GB mem, and ~280GB storage, and the environment is ephemeral ( so make sure you push your changes to your fork/branch ). You can install a Gitpod browser button which adds a button to open Gitpod from the github repo.

First stage is to refactor the code to the following structure.

  • Follow directory structure
  • Move process definitions to modules/local/<script.nf>
  • Consolidate configuration
  • Move logo to Template.groovy
  • Change from preview DSL2 syntax to current DSL2 syntax.

Here are nf-cores' docs on how to create a new pipeline (https://nf-co.re/tools/#creating-a-new-pipeline) but this will likely be useful later.
However, feel free to use it to see how similar the workflow structures are.

Nextflow’s DSL2 docs are https://www.nextflow.io/docs/latest/dsl2.html

My Nextflow coding practices that I wrote for the Carpentries workshop is https://carpentries-incubator.github.io/workflows-nextflow/15-coding_practices/index.html ( I see I need to fix some syntax there so check back later )

Requested node configuration is not available in transcript_assembly

Hi,

I'm trying to run the subworkflow transcript_assembly on rackham (Uppmax). Here the command that I run:

screen -S anno_tra_ass
module load bioinfo-tools Nextflow
export NXF_HOME=/proj/uppstore2019057/private/program_MF/nextflow_home
export NXF_LAUNCHER=$SNIC_TMP
export NXF_TEMP=$SNIC_TMP
export NXF_SINGULARITY_CACHEDIR=/proj/uppstore2019057/nobackup/pro_next/w_tra_ass
nextflow run -profile uppmax -params-file /proj/uppstore2019057/nobackup/pro_next/para_tra_ass.yml /proj/uppstore2019057/private/program_MF/pipelines-nextflow/main.nf

This is the yml file:

subworkflow: 'transcript_assembly'
reads: '/proj/uppstore2019057/nobackup/pro_next/reads_Ltri/TI*_{R1,R2}_001.fastq.gz'
genome: '/proj/uppstore2019057/private/Linum_ref/L_trigynum_pilon2.fasta'
single_end: false
outdir: '/proj/uppstore2019057/nobackup/pro_next/ris_tra_ass/'
project: 'snic2022-22-696'

This is the error that I got:

Caused by:
  Failed to submit process to grid scheduler for execution

Command executed:

  sbatch .command.run

Command exit status:
  1

Command output:
  sbatch: error: CPU count per node can not be satisfied
  sbatch: error: Batch job submission failed: Requested node configuration is not available

Work dir:
  /crex/proj/uppstore2019057/nobackup/pro_next/w_tra_ass/work/e3/dbf852ba17bf670d7ce2474e4e02b5

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh

and this the above mentioned .command.sh:

#!/bin/bash -ue
mkdir hisat2
hisat2-build \
    -p 12 \
     \
    L_trigynum_pilon2.fasta \
    hisat2/L_trigynum_pilon2

cat <<-END_VERSIONS > versions.yml
"TRANSCRIPT_ASSEMBLY:HISAT2_BUILD":
    hisat2: 2.2.0
END_VERSIONS

I don't understand why it complains about the CPU per node. If I understand correctly hisat2 it require 12 core and it should not be a problem for a rackham node.
Best,

Marco

[New pipeline] RepeatMaskMyGenome

See #17 for the general picture.

Maybe can be merge with the DeNovoRepeatLib pipeline (see #32).

The purpose of RepeatMaskMyGenome is to repeat mask a genome based on a repeat library (made de novo or provided within a lib e.g dfam or RepBase).
Having this pipeline could help to move easily to other annotation tool than MAKER if needed.

This pipeline consist in mains 3 steps:

  • split the genome into chunks (overlap or not? we can see in MAKER code to see how they do.)
  • Mask the chunks
  • Merge the annotation of the different chunks (if overlapping chunks we need to find a way to resolve the merge in a good way)

Input:

  • Path to a lib (fasta file is de-novo a name if from DFam or RepBase)
  • genome
  • param for the split size

Output: single gff file + stats

[New pipeline] AnnotationToENA

See #17 for the general picture.

Here a description of the AnnotationToENA pipeline we need:

Input file: 2 => The GFF file along with the Fasta file
Tool needed: AGAT, EMBLmyGFF3 both available by Bioconda and webin-cli-.jar from https://github.com/enasequence/webin-cli (they provide a docker. We can create a bioconda recipe)
Output file 1 => EMBL Flat file
Required parameters: (all for EMBLmyGFF3):

  • LOCUS_TAG (default "XXX")
  • PROJECT (default "XXX")
  • MOLECULE (default "genomic DNA")
  • TABLE (default 1)
  • TOPOLOGIE (default linear)
  • SPECIES (latin name (e.g. "Drosophila melanogaster") or taxid, no default value)

Step1: agat_sp_flag_short_intron.pl --gff annotation.gff -o annotation_short_intron_flagged.gff
Step2: agat_sp_fix_features_locations_duplicated.pl --gff annotation_short_intron_flagged.gff -o annotation_short_intron_flagged_duplicated_location_fixed.gff
Step3:

  • EMBLmyGFF3 --expose_translations # to get the son files locally
  • code find a way to add "remove": true after the line "exon": { in the local translation_gff_feature_to_embl_feature.json file
  • example: EMBLmyGFF3 -I $LOCUS_TAG -p $PROJECT -m $MOLECULE -r $TABLE -t $TOPOLOGIE -s SPECIES -o annotation.embl annotation.gff genome.fa

Step4: validation using the Webin-CLI command line submission program that supports validation using the -validate option: see here https://github.com/enasequence/webin-cli

Can't select 0.0 nor 0.00 in model_selection_value in AbinitioTraining.nf

Hi!
I wanted to select AED only equal to 0 and got the following error:

Error executing process > 'abinitio_training:blast_makeblastdb (codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins type: null)'

Caused by:
  Process `abinitio_training:blast_makeblastdb (codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins type: null)` terminated with an error exit status (1)

Command executed:

  makeblastdb -in codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins.fasta -dbtype prot

Command exit status:
  1

Command output:
  
  
  Building a new DB, current time: 04/24/2020 09:39:23
  New DB name:   /scratch/nxf.EEhdsta03j/codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins.fasta
  New DB title:  codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins.fasta
  Sequence type: Protein
  Keep MBits: T
  Maximum file size: 1000000000B

Command error:
  BLAST options error: File codingGeneFeatures.filter.longest_cds.complete.good_distance_proteins.fasta is empty

Work dir:
  /projects/annotation/agrotis_infusa/abinitio/work/f6/2ec5d6f0ce7146a67310e30a9cde28

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

I have AED being only 0.00 and it is working with if I choose 0.01 as parameter

Integrating workflows into a single workflow

It would be helpful if all stages of genome annotation could be run in a single workflow.

@Juke34 Provided this diagram of how the single workflow might look like.
Single workflow concept

With the introduction of Nextflow modules, each stage can be built as independent workflows and
then imported into a single workflow.

e.g. (note: modules must be locally available).

include foo from './some/module'

workflow {
    data = Channel.fromPath('/some/data/*.txt')
    foo(data)
}

Part of the suggested workflow proposes to include a part from another workflow:
https://github.com/ikmb-denbi/genome-annotation

A collaboration has been suggested, such that we just need to include the proposed part(s) into the single workflow. This requires the ikmb-denbi workflow to be converted to DSL2 and modularized. Together, we need to ensure interoperability between modules. The NBIS annotation workflows are almost complete to be used as modules.

Tasks:

  • Organelle detection pipeline see #5 (not important yet)
  • Preprocessing pipeline (NBIS)
  • De novo repeat library pipeline see #32 and #33 (high priority)
  • Transcript assembly pipeline (NBIS)
  • Evidence alignment pipeline (?) (low priority) see #35
  • PASA pipeline (IKMB) see #31 (low priority)
  • Maker pipeline (NBIS) see #34 (low priority)
  • Abintio training pipeline (NBIS)
  • Functional annotation pipeline (NBIS)
  • EMBLmyGFF3 pipeline (NBIS) see #30 (high priority)

Should training genemark be included into the Abinitio training pipeline?

Which pipeline should the agat_sp_complement_annotations process go into?

Trinity will be added to the transcript assembly pipeline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.