sanger-tol / treeval Goto Github PK

View Code? Open in Web Editor NEW

20.0 7.0 2.0 49.97 MB

Pipelines for the production of Treeval data

Home Page: https://pipelines.tol.sanger.ac.uk/treeval

License: MIT License

HTML 0.72% Nextflow 60.42% Groovy 14.30% AngelScript 0.75% Python 9.82% Perl 12.96% Shell 1.04%

curation genome-assembly genomics nextflow pipeline genome-alignment quality-control synteny

treeval's People

Contributors

Stargazers

Watchers

Forkers

yumisims pythseq

treeval's Issues

Update the pipeline template

Description of feature

Advice from NF-core is to update the version of NF-core templates used by the pipeline as often as we can. This should be a priority fix.

MAKE - Multiple simple bash processes

GENERATE_GENOME will use cut and sort to generate the final my.genome file.

PULL_DOTAS will use cp to pull a .as file from assets.

CAT_BLAST uses cat to concatenate to merge together multiple BLAST outputs.

Outdir setting needs to be updated

Description of the bug

The outdir has been set to save all output whilst in development, we have realised that with the selfcomp module generating tens of thousands of files it is prudent to fix the outdir so that only required files and some pipeline information files are saved. We will also be adding automated clean-up, although at a later date.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Modules restructuring

This is a summary issue, please create PRs for the individual issues referenced here.

#37 The structure of the modules is incorrect. Take a look at genomenote for the structure expected. You may need to delete the entire contents below modules/nf-core and repopulate using nf-core modules install.
#38 The miniprot modules need to first be submitted to nf-core using released bioconda package and containers. Then install under nf-core using nf-core modules install.
#39 There is already a combined minimap2 + samtools module in nf-core. Please use that instead of minimap_samtools
#40 samtools in merge_bam module will not work without a container in production env. There is already a module for samtools merge, please use that instead.
#41 There is already a combined bamToBed + bedtools sort module in nf-core. Please use that instead of bedtools_bed_sort. You also have a redundant module under sanger-tol. Only keep one copy under nf-core.
#42 The blast/tblastn needs to be removed from sanger-tol and installed under nf-core. If not using, remove entirely.
#43 For the modules under makecmap, please move the scripts with the credits and licence information intact to the bin folder. Then, use a perl conda package and containers to create local modules using nf-core modules create from within the pipeline. If you make any changes to the original scripts, please address these in the script header, so credit is given where due. Do the same for selfcomp/splitfasta.
#44 For the modules selfcomp/mapids and selfcomp/mummer2bed do the same as N°7 but with a python package and container.
#45 For these local modules, please add a bioconda package and container. You can use a basic one: https://github.com/nf-core/modules/blob/master/modules/nf-core/md5sum/main.nf

get_synteny_genomes
generate_genome_file
csv_generator
concatmummer (only conda needs adding)
concat_gff
chunkfasta (pyfasta conda package exists)
cat_blast
bb_generator

#47 For local module filter_blast, add the script in your bin folder (which seems to be missing) and use the appropriate conda package and containers as needed. Please do not package the script in a container.
#46 For the different cat local modules, might be better to use a generic one and configure it for different purposes as needed – see nf-core cat module. Example: concatmummer, concat_gff and cat_blast
It is best to create local modules using nf-core modules create from within the pipeline directory. The idea is to keep the formatting and structure of the local modules as close to the nf-core ones.
Before creating any bespoke containers, please have a chat. There are multi-package containers and other options available, which would save you time and make your pipeline more reproducible.

@muffato is happy to help with tasks above so please contact him if you need help.
@priyanka-surana is happy to help manage this release, we can have regular catch ups to keep track of the work.

Samtools merge container missing

Description of the bug

Samtools in merge_bam module will not work without a container in production env. There is already a module for samtools merge, please use that instead.

From Matthieu:

motivated by making the pipeline usable in our production environment (currently not possible)

Haplotypic Block Analysis

Description of feature

The original SelfComp (#5) was not fit for purpose and was not close enough to replicating gEVAL, which used the Ensembl database to generate the SelfComp blocks. This API cannot be easily decoded due to its age and complexity, so @yumisims is reverse engineering a standalone solution. to replace the SelfComp sub-workflow.

Sub workflow GENE_ALIGNMENT

Workflow for gene alignment, this requires:

Confirmation of data directory (treeval_data/insect/iy_{latin_name}{data_type}{chunk}.fa = iy_tiphia_femorata_cds_500.fa
makeblastdb
blastx
blastn
concat on data organism
filter at 90%
input data.genome file generation
generation of assembly.as file
filtered bed + assembly.as to generate BigBed file

[Documentation] output.md - INSILICO_DIGEST

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used

Module `filter_blast` change container

Description of feature

For local module filter_blast, add the script in your bin folder (which seems to be missing) and use the appropriate conda package and containers as needed. Please do not package the script in a container.

From Matthieu:

motivated by making the pipeline inspectable / modifiable by non-Sanger people (currently not possible)

Subworkflow INSILICO_DIGEST

UPDATE: Variable naming

Some variable names are less than ideal for their purpose (do not describe their usage). These should be changed.

Add containers and conda package to local modules

Description of the bug

For these local modules, please add a bioconda package and container. You can use a basic one: https://github.com/nf-core/modules/blob/master/modules/nf-core/md5sum/main.nf

get_synteny_genomes
generate_genome_file
csv_generator
concatmummer (only conda needs adding)
concat_gff
chunkfasta (pyfasta conda package exists)
cat_blast
bb_generator

From Matthieu:

motivated by making the pipeline usable outside of our production environment (currently at risk)

MAKE - FILTER-BLAST

This will be a Python 3 script that takes a concatenated blast output file and parses it into the format required by bedToBigBed.

This script will need to be dockerised and will possibly be amalgamated with multiple other scripts before the final release.

[Documentation] Review output.md

Description of feature

Review complete details once individual workflows added.

Use nf-core blastn

Description of feature

The blast/tblastn needs to be removed from sanger-tol and installed under nf-core. If not using, remove entirely.

From Matthieu:

motivated by reducing the amount of code you / we need to maintain

[Documentation] output.md - SELFCOMP

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used

[Documentation] output.md - GENERATE_GENOME

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used
Outline what local modules used do.

Create local modules for `makecmap` and `splitfasta`

Description of feature

For the modules under makecmap, please move the scripts with the credits and licence information intact to the bin folder. Then, use a perl conda package and containers to create local modules using nf-core modules create from within the pipeline. If you make any changes to the original scripts, please address these in the script header, so credit is given where due. Do the same for selfcomp/splitfasta.

From Matthieu:

motivated by making the pipeline inspectable / modifiable by non-Sanger people (currently not possible)

Use generic `cat` module

Description of feature

For the different cat local modules, might be better to use a generic one and configure it for different purposes as needed – see nf-core cat module. Example: concatmummer, concat_gff and cat_blast

From Matthieu:

motivated by reducing the amount of code you / we need to maintain

ADD - resource files

Add files such as:

assem_{cds,cdna,rna,pep}.as to the assets folder

MODIFY - filter_blast

Currently, this module will not run unless using the local script and python installations.

@yumisims is currently dockerising and submitting as a module for pipeline integration.

Make the CI tests pass on GitHub

Description of feature

(as per #51 (comment) )

Currently, the integration tests only pass on the farm, because some config paths refer to /nfs/team135 (at least the one I've found, perhaps some are on /lustre too). It would be useful to update the S3 test profile to include all data on the S3 server. This way, reviewers could rely on GitHub to test pull-requests and wouldn't need to run the pipeline themselves.

This would also tell us / confirm what the pipeline needs from the Sanger infrastructure, and allow to plan the next steps for the pipeline being usable by external collaborators, for when we feel ready to support that.

[Documentation] output.md - INPUT_READ

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used
Outline what local modules used do.

Create local modules for `selfcomp`

Description of feature

For the modules selfcomp/mapids and selfcomp/mummer2bed please move the scripts with the credits and licence information intact to the bin folder. Then, use a python conda package and containers to create local modules using nf-core modules create from within the pipeline. If you make any changes to the original scripts, please address these in the script header, so credit is given where due.

From Matthieu:

motivated by making the pipeline inspectable / modifiable by non-Sanger people (currently not possible

Include - Makeblastdb

Add makeblastdb to the gene alignement subworkflow
This will take the input fasta, generate a db. This will allow the alignment data to be blasted against it.

MAKE - BB-GENERATOR

This is a module that makes use of the bedToBigBed software to generate a BigBed file for jBrowse display.

Refactor how data is input

Using the --params-file for our input yaml does not apply to standards. I will produce a new subworkflow so we can use the --input flag and all params can parsed from the input.

This will require refactoring of subworkflows to take the new format of inputs as they are now channels not values.

Generate parameters.md

Description of feature

nextflow_schema.json needs to be updated to reflect current parameters.

~~2. nf-core schema docs -x markdown to generate prettified version of this schema. Save as parameters.md in the docs folder.~~

~~Based on documentation guidance here~~

Restructure modules folder

Description of the bug

The structure of the modules is incorrect. Take a look at genomenote for the structure expected. You may need to delete the entire contents below modules/nf-core and repopulate using nf-core modules install.

From Matthieu:

motivated by allowing to exchange modules and code with others, incl. from ToLA (currently not possible – nf-core pushed that breaking change, not us)
motivated by reducing the barrier to entry for other people (incl. from ToLA ) to contribute and debug, by following the same structure

Clean up repository

Description of feature

@DLBPointon mentioned there might be some outdated modules, can you please remove these? Remove any modules or subworkflows no longer necessary. Remove commented out pieces of code, if not part of testing.

When you create a PR for this, please reference this issue, and set dev as the base branch.

Set up example data for testing

This will require some cdna, cds, rna and pep data as well as an input fasta.

Subworkflow SYNTENY

The addition of SYNTENY sub-workflow which uses YAML params:

Path of reference genomes for synteny
Class of organism
Organism sample name
Organism assembly fasta
Output directory

This sub-flow uses nf-core module:

MINIMAP2_ALIGN

COMMIT - GENERATE_GENOME to main

As the GENERATE_GENOME subworkflow (SW) is required for multiple other SWs, this SW needs to be merged into the main for colleagues.

Correct gene alignment data csv files

The input csv need to be updated for the new directory structure. The commands to generate these files also need to be corrected to:

for file in /lustre/scratch123/tol/resources/treeval/gene_alignment_data/{clade}/{item}/{item}.{accession}/*/*.fa ; do  var=$(echo $file | cut -f10 -d/); var2=$(echo $file | cut -f11 -d/);echo $var,$var2,$file >> /lustre/scratch123/tol/resources/treeval/gene_alignment_data/{clade}/csv_data/{item}.{accession}-data.csv; done

This also needs updating to work on the whole library of gene_alignment data rather than one at a time.

Include - SAMTOOLS-FAIDX

Samtools will be used in generating the my.genome file, containing chromosome sizes.

Subworkflow SELFCOMP

[Documentation] output.md - SYNTENY

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used

Subworkflow MYGENOME

Add a workflow for MYGENOME generation.

As this file is required for multiple sub workflows, it should be packaged into it's own.

Containing SAMTOOLS FAIDX and the BASH found in #3 .

MAKE - CSV-GENERATOR

A process which copies the csv file into the nextflow directory and then allows for the data to be parsed in the main.nf.

This was decided upon with help from the Seqera team, as there is no direct way of building path objects from strings.

[Documentation] Review usage.md

Description of feature

Review contents of usage.md file.

Include - BLASTN

Include the BLASTN NF-CORE module, to be used in blasting set query data against the input genome.

Use nf-core bamToBed

Description of feature

There is already a combined bamToBed + bedtools sort module in nf-core. Please use that instead of bedtools_bed_sort. You also have a redundant module under sanger-tol. Only keep one copy under nf-core.

From Matthieu:

motivated by reducing the amount of code you / we need to maintain

I would also recommend to move away from bedtools sort if you can. It's less efficient than a regular sort, as the bedtools authors say themselves (see the disclaimer at the bottom of https://bedtools.readthedocs.io/en/latest/content/tools/sort.html) and it's indeed caused us some problems in the read-mapping pipeline.

More details on the issue with bedtools sort: sanger-tol/genomenote#51

If you do separate the sort, please make sure it is a different module. You can borrow this gnu_sort module.

[Documentation] output.md - GENERATE_ALIGNMENT

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used

Use nf-core minimap2

Description of feature

There is already a combined minimap2 + samtools module in nf-core. Please use that instead of minimap_samtools

From Matthieu:

motivated by reducing the amount of code you / we need to maintain

CSV_PULL creating unsorted list of input data

Testing has revealed that since the directory change, the CSV pull module has begun mixing it's input lists. The first for organism names and second for the file location.

In the function there seems to be a cross over where organism 1 and organism 2's path location are being used and causing a conflict with expected file output, e.g., path/organism 1.

We are unsure why or how this is occuring since it is being passed the correct values, but we are investigating and will fix ASAP before #52 is complete.

[Documentation] output.md - BUSCO_ANALYSIS

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used

miniprot module to nf-core

Description of feature

The miniprot modules need to first be submitted to nf-core using released bioconda package and containers. Then install under nf-core using nf-core modules install.

From Matthieu:

motivated by reducing the amount of code you / we need to maintain

Subworkflow INPUT_CHECK

Description of feature

The input functions must be changed to instead take the gEVAL-yaml or a new treeval-yaml.

File save naming scheme

We have been using a file naming scheme that has resulted in multiple files overwriting each other upon completion of the pipeline.
This will be corrected in the next commit.

ADD - TBLASTN

@yumisims has created a TBLASTN module currently available in sanger-tol/nf-core-modules here.

This should be added to the pipeline to allow blasting of pep data.

sanger-tol / treeval Goto Github PK

treeval's People

Contributors

Stargazers

Watchers

Forkers

treeval's Issues

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

Description of the bug

Description of feature

Description of feature

Description of feature

Description of the bug

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of the bug

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Recommend Projects

Recommend Topics

Recommend Org