sanger-tol / treeval Goto Github PK
View Code? Open in Web Editor NEWPipelines for the production of Treeval data
Home Page: https://pipelines.tol.sanger.ac.uk/treeval
License: MIT License
Pipelines for the production of Treeval data
Home Page: https://pipelines.tol.sanger.ac.uk/treeval
License: MIT License
Advice from NF-core is to update the version of NF-core templates used by the pipeline as often as we can. This should be a priority fix.
GENERATE_GENOME will use cut and sort to generate the final my.genome file.
PULL_DOTAS will use cp to pull a .as file from assets.
CAT_BLAST uses cat to concatenate to merge together multiple BLAST outputs.
The outdir has been set to save all output whilst in development, we have realised that with the selfcomp module generating tens of thousands of files it is prudent to fix the outdir so that only required files and some pipeline information files are saved. We will also be adding automated clean-up, although at a later date.
No response
No response
No response
This is a summary issue, please create PRs for the individual issues referenced here.
modules/nf-core
and repopulate using nf-core modules install
.miniprot
modules need to first be submitted to nf-core
using released bioconda package and containers. Then install under nf-core
using nf-core modules install
.minimap2
+ samtools
module in nf-core. Please use that instead of minimap_samtools
samtools
in merge_bam
module will not work without a container in production env. There is already a module for samtools merge, please use that instead.bamToBed
+ bedtools sort
module in nf-core. Please use that instead of bedtools_bed_sort
. You also have a redundant module under sanger-tol
. Only keep one copy under nf-core
.blast/tblastn
needs to be removed from sanger-tol
and installed under nf-core
. If not using, remove entirely.makecmap
, please move the scripts with the credits and licence information intact to the bin
folder. Then, use a perl
conda package and containers to create local modules using nf-core modules create
from within the pipeline. If you make any changes to the original scripts, please address these in the script header, so credit is given where due. Do the same for selfcomp/splitfasta
.selfcomp/mapids
and selfcomp/mummer2bed
do the same as N°7 but with a python package and container.get_synteny_genomes
generate_genome_file
csv_generator
concatmummer
(only conda needs adding)concat_gff
chunkfasta
(pyfasta conda package exists)cat_blast
bb_generator
filter_blast
, add the script in your bin folder (which seems to be missing) and use the appropriate conda package and containers as needed. Please do not package the script in a container.cat
local modules, might be better to use a generic one and configure it for different purposes as needed – see nf-core cat module. Example: concatmummer
, concat_gff
and cat_blast
nf-core modules create
from within the pipeline directory. The idea is to keep the formatting and structure of the local modules as close to the nf-core ones.@muffato is happy to help with tasks above so please contact him if you need help.
@priyanka-surana is happy to help manage this release, we can have regular catch ups to keep track of the work.
Samtools in merge_bam module will not work without a container in production env. There is already a module for samtools merge, please use that instead.
From Matthieu:
motivated by making the pipeline usable in our production environment (currently not possible)
The original SelfComp (#5) was not fit for purpose and was not close enough to replicating gEVAL, which used the Ensembl database to generate the SelfComp blocks. This API cannot be easily decoded due to its age and complexity, so @yumisims is reverse engineering a standalone solution. to replace the SelfComp sub-workflow.
Workflow for gene alignment, this requires:
Branch from the documentation branch for adding documentation.
Include:
For local module filter_blast
, add the script in your bin folder (which seems to be missing) and use the appropriate conda package and containers as needed. Please do not package the script in a container.
From Matthieu:
motivated by making the pipeline inspectable / modifiable by non-Sanger people (currently not possible)
Some variable names are less than ideal for their purpose (do not describe their usage). These should be changed.
For these local modules, please add a bioconda package and container. You can use a basic one: https://github.com/nf-core/modules/blob/master/modules/nf-core/md5sum/main.nf
From Matthieu:
motivated by making the pipeline usable outside of our production environment (currently at risk)
This will be a Python 3 script that takes a concatenated blast output file and parses it into the format required by bedToBigBed.
This script will need to be dockerised and will possibly be amalgamated with multiple other scripts before the final release.
Review complete details once individual workflows added.
The blast/tblastn needs to be removed from sanger-tol and installed under nf-core. If not using, remove entirely.
From Matthieu:
motivated by reducing the amount of code you / we need to maintain
Branch from the documentation branch for adding documentation.
Include:
Branch from the documentation branch for adding documentation.
Include:
For the modules under makecmap, please move the scripts with the credits and licence information intact to the bin folder. Then, use a perl conda package and containers to create local modules using nf-core modules create from within the pipeline. If you make any changes to the original scripts, please address these in the script header, so credit is given where due. Do the same for selfcomp/splitfasta.
From Matthieu:
motivated by making the pipeline inspectable / modifiable by non-Sanger people (currently not possible)
For the different cat local modules, might be better to use a generic one and configure it for different purposes as needed – see nf-core cat module. Example: concatmummer, concat_gff and cat_blast
From Matthieu:
motivated by reducing the amount of code you / we need to maintain
Add files such as:
Currently, this module will not run unless using the local script and python installations.
@yumisims is currently dockerising and submitting as a module for pipeline integration.
(as per #51 (comment) )
Currently, the integration tests only pass on the farm, because some config paths refer to /nfs/team135
(at least the one I've found, perhaps some are on /lustre
too). It would be useful to update the S3 test profile to include all data on the S3 server. This way, reviewers could rely on GitHub to test pull-requests and wouldn't need to run the pipeline themselves.
This would also tell us / confirm what the pipeline needs from the Sanger infrastructure, and allow to plan the next steps for the pipeline being usable by external collaborators, for when we feel ready to support that.
Branch from the documentation branch for adding documentation.
Include:
For the modules selfcomp/mapids and selfcomp/mummer2bed please move the scripts with the credits and licence information intact to the bin folder. Then, use a python conda package and containers to create local modules using nf-core modules create
from within the pipeline. If you make any changes to the original scripts, please address these in the script header, so credit is given where due.
From Matthieu:
motivated by making the pipeline inspectable / modifiable by non-Sanger people (currently not possible
Add makeblastdb to the gene alignement subworkflow
This will take the input fasta, generate a db. This will allow the alignment data to be blasted against it.
This is a module that makes use of the bedToBigBed software to generate a BigBed file for jBrowse display.
Using the --params-file
for our input yaml does not apply to standards. I will produce a new subworkflow so we can use the --input
flag and all params can parsed from the input.
This will require refactoring of subworkflows to take the new format of inputs as they are now channels not values.
2. nf-core schema docs -x markdown
to generate prettified version of this schema. Save as parameters.md in the docs folder.
Based on documentation guidance here
The structure of the modules is incorrect. Take a look at genomenote for the structure expected. You may need to delete the entire contents below modules/nf-core and repopulate using nf-core modules install
.
From Matthieu:
motivated by allowing to exchange modules and code with others, incl. from ToLA (currently not possible – nf-core pushed that breaking change, not us)
motivated by reducing the barrier to entry for other people (incl. from ToLA ) to contribute and debug, by following the same structure
@DLBPointon mentioned there might be some outdated modules, can you please remove these? Remove any modules or subworkflows no longer necessary. Remove commented out pieces of code, if not part of testing.
When you create a PR for this, please reference this issue, and set dev
as the base branch.
This will require some cdna, cds, rna and pep data as well as an input fasta.
The addition of SYNTENY sub-workflow which uses YAML params:
This sub-flow uses nf-core module:
As the GENERATE_GENOME subworkflow (SW) is required for multiple other SWs, this SW needs to be merged into the main for colleagues.
The input csv need to be updated for the new directory structure. The commands to generate these files also need to be corrected to:
for file in /lustre/scratch123/tol/resources/treeval/gene_alignment_data/{clade}/{item}/{item}.{accession}/*/*.fa ; do var=$(echo $file | cut -f10 -d/); var2=$(echo $file | cut -f11 -d/);echo $var,$var2,$file >> /lustre/scratch123/tol/resources/treeval/gene_alignment_data/{clade}/csv_data/{item}.{accession}-data.csv; done
This also needs updating to work on the whole library of gene_alignment data rather than one at a time.
Samtools will be used in generating the my.genome file, containing chromosome sizes.
Branch from the documentation branch for adding documentation.
Include:
Add a workflow for MYGENOME generation.
As this file is required for multiple sub workflows, it should be packaged into it's own.
Containing SAMTOOLS FAIDX and the BASH found in #3 .
A process which copies the csv file into the nextflow directory and then allows for the data to be parsed in the main.nf.
This was decided upon with help from the Seqera team, as there is no direct way of building path objects from strings.
Review contents of usage.md file.
Include the BLASTN NF-CORE module, to be used in blasting set query data against the input genome.
There is already a combined bamToBed + bedtools sort module in nf-core. Please use that instead of bedtools_bed_sort. You also have a redundant module under sanger-tol. Only keep one copy under nf-core.
From Matthieu:
motivated by reducing the amount of code you / we need to maintain
I would also recommend to move away from bedtools sort if you can. It's less efficient than a regular sort, as the bedtools authors say themselves (see the disclaimer at the bottom of https://bedtools.readthedocs.io/en/latest/content/tools/sort.html) and it's indeed caused us some problems in the read-mapping pipeline.
More details on the issue with bedtools sort
: sanger-tol/genomenote#51
If you do separate the sort
, please make sure it is a different module. You can borrow this gnu_sort
module.
Branch from the documentation branch for adding documentation.
Include:
There is already a combined minimap2 + samtools module in nf-core. Please use that instead of minimap_samtools
From Matthieu:
motivated by reducing the amount of code you / we need to maintain
Testing has revealed that since the directory change, the CSV pull module has begun mixing it's input lists. The first for organism names and second for the file location.
In the function there seems to be a cross over where organism 1 and organism 2's path location are being used and causing a conflict with expected file output, e.g., path/organism 1.
We are unsure why or how this is occuring since it is being passed the correct values, but we are investigating and will fix ASAP before #52 is complete.
Branch from the documentation branch for adding documentation.
Include:
The input functions must be changed to instead take the gEVAL-yaml or a new treeval-yaml.
We have been using a file naming scheme that has resulted in multiple files overwriting each other upon completion of the pipeline.
This will be corrected in the next commit.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.