Coder Social home page Coder Social logo

coreugate's Introduction

COREugate - A pipeline for cgMLST

From contigs to cgMLST profile and SLC.

COREugate has had a small facelift!! Under the hood we are now using NextFlow as our pipeline engine and have introduced some additional functionality for clustering the profiles.

  1. PrepSchema (if necessary) and Call alleles using chewBBACA.
  2. Combine profiles and statisitics for the whole dataset.
  3. Calculate pairwise allelic distances (missing data is ignored)
  4. Perform SLC to group related profiles, based on user supplied thresholds.

Dependencies

Python >=3.7
Biopython >=1.70
Nextflow >=20.10
chewBBACA >=2.6

NextFlow

Ensure that you have NextFlow installed. Detailed instructions can be found here

chewBBACA

chewBBACA is used here to prepare the schema, by selecting exemplar alleles for comparison and to call allele profiles. More information about chewBBACA and how it is works can be found here. COREugate can use a singularity version of chewBBACA, however if you want to install the latest version (>=2.0.16)

Run COREugate

Get COREugate

pip3 install git+https://github.com/kristyhoran/Coreugate

If you are installing COREugate on a server using --user please ensure that your ~/.local/bin is part of your PATH

export PATH=$PATH:/path/to/.local/bin

Running COREugate

coreugate [-h] [-v] [--input_file INPUT_FILE]
                 [--schema_path SCHEMA_PATH]
                 [--prodigal_training PRODIGAL_TRAINING] [--workdir WORKDIR]
                 [--threads THREADS]
                 [--filter_samples_threshold FILTER_SAMPLES_THRESHOLD]
                 [--cluster] [--cluster_thresholds CLUSTER_THRESHOLDS]
                 [--force] [--report]

Coreugate - a cgMLST pipeline implementing chewBACCA

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --input_file INPUT_FILE, -i INPUT_FILE
                        Input file tab-delimited file3 columns isolate_id
                        path_to_input_file (contigs) (default: )
  --schema_path SCHEMA_PATH, -s SCHEMA_PATH
                        Path to species schema/allele db (or url if using
                        chewie Nomenclature server) (default: )
  --prodigal_training PRODIGAL_TRAINING, -p PRODIGAL_TRAINING
                        Prodigal file to be used in allele calling. See https:
                        //github.com/B-UMMI/chewBBACA/tree/master/CHEWBBACA/pr
                        odigal_training_files for options (default: )
  --workdir WORKDIR, -w WORKDIR
                        Working directory, default is current directory
                        (default: /home/khhor/validation/salmonella_typing/rev
                        erification_20210322)
  --threads THREADS, -t THREADS
                        Number of threads to run chewBACCA (default: 16)
  --filter_samples_threshold FILTER_SAMPLES_THRESHOLD, -ft FILTER_SAMPLES_THRESHOLD
                        The proportion of loci present in a sample for an
                        sample to be included in further analysis (0-1)
                        (default: 0.95)
  --cluster, -c         If you would like to cluster the pairwise distance
                        matrix. If selected you must provide a list of
                        thresholds. (default: False)
  --cluster_thresholds CLUSTER_THRESHOLDS, -ct CLUSTER_THRESHOLDS
                        Provide a comma separate list (NO SPACES) eg 20,40,200
                        (default: )
  --force, -f           If you want to force chewBBACA to re-run. (default:
                        False)
  --report              Save nextflow reports. (default: False)
                                 Display this help message
Sample data

Assemblies

isolate_name	path/to/assembly.fa	
Species cgMLST schema

COREugate requires an exisiting cgMLST schema, this can be a schema generated by the user or downloaded from one of the publically available databases. These schema should be in the format of a fasta file for each loci, each file should contain the different alleles for each loci. It should be noted that during allele calling, chewBBACA (implemented by COREugate) will add inferred alleles (more information) to your schema, so it is recommended that the schema path be fixed, that is that the schema is kept in a central location and a single version is used for each species/study.

Other optional arguments
  • prodigal_training a prodigal training file for allele calling. Recommended by chewBBACA developers, a list of default training files and further information can be found here.

Limitations of the pipeline

  • Coreugate is only able to work with pre-exisiting schemas that have been prep as described above, to derive profiles for isolates.
  • Possibly more, I just haven't found them yet!!

coreugate's People

Contributors

kristyhoran avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

coreugate's Issues

missing dependencies csvtk, rapidnj, cgmlst-dists, and scikit-learn.

After a bit of work I have a conda env that works with coreugate. There were a few undocumented dependencies so I thought I'd share my build for others who want to try it.
The main issues were a change in chewbbaca where they removed the --fr argument in v3.0.0, and missing csvtk, rapidnj, cgmlst-dists, and scikit-learn. I'm building the conda env inside a singularity container but this should work for standard conda/mamba as well.

conda create --yes --quiet --name coreugate_2.0.5 -c conda-forge -c anaconda -c bioconda  'nextflow>=20.10' bcftools=1.15.1 singularity 'python>=3.7' 'biopython>=1.70' 'chewbbaca>=2.8.4, <3.0.0' csvtk rapidnj cgmlst-dists scikit-learn

conda activate coreugate_2.0.5

#  use commit 89f613e9984c4d59400ec91d9065a786ddd256dd as there's a println bug in v2.0.5
python3 -mpip install git+https://github.com/kristyhoran/Coreugate.git@89f613e9984c4d59400ec91d9065a786ddd256dd

stray println statement in v2.0.5

It appears there's a debugging println statement that's made it's way into release v2.0.5 which causes errors in Nextflow 23.04.1.
ERROR ~ No such variable: println
It looks like line 97 in coreugate/utils/main.nf

input file

Hi Kristy,
In the old version (https://github.com/MDU-PHL/Coreugate), the input file could be reads or assemblies. However, according to my understanding, in this new version, the input file could only be assemblies. Is this right? Can we input reads files?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.