Coder Social home page Coder Social logo

prophagepredictioncomparisons's Introduction

Edwards Lab License: MIT DOI

Prophage Prediction Comparisons

Open source comparisons of multiple different prophage predictions

What is it?

There are multiple different ways of identifying prophages in bacterial genomes, and this is an open source way of comparing them. Please feel free to clone this repo, add your tool or code, and then make a pull request.

What are prophages?

Prophages are viruses that are integrated into bacterial genomes. A few computatational biologists are keen to identify those specific regions, because they are more interesting than the rest of the genome. For more about prophages, take a look at the home pages for some of the tools listed here.

This site is not intended to be a gentle introduction to prophages, but a FAIR (findable, accessible, interoperable, and reusable) data resource for comparing prophage prediction software.

How do I use it?

To run the tests, first clone the repository and pull the files (requires git and git lfs)

git clone https://github.com/linsalrob/ProphagePredictionComparisons.git
cd ProphagePredictionComparisons
git submodule update
git lfs install
git lfs update

Then run the pipelines (requires snakemake and conda)

snakemake -s snakefiles/virsorter.smk --use-conda # --profile slurm or -j 16 etc...

If you develop prophage prediction software, clone the repository and implement your tool using a snakemake pipeline. There are several examples in the snakefiles directory. We have also defined conda environments for each of the tools (see the note below).

Once your tool is working, use it to predict the prophages in the genbank folder, and use the scripts to calculate true positive, true negative, false positive, false negative and related statistics.

The jupyter notebooks can be used to plot your data and make images like those below.

If you go to all that work, please make a pull request and we will update this site with your code.

What software is currently included?

We have:

We could not install:

  • LysoPhD - We can not find this available online anywhere
  • ProphET - This requires legacy BLAST and EMBOSS packages and we could not get it to install and run.

If you know of other tools that should be included please let us know or make a PR.

How does it work?

We manually curated the prophages in the bacterial genomes in the genbank files. For each phage we mark both the prophage region, and we mark each prophage gene as being a phage gene with a unique is_phage tag. We run the prediction software on those genbank files, and then compare the predictions with our manual curations.

We need more manually curated genomes! Please contribute by adding more manually curated genomes to our data set.

How can I contribute genomes?

Our dataset of manually curated genomes is a start, and we welcome submissions from anyone. To add a new genome:

  1. Please generate a GenBank format file with the complete bacterial genome
  2. For the CDS entries that are phages, please add the flag /is_phage="1" to the entry (the value doesn't matter, we check for the presence of the is_phage flag and that the value is not zero)
  3. Make a clone of this repository and add your genome(s)
  4. Make a pull request to add your genome(s) from your clone to the master branch

We welcome annotated microbial genomes from all sources, but we ask that you please manually curate the presence of phage, because it is that gold-standard manual curation that allows us to accurately compare tools.

What are the results?

Since we have a notion of truth, we calculate and plot:

  • true positives (TP)
  • true negatives (TN)
  • false positives (FP)
  • false negatives (FN)
  • accuracy: the ratio of the correctly labeled phage genes to the whole pool of genes
  • precision: the ratio of correctly labeled phage genes to all predictions
  • recall: the fraction of actual phage genes we got right
  • specficity: the fraction of non phage genes we got right
  • F1 score: the harmonic mean of precision and recall, and is the best measure when, as in this case, there is a big difference between the number of phage and non-phage genes

Note that plots similar to these can be generated by the jupyter notebooks we provide, but please repeat them and let us know if we made an error!

We plotted the accuracy, precision, recall, and F1 score of the different callers, and in this plot each subplot has the same axis.

Accuracy, Precision, Recall, and f1 score of all the prophage callers

As noted above, however, most of these are probably not the most robust since we have a lot of non-phage genes (ie. everything in the genome that is not a prophage), and only a relatively few phage genes. So we rely more on F1 score.

What about speed?

Speed is of the essence, and this is where each of the prophage callers really begin to differ. This plot shows time (seconds) to complete the predictions, and amount of memory consumed. We also plot disk write operations as these can severely impact performance under high parallelization, and the total file output size which is another consideration for large-scale analyses.

Runtime performance for all callers

What do the results mean?

Not much! You should always take benchmarks with a grain of salt, because whoever made them (see below) usually has a vested interest int their outcome.

You should note, however, that phage_finder, the OG of prophage identification is still one of the most robust methods.

Who did this?

This site was put together by Rob Edwards to compare prophage predictions. Help him out with curated genomes!

Citation

The preprint for this work is available in bioRxiv https://www.biorxiv.org/content/10.1101/2021.06.03.446868v2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.