linsalrob / prophagepredictioncomparisons Goto Github PK

View Code? Open in Web Editor NEW

22.0 4.0 12.0 5.32 MB

Comparisons of multiple different prophage predictions

License: MIT License

Perl 13.25% Python 7.17% Jupyter Notebook 79.58%

prophagepredictioncomparisons's People

Contributors

Stargazers

Watchers

Forkers

beardymcjohnface pdec laura-rc npbhavya mpdoane2 jbogias sarahgiles1385 evanpargin simroux phenolophthaleinum celestial-bai ruixuan-zhang

prophagepredictioncomparisons's Issues

[Suggestion] Add strand information to the gbk files

Gene strand can be very useful to detect prophages, but it is currently lacking from the .gb files. Because of that, there's no way to benchmark a tool that leverages strandness using proteins/ORFs extracted from this dataset's .gb files (using genbank2sequences.py, for example).

Missing prophage in Escherichia coli K12

Hey Rob!

The Escherichia coli K12 genome seems to be missing more than one prophage. Here's one that I curated manually:

contig       gene                            start      end
---------------------------------------------------------------
NC_000913    RAST2:fig|83333.998.peg.2428    2464528    2465722
NC_000913    RAST2:fig|83333.998.peg.2429    2466233    2467154
NC_000913    RAST2:fig|83333.998.peg.2430    2467150    2468482
NC_000913    RAST2:fig|83333.998.peg.2431    2468822    2469125
NC_000913    RAST2:fig|83333.998.peg.2432    2469096    2469537
NC_000913    RAST2:fig|83333.998.peg.2433    2469563    2470157
NC_000913    RAST2:fig|83333.998.peg.2434    2470131    2470407
NC_000913    RAST2:fig|83333.998.peg.2435    2470406    2470901
NC_000913    RAST2:fig|83333.998.peg.2436    2470897    2471266
NC_000913    RAST2:fig|83333.998.peg.2437    2471415    2471589
NC_000913    RAST2:fig|83333.998.peg.2438    2471623    2471986
NC_000913    RAST2:fig|83333.998.peg.2439    2472051    2472876
NC_000913    RAST2:fig|83333.998.peg.2440    2473003    2473540
NC_000913    RAST2:fig|83333.998.peg.2441    2473530    2473893
NC_000913    RAST2:fig|83333.998.peg.2442    2473892    2474198
NC_000913    RAST2:fig|83333.998.peg.2443    2474329    2474530

There's also a clear prophage that starts with RAST2:fig|83333.998.peg.1592, but I couldn't determine its other boundary.

add depht

https://github.com/chg60/DEPhT

Accuracy of your genbank annotations

Hello, I wanted to test your compare_predictions_to_phages.py to make sure that it was working, so I used the tsv file containing the reference locations for phages in NC_002655.

I was expecting to get perfect results, since I was using the reference intervals from the Casjens 2003 paper as reported on the PHASTER website statistics page. Instead I got these results:

(base) [u1323098@notch164:scripts]$ python3 compare_predictions_to_phages.py -t /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb -r reference.tsv --fp --fn -v
Reading reference.tsv
Reading /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb again to get the phage regions
Getting from 1879335 to 1897622
Getting from 3551577 to 3565707
Getting from 2966382 to 3015014
Getting from 2668339 to 2688870
Getting from 2285976 to 2330172
Getting from 300073 to 310251
Getting from 1897625 to 1908911
Getting from 1702185 to 1725748
Getting from 310756 to 323112
Getting from 1250521 to 1295458
Getting from 1330857 to 1391923
Getting from 1678706 to 1693737
Getting from 1849488 to 1879269
Getting from 1909139 to 1930250
Getting from 892845 to 930943
Getting from 1730065 to 1756006
Getting from 1626722 to 1673485
Getting from 1655548 to 1696145
Getting from 2743223 to 2788348
Getting from 2118738 to 2165694
Getting from 3263064 to 3270404
Getting from 1521574 to 1530771
Found 789 predicted prophage features
Reading /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb
Comparing real and predicted
Found:

Test set:
Phage: 676 Not phage: 4832

Predictions:
Phage: 789 Not phage: 4709

TP: 641
FP: 158
TN: 4674
FN: 35

Accuracy: 0.965 (this is the ratio of the correctly labeled phage genes to the whole pool of genes
Precision: 0.802 (This is the ratio of correctly labeled phage genes to all predictions)
Recall: 0.948 (This is the fraction of actual phage genes we got right)
Specificity: 0.967 (This is the fraction of non phage genes we got right)
f1_score: 0.869 (this is the harmonic mean of precision and recall, and is the best measure when, as in this case, there is a big difference between the number of phage and non-phage genes)

It seems that there are some differences between the reference intervals listed in your supplementary table and the intervals listed on the PHASTER website.

Do you have a list of where the annotations came from that you are using? Thank you
LeAnn

Some contig accession numbers in the Supplementary Tables seem to be incorrect?

I tried to download all of the genomes using the "Contig accession" column in the supplementary table and these are the ones that were not recognized:

411479.31.con.0020
1351.557.con.0001
1280.10152.con.0001

Searching https://www.ncbi.nlm.nih.gov/ also produces no results.

Do you know where you got these?

Thank you

Duplicated protein_id in Paracoccus sanguinis

Hey Rob,

The Paracoccus sanguinis 5503 genome has a couple of duplicated protein IDs, which can cause troubles in downstream analysis. Here's the list:

WP_036707377.1
WP_036703025.1
WP_036705789.1
WP_036712647.1
WP_036705789.1

Request to add tool and other thoughts

Hi Rob,

I was really excited to see that someone is doing a comparison of different prophage tools. If you are still open to adding more tools for comparison, I hope you try my tool Cenote-Taker 2. If you do use it, please use the settings in the readme for bacterial genomes: -p True -db virion --minimum_length_linear 3000 --lin_minimum_hallmark_genes 2.

I'm sure it will be on the slower side compared to some other methods, but part of that is because Cenote-Taker 2 generates genome maps for each virus prediction. If you instead use Unlimited Breadsticks, which might be more of a direct comparison time-wise, it should go a bit faster (still not gonna be lightning fast).

I have some other thoughts: A really important part of prophage prediction is definition of the prophage/cellular chromosome boundary. Do you have any plans to analyze how close each tool gets to the manually curated boundary? If so, I would also recommend comparing other approaches to CheckV, which does a pretty good job (if not a bit conservative).

Further, I have some suggestions about expanding the prophage dataset that you may or may not like.

There are some datasets in SRA that consist of reads from the DNA of induced prophages which also have corresponding bacterial reference genomes. I think these are good examples of prophage that don't rely on manual curation. I have several examples in my notes and I can send those along if you'd like.
I haven't tried this extensively, but I wonder about using a pangenome approach to carefully mine prophages. The idea is you would compare the genome content of several bacterial reference genomes for a species, then extract the "regions of plasticity" using PanRGP from PPanGGolin. You could then predict which sequences represent prophage and get the coordinates from the original bacterial reference genome.

I'd be happy to discuss further.

Best regards,

Mike

linsalrob / prophagepredictioncomparisons Goto Github PK

prophagepredictioncomparisons's People

Contributors

Stargazers

Watchers

Forkers

prophagepredictioncomparisons's Issues

[Suggestion] Add strand information to the gbk files

Missing prophage in Escherichia coli K12

add depht

Accuracy of your genbank annotations

Some contig accession numbers in the Supplementary Tables seem to be incorrect?

Duplicated protein_id in Paracoccus sanguinis

Request to add tool and other thoughts

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent