linsalrob / prophagepredictioncomparisons Goto Github PK
View Code? Open in Web Editor NEWComparisons of multiple different prophage predictions
License: MIT License
Comparisons of multiple different prophage predictions
License: MIT License
Gene strand can be very useful to detect prophages, but it is currently lacking from the .gb
files. Because of that, there's no way to benchmark a tool that leverages strandness using proteins/ORFs extracted from this dataset's .gb
files (using genbank2sequences.py
, for example).
Hey Rob!
The Escherichia coli K12 genome seems to be missing more than one prophage. Here's one that I curated manually:
contig gene start end
---------------------------------------------------------------
NC_000913 RAST2:fig|83333.998.peg.2428 2464528 2465722
NC_000913 RAST2:fig|83333.998.peg.2429 2466233 2467154
NC_000913 RAST2:fig|83333.998.peg.2430 2467150 2468482
NC_000913 RAST2:fig|83333.998.peg.2431 2468822 2469125
NC_000913 RAST2:fig|83333.998.peg.2432 2469096 2469537
NC_000913 RAST2:fig|83333.998.peg.2433 2469563 2470157
NC_000913 RAST2:fig|83333.998.peg.2434 2470131 2470407
NC_000913 RAST2:fig|83333.998.peg.2435 2470406 2470901
NC_000913 RAST2:fig|83333.998.peg.2436 2470897 2471266
NC_000913 RAST2:fig|83333.998.peg.2437 2471415 2471589
NC_000913 RAST2:fig|83333.998.peg.2438 2471623 2471986
NC_000913 RAST2:fig|83333.998.peg.2439 2472051 2472876
NC_000913 RAST2:fig|83333.998.peg.2440 2473003 2473540
NC_000913 RAST2:fig|83333.998.peg.2441 2473530 2473893
NC_000913 RAST2:fig|83333.998.peg.2442 2473892 2474198
NC_000913 RAST2:fig|83333.998.peg.2443 2474329 2474530
There's also a clear prophage that starts with RAST2:fig|83333.998.peg.1592
, but I couldn't determine its other boundary.
Hello, I wanted to test your compare_predictions_to_phages.py to make sure that it was working, so I used the tsv file containing the reference locations for phages in NC_002655.
I was expecting to get perfect results, since I was using the reference intervals from the Casjens 2003 paper as reported on the PHASTER website statistics page. Instead I got these results:
(base) [u1323098@notch164:scripts]$ python3 compare_predictions_to_phages.py -t /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb -r reference.tsv --fp --fn -v
Reading reference.tsv
Reading /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb again to get the phage regions
Getting from 1879335 to 1897622
Getting from 3551577 to 3565707
Getting from 2966382 to 3015014
Getting from 2668339 to 2688870
Getting from 2285976 to 2330172
Getting from 300073 to 310251
Getting from 1897625 to 1908911
Getting from 1702185 to 1725748
Getting from 310756 to 323112
Getting from 1250521 to 1295458
Getting from 1330857 to 1391923
Getting from 1678706 to 1693737
Getting from 1849488 to 1879269
Getting from 1909139 to 1930250
Getting from 892845 to 930943
Getting from 1730065 to 1756006
Getting from 1626722 to 1673485
Getting from 1655548 to 1696145
Getting from 2743223 to 2788348
Getting from 2118738 to 2165694
Getting from 3263064 to 3270404
Getting from 1521574 to 1530771
Found 789 predicted prophage features
Reading /uufs/chpc.utah.edu/common/home/u1323098/sundar-group-space2/PHAGE/BENCHMARKING/Philympics_dataset/Escherichia_coli_O157-H7_EDL933.gb
Comparing real and predicted
Found:
Test set:
Phage: 676 Not phage: 4832
Predictions:
Phage: 789 Not phage: 4709
TP: 641
FP: 158
TN: 4674
FN: 35
Accuracy: 0.965 (this is the ratio of the correctly labeled phage genes to the whole pool of genes
Precision: 0.802 (This is the ratio of correctly labeled phage genes to all predictions)
Recall: 0.948 (This is the fraction of actual phage genes we got right)
Specificity: 0.967 (This is the fraction of non phage genes we got right)
f1_score: 0.869 (this is the harmonic mean of precision and recall, and is the best measure when, as in this case, there is a big difference between the number of phage and non-phage genes)
It seems that there are some differences between the reference intervals listed in your supplementary table and the intervals listed on the PHASTER website.
Do you have a list of where the annotations came from that you are using? Thank you
LeAnn
I tried to download all of the genomes using the "Contig accession" column in the supplementary table and these are the ones that were not recognized:
411479.31.con.0020
1351.557.con.0001
1280.10152.con.0001
Searching https://www.ncbi.nlm.nih.gov/ also produces no results.
Do you know where you got these?
Thank you
Hey Rob,
The Paracoccus sanguinis 5503 genome has a couple of duplicated protein IDs, which can cause troubles in downstream analysis. Here's the list:
WP_036707377.1
WP_036703025.1
WP_036705789.1
WP_036712647.1
WP_036705789.1
Hi Rob,
I was really excited to see that someone is doing a comparison of different prophage tools. If you are still open to adding more tools for comparison, I hope you try my tool Cenote-Taker 2. If you do use it, please use the settings in the readme for bacterial genomes: -p True -db virion --minimum_length_linear 3000 --lin_minimum_hallmark_genes 2
.
I'm sure it will be on the slower side compared to some other methods, but part of that is because Cenote-Taker 2 generates genome maps for each virus prediction. If you instead use Unlimited Breadsticks, which might be more of a direct comparison time-wise, it should go a bit faster (still not gonna be lightning fast).
I have some other thoughts: A really important part of prophage prediction is definition of the prophage/cellular chromosome boundary. Do you have any plans to analyze how close each tool gets to the manually curated boundary? If so, I would also recommend comparing other approaches to CheckV, which does a pretty good job (if not a bit conservative).
Further, I have some suggestions about expanding the prophage dataset that you may or may not like.
There are some datasets in SRA that consist of reads from the DNA of induced prophages which also have corresponding bacterial reference genomes. I think these are good examples of prophage that don't rely on manual curation. I have several examples in my notes and I can send those along if you'd like.
I haven't tried this extensively, but I wonder about using a pangenome approach to carefully mine prophages. The idea is you would compare the genome content of several bacterial reference genomes for a species, then extract the "regions of plasticity" using PanRGP from PPanGGolin. You could then predict which sequences represent prophage and get the coordinates from the original bacterial reference genome.
I'd be happy to discuss further.
Best regards,
Mike
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.