schlosslab / hannigan_conjunctisviribus_ploscompbio_2018 Goto Github PK

View Code? Open in Web Editor NEW

2.0 5.0 4.0 460.11 MB

Using graph database techniques to understand the infectious relationships between bacteria and phages.

License: MIT License

Perl 0.93% R 4.42% Shell 1.58% TeX 0.14% Makefile 0.56% HTML 8.63% Python 0.11% PostScript 83.63%

reproducible-paper phage microbiome phage-bacteria-networks bacteria

hannigan_conjunctisviribus_ploscompbio_2018's Introduction

Biogeography and Environmental Conditions Shape Phage and Bacteria Interaction Networks Across the Healthy Human Microbiome

Geoffrey D Hannigan, Melissa B Duhaime, Danai Koutra, and Patrick D Schloss

Abstract

Viruses and bacteria are critical components of the human microbiome and play important roles in health and disease. Most previous work has relied on studying microbes and viruses independently, thereby reducing them to two separate communities. Such approaches are unable to capture how these microbial communities interact, such as through processes that maintain community stability or allow phage-host populations to co-evolve. We developed and implemented a network-based analytical approach to describe phage-bacteria network diversity throughout the human body. We accomplished this by building a machine learning algorithm to predict which phages could infect which bacteria in a given microbiome. This algorithm was applied to paired viral and bacterial metagenomic sequence sets from three previously published human cohorts. We organized the predicted interactions into networks that allowed us to evaluate phage-bacteria connectedness across the human body. We found that gut and skin network structures were person-specific and not conserved among cohabitating family members. High-fat diets and obesity were associated with less connected networks. Network structure differed between skin sites, with those exposed to the external environment being less connected and more prone to instability. This study quantified and contrasted the diversity of virome-microbiome networks across the human body and illustrated how environmental factors may influence phage-bacteria interactive dynamics. This work provides a baseline for future studies to better understand system perturbations, such as disease states, through ecological networks.

Importance

The human microbiome, the collection of microbial communities that colonize the human body, is a crucial component to health and disease. Two major components to the human microbiome are the bacterial and viral communities. These communities have primarily been studied separately using metrics of community composition and diversity. These approaches have failed to capture the complex dynamics of interacting bacteria and phage communities, which frequently share genetic information and work together to maintain stable ecosystems. Removal of bacteria or phage can disrupt or even collapse those ecosystems. Relationship-based network approaches allow us to capture this interaction information. Using this network-based approach with three independent human cohorts, we were able to present an initial understanding of how phage-bacteria networks differ throughout the human body, so as to provide a baseline for future studies of how and why microbiome networks differ in disease states.

This looks cool. Take me to the manuscript!

hannigan_conjunctisviribus_ploscompbio_2018's People

Contributors

Stargazers

Watchers

Forkers

ecogenomix duhaimelab randomeffect zhaoxia413

hannigan_conjunctisviribus_ploscompbio_2018's Issues

Contig binning by kier distances

Incorporate my k-mer distance script into the manuscript as a way of binning (or collapsing) contig nodes by similarity. This will allow me to justify assembling contigs by sample, which will over oversampling that can harm contig assembly by using the entire dataset.

Remove slashes from genome names

The slashes in genome names are causing problems for the neo4p processing so I need to remove them.

QC Information for Global Human Virome Datasets

Average number of sequences
Average sequence length per study
Sequencing platforms used
Geographic & anatomical sampling locations

Stop neo4j from always running in makefile

Makefile is setup so that it always reruns the validation graph database creation script. This needs to be stopped.

Condense network Perl script

Network Perl script is becoming a set of repetitive blocks that can be condensed into a subroutine and run repeatedly.

Finish All Contig Assemblies

It looks like there is a problem with running the contig assembler across all of the samples. I say this because I don't have contigs for all of the samples in the results directory.

My strategy for now is going to be moving on and establishing the rest of the workflow, and then going back to fix the bug.

Install PilerCR on FLUX

Fix Uniprot on FLUX

Uniprot ID is still not working on FLUX even though it works with Pfam IDs. I think the reference database is too large, and I can fix this by removing the members of Uniprot that I do not need, such as human gene IDs.

Script for OPF Diversity

I decided that I am going to want to do some correlations with the functional metric of OPF diversity. I essentially already have the scripts together but I need to rework them a bit for this specific application.

Run relationship prediction value calculations on real data

I am hitting a wall with predicting interactions using the contigs instead of the reference sequences. The problem seems to occur when I try calling ORFs from the contigs. I am in the process of trying to figure this out.

Increased Speed for Network Creation

Okay home stretch here for the first version of the integrated study network. Unfortunately it is building very slowly because there is so much information to be added.

Thankfully I really only need the average blast scores instead of all of the individual blast scores for each gene-based approach so I can do that up front and greatly reduce the information to be added to the network.

This should be easy to implement as an R script.

Add Sullivan interaction data

Sullivan interaction data, Nature 2003

FASTX Quality Trimming

I think all of the studies in my dataset will have the same quality score offsets for the ASCII characters but I'm not totally sure about that so I need to confirm.

`CreateProteinNetwork` Script Broken

The CreateProteinNetwork is broken and Im not entirely sure what is wrong yet. Probably a variable call issue.

Output from Makefile:

bash ./bin/CreateProteinNetwork \
                ./data/ValidationSet/Interactions.tsv \
                ./data/BenchmarkingSet/BenchmarkCrisprsFormat.tsv \
                ./data/BenchmarkingSet/BenchmarkProphagesFormatFlip.tsv \
                ./data/BenchmarkingSet/PfamInteractionsFormatScoredFlip.tsv \
                ./data/BenchmarkingSet/MatchesByBlastxFormatOrder.tsv \
                "TRUE"
Loaded perl-modules version 5.22.1 (default)
WARNING: Max 32768 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Starting Neo4j Server...WARNING: not changing user
process [54808]... waiting for server to be ready......... OK.
http://localhost:7474/ is ready.
Running neo4j script...
Not FoundStopping Neo4j Server [54808].... done
ERROR: Neo4j Server not running

Add k-mer screen to interaction scoring

I have the k-mer infrastructure done so I can get the k-mer distances between the bacteria and phages and see how good they are for prediction. Although the model is pretty solid right now so this is a maybe.

Add nodes for study associated with each bacteria and phage

Add a node with an edge defined as the bacteria/phage being found in that study.

Include the name of the study, sequencing method, and environment.

Condense Relationships

Right now I have a new relationship for each method of predicting interactions, but this should be built with just a single relationship with properties specifying the methods that support the prediction.

Start make file for analysis implementation

Confirm ssDNA viruses in validation dataset

The validation dataset needs to have diverse phages, including ssDNA as well as dsDAN phages.

Fix bacteria name in interactions

I noticed one of the bacteria names was mistakenly propionibacterium phage in the interaction file. This needs to be fixed.

Add map to dataset summary figure

I've done it already but I want to add it here so I can close it, have a record, and feel accomplished. 🎉

Error in Perl Parsing of Literature

Perl parsing script does not deal with multiple hosts that are documented for a single phage. In discards all hosts except one. This needs to be fixed to consider all potential hosts.

Add test that all samples are run through QC properly.

Array for Makefile Download

Instead of targeting a list file for download, I want to loop through an array with each accession number.

CRISPR nodes are being lost

CRISPR nodes are being lost if the node does not already exist from the literature parsing. Need to write section that adds new nodes if not already present.

Diagram for Interaction Network

Make a .dot graph viz like diagram for the interaction model itself.

Establish OPF Clustering

Finish establishing the OPF workflow so that we can start scaling it up.

Add CRISPR data to validation

I need some CRISPR information in my validation dataset, otherwise I cannot train on it.

Add heat map of predicted interactions

Both as a way to list the organisms included in the model, as well as an additional visualization tool for the interactions we used.

Finish Data Downloading & Processing

Finish the real data downloading and processing up to where they can be scored and added to the graph database.

Quantify circular contigs

Add section to QC that quantifies the amount of predicted circular contigs based on ccontigs.

Plot Validation ROC Curve

I need to recreate the validation ROC curve in my makefile with my slightly different code structure.

I want the resulting ROC curve from my R script. This should include the final model, as well as the predictive accuracy of the individual characteristics (CRISPR, PFAM, etc).

Include phage-bacteria pairs without integration

The blast methods that I have utilized may favor integrated prophages, while short alignments may also predict integration interactions.

Amount of validation phage-bacteria pairs without lysogenic relationships should be made clear in the model description text.
Incorporate Prochlorococcus MIT 9313 + Phage PSS2 pair as a standard example of known interactions without integration.

Addition of Gene Hit Values

Right now the gene hit values for multiple genes from a single contig are not being summed properly so I need to go back and fix this calculation.

Improve bowtie speed

The bowtie process could run a lot faster if I run the alignments in parallel. I think the key will be running many instances of qsub... I think...

Validation Network Naming Test

It looks like some of the node are getting named incorrectly in the graph database and this is resulting in a loss of interaction data. I want to incorporate some sort of simple test so that I know the information agrees between the relationship reference table and the final network.

Run validation protocol through makefile

The validation dataset and output need to be able to be run through the makefile. This means that the main shell scripts need to the working properly.

Accuracy Validation

Need to add a set of scripts that validate the predicted interactions and demonstrate how well the results can be trusted.

Implement method for confirming bacteria

The whole metagenome samples primarily contain bacteria, but they are also expected to contain a subset of viral sequences. Therefore I cannot completely assume that the relationships are between bacteria and phages because it could be a match between a phage and a phage.

I need a method for confirming that contigs are in fact bacterial and not phage.

Circular Contig Script

Make a script that will detect circular contigs (complete virus genomes).

Set relative abundance relationship in graph

I need to get the relative abundance information in the graph database. I need a set script to do this.

Add proportion of hits to ROC curve figure

I need some sort of visualization of the proportion of samples with no scores. I can accomplish this with as a pie chart or bar graph (I know I know but tables are boring).

Underscores in phage names

Many of the phage names are getting underscores followed by numbers which need to be removed. And example is Lactococcus_phage_phiL47_114.

The bacteria also have this problematic nomenclature that is throwing off the node creation.

Compare to ocean plankton model

Cite Joshua for temporal stuff and discuss differnece from interaction/predictions "Ocean plankton. Determinants of community structure in the global plankton interactome." Gold standard so have discussion.

Assemble unique contigs

I am cat'ing together the contigs from all of the samples but Im not sure they were given unique names so there might be some repeated names in the fasta file. I need to append the contig names with the ID of the sample they originated from before pasting them together.