Coder Social home page Coder Social logo

schlosslab / hannigan_conjunctisviribus_ploscompbio_2018 Goto Github PK

View Code? Open in Web Editor NEW
2.0 5.0 4.0 460.11 MB

Using graph database techniques to understand the infectious relationships between bacteria and phages.

License: MIT License

Perl 0.93% R 4.42% Shell 1.58% TeX 0.14% Makefile 0.56% HTML 8.63% Python 0.11% PostScript 83.63%
reproducible-paper phage microbiome phage-bacteria-networks bacteria

hannigan_conjunctisviribus_ploscompbio_2018's Introduction

Twitter Follow Twitter Follow Twitter Follow

Biogeography and Environmental Conditions Shape Phage and Bacteria Interaction Networks Across the Healthy Human Microbiome

Geoffrey D Hannigan, Melissa B Duhaime, Danai Koutra, and Patrick D Schloss

Abstract

Viruses and bacteria are critical components of the human microbiome and play important roles in health and disease. Most previous work has relied on studying microbes and viruses independently, thereby reducing them to two separate communities. Such approaches are unable to capture how these microbial communities interact, such as through processes that maintain community stability or allow phage-host populations to co-evolve. We developed and implemented a network-based analytical approach to describe phage-bacteria network diversity throughout the human body. We accomplished this by building a machine learning algorithm to predict which phages could infect which bacteria in a given microbiome. This algorithm was applied to paired viral and bacterial metagenomic sequence sets from three previously published human cohorts. We organized the predicted interactions into networks that allowed us to evaluate phage-bacteria connectedness across the human body. We found that gut and skin network structures were person-specific and not conserved among cohabitating family members. High-fat diets and obesity were associated with less connected networks. Network structure differed between skin sites, with those exposed to the external environment being less connected and more prone to instability. This study quantified and contrasted the diversity of virome-microbiome networks across the human body and illustrated how environmental factors may influence phage-bacteria interactive dynamics. This work provides a baseline for future studies to better understand system perturbations, such as disease states, through ecological networks.

Importance

The human microbiome, the collection of microbial communities that colonize the human body, is a crucial component to health and disease. Two major components to the human microbiome are the bacterial and viral communities. These communities have primarily been studied separately using metrics of community composition and diversity. These approaches have failed to capture the complex dynamics of interacting bacteria and phage communities, which frequently share genetic information and work together to maintain stable ecosystems. Removal of bacteria or phage can disrupt or even collapse those ecosystems. Relationship-based network approaches allow us to capture this interaction information. Using this network-based approach with three independent human cohorts, we were able to present an initial understanding of how phage-bacteria networks differ throughout the human body, so as to provide a baseline for future studies of how and why microbiome networks differ in disease states.

This looks cool. Take me to the manuscript!

hannigan_conjunctisviribus_ploscompbio_2018's People

Contributors

danai112358 avatar ecogenomix avatar microbiology avatar pschloss avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

hannigan_conjunctisviribus_ploscompbio_2018's Issues

Contig binning by kier distances

Incorporate my k-mer distance script into the manuscript as a way of binning (or collapsing) contig nodes by similarity. This will allow me to justify assembling contigs by sample, which will over oversampling that can harm contig assembly by using the entire dataset.

Condense network Perl script

Network Perl script is becoming a set of repetitive blocks that can be condensed into a subroutine and run repeatedly.

giphy.gif

Finish All Contig Assemblies

It looks like there is a problem with running the contig assembler across all of the samples. I say this because I don't have contigs for all of the samples in the results directory.

My strategy for now is going to be moving on and establishing the rest of the workflow, and then going back to fix the bug.

Fix Uniprot on FLUX

Uniprot ID is still not working on FLUX even though it works with Pfam IDs. I think the reference database is too large, and I can fix this by removing the members of Uniprot that I do not need, such as human gene IDs.

Script for OPF Diversity

I decided that I am going to want to do some correlations with the functional metric of OPF diversity. I essentially already have the scripts together but I need to rework them a bit for this specific application.

Increased Speed for Network Creation

Okay home stretch here for the first version of the integrated study network. Unfortunately it is building very slowly because there is so much information to be added.

Thankfully I really only need the average blast scores instead of all of the individual blast scores for each gene-based approach so I can do that up front and greatly reduce the information to be added to the network.

This should be easy to implement as an R script.

FASTX Quality Trimming

I think all of the studies in my dataset will have the same quality score offsets for the ASCII characters but I'm not totally sure about that so I need to confirm.

`CreateProteinNetwork` Script Broken

The CreateProteinNetwork is broken and Im not entirely sure what is wrong yet. Probably a variable call issue.

Output from Makefile:

bash ./bin/CreateProteinNetwork \
                ./data/ValidationSet/Interactions.tsv \
                ./data/BenchmarkingSet/BenchmarkCrisprsFormat.tsv \
                ./data/BenchmarkingSet/BenchmarkProphagesFormatFlip.tsv \
                ./data/BenchmarkingSet/PfamInteractionsFormatScoredFlip.tsv \
                ./data/BenchmarkingSet/MatchesByBlastxFormatOrder.tsv \
                "TRUE"
Loaded perl-modules version 5.22.1 (default)
WARNING: Max 32768 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Starting Neo4j Server...WARNING: not changing user
process [54808]... waiting for server to be ready......... OK.
http://localhost:7474/ is ready.
Running neo4j script...
Not FoundStopping Neo4j Server [54808].... done
ERROR: Neo4j Server not running

Add k-mer screen to interaction scoring

I have the k-mer infrastructure done so I can get the k-mer distances between the bacteria and phages and see how good they are for prediction. Although the model is pretty solid right now so this is a maybe.

Condense Relationships

Right now I have a new relationship for each method of predicting interactions, but this should be built with just a single relationship with properties specifying the methods that support the prediction.

Error in Perl Parsing of Literature

Perl parsing script does not deal with multiple hosts that are documented for a single phage. In discards all hosts except one. This needs to be fixed to consider all potential hosts.

Array for Makefile Download

Instead of targeting a list file for download, I want to loop through an array with each accession number.

CRISPR nodes are being lost

CRISPR nodes are being lost if the node does not already exist from the literature parsing. Need to write section that adds new nodes if not already present.

Plot Validation ROC Curve

I need to recreate the validation ROC curve in my makefile with my slightly different code structure.

I want the resulting ROC curve from my R script. This should include the final model, as well as the predictive accuracy of the individual characteristics (CRISPR, PFAM, etc).

Include phage-bacteria pairs without integration

The blast methods that I have utilized may favor integrated prophages, while short alignments may also predict integration interactions.

  1. Amount of validation phage-bacteria pairs without lysogenic relationships should be made clear in the model description text.

  2. Incorporate Prochlorococcus MIT 9313 + Phage PSS2 pair as a standard example of known interactions without integration.

Addition of Gene Hit Values

Right now the gene hit values for multiple genes from a single contig are not being summed properly so I need to go back and fix this calculation.

Improve bowtie speed

The bowtie process could run a lot faster if I run the alignments in parallel. I think the key will be running many instances of qsub... I think...

Validation Network Naming Test

It looks like some of the node are getting named incorrectly in the graph database and this is resulting in a loss of interaction data. I want to incorporate some sort of simple test so that I know the information agrees between the relationship reference table and the final network.

Accuracy Validation

Need to add a set of scripts that validate the predicted interactions and demonstrate how well the results can be trusted.

Implement method for confirming bacteria

The whole metagenome samples primarily contain bacteria, but they are also expected to contain a subset of viral sequences. Therefore I cannot completely assume that the relationships are between bacteria and phages because it could be a match between a phage and a phage.

I need a method for confirming that contigs are in fact bacterial and not phage.

Add proportion of hits to ROC curve figure

I need some sort of visualization of the proportion of samples with no scores. I can accomplish this with as a pie chart or bar graph (I know I know but tables are boring).

Underscores in phage names

Many of the phage names are getting underscores followed by numbers which need to be removed. And example is Lactococcus_phage_phiL47_114.

The bacteria also have this problematic nomenclature that is throwing off the node creation.

Compare to ocean plankton model

Cite Joshua for temporal stuff and discuss differnece from interaction/predictions "Ocean plankton. Determinants of community structure in the global plankton interactome." Gold standard so have discussion.

Assemble unique contigs

I am cat'ing together the contigs from all of the samples but Im not sure they were given unique names so there might be some repeated names in the fasta file. I need to append the contig names with the ID of the sample they originated from before pasting them together.

Contigs need moving/renaming

Output contig files go to their own directory. The directory has the sample name, while the contig files themselves are all named final.contigs.fa.

I need to update the script so that the contig files are all in the same directory and the sample directories and extra files are deleted.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.