Coder Social home page Coder Social logo

syn-mrl's Introduction

Pipeline for whole-genome microsynteny-based phylogenetic inference

Our synteny-based phylogenetic reconstruction approach includes four main steps, in turn namely phylogenomic synteny network construction, network clustering, matrix representation, and maximum-likelihood estimation. Together we call our approach ‘Syn-MRL’ for short.

The synteny network construction consists of two main steps: first, all-vs-all reciprocal annotated-protein comparisons of the whole genome using DIAMOND was performed, followed by MCScanX, which was used for pairwise synteny 489 block detection. Parameter settings for MCScanX have been tested and compared before; here we adopt ‘b5s5m25’ (b: number of top homologous pairs, s: number of minimum matched syntenic anchors, m: number of max gene gaps), which has proven to be appropriate by various studies for the evolutionary distances among angiosperm genomes. To avoid large numbers of local collinear gene pairs due to tandem arrays, if consecutive homologs (up to five genes apart) share a common gene, homologs are collapsed to one representative pair (with the smallest E-value). Further details regarding phylogenomic synteny network construction can be found in a tutorial available in the associated GitHub repository (https://github.com/zhaotao1987/SynNet-Pipeline). Each pairwise synteny block represents pairs of connected nodes (syntenic genes), all pairwise identified synteny blocks together form a comprehensive synteny network with millions of nodes and edges. In this synteny network, nodes are genes (from the synteny blocks), while edges connect syntenic genes. For our work, the entire synteny network summarizes information from 7,435,502 pairwise syntenic blocks, and contains 503 3,098,333 nodes (genes) and 94,980,088 edges (syntenic connections). The entire synteny network (database) is clustered for further analysis. We used the Infomap algorithm for detecting synteny clusters within the map equation framework(https://github.com/mapequation/infomap). We have discussed before why Infomap is more appropriate for clustering phylogenomic synteny networks. We used the two-level partitioning mode with ten trials (--clu -N 10 --map -2). The network was treated as undirected and unweighted. Resulting synteny clusters vary in size and composition, which is associated with synteny either being well conserved or rather lineage-/species-specific. A typical synteny cluster comprises of syntenic genes shared by groups of species, which precisely represent phylogenetic relatedness of genomic architecture among species. Here, we classified the entire synteny network into 137,833 synteny clusters.

A cluster phylogenomic profile shows its composition by the number of nodes in each species. We summarize the total information residing in all synteny clusters as a data matrix for tree inference. Phylogenomic profiles of all clusters construct a large data matrix, where rows represent species, and columns as clusters. The matrix was then reduced to a binary presence-absence matrix to obtain the final synteny matrix. Tree estimation was based on maximum-likelihood as implemented in IQ-TREE (version 1.7-beta7) (Nguyen et al., 2014), using the MK+R+FO model. (where “M” stands for “Markov” and “k” refers to the number of states observed, in our case, k =2). The +R (FreeRate) model was used to account for site-heterogeneity, and typically fits data better than the Gamma model for large datasets. State frequencies were optimized by maximum-likelihood (by using ‘+FO’). We generated 1000 bootstrap replicates for the SH-like approximate likelihood ratio test (SH-aLRT), and 1000 ultrafast bootstrap (UFBoot) replicates (-alrt 1000 -bb 1000).

 

Microsynteny-based vs sequence-alignment based phylogenetic reconstruction

me

syn-mrl's People

Contributors

zhaotao1987 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

licheng0921

syn-mrl's Issues

Error when carrying out Phylogenetic profiling

Hi Dr.Zhao,

I have run into an error when I am carrying out the Phylogenetic profiling of the pipeline. I am using 94 ecdyzosoan genomes and am getting this error:

Loading required package: permute
Loading required package: lattice
This is vegan 2.5-7
Error in hclust(d, method = "ward.D") :
size cannot be NA nor exceed 65536
Execution halted

Could you help me overcome this?

All the best,
Dearbhaile.

Do you have a tutorial on how to run SYN-MRL after you have the Synteny Network file?

I am done with running the SyntneyNet but it gives me error saying while running infomap clustering
Infomap SynNetformatted.txt Clustering/ --clu -N 10 -2 --flow-model undirected

Infomap v1.4.1 starts at 2023-10-19 14:37:23
-> Input network: SynNetformatted.txt
-> Output path: Clustering/
-> Configuration: clu
two-level
flow-model = undirected
num-trials = 10

OpenMP 201511 detected with 32 threads...
Parsing undirected network from file 'SynNetformatted.txt'...
Parsing links...
'rror: Can't parse link data from line 'PGAS1623 PGNA1056 700

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.