Coder Social home page Coder Social logo

begfe's Introduction

BEGFE2.0: Bayesian Estimation for Gene Family Evolution

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program implements a Markov Chain Monte Carlo algorithm to estimate the posterior probability distribution of the birth and death rate parameter and the numbers of gene copies at the internodes of the phylogenetic tree. In addition, BEGFE can simulate gene family data under the birth and death model.

New features in version 2.0

  1. Parallel computing using PTHREAD is available in version 2.0. Set PTHREAD ?= yes in Makefile
  2. The constraints on the birth rate parameter lambda in the old versions are removed in version 2.0
  3. The birth-death probability is calculated using two formulas - one for alpha < 0.5 and another for alpha > 0.5 where alpha = lambda*brlens / (1+lambda*brlens)
  4. The program provides a warning message if the species names in the data file do not match the names in the control file
  5. The program provides a warning message if the number of gene families in the control file > the number of gene families in the data file
  6. The second column "NUMBER" is removed in the data file

Compile from source code

To compile the program from source code, type make and hit return under the directory src.

Run the program

./begfe controlfile

Example control files

Three control files are included in the package. The control file controlsim is used to simulate gene family data.

A control file for simulating gene family data (controlsim)

1 #0:analysis, 1:simulation

sim1 #output file

-1 #random seed

100 #number of gene families to simulate

5 #number of species

(((chimp:6#0.005,human:6#0.005):81#0.005,(mouse:17#0.005,rat:17#0.005):70#0.005):6#0.005,dog:93#0.005)#0.005; #species tree, the numbers after '#' are birth/death rates

5 15 #the number of gene copies at the tree root is simulated from an uniform distribution (5, 15)

To run the simulation, type ./begfe controlsim

This will produce two files; sim1 and sim1.true. The simulated gene family dataset is saved in the file sim1, while sim1.true contains the true values of the parameters in the Bayesian model.

A control file for analyzing gene family data (control)

The other control file control is for carrying out the Bayesian analysis of the gene family data. Type ./begfe control and hit return.

0 #0:analysis, 1:simulation

sim1 #input file

-1 #random seed

100 #number of gene families

5 #number of species

(((chimp:6,human:6):81,(mouse:17,rat:17):70):6,dog:93); #species tree

10000 100 0 #number of MCMC generations, save every 100 samples, 0:unlinked (variable) lambdas and 1:linked (single) lambda

A second example of analyzing gene family data (AnolisMHC_control.txt)

0 #0:analysis, 1:simulation

AnolisMHCtab.txt #input file

-1 #random seed

4 #number of gene families

13 #number of species

(((((((Laticauda_laticaudata:32.71005600,(Pseudonaja_textilis:27.60000000,Notechis_scutatus:27.60000000):5.11005600):2.28994400,Naja_naja:35.00000000):132.12419034,(Anolis_carolinensis:66.08013667,Anolis_sagrei:66.08013667):101.04405368):10.87262926,(Podarcis_muralis:148.65977538,Salvator_merianae:148.65977538):29.33704422):73.83363773,Sphenodon_punctatus:251.83045733):27.82651933,(Gallus_gallus:98.04286929,Taeniopygia_guttata:98.04286929):181.61410737):32.24694470,(Mus_musculus:89.82318742,Homo_sapiens:89.82318742):222.08073394);

1000000 1000 0 #number of MCMC generations, save every 100 samples, 0:unlinked (variable) lambdas and 1:linked (single) lambda

Ouput files

There are two output files; sim1.out and sim1.pvalue. The MCMC output for all parameters is saved in sim1.out. Each column in sim.out represents the posterior distribution of a particular parameter in the birth and death model. The parameters such as the birth and death rate are estimated by the Bayesian means, i.e., the averages of the columns after discarding the burn-in period.

The Bayesian p-values (PPP) for all gene families in the dataset are saved in sim1.pvalue. The average of a column in sim1.pvalue is the Bayesian p-values for a particular gene family. Of course, the burn-in period must be discarded.

Citation

Liu, L., L. Yu, V. Kalavacharla, Z. Liu. A Bayesian model for gene family evolution. BMC Bioinformatics. 2011, 12:426

Old versions

Old versions BEGFE are available at https://github.com/lliu1871/oldversion

Example data set

The Anolis MHC example data set (unpublished as of April 2022) is provided courtesy of Daren Card and Scott Edwards, Harvard University

begfe's People

Contributors

lliu1871 avatar

Watchers

James Cloos avatar  avatar  avatar

Forkers

darencard

begfe's Issues

Clarifying settings in control files

I am hoping that some description of the control files can be provided so I can understand how to run the program on my own data. In looking at the files, I can reasonably guess certain fields, but there are others where I am puzzled. Apologies if I missed this information somewhere, but I have looked through the repo and cannot find any explanations.

controlsim

1
sim1
-1
124
5
(((chimp:6,human:6):81,(mouse:17,rat:17):70):6,dog:93);
5 15
0.005 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001

control1

0
sim1
-1
122
5
(((chimp:6,human:6):81,(mouse:17,rat:17):70):6,dog:93);
10000 100 0

Thanks in advance for the help.
Daren Card

Multimodal posteriors

I have been running begfe to assess gene family evolution among 8 species of vertebrates based on 10,683 gene families, with rates between lineages unlinked. Here is my configuration file for running begfe:

0
geneFamCounts.txt
begfe_ratesUnlinked_results_run1
-1
10683
8
(mm9:318.5,(galGal6:281.8,((gekJap1:191.9,((salMer1:168,(anoCar2:160.8,croVir1:160.8):7.2):7.0,lerBou1:175):16.9):54.0,sphPun1:245.9):35.9):36.7);
250000000 250 0

I ran two separate chains with the above configuration. and discarded the first 20% of MCMC repetitions as burn-in, leaving 800k MCMC reps from each chain. Both chains showed similar MCMC characteristics and converged on essentially identical posteriors. In some cases, ESS values were lower than I would have liked, but they appear largely sufficient. I plotted the species tree with divergence times and node labels on each branch to help with interpreting the following information. Happy to provide more details, if necessary.

I am interested in lambda for each node but am seeing a potential issue with multimodal posteriors for several nodes. I have attached graphs of the post-burn-in MCMC traces and densities for these parameters (chains combined) where this is evident for nodes 1-5 and 8. The distribution is strange and it is also centered on the same lambda estimate across the nodes. Moreover, nodes 14-15 are also centered on very similar lambda estimates to the other "problematic" nodes, though the distributions are much smoother and not multimodal. I would not expect lambdas for all of these nodes to be so similar, which also makes me suspicious of the results. Note that while I did not provide the traces/densities for all the other parameters estimates (contraction/no change/expansion), they appear reasonable and do not suffer from the same issues as what I observe for the estimates of lambdas.

I am hoping for some information about why I may be seeing these patterns and how I should consider proceeding with my analysis. Given I have run 2 chains with what appears to be sufficient MCMC reps that have converged on similar results, I'm at a loss for how (or whether) to interpret these results or how I can overcome these potential issues. Any help or advice is greatly appreciated!

tree

1-5

6-10

10-15

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.