begfe's Introduction

BEGFE2.0: Bayesian Estimation for Gene Family Evolution

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program implements a Markov Chain Monte Carlo algorithm to estimate the posterior probability distribution of the birth and death rate parameter and the numbers of gene copies at the internodes of the phylogenetic tree. In addition, BEGFE can simulate gene family data under the birth and death model.

New features in version 2.0

Parallel computing using PTHREAD is available in version 2.0. Set PTHREAD ?= yes in Makefile
The constraints on the birth rate parameter lambda in the old versions are removed in version 2.0
The birth-death probability is calculated using two formulas - one for alpha < 0.5 and another for alpha > 0.5 where alpha = lambda*brlens / (1+lambda*brlens)
The program provides a warning message if the species names in the data file do not match the names in the control file
The program provides a warning message if the number of gene families in the control file > the number of gene families in the data file
The second column "NUMBER" is removed in the data file

Compile from source code

To compile the program from source code, type make and hit return under the directory src.

Run the program

./begfe controlfile

Example control files

Three control files are included in the package. The control file controlsim is used to simulate gene family data.

A control file for simulating gene family data (controlsim)

1 #0:analysis, 1:simulation

sim1 #output file

-1 #random seed

100 #number of gene families to simulate

5 #number of species

(((chimp:6#0.005,human:6#0.005):81#0.005,(mouse:17#0.005,rat:17#0.005):70#0.005):6#0.005,dog:93#0.005)#0.005; #species tree, the numbers after '#' are birth/death rates

5 15 #the number of gene copies at the tree root is simulated from an uniform distribution (5, 15)

To run the simulation, type ./begfe controlsim

This will produce two files; sim1 and sim1.true. The simulated gene family dataset is saved in the file sim1, while sim1.true contains the true values of the parameters in the Bayesian model.

A control file for analyzing gene family data (control)

The other control file control is for carrying out the Bayesian analysis of the gene family data. Type ./begfe control and hit return.

0 #0:analysis, 1:simulation

sim1 #input file

-1 #random seed

100 #number of gene families

5 #number of species

(((chimp:6,human:6):81,(mouse:17,rat:17):70):6,dog:93); #species tree

10000 100 0 #number of MCMC generations, save every 100 samples, 0:unlinked (variable) lambdas and 1:linked (single) lambda

A second example of analyzing gene family data (AnolisMHC_control.txt)

0 #0:analysis, 1:simulation

AnolisMHCtab.txt #input file

-1 #random seed

4 #number of gene families

13 #number of species

(((((((Laticauda_laticaudata:32.71005600,(Pseudonaja_textilis:27.60000000,Notechis_scutatus:27.60000000):5.11005600):2.28994400,Naja_naja:35.00000000):132.12419034,(Anolis_carolinensis:66.08013667,Anolis_sagrei:66.08013667):101.04405368):10.87262926,(Podarcis_muralis:148.65977538,Salvator_merianae:148.65977538):29.33704422):73.83363773,Sphenodon_punctatus:251.83045733):27.82651933,(Gallus_gallus:98.04286929,Taeniopygia_guttata:98.04286929):181.61410737):32.24694470,(Mus_musculus:89.82318742,Homo_sapiens:89.82318742):222.08073394);

1000000 1000 0 #number of MCMC generations, save every 100 samples, 0:unlinked (variable) lambdas and 1:linked (single) lambda

Ouput files

There are two output files; sim1.out and sim1.pvalue. The MCMC output for all parameters is saved in sim1.out. Each column in sim.out represents the posterior distribution of a particular parameter in the birth and death model. The parameters such as the birth and death rate are estimated by the Bayesian means, i.e., the averages of the columns after discarding the burn-in period.

The Bayesian p-values (PPP) for all gene families in the dataset are saved in sim1.pvalue. The average of a column in sim1.pvalue is the Bayesian p-values for a particular gene family. Of course, the burn-in period must be discarded.

Citation

Liu, L., L. Yu, V. Kalavacharla, Z. Liu. A Bayesian model for gene family evolution. BMC Bioinformatics. 2011, 12:426

Old versions

Old versions BEGFE are available at https://github.com/lliu1871/oldversion

Example data set

The Anolis MHC example data set (unpublished as of April 2022) is provided courtesy of Daren Card and Scott Edwards, Harvard University

begfe's People

Contributors

Watchers

begfe's Issues

Clarifying settings in control files

I am hoping that some description of the control files can be provided so I can understand how to run the program on my own data. In looking at the files, I can reasonably guess certain fields, but there are others where I am puzzled. Apologies if I missed this information somewhere, but I have looked through the repo and cannot find any explanations.

controlsim

1
sim1
-1
124
5
(((chimp:6,human:6):81,(mouse:17,rat:17):70):6,dog:93);
5 15
0.005 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001

control1

0
sim1
-1
122
5
(((chimp:6,human:6):81,(mouse:17,rat:17):70):6,dog:93);
10000 100 0

Thanks in advance for the help.
Daren Card

Multimodal posteriors

I have been running begfe to assess gene family evolution among 8 species of vertebrates based on 10,683 gene families, with rates between lineages unlinked. Here is my configuration file for running begfe:

0
geneFamCounts.txt
begfe_ratesUnlinked_results_run1
-1
10683
8
(mm9:318.5,(galGal6:281.8,((gekJap1:191.9,((salMer1:168,(anoCar2:160.8,croVir1:160.8):7.2):7.0,lerBou1:175):16.9):54.0,sphPun1:245.9):35.9):36.7);
250000000 250 0

I ran two separate chains with the above configuration. and discarded the first 20% of MCMC repetitions as burn-in, leaving 800k MCMC reps from each chain. Both chains showed similar MCMC characteristics and converged on essentially identical posteriors. In some cases, ESS values were lower than I would have liked, but they appear largely sufficient. I plotted the species tree with divergence times and node labels on each branch to help with interpreting the following information. Happy to provide more details, if necessary.

I am interested in lambda for each node but am seeing a potential issue with multimodal posteriors for several nodes. I have attached graphs of the post-burn-in MCMC traces and densities for these parameters (chains combined) where this is evident for nodes 1-5 and 8. The distribution is strange and it is also centered on the same lambda estimate across the nodes. Moreover, nodes 14-15 are also centered on very similar lambda estimates to the other "problematic" nodes, though the distributions are much smoother and not multimodal. I would not expect lambdas for all of these nodes to be so similar, which also makes me suspicious of the results. Note that while I did not provide the traces/densities for all the other parameters estimates (contraction/no change/expansion), they appear reasonable and do not suffer from the same issues as what I observe for the estimates of lambdas.

I am hoping for some information about why I may be seeing these patterns and how I should consider proceeding with my analysis. Given I have run 2 chains with what appears to be sufficient MCMC reps that have converged on similar results, I'm at a loss for how (or whether) to interpret these results or how I can overcome these potential issues. Any help or advice is greatly appreciated!

Recommend Projects

lliu1871 / begfe Goto Github PK

begfe's Introduction

BEGFE2.0: Bayesian Estimation for Gene Family Evolution

New features in version 2.0

Compile from source code

Run the program

Example control files

A control file for simulating gene family data (controlsim)

A control file for analyzing gene family data (control)

A second example of analyzing gene family data (AnolisMHC_control.txt)

Ouput files

Citation

Old versions

Example data set

begfe's People

Contributors

Watchers

Forkers

begfe's Issues

Recommend Projects

Recommend Topics

Recommend Org