MrBayes is a program for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models. MrBayes uses Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of model parameters.
Program features include:
A common command-line interface across Macintosh, Windows, and UNIX operating systems;
Extensive help available from the command line;
Analysis of nucleotide, amino acid, restriction site, and morphological data;
Mixing of data types, such as molecular and morphological characters, in a single analysis;
Easy linking and unlinking of parameters across data partitions;
An abundance of evolutionary models, including 4 X 4, doublet, and codon models for nucleotide data and many of the standard rate matrices for amino acid data;
Estimation of positively selected sites in a fully hierarchical Bayesian framework;
Full integration of the BEST algorithms for the multi-species coalescent.
Support for complex combinations of positive, negative, and backbone constraints on topologies;
Model jumping across the GTR model space and across fixed rate matrices for amino acid data;
Monitoring of convergence during the analysis, and access to a wide range of convergence diagnostics tools after the analysis has finished;
Rich summaries of posterior samples of branch and node parameters printed to majority rule consensus trees in FigTree format;
Implementation of the stepping-stone method for accurate estimation of model likelihoods for Bayesian model choice using Bayes factors;
The ability to spread jobs over a cluster of computers using MPI (for Macintosh (OS X) and UNIX environments only);
Support for the BEAGLE library, resulting in dramatic speedups for codon and amino acid models on compatible hardware (NVIDIA graphics cards);
Checkpointing across all models, allowing the user to seemlessly extend a previous analysis or recover from a system crash;
Installation
MrBayes can be downloaded from the MrBayes website. Pre-compiled, executable versions of MrBayes for PCs or Macs are available for direct download. Both serial and parallel versions are available. Parallel versions allow a single analysis to take advantage of multiple processors, if they are available.
The source code can also be downloaded and compiled locally for use on Unix systems or for command-line use on other systems. Command-line versions will facilitate batch and cluster-based analyses. Instructions for compilation are given on the website.
Starting MrBayes.
If you have downloaded a pre-compiled executable version of MrBayes, simply double-click on the icon to start MrBayes. If you have compiled the source code locally, you can execute MrBayes from the command line (e.g., ./mb in Unix), once you have set your working directory properly.
Once MrBayes has started, it should print out an intro splash screen and then wait for your commands at the command prompt:
MrBayes >
Getting Help.
The single most important MrBayes command is help. Simply typing help on the command line,
MrBayes > help
will provide a list of the possible MrBayes commands and a brief description of their purpose. If you would like more information on a particular command, including a list of its associated parameters and their current settings, use help <command_name> where <command_name> should be replaced by the name of the command in which you are interested. For example, to get help on the execute command, type:
MrBayes > help execute
The first command that you will need to use MrBayes is execute. This command loads information (either a data matrix or a set of MrBayes commands) from a file. The file should be nexus-formatted and placed in the same directory as the MrBayes executable. To load the primate sequence data, place the primates.nex file in the same folder as MrBayes and then type:
MrBayes > execute primates.nex
The execute command can be used identically with a file containing MrBayes commands. If this file contains all the commands needed for an analysis, one could simply open MrBayes and execute this command file at the command prompt. You could also run MrBayes with a command file by specifying the file name (e.g., mb_cmds.nex) on the command line when executing MrBayes. For instance, in Unix:
mb mb_cmds.nex
[Note: This command line should work if you've run the MrBayes installer, so the MrBayes binary has been automatically placed in your path. If you've compiled MrBayes from source code on your own, you'll need to move the binary to the folder where you want to run your analysis. You'll then need to type ./mb mb_cmds.nex instead.]
Once data has been loaded into MrBayes, sites can be excluded from an analysis using the exclude command. Sites are designated for exclusion based on the number corresponding to their column in an alignment. For instance, to exclude sites 90-99 in the primates data that we already loaded, type:
MrBayes > exclude 90-99
If sites need to be re-included, use the include command:
MrBayes > include 90-99
Often, one might want to include/exclude every n-th character. For instance, defining a character set that starts with the first character and includes every 3rd character would circumscribe all 1st codon positions. This type of inclusion/exclusion is accomplished by using the notation \3 at the end of the character list. For example, to exclude all 1st codon positions, type:
MrBayes > exclude 1-. \3
The "." symbol is used to mean the last position in our data set. To exclude all 2nd codon positions, use a very similar command, but start the character list with the 2nd site:
MrBayes > exclude 2-. \3
The inclusion/exclusion status of all sites in our dataset can be viewed using the charstat command.
Specifying a Model of Sequence Evolution.
Basic Model Manipulations.
Most model specification is done using the lset command. The most common model specifications performed in lset involve the number of substitution types and the model of rate variation across sites. For instance, the number of unique substitution rates can be specified using the nst parameter. Possible values for nst are 1 (as in the Jukes-Cantor model), 2 (as in the HKY model), and 6 (as in the GTR model). Possible models of rate variation specified with the rates parameter include a proportion of invariable sites (I), gamma-distributed rates across sites (G), or a combination of the two (I+G). For example, you could specify a GTR+G model by typing:
MrBayes > lset nst=6 rates=gamma
In the newest version of MrBayes (3.2), one can also avoid having to specify only one scheme of substitution types by allowing MrBayes to move across different schemes as part of its MCMC sampling. This procedure is known as reversible jump MCMC (RJ-MCMC). To set up reversible jumping, use the following command:
MrBayes > lset nst=mixed rates=gamma
Reversible jumping is not currently set up for different models of rate variation across sites, so you will still need to specify +I, +G, or +I+G.
While I will not provide detailed instructions here on analyses with doublet and codon models in MrBayes, these models can also be specified using the lset command. See the MrBayes manual for more information.
By default, MrBayes allows the stationary frequencies of the four nucleotides to be estimated from the data. To fix these frequencies at particular values (often 0.25 for all four nucleotides), use the prset command. MrBayes considers the fixation of base frequencies to be a prior setting, rather than a setting of the likelihood model. For instance, to fix all frequencies to be equal, type:
MrBayes > prset statefreqpr=fixed(equal)
Partitioning.
It is now widely recognized that the evolutionary process has not been homogeneous across sites. One approach to accomodating heterogeneity in the evolutionary process across sites is to divide sites into distinct subsets (a process called partitioning) and model the evolution of each subset using an independent Markov model of nucleotide substitution.
The first step in performing a partitioned analysis is to define the distinct subsets of your data. These definitions are made using the charset command. The syntax for specifying sites with charset is the same as with the include and exclude commands. For instance, to specify a subset corresponding to the first codon position, type:
MrBayes > charset cod.pos.1 = 1-. \3
Subsets corresponding to codon positions 2 and 3 could be similarly specified as:
MrBayes > charset cod.pos.2 = 2-. \3
MrBayes > charset cod.pos.3 = 3-. \3
Note that each site must be assigned to one and only one subset to perform a partitioned analysis.
Once all of the appropriate character sets have been defined, they need to be explicity combined into a partition. Appropriately, this is done with a command called partition. For instance, to specify a partitioning scheme with 3 subsets corresponding to the three codon positions, type
MrBayes > partition cod.pos = 3:cod.pos.1,cod.pos.2,cod.pos.3
The number immediately preceding the ":" gives the number of subsets in the partioning scheme.
After defining a partitioning scheme, you need to tell MrBayes that you actually want to use that scheme. This is done with the set command. So, to use the "cod.pos" partitioning scheme, type
MrBayes > set partition=cod.pos
Now that the partitioning scheme is all set, you need to define models of evolution for each subset. As above, this is done with the lset command. The one additional consideration is that you need to tell MrBayes to which subsets you want any particular lset call to apply. To apply one lset command to all subsets within a partitioning scheme, type
MrBayes > lset applyto=(all) nst=6 rates=gamma
If you want to apply different lset calls to different partitions, use the numbers corresponding to the subset (defined by the order in which the subsets are listed in the partition definition). So, to apply an HKY model to codon positions 1 & 2 and a GTR+G model to codon position 3, type
MrBayes > lset applyto=(1,2) nst=2
MrBayes > lset applyto=(3) nst=6 rates=gamma
After defining the appropriate models, you need to explicitly tell MrBayes that parameters (some or all) need to be estimated independently (or "unlinked") for each partition. The "unlinking" of parameters across parititions is accomplished with the unlink command
MrBayes > unlink revmat=(all) tratio=(all) statefreq=(all) shape=(all)
Each parameter (or related set of parameters) needs to be unlinked individually. "revmat" corresponds to the relative rates of substitution, "tratio" corresponds to the transition/transversion rate ratio, "statefreq" corresponds to the stationary nucleotide frequencies, and "shape" corresponds to the alpha shape parameter of the gamma distribution that models rate variation across sites. You may also need to unlink the proportion of invariable sites across partitions if they're included in your models.
To check that you've unlinked parameters appropriately across partitions, use the showmodel command.
It is also possible to allow the overall tree length to be estimated independently for each subset, through the use of independent "scaling parameters". By default, these scaling parameters are linked across partitions. Unlike the unlinking of other parameters, the scaling parameters are unlinked using prset by typing
MrBayes > prset ratepr=variable
Note that the use of rate multipliers across partitions still assumes that the relative branch lengths in the tree are identical across partitions. If, instead, one wants to model different relative branch lengths across partitions, every branch length can be estimated independently across partitions. The unlinking of every branch length across partitions is done with unlink
MrBayes > unlink brlens=(all)
Unlink commands can be reversed using the link command, with identical syntax.
Specifying Priors on Model Parameters.
Priors for nearly all parameters (i.e., relative rates of substitution, base frequencies, branch lengths, etc) can be set using the prset command. The available prior distributions vary by parameter type. Lots of information on possibile distributions can be gained from help prset. As one example, a more informative prior around equal base frequencies (but not fixed at equal frequencies) can be set by increasing the values of the Dirichlet parameters
MrBayes > prset applyto=(all) statefreqpr=Dirichlet(100,100,100,100)
One particularly important prior to consider is the branch-length prior. This prior is important because this single distribution is applied to all the branch lengths in the tree, which can be a very large number of independent parameters. This is a particularly good prior to try testing for prior sensitivity. In other words, run multiple analyses with different branch-length priors and examine the sensitivity of the resulting posteriors to the specification of the prior. Also note that the parameter specified for the exponential prior on branch lengths is the rate of the exponential distribution, which is the inverse of its mean. To specify an exponential prior that is pushed up more tightly around small branch lengths, you should specify a larger rate parameter. For instance,
MrBayes > prset brlenspr=Unconstrained:Exp(100) - Exponential prior on branch lengths with a mean of 0.01.
MrBayes > prset brlenspr=Unconstrained:Exp(10) - Exponential prior on branch lengths with a mean of 0.1. (Default in MrBayes)
MrBayes > prset brlenspr=Unconstrained:Exp(1) - Exponential prior on branch lengths with a mean of 1.
It has also been shown that the branch-length prior specification can affect the inferred posterior probabilities of the topology. Therefore, when performing prior sensitivity analyses with different branch-length priors, it might be important to monitor sensitivity to inferred branch lengths and topologies.
Manipulating Markov chain Monte Carlo (MCMC) Settings.
There are a huge number of potential settings pertaining to the Markov chain Monte Carlo (MCMC) analysis that can be manipulated in MrBayes. Most of these changes are made using the mcmcp and mcmc commands. These commands differ only in whether or not they begin the Markov chain when the command is issued. mcmcp just sets the parameter values without starting the analysis. mcmc will begin the analysis after setting the values of parameters. Some of the MCMC parameters that can be set using these commands include:
ngen - The number of MCMC generations MrBayes will run before pausing
nruns - The number of independent MCMC analyses run by MrBayes
nchains - The number of Metropolis-coupled chains for each independent MCMC analysis
samplefreq - The frequency with which MrBayes writes output to files
printfreq - The frequency with which MrBayes prints output to the screen
filename - The root name for output file names
temp - The degree of heating for Metropolis-coupled chains
Many other MCMC options can be set with these commands. The full list of these options can be obtained by typing help mcmcp or help mcmc.
The other command useful in tweaking the details of the MCMC analysis is the props command. props can alter the relative frequency of proposals to different parameters, as well as the distributions from which proposals are drawn. Proposal distributions differ according to the parameter of interest. The final page of the MrBayes manual lists the proposal distributions for each parameter type and their associated tuning parameters. It is also important to note that in the current version of MrBayes (3.1.2), the props command can only be issued "interactively". In other words, you cannot put a props command in a MrBayes block of a command file. Typing props will cause MrBayes to prompt you for more information about how to change proposal frequencies and distributions. Finally, a 'word of caution'. Certain changes to proposals can make it virtually impossible for the MCMC analysis to mix properly across the posterior distribution and give biased results. Using rigorous checks of convergence should allow you to detect cases of very poor mixing.
Summarizing Output.
After you have run an MCMC analysis in MrBayes and made sure that your runs have converged (a topic not covered here), you can summarize the estimated posterior distributions of both the parameter values and trees using the sump (parameters) and sumt (trees) commands. You can also discard those samples biased by the starting point of an MCMC analysis, a process known as discarding the "burn-in". Removing burn-in when summarizing the posterior distributions is done by specifying a value for the burn-in parameter with sump and sumt
MrBayes > sump burnin=1000 - Summarizing posterior samples of parameter values, discarding 1,000 samples (NOT generations)
MrBayes > sumt burnin=1000 - Summarizing posterior samples of trees, discarding 1,000 samples
Several other parameters can be specified when performing posterior summaries. A full list can be obtained by typing help sump or help sumt.
The output of the sump command will include means, medians, variances, and 95% credible intervals for all scalar parameters.
The output of the sumt command includes estimated posterior probabilities of branches (.parts file), estimated posterior probabilities of trees (.trprobs file), and a majority-rule consensus tree calculated from the posterior tree samples (.con file).
Other software (some listed below) can help you summarize posterior distributions of trees and parameter values, as well as diagnose convergence.
Estimating Marginal Likelihoods Using Steppingstone Sampling.
The marginal likelihood of a model is an important quantity that allows comparison of fit across different models, accounting for uncertainty in the values of model parameters. The simplest way to estimate the marginal likelihood of a model based on the output of MCMC sampling of the posterior is to calculate the harmonic mean of their likelihood scores. However, the harmonic mean can have some highly undesirable statistical properties, including bias and extremely high variance. This leads to low repeatability across replicates for estimates that are inaccurate. A much more accurate and repeatable approach was recently developed, also based on importance sampling, and is known as steppingstone sampling (Xie et al., 2011; Fan et al., 2011). In this approach, an MCMC chain is still used, but it samples a series of "power posterior distributions" that move progressively between the posterior and the prior (or vice versa). A theoretical treatment of this approach is beyond the scope of this tutorial, but I encourage any user interested in estimating marginal likelihoods to read the papers describing and characterizing this approach (Xie et al., 2011; Fan et al., 2011).
To estimate marginal likelihoods via steppingstone in MrBayes, one first needs to specify appropriate settings. This can be done using the mcmcp and ssp commands. Relevant parameters include the total number of generations sampled (set with mcmcp), the burn-in before the first step of the sampling (set with ssp), the burn-in for each sampling step (set with mcmcp), and the number of different power posterior distributions sampled (set with ssp).
After everything is set up, start the steppingstone sampling process using the simple ss command:
MrBayes > ss
Once sampling is completed across all power posterior distributions, MrBayes will provide an estimate of the marginal likelihood. It is good practice to estimate these values several times with independent runs to ensure their repeatability.
NOTE: Because MrBayes is sampling many different distributions that vary between the posterior and prior when steppingstone is run, do NOT use the trees and parameter files from an ss run to estimate posterior probabilities and credible intervals.
If you liked this article, subscribe to the feed by clicking the image below to keep informed about new contents of the blog:
0 commenti:
Post a Comment