It guides you through a discrete phylogeography analysis of a H5N1 epidemic in South China. This analysis will use the model developed by Lemey et al., 2009 that implements ancestral reconstruction of discrete states (locations) in a Bayesian statistical framework, and employs the Bayesian stochastic search variable selection (BSSVS) to identify the most parsimonious description of the phylogeographic diffusion process.

The additional benefit using this model is that we can summarise the phylogeographic inferences from an analysis, and use a virtual globe software to visualize the spatial and temporal information.

The programs used in this tutorial are listed below.

The NEXUS alignment

The data is in a file called h5n1.nex. By clicking the name of the file, it will be opened on your web browser, after which you can download the data by right-clicking on the main window, “Save Page As”, and saving the file as h5n1.nex in the desired folder.

The data is a subset of original dataset Wallace et al., 2007, and it consists of 43 influenza A H5N1 hemagglutinin and neuraminidase gene sequences isolated from a variety of hosts 1996 - 2005 across sample locations.

Constructing the scripts in LPhy Studio

The software LPhy Studio is used to specify and visualise models as well as simulate data from models defined in LPhy scripts.

The data block is used to input and store the data, which will be processed by the models defined later, and which also allows you to reuse the another dataset by simply replacing the current data. In this block, we normally include the constants for models, the alignment loaded from a NEXUS file, and the meta data regarding to the information of taxa that we have known.

The model block is to define and also describe your models and parameters in the Bayesian phylogenetic analysis. Therefore, your result could be easily reproduced by other researchers.

Please make sure the tab above the command console is set to data, when you intend to type or copy and paste the data block scripts into the console. In addition, make sure to switch the tab to model, when you intend to type or copy and paste the model block scripts into the console.

When you write your LPhy scripts, please be aware that data and model have been reserved and cannot be used as the variable name.

The LPhy scripts to define this analysis is listed below.


data {
  options = {ageDirection=“forward”, ageRegex=“_(\d+)$”};
  D = readNexus(file=“data/H5N1.nex”, options=options);
  L = D.nchar();
  taxa = D.taxa();
  D_trait = extractTrait(taxa=taxa, sep=“_”, i=2);
  K = D_trait.canonicalStateCount();
  dim = K*(K-1)/2;
model {
  π ~ Dirichlet(conc=[2.0, 2.0, 2.0, 2.0]);
  κ ~ LogNormal(meanlog=1.0, sdlog=1.25);
  γ ~ LogNormal(meanlog=0.0, sdlog=2.0);
  r ~ DiscretizeGamma(shape=γ, ncat=4, replicates=L);
  Θ ~ LogNormal(meanlog=0.0, sdlog=1.0);
  ψ ~ Coalescent(taxa=taxa, theta=Θ);
  D ~ PhyloCTMC(Q=hky(kappa=κ, freq=π), mu=0.004, siteRates=r, tree=ψ);
  π_trait ~ Dirichlet(conc=rep(element=3.0, times=K));
  I ~ Bernoulli(minSuccesses=dim-2, p=0.5, replicates=dim);
  R_trait ~ Dirichlet(conc=rep(element=1.0, times=dim));
  Q_trait = generalTimeReversible(rates=select(x=R_trait, indicator=I), freq=π_trait);
  μ_trait ~ LogNormal(meanlog=0, sdlog=1.25);
  D_trait ~ PhyloCTMC(L=1, Q=Q_trait, dataType=“standard”, mu=μ_trait, tree=ψ);

Graphical Model

Figure 1: The graphical model

For the details, please read the auto-generated narrative from LPhyStudio.

The code D_trait = extractTrait(taxa=taxa, sep=“_”, i=2); in the data block uses the function to extract the locations from the taxa names, and creates a trait alignment D_trait to contain these locations mapped to each taxon. Then the next line K = D_trait.canonicalStateCount(); count the number of unique locations in the trait alignment. Please note the method canonicalStateCount() returns the number of canonical states excluding ambiguous states.

Geographic Model

In this analysis, we have two parts mixed in the model section: the first part is modeling evolutionary history and demographic structure based on a nucleotide alignment, and the second part is defining how to sample the discrete states (locations) from the phylogeny $\psi$ shared with the 1st part.

For the nucleotide alignment, We fix a strict molecular clock to 0.004, to make the analysis converge a bit quicker.

The next is the geographic model. In the discrete phylogeography, the probability of transitioning to a new location through the time is computed by

\[P(t) = e^{\Lambda t}\]

where $\Lambda$ is a $ K \times K $ infinitesimal rate matrix, and $K$ is the number of discrete locations (i.e. 5 locations here). Then,

\[\Lambda = \mu S \Pi\]

where $\mu$ is an overall rate scalar, $S$ is a $ K \times K $ matrix of relative migration rates, and $\Pi = diag(\pi)$ where $\pi$ is the equilibrium trait frequencies. After the normalization, $\mu$ measures the number of migration events per unit time $t$. The detail is explained in Lemey et al., 2009.

So, assuming migration to be symmetric in this analysis, we define a vector variable R_trait with the length of $ \frac{K \times (K-1)}{2} $ to store the off-diagonal entries of the unnormalised $S$. Another boolean vector I with the same length determines which infinitesimal rates are zero, which is performed by the function select. This implements the Bayesian stochastic search variable selection (BSSVS).

Producing BEAST XML using LPhyBEAST

BEAST 2 reads instructions about the data and the model from a user-provided .xml, which can be produced in a variety of ways. Our goal with LPhy is to make the preparation of the .xml as painless, clear and precise as possible. In order to achieve that, we will use a companion application, LPhyBEAST, as a bridge between the LPhy script we typed above and the .xml.

LPhyBEAST is distributed as a BEAST 2 package, we can use an application called Package Manager, which is distributed with BEAST 2 together, to install it. To start LPhyBEAST, we have to use the script lphybeast. Some technical guides can help you to start.

In our h5n1.lphy script, the alignment file is assumed to locate under the folder tutorials/data/. So we need to go to the tutorials folder, which is normally where the LPhy is installed, run LPhyBEAST as below and check the end of message to find where is the generated XML.

Let us run LPhyBEAST now:

# BEAST_DIR="/Applications/BEAST2/"
$BEAST_DIR/bin/lphybeast -l 3000000 h5n1.lphy

Running BEAST

After LPhyBEAST generates a BEAST 2 .xml file (e.g., h5n1.xml), we can point BEAST 2 to it, which will then start the inferential MCMC analysis.

BEAST 2 will write its outputs to disk into text files specified in the .xml file (specific paths can be passed in, but in their absence BEAST 2 will write the outputs in the same directory from where it was called).

BEAST 2 will also output the progress of the analysis and some summaries to the screen, like this:

                         BEAST v2.6.7, 2002-2020
             Bayesian Evolutionary Analysis Sampling Trees
                       Designed and developed by
 Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut & Marc A. Suchard
                   Centre for Computational Evolution
                         University of Auckland
                   Institute of Evolutionary Biology
                        University of Edinburgh
                    David Geffen School of Medicine
                 University of California, Los Angeles
                      Downloads, Help & Resources:
  Source code distributed under the GNU Lesser General Public License:
                           BEAST developers:
   Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled, 
 Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li, 
Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel, 
          Oliver Pybus, Tim Vaughan, Chieh-Hsi Wu, Walter Xie
                               Thanks to:
          Roald Forsberg, Beth Shapiro and Korbinian Strimmer

Random number seed: 1659590653904


        2850000     -5960.6708     -5834.2398      -126.4309         0.3211         0.1966         0.2226         0.2595         9.0333         0.3238         8.6857         0.1296         0.3150         0.1804         0.1649         0.2098              1              1              1              1              1              1              1              1              1              1         0.1884         0.0747         0.0422         0.1702         0.0805         0.0174         0.2373         0.0457         0.0291         0.1139         0.6191 1m16s/Msamples
        3000000     -5957.8699     -5834.9067      -122.9631         0.3277         0.1965         0.2239         0.2517         7.6506         0.4290         5.9862         0.1501         0.3367         0.1051         0.2549         0.1529              1              1              1              1              1              1              1              1              1              1         0.1650         0.0716         0.0709         0.1302         0.3421         0.0130         0.0994         0.0303         0.0485         0.0286         0.2453 1m16s/Msamples

Operator                                          Tuning    #accept    #reject      Pr(m)  Pr(acc|m)
BitFlipOperator(I.bitFlip)                             -      39782      90355    0.04340    0.30569 
DeltaExchangeOperator(R_trait.deltaExchange)     0.57921      15922     104777    0.04031    0.13191 
ScaleOperator(Theta.scale)                       0.46729       7620      18512    0.00866    0.29160 
ScaleOperator(gamma.scale)                       0.40531       7005      18916    0.00866    0.27024 
ScaleOperator(kappa.scale)                       0.52022       6840      19112    0.00866    0.26356 
ScaleOperator(mu_trait.scale)                    0.27589       7746      18166    0.00866    0.29893 
UpDownOperator(mu_traitUppsiDownOperator)        0.91311      40003     321915    0.12047    0.11053 
DeltaExchangeOperator(pi.deltaExchange)          0.06532       9549      46452    0.01868    0.17051 
DeltaExchangeOperator(pi_trait.deltaExchange)    0.54124      10870      57988    0.02285    0.15786 
Exchange(psi.narrowExchange)                           -      60429     295287    0.11850    0.16988 
ScaleOperator(psi.rootAgeScale)                  0.77165       3697      22375    0.00866    0.14180 
ScaleOperator(psi.scale)                         0.90511      35312     319993    0.11850    0.09939 Try setting scaleFactor to about 0.951
SubtreeSlide(psi.subtreeSlide)                   0.88122      70929     284431    0.11850    0.19960 
Uniform(psi.uniform)                                   -     149335     206383    0.11850    0.41981 
Exchange(psi.wideExchange)                             -       1575     353485    0.11850    0.00444 
WilsonBalding(psi.wilsonBalding)                       -       2051     353189    0.11850    0.00577 

     Tuning: The value of the operator's tuning parameter, or '-' if the operator can't be optimized.
    #accept: The total number of times a proposal by this operator has been accepted.
    #reject: The total number of times a proposal by this operator has been rejected.
      Pr(m): The probability this operator is chosen in a step of the MCMC (i.e. the normalized weight).
  Pr(acc|m): The acceptance probability (#accept as a fraction of the total proposals for this operator).

Total calculation time: 232.693 seconds

Analysing the BEAST output

Run the program called Tracer to analyze the output of BEAST. When the main window has opened, choose Import Trace File... from the File menu and select the file that BEAST has created called h5n1.log. You should now see a window like in Figure 2.

A screenshot of Tracer
Figure Figure 2: A screenshot of Tracer.

Remember that MCMC is a stochastic algorithm so the actual numbers will not be exactly the same. On the left hand side is a list of the different quantities that BEAST has logged. There are traces for the posterior (this is the log of the product of the tree likelihood and the prior probabilities), and the continuous parameters. Selecting a trace on the left brings up analyses for this trace on the right hand side depending on tab that is selected. When first opened, the “posterior”” trace is selected and various statistics of this trace are shown under the Estimates tab. In the top right of the window is a table of calculated statistics for the selected trace.

Tracer will plot a (marginal posterior) distribution for the selected parameter and also give you statistics such as the mean and median. The 95% HPD lower or upper stands for highest posterior density interval and represents the most compact interval on the selected parameter that contains 95% of the posterior probability. It can be thought of as a Bayesian analog to a confidence interval.

Obtaining an estimate of the phylogenetic tree

Run the program TreeAnnotator, and then choose 10% as the burn-in percentage, while keeping “Maximum clade credibility tree” as the “Target tree type”. For “Node heights”, choose “Mean heights”. Then load the tree log file that BEAST 2 generated (it will end in “.trees” by default) as “Input Tree File”. For this tutorial, the tree log file is called h5n1_with_trait.trees. Finally, for “Output File”, type h5n1_with_trait.tree.

Figure 3: A screenshot of TreeAnnotator showing how to create a summary tree from a posterior tree set.

This setup will take the set of trees in the tree log file, and summarize it with a single maximum clade credibility (MCC) tree. The MCC tree is the tree that has the largest clade probability product across all nodes. Divergence times will reflect the mean ages of each node, and those times will be annotated with their 95% HPD intervals. TreeAnnotator will also display the posterior clade credibility of each node in the MCC tree.

More details on summarizing trees can be found in

Note that TreeAnnotator only parses a tree log file into an output text file, but it will not allow you to visualize summary trees. Visualization has to be done with other programs (see next section).

Distribution of root location

When you open the summary tree with locations h5n1_with_trait.tree in a text editor, and scroll to the most right and locate the end of the tree definition, you can see the set of meta data for the root. Looking for the last entries of location.set and location.set.prob, you might find something like this:

location.set = {Guangdong,HongKong,Hunan,Guangxi,Fujian}
location.set.prob = {0.18656302054414214,0.6129927817878956,0.03220433092726263,0.1121599111604664,0.0560799555802332}

This means that we have the following distribution for the root location:

Location Probability
Guangdong 0.18656302054414214
HongKong 0.6129927817878956
Hunan 0.03220433092726263
Guangxi 0.1121599111604664
Fujian 0.0560799555802332

This distribution shows that the 95% HPD consists of all locations except Hunan, with a strong indication that HongKong might be the root with over 58% probability. It is quite typical that a lot of locations are part of the 95% HPD in discrete phylogeography.

Viewing the Location Tree

We can visualise the tree in a program called FigTree. Run this program, and open the summary tree file h5n1_with_trait.tree by using the Open command in the File menu. The tree should appear. You can now try selecting some of the options in the control panel on the left. Try selecting Appearance to set the branch Colour by location. In addition, you can set the branch Width by location.prob according to the posterior support of estimated locations. Increasing the Line Weight can make the branch width more different regarding to its posterior support. Finally, tick Legend and select location in the drop list of Attribute. You should end up with something like Figure 4.

MCC tree
Figure 4: Figtree representation of the summary tree. Branch colours represent location and branch widths posterior support for the branch.

Alternatively, you can load the posterior tree set h5n1_with_trait.trees (note this is NOT the summary tree, but the complete set) into DensiTree and set it up as follows.

  • Click Show to choose Root Canal tree to guide the eye.
  • Click Grid to choose Full grid option, type year 2005 in the Origin text field and tick Reverse to show the correct time scale. You also can reduce the Digits to 0 which will rounding years in the x-axis (i.g. 2005, instead of 2005.22).
  • Go to Line Color, you can colour branches by location. The final image look like Figure 5.
Figure 5: The posterior tree set visualised in DensiTree.

The number of estimated transitions

Sometime, we want to visualise how the location states are changed through the phylogeny. StateTransitionCounter can count the number of branches in a tree or a set of trees that have a certain state at the parent and another at the node.

So, install the Babel package and run the StateTransitionCounter through BEAST application launcher. The command line below will generate the output file stc.out containing all counts from the logged posterior trees h5n1_with_trait.trees, after removing 10% burn-in. Please make sure you install the latest version (Babel >= v0.3.2, BEASTLabs >= v1.9.6).

$BEAST2_PATH/bin/applauncher StateTransitionCounter -burnin 10 -in h5n1_with_trait.trees -tag location -out stc.out

Next, we need to use R to plot the histogram given the summary in stc.out. If you have a problem to generate it, you can download a prepared file stc.out. You also need to download a script PlotTransitions.R, which contains the functions to parse the file and plot the histograms. The script requires to install R package ggplot2 and tidyverse.

Run the following scripts in R:


stc <- parseTransCount(input="stc.out", pattern = "Histogram", target="Hunan")
# only => Hunan
p <- plotTransCount(stc$hist[grepl("=>Hunan", stc$hist[["Transition"]]),])
ggsave( paste0("transition-distribution-hunan.png"), p, width = 6, height = 4) 

The stc contains a statistical summary of the estimated transition counts related to the target location, here is Hunan, and a table of the actual counts. To plot a simple graph, we only pick up the transitions to Hunan in the next command, and then save the graph a PNG file. The counts are normalised into probabilities.

Figure 6: The probability distribution of estimated transitions into Hunan from other places.

The x-axis presents the number of estimated transitions in all migration events from one particular location to another which is separated by “=>”, and y-axis is the probability which is normalised from the total counts. As you can see, “Guangxi=>Hunan” (blue) has higher probability than other migration events. This type of visualisation will help you to quantify the uncertainty how the disease (H5N1) spread from/to the location(s) of interest, which is simulated from your model and given the data.

Because the posterior trees in this analysis are scaled to time, also known as “time tree”, we can convert this graph into daily transitions. More details and visualisations can be seen from Douglas et. al. 2020.

Post processing geography

Start the application spread, which can be used to analyze and visualize phylogeographic reconstructions resulting from Bayesian inference of spatio-temporal diffusion Bielejec et al., 2011.

Select the Open button in the panel under Load tree file, and select the summary tree file h5n1_with_trait.tree. Change the State attribute name to the name of the trait,
which is location in this analysis. Click the Setup button. A dialog pops up where you can edit altitude and longitude for the locations. Alternatively, you can load it from a tab-delimited file. A file locationCoordinates_H5N1.txt is prepared in Spread website.

Tip: to find latitude and longitude of locations, you can use Google maps, switch on photo’s and select a photo at the location of the map. Click the photo, then click Show in Panoramio and a new page opens that contains the locations where the photo was taken. An alternative is to use Google Earth, and point the mouse to the location. Google-Earth shows latitude and longitude of the mouse location at the bottom of the screen.

Now, open the Output tab in the panel on the left hand side. Here, you can choose where to save the KML file (default output.kml). Select the generate button to generate the KML file, and a world map appears with the tree superimposed onto the area where the rabies epidemic occurred.

If you have a problem to generate KML file, you can download a prepared output.kml.

The KML file can be read into Google Earth. Then, the spread of the epidemic can be animated through time. The coloured areas represent the 95% HPD regions of the locations of the internal nodes of the summary tree.

Programs used in this tutorial

The following software will be used in this tutorial:

  • LPhy Studio - this software will specify and visualise models as well as simulate data from models defined in LPhy scripts. It is available for download from LPhy releases.
  • LPhy BEAST - this software will construct an input file for BEAST. The installation guide and usage can be found from here.
  • BEAST - this package contains the BEAST program, BEAUti, DensiTree, TreeAnnotator and other utility programs. This tutorial is written for BEAST v2.6.7 or higher version, which has support for multiple partitions. It is available for download from
  • BEAST labs package - containing some generally useful stuff used by other packages.
  • BEAST feast package - this is a small BEAST 2 package which contains additions to the core functionality.
  • Tracer - this program is used to explore the output of BEAST (and other Bayesian MCMC programs). It graphically and quantitatively summarises the distributions of continuous parameters and provides diagnostic information. At the time of writing, the current version is v1.7.2. It is available for download from
  • FigTree - this is an application for displaying and printing molecular phylogenies, in particular those obtained using BEAST. At the time of writing, the current version is v1.4.3. It is available for download from
  • BEAST classic package - Phylogeography is a part of the BEAST-CLASSIC package. BEAST-CLASSIC requires the BEASTlabs package. You can install them from BEAST 2 package manager.
  • Babel - A BEAST package containing tools for post-analysis. We will use StateTransitionCounter.
  • Spread - summarising the geographic spread in a KML file (available from
  • Google-earth - displaying the KML file (just Google for it, if you have not already have it installed).


The number of replicates, L is the number of characters of alignment, D. The alignment, D is read from the Nexus file with a file name of "data/H5N1.nex" and options. The options are ageDirection="forward" and ageRegex="_(\d+)$". The taxa is the list of taxa of alignment, D. The int, K is the number of states of alignment, Dtrait. The alignment, Dtrait extracts the trait from taxa, with a sep of "_" and an i of 2. The object, dim comes from the K*(K-1)/2.


The alignment, D is assumed to have evolved under a phylogenetic continuous time Markov process (Felsenstein; 1981) on phylogenetic time tree, ψ, with a molecular clock rate of 0.004, an instantaneous rate matrix and siteRates, r. An instantaneous rate matrix is the HKY model (Hasegawa et al; 1985) with transition bias parameter, κ and base frequency vector, π. The base frequency vector, π have a Dirichlet distribution prior with a concentration of [2.0, 2.0, 2.0, 2.0]. The transition bias parameter, κ has a log-normal prior with a mean in log space of 1.0 and a standard deviation in log space of 1.25. The double, ri is assumed to come from a DiscretizeGamma with shape, γ and a ncat of 4, for i in 0 to L - 1. The shape, γ has a log-normal prior with a mean in log space of 0.0 and a standard deviation in log space of 2.0. The time tree, ψ is assumed to come from a Kingman's coalescent tree prior (Kingman; 1982) with coalescent parameter, Θ and taxa. The coalescent parameter, Θ has a log-normal prior with a mean in log space of 0.0 and a standard deviation in log space of 1.0. The alignment, Dtrait is assumed to have evolved under a phylogenetic continuous time Markov process (Felsenstein; 1981) on phylogenetic time tree, ψ, with molecular clock rate, μtrait, instantaneous rate matrix, Qtrait, a length of 1 and a dataType of "standard". The instantaneous rate matrix, Qtrait is assumed to come from the generalTimeReversible with rates and freq, πtrait. The freq, πtrait have a Dirichlet distribution prior with a concentration. A concentration is assumed to come from the rep with an element of 3.0 and times, K. The number is assumed to come from the select with x, R_traiti and indicator, I0. The indicator, I is assumed to come from a Bernoulli with a p of 0.5, replicates, dim and minSuccesses. The minSuccesses is calculated by dim-2. The x, Rtrait have a Dirichlet distribution prior with a concentration. A concentration is assumed to come from the rep with an element of 1.0 and times, dim. The molecular clock rate, μtrait has a log-normal prior with a mean in log space of 0 and a standard deviation in log space of 1.25.


$$ \begin{split} P(\boldsymbol{\pi}, \kappa, \boldsymbol{r}, \gamma, \boldsymbol{\psi}, \Theta, \boldsymbol{\pi\textbf{}_{trait}}, \boldsymbol{I}, \boldsymbol{\textbf{R}_{trait}}, \mu\textrm{}_{trait} | \boldsymbol{D}, \boldsymbol{\textbf{D}_{trait}}) \propto &P(\boldsymbol{D} | \boldsymbol{\psi}, Q, \boldsymbol{r})P(\boldsymbol{\pi})P(\kappa)\\& \prod_{i=0}^{L - 1}P(\textrm{r}_i | \gamma)P(\gamma)P(\boldsymbol{\psi} | \Theta)\\& P(\Theta)P(\boldsymbol{\textbf{D}_{trait}} | \boldsymbol{\psi}, \mu\textrm{}_{trait}, \boldsymbol{\textbf{Q}_{trait}})\\& P(\boldsymbol{\pi\textbf{}_{trait}})P(\boldsymbol{I})P(\boldsymbol{\textbf{R}_{trait}})\\& P(\mu\textrm{}_{trait})\end{split} $$


  • Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of molecular evolution, 17(6), 368-376.
  • Hasegawa, M., Kishino, H. & Yano, T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22, 160–174 (1985)
  • Kingman JFC. The Coalescent. Stochastic Processes and their Applications 13, 235–248 (1982)
  • Lemey, P., Rambaut, A., Drummond, A. J. and Suchard, M. A. (2009). Bayesian phylogeography finds its roots. PLoS Comput Biol 5, e1000520.
  • Wallace, R., HoDac, H., Lathrop, R. and Fitch, W. (2007). A statistical phylogeography of influenza A H5N1. Proceedings of the National Academy of Sciences 104, 4473.
  • Bielejec F., Rambaut A., Suchard M.A & Lemey P. (2011). SPREAD: Spatial Phylogenetic Reconstruction of Evolutionary Dynamics. Bioinformatics, 27(20):2910-2912. doi:10.1093.
  • Douglas, J., Mendes, F. K., Bouckaert, R., Xie, D., Jimenez-Silva, C. L., Swanepoel, C., … & Drummond, A. J. (2020). Phylodynamics reveals the role of human travel and contact tracing in controlling COVID-19 in four island nations. doi: