Time-stamped data
by Fábio K. Mendes, Walter Xie and Alexei J. Drummond
This tutorial is modified from Taming the BEAST material. You can find the source tutorial, data and author information here.
Requirements
The programs used in this tutorial are listed below. In case you have not done it yet, please follow the installation instructions here before moving on.
Background
In this tutorial, we will study the phylodynamics of the Respiratory syncytial virus subgroup A (RSVA). RSVA infects the human lower respiratory tract, causing symptoms that are often indistinguishable from the common cold. By age 3, nearly all children will be infected, and a small percentage (<3%) will develop a more serious inflammation of the bronchioles requiring hospitalisation.
Goals
The goal of our phylodynamic analysis is to estimate:
- the rate of molecular evolution (of RSVA’s G gene);
- the date of the most recent common ancestor (MRCA) of all viral samples (i.e., the tree’s root), interpreted as the first infection event with respect to the samples at hand;
- the phylogenetic relationships among viral samples.
Data
Our data is comprised of 129 molecular sequences coding for RSVA’s G protein (Zlateva et al., 2004; Zlateva et al., 2005), a glycoprotein that allows the virus to attach itself to the host cells’ membranes, starting the infection. Importantly, our data is time stamped (also referred to as heterochronous or serial data) because it was collected at multiple time points, from 1956–2002.
Below we further detail how we will treat our molecular data in our analyses.
The NEXUS alignment
The data is in a file called RSV2.nex.
By clicking the name of the file, it will be opened on your web
browser, after which you can download the data by right-clicking on
the main window, “Save Page As”, and saving the file as RSV2.nex
in the desired folder.
This .nex file can also be found in the examples/nexus
folder,
where BEAST 2 was installed.
Multiple partitions
Our molecular alignment corresponds to a 629-bp coding sequence of RSVA’s G protein. We are going to partition our alignment into three compartments corresponding to the first, second and third codon positions.
Tip dates
In phylodynamic and macroevolutionary analyses, quite often our samples (viral sequences from infected individuals in the former case and from different species in the latter, for example) were collected only at the present moment. In such cases we assign our samples an age of 0.0 to represent the present – this is our default.
As mentioned above, however, our sequences have been sampled at
various dates going back to the 1950s.
We can see the date of each of our samples (sequences) in the NEXUS
file, after the last lower-case s
.
This date can be read as years since 1900.
Whenever tip dates are heterochronous and used as data, they can be read in natural forward time (here, we refer to them simply as “dates”) or backward time (in which case we call them “ages”).
Whether we treat the tip times as dates or ages internally in the code, however, depends on how the models we want to use are parameterized. The same goes for how we output the internal node times when we estimate them during phylogenetic inference. Regardless, the times are first converted into scalar values from 0.0 to some value > 0.0 (in days, years, millions of years, etc.).
LPhy’s behavior is to treat tip times always as ages.
Constructing the scripts in LPhy Studio
LPhy Studio implements a GUI for users to specify and visualize probabilistic graphical models, as well as for simulating data under those models. These tasks are executed according to an LPhy script the user types (or loads) on LPhy Studio’s interactive terminal. If you are new to LPhy, we recommend you to read this introduction first, before you continue on this tutorial.
Please note if you are working in LPhyStudio,
you do not need to add data
and model
keywords and curly brackets to define the code blocks.
We are supposed to add the lines without the data { }
and model { }
to the command line console
at the bottom of the window, where the data
and model
tabs in the GUI are used to specify
which block we are working on.
Below, we will build an LPhy script in two parts, the data
and
the model
blocks.
Data block
The data { ... }
block is necessary when we use LPhy Studio to prepare
instruction input files for inference software (e.g., BEAST 2,
RevBayes, etc.).
The purpose of this block is to tell LPhy which nodes of our graphical
model are to be treated as known constants (and not to be
sampled by the inference software) because they are observed
data.
Elsewhere, this procedure has been dubbed “clamping” (Höhna et al.,
2016).
In this block, we will either type strings representing values to be directly assigned to scalar variables, or use LPhy’s syntax to extract such values from LPhy objects, which might be read from file paths given by the user.
(Note that keyword data
cannot be used to name variables because it
is reserved for defining scripting blocks as outlined above.)
In order to start specifying the data { ... }
block, make sure you
type into the “data” tab of the command prompt, by clicking “data” at
the bottom of LPhy Studio’s window.
options = {ageDirection="forward", ageRegex="s(\d+)$"};
D = readNexus(file="data/RSV2.nex", options=options);
codon = D.charset(["3-629\3", "1-629\3", "2-629\3"]);
L = codon.nchar();
n = length(codon);
taxa = D.taxa();
}
Our data here consist of molecular alignments and the sample dates.
We start by parsing the latter with regular expression "s(\d+)"
when
defining some options in options={}
.
This tells LPhy how to extract the sample times from the NEXUS file,
and then we specify that we want to read those times as dates (i.e.,
forward in time).
After all, our sample times are given to us in natural time.
In order to check if you have set the sample times correctly, click
the graphical component taxa
and check the column “Age”.
The most recent sequences (i.e., from 2002 and that end with s102
)
should have an age of 0.0, while all other tips should be > 0.0.
This is because, as mentioned above, LPhy always treats sample
times as ages.
Then we must parse the molecular alignments, which we do when
initializing variable D
.
Note that our open reading frame (ORF) starts in position 3, which
must be reflected by the arguments we pass on to the charset()
function: first ("3-629\3"
), second ("1-629\3"
) and third
("2-629\3"
) codon positions.
Finally, we use the last three lines to set the number of loci L
,
the number of taxa n
, and the taxa themselves, taxa
.
Note that n=length(codon);
is equivalent to n=3;
because
length()
here is extracting the number of partitions in codon
.
If you want to double-check everything you have typed, click the “Model” tab in the upper right panel.
Model block
The model { ... }
block is the main protagonist of our scripts.
This is where you will specify the many nodes and sampling
distributions that characterize a probabilistic model.
(Note that keyword model
cannot be used to name variables because it
is reserved for defining scripting blocks as outlined above.)
In order to start specifying the model { ... }
block, make sure you
type into the “model” tab of the command prompt, by clicking “model”
at the bottom of LPhy Studio’s window.
We will specify sampling distributions for the following parameters, in this order:
π
, the equilibrium nucleotide frequencies (with 4 dimensions, one for each nucleotide);κ
, twice the transition:transversion bias assuming equal nucleotide frequencies (with 3 dimensions, one for each partition);r
, the relative substitution rates (with 3 dimensions, one for each partition);μ
, the global (mean) clock rate;Θ
, the effective population size (with as many dimensions as there are branches in the tree);ψ
, the time-scaled (i.e., in absolute time) phylogenetic tree.
π ~ Dirichlet(replicates=n, conc=[2.0, 2.0, 2.0, 2.0]);
κ ~ LogNormal(sdlog=0.5, meanlog=1.0, replicates=n);
r ~ WeightedDirichlet(conc=rep(element=1.0, times=n), weights=L);
μ ~ LogNormal(meanlog=-5.0, sdlog=1.25);
Θ ~ LogNormal(meanlog=3.0, sdlog=2.0);
ψ ~ Coalescent(taxa=taxa, theta=Θ);
codon ~ PhyloCTMC(L=L, Q=hky(kappa=κ, freq=π, meanRate=r), mu=μ, tree=ψ);
}
Note that parameters π
, κ
and r
are 3-dimensional vectors,
because they represent the nucleotide equilibrium frequencies,
(ts:tv)/2, and relative substitution rates of each of the three
partitions, respectively.
LPhy conveniently uses vectorization, so PhyloCTMC
recognizes that
three parameters above are vectors, and automatically takes care of
building three separate HKY models, one per partition!
One final remark is that we use rep(element=1.0, times=n)
(where n
evaluates to 3) to create three concentration vectors (one vector per
partition) for the WeightedDirichlet
sampling distribution.
Again, if you want to double-check everything you have typed, click the “Model” tab in the upper right panel.
The whole script
Let us look at the whole thing:
data {options = {ageDirection="forward", ageRegex="s(\d+)$"};
D = readNexus(file="data/RSV2.nex", options=options);
codon = D.charset(["3-629\3", "1-629\3", "2-629\3"]);
L = codon.nchar();
n = length(codon);
taxa = D.taxa();
}
model {
π ~ Dirichlet(replicates=n, conc=[2.0, 2.0, 2.0, 2.0]);
κ ~ LogNormal(sdlog=0.5, meanlog=1.0, replicates=n);
r ~ WeightedDirichlet(conc=rep(element=1.0, times=n), weights=L);
μ ~ LogNormal(meanlog=-5.0, sdlog=1.25);
Θ ~ LogNormal(meanlog=3.0, sdlog=2.0);
ψ ~ Coalescent(taxa=taxa, theta=Θ);
codon ~ PhyloCTMC(L=L, Q=hky(kappa=κ, freq=π, meanRate=r), mu=μ, tree=ψ);
}
If the script contains no errors, you will be able to see the specified graphical model. Simulated data, such as the alignment and the tree, can be seen in the right-hand panel.

As alternatives to typing or copying-and-pasting the script above,
you can find it in HTML format here, or you can load
the script with “File” > “Examples” and select RSV2.lphy
.
Phylogenetic inference with BEAST 2
Once we have read in our data and defined our model, we can move to the task of carrying out statistical inference. We shall try to estimate our parameters of interest, and for the purposes of this tutorial we will do that using the BEAST 2 program.
Producing a BEAST 2 .xml using LPhyBEAST
BEAST 2 reads instructions about the data and the model from a user-provided .xml, which can be produced in a variety of ways. Our goal with LPhy is to make the preparation of the .xml as painless, clear and precise as possible. In order to achieve that, we will use a companion application, LPhyBEAST, as a bridge between the LPhy script we typed above and the .xml.
LPhyBEAST is distributed as a BEAST 2 package,
we can use an application called Package Manager
, which is distributed with BEAST 2 together,
to install it.
To start LPhyBEAST, we have to use the script lphybeast
.
Some technical guides can help you to start.
In our RSV2.lphy script, the alignment file is assumed to locate under the folder tutorials/data/
.
So we need to go to the tutorials
folder, which is normally where the LPhy is installed,
run LPhyBEAST as below and check the end of message to find where is the generated XML.
Let us run LPhyBEAST now:
# BEAST_DIR="/Applications/BEAST2"
$BEAST_DIR/bin/lphybeast RSV2.lphy
Running BEAST 2
After LPhyBEAST generates a BEAST 2 .xml file (e.g., RSV2.xml), we can point BEAST 2 to it, which will then start the inferential MCMC analysis.
BEAST 2 will write its outputs to disk into text files specified in the .xml file (specific paths can be passed in, but in their absence BEAST 2 will write the outputs in the same directory from where it was called).
BEAST 2 will also output the progress of the analysis and some summaries to the screen, like this:
BEAST v2.6.7, 2002-2020
Bayesian Evolutionary Analysis Sampling Trees
Designed and developed by
Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut & Marc A. Suchard
Centre for Computational Evolution
University of Auckland
r.bouckaert@auckland.ac.nz
alexei@cs.auckland.ac.nz
Institute of Evolutionary Biology
University of Edinburgh
a.rambaut@ed.ac.uk
David Geffen School of Medicine
University of California, Los Angeles
msuchard@ucla.edu
Downloads, Help & Resources:
http://beast2.org/
Source code distributed under the GNU Lesser General Public License:
http://github.com/CompEvol/beast2
BEAST developers:
Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled,
Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li,
Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel,
Oliver Pybus, Tim Vaughan, Chieh-Hsi Wu, Walter Xie
Thanks to:
Roald Forsberg, Beth Shapiro and Korbinian Strimmer
Random number seed: 1630288622480
...
...
950000 -6090.8619 -5484.6314 -606.2305 0.4475 0.2692 0.1000 0.1831 0.3183 0.4213 0.0826 0.1776 0.3530 0.3155 0.0702 0.2611 8.2537 8.5243 10.1785 0.7057 0.9198 1.3729 0.0020 38.8407 2m34s/Msamples
1000000 -6084.2226 -5480.6365 -603.5861 0.5015 0.2429 0.0847 0.1706 0.2456 0.4702 0.1013 0.1828 0.3366 0.3591 0.0873 0.2168 6.4508 10.1371 9.5392 0.6864 1.0965 1.2155 0.0022 35.9676 2m34s/Msamples
Operator Tuning #accept #reject Pr(m) Pr(acc|m)
ScaleOperator(Theta.scale) 0.61008 1298 3181 0.00450 0.28980
ScaleOperator(kappa.scale) 0.47919 2976 6630 0.00970 0.30981
ScaleOperator(mu.scale) 0.77528 1071 3560 0.00450 0.23127
UpDownOperator(muUppsiDownOperator) 0.94041 5481 130154 0.13497 0.04041 Try setting scaleFactor to about 0.97
DeltaExchangeOperator(pi_0.deltaExchange) 0.12380 1763 7805 0.00970 0.18426
DeltaExchangeOperator(pi_1.deltaExchange) 0.14501 1362 8290 0.00970 0.14111
DeltaExchangeOperator(pi_2.deltaExchange) 0.10881 1757 8020 0.00970 0.17971
Exchange(psi.narrowExchange) - 33869 100180 0.13424 0.25266
ScaleOperator(psi.rootAgeScale) 0.76570 567 3843 0.00450 0.12857
ScaleOperator(psi.scale) 0.90191 3141 131241 0.13424 0.02337 Try setting scaleFactor to about 0.95
SubtreeSlide(psi.subtreeSlide) 4.50313 18248 116199 0.13424 0.13573
Uniform(psi.uniform) - 72184 61788 0.13424 0.53880
Exchange(psi.wideExchange) - 318 133559 0.13424 0.00238
WilsonBalding(psi.wilsonBalding) - 764 133396 0.13424 0.00569
DeltaExchangeOperator(r.deltaExchange) 0.30317 1619 5737 0.00730 0.22009
Tuning: The value of the operator's tuning parameter, or '-' if the operator can't be optimized.
#accept: The total number of times a proposal by this operator has been accepted.
#reject: The total number of times a proposal by this operator has been rejected.
Pr(m): The probability this operator is chosen in a step of the MCMC (i.e. the normalized weight).
Pr(acc|m): The acceptance probability (#accept as a fraction of the total proposals for this operator).
Total calculation time: 154.759 seconds
End likelihood: -6084.222697270579
Analysing BEAST 2’s output
We start by opening the Tracer program, and dropping the RSV2.log
file onto Tracer’s window.
Alternatively you can load the trace (.log) file by clicking “File” > “Import
Trace File…”.
Tracer will display several summary statistics computed from the trace file in the left-hand and top-right panels. Other summaries will be displayed graphically. It is all pretty self-evident, just click around to familiarize yourself with the program.
The first thing you will notice from this first run is that the effective sample sizes (ESSs) for many of the logged quantities are small (ESSs less than 100 will be highlighted in red by Tracer). This is not good. A low ESS means that the trace contains a lot of correlated samples and thus may not represent the posterior distribution well. In the bottom-right panel you will see a histogram that looks a bit rough. This is expected when ESSs are low.

If you click the “Trace” tab in the upper-right corner, you will see the raw trace, that is, the sampled values against the step in the MCMC chain (Fig. 3).
The trace shows the value of a given statistic for each sampled step of the MCMC chain. It will help you see if the chain is mixing well and to what extent the samples are correlated. Here you can see how the samples are correlated.

In LPhyBEAST, the default MCMC chain length is 1,000,000. After we remove the burn-in (set to 10%), there will be 1,800 samples in the trace (we ran the MCMC for steps sampling every 500). Note how adjacent samples often tend to have similar values – not great.
The ESS for the mean molecular evolution rate (μ
) is about 25 so we
are only getting 1 independent sample every 72 samples
(to a total of ~1800/25 “effective” samples).
With such a short run, it may also be the case that the default
burn-in of 10% of the chain length is inadequate.
Not excluding enough of the start of the chain as burn-in will render
estimates of ESS unreliable.
Let us try to fix this low ESS issue by running the MCMC chain for longer.
We will set the chain length to 20,000,000, but still log the same
number of samples (2,000).
You can now run LPhyBEAST again, this time with the -l
argument
to create a new .xml file:
# MY_SCRIPT_PATH=~/WorkSpace/linguaPhylo/tutorials/
$BEAST_DIR/bin/lphybeast -wd $MY_SCRIPT_PATH -l 20000000 -o RSV2long.xml RSV2.lphy
Now run BEAST 2 again. It may take about half an hour to complete using modern computers.
If you do not want to wait, you can download the
RSV2long.log and RSV2long.trees
completed by running the longer MCMC chain.
Load the new log file into Tracer after it is finished (you can leave the old one loaded for comparison). Click on the “Trace” tab and look at the raw trace plot.

With this much longer MCMC chain, we still have 1,800 samples after removing the first 10% as burn-in, but our ESSs are all > 200. A large ESS does not mean there is no auto-correlation between samples overall, but instead that we have at least 200 effectively independent samples. More effective samples means we are approximating the posterior better than before, when ESSs were less than 100.
(You should keep an eye for weird trends in the trace plot suggesting that the MCMC chain has not yet converged, or is struggling to converge, like different value “plateaus” attracting the chain for a large number of steps.)
Now that we are satisfied with the length of the MCMC chain and its
mixing, we can move on to one of the parameters of interest: the
global substitution rate, μ
.
This is the average substitution rate across all sites in the
alignment.
First, select “mu” in the left-hand panel, and click the “Marginal Density” tab in the upper-right corner to see the posterior density plot for this parameter. You should see a plot similar to this:

As we can see in Figure 4, the marginal
posterior probability density of μ
is roughly bell-shaped.
If the curve does not look very smooth, it is because there is some
sampling noise: we could smooth it out more if we ran the MCMC chain
for longer or took more samples.
Regardless, our mean and highest posterior density interval (HPD; also
known as the credible interval) estimates should be good given the
ESSs.
Note that you can overlay the density plots of multiple traces in order to compare them (it is up to you to determine whether they are comparable on the the same axis or not). For example: select the relative substitution rates for all three codon positions (partitions) in the left-hand panel (labelled “r_0”, “r_1” and “r_2”).
You will now see the posterior probability densities for the relative substitution rate at all three codon positions overlaid:

Summarising the trees
Because of how peculiar and discrete tree space is, it is a bit harder
to summarize and visualize the posterior distribution over
phylogenetic trees, as compared to the mean molecular rate μ
, for
example.
We will use a special tool for that, TreeAnnotator.
Run the program TreeAnnotator, and then choose 10% as the burn-in
percentage, while keeping “Maximum clade credibility tree” as the
“Target tree type”.
For “Node heights”, choose “Mean heights”.
Then load the tree log file that BEAST 2 generated (it will end in
“.trees” by default) as “Input Tree File”.
For this tutorial, the tree log file is called RSV2long.trees
.
Finally, for “Output File”, type RSV2long.tree
.

This setup will take the set of trees in the tree log file, and summarize it with a single maximum clade credibility (MCC) tree. The MCC tree is the tree that has the largest clade probability product across all nodes. Divergence times will reflect the mean ages of each node, and those times will be annotated with their 95% HPD intervals. TreeAnnotator will also display the posterior clade credibility of each node in the MCC tree.
More details on summarizing trees can be found in beast2.org/summarizing-posterior-trees/.
Note that TreeAnnotator only parses a tree log file into an output text file, but it will not allow you to visualize summary trees. Visualization has to be done with other programs (see next section).
Visualizing the trees
Summary trees can be viewed using FigTree and DensiTree, the latter being distributed with BEAST 2. In FigTree, just load TreeAnnotator’s output file as the input (if you were to load the entire tree log file, FigTree would show you each tree in the posterior individually).
After loading the MCC tree in FigTree, click on the “Trees” tab on the left-hand side, then select “Order nodes” while keeping the default ordering (“increasing”). In addition, check “Node Bars”, choosing “height_95%_HPD” for “Display”. This last step shows the estimated 95% HPD intervals for all node heights. You should end up with a tree like the one in Figure 8.

In DensiTree, load the tree log file as the input. Here, the tree in blue is the MCC tree, where the green “fuzzy” trees are all the other trees in the posterior set.

Questions
In what year did the common ancestor of all RSVA viruses sampled live? What is the 95% HPD?
Programs used in this tutorial
The following software will be used in this tutorial:
- LPhy Studio - this software will specify and visualise models as well as simulate data from models defined in LPhy scripts. It is available for download from LPhy releases.
- LPhy BEAST - this software will construct an input file for BEAST. The installation guide and usage can be found from here.
- BEAST - this package contains the BEAST program, BEAUti, DensiTree, TreeAnnotator and other utility programs. This tutorial is written for BEAST v2.6.7 or higher version, which has support for multiple partitions. It is available for download from http://www.beast2.org.
- BEAST labs package - containing some generally useful stuff used by other packages.
- BEAST feast package - this is a small BEAST 2 package which contains additions to the core functionality.
- Tracer - this program is used to explore the output of BEAST (and other Bayesian MCMC programs). It graphically and quantitatively summarises the distributions of continuous parameters and provides diagnostic information. At the time of writing, the current version is v1.7.2. It is available for download from http://beast.community/tracer.
- FigTree - this is an application for displaying and printing molecular phylogenies, in particular those obtained using BEAST. At the time of writing, the current version is v1.4.3. It is available for download from http://beast.community/figtree.
Useful Links
- LinguaPhylo: https://linguaphylo.github.io
- BEAST 2 website and documentation: http://www.beast2.org/
- Join the BEAST user discussion: http://groups.google.com/group/beast-users
References
- Zlateva, K. T., Lemey, P., Vandamme, A.-M., & Van Ranst, M. (2004). Molecular evolution and circulation patterns of human respiratory syncytial virus subgroup a: positively selected sites in the attachment g glycoprotein. J. Virol., 78(9), 4675–4683.
- Zlateva, K. T., Lemey, P., Moës, E., Vandamme, A.-M., & Van Ranst, M. (2005). Genetic variability and molecular evolution of the human respiratory syncytial virus subgroup B attachment G protein. J. Virol., 79(14), 9157–9167. https://doi.org/10.1128/JVI.79.14.9157-9167.2005
- Drummond, A. J., & Bouckaert, R. R. (2014). Bayesian evolutionary analysis with BEAST 2. Cambridge University Press.
- Höhna, S., Landis, M. J., Heath, T. A., Boussau, B., Lartillot, N., Moore, B. R., Huelsenbeck, J. P., Ronquist, F. (2016). RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language. Syst. Biol., 65(4), 726-736.