This tutorial is modified from Taming the BEAST material. You can find the source tutorial, data and author information here.

Requirements

The programs used in this tutorial are listed below. In case you have not done it yet, please follow the installation instructions here before moving on.

Background

In this tutorial, we will study the phylodynamics of the Respiratory syncytial virus subgroup A (RSVA). RSVA infects the human lower respiratory tract, causing symptoms that are often indistinguishable from the common cold. By age 3, nearly all children will be infected, and a small percentage (<3%) will develop a more serious inflammation of the bronchioles requiring hospitalisation.

Goals

The goal of our phylodynamic analysis is to estimate:

  • the rate of molecular evolution (of RSVA’s G gene);
  • the date of the most recent common ancestor (MRCA) of all viral samples (i.e., the tree’s root), interpreted as the first infection event with respect to the samples at hand;
  • the phylogenetic relationships among viral samples.

Data

Our data is comprised of 129 molecular sequences coding for RSVA’s G protein (Zlateva et al., 2004; Zlateva et al., 2005), a glycoprotein that allows the virus to attach itself to the host cells’ membranes, starting the infection. Importantly, our data is time stamped (also referred to as heterochronous or serial data) because it was collected at multiple time points, from 1956–2002.

Below we further detail how we will treat our molecular data in our analyses.

The NEXUS alignment

The data is in a file called RSV2.nex. By clicking the name of the file, it will be opened on your web browser, after which you can download the data by right-clicking on the main window, “Save Page As”, and saving the file as RSV2.nex in the desired folder.

This .nex file can also be found in the examples/nexus folder, where BEAST 2 was installed.

Multiple partitions

Our molecular alignment corresponds to a 629-bp coding sequence of RSVA’s G protein. We are going to partition our alignment into three compartments corresponding to the first, second and third codon positions.

Tip dates

In phylodynamic and macroevolutionary analyses, quite often our samples (viral sequences from infected individuals in the former case and from different species in the latter, for example) were collected only at the present moment. In such cases we assign our samples an age of 0.0 to represent the present – this is our default.

As mentioned above, however, our sequences have been sampled at various dates going back to the 1950s. We can see the date of each of our samples (sequences) in the NEXUS file, after the last lower-case s. This date can be read as years since 1900.

Whenever tip dates are heterochronous and used as data, they can be read in natural forward time (here, we refer to them simply as “dates”) or backward time (in which case we call them “ages”).

Whether we treat the tip times as dates or ages internally in the code, however, depends on how the models we want to use are parameterized. The same goes for how we output the internal node times when we estimate them during phylogenetic inference. Regardless, the times are first converted into scalar values from 0.0 to some value > 0.0 (in days, years, millions of years, etc.).

LPhy’s behavior is to treat tip times always as ages.

Constructing the scripts in LPhy Studio

LPhy Studio implements a GUI for users to specify and visualize probabilistic graphical models, as well as for simulating data under those models. These tasks are executed according to an LPhy script the user types (or loads) on LPhy Studio’s interactive terminal. If you are new to LPhy, we recommend you to read this introduction first, before you continue on this tutorial.

Please note if you are working in LPhyStudio, you do not need to add data and model keywords and curly brackets to define the code blocks. We are supposed to add the lines without the data { } and model { } to the command line console at the bottom of the window, where the data and model tabs in the GUI are used to specify which block we are working on.

Below, we will build an LPhy script in two parts, the data and the model blocks.

Data block

The data { ... } block is necessary when we use LPhy Studio to prepare instruction input files for inference software (e.g., BEAST 2, RevBayes, etc.). The purpose of this block is to tell LPhy which nodes of our graphical model are to be treated as known constants (and not to be sampled by the inference software) because they are observed data. Elsewhere, this procedure has been dubbed “clamping” (Höhna et al., 2016).

In this block, we will either type strings representing values to be directly assigned to scalar variables, or use LPhy’s syntax to extract such values from LPhy objects, which might be read from file paths given by the user.

(Note that keyword data cannot be used to name variables because it is reserved for defining scripting blocks as outlined above.)

In order to start specifying the data { ... } block, make sure you type into the “data” tab of the command prompt, by clicking “data” at the bottom of LPhy Studio’s window.

data {
  options = {ageDirection="forward", ageRegex="s(\d+)$"};
  D = readNexus(file="data/RSV2.nex", options=options);
  codon = D.charset(["3-629\3", "1-629\3", "2-629\3"]);
  L = codon.nchar();
  n = length(codon);
  taxa = D.taxa();
}

Our data here consist of molecular alignments and the sample dates.

We start by parsing the latter with regular expression "s(\d+)" when defining some options in options={}. This tells LPhy how to extract the sample times from the NEXUS file, and then we specify that we want to read those times as dates (i.e., forward in time). After all, our sample times are given to us in natural time.

In order to check if you have set the sample times correctly, click the graphical component taxa and check the column “Age”. The most recent sequences (i.e., from 2002 and that end with s102) should have an age of 0.0, while all other tips should be > 0.0. This is because, as mentioned above, LPhy always treats sample times as ages.

Then we must parse the molecular alignments, which we do when initializing variable D. Note that our open reading frame (ORF) starts in position 3, which must be reflected by the arguments we pass on to the charset() function: first ("3-629\3"), second ("1-629\3") and third ("2-629\3") codon positions. Finally, we use the last three lines to set the number of loci L, the number of taxa n, and the taxa themselves, taxa. Note that n=length(codon); is equivalent to n=3; because length() here is extracting the number of partitions in codon.

If you want to double-check everything you have typed, click the “Model” tab in the upper right panel.

Model block

The model { ... } block is the main protagonist of our scripts. This is where you will specify the many nodes and sampling distributions that characterize a probabilistic model.

(Note that keyword model cannot be used to name variables because it is reserved for defining scripting blocks as outlined above.)

In order to start specifying the model { ... } block, make sure you type into the “model” tab of the command prompt, by clicking “model” at the bottom of LPhy Studio’s window.

We will specify sampling distributions for the following parameters, in this order:

  1. π, the equilibrium nucleotide frequencies (with 4 dimensions, one for each nucleotide);
  2. κ, twice the transition:transversion bias assuming equal nucleotide frequencies (with 3 dimensions, one for each partition);
  3. r, the relative substitution rates (with 3 dimensions, one for each partition);
  4. μ, the global (mean) clock rate;
  5. Θ, the effective population size (with as many dimensions as there are branches in the tree);
  6. ψ, the time-scaled (i.e., in absolute time) phylogenetic tree.
model {
  π ~ Dirichlet(replicates=n, conc=[2.0, 2.0, 2.0, 2.0]);
  κ ~ LogNormal(sdlog=0.5, meanlog=1.0, replicates=n);
  r ~ WeightedDirichlet(conc=rep(element=1.0, times=n), weights=L);
  μ ~ LogNormal(meanlog=-5.0, sdlog=1.25);
  Θ ~ LogNormal(meanlog=3.0, sdlog=2.0);
  ψ ~ Coalescent(taxa=taxa, theta=Θ);
  codon ~ PhyloCTMC(L=L, Q=hky(kappa=κ, freq=π, meanRate=r), mu=μ, tree=ψ);
}

Note that parameters π, κ and r are 3-dimensional vectors, because they represent the nucleotide equilibrium frequencies, (ts:tv)/2, and relative substitution rates of each of the three partitions, respectively. LPhy conveniently uses vectorization, so PhyloCTMC recognizes that three parameters above are vectors, and automatically takes care of building three separate HKY models, one per partition!

One final remark is that we use rep(element=1.0, times=n) (where n evaluates to 3) to create three concentration vectors (one vector per partition) for the WeightedDirichlet sampling distribution.

Again, if you want to double-check everything you have typed, click the “Model” tab in the upper right panel.

The whole script

Let us look at the whole thing:

data {
  options = {ageDirection="forward", ageRegex="s(\d+)$"};
  D = readNexus(file="data/RSV2.nex", options=options);
  codon = D.charset(["3-629\3", "1-629\3", "2-629\3"]);
  L = codon.nchar();
  n = length(codon);
  taxa = D.taxa();
}
model {
  π ~ Dirichlet(replicates=n, conc=[2.0, 2.0, 2.0, 2.0]);
  κ ~ LogNormal(sdlog=0.5, meanlog=1.0, replicates=n);
  r ~ WeightedDirichlet(conc=rep(element=1.0, times=n), weights=L);
  μ ~ LogNormal(meanlog=-5.0, sdlog=1.25);
  Θ ~ LogNormal(meanlog=3.0, sdlog=2.0);
  ψ ~ Coalescent(taxa=taxa, theta=Θ);
  codon ~ PhyloCTMC(L=L, Q=hky(kappa=κ, freq=π, meanRate=r), mu=μ, tree=ψ);
}

If the script contains no errors, you will be able to see the specified graphical model. Simulated data, such as the alignment and the tree, can be seen in the right-hand panel.

LPhyStudio
Figure 1: A screenshot of LPhy Studio.

As alternatives to typing or copying-and-pasting the script above, you can find it in HTML format here, or you can load the script with “File” > “Examples” and select RSV2.lphy.

Phylogenetic inference with BEAST 2

Once we have read in our data and defined our model, we can move to the task of carrying out statistical inference. We shall try to estimate our parameters of interest, and for the purposes of this tutorial we will do that using the BEAST 2 program.

Producing a BEAST 2 .xml using LPhyBEAST

BEAST 2 reads instructions about the data and the model from a user-provided .xml, which can be produced in a variety of ways. Our goal with LPhy is to make the preparation of the .xml as painless, clear and precise as possible. In order to achieve that, we will use a companion application, LPhyBEAST, as a bridge between the LPhy script we typed above and the .xml.

LPhyBEAST is distributed as a BEAST 2 package, we can use an application called Package Manager, which is distributed with BEAST 2 together, to install it. To start LPhyBEAST, we have to use BEAST 2 applauncher, which is also distributed with BEAST 2. Here is some details how to run it from the command line.

Some technical guides can help you to start.

In our RSV2.lphy script, the alignment file is assumed to locate under the folder tutorials/data/. So we need to go to its parent folder, which is normally where the LPhy is installed, run LPhyBEAST as below and check the end of message to find where is the generated XML.

Let us run LPhyBEAST now:

# BEAST_DIR = "/Applications/BEAST2"
$BEAST_DIR/bin/applauncher LPhyBEAST RSV2.lphy 

Running BEAST 2

After LPhyBEAST generates a BEAST 2 .xml file (e.g., RSV2.xml), we can point BEAST 2 to it, which will then start the inferential MCMC analysis.

BEAST 2 will write its outputs to disk into text files specified in the .xml file (specific paths can be passed in, but in their absence BEAST 2 will write the outputs in the same directory from where it was called).

BEAST 2 will also output the progress of the analysis and some summaries to the screen, like this:

                         BEAST v2.6.5, 2002-2020
             Bayesian Evolutionary Analysis Sampling Trees
                       Designed and developed by
 Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut & Marc A. Suchard
                                    
                   Centre for Computational Evolution
                         University of Auckland
                       r.bouckaert@auckland.ac.nz
                        alexei@cs.auckland.ac.nz
                                    
                   Institute of Evolutionary Biology
                        University of Edinburgh
                           a.rambaut@ed.ac.uk
                                    
                    David Geffen School of Medicine
                 University of California, Los Angeles
                           msuchard@ucla.edu
                                    
                      Downloads, Help & Resources:
                           http://beast2.org/
                                    
  Source code distributed under the GNU Lesser General Public License:
                   http://github.com/CompEvol/beast2
                                    
                           BEAST developers:
   Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled, 
 Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li, 
Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel, 
          Oliver Pybus, Tim Vaughan, Chieh-Hsi Wu, Walter Xie
                                    
                               Thanks to:
          Roald Forsberg, Beth Shapiro and Korbinian Strimmer

Random number seed: 1622764828047
    ...

    ...
         950000     -6079.9655     -5478.8367      -601.1287         0.4394         0.2880         0.0799         0.1925         0.2849         0.4208         0.1046         0.1896         0.2816         0.3837         0.0922         0.2423        11.1661         3.2157         5.1059         0.7299         0.9736         1.2950         0.0023        39.0070 2m17s/Msamples
        1000000     -6067.4526     -5479.6833      -587.7693         0.5137         0.2380         0.0903         0.1579         0.3206         0.4005         0.0970         0.1816         0.3359         0.3949         0.0733         0.1957        12.2206         1.4502         3.5749         0.6667         0.9711         1.3604         0.0023        41.3278 2m17s/Msamples

Operator                                      Tuning    #accept    #reject      Pr(m)  Pr(acc|m)
ScaleOperator(Theta.scale)                   0.63281       1426       3100    0.00450    0.31507 
ScaleOperator(kappa.scale)                   0.34579       3555       6070    0.00970    0.36935 
ScaleOperator(mu.scale)                      0.73189        829       3579    0.00450    0.18807 
UpDownOperator(muUppsiDownOperator)          0.96114       8363     127174    0.13497    0.06170 Try setting scaleFactor to about 0.98
DeltaExchangeOperator(pi_0.deltaExchange)    0.11883       1848       7697    0.00970    0.19361 
DeltaExchangeOperator(pi_1.deltaExchange)    0.12557       1648       8136    0.00970    0.16844 
DeltaExchangeOperator(pi_2.deltaExchange)    0.10541       1754       7864    0.00970    0.18237 
Exchange(psi.narrowExchange)                       -      34025     100095    0.13424    0.25369 
ScaleOperator(psi.rootAgeScale)              0.77138        705       3803    0.00450    0.15639 
ScaleOperator(psi.scale)                     0.92575       4055     129740    0.13424    0.03031 Try setting scaleFactor to about 0.962
SubtreeSlide(psi.subtreeSlide)               5.55829      14955     119345    0.13424    0.11136 
Uniform(psi.uniform)                               -      72300      62379    0.13424    0.53683 
Exchange(psi.wideExchange)                         -        351     134365    0.13424    0.00261 
WilsonBalding(psi.wilsonBalding)                   -        769     132705    0.13424    0.00576 
DeltaExchangeOperator(r.deltaExchange)       0.29898       1708       5658    0.00730    0.23188 

     Tuning: The value of the operator's tuning parameter, or '-' if the operator can't be optimized.
    #accept: The total number of times a proposal by this operator has been accepted.
    #reject: The total number of times a proposal by this operator has been rejected.
      Pr(m): The probability this operator is chosen in a step of the MCMC (i.e. the normalized weight).
  Pr(acc|m): The acceptance probability (#accept as a fraction of the total proposals for this operator).


Total calculation time: 140.526 seconds
End likelihood: -6067.4526573438525

Analysing BEAST 2’s output

We start by opening the Tracer program, and dropping the RSV2.log file onto Tracer’s window. Alternatively you can load the trace (.log) file by clicking “File” > “Import Trace File…”.

Tracer will display several summary statistics computed from the trace file in the left-hand and top-right panels. Other summaries will be displayed graphically. It is all pretty self-evident, just click around to familiarize yourself with the program.

The first thing you will notice from this first run is that the effective sample sizes (ESSs) for many of the logged quantities are small (ESSs less than 100 will be highlighted in red by Tracer). This is not good. A low ESS means that the trace contains a lot of correlated samples and thus may not represent the posterior distribution well. In the bottom-right panel you will see a histogram that looks a bit rough. This is expected when ESSs are low.

The posterior histogram of mu
Figure 2: A screenshot of Tracer showing the sampling histogram of `μ` from a short run.

If you click the “Trace” tab in the upper-right corner, you will see the raw trace, that is, the sampled values against the step in the MCMC chain (Fig. 3).

The trace shows the value of a given statistic for each sampled step of the MCMC chain. It will help you see if the chain is mixing well and to what extent the samples are correlated. Here you can see how the samples are correlated.

The trace of short run
Figure 3: A screenshot of Tracer showing the trace of a short run.

In LPhyBEAST, the default MCMC chain length is 1,000,000. After we remove the burn-in (set to 10%), there will be 1,800 samples in the trace (we ran the MCMC for steps sampling every 500). Note how adjacent samples often tend to have similar values – not great.

The ESS for the mean molecular evolution rate (μ) is about 25 so we are only getting 1 independent sample every 72 samples (to a total of ~1800/25 “effective” samples). With such a short run, it may also be the case that the default burn-in of 10% of the chain length is inadequate. Not excluding enough of the start of the chain as burn-in will render estimates of ESS unreliable.

Let us try to fix this low ESS issue by running the MCMC chain for longer. We will set the chain length to 15,000,000, but still log the same number of samples (2,000). This means we should log 15,000,000/2,000 samples, which can be manually specified in the .xml file with logEvery=7500.

<run id="MCMC" spec="MCMC" chainLength="15000000" preBurnin="1480">
<logger id="Logger" spec="Logger" logEvery="750000">
<logger id="Logger1" spec="Logger" fileName="RSV2long.log" logEvery="7500">
<logger id="psi.treeLogger" spec="Logger" fileName="RSV2long.trees" logEvery="7500">

You can now run LPhyBEAST again, this time with the -l argument to create a new .xml file:

# MY_SCRIPT_PATH = ~/WorkSpace/linguaPhylo/tutorials/
$BEAST_DIR/bin/applauncher LPhyBEAST -wd $MY_SCRIPT_PATH -l 15000000 -o RSV2long.xml RSV2.lphy

Now run BEAST 2 again. It may take about half an hour to complete using modern computers.
If you do not want to wait, you can download the RSV2long.log and RSV2long.trees completed by running the longer MCMC chain.

Load the new log file into Tracer after it is finished (you can leave the old one loaded for comparison). Click on the “Trace” tab and look at the raw trace plot.

The trace of long run
Figure 4: A screenshot of Tracer showing the trace of a long run.

With this much longer MCMC chain, we still have 1,800 samples after removing the first 10% as burn-in, but our ESSs are all > 200. A large ESS does not mean there is no auto-correlation between samples overall, but instead that we have at least 200 effectively independent samples. More effective samples means we are approximating the posterior better than before, when ESSs were less than 100.

(You should keep an eye for weird trends in the trace plot suggesting that the MCMC chain has not yet converged, or is struggling to converge, like different value “plateaus” attracting the chain for a large number of steps.)

Now that we are satisfied with the length of the MCMC chain and its mixing, we can move on to one of the parameters of interest: the global substitution rate, μ. This is the average substitution rate across all sites in the alignment.

First, select “mu” in the left-hand panel, and click the “Marginal Density” tab in the upper-right corner to see the posterior density plot for this parameter. You should see a plot similar to this:

marginal density
Figure 5: A screenshot of Tracer showing the marginal posterior probability density of the mean substitution rate, `μ`.

As we can see in Figure 4, the marginal posterior probability density of μ is roughly bell-shaped. If the curve does not look very smooth, it is because there is some sampling noise: we could smooth it out more if we ran the MCMC chain for longer or took more samples. Regardless, our mean and highest posterior density interval (HPD; also known as the credible interval) estimates should be good given the ESSs.

Note that you can overlay the density plots of multiple traces in order to compare them (it is up to you to determine whether they are comparable on the the same axis or not). For example: select the relative substitution rates for all three codon positions (partitions) in the left-hand panel (labelled “r_0”, “r_1” and “r_2”).

You will now see the posterior probability densities for the relative substitution rate at all three codon positions overlaid:

relative substitution rates
Figure 6: A screenshot of Tracer showing the marginal posterior probability densities for the relative substitution rates, `r_0`, `r_1` and `r_2`.

Summarising the trees

Because of how peculiar and discrete tree space is, it is a bit harder to summarize and visualize the posterior distribution over phylogenetic trees, as compared to the mean molecular rate μ, for example. We will use a special tool for that, TreeAnnotator.

Run the program TreeAnnotator, and then choose 10% as the burn-in percentage, while keeping “Maximum clade credibility tree” as the “Target tree type”. For “Node heights”, choose “Mean heights”. Then load the tree log file that BEAST 2 generated (it will end in “.trees” by default) as “Input Tree File”. For this tutorial, the tree log file is called RSV2long.trees. Finally, for “Output File”, type RSV2long.tree.

TreeAnnotator
Figure 7: A screenshot of TreeAnnotator showing how to create a summary tree from a posterior tree set.

This setup will take the set of trees in the tree log file, and summarize it with a single maximum clade credibility (MCC) tree. The MCC tree is the tree that has the largest clade probability product across all nodes. Divergence times will reflect the mean ages of each node, and those times will be annotated with their 95% HPD intervals. TreeAnnotator will also display the posterior clade credibility of each node in the MCC tree.

More details on summarizing trees can be found in beast2.org/summarizing-posterior-trees/.

Note that TreeAnnotator only parses a tree log file into an output text file, but it will not allow you to visualize summary trees. Visualization has to be done with other programs (see next section).

Visualizing the trees

Summary trees can be viewed using FigTree and DensiTree, the latter being distributed with BEAST 2. In FigTree, just load TreeAnnotator’s output file as the input (if you were to load the entire tree log file, FigTree would show you each tree in the posterior individually).

After loading the MCC tree in FigTree, click on the “Trees” tab on the left-hand side, then select “Order nodes” while keeping the default ordering (“increasing”). In addition, check “Node Bars”, choosing “height_95%_HPD” for “Display”. This last step shows the estimated 95% HPD intervals for all node heights. You should end up with a tree like the one in Figure 8.

MCC tree
Figure 8: The maximum clade credibility (MCC) tree for the G gene of 129 RSVA-2 viral samples.

In DensiTree, load the tree log file as the input. Here, the tree in blue is the MCC tree, where the green “fuzzy” trees are all the other trees in the posterior set.

MCC tree
Figure 9: The posterior tree set visualised in DensiTree.

Questions

In what year did the common ancestor of all RSVA viruses sampled live? What is the 95% HPD?

Programs used in this tutorial

The following software will be used in this tutorial:

  • LPhy Studio - this software will specify and visualise models as well as simulate data from models defined in LPhy scripts. It is available for download from LPhy releases.
  • LPhy BEAST - this software will construct an input file for BEAST. The installation guide and usage can be found from here.
  • BEAST - this package contains the BEAST program, BEAUti, DensiTree, TreeAnnotator and other utility programs. This tutorial is written for BEAST v2.6.x, which has support for multiple partitions. It is available for download from http://www.beast2.org.
  • BEAST outercore package - this package of BEAST 2 has core features, but not in the core. You can install and use it following the instruction of managing BEAST 2 packages.
  • BEAST labs package - containing some generally useful stuff used by other packages.
  • BEAST feast package - this is a small BEAST 2 package which contains additions to the core functionality.
  • Tracer - this program is used to explore the output of BEAST (and other Bayesian MCMC programs). It graphically and quantitively summarises the distributions of continuous parameters and provides diagnostic information. At the time of writing, the current version is v1.7. It is available for download from http://beast.community/tracer.
  • FigTree - this is an application for displaying and printing molecular phylogenies, in particular those obtained using BEAST. At the time of writing, the current version is v1.4.3. It is available for download from http://beast.community/figtree.

You will also need to make sure all required BEAST 2 packages (e.g., outercore) have been installed on your local computer.
The Package Manager can help you do that (see the screenshot below).

Package manager
Figure 10: A screenshot of Package Manager.

References

  • Zlateva, K. T., Lemey, P., Vandamme, A.-M., & Van Ranst, M. (2004). Molecular evolution and circulation patterns of human respiratory syncytial virus subgroup a: positively selected sites in the attachment g glycoprotein. J. Virol., 78(9), 4675–4683.
  • Zlateva, K. T., Lemey, P., Moës, E., Vandamme, A.-M., & Van Ranst, M. (2005). Genetic variability and molecular evolution of the human respiratory syncytial virus subgroup B attachment G protein. J. Virol., 79(14), 9157–9167. https://doi.org/10.1128/JVI.79.14.9157-9167.2005
  • Drummond, A. J., & Bouckaert, R. R. (2014). Bayesian evolutionary analysis with BEAST 2. Cambridge University Press.
  • Höhna, S., Landis, M. J., Heath, T. A., Boussau, B., Lartillot, N., Moore, B. R., Huelsenbeck, J. P., Ronquist, F. (2016). RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language. Syst. Biol., 65(4), 726-736.