Bayesian Skyline Plots
by LinguaPhylo core team
This tutorial is modified from Taming the BEAST tutorial Skyline plots.
Population dynamics influence the shape of the tree and consequently, the shape of the tree contains some information about past population dynamics. The socalled Skyline methods allow to extract this information from phylogenetic trees in a nonparametric manner. It is nonparametric since there is no underlying system of differential equations governing the inference of these dynamics.
In this tutorial we will look at a popular coalescent method, the Coalescent Bayesian Skyline plot (Drummond, Rambaut, Shapiro, & Pybus, 2005), to infer these dynamics from sequence data.
The programs used in this tutorial are listed below.
Background: Classic and Generalized Plots
(Drummond, Rambaut, Shapiro, & Pybus, 2005) explained these concepts in the figure below:
 A genealogy of five individuals sampled contemporaneously (top) together with its associated classic (middle) and generalized (bottom) skyline plots.
 A genealogy of five individuals sampled at three different times (top) along with its associated classic (middle) and generalized (bottom) skyline plots.
In the classic skyline plots, the changes in effective population size coincide with coalescent events,
resulting in a stepwise function with n − 2
change points and n − 1
population sizes,
where n
is the number of sampled individuals.
In the generalized skyline plot, changes in effective population size coincide with some, but not necessarily all, coalescent events.
The resulting stepwise function has m − 1
change points (1 ≤ m ≤ n−1
) and m
effective population sizes.
The NEXUS alignment
The data is in a file called hcv.nex.
By clicking the name of the file, it will be opened on your web
browser, after which you can download the data by rightclicking on
the main window, “Save Page As”, and saving the file as hcv.nex
in the desired folder.
The dataset consists of an alignment of 63 Hepatitis C sequences sampled in 1993 in Egypt (Ray, Arthur, Carella, Bukh, & Thomas, 2000). This dataset has been used previously to test the performance of skyline methods (Drummond, Rambaut, Shapiro, & Pybus, 2005, and Stadler, Kuhnert, Bonhoeffer, & Drummond, 2013).
With an estimated 1525%, Egypt has the highest Hepatits C prevalence in the world. In the mid 20th century, the prevalence of Hepatitis C increased drastically (see Figure 2 for estimates). We will try to infer this increase from sequence data.
Constructing the scripts in LPhy Studio
LPhy Studio implements a GUI for users to specify and visualize probabilistic graphical models, as well as for simulating data under those models. These tasks are executed according to an LPhy script the user types (or loads) on LPhy Studio’s interactive terminal. If you are new to LPhy, we recommend you to read this introduction first, before you continue on this tutorial.
Please note if you are working in LPhyStudio,
you do not need to add data
and model
keywords and curly brackets to define the code blocks.
We are supposed to add the lines without the data { }
and model { }
to the command line console
at the bottom of the window, where the data
and model
tabs in the GUI are used to specify
which block we are working on.
Below, we will build an LPhy script in two parts, the data
and
the model
blocks.
Code
data {
D = readNexus(file=“data/hcv.nexus”);
L = D.nchar();
numGroups = 4;
taxa = D.taxa();
w = taxa.length()1;
}
model {
π ~ Dirichlet(conc=[3.0, 3.0, 3.0, 3.0]);
rates ~ Dirichlet(conc=[1.0, 2.0, 1.0, 1.0, 2.0, 1.0]);
Q = gtr(rates=rates, freq=π);
γ ~ LogNormal(meanlog=0.0, sdlog=2.0);
r ~ DiscretizeGamma(shape=γ, ncat=4, replicates=L);
A ~ RandomComposition(k=numGroups, n=w);
θ1 ~ LogNormal(meanlog=9.0, sdlog=2.0);
Θ ~ ExpMarkovChain(firstValue=θ1, n=numGroups);
ψ ~ SkylineCoalescent(groupSizes=A, taxa=taxa, theta=Θ);
D ~ PhyloCTMC(Q=Q, mu=7.9E4, siteRates=r, tree=ψ);
}
Graphical Model
For the details, please read the autogenerated narrative from LPhyStudio.
Data block
The data { ... }
block is necessary when we use LPhy Studio to prepare
instruction input files for inference software (e.g., BEAST 2,
RevBayes, etc.).
The purpose of this block is to tell LPhy which nodes of our graphical
model are to be treated as known constants (and not to be
sampled by the inference software) because they are observed
data.
Elsewhere, this procedure has been dubbed “clamping” (Höhna et al.,
2016).
In this block, we will either type strings representing values to be directly assigned to scalar variables, or use LPhy’s syntax to extract such values from LPhy objects, which might be read from file paths given by the user.
(Note that keyword data
cannot be used to name variables because it
is reserved for defining scripting blocks as outlined above.)
In order to start specifying the data { ... }
block, make sure you
type into the “data” tab of the command prompt, by clicking “data” at
the bottom of LPhy Studio’s window.
In the script, taxa
and L
respectively stores the taxa from the alignment D
and the length of D
.
numGroups = 4
sets the number of grouped intervals in the generalized Coalescent Bayesian Skyline plots,
and w
defines n − 1
effective population sizes, which is the same number of times at which coalescent events occur.
Model block
The model { ... }
block is the main protagonist of our scripts.
This is where you will specify the many nodes and sampling
distributions that characterize a probabilistic model.
(Note that keyword model
cannot be used to name variables because it
is reserved for defining scripting blocks as outlined above.)
In order to start specifying the model { ... }
block, make sure you
type into the “model” tab of the command prompt, by clicking “model”
at the bottom of LPhy Studio’s window.
In this analysis, we will use the GTR model, which is the most general reversible model and estimates transition probabilities between individual nucleotides separately. That means that the transition probabilities between e.g. A and T will be inferred separately to the ones between A and C, however transition probabilities from A to C will be the same as C to A etc. The nucleotide equilibrium state frequencies π are estimated here.
Additionally, we allow for rate heterogeneity among sites.
We do this by approximating the continuous rate distribution (for each site in the alignment)
with a discretized gamma probability distribution (mean = 1),
where the number of bins in the discretization ncat = 4
(normally between 4 and 6).
The shape parameter will be estimated in this analysis.
As explained in (Yang, 2006), the shape parameter α is inversely related to the extent of rate variation at sites. If α > 1, the distribution is bellshaped, meaning that most sites have intermediate rates around 1, while few sites have either very low or very high rates. In particular, when α → ∞, the distribution degenerates into the model of a single rate for all sites. If α ≤ 1, the distribution has a highly skewed Lshape, meaning that most sites have very low rates of substitution or are nearly ‘invariable’, but there are some substitution hotspots with high rates.
The sequences were all sampled in 1993 so we are dealing with a homochronous alignment and do not need to specify tip dates.
Because our sequences are contemporaneous (homochronous data), there is no information in our dataset to estimate the clock rate. We will use an estimate inferred in Pybus et al., 2001 to fix the clock rate. In this case all the samples were contemporaneous (sampled at the same time) and the clock rate is simply a scaling of the estimated tree branch lengths (in substitutions/site) into calendar time.
So, let’s set the clock rate $\mu$ to 0.00079 s/s/y
In addition, we define the priors for the following parameters:
 the vector of effective population sizes Θ;
 the relative rates of the GTR process rates;
 the base frequencies π;
 the shape of the discretized gamma distribution $\gamma$.
Here we setup a Markov chain of effective population sizes using ExpMarkovChain
,
and apply a LogNormal
distribution to the first value of the chain.
Please note that the first value Θ1 is measured from the tips according to
(Drummond, Rambaut, Shapiro, & Pybus, 2005).
The vector of group sizes A
are positive integers randomly sampled by the function RandomComposition
where the vector’s dimension equals to a constant numGroups
,
and they should sum to the number of coalescent events w
.
Questions

What are
numGroups
andw
according to the Figure 1? And how to compute the number of coalescent events given the number of taxa? 
How to change the above LPhy scripts to use the classic Skyline coalescent?
Tips: by default all group sizes in SkylineCoalescent function are 1 which is equivalent to the classic skyline coalescent.
Producing BEAST XML using LPhyBEAST
BEAST 2 reads instructions about the data and the model from a userprovided .xml, which can be produced in a variety of ways. Our goal with LPhy is to make the preparation of the .xml as painless, clear and precise as possible. In order to achieve that, we will use a companion application, LPhyBEAST, as a bridge between the LPhy script we typed above and the .xml.
LPhyBEAST is distributed as a BEAST 2 package,
we can use an application called Package Manager
, which is distributed with BEAST 2 together,
to install it.
To start LPhyBEAST, we have to use the script lphybeast
.
Some technical guides can help you to start.
In our hcv_coal.lphy script, the alignment file is assumed to locate under the folder tutorials/data/
.
So we need to go to the tutorials
folder, which is normally where the LPhy is installed,
run LPhyBEAST as below and check the end of message to find where is the generated XML.
Let us run LPhyBEAST now:
# BEAST_DIR="/Applications/BEAST2"
cd ~/WorkSpace/linguaPhylo/tutorials/
$BEAST_DIR/bin/lphybeast l 40000000 hcv_coal.lphy
Running BEAST
After LPhyBEAST generates a BEAST 2 .xml file (e.g., hcv_coal.xml), we can point BEAST 2 to it, which will then start the inferential MCMC analysis.
BEAST 2 will write its outputs to disk into text files specified in the .xml file (specific paths can be passed in, but in their absence BEAST 2 will write the outputs in the same directory from where it was called).
BEAST 2 will also output the progress of the analysis and some summaries to the screen, like this:
BEAST v2.6.7, 20022020
Bayesian Evolutionary Analysis Sampling Trees
Designed and developed by
Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut & Marc A. Suchard
Centre for Computational Evolution
University of Auckland
r.bouckaert@auckland.ac.nz
alexei@cs.auckland.ac.nz
Institute of Evolutionary Biology
University of Edinburgh
a.rambaut@ed.ac.uk
David Geffen School of Medicine
University of California, Los Angeles
msuchard@ucla.edu
Downloads, Help & Resources:
http://beast2.org/
Source code distributed under the GNU Lesser General Public License:
http://github.com/CompEvol/beast2
BEAST developers:
Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled,
Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li,
Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel,
Oliver Pybus, Tim Vaughan, ChiehHsi Wu, Walter Xie
Thanks to:
Roald Forsberg, Beth Shapiro and Korbinian Strimmer
File: hcv_coal.xml seed: 1630290833514 threads: 1
...
...
38000000 6632.7087 6161.6478 471.0609 0.2063 0.3178 0.2259 0.2499 0.0461 0.3619 0.0614 0.0340 0.4518 0.0445 0.3966 1 10 2 49 1847.5418 8278.3538 788.9602 243.4601 1m19s/Msamples
40000000 6635.1459 6184.2080 450.9379 0.1935 0.3251 0.2391 0.2421 0.0541 0.3255 0.0663 0.0406 0.4662 0.0471 0.3508 13 32 10 7 7759.0760 88.4127 139.5266 295.8700 1m19s/Msamples
Operator Tuning #accept #reject Pr(m) Pr(accm)
DeltaExchangeOperator(A.deltaExchange) 3.18924 177760 541485 0.01800 0.24715
ScaleOperator(Theta.scale) 0.28842 229995 650541 0.02201 0.26120
ScaleOperator(gamma.scale) 0.63167 81407 252858 0.00834 0.24354
DeltaExchangeOperator(pi.deltaExchange) 0.07328 144605 576445 0.01800 0.20055
Exchange(psi.narrowExchange)  2453146 3545681 0.14993 0.40894
ScaleOperator(psi.rootAgeScale) 0.63607 56804 276244 0.00834 0.17056
ScaleOperator(psi.scale) 0.71569 1416928 4579768 0.14993 0.23628
SubtreeSlide(psi.subtreeSlide) 60.10339 575882 5421860 0.14993 0.09602 Try decreasing size to about 30.052
Uniform(psi.uniform)  2410120 3584371 0.14993 0.40206
Exchange(psi.wideExchange)  51775 5949658 0.14993 0.00863
WilsonBalding(psi.wilsonBalding)  86216 5907884 0.14993 0.01438
DeltaExchangeOperator(rates.deltaExchange) 0.08060 129011 899557 0.02573 0.12543
Tuning: The value of the operator's tuning parameter, or '' if the operator can't be optimized.
#accept: The total number of times a proposal by this operator has been accepted.
#reject: The total number of times a proposal by this operator has been rejected.
Pr(m): The probability this operator is chosen in a step of the MCMC (i.e. the normalized weight).
Pr(accm): The acceptance probability (#accept as a fraction of the total proposals for this operator).
Total calculation time: 3160.808 seconds
End likelihood: 6635.145974434814
Analysing the BEAST output
Run the program called Tracer to analyze the output of BEAST. When the
main window has opened, choose Import Trace File...
from the File
menu
and select the file that BEAST has created called hcv_coal.log
. You should now
see a window like in 5.
For the reconstruction of the population dynamics, we need two files, the hcv_coal.log
file and the hcv_coal.trees
file.
The log file contains the information about the group sizes and population sizes of each segment,
while the trees file is needed for the times of the coalescent events.
Navigate to Analysis > Bayesian Skyline Reconstruction.
From there open the tree log file. To get the correct dates in the analysis we should specify the Age of the youngest tip
.
In our case it is 1993, the year where all the samples were taken.
If the sequences were sampled at different times (heterochronous data),
the age of the youngest tip is the time when the most recent sample was collected.
Press OK to reconstruct the past population dynamics.
The output will have the years on the xaxis and the effective population size on the yaxis. By default, the yaxis is on a logscale. If everything worked as it is supposed to work you will see a sharp increase in the effective population size in the mid 20th century, similar to what is seen below.
Note that the reconstruction will only work if the *.log and *.trees files contain the same number of states and both files were logged at the same frequency.
There are two ways to save the analysis, it can either be saved as a PDF file for displaying purposes or as a tab delimited file.
Navigate to File > Export Data Table. Enter the filename as hcv_coal.tsv
and save the file.
The exported file will have five rows, the time, the mean, median, lower and upper boundary of the 95% HPD interval of the estimates,
which you can use to plot the data with other software (R, Matlab, etc).
The parameterization and choosing the dimension
Please read the section of “The Coalescent Bayesian Skyline parameterization” and “Choosing the Dimension” from Taming the BEAST tutorial Skyline plots.
Questions

How to choose the dimension for the Bayesian skyline analysis? How does the number of dimensions of effective population sizes affect the result?

What are the alternative models to deal with this dimension problem?

What does the Bayesian skyline plot in this analysis tell you?
Some considerations for using skyline plots
In the coalescent, the time is modeled to go backwards, from present to past.
The coalescent skylines assume that the population is wellmixed. That is, they assume that there is no significant population structure and that the sequences are a random sample from the population. However, if there is population structure, for instance sequences were sampled from two different villages and there is much more contact within than between villages, then the results will be biased (Heller, Chikhi, & Siegismund, 2013). Instead a structured model should then be used to account for these biases.
Programs used in this tutorial
The following software will be used in this tutorial:
 LPhy Studio  this software will specify and visualise models as well as simulate data from models defined in LPhy scripts. It is available for download from LPhy releases.
 LPhy BEAST  this software will construct an input file for BEAST. The installation guide and usage can be found from here.
 BEAST  this package contains the BEAST program, BEAUti, DensiTree, TreeAnnotator and other utility programs. This tutorial is written for BEAST v2.6.7 or higher version, which has support for multiple partitions. It is available for download from http://www.beast2.org.
 BEAST labs package  containing some generally useful stuff used by other packages.
 BEAST feast package  this is a small BEAST 2 package which contains additions to the core functionality.
 Tracer  this program is used to explore the output of BEAST (and other Bayesian MCMC programs). It graphically and quantitatively summarises the distributions of continuous parameters and provides diagnostic information. At the time of writing, the current version is v1.7.2. It is available for download from http://beast.community/tracer.
 FigTree  this is an application for displaying and printing molecular phylogenies, in particular those obtained using BEAST. At the time of writing, the current version is v1.4.3. It is available for download from http://beast.community/figtree.
 BEAST SSM (standard substitution models) package  containing the following standard timereversible substitution models: JC, F81, K80, HKY, TrNf, TrN, TPM1, TPM1f, TPM2, TPM2f, TPM3, TPM3f, TIM1, TIM1f, TIM2, TIM2f, TIM3 , TIM3f, TVMf, TVM, SYM, GTR.
Data
The number of replicates, L is the number of characters of alignment, D. The alignment, D is read from the Nexus file with a file name of "data/hcv.nexus". numGroups = 4 The n, w is calculated by taxa.length()1. The taxa is the list of taxa of alignment, D.Model
The alignment, D is assumed to have evolved under a phylogenetic continuous time Markov process (Felsenstein; 1981) on phylogenetic time tree, ψ, with a molecular clock rate of 7.9E4, instantaneous rate matrix, Q and siteRates, r. The instantaneous rate matrix, Q is the general timereversible rate matrix (Rodriguez et al; 1990) with relative rates, rates and base frequencies, π. The base frequencies, π have a Dirichlet distribution prior with a concentration of [3.0, 3.0, 3.0, 3.0]. The relative rates, rates have a Dirichlet distribution prior with a concentration of [1.0, 2.0, 1.0, 1.0, 2.0, 1.0]. The double, r_{i} is assumed to come from a DiscretizeGamma with shape, γ and a ncat of 4, for i in 0 to L  1. The shape, γ has a lognormal prior with a mean in log space of 0.0 and a standard deviation in log space of 2.0. The phylogenetic time tree, ψ has a skyline coalescent prior (Drummond et al; 2005) with population sizes, Θ, group sizes, A and taxa. The group sizes, A is assumed to come from a RandomComposition with n, w and k, numGroups of 4. The taxa.length() is the .length of taxa. The population sizes, Θ have a smoothing prior in which each element has an exponential prior with a mean of the previous element in the chain with firstValue, θ1 and number of steps, numGroups of 4. The firstValue, θ1 has a lognormal prior with a mean in log space of 9.0 and a standard deviation in log space of 2.0.Posterior
$$ \begin{split} P(\boldsymbol{\pi}, \boldsymbol{\textbf{rates}}, \boldsymbol{r}, \gamma, \boldsymbol{\psi}, \boldsymbol{A}, \boldsymbol{\Theta}, \theta1  \boldsymbol{D}) \propto &P(\boldsymbol{D}  \boldsymbol{\psi}, \boldsymbol{Q}, \boldsymbol{r})P(\boldsymbol{\pi})\\& P(\boldsymbol{\textbf{rates}})\prod_{i=0}^{L  1}P(\textrm{r}_i  \gamma)P(\gamma)\\& P(\boldsymbol{\psi}  \boldsymbol{\Theta}, \boldsymbol{A})P(\boldsymbol{A})P(\boldsymbol{\Theta}  \theta1)\\& P(\theta1)\end{split} $$Useful Links
 LinguaPhylo: https://linguaphylo.github.io
 BEAST 2 website and documentation: http://www.beast2.org/
 Join the BEAST user discussion: http://groups.google.com/group/beastusers
References
 Drummond, A. J., Rambaut, A., Shapiro, B., & Pybus, O. G. (2005). Bayesian coalescent inference of past population dynamics from molecular sequences. Molecular Biology and Evolution, 22(5), 1185–1192. https://doi.org/10.1093/molbev/msi103
 Bouckaert, R., Heled, J., Kühnert, D., Vaughan, T., Wu, C.H., Xie, D., … Drummond, A. J. (2014). BEAST 2: a software platform for Bayesian evolutionary analysis. PLoS Computational Biology, 10(4), e1003537. https://doi.org/10.1371/journal.pcbi.1003537
 Bouckaert, R., Vaughan, T. G., BaridoSottani, J., Duchêne, S., Fourment, M., Gavryushkina, A., … Drummond, A. J. (2019). BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis. PLOS Computational Biology, 15(4).
 Ray, S. Ê. C., Arthur, R. Ê. R., Carella, A., Bukh, J., & Thomas, D. Ê. L. (2000). Genetic Epidemiology of Hepatitis C Virus throughout Egypt. The Journal of Infectious Diseases, 182(3), 698–707. https://doi.org/10.1086/315786
 Pybus, O. G., Drummond, A. J., Nakano, T., Robertson, B. H., & Rambaut, A. (2003). The Epidemiology and Iatrogenic Transmission of Hepatitis C Virus in Egypt: A Bayesian Coalescent Approach. Molecular Biology and Evolution, 20(3), 381–387. https://doi.org/10.1093/molbev/msg043
 Pybus, O. G., Charleston, M. A., Gupta, S., Rambaut, A., Holmes, E. C., Harvey, P. H., … Felsenstein, J. (2001). The epidemic behavior of the hepatitis C virus. Science (New York, N.Y.), 292(5525), 2323–2325. https://doi.org/10.1126/science.1058321
 Rosenberg, N. A., & Nordborg, M. (2002). Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nature Reviews Genetics, 3(5).
 Pybus, O. G., Rambaut, A., & Harvey, P. H. (2000). An Integrated Framework for the Inference of Viral Population History From Reconstructed Genealogies. Genetics, 155(3).
 Heled, J., & Drummond, A. J. (2008). Bayesian inference of population size history from multiple loci. BMC Evolutionary Biology, 8(1), 289. https://doi.org/10.1186/147121488289
 Minin, V. N., Bloomquist, E. W., & Suchard, M. A. (2008). Smooth skyride through a rough skyline: Bayesian coalescentbased inference of population dynamics. Molecular Biology and Evolution, 25(7), 1459–1471. https://doi.org/10.1093/molbev/msn090
 Heller, R., Chikhi, L., & Siegismund, H. R. (2013). The confounding effect of population structure on Bayesian skyline plot inferences of demographic history. PloS One, 8(5), e62992. https://doi.org/10.1371/journal.pone.0062992
 Drummond, A. J., & Bouckaert, R. R. (2014). Bayesian evolutionary analysis with BEAST 2. Cambridge University Press.