It guides you through a discrete phylogeography analysis of a H5N1 epidemic in South China. This analysis will use the model developed by Lemey et al., 2009 that implements ancestral reconstruction of discrete states (locations) in a Bayesian statistical framework, and employs the Bayesian stochastic search variable selection (BSSVS) to identify the most parsimonious description of the phylogeographic diffusion process.

The additional benefit using this model is that we can summarise the phylogeographic inferences from an analysis, and use a virtual globe software to visualize the spatial and temporal information.

The programs used in this tutorial are listed below.

## The NEXUS alignment

The data is in a file called h5n1.nex. By clicking the name of the file, it will be opened on your web browser, after which you can download the data by right-clicking on the main window, “Save Page As”, and saving the file as h5n1.nex in the desired folder.

The data is a subset of original dataset Wallace et al., 2007, and it consists of 43 influenza A H5N1 hemagglutinin and neuraminidase gene sequences isolated from a variety of hosts 1996 - 2005 across sample locations.

## Constructing the scripts in LPhy Studio

The software LPhy Studio is used to specify and visualise models as well as simulate data from models defined in LPhy scripts.

The data block is used to input and store the data, which will be processed by the models defined later, and which also allows you to reuse the another dataset by simply replacing the current data. In this block, we normally include the constants for models, the alignment loaded from a NEXUS file, and the meta data regarding to the information of taxa that we have known.

The model block is to define and also describe your models and parameters in the Bayesian phylogenetic analysis. Therefore, your result could be easily reproduced by other researchers.

Please make sure the tab above the command console is set to data, when you intend to type or copy and paste the data block scripts into the console. In addition, make sure to switch the tab to model, when you intend to type or copy and paste the model block scripts into the console.

When you write your LPhy scripts, please be aware that data and model have been reserved and cannot be used as the variable name.

The LPhy scripts to define this analysis is listed below.

data {
options = {ageDirection=“forward”, ageRegex=“_(\d+)”}; D = readNexus(file=“data/H5N1.nex”, options=options); L = D.nchar(); taxa = D.taxa(); D_trait = extractTrait(taxa=taxa, sep=“_”, i=2); K = D_trait.canonicalStateCount(); dim = K*(K-1)/2; } model { π ~ Dirichlet(conc=[2.0, 2.0, 2.0, 2.0]); κ ~ LogNormal(meanlog=1.0, sdlog=1.25); γ ~ LogNormal(meanlog=0.0, sdlog=2.0); r ~ DiscretizeGamma(shape=γ, ncat=4, replicates=L); Θ ~ LogNormal(meanlog=0.0, sdlog=1.0); ψ ~ Coalescent(taxa=taxa, theta=Θ); D ~ PhyloCTMC(Q=hky(kappa=κ, freq=π), mu=0.004, siteRates=r, tree=ψ); π_trait ~ Dirichlet(conc=rep(element=3.0, times=K)); I ~ Bernoulli(minSuccesses=dim-2, p=0.5, replicates=dim); R_trait ~ Dirichlet(conc=rep(element=1.0, times=dim)); Q_trait = generalTimeReversible(rates=select(x=R_trait, indicator=I), freq=π_trait); μ_trait ~ LogNormal(meanlog=0, sdlog=1.25); D_trait ~ PhyloCTMC(L=1, Q=Q_trait, dataType=“standard”, mu=μ_trait, tree=ψ); } ## Graphical Model For the details, please read the auto-generated narrative from LPhyStudio. The code D_trait = extractTrait(taxa=taxa, sep=“_”, i=2); in the data block uses the function to extract the locations from the taxa names, and creates a trait alignment D_trait to contain these locations mapped to each taxon. Then the next line K = D_trait.canonicalStateCount(); count the number of unique locations in the trait alignment. Please note the method canonicalStateCount() returns the number of canonical states excluding ambiguous states. ## Geographic Model In this analysis, we have two parts mixed in the model section: the first part is modeling evolutionary history and demographic structure based on a nucleotide alignment, and the second part is defining how to sample the discrete states (locations) from the phylogeny\psishared with the 1st part. For the nucleotide alignment, We fix a strict molecular clock to 0.004, to make the analysis converge a bit quicker. The next is the geographic model. In the discrete phylogeography, the probability of transitioning to a new location through the time is computed by $P(t) = e^{\Lambda t}$ where\Lambda$is a$ K \times K $infinitesimal rate matrix, and$K$is the number of discrete locations (i.e. 5 locations here). Then, $\Lambda = \mu S \Pi$ where$\mu$is an overall rate scalar,$S$is a$ K \times K $matrix of relative migration rates, and$\Pi = diag(\pi)$where$\pi$is the equilibrium trait frequencies. After the normalization,$\mu$measures the number of migration events per unit time$t$. The detail is explained in Lemey et al., 2009. So, assuming migration to be symmetric in this analysis, we define a vector variable R_trait with the length of$ \frac{K \times (K-1)}{2} $to store the off-diagonal entries of the unnormalised$S. Another boolean vector I with the same length determines which infinitesimal rates are zero, which is performed by the function select. This implements the Bayesian stochastic search variable selection (BSSVS). ## Producing BEAST XML using LPhyBEAST BEAST 2 reads instructions about the data and the model from a user-provided .xml, which can be produced in a variety of ways. Our goal with LPhy is to make the preparation of the .xml as painless, clear and precise as possible. In order to achieve that, we will use a companion application, LPhyBEAST, as a bridge between the LPhy script we typed above and the .xml. LPhyBEAST is distributed as a BEAST 2 package, we can use an application called Package Manager, which is distributed with BEAST 2 together, to install it. To start LPhyBEAST, we have to use the script lphybeast. Some technical guides can help you to start. In our h5n1.lphy script, the alignment file is assumed to locate under the folder tutorials/data/. So we need to go to the tutorials folder, which is normally where the LPhy is installed, run LPhyBEAST as below and check the end of message to find where is the generated XML. Let us run LPhyBEAST now: # BEAST_DIR="/Applications/BEAST2/"BEAST_DIR/bin/lphybeast -l 3000000 h5n1.lphy


## Running BEAST

After LPhyBEAST generates a BEAST 2 .xml file (e.g., h5n1.xml), we can point BEAST 2 to it, which will then start the inferential MCMC analysis.

BEAST 2 will write its outputs to disk into text files specified in the .xml file (specific paths can be passed in, but in their absence BEAST 2 will write the outputs in the same directory from where it was called).

BEAST 2 will also output the progress of the analysis and some summaries to the screen, like this:

                         BEAST v2.6.7, 2002-2020
Bayesian Evolutionary Analysis Sampling Trees
Designed and developed by
Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut & Marc A. Suchard

Centre for Computational Evolution
University of Auckland
r.bouckaert@auckland.ac.nz
alexei@cs.auckland.ac.nz

Institute of Evolutionary Biology
University of Edinburgh
a.rambaut@ed.ac.uk

David Geffen School of Medicine
University of California, Los Angeles
msuchard@ucla.edu

http://beast2.org/

http://github.com/CompEvol/beast2

BEAST developers:
Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled,
Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li,
Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel,
Oliver Pybus, Tim Vaughan, Chieh-Hsi Wu, Walter Xie

Thanks to:
Roald Forsberg, Beth Shapiro and Korbinian Strimmer

Random number seed: 1659590653904

...

...
2850000     -5960.6708     -5834.2398      -126.4309         0.3211         0.1966         0.2226         0.2595         9.0333         0.3238         8.6857         0.1296         0.3150         0.1804         0.1649         0.2098              1              1              1              1              1              1              1              1              1              1         0.1884         0.0747         0.0422         0.1702         0.0805         0.0174         0.2373         0.0457         0.0291         0.1139         0.6191 1m16s/Msamples
3000000     -5957.8699     -5834.9067      -122.9631         0.3277         0.1965         0.2239         0.2517         7.6506         0.4290         5.9862         0.1501         0.3367         0.1051         0.2549         0.1529              1              1              1              1              1              1              1              1              1              1         0.1650         0.0716         0.0709         0.1302         0.3421         0.0130         0.0994         0.0303         0.0485         0.0286         0.2453 1m16s/Msamples

Operator                                          Tuning    #accept    #reject      Pr(m)  Pr(acc|m)
BitFlipOperator(I.bitFlip)                             -      39782      90355    0.04340    0.30569
DeltaExchangeOperator(R_trait.deltaExchange)     0.57921      15922     104777    0.04031    0.13191
ScaleOperator(Theta.scale)                       0.46729       7620      18512    0.00866    0.29160
ScaleOperator(gamma.scale)                       0.40531       7005      18916    0.00866    0.27024
ScaleOperator(kappa.scale)                       0.52022       6840      19112    0.00866    0.26356
ScaleOperator(mu_trait.scale)                    0.27589       7746      18166    0.00866    0.29893
UpDownOperator(mu_traitUppsiDownOperator)        0.91311      40003     321915    0.12047    0.11053
DeltaExchangeOperator(pi.deltaExchange)          0.06532       9549      46452    0.01868    0.17051
DeltaExchangeOperator(pi_trait.deltaExchange)    0.54124      10870      57988    0.02285    0.15786
Exchange(psi.narrowExchange)                           -      60429     295287    0.11850    0.16988
ScaleOperator(psi.rootAgeScale)                  0.77165       3697      22375    0.00866    0.14180
ScaleOperator(psi.scale)                         0.90511      35312     319993    0.11850    0.09939 Try setting scaleFactor to about 0.951
SubtreeSlide(psi.subtreeSlide)                   0.88122      70929     284431    0.11850    0.19960
Uniform(psi.uniform)                                   -     149335     206383    0.11850    0.41981
Exchange(psi.wideExchange)                             -       1575     353485    0.11850    0.00444
WilsonBalding(psi.wilsonBalding)                       -       2051     353189    0.11850    0.00577

Tuning: The value of the operator's tuning parameter, or '-' if the operator can't be optimized.
#accept: The total number of times a proposal by this operator has been accepted.
#reject: The total number of times a proposal by this operator has been rejected.
Pr(m): The probability this operator is chosen in a step of the MCMC (i.e. the normalized weight).
Pr(acc|m): The acceptance probability (#accept as a fraction of the total proposals for this operator).

Total calculation time: 232.693 seconds


## Analysing the BEAST output

Run the program called Tracer to analyze the output of BEAST. When the main window has opened, choose Import Trace File... from the File menu and select the file that BEAST has created called h5n1.log. You should now see a window like in Figure 2.

Remember that MCMC is a stochastic algorithm so the actual numbers will not be exactly the same. On the left hand side is a list of the different quantities that BEAST has logged. There are traces for the posterior (this is the log of the product of the tree likelihood and the prior probabilities), and the continuous parameters. Selecting a trace on the left brings up analyses for this trace on the right hand side depending on tab that is selected. When first opened, the “posterior”” trace is selected and various statistics of this trace are shown under the Estimates tab. In the top right of the window is a table of calculated statistics for the selected trace.

Tracer will plot a (marginal posterior) distribution for the selected parameter and also give you statistics such as the mean and median. The 95% HPD lower or upper stands for highest posterior density interval and represents the most compact interval on the selected parameter that contains 95% of the posterior probability. It can be thought of as a Bayesian analog to a confidence interval.

## Obtaining an estimate of the phylogenetic tree

Run the program TreeAnnotator, and then choose 10% as the burn-in percentage, while keeping “Maximum clade credibility tree” as the “Target tree type”. For “Node heights”, choose “Mean heights”. Then load the tree log file that BEAST 2 generated (it will end in “.trees” by default) as “Input Tree File”. For this tutorial, the tree log file is called h5n1_with_trait.trees. Finally, for “Output File”, type h5n1_with_trait.tree.

This setup will take the set of trees in the tree log file, and summarize it with a single maximum clade credibility (MCC) tree. The MCC tree is the tree that has the largest clade probability product across all nodes. Divergence times will reflect the mean ages of each node, and those times will be annotated with their 95% HPD intervals. TreeAnnotator will also display the posterior clade credibility of each node in the MCC tree.

More details on summarizing trees can be found in beast2.org/summarizing-posterior-trees/.

Note that TreeAnnotator only parses a tree log file into an output text file, but it will not allow you to visualize summary trees. Visualization has to be done with other programs (see next section).

## Distribution of root location

When you open the summary tree with locations h5n1_with_trait.tree in a text editor, and scroll to the most right and locate the end of the tree definition, you can see the set of meta data for the root. Looking for the last entries of location.set and location.set.prob, you might find something like this:

location.set = {Guangdong,HongKong,Hunan,Guangxi,Fujian}
location.set.prob = {0.18656302054414214,0.6129927817878956,0.03220433092726263,0.1121599111604664,0.0560799555802332}


This means that we have the following distribution for the root location:

Location Probability
Guangdong 0.18656302054414214
HongKong 0.6129927817878956
Hunan 0.03220433092726263
Guangxi 0.1121599111604664
Fujian 0.0560799555802332

This distribution shows that the 95% HPD consists of all locations except Hunan, with a strong indication that HongKong might be the root with over 58% probability. It is quite typical that a lot of locations are part of the 95% HPD in discrete phylogeography.

## Viewing the Location Tree

We can visualise the tree in a program called FigTree. Run this program, and open the summary tree file h5n1_with_trait.tree by using the Open command in the File menu. The tree should appear. You can now try selecting some of the options in the control panel on the left. Try selecting Appearance to set the branch Colour by location. In addition, you can set the branch Width by location.prob according to the posterior support of estimated locations. Increasing the Line Weight can make the branch width more different regarding to its posterior support. Finally, tick Legend and select location in the drop list of Attribute. You should end up with something like Figure 4.

Alternatively, you can load the posterior tree set h5n1_with_trait.trees (note this is NOT the summary tree, but the complete set) into DensiTree and set it up as follows.

• Click Show to choose Root Canal tree to guide the eye.
• Click Grid to choose Full grid option, type year 2005 in the Origin text field and tick Reverse to show the correct time scale. You also can reduce the Digits to 0 which will rounding years in the x-axis (i.g. 2005, instead of 2005.22).
• Go to Line Color, you can colour branches by location. The final image look like Figure 5.

## The number of estimated transitions

Sometime, we want to visualise how the location states are changed through the phylogeny. StateTransitionCounter can count the number of branches in a tree or a set of trees that have a certain state at the parent and another at the node.

So, install the Babel package and run the StateTransitionCounter through BEAST application launcher. The command line below will generate the output file stc.out containing all counts from the logged posterior trees h5n1_with_trait.trees, after removing 10% burn-in. Please make sure you install the latest version (Babel >= v0.3.2, BEASTLabs >= v1.9.6).

## Model

The alignment, D is assumed to have evolved under a phylogenetic continuous time Markov process (Felsenstein; 1981) on phylogenetic time tree, ψ, with a molecular clock rate of 0.004, an instantaneous rate matrix and siteRates, r. An instantaneous rate matrix is the HKY model (Hasegawa et al; 1985) with transition bias parameter, κ and base frequency vector, π. The base frequency vector, π have a Dirichlet distribution prior with a concentration of [2.0, 2.0, 2.0, 2.0]. The transition bias parameter, κ has a log-normal prior with a mean in log space of 1.0 and a standard deviation in log space of 1.25. The double, ri is assumed to come from a DiscretizeGamma with shape, γ and a ncat of 4, for i in 0 to L - 1. The shape, γ has a log-normal prior with a mean in log space of 0.0 and a standard deviation in log space of 2.0. The time tree, ψ is assumed to come from a Kingman's coalescent tree prior (Kingman; 1982) with coalescent parameter, Θ and taxa. The coalescent parameter, Θ has a log-normal prior with a mean in log space of 0.0 and a standard deviation in log space of 1.0. The alignment, Dtrait is assumed to have evolved under a phylogenetic continuous time Markov process (Felsenstein; 1981) on phylogenetic time tree, ψ, with molecular clock rate, μtrait, instantaneous rate matrix, Qtrait, a length of 1 and a dataType of "standard". The instantaneous rate matrix, Qtrait is assumed to come from the generalTimeReversible with rates and freq, πtrait. The freq, πtrait have a Dirichlet distribution prior with a concentration. A concentration is assumed to come from the rep with an element of 3.0 and times, K. The number is assumed to come from the select with x, R_traiti and indicator, I0. The indicator, I is assumed to come from a Bernoulli with a p of 0.5, replicates, dim and minSuccesses. The minSuccesses is calculated by dim-2. The x, Rtrait have a Dirichlet distribution prior with a concentration. A concentration is assumed to come from the rep with an element of 1.0 and times, dim. The molecular clock rate, μtrait has a log-normal prior with a mean in log space of 0 and a standard deviation in log space of 1.25.

## Posterior

$$\begin{split} P(\boldsymbol{\pi}, \kappa, \boldsymbol{r}, \gamma, \boldsymbol{\psi}, \Theta, \boldsymbol{\pi\textbf{}_{trait}}, \boldsymbol{I}, \boldsymbol{\textbf{R}_{trait}}, \mu\textrm{}_{trait} | \boldsymbol{D}, \boldsymbol{\textbf{D}_{trait}}) \propto &P(\boldsymbol{D} | \boldsymbol{\psi}, Q, \boldsymbol{r})P(\boldsymbol{\pi})P(\kappa)\\& \prod_{i=0}^{L - 1}P(\textrm{r}_i | \gamma)P(\gamma)P(\boldsymbol{\psi} | \Theta)\\& P(\Theta)P(\boldsymbol{\textbf{D}_{trait}} | \boldsymbol{\psi}, \mu\textrm{}_{trait}, \boldsymbol{\textbf{Q}_{trait}})\\& P(\boldsymbol{\pi\textbf{}_{trait}})P(\boldsymbol{I})P(\boldsymbol{\textbf{R}_{trait}})\\& P(\mu\textrm{}_{trait})\end{split}$$

## References

• Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of molecular evolution, 17(6), 368-376. https://doi.org/10.1007/BF01734359
• Hasegawa, M., Kishino, H. & Yano, T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22, 160–174 (1985) https://doi.org/10.1007/BF02101694
• Kingman JFC. The Coalescent. Stochastic Processes and their Applications 13, 235–248 (1982) https://doi.org/10.1016/0304-4149(82)90011-4
• Lemey, P., Rambaut, A., Drummond, A. J. and Suchard, M. A. (2009). Bayesian phylogeography finds its roots. PLoS Comput Biol 5, e1000520.
• Wallace, R., HoDac, H., Lathrop, R. and Fitch, W. (2007). A statistical phylogeography of influenza A H5N1. Proceedings of the National Academy of Sciences 104, 4473.
• Bielejec F., Rambaut A., Suchard M.A & Lemey P. (2011). SPREAD: Spatial Phylogenetic Reconstruction of Evolutionary Dynamics. Bioinformatics, 27(20):2910-2912. doi:10.1093.
• Douglas, J., Mendes, F. K., Bouckaert, R., Xie, D., Jimenez-Silva, C. L., Swanepoel, C., … & Drummond, A. J. (2020). Phylodynamics reveals the role of human travel and contact tracing in controlling COVID-19 in four island nations. doi: https://doi.org/10.1101/2020.08.04.20168518