This tutorial is modified from Taming the BEAST MASCOT Tutorial.

Background

Phylogeographic methods can help reveal the movement of genes between populations of organisms. This has been widely used to quantify pathogen movement between different host populations, the migration history of humans, and the geographic spread of languages or the gene flow between species using the location or state of samples alongside sequence data. Phylogenies therefore offer insights into migration processes not available from classic epidemiological or occurrence data alone.

The structured coalescent allows to coherently model the migration and coalescent process, but struggles with complex datasets due to the need to infer ancestral migration histories. Thus, approximations to the structured coalescent, which integrate over all ancestral migration histories, have been developed. This tutorial gives an introduction into how a MASCOT analysis in BEAST2 can be set-up. MASCOT is short for Marginal Approximation of the Structured COalescenT Müller, Rasmussen, & Stadler, 2018 and implements a structured coalescent approximation Müller, Rasmussen, & Stadler, 2017. This approximation doesn’t require migration histories to be sampled using MCMC and therefore allows to analyse phylogenies with more than three or four states.

Practical: Parameter and State inference using the approximate structured coalescent

In this tutorial we will estimate migration rates, effective population sizes and locations of internal nodes using the marginal approximation of the structured coalescent implemented in BEAST2, MASCOT Müller, Rasmussen, & Stadler, 2018.

Instead of following the “traditional” BEAST pipeline, we will use LPhy to build the MASCOT model for the analysis, and then use LPhyBEAST to create the XML from the LPhy scripts.

The aim is to:

• Learn how to infer structure from trees with sampling location
• Get to know how to choose the set-up of such an analysis
• Learn how to read the output of a MASCOT analysis

The programs used in this tutorial are listed below.

The NEXUS alignment

The data is in a file called H3N2.nex. By clicking the name of the file, it will be opened on your web browser, after which you can download the data by right-clicking on the main window, “Save Page As”, and saving the file as H3N2.nex in the desired folder.

The dataset consists of 24 Influenza A/H3N2 sequences (between 2000 and 2001) subsampled from the original dataset, which are sampled in Hong Kong, New York and in New Zealand. South-East Asia has been hypothesized to be a global source location of seasonal Influenza, while more temperate regions such as New Zealand are assumed to be global sinks (missing reference), meaning that Influenza strains are more likely to migrate from the tropic to the temperate regions then vice versa. We want to see if we can infer this source-sink dynamic from sequence data using the structured coalescent.

Constructing the scripts in LPhy Studio

The software LPhy Studio is used to specify and visualise models as well as simulate data from models defined in LPhy scripts.

The data block is used to input and store the data, which will be processed by the models defined later, and which also allows you to reuse the another dataset by simply replacing the current data. In this block, we normally include the constants for models, the alignment loaded from a NEXUS file, and the meta data regarding to the information of taxa that we have known.

The model block is to define and also describe your models and parameters in the Bayesian phylogenetic analysis. Therefore, your result could be easily reproduced by other researchers.

Please make sure the tab above the command console is set to data, when you intend to type or copy and paste the data block scripts into the console. In addition, make sure to switch the tab to model, when you intend to type or copy and paste the model block scripts into the console.

When you write your LPhy scripts, please be aware that data and model have been reserved and cannot be used as the variable name.

The LPhy scripts to define this analysis is listed below.

Code

data {
options = {ageDirection="forward", ageRegex=".*\|.*\|(\d*\.\d+|\d+\.\d*)\|.*$"}; D = readNexus(file="data/h3n2.nexus", options=options); L = D.nchar(); demes = split(str=D.taxaNames(), regex="\|", i=3); S = length(unique(arg=demes)); dim = S*(S-1); taxa = D.taxa(); } model { π ~ Dirichlet(conc=[2.0, 2.0, 2.0, 2.0]); κ ~ LogNormal(meanlog=1.0, sdlog=1.25); μ ~ LogNormal(meanlog=-5.298, sdlog=0.25); γ ~ LogNormal(meanlog=0.0, sdlog=2.0); r ~ DiscretizeGamma(shape=γ, ncat=4, replicates=L); m ~ Exp(mean=1.0, replicates=dim); Θ ~ LogNormal(sdlog=1.0, meanlog=0.0, replicates=S); M = migrationMatrix(theta=Θ, m=m); ψ ~ StructuredCoalescent(M=M, demes=demes, sort=true, taxa=taxa); D ~ PhyloCTMC(Q=hky(kappa=κ, freq=π), mu=μ, siteRates=r, tree=ψ); } Graphical Model For the details, please read the auto-generated narrative from LPhyStudio. Data block The data { ... } block is necessary when we use LPhy Studio to prepare instruction input files for inference software (e.g., BEAST 2, RevBayes, etc.). The purpose of this block is to tell LPhy which nodes of our graphical model are to be treated as known constants (and not to be sampled by the inference software) because they are observed data. Elsewhere, this procedure has been dubbed “clamping” (Höhna et al., 2016). In this block, we will either type strings representing values to be directly assigned to scalar variables, or use LPhy’s syntax to extract such values from LPhy objects, which might be read from file paths given by the user. (Note that keyword data cannot be used to name variables because it is reserved for defining scripting blocks as outlined above.) In order to start specifying the data { ... } block, make sure you type into the “data” tab of the command prompt, by clicking “data” at the bottom of LPhy Studio’s window. Tip dates Since the sequences were sampled through time, we have to specify the sampling dates. These are included in the sequence names split by \|. To set the sampling dates, We will use the regular expression ".*\|.*\|(\d*\.\d+|\d+\.\d*)\|.*$" to extract these decimal numbers and turn to ages.

How to set the age direction in LPhy is available in the Time-stamped data tutorial.

Tip locations

For this analyses we have additional information about the sampling location of sequences taken from patients, which are including Hong Kong, New York and New Zealand. We can extract these sampling locations from taxa labels, using split given the separator |, and taking the 4th element given i=3 where i is the index of split elements and starting from 0. You can check them by clicking the graphical component demes (yellow orange diamond) generated by the data section. It is an array of locations required by the StructuredCoalescent in the model section later.

Model block

The model { ... } block is the main protagonist of our scripts. This is where you will specify the many nodes and sampling distributions that characterize a probabilistic model.

(Note that keyword model cannot be used to name variables because it is reserved for defining scripting blocks as outlined above.)

In order to start specifying the model { ... } block, make sure you type into the “model” tab of the command prompt, by clicking “model” at the bottom of LPhy Studio’s window.

In this analysis, we will use three HKY models with estimated frequencies. Additionally, we allow for rate heterogeneity among sites. We do this by approximating the continuous rate distribution (for each site in the alignment) with a discretized gamma probability distribution (mean = 1), where the number of bins in the discretization ncat = 4 (normally between 4 and 6). The shape parameter will be estimated in this analysis. More details can be seen in the Bayesian Skyline Plots tutorial.

Next, we are going to set the priors for MASCOT. First, consider the effective population size parameter. Since we have only a few samples per location, meaning little information about the different effective population sizes, we will need an informative prior. In this case we will use a log normal prior with parameters M=0 and S=1. (These are respectively the mean and variance of the corresponding normal distribution in log space.)

The existing exponential distribution as a prior on the migration rate puts much weight on lower values while not prohibiting larger ones. For migration rates, a prior that prohibits too large values while not greatly distinguishing between very small and very very small values is generally a good choice. Be aware however that the exponential distribution is quite an informative prior: one should be careful that to choose a mean so that feasible rates are at least within the 95% HPD interval of the prior. (This can be determined by using R script)

A more in depth explanation of what backwards migration really are can be found in the Peter Beerli’s blog post.

Finally, set the prior for the clock rate. We have a good idea about the clock rate of Influenza A/H3N2 Hemagglutinin. From previous work by other people, we know that the clock rate will be around 0.005 substitution per site per year. To include that prior knowledger, we can set the prior on the clock rate to a log normal distribution. If we set meanlog=-5.298 and sdlog=0.25, then we expect the clock rate to be with 95% certainty between 0.00306 and 0.00816.

Please note the dimension of effective population sizes should equal to the number of locations (assuming it is x), then the dimension of migration rates backwards in time should equal to x*(x-1).

Producing BEAST XML using LPhyBEAST

BEAST 2 reads instructions about the data and the model from a user-provided .xml, which can be produced in a variety of ways. Our goal with LPhy is to make the preparation of the .xml as painless, clear and precise as possible. In order to achieve that, we will use a companion application, LPhyBEAST, as a bridge between the LPhy script we typed above and the .xml.

LPhyBEAST is distributed as a BEAST 2 package, we can use an application called Package Manager, which is distributed with BEAST 2 together, to install it. To start LPhyBEAST, we have to use the script lphybeast. Some technical guides can help you to start.

In our h3n2.lphy script, the alignment file is assumed to locate under the folder tutorials/data/. So we need to go to the tutorials folder, which is normally where the LPhy is installed, run LPhyBEAST as below and check the end of message to find where is the generated XML.

Let us run LPhyBEAST now:

# BEAST_DIR="/Applications/BEAST2"

Model

The alignment, D is assumed to have evolved under a phylogenetic continuous time Markov process (Felsenstein; 1981) on phylogenetic time tree, ψ, with molecular clock rate, μ, an instantaneous rate matrix and siteRates, r. The instantaneous rate matrix is the HKY model (Hasegawa et al; 1985) with transition bias parameter, κ and base frequency vector, π. The base frequency vector, π have a Dirichlet distribution prior with a concentration of [2.0, 2.0, 2.0, 2.0]. The transition bias parameter, κ has a log-normal prior with a mean in log space of 1.0 and a standard deviation in log space of 1.25. The molecular clock rate, μ has a log-normal prior with a mean in log space of -5.298 and a standard deviation in log space of 0.25. The double, ri is assumed to come from a DiscretizeGamma with shape, γ and a ncat of 4, for i in 0 to L - 1. The shape, γ has a log-normal prior with a mean in log space of 0.0 and a standard deviation in log space of 2.0. The phylogenetic time tree, ψ is assumed to come from a StructuredCoalescent (Müller et al; 2017) with M, taxa, demes and a sort of true. The M is assumed to come from the migrationMatrix with theta, Θ and m. The double, mi has an exponential distribution prior with a mean of 1.0, for i in 0 to dim - 1. The object provides the unique of arg, demes. The double, Θi has a log-normal prior with a mean in log space of 0.0 and a standard deviation in log space of 1.0, for i in 0 to S - 1.

Posterior

$$\begin{split} P(\boldsymbol{\pi}, \kappa, \mu, \boldsymbol{r}, \gamma, \boldsymbol{\psi}, \boldsymbol{m}, \boldsymbol{\Theta} | \boldsymbol{D}) \propto &P(\boldsymbol{D} | \boldsymbol{\psi}, \mu, Q, \boldsymbol{r})P(\boldsymbol{\pi})P(\kappa)\\& P(\mu)\prod_{i=0}^{L - 1}P(\textrm{r}_i | \gamma)P(\gamma)P(\boldsymbol{\psi} | \boldsymbol{M})\\& \prod_{i=0}^{\textrm{dim} - 1}P(\textrm{m}_i)\prod_{i=0}^{S - 1}P(\Theta\textrm{}_i)\\& \end{split}$$

If you interested in the derivations of the marginal approximation of the structured coalescent, you can find them from Müller, Rasmussen, & Stadler, 2017. This paper also explains the mathematical differences to other methods such as the theory underlying BASTA. To get a better idea of how the states of internal nodes are calculated, have a look in this paper Müller, Rasmussen, & Stadler, 2018.

References

• Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of molecular evolution, 17(6), 368-376. https://doi.org/10.1007/BF01734359
• Hasegawa, M., Kishino, H. & Yano, T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22, 160–174 (1985) https://doi.org/10.1007/BF02101694
• Müller, N. F., Rasmussen, D., & Stadler, T. (2018). MASCOT: Parameter and state inference under the marginal structured coalescent approximation. Bioinformatics, bty406. https://doi.org/10.1093/bioinformatics/bty406
• Müller, N. F., Rasmussen, D. A., & Stadler, T. (2017). The Structured Coalescent and its Approximations. Molecular Biology and Evolution, msx186.
• Drummond, A. J., & Bouckaert, R. R. (2014). Bayesian evolutionary analysis with BEAST 2. Cambridge University Press.