Detailed documentation: https://emvolz.github.io/Coalescent.jl/dev/intro/

Introduction

Coalescent.jl is a package for simulation of coalescent genealogical histories with a flexible markup language for conditioning on demographic history. This package can be used as a building-block in pipelines for simulation-based inference in population genetics research.

The focus of this package is on support for simulation under demographic histories featuring nonlinear dynamics, potentially complex state spaces, and complex sampling over time. This makes the package especially relevent for studies of pathogen evolution, but other applications are possible. A design goal of Coalescent.jl is to enable fast and memory-efficient simulation of very large coalescent trees.

Other software:

  • msprime A flexible and highly optimised coalescent simulator in Python, but does not currently support complex nonlinear models
  • phydynR This is an R package which supports similar specification of demographic processes, but is substantially slower
  • TiPS A recently-developed R package supporting a similar specification of demographic processes
  • MASTER A BEAST2 add-on focused on stochastic demographic processes

Relative to msprime, some major features are currently missing:

  • Does not currently support recombination (ancestral recombination graphs)
  • Detailed models of evolution, which currently must be added on manually

A simple coalescent simulation

First, let's load the necessary packages:

using Coalescent
using Phylo # For plotting trees 
using Plots # For plotting demographic trajectories
using Distributions # For random number generation

Let's start with a simple coalescent simulation with a constant population size=10 (first argument) and 10 samples:

tr = SimTree(10.0, 10)
@show tr 
tonewick(tr)
Simulated coalescent tree with 10 tips and 9 internal nodes
Tip labels:
	Union{Nothing, String}["t.1", "t.2", "t.3", "t.4", "t.5", "t.6", "t.7", "t
.8", "t.9", "t.10"] ...

Rooted; includes branch lengths
tr = 
"((((t.4:2.520073107400519,t.5:2.520073107400519):1.3174591901714678,(t.9:1
.266569944939787,(t.10:0.36999246585678014,(t.1:0.36707169076260365,t.7:0.3
6707169076260365):0.0029207750941764865):0.8965774790830068):2.570962352632
2):2.013564887352403,(t.2:1.6115809554183689,t.8:1.6115809554183689):4.2395
16229506021):5.320626273601084,(t.3:5.617177202174617,t.6:5.617177202174617
):5.554546256350856):0.0;"

tonewick converts the tree to a portable newick string for loading into other software or packages.

Variable Population Size

Now, let's simulate a coalescent with a variable population size:

tr = SimTree((t,p...)->exp(-t), 10)
Simulating coalescent, sample size =10
Initial guess of time to most recent common ancestor (before most recent sa
mple): 1.0
Markovian coalescent algorithm
User-specified Ne(t) function 
Simulated coalescent tree with 10 tips and 9 internal nodes
Tip labels:
	Union{Nothing, String}["t.1", "t.2", "t.3", "t.4", "t.5", "t.6", "t.7", "t
.8", "t.9", "t.10"] ...

Rooted; includes branch lengths

This creates a tree where the population size increases exponentially over time. Note that in univariate models, the convention is for time to represent time before the most recent sample. Thus time zero always represents when the most recent sample was collected.

We can visualize the tree using Phylo.jl and [Plots.jl]:

tonewick(tr) |> parsenewick |> plot

Parametric population size functions

Suppose we wish to specify the growth rate and initial (most recent) population size in the exponential growth model.

tr = SimTree((t,Ne0,λ) -> Ne0*exp(-λ*t), 10, 2_000, 0.25)
@show tr 
tonewick(tr) |> parsenewick |> plot
Simulating coalescent, sample size =10
Initial guess of time to most recent common ancestor (before most recent sa
mple): 2000.0
Markovian coalescent algorithm
User-specified Ne(t) function 
Simulated coalescent tree with 10 tips and 9 internal nodes
Tip labels:
	Union{Nothing, String}["t.1", "t.2", "t.3", "t.4", "t.5", "t.6", "t.7", "t
.8", "t.9", "t.10"] ...

Rooted; includes branch lengths
tr =

This simulates a population that starts at size Ne0 and declines exponentially with rate λ. Arguments to SimTree folllowing the sample size (10) are passed as arguments to the effective population size function. So in this case, Ne0=2000 and λ=0.25.

Logistic Growth

Here is a tree simulated under logistic growth and carrying capacity K=100:

nefunc(t,K) = 10 + K*(1/(1+exp(-(t-5))))
plot(t->nefunc(t,100), 0, 10)

These simulations are pretty fast, although for unstructured models, there are faster alternatives. This is how long it takes to simulate a tree with 10 thousand samples and the logistic growth model:

@time SimTree(nefunc, 10_000, 100)
Simulating coalescent, sample size =10000
Initial guess of time to most recent common ancestor (before most recent sa
mple): 10.669285092428485
Markovian coalescent algorithm
User-specified Ne(t) function 
  0.600311 seconds (12.49 M allocations: 335.358 MiB, 10.20% gc time, 27.40
% compilation time)
Simulated coalescent tree with 10000 tips and 9999 internal nodes
Tip labels:
	Union{Nothing, String}["t.1", "t.2", "t.3", "t.4", "t.5", "t.6", "t.7", "t
.8", "t.9", "t.10"] ...

Rooted; includes branch lengths

Population Bottleneck

We can simulate a population that experiences a sudden bottleneck. In this case, the effective population size drops from 100 to 1 at time 5.

nefunc(t, Ne1,Ne2,T) = t<T ? Ne1 : Ne2
plot(t->nefunc(t, 10.0, 1.0, 5.0), 0, 10)

tr = SimTree(nefunc, 25, 100.0, 1.0, 5.0)
@show tr 
tonewick(tr) |> parsenewick |> plot
Simulating coalescent, sample size =25
Initial guess of time to most recent common ancestor (before most recent sa
mple): 100.0
Markovian coalescent algorithm
User-specified Ne(t) function 
Simulated coalescent tree with 25 tips and 24 internal nodes
Tip labels:
	Union{Nothing, String}["t.1", "t.2", "t.3", "t.4", "t.5", "t.6", "t.7", "t
.8", "t.9", "t.10"] ...

Rooted; includes branch lengths
tr =

Variable Sampling Times

We can also simulate coalescent trees with samples taken at different times:

tr = SimTree(1.0, rand(10))
tonewick(tr) |> parsenewick |> plot
Simulating coalescent, sample size =10
Initial guess of time to most recent common ancestor (before most recent sa
mple): 1.0
Markovian coalescent algorithm
User-specified Ne(t) function

The first argument is Ne, and the second argument is a vector sample times, which in this case were 10 uniform random numbers.

SIR Model

The specialty of Coalescent.jl is the simulation of trees under non-linear dynamics. We'll demonstrate this with the SIR (Susceptible-Infected-Recovered) epidemiological model which is specified by a system of ordinary differential equations. In this case, and in structured models (demonstrated later), there are some important differences regarding how population dynamics are specified:

  • Time now represents the conventional forward direction. This is different than how time is handled in most coalescent software packages (i.e. msprime), but is necessary to accomodate nonlinear models that are difficult or impossible to solve in retrospective time.
  • Rather than defining the dynamics with a julia expression, we will define the model in YAML format, usually kept in a separate file, but here we will write the YAML as multiline strings.
sir_yaml = """
modelname: example_sir
births:
  - source: I # Infections generate new infections at this rate 
    recipient: I 
    rate: β*S*I/N

deaths:
  - deme: I # Representing recovery of infected individuals
    rate: γ*I

parameters:
  - name: β # transmission rate
    value: 3.0
  - name: γ # recovery rate 
    value: 2.0

dynamic_variables:
  - name: I
    initial_value: 1.0
  - name: S
    initial_value: 1e5
    ode: -β*S*I/N
  - name: R
    initial_value: 0.0
    ode: γ*I

helpers: # Other variables that can make it easier to define the ODEs 
  - name: N
    definition: S+I+R

time: 
  initial: 1.0
  final: 20.0 
""";

Note that this model schema follows the conventions of so-called FGY models. Specifically:

  • The genealogical process is defined by several state & time dependant rates that can influence the history of a lineage: Births, Migrations (examples in later sections) and Deaths
  • Some dynamical variables correspond to a population size influencing the coalescent rate ("I" in this case representing the number of infections), but others represent the state of the system which must be known to determine birth, migration and death rates ("S" and "R"). We do not decompose dynamics of these variables into birth and death rates, but we must specify an ODE giving the time derivative of these variables.

Now we integrate the model and plot a trajectory:

sirmodel = ModelFGY(confstr = sir_yaml)
@show sirmodel 
soln_sirmodel = solveodes(sirmodel)
plot(soln_sirmodel)
Compartmental model with 1 demes, 
and 2 other dynamic variables.

Dynamic variables: ["I"], ["S", "R"]

Parameters: 
2×2 DataFrame
 Row │ parameter  value
     │ String     Float64
─────┼────────────────────
   1 │ γ              2.0
   2 │ β              3.0

Initial conditions: Dict{String, Number}("I" => 1.0, "S" => 100000.0, "R" =
> 0.0)
3×2 DataFrame
 Row │ variable  initial value
     │ String    Float64
─────┼─────────────────────────
   1 │ I                   1.0
   2 │ S              100000.0
   3 │ R                   0.0

Initial time: 1.0

Final time: 20.0

sirmodel =

Now let's define a sampling scheme. There are several ways to do this. Samples can be defined a particular time and a particular deme (see structured models). Or julia code can be provided that generates a vector of sample times. Or, a table can be loaded from a file which contains the time and deme of each sample. These can co-exist within the same configuration. Here is an example:

sirsample_yaml = """
sample: 
  # Take a single sample from deme I at time 10
  - deme: I
    time: 15.0 
    size: 10
  # Executable Julia code to define sample times. Takes 5 samples from deme I at equal intervals between times 5 and 10. 
  - deme: I
    time: range( 5.0, 10.0, 5 )
  # Another example of parsing and executing Julia code; sampling 5 times from a Normal distribution
  - deme: I
    time: rand( Normal(10), 5 )
  # [NOT RUN] Read sample time information from a table
  #- table: ./sampletimes.csv # A table with columns <sample_time>, <deme>
"""

sirsamp = SampleConfiguration(confstr = sirsample_yaml);

Now simulate the tree with the model and sample configuration:

tr = SimTree(sirmodel, sirsamp)
tonewick(tr) |> parsenewick |> plot

Models with population structure

It is straightforward to specify a model with population structure, nonlinear dynamics, and heterogeneous sampling over time using the FGY yaml format.

Island Model

First, let's reproduce the classic island model with 2 demes, a constant migration rate (μ) between demes, and a constant population size in each deme.

The population size is constant because we specify birth rates and death rates to be the same at a per-capita rate γ. This rate also sets the timescale of the process (one generation = 1/γ).

islandmodel_yaml = """
modelname: exampleisland
births:
  - source: A
    recipient: A
    rate: γ * A
  - source: B
    recipient: B
    rate: γ * B
migrations:
  - source: A 
    recipient: B 
    rate: μ * A 
  - source: B 
    recipient: A 
    rate: μ * B 
deaths:
  - deme: A
    rate: γ * A
  - deme: B
    rate: γ * B
dynamic_variables:
  - name: A
    initial_value: 100.0
  - name: B
    initial_value: 100.0
time:
  initial: 0
  final: 300
parameters:
  - name: γ
    value: 1.0
  - name: μ
    value: 0.05
"""

# Homochronous sampling from both demes: 
islandsample_yaml = """ 
sample:
  - deme: A 
    time: 300
    size: 20 
  - deme: B 
    time: 300
    size: 20 
"""

islandmodel = ModelFGY(confstr=islandmodel_yaml)
@show islandmodel 

# Simulate the tree 
tr = SimTree(
    islandmodel,
    SampleConfiguration(confstr=islandsample_yaml)
)
@show tr 
# Plot the tree: 
tonewick(tr) |> parsenewick |> plot
Compartmental model with 2 demes, 
and 0 other dynamic variables.

Dynamic variables: ["A", "B"], String[]

Parameters: 
2×2 DataFrame
 Row │ parameter  value
     │ String     Float64
─────┼────────────────────
   1 │ γ             1.0
   2 │ μ             0.05

Initial conditions: Dict{String, Number}("B" => 100.0, "A" => 100.0)
2×2 DataFrame
 Row │ variable  initial value
     │ String    Float64
─────┼─────────────────────────
   1 │ B                 100.0
   2 │ A                 100.0

Initial time: 0.0

Final time: 300.0

islandmodel = 
Simulated coalescent tree with 40 tips and 39 internal nodes
Tip labels:
	Union{Nothing, String}["t.A.1", "t.A.2", "t.A.3", "t.A.4", "t.A.5", "t.A.6
", "t.A.7", "t.A.8", "t.A.9", "t.A.10"] ...

Rooted; includes branch lengths
tr =

Note that the deme of sampling is embedded in the tip label, which makes it easy to extract in other software. The deme can also be accessed programmatically in tr.demes.

Imbalanced Migration

The way migration is handled in FGY models is quite different than most coalescent frameworks. The given rate is always a "per-capita" rate in a forwards time direction. In contrast, most coalescent frameworks specify a migration rate "per-lineage" in a reverse time direction. To show the difference, consider this variation on the island model which is the same as above except that deme B starts at size 1 and deme A at size 500.

twodememigration_yaml = """
modelname: examplemigration
births:
  - source: A
    recipient: A
    rate: γ * A
  - source: B
    recipient: B
    rate: γ * B
migrations:
  - source: A 
    recipient: B 
    rate: μ * A
  - source: B 
    recipient: A 
    rate: μ * B
deaths:
  - deme: A
    rate: γ * A
  - deme: B
    rate: γ * B
dynamic_variables:
  - name: A
    initial_value: 500.0
  - name: B
    initial_value: 1.0
time:
  initial: 0
  final: 50
parameters:
  - name: γ
    value: 1.0
  - name: μ
    value: 0.05
"""

twodememodel = ModelFGY(confstr=twodememigration_yaml)
solntwodeme = solveodes(twodememodel) 
plot(solntwodeme)

Because μ is a forward-time per-capita rate, the initial rate of migrations from A to B is 500μ while in the opposite direction it is only 1μ. Thus the populations converge in size over time until the total rate of migrations is the same in both directions.

If you want to use the per-lineage instead per-capita migration rate, it is possible to deduce the "per-lineage" retrospective rate of migrations. For example in this model, the per-lineage rate from B to A is μA/B and from A to B it is μB/A. See this paper for details.

A note about initial conditions

A disadvantage of the FGY model format is that the genealogical process is conditioned on a demographic history, and there is no guarantee that the process will coalesce to a single MRCA by the initial time of the simulation. For most applications, it is desired to simulate until this MRCA is reached. Consider the previous island model and this sampling scheme:

earlyislandsample_yaml = """ 
sample:
  - deme: A 
    time: 1
    size: 20 
  - deme: B 
    time: 1
    size: 20 
""";

In this case, all of the samples are collected at time 1 (after the initial time zero), so that there is not time for the process to coalesce. In this case, the simulator will raise a warning and add branches until a MRCA is found based on the distribution of events up to time zero. This is for convenience only, and branches before time zero should not be used for subsequent analysis:

julia> SimTree(
           ModelFGY(confstr=islandmodel_yaml),
           SampleConfiguration(confstr=earlyislandsample_yaml)
       )
Simulated coalescent tree with 40 tips and 39 internal nodes
Tip labels:
	Union{Nothing, String}["t.A.1", "t.A.2", "t.A.3", "t.A.4", "t.A.5", "t.A.6", "t.A.7", "t.A.8", "t.A.9", "t.A.10"] ...

Rooted; includes branch lengths
┌ Warning: Coalescent process did not reach a common ancestor. Adding 34 nodes. 
└ @ Coalescent ~/.julia/packages/Coalescent/kfJsL/src/s0.jl:41

If you get this warning, it will likely be important for your application to revise the model to begin simulation from an earlier time point.

SEIR Model

Finally, let's simulate a coalescent tree based on the SEIR (Susceptible-Exposed-Infected-Recovered) model which will illustrate population structure, nonlinear dynamics, and heterochronous sampling:

seir_yaml = """
modelname: example_sir
births:
  - source: I
    recipient: E
    rate: β*S*I/N
migrations:
  - source: E 
    recipient: I 
    rate: γ₁ * E 
deaths:
  - deme: I
    rate: γ₂ * I
parameters:
  - name: β 
    value: 6.0
  - name: γ₁
    value: 1.0
  - name: γ₂
    value: 1.0
dynamic_variables:
  - name: E
    initial_value: 0.1 
  - name: I
    initial_value: 0.0
  - name: S
    initial_value: 1e3
    ode: -β*S*I/N
  - name: R
    initial_value: 0.0
    ode: γ*I
helpers: 
  - name: N
    definition: S+E+I+R
time: 
  initial: 0.0
  final: 10.0 
"""

seirmodel = ModelFGY(confstr = seir_yaml) 
@show seirmodel
soln_seirmodel = solveodes(seirmodel)
plot(soln_seirmodel)

seirsample_yaml = """
sample:
  - deme: I
    time: range(2.0, 10.0, 10)
"""

seirsamp = SampleConfiguration(confstr = seirsample_yaml) 
tr = SimTree(seirmodel, seirsamp)
@show tr 
tonewick(tr) |> parsenewick |> plot
Compartmental model with 2 demes, 
and 2 other dynamic variables.

Dynamic variables: ["I", "E"], ["S", "R"]

Parameters: 
3×2 DataFrame
 Row │ parameter  value
     │ String     Float64
─────┼────────────────────
   1 │ γ₁             1.0
   2 │ γ₂             1.0
   3 │ β              6.0

Initial conditions: Dict{String, Number}("I" => 0.0, "S" => 1000.0, "E" => 
0.1, "R" => 0.0)
4×2 DataFrame
 Row │ variable  initial value
     │ String    Float64
─────┼─────────────────────────
   1 │ I                   0.0
   2 │ S                1000.0
   3 │ E                   0.1
   4 │ R                   0.0

Initial time: 0.0

Final time: 10.0

seirmodel = 
Simulated coalescent tree with 10 tips and 9 internal nodes
Tip labels:
	Union{Nothing, String}["t.I.1", "t.I.2", "t.I.4", "t.I.8", "t.I.11", "t.I.
16", "t.I.33", "t.I.50", "t.I.60", "t.I.70"] ...

Rooted; includes branch lengths
tr =

In this model, lineages are sampled from the "I" deme, representing for example microbial genomes collected from symptomatic infected individuals. Migration represents the progression of disease "E" to "I". Deaths represent recovery of infected individuals. And "I" lineages generate new lineages in "E" (forwards time) according to a mass action model.

Computing Distances

For some applications related to simulation-based inference, it may be better to work with a distance matrix rather than the tree. We can compute this matrix as follows:

tr = SimTree(seirmodel, seirsamp, computedescendants=true)
# You can modify `tr.edgelength` to simulate various evolutionary models
Coalescent.distancematrix(tr)
10×10 Matrix{Float64}:
  0.0      14.9448   12.7552   …  8.09323  8.3108   8.72257  8.30531
 14.9448    0.0      13.167       9.61146  8.72257  6.75937  7.41642
 12.7552   13.167     0.0         7.42192  5.02444  6.94479  6.52753
 11.835    12.2781   10.0886      6.50165  5.64414  6.05591  5.63864
 10.1032   11.3892    9.19969     4.76983  4.75525  5.16702  4.74975
  6.81885  10.5003    8.3108   …  3.64878  3.86636  4.27813  3.86086
  8.09323   9.61146   7.42192     0.0      2.97747  3.38924  2.97197
  8.3108    8.72257   5.02444     2.97747  0.0      2.50035  2.08309
  8.72257   6.75937   6.94479     3.38924  2.50035  0.0      1.1942
  8.30531   7.41642   6.52753     2.97197  2.08309  1.1942   0.0

Note that the simulator must be called with computedescendants=true in this case which will add modestly to simulation time and substantially to memory usage. If you wish to simulate an evolutionary model, so that for example distances represent substitutions per site, you can modify the branch lengths (tr.edgelength) before computing the distance matrix. For example, this would simulate a Jukes-Cantor model with substitution rate 0.001 and 10,000 nucleotides:

tr.edgelength .= [ Poisson(rate) |> rand for rate in tr.edgelength*10_000*0.001  ]
18-element Vector{Union{Nothing, Float64}}:
 58.0
  8.0
  7.0
 33.0
 11.0
  7.0
  1.0
 30.0
  6.0
 47.0
  0.0
  3.0
  3.0
 59.0
  5.0
  5.0
  2.0
  2.0

Future versions may streamline this simulation.