N. gonorrhea: Computing coalescent odds, identifying growing lineages, cluster identification & sample reweighting

This vignette demonstrates the main functions of cod using data from 1,102 Neiseria gonorrhoeae genomes first described in ¹. The data used here were further analysed in ² and this version of the data are available at https://github.com/xavierdidelot/gonophylo.

library(cod)
tr <- ape::read.tree( system.file('extdata/grad2016-treedater-tr1.nwk', package='cod' ) )
md <- read.csv( system.file( 'extdata/grad2016-a2md.csv' , package='cod' ) )

This is a time-scaled phylogeny estimated with the treedater R package.

tr

## 
## Phylogenetic tree with 1102 tips and 1101 internal nodes.
## 
## Tip labels:
##   17176_1#21, 15335_3#33, 15335_6#51, 15335_4#63, 15335_5#50, 8727_8#36, ...
## 
## Rooted; includes branch length(s).

Metadata includes the year of sample collection, clinic where the sample was collected, and resistance scores to several classes of antibiotics. For the purposes of this vignette, we will consider a score of “2” to represent resistance.

head(md)

##           ID PEN TET SPC CFX CRO CIP AZI Clinic Year
## 1  15335_2#1   2   2   0   2   0   2   1    POR 2012
## 2 15335_2#10   1   1   0   0   0   0   2    MIN 2005
## 3 15335_2#11   1   1   0   0   0   0   2    MIN 2005
## 4 15335_2#12   1   2   0   0   0   0   2    LVG 2006
## 5 15335_2#13   2   2   0  NA   0   2   2    CHI 2008
## 6 15335_2#14   2   2   0  NA   0   0   2    KCY 2008

hist( md$Year, title = '', ylabel = '', xlabel = 'Year' )

Estimating coalescent odds

The main function to estimate coalescent odds using weighted least squares is codls:

f <- codls( tr )

##    logtau      loss optimal
## 1    -4.0 200516.48        
## 2     0.1  87477.40        
## 3     4.2  86547.03     ***
## 4     8.3  86589.56        
## 5    12.4       Inf        
## 6    16.5       Inf        
## 7    20.6       Inf        
## 8    24.7       Inf        
## 9    28.8       Inf        
## 10   32.9       Inf        
## 11   37.0       Inf

##  Genealogical placement GMRF model fit 
## 
## Phylogenetic tree with 1102 tips and 1101 internal nodes.
## 
## Tip labels:
##   17176_1#21, 15335_3#33, 15335_6#51, 15335_4#63, 15335_5#50, 8727_8#36, ...
## 
## Rooted; includes branch length(s).
## Range of coefficients: 
## -0.442259845161674 0.426205484263409
## Precision parameter (log tau): 4.2

The only required argument is a time-scaled phylogenetic tree. This method models the correlation of coalescent odds between phylogenetic lineages using a Gaussian-Markov Random Field which includes a precision parameter logtau. If indepently estimated, this can be provided to codls to speed up estimation, but if omitted, logtau will be automatically estimated using the tauprofile function. You can also speed up codls by using multiple CPUs with the ncpu argument.

Plotting the fit will display a tree with estimated log odds of coalescence mapped by colour on to branches. Note that this requires the ggtree package to be installed.

# Plot the tree with coalescent odds - handle potential ggtree issues
tryCatch({
  plot(f)
}, error = function(e) {
  cat("Plot generation failed due to ggtree compatibility issue:\n")
  cat(e$message, "\n")
  cat("Tree summary:\n")
  print(f)
  cat("\nCoalescent odds summary:\n")
  print(summary(coef(f)))
})

The coalescent odds for each branch can be retrieved using coef, e.g.:

coef(f)[1:4]

## [1] 0.003407523 0.005130719 0.005719041 0.006293182

These are in the same order as nodes in the input tree.

Let’s merge the estimated coalescent odds back into the metadata for subsequent analysis:

fdf <- data.frame( tip = f$data$tip.label, psi = coef(f)[1:Ntip(tr)] )
md$tip  <- md$ID 
md <- merge( md, fdf, by = 'tip')
head( md )

##          tip         ID PEN TET SPC CFX CRO CIP AZI Clinic Year         psi
## 1  15335_2#1  15335_2#1   2   2   0   2   0   2   1    POR 2012 -0.04803569
## 2 15335_2#10 15335_2#10   1   1   0   0   0   0   2    MIN 2005  0.24893462
## 3 15335_2#11 15335_2#11   1   1   0   0   0   0   2    MIN 2005  0.24848434
## 4 15335_2#12 15335_2#12   1   2   0   0   0   0   2    LVG 2006  0.24821303
## 5 15335_2#13 15335_2#13   2   2   0  NA   0   2   2    CHI 2008 -0.02901797
## 6 15335_2#14 15335_2#14   2   2   0  NA   0   0   2    KCY 2008 -0.04413655

Sample weights

If we examine the relationship between coalescent odds and where samples originated (Clinic) there are a few clinics with significantly higher values:

table( md$Clinic ) |> sort()

## 
## COL LA2 SLC STL FBG GRB IND PON NOR LBC NYC CLE RIC KCY ALB DTR OKC DAL CIN ATL 
##   3   3   3   3   4   6   6   6   7   8   8  11  11  13  14  14  14  16  22  25 
## BHM MIA SEA DEN BAL HON POR MIN LAX ORA PHX SFO PHI CHI LVG SDG 
##  28  28  29  30  33  35  48  51  53  62  64  67  69  71  85 152

s <- lm( psi ~ Clinic, data = md) |> summary() 
s

## 
## Call:
## lm(formula = psi ~ Clinic, data = md)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50273 -0.08555 -0.03763  0.02845  0.56277 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.0052379  0.0480984   0.109   0.9133  
## ClinicATL    0.0037715  0.0600749   0.063   0.9500  
## ClinicBAL   -0.0297057  0.0574015  -0.518   0.6049  
## ClinicBHM   -0.0842535  0.0589083  -1.430   0.1529  
## ClinicCHI    0.0139142  0.0526273   0.264   0.7915  
## ClinicCIN   -0.0352066  0.0615277  -0.572   0.5673  
## ClinicCLE   -0.1300334  0.0725111  -1.793   0.0732 .
## ClinicCOL   -0.1405248  0.1144972  -1.227   0.2200  
## ClinicDAL    0.0690866  0.0658615   1.049   0.2944  
## ClinicDEN    0.0412778  0.0582501   0.709   0.4787  
## ClinicDTR   -0.0856221  0.0680215  -1.259   0.2084  
## ClinicFBG   -0.1115689  0.1020322  -1.093   0.2744  
## ClinicGRB    0.1535391  0.0878153   1.748   0.0807 .
## ClinicHON   -0.0043148  0.0569108  -0.076   0.9396  
## ClinicIND   -0.0488006  0.0878153  -0.556   0.5785  
## ClinicKCY    0.0552351  0.0693172   0.797   0.4257  
## ClinicLA2    0.1110863  0.1144972   0.970   0.3322  
## ClinicLAX    0.0161195  0.0540792   0.298   0.7657  
## ClinicLBC   -0.0761099  0.0797622  -0.954   0.3402  
## ClinicLVG   -0.0550232  0.0519086  -1.060   0.2894  
## ClinicMIA    0.0763673  0.0589083   1.296   0.1951  
## ClinicMIN   -0.0261175  0.0543003  -0.481   0.6306  
## ClinicNOR   -0.1240360  0.0833089  -1.489   0.1368  
## ClinicNYC   -0.0308319  0.0797622  -0.387   0.6992  
## ClinicOKC   -0.1496981  0.0680215  -2.201   0.0280 *
## ClinicORA    0.0320981  0.0532527   0.603   0.5468  
## ClinicPHI   -0.0270736  0.0527528  -0.513   0.6079  
## ClinicPHX    0.0520299  0.0530992   0.980   0.3274  
## ClinicPON   -0.0770099  0.0878153  -0.877   0.3807  
## ClinicPOR    0.0006079  0.0546646   0.011   0.9911  
## ClinicRIC   -0.0593261  0.0725111  -0.818   0.4134  
## ClinicSDG    0.0291153  0.0502647   0.579   0.5625  
## ClinicSEA   -0.0522784  0.0585688  -0.893   0.3723  
## ClinicSFO   -0.0039325  0.0528854  -0.074   0.9407  
## ClinicSLC   -0.2952074  0.1144972  -2.578   0.0101 *
## ClinicSTL   -0.2333512  0.1144972  -2.038   0.0418 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.18 on 1066 degrees of freedom
## Multiple R-squared:  0.07834,    Adjusted R-squared:  0.04808 
## F-statistic: 2.589 on 35 and 1066 DF,  p-value: 1.699e-06

It is possible that this occurred because these locations were sampled more intensively than other clinics which can artificially increase coalescent rates because of higher local density of co-circulating lineages. Samples can be down-weighted in codls by passing the *weights* argument, which should ameliorate bias from over-sampling if we know how much over-sampling took place. Unfortunately, this is rarely known, so cod includes a routine to consider a range of sample weights and will identify the maximum weight such that there is no longer a significant relationship between coalescent odds and a given variable (usually geographic). Here we identify all samples from the “MIA” clinical and pass these to the autoreweight function.

# Clinics associated with psi :
signifclinics <- rownames(s$coefficients)[ s$coefficients[ , 4]  < .1 ]
signifclinics <- substr(signifclinics, 7,9 )
# Subset of tips from clinics associated with psi :
reweighttips  <- md$tip[ md$Clinic %in% signifclinics ]
arw <- autoreweight( f, reweighttips, wlb = 1e-2, wub = .5, res = 5, alpha = .02 )
f = arw$fit
arw$summary

##   sampleweight            p
## 1       0.0100 1.554659e-05
## 2       0.1325 2.257137e-05
## 3       0.2550 3.269453e-05
## 4       0.3775 4.723516e-05
## 5       0.5000 6.804639e-05

Note that in some cases, the association will not disappear even if the weight is zero because lineages surrounding the given samples also have higher coalescent odds. In these cases, the relationship is more likely to be authentic. That is exactly what we see here. Even when weighting these samples at 1% (unrealistically low) there remains a significant association with coalescent odds. Consequently, this fit will be identical to the original fit.

Antibiotic resistance

Here we examine the relationship between coalescent odds and antibiotic resistance. First, we replot the coloured tree alongside resistance phenotypes.

abxs <- c( 'PEN', 'TET', 'CFX', 'CRO', 'CIP', 'AZI')
abxmat <- as.matrix(md[, abxs ] )
abxmat <- apply( abxmat, 2, function(x) (x == "2") ) # The value of '2' is coded as abx resistant
rownames( abxmat ) <- md$tip 
head(abxmat)

##              PEN   TET   CFX   CRO   CIP   AZI
## 15335_2#1   TRUE  TRUE  TRUE FALSE  TRUE FALSE
## 15335_2#10 FALSE FALSE FALSE FALSE FALSE  TRUE
## 15335_2#11 FALSE FALSE FALSE FALSE FALSE  TRUE
## 15335_2#12 FALSE  TRUE FALSE FALSE FALSE  TRUE
## 15335_2#13  TRUE  TRUE    NA FALSE  TRUE  TRUE
## 15335_2#14  TRUE  TRUE    NA FALSE FALSE  TRUE

abxmat[ is.na(abxmat) ] <- FALSE

# Try to create tree plot with heatmap, handle ggtree issues
tryCatch({
  trpl <- plot(f) + 
    ggplot2::scale_color_gradient2( low='blue'
                                  , mid = 'lightblue'
                                  , high = 'red'
                                  , midpoint = 0
                                  , limits = range(fdf$psi)
                                  , name = "ψ" )
  trpl <- ggtree::gheatmap( trpl, abxmat, colnames_position='top', colnames_offset_y = -11)
  trpl$data$label = '' # suppress tip labels 
  print(trpl)
}, error = function(e) {
  cat("Tree plot with heatmap failed due to ggtree compatibility issue:\n")
  cat(e$message, "\n")
  cat("\nAntibiotic resistance summary:\n")
  print(colSums(abxmat))
  cat("\nSamples with highest coalescent odds:\n")
  top_psi <- head(md[order(md$psi, decreasing=TRUE), c("tip", "psi", abxs)], 10)
  print(top_psi)
})

## Tree plot with heatmap failed due to ggtree compatibility issue:
## <ggtree> object properties are invalid:
## - @mapping must be <ggplot2::mapping>, not S3<data.frame> 
## 
## Antibiotic resistance summary:
## PEN TET CFX CRO CIP AZI 
## 537 661 266   8 594 232 
## 
## Samples with highest coalescent odds:
##            tip       psi PEN TET CFX CRO CIP AZI
## 492 15335_7#38 0.4262055   1   2   0   0   0   2
## 512 15335_7#57 0.4262055   1   2   0   0   0   2
## 490 15335_7#36 0.4262054   1   2   0   0   0   2
## 482 15335_7#29 0.4259646   1   2   0   0   0   2
## 488 15335_7#34 0.4259645   1   2   0   0   0   2
## 479 15335_7#26 0.4257999   1   2   0   0   0   2
## 459 15335_6#93 0.4254206   1   0   0   0   0   1
## 700 17176_1#28 0.4254206   1   0   0   0   0   1
## 702  17176_1#3 0.4253203   1   0   0   0   0   1
## 698 17176_1#26 0.4251364   1   0   0   0   0   1

It looks like higher coalescent odds are associated with AZI resistance and no other abx shows a positive effect. Let’s quantify this. Here we do a linear regression of coalescent odds on each abx and time, then infer the mean psi at the end of sampling (year 2013).

md$t  <- md$Year - min(md$Year)
psi2013 <- sapply( abxs, function(x)
{
    md1 <- md
    md1$v <- md1[[x]] == 2
    m = lm( psi ~ v*t  , data = md1 )
    predict( m, newdata= data.frame( psi = NA, v = TRUE, t = 13 ) , interval='confidence')
}) |> setNames( abxs )
ebdf <- as.data.frame( t( psi2013 ) ) 
colnames(ebdf) <- c( 'Median', '2.5%', '97.5%' )
ebdf$abx = rownames( ebdf ) 
ebdf <- ebdf[ order( ebdf$Median ) , ] 
print( ebdf )

##           Median        2.5%       97.5% abx
## CRO -0.062852959 -0.28257319  0.15686727 CRO
## PEN -0.042395359 -0.06566536 -0.01912536 PEN
## CIP -0.041630966 -0.06456462 -0.01869732 CIP
## CFX -0.005595061 -0.04700795  0.03581783 CFX
## TET -0.003377564 -0.02501186  0.01825674 TET
## AZI  0.247253178  0.21017410  0.28433226 AZI

Here the result is plotted:

peb = ggplot2::ggplot(ebdf, ggplot2::aes(x = abx, y = Median, ymin = `2.5%`, ymax = `97.5%`) ) + ggplot2::geom_errorbar(width=0) + 
ggplot2::geom_point() + 
ggplot2::labs(y = 'psi', x = '' )
peb

In fact, resistance to AZI expanded rapidly after these data were collected, from a prevalence of 0.6% in 2013 to 4.5% in 2017³. This is the most rapid growth among these antibiotics.

Phylogenetic clusters

An alternative way to look at coalescent odds is in terms of phylogenetic clusters. These are clades defined by a threshold change in coalescent odds along a lineage, from low to high values. There is some subjectivity in the choice of clustering thresholds and the best choice depends on the application, however cod provides a method based on the Calinski-Harabasz index.

chis <- chindices(f, clths = seq( .03, .4, length = 20 ), rescale=TRUE)
chis

##     threshold        CH optimal
## 1  0.03000000 1502518.1        
## 2  0.04947368  650599.0        
## 3  0.06894737  807982.9        
## 4  0.08842105  902511.2        
## 5  0.10789474  946716.4        
## 6  0.12736842 1334924.7        
## 7  0.14684211 1532793.1        
## 8  0.16631579 1641530.8     ***
## 9  0.18578947  786818.6        
## 10 0.20526316  950306.2        
## 11 0.22473684  885988.9        
## 12 0.24421053  885988.9        
## 13 0.26368421 1215831.1        
## 14 0.28315789 1215831.1        
## 15 0.30263158 1215831.1        
## 16 0.32210526 1215831.1        
## 17 0.34157895 1215831.1        
## 18 0.36105263 1215831.1        
## 19 0.38052632 1215831.1        
## 20 0.40000000 1215831.1

Here, we compute clusters using the maximum CH index. Note that if no threshold is provided, computeclusters will automatically select the optimal threshold:

# Using optimal threshold from CH index
chdf = computeclusters(f, chis$threshold[which.max(chis$CH)] )
# Alternatively, let computeclusters automatically select optimal threshold:
# chdf = computeclusters(f)  
head(chdf)

##   node  tip.label clusterid        psi   tip
## 1 1088 15335_5#48         1 -0.4209262  TRUE
## 2 1089 15335_3#80         1 -0.4311232  TRUE
## 3 1089 15335_3#80         1 -0.4311232  TRUE
## 4 2191                    1 -0.3838401 FALSE
## 5 1065 17176_1#39         2 -0.4356554  TRUE
## 6 1066  8289_2#72         2 -0.4411569  TRUE

Plotting these shows that one cluster very closely matches a clade with high levels of AZI resistance and high coalescent odds, so an alternative way to analyse these data would be to identify clusters with high coalescent odds and then characterise resistance patters within these clusters. The clusters can be visualised by running plotclusters(f, chdf).

Cluster visualization showing phylogenetic tree with heatmap of cluster assignments

# Cluster summary statistics
cat("Cluster summary:\n")

## Cluster summary:

print(table(chdf$clusterid))

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
##   4  46 158 270 104  88 888 910 145   3  15 457   1  51

cat("\nCluster statistics (coalescent odds by cluster):\n")

## 
## Cluster statistics (coalescent odds by cluster):

cluster_stats <- aggregate(chdf$psi, by=list(chdf$clusterid),
                           function(x) c(mean=mean(x, na.rm=TRUE),
                                       sd=sd(x, na.rm=TRUE),
                                       n=length(x)))
names(cluster_stats) <- c("cluster", "psi_stats")
print(cluster_stats)

##    cluster psi_stats.mean  psi_stats.sd   psi_stats.n
## 1        1   -0.416753158   0.022462421   4.000000000
## 2        2   -0.416514088   0.019188764  46.000000000
## 3        3   -0.239161430   0.039221155 158.000000000
## 4        4   -0.071805631   0.012269594 270.000000000
## 5        5   -0.017465734   0.018791471 104.000000000
## 6        6   -0.013355858   0.010180204  88.000000000
## 7        7    0.315540443   0.091935636 888.000000000
## 8        8   -0.032857085   0.016487044 910.000000000
## 9        9   -0.094540079   0.021518781 145.000000000
## 10      10   -0.154503030   0.018780184   3.000000000
## 11      11   -0.216450242   0.009101256  15.000000000
## 12      12   -0.076401526   0.064003733 457.000000000
## 13      13   -0.277591516            NA   1.000000000
## 14      14   -0.297566015   0.018720376  51.000000000

Erik Volz

2025-09-25

Estimating coalescent odds

Sample weights

Antibiotic resistance

Phylogenetic clusters

References