Using codon models#

Section author: Gavin Huttley

The basic paradigm for evolutionary modelling is:

  1. construct the codon substitution model

  2. constructing likelihood function

  3. modify likelihood function (setting rules)

  4. providing the alignment(s)

  5. optimisation

  6. get results out

Note

In the following, a result followed by ‘…’ just means the output has been truncated for the sake of a succinct presentation.

Constructing the codon substitution model#

For the time-reversible category, Cogent3 implements 4 basic rate matrices: i) NF models, these are nucleotide frequency weighted rate matrices and were initially described by Muse and Gaut (1994); ii) a variant of (i) where position specific nucleotide frequencies are used; iii) TF models, these are tuple (codon in this case) frequency weighted rate matrices and were initially described by Goldman and Yang (1994); iv) CNF, these use the conditional nucleotide frequency and have developed by Yap, Lindsay, Easteal and Huttley. These different models can be created using provided convenience functions which will be the case here, or specified by directly calling the TimeReversibleCodon substitution model class and setting the argument mprob_model equal to:

  • NF, mprob_model='monomer'

  • NF with position specific nucleotide frequencies, mprob_model='monomers'

  • TF, mprob_model=None

  • CNF, mprob_model='conditional'

In the following I will construct GTR variants of i and iv and a HKY variant of iii.

We import these explicitly from the cogent3.evolve.models module.

from cogent3.evolve.models import get_model

These are functions and calling them returns the indicated substitution model with default behaviour of recoding gap characters into N’s.

tf = get_model("GY94")
nf = get_model("MG94GTR")
cnf = get_model("CNFGTR")

In the following demonstration I will use only the CNF form (cnf).

For our example we load a sample alignment and tree as per usual. To reduce the computational overhead for this example we will limit the number of sampled taxa.

from cogent3 import load_aligned_seqs, load_tree

aln = load_aligned_seqs("data/primate_brca1.fasta", moltype="dna")
tree = load_tree("data/primate_brca1.tree")

Standard test of neutrality#

We construct a likelihood function and constrain omega parameter (the ratio of nonsynonymous to synonymous substitutions) to equal 1. We also set some display formatting parameters.

lf = cnf.make_likelihood_function(tree, digits=2, space=3)
lf.set_param_rule("omega", is_constant=True, value=1.0)

We then provide an alignment and optimise the model. In the current case we just use the local optimiser (hiding progress to keep this document succinct). We then print(the results.)

Note

I’m going to specify a set of conditions that will be used for all optimiser steps. For those new to python, one can construct a dictionary with the following form {'argument_name': argument_value}, or alternatively dict(argument_name=argument_value). I’m doing the latter. This dictionary is then passed to functions/methods by prefacing it with **.

optimiser_args = dict(
    local=True, max_restarts=5, tolerance=1e-8, show_progress=False
)
lf.set_alignment(aln)
lf.optimise(**optimiser_args)
lf

CNFGTR

log-likelihood = -6767.0980

number of free parameters = 16

Global params
A/CA/GA/TC/GC/Tomega
1.104.070.841.954.581.00
Edge params
edgeparentlength
Galagoroot0.53
HowlerMonroot0.14
Rhesusedge.30.07
Orangutanedge.20.02
Gorillaedge.10.01
Humanedge.00.02
Chimpanzeeedge.00.01
edge.0edge.10.00
edge.1edge.20.01
edge.2edge.30.04
edge.3root0.02
Motif params
AAAAACAAGAATACAACCACGACTAGAAGCAGGAGTATA
0.060.020.030.060.020.000.000.030.020.030.010.040.02
continuation
ATCATGATTCAACACCAGCATCCACCCCCGCCTCGACGC
0.010.010.020.020.010.020.020.020.010.000.030.000.00
continuation
CGGCGTCTACTCCTGCTTGAAGACGAGGATGCAGCCGCG
0.000.000.010.010.010.010.080.010.030.030.020.010.00
continuation
GCTGGAGGCGGGGGTGTAGTCGTGGTTTACTATTCATCC
0.010.020.010.010.010.010.010.010.020.000.010.020.01
continuation
TCGTCTTGCTGGTGTTTATTCTTGTTT
0.000.030.000.000.020.020.010.010.02

In the above output, the first table shows the maximum likelihood estimates (MLEs) for the substitution model parameters that are ‘global’ in scope. For instance, the C/T=4.58 MLE indicates that the relative rate of substitutions between C and T is nearly 5 times the background substitution rate.

The above function has been fit using the default counting procedure for estimating the motif frequencies, i.e. codon frequencies are estimated as the average of the observed codon frequencies. If you wanted to numerically optimise the motif probabilities, then modify the likelihood function creation line to

lf = cnf.make_likelihood_function(tree, optimise_motif_probs=True)

We can then free up the omega parameter, but before we do that we’ll store the log-likelihood and number of free parameters for the current model form for reuse later.

neutral_lnL = lf.get_log_likelihood()
neutral_nfp = lf.get_num_free_params()
lf.set_param_rule("omega", is_constant=False)
lf.optimise(**optimiser_args)
non_neutral_lnL = lf.get_log_likelihood()
non_neutral_nfp = lf.get_num_free_params()
lf

CNFGTR

log-likelihood = -6762.5761

number of free parameters = 17

Global params
A/CA/GA/TC/GC/Tomega
1.083.860.781.964.080.75
Edge params
edgeparentlength
Galagoroot0.53
HowlerMonroot0.14
Rhesusedge.30.07
Orangutanedge.20.02
Gorillaedge.10.01
Humanedge.00.02
Chimpanzeeedge.00.01
edge.0edge.10.00
edge.1edge.20.01
edge.2edge.30.03
edge.3root0.02
Motif params
AAAAACAAGAATACAACCACGACTAGAAGCAGGAGTATA
0.060.020.030.060.020.000.000.030.020.030.010.040.02
continuation
ATCATGATTCAACACCAGCATCCACCCCCGCCTCGACGC
0.010.010.020.020.010.020.020.020.010.000.030.000.00
continuation
CGGCGTCTACTCCTGCTTGAAGACGAGGATGCAGCCGCG
0.000.000.010.010.010.010.080.010.030.030.020.010.00
continuation
GCTGGAGGCGGGGGTGTAGTCGTGGTTTACTATTCATCC
0.010.020.010.010.010.010.010.010.020.000.010.020.01
continuation
TCGTCTTGCTGGTGTTTATTCTTGTTT
0.000.030.000.000.020.020.010.010.02

We then conduct a likelihood ratio test whether the MLE of omega significantly improves the fit over the constraint it equals 1. We import the convenience function from the cogent3 stats module.

from scipy.stats.distributions import chi2

LR = 2 * (non_neutral_lnL - neutral_lnL)
df = non_neutral_nfp - neutral_nfp
print(chi2.sf(LR, df))
0.0026357631760662533

Not surprisingly, this is significant. We then ask whether the Human and Chimpanzee edges have a value of omega that is significantly different from the rest of the tree.

lf.set_param_rule(
    "omega", tip_names=["Chimpanzee", "Human"], outgroup_name="Galago", clade=True
)
lf.optimise(**optimiser_args)
lf
chimp_human_clade_lnL = lf.get_log_likelihood()
chimp_human_clade_nfp = lf.get_num_free_params()
LR = 2 * (chimp_human_clade_lnL - non_neutral_lnL)
df = chimp_human_clade_nfp - non_neutral_nfp
print(chi2.sf(LR, df))
0.02842632490313894

This is basically a replication of the original Huttley et al (2000) result for BRCA1.

Rate-heterogeneity model variants#

It is also possible to specify rate-heterogeneity variants of these models. In the first instance we’ll create a likelihood function where these rate-classes are global across the entire tree. Because fitting these models can be time consuming I’m going to recreate the non-neutral likelihood function from above first, fit it, and then construct the rate-heterogeneity likelihood function. By doing this I can ensure that the richer model starts with parameter values that produce a log-likelihood the same as the null model, ensuring the subsequent optimisation step improves the likelihood over the null.

lf = cnf.make_likelihood_function(tree, digits=2, space=3)
lf.set_alignment(aln)
lf.optimise(**optimiser_args)
non_neutral_lnL = lf.get_log_likelihood()
non_neutral_nfp = lf.get_num_free_params()

Now, we have a null model which we know (from having fit it above) has a MLE < 1. We will construct a rate-heterogeneity model with just 2 rate-classes (neutral and adaptive) that are separated by the boundary of omega=1. These rate-classes are specified as discrete bins in Cogent3 and the model configuration steps for a bin or bins are done using the set_param_rule method. To ensure the alternate model starts with a likelihood at least as good as the previous we need to make the probability of the neutral site-class bin ~= 1 (these are referenced by the bprobs parameter type) and assign the null model omega MLE to this class.

To get all the parameter MLEs (branch lengths, GTR terms, etc ..) into the alternate model we get an annotated tree from the null model which will have these values associated with it.

annot_tree = lf.get_annotated_tree()
omega_mle = lf.get_param_value("omega")

We can then construct a new likelihood function, specifying the rate-class properties.

rate_lf = cnf.make_likelihood_function(
    annot_tree, bins=["neutral", "adaptive"], digits=2, space=3
)

We define a very small value (epsilon) that is used to specify the starting values.

epsilon = 1e-6

We now provide starting parameter values for omega for the two bins, setting the boundary

rate_lf.set_param_rule("omega", bin="neutral", upper=1, init=omega_mle)
rate_lf.set_param_rule(
    "omega", bin="adaptive", lower=1 + epsilon, upper=100, init=1 + 2 * epsilon
)

and provide the starting values for the bin probabilities (bprobs).

rate_lf.set_param_rule("bprobs", init=[1 - epsilon, epsilon])

The above statement essentially assigns a probability of nearly 1 to the ‘neutral’ bin. We now set the alignment and fit the model.

rate_lf.set_alignment(aln)
rate_lf.optimise(**optimiser_args)
rate_lnL = rate_lf.get_log_likelihood()
rate_nfp = rate_lf.get_num_free_params()
LR = 2 * (rate_lnL - non_neutral_lnL)
df = rate_nfp - non_neutral_nfp
rate_lf

CNFGTR

log-likelihood = -6755.4520

number of free parameters = 19

Global params
A/CA/GA/TC/GC/T
1.073.960.781.964.20
Bin params
binbprobsomega
neutral0.140.01
adaptive0.861.17
Edge params
edgeparentlength
Galagoroot0.56
HowlerMonroot0.14
Rhesusedge.30.07
Orangutanedge.20.02
Gorillaedge.10.01
Humanedge.00.02
Chimpanzeeedge.00.01
edge.0edge.10.00
edge.1edge.20.01
edge.2edge.30.03
edge.3root0.02
Motif params
AAAAACAAGAATACAACCACGACTAGAAGCAGGAGTATA
0.060.020.030.060.020.000.000.030.020.030.010.040.02
continuation
ATCATGATTCAACACCAGCATCCACCCCCGCCTCGACGC
0.010.010.020.020.010.020.020.020.010.000.030.000.00
continuation
CGGCGTCTACTCCTGCTTGAAGACGAGGATGCAGCCGCG
0.000.000.010.010.010.010.080.010.030.030.020.010.00
continuation
GCTGGAGGCGGGGGTGTAGTCGTGGTTTACTATTCATCC
0.010.020.010.010.010.010.010.010.020.000.010.020.01
continuation
TCGTCTTGCTGGTGTTTATTCTTGTTT
0.000.030.000.000.020.020.010.010.02
print(chi2.sf(LR, df))
0.0008054431261849434

We can get the posterior probabilities of site-classifications out of this model as

pp = rate_lf.get_bin_probs()

This is a DictArray class which stores the probabilities as a numpy.array.

Mixing branch and site-heterogeneity#

The following implements a modification of the approach of Zhang, Nielsen and Yang (Mol Biol Evol, 22:2472–9, 2005). For this model class, there are groups of branches for which all positions are evolving neutrally but some proportion of those neutrally evolving sites change to adaptively evolving on so-called foreground edges. For the current example, we’ll define the Chimpanzee and Human branches as foreground and everything else as background. The following table defines the parameter scopes.

<IPython.core.display.HTML object>

Note

Our implementation is not as parametrically succinct as that of Zhang et al, we have 1 additional bin probability.

After Zhang et al, we first define a null model that has 2 rate classes ‘0’ and ‘1’. We also get all the MLEs out using get_statistics, just printing out the bin parameters table in the current case.

rate_lf = cnf.make_likelihood_function(tree, bins=["0", "1"], digits=2, space=3)
rate_lf.set_param_rule("omega", bin="0", upper=1.0 - epsilon, init=1 - epsilon)
rate_lf.set_param_rule("omega", bins="1", is_constant=True, value=1.0)
rate_lf.set_alignment(aln)
rate_lf.optimise(**optimiser_args)
tables = rate_lf.get_statistics(with_titles=True)
for table in tables:
    if "bin" in table.title:
        print(table)
bin params
====================
bin   bprobs   omega
--------------------
0       0.11   0.00 
1       0.89   1.00 
--------------------

We’re also going to use the MLEs from the rate_lf model, since that nests within the more complex branch by rate-class model. This is unfortunately quite ugly compared with just using the annotated tree approach described above. It is currently necessary, however, due to a bug in constructing annotated trees for models with binned parameters.

globals = [t for t in tables if "global" in t.title][0]
globals = dict(zip(globals.header, globals.to_list()[0]))
bin_params = [t for t in tables if "bin" in t.title][0]
rate_class_omegas = dict(bin_params.to_list(["bin", "omega"]))
rate_class_probs = dict(bin_params.to_list(["bin", "bprobs"]))
lengths = [t for t in tables if "edge" in t.title][0]
lengths = dict(lengths.to_list(["edge", "length"]))

We now create the more complex model,

rate_branch_lf = cnf.make_likelihood_function(
    tree, bins=["0", "1", "2a", "2b"], digits=2, space=3
)

and set from the nested null model the branch lengths,

for branch, length in lengths.items():
    rate_branch_lf.set_param_rule("length", edge=branch, init=length)

GTR term MLES,

for param, mle in globals.items():
    rate_branch_lf.set_param_rule(param, init=mle)

binned parameter values,

rate_branch_lf.set_param_rule(
    "omega", bins=["0", "2a"], upper=1.0, init=rate_class_omegas["0"]
)
rate_branch_lf.set_param_rule(
    "omega", bins=["1", "2b"], is_constant=True, value=1.0
)
rate_branch_lf.set_param_rule(
    "omega",
    bins=["2a", "2b"],
    edges=["Chimpanzee", "Human"],
    init=99,
    lower=1.0,
    upper=100.0,
    is_constant=False,
)

and the bin probabilities.

rate_branch_lf.set_param_rule(
    "bprobs",
    init=[
        rate_class_probs["0"] - epsilon,
        rate_class_probs["1"] - epsilon,
        epsilon,
        epsilon,
    ],
)

The result of these steps is to create a rate/branch model with initial parameter values that result in likelihood the same as the null.

rate_branch_lf.set_alignment(aln)
rate_branch_lf.optimise(**optimiser_args)
print(rate_branch_lf)
Likelihood function statistics
log-likelihood = -6753.4561
number of free parameters = 21
=========================
      edge   bin    omega
-------------------------
    Galago     0     0.00
    Galago     1     1.00
    Galago    2a     0.00
    Galago    2b     1.00
 HowlerMon     0     0.00
 HowlerMon     1     1.00
 HowlerMon    2a     0.00
 HowlerMon    2b     1.00
    Rhesus     0     0.00
    Rhesus     1     1.00
    ...