Using codon models#
Section author: Gavin Huttley
The basic paradigm for evolutionary modelling is:
construct the codon substitution model
constructing likelihood function
modify likelihood function (setting rules)
providing the alignment(s)
optimisation
get results out
Note
In the following, a result followed by ‘…’ just means the output has been truncated for the sake of a succinct presentation.
Constructing the codon substitution model#
For the time-reversible category, Cogent3 implements 4 basic rate matrices: i) NF models, these are nucleotide frequency weighted rate matrices and were initially described by Muse and Gaut (1994); ii) a variant of (i) where position specific nucleotide frequencies are used; iii) TF models, these are tuple (codon in this case) frequency weighted rate matrices and were initially described by Goldman and Yang (1994); iv) CNF, these use the conditional nucleotide frequency and have developed by Yap, Lindsay, Easteal and Huttley. These different models can be created using provided convenience functions which will be the case here, or specified by directly calling the TimeReversibleCodon
substitution model class and setting the argument mprob_model
equal to:
NF,
mprob_model='monomer'
NF with position specific nucleotide frequencies,
mprob_model='monomers'
TF,
mprob_model=None
CNF,
mprob_model='conditional'
In the following I will construct GTR variants of i and iv and a HKY variant of iii.
We import these explicitly from the cogent3.evolve.models
module.
from cogent3.evolve.models import get_model
These are functions and calling them returns the indicated substitution model with default behaviour of recoding gap characters into N’s.
tf = get_model("GY94")
nf = get_model("MG94GTR")
cnf = get_model("CNFGTR")
In the following demonstration I will use only the CNF form (cnf
).
For our example we load a sample alignment and tree as per usual. To reduce the computational overhead for this example we will limit the number of sampled taxa.
from cogent3 import load_aligned_seqs, load_tree
aln = load_aligned_seqs("data/primate_brca1.fasta", moltype="dna")
tree = load_tree("data/primate_brca1.tree")
Standard test of neutrality#
We construct a likelihood function and constrain omega parameter (the ratio of nonsynonymous to synonymous substitutions) to equal 1. We also set some display formatting parameters.
lf = cnf.make_likelihood_function(tree, digits=2, space=3)
lf.set_param_rule("omega", is_constant=True, value=1.0)
We then provide an alignment and optimise the model. In the current case we just use the local optimiser (hiding progress to keep this document succinct). We then print(the results.)
Note
I’m going to specify a set of conditions that will be used for all optimiser steps. For those new to python, one can construct a dictionary with the following form {'argument_name': argument_value}
, or alternatively dict(argument_name=argument_value)
. I’m doing the latter. This dictionary is then passed to functions/methods by prefacing it with **
.
optimiser_args = dict(
local=True, max_restarts=5, tolerance=1e-8, show_progress=False
)
lf.set_alignment(aln)
lf.optimise(**optimiser_args)
lf
CNFGTR
log-likelihood = -6767.0980
number of free parameters = 16
A/C | A/G | A/T | C/G | C/T | omega |
---|---|---|---|---|---|
1.10 | 4.07 | 0.84 | 1.95 | 4.58 | 1.00 |
edge | parent | length |
---|---|---|
Galago | root | 0.53 |
HowlerMon | root | 0.14 |
Rhesus | edge.3 | 0.07 |
Orangutan | edge.2 | 0.02 |
Gorilla | edge.1 | 0.01 |
Human | edge.0 | 0.02 |
Chimpanzee | edge.0 | 0.01 |
edge.0 | edge.1 | 0.00 |
edge.1 | edge.2 | 0.01 |
edge.2 | edge.3 | 0.04 |
edge.3 | root | 0.02 |
AAA | AAC | AAG | AAT | ACA | ACC | ACG | ACT | AGA | AGC | AGG | AGT | ATA |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.06 | 0.02 | 0.03 | 0.06 | 0.02 | 0.00 | 0.00 | 0.03 | 0.02 | 0.03 | 0.01 | 0.04 | 0.02 |
ATC | ATG | ATT | CAA | CAC | CAG | CAT | CCA | CCC | CCG | CCT | CGA | CGC |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.01 | 0.01 | 0.02 | 0.02 | 0.01 | 0.02 | 0.02 | 0.02 | 0.01 | 0.00 | 0.03 | 0.00 | 0.00 |
CGG | CGT | CTA | CTC | CTG | CTT | GAA | GAC | GAG | GAT | GCA | GCC | GCG |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.08 | 0.01 | 0.03 | 0.03 | 0.02 | 0.01 | 0.00 |
GCT | GGA | GGC | GGG | GGT | GTA | GTC | GTG | GTT | TAC | TAT | TCA | TCC |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.00 | 0.01 | 0.02 | 0.01 |
TCG | TCT | TGC | TGG | TGT | TTA | TTC | TTG | TTT |
---|---|---|---|---|---|---|---|---|
0.00 | 0.03 | 0.00 | 0.00 | 0.02 | 0.02 | 0.01 | 0.01 | 0.02 |
In the above output, the first table shows the maximum likelihood estimates (MLEs) for the substitution model parameters that are ‘global’ in scope. For instance, the C/T=4.58
MLE indicates that the relative rate of substitutions between C and T is nearly 5 times the background substitution rate.
The above function has been fit using the default counting procedure for estimating the motif frequencies, i.e. codon frequencies are estimated as the average of the observed codon frequencies. If you wanted to numerically optimise the motif probabilities, then modify the likelihood function creation line to
lf = cnf.make_likelihood_function(tree, optimise_motif_probs=True)
We can then free up the omega parameter, but before we do that we’ll store the log-likelihood and number of free parameters for the current model form for reuse later.
neutral_lnL = lf.get_log_likelihood()
neutral_nfp = lf.get_num_free_params()
lf.set_param_rule("omega", is_constant=False)
lf.optimise(**optimiser_args)
non_neutral_lnL = lf.get_log_likelihood()
non_neutral_nfp = lf.get_num_free_params()
lf
CNFGTR
log-likelihood = -6762.5761
number of free parameters = 17
A/C | A/G | A/T | C/G | C/T | omega |
---|---|---|---|---|---|
1.08 | 3.86 | 0.78 | 1.96 | 4.08 | 0.75 |
edge | parent | length |
---|---|---|
Galago | root | 0.53 |
HowlerMon | root | 0.14 |
Rhesus | edge.3 | 0.07 |
Orangutan | edge.2 | 0.02 |
Gorilla | edge.1 | 0.01 |
Human | edge.0 | 0.02 |
Chimpanzee | edge.0 | 0.01 |
edge.0 | edge.1 | 0.00 |
edge.1 | edge.2 | 0.01 |
edge.2 | edge.3 | 0.03 |
edge.3 | root | 0.02 |
AAA | AAC | AAG | AAT | ACA | ACC | ACG | ACT | AGA | AGC | AGG | AGT | ATA |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.06 | 0.02 | 0.03 | 0.06 | 0.02 | 0.00 | 0.00 | 0.03 | 0.02 | 0.03 | 0.01 | 0.04 | 0.02 |
ATC | ATG | ATT | CAA | CAC | CAG | CAT | CCA | CCC | CCG | CCT | CGA | CGC |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.01 | 0.01 | 0.02 | 0.02 | 0.01 | 0.02 | 0.02 | 0.02 | 0.01 | 0.00 | 0.03 | 0.00 | 0.00 |
CGG | CGT | CTA | CTC | CTG | CTT | GAA | GAC | GAG | GAT | GCA | GCC | GCG |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.08 | 0.01 | 0.03 | 0.03 | 0.02 | 0.01 | 0.00 |
GCT | GGA | GGC | GGG | GGT | GTA | GTC | GTG | GTT | TAC | TAT | TCA | TCC |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.00 | 0.01 | 0.02 | 0.01 |
TCG | TCT | TGC | TGG | TGT | TTA | TTC | TTG | TTT |
---|---|---|---|---|---|---|---|---|
0.00 | 0.03 | 0.00 | 0.00 | 0.02 | 0.02 | 0.01 | 0.01 | 0.02 |
We then conduct a likelihood ratio test whether the MLE of omega significantly improves the fit over the constraint it equals 1. We import the convenience function from the cogent3
stats module.
from scipy.stats.distributions import chi2
LR = 2 * (non_neutral_lnL - neutral_lnL)
df = non_neutral_nfp - neutral_nfp
print(chi2.sf(LR, df))
0.0026357631760662533
Not surprisingly, this is significant. We then ask whether the Human and Chimpanzee edges have a value of omega that is significantly different from the rest of the tree.
lf.set_param_rule(
"omega", tip_names=["Chimpanzee", "Human"], outgroup_name="Galago", clade=True
)
lf.optimise(**optimiser_args)
lf
chimp_human_clade_lnL = lf.get_log_likelihood()
chimp_human_clade_nfp = lf.get_num_free_params()
LR = 2 * (chimp_human_clade_lnL - non_neutral_lnL)
df = chimp_human_clade_nfp - non_neutral_nfp
print(chi2.sf(LR, df))
0.02842632490313894
This is basically a replication of the original Huttley et al (2000) result for BRCA1.
Rate-heterogeneity model variants#
It is also possible to specify rate-heterogeneity variants of these models. In the first instance we’ll create a likelihood function where these rate-classes are global across the entire tree. Because fitting these models can be time consuming I’m going to recreate the non-neutral likelihood function from above first, fit it, and then construct the rate-heterogeneity likelihood function. By doing this I can ensure that the richer model starts with parameter values that produce a log-likelihood the same as the null model, ensuring the subsequent optimisation step improves the likelihood over the null.
lf = cnf.make_likelihood_function(tree, digits=2, space=3)
lf.set_alignment(aln)
lf.optimise(**optimiser_args)
non_neutral_lnL = lf.get_log_likelihood()
non_neutral_nfp = lf.get_num_free_params()
Now, we have a null model which we know (from having fit it above) has a MLE < 1. We will construct a rate-heterogeneity model with just 2 rate-classes (neutral and adaptive) that are separated by the boundary of omega=1. These rate-classes are specified as discrete bins in Cogent3 and the model configuration steps for a bin or bins are done using the set_param_rule
method. To ensure the alternate model starts with a likelihood at least as good as the previous we need to make the probability of the neutral site-class bin ~= 1 (these are referenced by the bprobs
parameter type) and assign the null model omega MLE to this class.
To get all the parameter MLEs (branch lengths, GTR terms, etc ..) into the alternate model we get an annotated tree from the null model which will have these values associated with it.
annot_tree = lf.get_annotated_tree()
omega_mle = lf.get_param_value("omega")
We can then construct a new likelihood function, specifying the rate-class properties.
rate_lf = cnf.make_likelihood_function(
annot_tree, bins=["neutral", "adaptive"], digits=2, space=3
)
We define a very small value (epsilon
) that is used to specify the starting values.
epsilon = 1e-6
We now provide starting parameter values for omega
for the two bins, setting the boundary
rate_lf.set_param_rule("omega", bin="neutral", upper=1, init=omega_mle)
rate_lf.set_param_rule(
"omega", bin="adaptive", lower=1 + epsilon, upper=100, init=1 + 2 * epsilon
)
and provide the starting values for the bin probabilities (bprobs
).
rate_lf.set_param_rule("bprobs", init=[1 - epsilon, epsilon])
The above statement essentially assigns a probability of nearly 1 to the ‘neutral’ bin. We now set the alignment and fit the model.
rate_lf.set_alignment(aln)
rate_lf.optimise(**optimiser_args)
rate_lnL = rate_lf.get_log_likelihood()
rate_nfp = rate_lf.get_num_free_params()
LR = 2 * (rate_lnL - non_neutral_lnL)
df = rate_nfp - non_neutral_nfp
rate_lf
CNFGTR
log-likelihood = -6755.4520
number of free parameters = 19
A/C | A/G | A/T | C/G | C/T |
---|---|---|---|---|
1.07 | 3.96 | 0.78 | 1.96 | 4.20 |
bin | bprobs | omega |
---|---|---|
neutral | 0.14 | 0.01 |
adaptive | 0.86 | 1.17 |
edge | parent | length |
---|---|---|
Galago | root | 0.56 |
HowlerMon | root | 0.14 |
Rhesus | edge.3 | 0.07 |
Orangutan | edge.2 | 0.02 |
Gorilla | edge.1 | 0.01 |
Human | edge.0 | 0.02 |
Chimpanzee | edge.0 | 0.01 |
edge.0 | edge.1 | 0.00 |
edge.1 | edge.2 | 0.01 |
edge.2 | edge.3 | 0.03 |
edge.3 | root | 0.02 |
AAA | AAC | AAG | AAT | ACA | ACC | ACG | ACT | AGA | AGC | AGG | AGT | ATA |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.06 | 0.02 | 0.03 | 0.06 | 0.02 | 0.00 | 0.00 | 0.03 | 0.02 | 0.03 | 0.01 | 0.04 | 0.02 |
ATC | ATG | ATT | CAA | CAC | CAG | CAT | CCA | CCC | CCG | CCT | CGA | CGC |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.01 | 0.01 | 0.02 | 0.02 | 0.01 | 0.02 | 0.02 | 0.02 | 0.01 | 0.00 | 0.03 | 0.00 | 0.00 |
CGG | CGT | CTA | CTC | CTG | CTT | GAA | GAC | GAG | GAT | GCA | GCC | GCG |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.08 | 0.01 | 0.03 | 0.03 | 0.02 | 0.01 | 0.00 |
GCT | GGA | GGC | GGG | GGT | GTA | GTC | GTG | GTT | TAC | TAT | TCA | TCC |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.00 | 0.01 | 0.02 | 0.01 |
TCG | TCT | TGC | TGG | TGT | TTA | TTC | TTG | TTT |
---|---|---|---|---|---|---|---|---|
0.00 | 0.03 | 0.00 | 0.00 | 0.02 | 0.02 | 0.01 | 0.01 | 0.02 |
print(chi2.sf(LR, df))
0.0008054431261849434
We can get the posterior probabilities of site-classifications out of this model as
pp = rate_lf.get_bin_probs()
This is a DictArray
class which stores the probabilities as a numpy.array
.
Mixing branch and site-heterogeneity#
The following implements a modification of the approach of Zhang, Nielsen and Yang (Mol Biol Evol, 22:2472–9, 2005). For this model class, there are groups of branches for which all positions are evolving neutrally but some proportion of those neutrally evolving sites change to adaptively evolving on so-called foreground edges. For the current example, we’ll define the Chimpanzee and Human branches as foreground and everything else as background. The following table defines the parameter scopes.
<IPython.core.display.HTML object>
Note
Our implementation is not as parametrically succinct as that of Zhang et al, we have 1 additional bin probability.
After Zhang et al, we first define a null model that has 2 rate classes ‘0’ and ‘1’. We also get all the MLEs out using get_statistics
, just printing out the bin parameters table in the current case.
rate_lf = cnf.make_likelihood_function(tree, bins=["0", "1"], digits=2, space=3)
rate_lf.set_param_rule("omega", bin="0", upper=1.0 - epsilon, init=1 - epsilon)
rate_lf.set_param_rule("omega", bins="1", is_constant=True, value=1.0)
rate_lf.set_alignment(aln)
rate_lf.optimise(**optimiser_args)
tables = rate_lf.get_statistics(with_titles=True)
for table in tables:
if "bin" in table.title:
print(table)
bin params
====================
bin bprobs omega
--------------------
0 0.11 0.00
1 0.89 1.00
--------------------
We’re also going to use the MLEs from the rate_lf
model, since that nests within the more complex branch by rate-class model. This is unfortunately quite ugly compared with just using the annotated tree approach described above. It is currently necessary, however, due to a bug in constructing annotated trees for models with binned parameters.
globals = [t for t in tables if "global" in t.title][0]
globals = dict(zip(globals.header, globals.to_list()[0]))
bin_params = [t for t in tables if "bin" in t.title][0]
rate_class_omegas = dict(bin_params.to_list(["bin", "omega"]))
rate_class_probs = dict(bin_params.to_list(["bin", "bprobs"]))
lengths = [t for t in tables if "edge" in t.title][0]
lengths = dict(lengths.to_list(["edge", "length"]))
We now create the more complex model,
rate_branch_lf = cnf.make_likelihood_function(
tree, bins=["0", "1", "2a", "2b"], digits=2, space=3
)
and set from the nested null model the branch lengths,
for branch, length in lengths.items():
rate_branch_lf.set_param_rule("length", edge=branch, init=length)
GTR term MLES,
for param, mle in globals.items():
rate_branch_lf.set_param_rule(param, init=mle)
binned parameter values,
rate_branch_lf.set_param_rule(
"omega", bins=["0", "2a"], upper=1.0, init=rate_class_omegas["0"]
)
rate_branch_lf.set_param_rule(
"omega", bins=["1", "2b"], is_constant=True, value=1.0
)
rate_branch_lf.set_param_rule(
"omega",
bins=["2a", "2b"],
edges=["Chimpanzee", "Human"],
init=99,
lower=1.0,
upper=100.0,
is_constant=False,
)
and the bin probabilities.
rate_branch_lf.set_param_rule(
"bprobs",
init=[
rate_class_probs["0"] - epsilon,
rate_class_probs["1"] - epsilon,
epsilon,
epsilon,
],
)
The result of these steps is to create a rate/branch model with initial parameter values that result in likelihood the same as the null.
rate_branch_lf.set_alignment(aln)
rate_branch_lf.optimise(**optimiser_args)
print(rate_branch_lf)
Likelihood function statistics
log-likelihood = -6753.4561
number of free parameters = 21
=========================
edge bin omega
-------------------------
Galago 0 0.00
Galago 1 1.00
Galago 2a 0.00
Galago 2b 1.00
HowlerMon 0 0.00
HowlerMon 1 1.00
HowlerMon 2a 0.00
HowlerMon 2b 1.00
Rhesus 0 0.00
Rhesus 1 1.00
...