Apply a non-stationary nucleotide model to an alignment with 3 sequences#
We load some sample data first and select just 3 sequences.
from cogent3 import get_app
loader = get_app("load_aligned", format="fasta", moltype="dna")
select_seqs = get_app("take_named_seqs", "Human", "Rhesus", "Galago")
process = loader + select_seqs
aln = process("data/primate_brca1.fasta")
aln.names
['Human', 'Rhesus', 'Galago']
We analyses these using the general Markov nucleotide, GN, model. Because we analyse just 3 sequences, there is only one possible unrooted tree, hence it is not required to specify the tree in this instance.
gn = get_app("model", "GN")
gn
model(sm='GN', tree=None, unique_trees=False, tree_func=None, name=None,
optimise_motif_probs=False, sm_args=None, lf_args=None, time_het=None,
param_rules=None, opt_args=None, lower=1e-06, upper=50, split_codons=False,
show_progress=False, verbose=False)
We apply this to aln
.
fitted = gn(aln)
type(fitted)
cogent3.app.result.model_result
model_result
#
As the output above indicates, fitted
is a model_result
object.
This object provides an interface for accessing attributes of a fitted model. The representation display (below), a styled table in a jupyter notebook, presents a summary view with the log-likelihood (lnL
), number of free parameters (nfp
) and whether all matrices satisfied the identifiability conditions diagonal largest in column (DLC) and a unique mapping of Q to P. (For description of these quantities and why they matter see Chang 1996 and Kaehler et al.)
model_result
has dictionary behaviour, hence the key
column. This will be demonstrated below.
fitted
key | lnL | nfp | DLC | unique_Q |
---|---|---|---|---|
'GN' | -5964.2583 | 14 | True | True |
More detail on the fitted model are available via attributes. For instance, display the maximum likelihood estimates via the likelihood function attribute
fitted.lf
GN
log-likelihood = -5964.2583
number of free parameters = 14
A>C | A>G | A>T | C>A | C>G | C>T | G>A | G>C | G>T | T>A |
---|---|---|---|---|---|---|---|---|---|
1.0635 | 3.1945 | 1.0209 | 1.7944 | 2.3266 | 5.6766 | 9.0612 | 1.1126 | 0.8312 | 1.5001 |
T>C |
---|
3.5616 |
edge | parent | length |
---|---|---|
Human | root | 0.0213 |
Rhesus | root | 0.0207 |
Galago | root | 0.1773 |
A | C | G | T |
---|---|---|---|
0.3793 | 0.1734 | 0.2058 | 0.2415 |
fitted.lnL, fitted.nfp
(-5964.258307148185, 14)
fitted.source
'primate_brca1.fasta'
The model_result.tree
attribute is an “annotated tree”. Maximum likelihood estimates from the model have been assigned to the tree. Of particular significance, the “length” attribute corresponds to the expected number of substitutions (or ENS). For a non-stationary model, like GN, this can be different to the conventional length (Kaehler et al).
fitted.tree, fitted.alignment
(Tree("(Human,Rhesus,Galago)root;"),
3 x 2814 dna alignment: Human[TGTGGCACAA...], Rhesus[TGTGGCACAA...], Galago[TGTGGCAAAA...])
We can access the sum of all branch lengths. Either as “ENS” or “paralinear” using the total_length()
method.
fitted.total_length(length_as="paralinear")
np.float64(0.9292253997823097)
Fitting a separate nucleotide model to each codon position#
Controlled by setting split_codons=True
.
gn = get_app("model", "GN", split_codons=True)
fitted = gn(aln)
fitted
key | lnL | nfp | DLC | unique_Q |
---|---|---|---|---|
'' | -5867.4461 | 42 | True | True |
1 | -1955.7571 | 14 | ||
2 | -1934.2690 | 14 | ||
3 | -1977.4200 | 14 |
The model fit statistics, lnL
and nfp
are now sums of the corresponding values from the fits to the individual positions. The DLC
and unique_Q
are also a summary across all models. These only achieve the value True
when all matrices, from all models, satisfy the condition.
We get access to the likelihood functions of the individual positions via the indicated dict keys.
fitted[3]
LF id: 3
log-likelihood = -1977.4200
number of free parameters = 14
A>C | A>G | A>T | C>A | C>G | C>T | G>A | G>C | G>T |
---|---|---|---|---|---|---|---|---|
1.7529 | 3.4660 | 2.4151 | 2.3739 | 3.4119 | 16.1259 | 16.8500 | 2.0221 | 1.7349 |
T>A | T>C |
---|---|
1.9255 | 5.2916 |
edge | parent | length |
---|---|---|
Human | root | 0.0239 |
Rhesus | root | 0.0316 |
Galago | root | 0.1777 |
A | C | G | T |
---|---|---|---|
0.3443 | 0.1382 | 0.1567 | 0.3607 |