Phylogenetic reconstruction by least squares#

Section author: Gavin Huttley

We will load some pre-computed pairwise distance data. To see how that data was computed see the Calculate pairwise distances between sequences example. That data is saved in a format called pickle which is native to python. As per usual, we import the basic components we need.

import pickle

from cogent3.phylo import least_squares

Now load the distance data.

with open("dists_for_phylo.pickle", "rb") as f:
    dists = pickle.load(f)

If there are extremely small distances, they can cause an error in the least squares calculation. Since such estimates are between extremely closely related sequences we could simply drop all distances for one of the sequences. We won’t do that here, we’ll leave that as exercise.

We make the ls calculator.

ls = least_squares.WLS(dists)

We also add tests that pass in a cogent3.evolve.fast_distance.DistanceMatrix and a pairwise distance dict.

ls_distance_matrix = least_squares.WLS(DistanceMatrix(dists))
ls_pairwise_matrix = least_squares.WLS(dists.to_dict())

We will search tree space for the collection of best trees using the advanced stepwise addition algorithm (hereafter asaa).

Look for the single best tree#

In this use case we are after just 1 tree. We specify up to what taxa size all possible trees for the sample will be computed. Here we are specifying a=5. This means 5 sequences will be picked randomly and all possible trees relating them will be evaluated. k=1 means only the best tree will be kept at the end of each such round of evaluation. For every remaining sequence it is grafted onto every possible branch of this tree. The best k results are then taken to the next round, when another sequence is randomly selected for addition. This proceeds until all sequences have been added. The result with following arguments is a single wls score and a single Tree which can be saved etc ..

score, tree = ls.trex(a=5, k=1, show_progress=False)
assert score < 1e-4

We won’t display this tree, because we are doing more below.

Assessing the fit for a pre-specified tree topology#

In some instances we may have a tree from the literature or elsewhere whose fit to the data we seek to evaluate. In this case I’m going load a tree as follows.

from cogent3 import make_tree

query_tree = make_tree(
    "((Human:.2,DogFaced:.2):.3,(NineBande:.1, Mouse:.5):.2,HowlerMon:.1)"
)

We now just use the ls object created above. The following evaluates the query using it’s associated branch lengths, returning only the wls statistic.

ls.evaluate_tree(query_tree)
np.float64(2.893810286786861)

We can also evaluate just the tree’s topology, returning both the wls statistic and the tree with best fit branch lengths.

wls, t = ls.evaluate_topology(query_tree)
assert "%.4f" % wls == "0.0084"

Using maximum likelihood for measuring tree fit#

This is a much slower algorithm and the interface largely mirrors that for the above. The difference is you import maximum_likelihood instead of least_squares, and use the ML instead of WLS classes. The ML class requires a substitution model (like a HKY85 for DNA or JTT92 for protein), and an alignment. It also optionally takes a distance matrix, such as that used here, computed for the same sequences. These distances are then used to obtain estimates of branch lengths by the WLS method for each evaluated tree topology which are then used as starting values for the likelihood optimisation.