Genetic distance calculation#
Fast pairwise distance estimation#
For a limited number of evolutionary models a fast implementation is available.
from cogent3 import available_distances
available_distances()
Abbreviation | Suitable for moltype |
---|---|
paralinear | dna, rna, protein |
logdet | dna, rna, protein |
jc69 | dna, rna |
tn93 | dna, rna |
hamming | dna, rna, protein, text, bytes |
pdist | dna, rna, protein, text, bytes |
6 rows x 2 columns
Computing genetic distances using the Alignment
object#
Abbreviations listed from available_distances()
can be used as values for the distance_matrix(calc=<abbreviation>)
.
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/primate_brca1.fasta", moltype="dna")
dists = aln.distance_matrix(calc="tn93", show_progress=False)
dists
names | Chimpanzee | Galago | Gorilla | HowlerMon | Human | Orangutan | Rhesus |
---|---|---|---|---|---|---|---|
Chimpanzee | 0.0000 | 0.1921 | 0.0054 | 0.0704 | 0.0089 | 0.0140 | 0.0396 |
Galago | 0.1921 | 0.0000 | 0.1923 | 0.2157 | 0.1965 | 0.1944 | 0.1962 |
Gorilla | 0.0054 | 0.1923 | 0.0000 | 0.0700 | 0.0086 | 0.0137 | 0.0393 |
HowlerMon | 0.0704 | 0.2157 | 0.0700 | 0.0000 | 0.0736 | 0.0719 | 0.0736 |
Human | 0.0089 | 0.1965 | 0.0086 | 0.0736 | 0.0000 | 0.0173 | 0.0423 |
Orangutan | 0.0140 | 0.1944 | 0.0137 | 0.0719 | 0.0173 | 0.0000 | 0.0411 |
Rhesus | 0.0396 | 0.1962 | 0.0393 | 0.0736 | 0.0423 | 0.0411 | 0.0000 |
Using the distance calculator directly#
from cogent3 import get_distance_calculator, load_aligned_seqs
aln = load_aligned_seqs("data/primate_brca1.fasta")
dist_calc = get_distance_calculator("tn93", alignment=aln)
dist_calc
<cogent3.evolve.fast_distance.TN93Pair at 0x7f5c54c13c50>
dist_calc.run(show_progress=False)
dists = dist_calc.get_pairwise_distances()
dists
names | Chimpanzee | Galago | Gorilla | HowlerMon | Human | Orangutan | Rhesus |
---|---|---|---|---|---|---|---|
Chimpanzee | 0.0000 | 0.1921 | 0.0054 | 0.0704 | 0.0089 | 0.0140 | 0.0396 |
Galago | 0.1921 | 0.0000 | 0.1923 | 0.2157 | 0.1965 | 0.1944 | 0.1962 |
Gorilla | 0.0054 | 0.1923 | 0.0000 | 0.0700 | 0.0086 | 0.0137 | 0.0393 |
HowlerMon | 0.0704 | 0.2157 | 0.0700 | 0.0000 | 0.0736 | 0.0719 | 0.0736 |
Human | 0.0089 | 0.1965 | 0.0086 | 0.0736 | 0.0000 | 0.0173 | 0.0423 |
Orangutan | 0.0140 | 0.1944 | 0.0137 | 0.0719 | 0.0173 | 0.0000 | 0.0411 |
Rhesus | 0.0396 | 0.1962 | 0.0393 | 0.0736 | 0.0423 | 0.0411 | 0.0000 |
The distance calculation object can provide more information. For instance, the standard errors.
dist_calc.stderr
Seq1 \ Seq2 | Galago | HowlerMon | Rhesus | Orangutan | Gorilla | Human | Chimpanzee |
---|---|---|---|---|---|---|---|
Galago | 0 | 0.010274827058395896 | 0.009616307832648562 | 0.009535646532276787 | 0.009491382249540176 | 0.009615033091864917 | 0.009469268026590141 |
HowlerMon | 0.010274827058395896 | 0 | 0.005411811712554772 | 0.0053348584951611175 | 0.005265612474694246 | 0.005406760238748984 | 0.005273572620183854 |
Rhesus | 0.009616307832648562 | 0.005411811712554772 | 0 | 0.0039408549417865755 | 0.003852798161903095 | 0.004005045920100125 | 0.0038665597157698894 |
Orangutan | 0.009535646532276787 | 0.0053348584951611175 | 0.0039408549417865755 | 0 | 0.0022291124743011375 | 0.0025151838791803655 | 0.0022606571679022955 |
Gorilla | 0.009491382249540176 | 0.005265612474694246 | 0.003852798161903095 | 0.0022291124743011375 | 0 | 0.0017596919902326876 | 0.0013848543487237903 |
Human | 0.009615033091864917 | 0.005406760238748984 | 0.004005045920100125 | 0.0025151838791803655 | 0.0017596919902326876 | 0 | 0.0017949285088691988 |
Chimpanzee | 0.009469268026590141 | 0.005273572620183854 | 0.0038665597157698894 | 0.0022606571679022955 | 0.0013848543487237903 | 0.0017949285088691988 | 0 |
7 rows x 8 columns
Likelihood based pairwise distance estimation#
The standard cogent3
likelihood function can also be used to estimate distances. Because these require numerical optimisation they can be significantly slower than the fast estimation approach above.
The following will use the F81 nucleotide substitution model and perform numerical optimisation.
from cogent3 import get_model, load_aligned_seqs
from cogent3.evolve import distance
aln = load_aligned_seqs("data/primate_brca1.fasta", moltype="dna")
d = distance.EstimateDistances(aln, submodel=get_model("F81"))
d.run(show_progress=False)
dists = d.get_pairwise_distances()
dists
names | Chimpanzee | Galago | Gorilla | HowlerMon | Human | Orangutan | Rhesus |
---|---|---|---|---|---|---|---|
Chimpanzee | 0.0000 | 0.1892 | 0.0054 | 0.0697 | 0.0089 | 0.0140 | 0.0395 |
Galago | 0.1892 | 0.0000 | 0.1891 | 0.2112 | 0.1934 | 0.1915 | 0.1930 |
Gorilla | 0.0054 | 0.1891 | 0.0000 | 0.0693 | 0.0086 | 0.0136 | 0.0391 |
HowlerMon | 0.0697 | 0.2112 | 0.0693 | 0.0000 | 0.0729 | 0.0713 | 0.0729 |
Human | 0.0089 | 0.1934 | 0.0086 | 0.0729 | 0.0000 | 0.0173 | 0.0421 |
Orangutan | 0.0140 | 0.1915 | 0.0136 | 0.0713 | 0.0173 | 0.0000 | 0.0410 |
Rhesus | 0.0395 | 0.1930 | 0.0391 | 0.0729 | 0.0421 | 0.0410 | 0.0000 |
Get the names of sequences with max pairwise distance#
Given a DistanceMatrix
object, finding the sequences that have the maximum pairwise distance is achieved through the max_pair
method.
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/primate_brca1.fasta", moltype="dna")
dists = aln.distance_matrix(calc="tn93", show_progress=False)
dists.max_pair()
('Galago', 'HowlerMon')
To find the maximum distance, index the DistanceMatrix
with the result of max_pair
.
dists[dists.max_pair()]
np.float64(0.2156879978632928)
Get the names of sequences with min pairwise distance#
Given a DistanceMatrix
object, finding the sequences that have the minimum pairwise distance is achieved through the min_pair
method.
Note
As the distance between a sequence and itself is zero, and this is not informative, min_pair
will return the smallest distance not on the diagonal.
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/primate_brca1.fasta", moltype="dna")
dists = aln.distance_matrix(calc="tn93", show_progress=False)
dists.min_pair()
('Chimpanzee', 'Gorilla')
To find the minimum distance, index the DistanceMatrix
with the result of min_pair
.
dists[dists.min_pair()]
np.float64(0.005354100636467117)