Genetic distance calculation

Fast pairwise distance estimation

For a limited number of evolutionary models a fast implementation is available.

from cogent3 import available_distances

available_distances()
Specify a pairwise genetic distance calculator using 'Abbreviation' (case insensitive).
AbbreviationSuitable for moltype
paralineardna, rna, protein
logdetdna, rna, protein
jc69dna, rna
tn93dna, rna
hammingdna, rna, protein, text, bytes
percentdna, rna, protein, text, bytes

6 rows x 2 columns

Computing genetic distances using the Alignment object

Abbreviations listed from available_distances() can be used as values for the distance_matrix(calc=<abbreviation>).

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/primate_brca1.fasta", moltype="dna")
dists = aln.distance_matrix(calc="tn93", show_progress=False)
dists
namesChimpanzeeGalagoGorillaHowlerMonHumanOrangutanRhesus
Chimpanzee0.00000.19210.00540.07040.00890.01400.0396
Galago0.19210.00000.19230.21570.19650.19440.1962
Gorilla0.00540.19230.00000.07000.00860.01370.0393
HowlerMon0.07040.21570.07000.00000.07360.07190.0736
Human0.00890.19650.00860.07360.00000.01730.0423
Orangutan0.01400.19440.01370.07190.01730.00000.0411
Rhesus0.03960.19620.03930.07360.04230.04110.0000

Using the distance calculator directly

from cogent3 import load_aligned_seqs, get_distance_calculator

aln = load_aligned_seqs("data/primate_brca1.fasta")
dist_calc = get_distance_calculator("tn93", alignment=aln)
dist_calc
<cogent3.evolve.fast_distance.TN93Pair at 0x11680e5b0>
dist_calc.run(show_progress=False)
dists = dist_calc.get_pairwise_distances()
dists
namesChimpanzeeGalagoGorillaHowlerMonHumanOrangutanRhesus
Chimpanzee0.00000.19210.00540.07040.00890.01400.0396
Galago0.19210.00000.19230.21570.19650.19440.1962
Gorilla0.00540.19230.00000.07000.00860.01370.0393
HowlerMon0.07040.21570.07000.00000.07360.07190.0736
Human0.00890.19650.00860.07360.00000.01730.0423
Orangutan0.01400.19440.01370.07190.01730.00000.0411
Rhesus0.03960.19620.03930.07360.04230.04110.0000

The distance calculation object can provide more information. For instance, the standard errors.

dist_calc.stderr
Standard Error of Pairwise Distances
Seq1 \ Seq2GalagoHowlerMonRhesusOrangutanGorillaHumanChimpanzee
Galago00.0102748270583958960.0096163078326485620.0095356465322767870.0094913822495401760.0096150330918649170.009469268026590141
HowlerMon0.01027482705839589600.0054118117125547720.00533485849516111750.0052656124746942460.0054067602387489840.005273572620183854
Rhesus0.0096163078326485620.00541181171255477200.00394085494178657550.0038527981619030950.0040050459201001250.0038665597157698894
Orangutan0.0095356465322767870.00533485849516111750.003940854941786575500.00222911247430113750.00251518387918036550.0022606571679022955
Gorilla0.0094913822495401760.0052656124746942460.0038527981619030950.002229112474301137500.00175969199023268760.0013848543487237903
Human0.0096150330918649170.0054067602387489840.0040050459201001250.00251518387918036550.001759691990232687600.0017949285088691988
Chimpanzee0.0094692680265901410.0052735726201838540.00386655971576988940.00226065716790229550.00138485434872379030.00179492850886919880

7 rows x 8 columns

Likelihood based pairwise distance estimation

The standard cogent3 likelihood function can also be used to estimate distances. Because these require numerical optimisation they can be significantly slower than the fast estimation approach above.

The following will use the F81 nucleotide substitution model and perform numerical optimisation.

from cogent3 import load_aligned_seqs, get_model
from cogent3.evolve import distance

aln = load_aligned_seqs("data/primate_brca1.fasta", moltype="dna")
d = distance.EstimateDistances(aln, submodel=get_model("F81"))
d.run(show_progress=False)
dists = d.get_pairwise_distances()
dists
namesChimpanzeeGalagoGorillaHowlerMonHumanOrangutanRhesus
Chimpanzee0.00000.18920.00540.06970.00890.01400.0395
Galago0.18920.00000.18910.21120.19340.19150.1930
Gorilla0.00540.18910.00000.06930.00860.01360.0391
HowlerMon0.06970.21120.06930.00000.07290.07130.0729
Human0.00890.19340.00860.07290.00000.01730.0421
Orangutan0.01400.19150.01360.07130.01730.00000.0410
Rhesus0.03950.19300.03910.07290.04210.04100.0000