Sequence Collections and Alignments#

Note

These docs now use the new_type core objects via the following setting.

import os

# using new types without requiring an explicit argument
os.environ["COGENT3_NEW_TYPE"] = "1"

For loading collections of unaligned or aligned sequences see Loading an alignment from a file or url.

Basic Collection objects#

Constructing a SequenceCollection or Alignment object from strings#

from cogent3 import make_aligned_seqs

data = {"seq1": "ATGACC", "seq2": "ATCGCC"}
# for an alignment, sequences must be the same length
seqs = make_aligned_seqs(data=data, moltype="dna")
type(seqs)
cogent3.core.new_alignment.Alignment
from cogent3 import make_unaligned_seqs

data = {"seq1": "ATGCC", "seq2": "ATCG"}
seqs = make_unaligned_seqs(data, moltype="dna")
type(seqs)
cogent3.core.new_alignment.SequenceCollection

Converting a SequenceCollection to FASTA format#

from cogent3 import load_unaligned_seqs

seqs = load_unaligned_seqs("data/test.paml", moltype="dna")
print(seqs.to_fasta())
>NineBande
GCAAGGCGCCAACAGAGCAGATGGGCTGAAAGTAAGGAAACATGTAATGATAGGCAGACT
>Mouse
GCAGTGAGCCAGCAGAGCAGATGGGCTGCAAGTAAAGGAACATGTAACGACAGGCAGGTT
>Human
GCAAGGAGCCAACATAACAGATGGGCTGGAAGTAAGGAAACATGTAATGATAGGCGGACT
>HowlerMon
GCAAGGAGCCAACATAACAGATGGGCTGAAAGTGAGGAAACATGTAATGATAGGCAGACT
>DogFaced
GCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT

Adding new sequences to an existing collection or alignment#

More than one sequence can be added to a collection simultaneously. Note that add_seqs() does not modify the existing collection/alignment, it creates a new one.

The elements of a collection or alignment#

Accessing individual sequences from a collection or alignment by name#

We can get a sequence by name by indexing the .seqs attribute.

from cogent3 import make_unaligned_seqs

seqcoll = make_unaligned_seqs(
    [("seq1", "ATGAA"), ("seq2", "ATGAGTGATG"), ("seq3", "ATAGGATG")],
    moltype="dna",
)
seq = seqcoll.seqs["seq1"]
seq
0
seq1ATGAA

DnaSequence, length=5

For an alignment, the result is an Aligned instance.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    [("seq1", "ATGAA------"), ("seq2", "ATG-AGTGATG"), ("seq3", "AT--AG-GATG")],
    moltype="dna",
)
aligned = aln.seqs["seq1"]
aligned
Aligned(name='seq1', seq='ATGAA--... 11', moltype='dna')

For the alignment case, you can get the ungapped sequence by accessing the .seq attribute of the aligned instance.

aligned.seq
0
seq1ATGAA

DnaSequence, length=5

seq = aln.get_seq("seq1")
seq.name
type(seq)
seq.is_gapped()
False

Alternatively, if you want to extract the aligned (i.e., gapped) sequence from an alignment, you can use get_gapped_seq.

seq = aln.get_gapped_seq("seq1")
seq.is_gapped()
seq
0
seq1ATGAA------

DnaSequence, length=11

To see the names of the sequences in a sequence collection, use the names attribute.

aln.names
('seq1', 'seq2', 'seq3')

Slice the sequences from a collection like a list#

Use the .seqs attribute. We can index a single sequence

from cogent3 import load_unaligned_seqs

fn = "data/long_testseqs.fasta"
seqs = load_unaligned_seqs(fn, moltype="dna")
my_seq = seqs.seqs[0]
my_seq
0
HumanTGTGGCACAAATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT

DnaSequence, length=2,532 (truncated to 60)

but you cannot index a slice (Use .take_seqs() for that).

seqs.seqs[:2]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 seqs.seqs[:2]

File /opt/hostedtoolcache/Python/3.12.10/x64/lib/python3.12/functools.py:949, in singledispatchmethod.__get__.<locals>._method(*args, **kwargs)
    947 def _method(*args, **kwargs):
    948     method = self.dispatcher.dispatch(args[0].__class__)
--> 949     return method.__get__(obj, cls)(*args, **kwargs)

File ~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/new_alignment.py:4509, in _IndexableSeqs.__getitem__(self, key)
   4503 @singledispatchmethod
   4504 def __getitem__(
   4505     self,
   4506     key: str | int | slice,
   4507 ) -> new_sequence.Sequence | Aligned:
   4508     msg = f"indexing not supported for {type(key)}, try .take_seqs()"
-> 4509     raise TypeError(msg)

TypeError: indexing not supported for <class 'slice'>, try .take_seqs()

Getting a subset of sequences from the alignment#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
aln.names
('NineBande', 'Mouse', 'Human', 'HowlerMon', 'DogFaced')
new = aln.take_seqs(["Human", "HowlerMon"])
new.names
('Human', 'HowlerMon')

Alternatively, you can extract only the sequences which are not specified by passing negate=True:

new = aln.take_seqs(["Human", "HowlerMon"], negate=True)
new.names
('NineBande', 'Mouse', 'DogFaced')

Note

The subset contains references to the original sequences, not copies.

Writing sequences to file#

Both collection and alignment objects have a write() method. The output format is inferred from the filename suffix,

from cogent3 import make_aligned_seqs

dna = {"seq1": "ATGACC", "seq2": "ATCGCC"}
aln = make_aligned_seqs(data=dna, moltype="dna")
aln.write("sample.fasta")

or by the format argument.

aln.write("sample", format="fasta")
from cogent3.util.io import remove_files

remove_files(["sample", "sample.fasta"], error_on_missing=False)

Alignments#

Creating an Alignment object from a SequenceCollection#

from cogent3 import load_unaligned_seqs
from cogent3.core.alignment import Alignment

seq = load_unaligned_seqs("data/test.paml", moltype="dna")
seq
0
NineBandeGCAAGGCGCCAACAGAGCAGATGGGCTGAAAGTAAGGAAACATGTAATGATAGGCAGACT
MouseGCAGTGAGCCAGCAGAGCAGATGGGCTGCAAGTAAAGGAACATGTAACGACAGGCAGGTT
HumanGCAAGGAGCCAACATAACAGATGGGCTGGAAGTAAGGAAACATGTAATGATAGGCGGACT
HowlerMonGCAAGGAGCCAACATAACAGATGGGCTGAAAGTGAGGAAACATGTAATGATAGGCAGACT
DogFacedGCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT

5 x {min=60, median=60.0, max=60} dna sequence collection

aln = Alignment(seq.seqs)
aln
0
DogFacedGCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT
NineBande......C....A....G........C.....G............................
Mouse...GT...........G........C..C..G...A.G.........C..C......GT.
Human...........A..T..........C..G..G.......................G....
HowlerMon...........A..T..........C.....G.G..........................

5 x 60 dna alignment

Convert alignment to DNA, RNA or PROTEIN moltypes#

This is useful if you’ve loaded a sequence alignment without specifying the moltype and later need to convert it using the dedicated method

from cogent3 import make_aligned_seqs

data = [("a", "ACG---"), ("b", "CCTGGG")]
aln = make_aligned_seqs(data=data, moltype="text")
dna = aln.to_dna()
dna
0
bCCTGGG
aA.G---

2 x 6 dna alignment

Or using the generic to_moltype() method

aln.to_moltype("dna")
0
bCCTGGG
aA.G---

2 x 6 dna alignment

To RNA

from cogent3 import make_aligned_seqs

data = [("a", "ACG---"), ("b", "CCUGGG")]
aln = make_aligned_seqs(data=data, moltype="text")
rna = aln.to_rna()
rna
0
bCCUGGG
aA.G---

2 x 6 rna alignment

To PROTEIN

from cogent3 import make_aligned_seqs

data = [("x", "TYV"), ("y", "TE-")]
aln = make_aligned_seqs(data=data, moltype="text")
prot = aln.to_moltype("protein")
prot
0
xTYV
y.E-

2 x 3 protein alignment

Handling gaps#

Remove all gaps from an alignment#

This necessarily returns a SequenceCollection.

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
degapped = aln.degap()
degapped
0
humanAGCGCCCGCGGGTTCTGAGAGCGCTCAAAGCCGCCGAGTCAGGCTGCCCAGCCCGCCGGG
macaqueAGCGCCCGCGGGTTCTGAGAGCGCTCAAAGCCACCGAGTCAGGCTGCCCAGCCCGCCGGG
chimpAGCGCCCGCGGGTTCTGAGAGCGCTCAAAGCCGCCGAGTCAGGCTGCCCAGCCCGCCGGG

3 x {min=1500, median=1503.0, max=1511} dna sequence collection

Removing all columns with gaps in a named sequence#

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    [("seq1", "ATGAA---TG-"), ("seq2", "ATG-AGTGATG"), ("seq3", "AT--AG-GATG")],
    moltype="dna",
)
new_aln = aln.get_degapped_relative_to("seq1")
new_aln
0
seq1ATGAATG
seq2...-.AT
seq3..--.AT

3 x 7 dna alignment

Converting an alignment to FASTA format#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
fasta_align = aln.to_fasta()

Converting an alignment to Phylip format#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
got = aln.to_phylip()
print(got)
5  60
NineBande GCAAGGCGCCAACAGAGCAGATGGGCTGAAAGTAAGGAAACATGTAATGATAGGCAGACT
Mouse     GCAGTGAGCCAGCAGAGCAGATGGGCTGCAAGTAAAGGAACATGTAACGACAGGCAGGTT
Human     GCAAGGAGCCAACATAACAGATGGGCTGGAAGTAAGGAAACATGTAATGATAGGCGGACT
HowlerMon GCAAGGAGCCAACATAACAGATGGGCTGAAAGTGAGGAAACATGTAATGATAGGCAGACT
DogFaced  GCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT

Converting an alignment to a list of strings#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
data = [str(s) for s in aln.seqs]

Converting an alignment to a list of arrays#

import numpy
from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
data = [numpy.array(s) for s in aln.seqs]
data
[array([3, 1, 2, 2, 3, 3, 1, 3, 1, 1, 2, 2, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0], dtype=uint8),
 array([3, 1, 2, 3, 0, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 1, 2, 2, 3, 0, 2, 2, 2, 3, 3, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 1, 3, 2, 1, 2, 3, 3, 1, 2, 3, 3, 0, 0], dtype=uint8),
 array([3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 3, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 3, 3, 2, 1, 0], dtype=uint8),
 array([3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 3, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0], dtype=uint8),
 array([3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 0, 0, 3, 2, 2, 2, 1, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0], dtype=uint8)]

Getting an alignment as an array#

The rows are sequences and their order corresponds to that of aln.names.

import numpy
from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
aln.array_seqs
array([[3, 1, 2, 2, 3, 3, 1, 3, 1, 1, 2, 2, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0],
       [3, 1, 2, 3, 0, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 1, 2, 2, 3, 0, 2, 2, 2, 3, 3, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 1, 3, 2, 1, 2, 3, 3, 1, 2, 3, 3, 0, 0],
       [3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 3, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 3, 3, 2, 1, 0],
       [3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 3, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0],
       [3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 0, 0, 3, 2, 2, 2, 1, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0]], dtype=uint8)

Slicing an alignment#

Alignments can be thought of as a matrix, with sequences along the rows and alignment positions as the columns. However, all slicing is only along positions.

from cogent3 import load_aligned_seqs

fn = "data/long_testseqs.fasta"
aln = load_aligned_seqs(fn, moltype="dna")
aln[:24]
0
DogFacedTGTGGCACAAATACTCATGCCAAC
Human......................G.
HowlerMon......................G.
Mouse.........G..G.........G.
NineBande........................

5 x 24 dna alignment

Warning

A SequenceCollection cannot be sliced!

Getting a single column from an alignment#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
column_four = aln[3]
column_four
0
DogFacedA
NineBande.
MouseG
Human.
HowlerMon.

5 x 1 dna alignment

Getting a region of contiguous columns#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
region = aln[50:70]

Iterating over alignment positions#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
col = list(aln[113:115].iter_positions())
col
[['A', 'A', 'A'], ['T', '-', '-']]

Getting codon 3rd positions#

We can use conventional slice notation. Note, because Python counts from 0, the 3rd position starts at index 2.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    data={"seq1": "ATGATGATG---", "seq2": "ATGATGATGATG"},
    moltype="dna",
)
aln[2::3]
0
seq2GGGG
seq1...-

2 x 4 dna alignment

Filtering positions#

Trim terminal stop codons#

For evolutionary analyses that use codon models we need to exclude terminating stop codons. For the case where the sequences are all of length divisible by 3.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    data={"seq1": "ACGTAA---", "seq2": "ACGACA---", "seq3": "ACGCAATGA"},
    moltype="dna",
)
new = aln.trim_stop_codons()
new
0
seq2ACGACA---
seq1...---...
seq3...CA....

3 x 9 dna alignment

To detect if the alignment contains sequences not divisible by 3, use the strict argument. This argument covers both allowing partial terminating codons / not divisible by 3.

aln = make_aligned_seqs(
    data={
        "seq1": "ACGTAA---",
        "seq2": "ACGAC----",  # terminal codon incomplete
        "seq3": "ACGCAATGA",
    },
    moltype="dna",
)
new = aln.trim_stop_codons(strict=True)

Eliminating columns with non-nucleotide characters#

We sometimes want to eliminate ambiguous or gap data from our alignments. We demonstrate how to exclude alignment columns based on the characters they contain. In the first instance, we do this just for single nucleotide columns, then for trinucleotides (equivalent for handling codons). Both are done using the no_degenerates() method.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    data=[
        ("seq1", "ATGAAGGTG---"),
        ("seq2", "ATGAAGGTGATG"),
        ("seq3", "ATGAAGGNGATG"),
    ],
    moltype="dna",
)

We apply to nucleotides,

nucs = aln.no_degenerates()
nucs
0
seq1ATGAAGGG
seq2........
seq3........

3 x 8 dna alignment

Applying the same filter to trinucleotides (specified by setting motif_length=3).

trinucs = aln.no_degenerates(motif_length=3)
trinucs
0
seq1ATGAAG
seq2......
seq3......

3 x 6 dna alignment

Getting all variable positions from an alignment#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
pos = aln.variable_positions()
just_variable_aln = aln.take_positions(pos)
just_variable_aln[:10]
0
DogFacedAAACAAAATA
Human..G.....CT
HowlerMon..G...G.CT
MouseGGG.CC.GCT
NineBande...T....CT

5 x 10 dna alignment

Getting all constant positions from an alignment#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
pos = aln.variable_positions()
just_constant_aln = aln.take_positions(pos, negate=True)
just_constant_aln[:10]
0
DogFacedTGTGGCACAA
Human..........
HowlerMon..........
Mouse..........
NineBande..........

5 x 10 dna alignment

Getting all variable codons from an alignment#

This is done using the filtered method using the motif_length argument.

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
pos = aln.variable_positions(motif_length=3)
variable_codons = aln.take_positions(pos)
just_variable_aln[:9]
0
DogFacedAAACAAAAT
Human..G.....C
HowlerMon..G...G.C
MouseGGG.CC.GC
NineBande...T....C

5 x 9 dna alignment

Filtering sequences#

Extracting sequences using an arbitrary function into a new alignment object#

You can use take_seqs_if to extract sequences into a new alignment object based on whether an arbitrary function applied to the sequence evaluates to True. For example, to extract sequences which don’t contain any N bases you could do the following:

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    data=[
        ("seq1", "ATGAAGGTG---"),
        ("seq2", "ATGAAGGTGATG"),
        ("seq3", "ATGAAGGNGATG"),
    ],
    moltype="dna",
)

def no_N_chars(s):
    return "N" not in str(s)

aln.take_seqs_if(no_N_chars)
0
seq2ATGAAGGTGATG
seq1.........---

2 x 12 dna alignment

You can additionally get the sequences where the provided function evaluates to False:

aln.take_seqs_if(no_N_chars, negate=True)
0
seq3ATGAAGGNGATG

1 x 12 dna alignment

Computing alignment statistics#

Getting motif counts#

We state the motif length we want and whether to allow gap or ambiguous characters. The latter only has meaning for IPUAC character sets (the DNA, RNA or PROTEIN moltypes). We illustrate this for the DNA moltype with motif lengths of 1 and 3.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    data=[
        ("seq1", "ATGAAGGTG---"),
        ("seq2", "ATGAAGGTGATG"),
        ("seq3", "ATGAAGGNGATG"),
    ],
    moltype="dna",
)
counts = aln.counts()
counts
ACGT
110147
counts = aln.counts(motif_length=3)
counts
AAAAACAAGAATACAACCACGACTAGAAGCAGGAGTATAATCATGATTCAACACCAGCATCCACCCCCGCCTCGACGCCGGCGTCTACTCCTGCTTGAAGACGAGGATGCAGCCGCGGCTGGAGGCGGGGGTGTAGTCGTGGTTTAATACTAGTATTCATCCTCGTCTTGATGCTGGTGTTTATTCTTGTTT
0030000000000050000000000000000000000000000000200000000000000000
counts = aln.counts(include_ambiguity=True)
counts
ACGNT
1101417

Note

Only the observed motifs are returned, rather than all defined by the alphabet.

Getting motif counts per sequence#

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    data=[
        ("seq1", "ATGAAGGTG---"),
        ("seq2", "ATGAAGGTGATG"),
        ("seq3", "ATGAAGGNGATG"),
    ],
    moltype="dna",
)
counts = aln.counts_per_seq()
counts
ACGT
seq13042
seq24053
seq34052

Note

There are also .probs_per_seq() and .entropy_per_seq() methods.

Getting motif counts per position#

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    data=[
        ("seq1", "ATGAAGGTG---"),
        ("seq2", "ATGAAGGTGATG"),
        ("seq3", "ATGAAGGNGATG"),
    ],
    moltype="dna",
)
counts = aln.counts_per_pos()
counts
TCAG
00030
13000
20003
30030
40030
50003
60003
72000
80003
90020
102000
110002

Note

There are also .probs_per_pos() and .entropy_per_pos() methods.

Computing motif probabilities from an alignment#

The method get_motif_probs of Alignment objects returns the probabilities for all motifs of a given length. For individual nucleotides:

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
motif_probs = aln.get_motif_probs()
motif_probs
{'T': 0.25520602569782896,
 'C': 0.25808595480726626,
 'A': 0.24390784226849802,
 'G': 0.24280017722640673}

For dinucleotides or longer, we need to pass in a KmerAlphabet with the appropriate word length. Here is an example with trinucleotides:

from cogent3 import load_aligned_seqs, get_moltype, make_table

trinuc_alphabet = get_moltype("dna").alphabet.get_kmer_alphabet(3)
aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
motif_probs = aln.get_motif_probs(alphabet=trinuc_alphabet)
table = make_table(header=["motif", "freq"], data=list(motif_probs.items()))
table
motiffreq
TTT0.0201
TTC0.0241
TTA0.0114
TTG0.0167
TCT0.0288
......
GAG0.0241
GGT0.0194
GGC0.0140
GGA0.0207
GGG0.0261

Top 5 and bottom 5 rows from 64 rows x 2 columns

Some calculations in cogent3 require all non-zero values in the motif probabilities, in which case we use a pseudo-count. We illustrate that here with a simple example where T is missing. Without the pseudo-count, the frequency of T is 0.0, with the pseudo-count defined as 1e-6 then the frequency of T will be slightly less than 1e-6.

aln = make_aligned_seqs(data=[("a", "AACAAC"), ("b", "AAGAAG")], moltype="dna")
motif_probs = aln.get_motif_probs()
assert motif_probs["T"] == 0.0
motif_probs = aln.get_motif_probs(pseudocount=1e-6)
motif_probs
{'T': 8.333330555556482e-08,
 'C': 0.16666669444443521,
 'A': 0.6666665277778241,
 'G': 0.16666669444443521}

Note

For alignments, motif probabilities are computed by treating sequences as non-overlapping tuples. To get all possible k-mers, use the iter_kmers() method on the sequence classes.

Working with alignment gaps#

Filtering extracted columns for the gap character#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
col = aln[113:115].iter_positions()
c1, c2 = list(col)
c1, c2
list(filter(lambda x: x == "-", c1))
list(filter(lambda x: x == "-", c2))
['-', '-']

Calculating the gaps per position#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
gap_counts = aln.count_gaps_per_pos()
gap_counts # this is a DictArray
0123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044104510461047104810491050105110521053105410551056105710581059106010611062106310641065106610671068106910701071107210731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119112011211122112311241125112611271128112911301131113211331134113511361137113811391140114111421143114411451146114711481149115011511152115311541155115611571158115911601161116211631164116511661167116811691170117111721173117411751176117711781179118011811182118311841185118611871188118911901191119211931194119511961197119811991200120112021203120412051206120712081209121012111212121312141215121612171218121912201221122212231224122512261227122812291230123112321233123412351236123712381239124012411242124312441245124612471248124912501251125212531254125512561257125812591260126112621263126412651266126712681269127012711272127312741275127612771278127912801281128212831284128512861287128812891290129112921293129412951296129712981299130013011302130313041305130613071308130913101311131213131314131513161317131813191320132113221323132413251326132713281329133013311332133313341335133613371338133913401341134213431344134513461347134813491350135113521353135413551356135713581359136013611362136313641365136613671368136913701371137213731374137513761377137813791380138113821383138413851386138713881389139013911392139313941395139613971398139914001401140214031404140514061407140814091410141114121413141414151416141714181419142014211422142314241425142614271428142914301431143214331434143514361437143814391440144114421443144414451446144714481449145014511452145314541455145614571458145914601461146214631464146514661467146814691470147114721473147414751476147714781479148014811482148314841485148614871488148914901491149214931494149514961497149814991500150115021503150415051506150715081509151015111512151315141515151615171518151915201521152215231524
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000222222222222222222201111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000020000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000020000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

To turn that into grap fraction

gap_frac = gap_counts.array / aln.num_seqs

Filtering alignments based on gaps#

If we want to remove positions from the alignment which are gaps in more than a certain percentage of the sequences, use the omit_gap_pos().

aln = make_aligned_seqs(
    data=[
        ("seq1", "ATGAA---TG-"),
        ("seq2", "ATG-AGTGATG"),
        ("seq3", "AT--AG-GATG"),
    ],
    moltype="dna",
)
filtered_aln = aln.omit_gap_pos(allowed_gap_frac=0.40)
filtered_aln
0
seq2ATGAGGATG
seq1....--TG-
seq3..-......

3 x 9 dna alignment

Note

The default for filtered_aln.omit_gap_pos() is to remove columns with gaps in all the sequences. This can occur after sequences have been removed from the alignment.