Sequence Collections and Alignments#

For loading collections of unaligned or aligned sequences see Loading sequence collections from a file or url.

Basic Collection objects#

Constructing a `SequenceCollection` or `Alignment` object from strings#

from cogent3 import make_aligned_seqs

data = {"seq1": "ATGACC", "seq2": "ATCGCC"}
# for an alignment, sequences must be the same length
seqs = make_aligned_seqs(data, moltype="dna")
type(seqs)

cogent3.core.alignment.Alignment

from cogent3 import make_unaligned_seqs

data = {"seq1": "ATGCC", "seq2": "ATCG"}
seqs = make_unaligned_seqs(data, moltype="dna")
type(seqs)

cogent3.core.alignment.SequenceCollection

Converting a `SequenceCollection` to FASTA format#

from cogent3 import load_unaligned_seqs

seqs = load_unaligned_seqs("data/test.paml", moltype="dna")
print(seqs.to_fasta())

>NineBande
GCAAGGCGCCAACAGAGCAGATGGGCTGAAAGTAAGGAAACATGTAATGATAGGCAGACT
>Mouse
GCAGTGAGCCAGCAGAGCAGATGGGCTGCAAGTAAAGGAACATGTAACGACAGGCAGGTT
>Human
GCAAGGAGCCAACATAACAGATGGGCTGGAAGTAAGGAAACATGTAATGATAGGCGGACT
>HowlerMon
GCAAGGAGCCAACATAACAGATGGGCTGAAAGTGAGGAAACATGTAATGATAGGCAGACT
>DogFaced
GCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT

Adding new sequences to an existing collection or alignment#

More than one sequence can be added to a collection simultaneously. Note that add_seqs() does not modify the existing collection/alignment, it creates a new one.

The elements of a collection or alignment#

Accessing individual sequences from a collection or alignment by name#

We can get a sequence by name by indexing the .seqs attribute.

from cogent3 import make_unaligned_seqs

seqcoll = make_unaligned_seqs(
    [("seq1", "ATGAA"), ("seq2", "ATGAGTGATG"), ("seq3", "ATAGGATG")],
    moltype="dna",
)
seq = seqcoll.seqs["seq1"]
seq

	0
seq1	ATGAA

DnaSequence, length=5

For an alignment, the result is an Aligned instance.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    [("seq1", "ATGAA------"), ("seq2", "ATG-AGTGATG"), ("seq3", "AT--AG-GATG")],
    moltype="dna",
)
aligned = aln.seqs["seq1"]
aligned

Aligned(name='seq1', seq='ATGAA--... 11', moltype='dna')

For the alignment case, you can get the ungapped sequence by accessing the .seq attribute of the aligned instance.

aligned.seq

	0
seq1	ATGAA

DnaSequence, length=5

seq = aln.get_seq("seq1")
seq.name
type(seq)
seq.is_gapped()

False

Alternatively, if you want to extract the aligned (i.e., gapped) sequence from an alignment, you can use get_gapped_seq.

seq = aln.get_gapped_seq("seq1")
seq.is_gapped()
seq

	0
seq1	ATGAA------

DnaSequence, length=11

To see the names of the sequences in a sequence collection, use the names attribute.

aln.names

('seq1', 'seq2', 'seq3')

Slice the sequences from a collection like a list#

Use the .seqs attribute. We can index a single sequence

from cogent3 import load_unaligned_seqs

fn = "data/long_testseqs.fasta"
seqs = load_unaligned_seqs(fn, moltype="dna")
my_seq = seqs.seqs[0]
my_seq

	0
Human	TGTGGCACAAATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT

DnaSequence, length=2,532 (truncated to 60)

but you cannot index a slice (Use .take_seqs() for that).

seqs.seqs[:2]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 seqs.seqs[:2]

File ~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/alignment.py:5216, in _IndexableSeqs.__getitem__(self, key)
   5213     return self._make_seq(key)
   5215 msg = f"indexing not supported for {type(key)}, try .take_seqs()"
-> 5216 raise TypeError(msg)

TypeError: indexing not supported for <class 'slice'>, try .take_seqs()

Getting a subset of sequences from the alignment#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
aln.names

('NineBande', 'Mouse', 'Human', 'HowlerMon', 'DogFaced')

new = aln.take_seqs(["Human", "HowlerMon"])
new.names

('Human', 'HowlerMon')

Alternatively, you can extract only the sequences which are not specified by passing negate=True:

new = aln.take_seqs(["Human", "HowlerMon"], negate=True)
new.names

('NineBande', 'Mouse', 'DogFaced')

Note

The subset contains references to the original sequences, not copies.

Writing sequences to file#

Both collection and alignment objects have a write() method. The output format is inferred from the filename suffix,

from cogent3 import make_aligned_seqs

dna = {"seq1": "ATGACC", "seq2": "ATCGCC"}
aln = make_aligned_seqs(dna, moltype="dna")
aln.write("sample.fasta")

or by the format argument.

aln.write("sample", format_name="fasta")

from cogent3.util.io import remove_files

remove_files(["sample", "sample.fasta"], error_on_missing=False)

Alignments#

Creating an `Alignment` object from a `SequenceCollection`#

from cogent3 import load_unaligned_seqs, make_aligned_seqs

seq = load_unaligned_seqs("data/test.paml", moltype="dna")
seq

	0
NineBande	GCAAGGCGCCAACAGAGCAGATGGGCTGAAAGTAAGGAAACATGTAATGATAGGCAGACT
Mouse	GCAGTGAGCCAGCAGAGCAGATGGGCTGCAAGTAAAGGAACATGTAACGACAGGCAGGTT
Human	GCAAGGAGCCAACATAACAGATGGGCTGGAAGTAAGGAAACATGTAATGATAGGCGGACT
HowlerMon	GCAAGGAGCCAACATAACAGATGGGCTGAAAGTGAGGAAACATGTAATGATAGGCAGACT
DogFaced	GCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT

5 x {min=60, median=60.0, max=60} dna sequence collection

aln = make_aligned_seqs(list(seq.seqs), moltype=seqs.moltype)
aln

	0
DogFaced	GCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT
NineBande	......C....A....G........C.....G............................
Mouse	...GT...........G........C..C..G...A.G.........C..C......GT.
Human	...........A..T..........C..G..G.......................G....
HowlerMon	...........A..T..........C.....G.G..........................

5 x 60 dna alignment

Convert alignment to DNA, RNA or PROTEIN moltypes#

This is useful if you’ve loaded a sequence alignment without specifying the moltype and later need to convert it using the dedicated method

from cogent3 import make_aligned_seqs

data = [("a", "ACG---"), ("b", "CCTGGG")]
aln = make_aligned_seqs(data, moltype="text")
dna = aln.to_dna()
dna

	0
b	CCTGGG
a	A.G---

2 x 6 dna alignment

Or using the generic to_moltype() method

aln.to_moltype("dna")

	0
b	CCTGGG
a	A.G---

2 x 6 dna alignment

To RNA

from cogent3 import make_aligned_seqs

data = [("a", "ACG---"), ("b", "CCUGGG")]
aln = make_aligned_seqs(data, moltype="text")
rna = aln.to_rna()
rna

	0
b	CCUGGG
a	A.G---

2 x 6 rna alignment

To PROTEIN

from cogent3 import make_aligned_seqs

data = [("x", "TYV"), ("y", "TE-")]
aln = make_aligned_seqs(data, moltype="text")
prot = aln.to_moltype("protein")
prot

	0
x	TYV
y	.E-

2 x 3 protein alignment

Handling gaps#

Remove all gaps from an alignment#

This necessarily returns a SequenceCollection.

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
degapped = aln.degap()
degapped

	0
human	AGCGCCCGCGGGTTCTGAGAGCGCTCAAAGCCGCCGAGTCAGGCTGCCCAGCCCGCCGGG
macaque	AGCGCCCGCGGGTTCTGAGAGCGCTCAAAGCCACCGAGTCAGGCTGCCCAGCCCGCCGGG
chimp	AGCGCCCGCGGGTTCTGAGAGCGCTCAAAGCCGCCGAGTCAGGCTGCCCAGCCCGCCGGG

3 x {min=1500, median=1503.0, max=1511} dna sequence collection

Removing all columns with gaps in a named sequence#

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    [("seq1", "ATGAA---TG-"), ("seq2", "ATG-AGTGATG"), ("seq3", "AT--AG-GATG")],
    moltype="dna",
)
new_aln = aln.get_degapped_relative_to("seq1")
new_aln

	0
seq1	ATGAATG
seq2	...-.AT
seq3	..--.AT

3 x 7 dna alignment

Converting an alignment to FASTA format#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
fasta_align = aln.to_fasta()

Converting an alignment to Phylip format#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
got = aln.to_phylip()
print(got)

5  60
NineBande GCAAGGCGCCAACAGAGCAGATGGGCTGAAAGTAAGGAAACATGTAATGATAGGCAGACT
Mouse     GCAGTGAGCCAGCAGAGCAGATGGGCTGCAAGTAAAGGAACATGTAACGACAGGCAGGTT
Human     GCAAGGAGCCAACATAACAGATGGGCTGGAAGTAAGGAAACATGTAATGATAGGCGGACT
HowlerMon GCAAGGAGCCAACATAACAGATGGGCTGAAAGTGAGGAAACATGTAATGATAGGCAGACT
DogFaced  GCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT

Converting an alignment to a list of strings#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
data = [str(s) for s in aln.seqs]

Converting an alignment to a list of arrays#

import numpy
from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
data = [numpy.array(s) for s in aln.seqs]
data

[array([3, 1, 2, 2, 3, 3, 1, 3, 1, 1, 2, 2, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0], dtype=uint8),
 array([3, 1, 2, 3, 0, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 1, 2, 2, 3, 0, 2, 2, 2, 3, 3, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 1, 3, 2, 1, 2, 3, 3, 1, 2, 3, 3, 0, 0], dtype=uint8),
 array([3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 3, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 3, 3, 2, 1, 0], dtype=uint8),
 array([3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 3, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0], dtype=uint8),
 array([3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 0, 0, 3, 2, 2, 2, 1, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0], dtype=uint8)]

Getting an alignment as an array#

The rows are sequences and their order corresponds to that of aln.names.

import numpy
from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
aln.array_seqs

array([[3, 1, 2, 2, 3, 3, 1, 3, 1, 1, 2, 2, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0],
       [3, 1, 2, 3, 0, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 1, 2, 2, 3, 0, 2, 2, 2, 3, 3, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 1, 3, 2, 1, 2, 3, 3, 1, 2, 3, 3, 0, 0],
       [3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 3, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 3, 3, 2, 1, 0],
       [3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 3, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0],
       [3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 2, 1, 2, 3, 2, 0,
        3, 3, 3, 0, 0, 3, 2, 2, 2, 1, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
        0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0]], dtype=uint8)

Slicing an alignment#

Alignments can be thought of as a matrix, with sequences along the rows and alignment positions as the columns. However, all slicing is only along positions.

from cogent3 import load_aligned_seqs

fn = "data/long_testseqs.fasta"
aln = load_aligned_seqs(fn, moltype="dna")
aln[:24]

	0
DogFaced	TGTGGCACAAATACTCATGCCAAC
Human	......................G.
HowlerMon	......................G.
Mouse	.........G..G.........G.
NineBande	........................

5 x 24 dna alignment

Warning

A SequenceCollection cannot be sliced!

Getting a single column from an alignment#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/test.paml", moltype="dna")
column_four = aln[3]
column_four

	0
DogFaced	A
NineBande	.
Mouse	G
Human	.
HowlerMon	.

5 x 1 dna alignment

Getting a region of contiguous columns#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
region = aln[50:70]

Iterating over alignment positions#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
col = list(aln[113:115].iter_positions())
col

[['A', 'A', 'A'], ['T', '-', '-']]

Getting codon 3rd positions#

We can use conventional slice notation. Note, because Python counts from 0, the 3rd position starts at index 2.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    {"seq1": "ATGATGATG---", "seq2": "ATGATGATGATG"},
    moltype="dna",
)
aln[2::3]

	0
seq2	GGGG
seq1	...-

2 x 4 dna alignment

Filtering positions#

Trim terminal stop codons#

For evolutionary analyses that use codon models we need to exclude terminating stop codons. For the case where the sequences are all of length divisible by 3.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    {"seq1": "ACGTAA---", "seq2": "ACGACA---", "seq3": "ACGCAATGA"},
    moltype="dna",
)
new = aln.trim_stop_codons()
new

	0
seq2	ACGACA---
seq1	...---...
seq3	...CA....

3 x 9 dna alignment

To detect if the alignment contains sequences not divisible by 3, use the strict argument. This argument covers both allowing partial terminating codons / not divisible by 3.

aln = make_aligned_seqs(
    {
        "seq1": "ACGTAA---",
        "seq2": "ACGAC----",  # terminal codon incomplete
        "seq3": "ACGCAATGA",
    },
    moltype="dna",
)
new = aln.trim_stop_codons(strict=True)

Eliminating columns with non-nucleotide characters#

We sometimes want to eliminate ambiguous or gap data from our alignments. We demonstrate how to exclude alignment columns based on the characters they contain. In the first instance, we do this just for single nucleotide columns, then for trinucleotides (equivalent for handling codons). Both are done using the no_degenerates() method.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    [
        ("seq1", "ATGAAGGTG---"),
        ("seq2", "ATGAAGGTGATG"),
        ("seq3", "ATGAAGGNGATG"),
    ],
    moltype="dna",
)

We apply to nucleotides,

nucs = aln.no_degenerates()
nucs

	0
seq1	ATGAAGGG
seq2	........
seq3	........

3 x 8 dna alignment

Applying the same filter to trinucleotides (specified by setting motif_length=3).

trinucs = aln.no_degenerates(motif_length=3)
trinucs

	0
seq1	ATGAAG
seq2	......
seq3	......

3 x 6 dna alignment

Getting all variable positions from an alignment#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
pos = aln.variable_positions()
just_variable_aln = aln.take_positions(pos)
just_variable_aln[:10]

	0
DogFaced	AAACAAAATA
Human	..G.....CT
HowlerMon	..G...G.CT
Mouse	GGG.CC.GCT
NineBande	...T....CT

5 x 10 dna alignment

Getting all constant positions from an alignment#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
pos = aln.variable_positions()
just_constant_aln = aln.take_positions(pos, negate=True)
just_constant_aln[:10]

	0
DogFaced	TGTGGCACAA
Human	..........
HowlerMon	..........
Mouse	..........
NineBande	..........

5 x 10 dna alignment

Getting all variable codons from an alignment#

This is done using the filtered method using the motif_length argument.

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
pos = aln.variable_positions(motif_length=3)
variable_codons = aln.take_positions(pos)
just_variable_aln[:9]

	0
DogFaced	AAACAAAAT
Human	..G.....C
HowlerMon	..G...G.C
Mouse	GGG.CC.GC
NineBande	...T....C

5 x 9 dna alignment

Filtering sequences#

Extracting sequences using an arbitrary function into a new alignment object#

You can use take_seqs_if to extract sequences into a new alignment object based on whether an arbitrary function applied to the sequence evaluates to True. For example, to extract sequences which don’t contain any N bases you could do the following:

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    [
        ("seq1", "ATGAAGGTG---"),
        ("seq2", "ATGAAGGTGATG"),
        ("seq3", "ATGAAGGNGATG"),
    ],
    moltype="dna",
)

def no_N_chars(s):
    return "N" not in str(s)

aln.take_seqs_if(no_N_chars)

	0
seq2	ATGAAGGTGATG
seq1	.........---

2 x 12 dna alignment

You can additionally get the sequences where the provided function evaluates to False:

aln.take_seqs_if(no_N_chars, negate=True)

	0
seq3	ATGAAGGNGATG

1 x 12 dna alignment

Computing alignment statistics#

Getting motif counts#

Motif counts are obtained from non-overlapping k-mers. (This is distinct from k-mer counting, in which they do overlap.)

We state the motif length we want and whether to allow gap or ambiguous characters. The latter only has meaning for IPUAC character sets (the DNA, RNA or PROTEIN moltypes). We illustrate this for the DNA moltype with motif lengths of 1 and 3.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    [
        ("seq1", "ATGAAGGTG---"),
        ("seq2", "ATGAAGGTGATG"),
        ("seq3", "ATGAAGGNGATG"),
    ],
    moltype="dna",
)
counts = aln.counts()
counts

A	C	G	T
11	0	14	7

counts = aln.counts(motif_length=3)
counts

AAA	AAC	AAG	AAT	ACA	ACC	ACG	ACT	AGA	AGC	AGG	AGT	ATA	ATC	ATG	ATT	CAA	CAC	CAG	CAT	CCA	CCC	CCG	CCT	CGA	CGC	CGG	CGT	CTA	CTC	CTG	CTT	GAA	GAC	GAG	GAT	GCA	GCC	GCG	GCT	GGA	GGC	GGG	GGT	GTA	GTC	GTG	GTT	TAA	TAC	TAG	TAT	TCA	TCC	TCG	TCT	TGA	TGC	TGG	TGT	TTA	TTC	TTG	TTT
0	0	3	0	0	0	0	0	0	0	0	0	0	0	5	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

counts = aln.counts(include_ambiguity=True)
counts

A	C	G	N	T
11	0	14	1	7

Note

Only the observed motifs are returned, rather than all defined by the alphabet.

Calculating GC% for a collection#

This is just a variant of the previous example but re-expressed for unaligned sequences. Note that the returned value from the counts() method acts like a dictionary, but also has an array attribute.

from cogent3 import make_unaligned_seqs

seqcoll = make_unaligned_seqs(
    [
        ("seq1", "ATGAAGGTG"),
        ("seq2", "ATGAAGGTGATG"),
        ("seq3", "ATGAAGGNGATG"),
    ],
    moltype="dna",
)
counts = seqcoll.counts()
GC_frac = (counts["C"] + counts["G"]) / counts.array.sum()
GC_frac * 100

np.float64(43.75)

Getting k-mer counts#

This only applies to sequence collections. It returns the counts of k-mers per sequence as a numpy array with shape (number of sequences, number of k-mers).

seqcoll = make_unaligned_seqs(
    [
        ("seq1", "ATGAAGGTG"),
        ("seq2", "ATGAAGGTGATG"),
        ("seq3", "ATGAAGGNGATG"),
    ],
    moltype="dna",
)
kcounts = seqcoll.count_kmers(k=2)
kcounts

array([[0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1],
       [0, 0, 0, 3, 0, 0, 0, 0, 2, 0, 1, 1, 1, 0, 2, 1],
       [0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 1, 0, 0, 2, 1]])

The order of elements in the numpy array corresponds to the result of

seqcoll.moltype.alphabet.get_kmer_alphabet(k=2)

('TT',
 'TC',
 'TA',
 'TG',
 'CT',
 'CC',
 'CA',
 'CG',
 'AT',
 'AC',
 'AA',
 'AG',
 'GT',
 'GC',
 'GA',
 'GG')

Note

We support third-party plugins for k-mer counting. After installing one, they can be selected by specifying the package with the .count_kmers(k=2, use_hook="<package name>").

Getting motif counts per position#

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    [
        ("seq1", "ATGAAGGTG---"),
        ("seq2", "ATGAAGGTGATG"),
        ("seq3", "ATGAAGGNGATG"),
    ],
    moltype="dna",
)
counts = aln.counts_per_pos()
counts

	T	A	G
0	0	3	0
1	3	0	0
2	0	0	3
3	0	3	0
4	0	3	0
5	0	0	3
6	0	0	3
7	2	0	0
8	0	0	3
9	0	2	0
10	2	0	0
11	0	0	2

Note

There are also .probs_per_pos() and .entropy_per_pos() methods.

Computing motif probabilities from an alignment#

The method get_motif_probs of Alignment objects returns the probabilities for all motifs of a given length. For individual nucleotides:

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
motif_probs = aln.get_motif_probs()
motif_probs

{'T': 0.25520602569782896,
 'C': 0.25808595480726626,
 'A': 0.24390784226849802,
 'G': 0.24280017722640673}

For dinucleotides or longer, we need to pass in a KmerAlphabet with the appropriate word length. Here is an example with trinucleotides:

from cogent3 import get_moltype, load_aligned_seqs, make_table

trinuc_alphabet = get_moltype("dna").alphabet.get_kmer_alphabet(3)
aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
motif_probs = aln.get_motif_probs(alphabet=trinuc_alphabet)
table = make_table(header=["motif", "freq"], data=list(motif_probs.items()))
table

motif	freq
TTT	0.0201
TTC	0.0241
TTA	0.0114
TTG	0.0167
TCT	0.0288
...	...
GAG	0.0241
GGT	0.0194
GGC	0.0140
GGA	0.0207
GGG	0.0261

Top 5 and bottom 5 rows from 64 rows x 2 columns

Some calculations in cogent3 require all non-zero values in the motif probabilities, in which case we use a pseudo-count. We illustrate that here with a simple example where T is missing. Without the pseudo-count, the frequency of T is 0.0, with the pseudo-count defined as 1e-6 then the frequency of T will be slightly less than 1e-6.

aln = make_aligned_seqs([("a", "AACAAC"), ("b", "AAGAAG")], moltype="dna")
motif_probs = aln.get_motif_probs()
assert motif_probs["T"] == 0.0
motif_probs = aln.get_motif_probs(pseudocount=1e-6)
motif_probs

{'T': 8.333330555556482e-08,
 'C': 0.16666669444443521,
 'A': 0.6666665277778241,
 'G': 0.16666669444443521}

Note

For alignments, motif probabilities are computed by treating sequences as non-overlapping tuples. To get all possible k-mers, use the iter_kmers() method on the sequence classes.

Working with alignment gaps#

Filtering extracted columns for the gap character#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
col = aln[113:115].iter_positions()
c1, c2 = list(col)
c1, c2
list(filter(lambda x: x == "-", c1))
list(filter(lambda x: x == "-", c2))

['-', '-']

Calculating the gaps per position#

from cogent3 import load_aligned_seqs

aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
gap_counts = aln.count_gaps_per_pos()
gap_counts  # this is a DictArray

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62	63	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80	81	82	83	84	85	86	87	88	89	90	91	92	93	94	95	96	97	98	99	100	101	102	103	104	105	106	107	108	109	110	111	112	113	114	115	116	117	118	119	120	121	122	123	124	125	126	127	128	129	130	131	132	133	134	135	136	137	138	139	140	141	142	143	144	145	146	147	148	149	150	151	152	153	154	155	156	157	158	159	160	161	162	163	164	165	166	167	168	169	170	171	172	173	174	175	176	177	178	179	180	181	182	183	184	185	186	187	188	189	190	191	192	193	194	195	196	197	198	199	200	201	202	203	204	205	206	207	208	209	210	211	212	213	214	215	216	217	218	219	220	221	222	223	224	225	226	227	228	229	230	231	232	233	234	235	236	237	238	239	240	241	242	243	244	245	246	247	248	249	250	251	252	253	254	255	256	257	258	259	260	261	262	263	264	265	266	267	268	269	270	271	272	273	274	275	276	277	278	279	280	281	282	283	284	285	286	287	288	289	290	291	292	293	294	295	296	297	298	299	300	301	302	303	304	305	306	307	308	309	310	311	312	313	314	315	316	317	318	319	320	321	322	323	324	325	326	327	328	329	330	331	332	333	334	335	336	337	338	339	340	341	342	343	344	345	346	347	348	349	350	351	352	353	354	355	356	357	358	359	360	361	362	363	364	365	366	367	368	369	370	371	372	373	374	375	376	377	378	379	380	381	382	383	384	385	386	387	388	389	390	391	392	393	394	395	396	397	398	399	400	401	402	403	404	405	406	407	408	409	410	411	412	413	414	415	416	417	418	419	420	421	422	423	424	425	426	427	428	429	430	431	432	433	434	435	436	437	438	439	440	441	442	443	444	445	446	447	448	449	450	451	452	453	454	455	456	457	458	459	460	461	462	463	464	465	466	467	468	469	470	471	472	473	474	475	476	477	478	479	480	481	482	483	484	485	486	487	488	489	490	491	492	493	494	495	496	497	498	499	500	501	502	503	504	505	506	507	508	509	510	511	512	513	514	515	516	517	518	519	520	521	522	523	524	525	526	527	528	529	530	531	532	533	534	535	536	537	538	539	540	541	542	543	544	545	546	547	548	549	550	551	552	553	554	555	556	557	558	559	560	561	562	563	564	565	566	567	568	569	570	571	572	573	574	575	576	577	578	579	580	581	582	583	584	585	586	587	588	589	590	591	592	593	594	595	596	597	598	599	600	601	602	603	604	605	606	607	608	609	610	611	612	613	614	615	616	617	618	619	620	621	622	623	624	625	626	627	628	629	630	631	632	633	634	635	636	637	638	639	640	641	642	643	644	645	646	647	648	649	650	651	652	653	654	655	656	657	658	659	660	661	662	663	664	665	666	667	668	669	670	671	672	673	674	675	676	677	678	679	680	681	682	683	684	685	686	687	688	689	690	691	692	693	694	695	696	697	698	699	700	701	702	703	704	705	706	707	708	709	710	711	712	713	714	715	716	717	718	719	720	721	722	723	724	725	726	727	728	729	730	731	732	733	734	735	736	737	738	739	740	741	742	743	744	745	746	747	748	749	750	751	752	753	754	755	756	757	758	759	760	761	762	763	764	765	766	767	768	769	770	771	772	773	774	775	776	777	778	779	780	781	782	783	784	785	786	787	788	789	790	791	792	793	794	795	796	797	798	799	800	801	802	803	804	805	806	807	808	809	810	811	812	813	814	815	816	817	818	819	820	821	822	823	824	825	826	827	828	829	830	831	832	833	834	835	836	837	838	839	840	841	842	843	844	845	846	847	848	849	850	851	852	853	854	855	856	857	858	859	860	861	862	863	864	865	866	867	868	869	870	871	872	873	874	875	876	877	878	879	880	881	882	883	884	885	886	887	888	889	890	891	892	893	894	895	896	897	898	899	900	901	902	903	904	905	906	907	908	909	910	911	912	913	914	915	916	917	918	919	920	921	922	923	924	925	926	927	928	929	930	931	932	933	934	935	936	937	938	939	940	941	942	943	944	945	946	947	948	949	950	951	952	953	954	955	956	957	958	959	960	961	962	963	964	965	966	967	968	969	970	971	972	973	974	975	976	977	978	979	980	981	982	983	984	985	986	987	988	989	990	991	992	993	994	995	996	997	998	999	1000	1001	1002	1003	1004	1005	1006	1007	1008	1009	1010	1011	1012	1013	1014	1015	1016	1017	1018	1019	1020	1021	1022	1023	1024	1025	1026	1027	1028	1029	1030	1031	1032	1033	1034	1035	1036	1037	1038	1039	1040	1041	1042	1043	1044	1045	1046	1047	1048	1049	1050	1051	1052	1053	1054	1055	1056	1057	1058	1059	1060	1061	1062	1063	1064	1065	1066	1067	1068	1069	1070	1071	1072	1073	1074	1075	1076	1077	1078	1079	1080	1081	1082	1083	1084	1085	1086	1087	1088	1089	1090	1091	1092	1093	1094	1095	1096	1097	1098	1099	1100	1101	1102	1103	1104	1105	1106	1107	1108	1109	1110	1111	1112	1113	1114	1115	1116	1117	1118	1119	1120	1121	1122	1123	1124	1125	1126	1127	1128	1129	1130	1131	1132	1133	1134	1135	1136	1137	1138	1139	1140	1141	1142	1143	1144	1145	1146	1147	1148	1149	1150	1151	1152	1153	1154	1155	1156	1157	1158	1159	1160	1161	1162	1163	1164	1165	1166	1167	1168	1169	1170	1171	1172	1173	1174	1175	1176	1177	1178	1179	1180	1181	1182	1183	1184	1185	1186	1187	1188	1189	1190	1191	1192	1193	1194	1195	1196	1197	1198	1199	1200	1201	1202	1203	1204	1205	1206	1207	1208	1209	1210	1211	1212	1213	1214	1215	1216	1217	1218	1219	1220	1221	1222	1223	1224	1225	1226	1227	1228	1229	1230	1231	1232	1233	1234	1235	1236	1237	1238	1239	1240	1241	1242	1243	1244	1245	1246	1247	1248	1249	1250	1251	1252	1253	1254	1255	1256	1257	1258	1259	1260	1261	1262	1263	1264	1265	1266	1267	1268	1269	1270	1271	1272	1273	1274	1275	1276	1277	1278	1279	1280	1281	1282	1283	1284	1285	1286	1287	1288	1289	1290	1291	1292	1293	1294	1295	1296	1297	1298	1299	1300	1301	1302	1303	1304	1305	1306	1307	1308	1309	1310	1311	1312	1313	1314	1315	1316	1317	1318	1319	1320	1321	1322	1323	1324	1325	1326	1327	1328	1329	1330	1331	1332	1333	1334	1335	1336	1337	1338	1339	1340	1341	1342	1343	1344	1345	1346	1347	1348	1349	1350	1351	1352	1353	1354	1355	1356	1357	1358	1359	1360	1361	1362	1363	1364	1365	1366	1367	1368	1369	1370	1371	1372	1373	1374	1375	1376	1377	1378	1379	1380	1381	1382	1383	1384	1385	1386	1387	1388	1389	1390	1391	1392	1393	1394	1395	1396	1397	1398	1399	1400	1401	1402	1403	1404	1405	1406	1407	1408	1409	1410	1411	1412	1413	1414	1415	1416	1417	1418	1419	1420	1421	1422	1423	1424	1425	1426	1427	1428	1429	1430	1431	1432	1433	1434	1435	1436	1437	1438	1439	1440	1441	1442	1443	1444	1445	1446	1447	1448	1449	1450	1451	1452	1453	1454	1455	1456	1457	1458	1459	1460	1461	1462	1463	1464	1465	1466	1467	1468	1469	1470	1471	1472	1473	1474	1475	1476	1477	1478	1479	1480	1481	1482	1483	1484	1485	1486	1487	1488	1489	1490	1491	1492	1493	1494	1495	1496	1497	1498	1499	1500	1501	1502	1503	1504	1505	1506	1507	1508	1509	1510	1511	1512	1513	1514	1515	1516	1517	1518	1519	1520	1521	1522	1523	1524
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	0	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

To turn that into grap fraction

gap_frac = gap_counts.array / aln.num_seqs

Filtering alignments based on gaps#

If we want to remove positions from the alignment which are gaps in more than a certain percentage of the sequences, use the omit_gap_pos().

aln = make_aligned_seqs(
    [
        ("seq1", "ATGAA---TG-"),
        ("seq2", "ATG-AGTGATG"),
        ("seq3", "AT--AG-GATG"),
    ],
    moltype="dna",
)
filtered_aln = aln.omit_gap_pos(allowed_gap_frac=0.40)
filtered_aln

	0
seq2	ATGAGGATG
seq1	....--TG-
seq3	..-......

3 x 9 dna alignment

Note

The default for filtered_aln.omit_gap_pos() is to remove columns with gaps in all the sequences. This can occur after sequences have been removed from the alignment.

	T	A	G
0	0	3	0
1	3	0	0
2	0	0	3
3	0	3	0
4	0	3	0
5	0	0	3
6	0	0	3
7	2	0	0
8	0	0	3
9	0	2	0
10	2	0	0
11	0	0	2

	T	A	G
0	0	3	0
1	3	0	0
2	0	0	3
3	0	3	0
4	0	3	0
5	0	0	3
6	0	0	3
7	2	0	0
8	0	0	3
9	0	2	0
10	2	0	0
11	0	0	2

Sequence Collections and Alignments#

Basic Collection objects#

Constructing a SequenceCollection or Alignment object from strings#

Converting a SequenceCollection to FASTA format#

Adding new sequences to an existing collection or alignment#

The elements of a collection or alignment#

Accessing individual sequences from a collection or alignment by name#

Slice the sequences from a collection like a list#

Getting a subset of sequences from the alignment#

Writing sequences to file#

Alignments#

Creating an Alignment object from a SequenceCollection#

Convert alignment to DNA, RNA or PROTEIN moltypes#

Handling gaps#

Remove all gaps from an alignment#

Removing all columns with gaps in a named sequence#

Converting an alignment to FASTA format#

Converting an alignment to Phylip format#

Converting an alignment to a list of strings#

Converting an alignment to a list of arrays#

Getting an alignment as an array#

Slicing an alignment#

Getting a single column from an alignment#

Getting a region of contiguous columns#

Iterating over alignment positions#

Getting codon 3rd positions#

Filtering positions#

Trim terminal stop codons#

Eliminating columns with non-nucleotide characters#

Getting all variable positions from an alignment#

Getting all constant positions from an alignment#

Getting all variable codons from an alignment#

Filtering sequences#

Extracting sequences using an arbitrary function into a new alignment object#

Computing alignment statistics#

Getting motif counts#

Calculating GC% for a collection#

Getting k-mer counts#

Getting motif counts per position#

Computing motif probabilities from an alignment#

Working with alignment gaps#

Filtering extracted columns for the gap character#

Calculating the gaps per position#

Filtering alignments based on gaps#

Constructing a `SequenceCollection` or `Alignment` object from strings#

Converting a `SequenceCollection` to FASTA format#

Creating an `Alignment` object from a `SequenceCollection`#

	T	A	G
0	0	3	0
1	3	0	0
2	0	0	3
3	0	3	0
4	0	3	0
5	0	0	3
6	0	0	3
7	2	0	0
8	0	0	3
9	0	2	0
10	2	0	0
11	0	0	2