Sequence Collections and Alignments#
Note
These docs now use the new_type
core objects via the following setting.
import os
# using new types without requiring an explicit argument
os.environ["COGENT3_NEW_TYPE"] = "1"
For loading collections of unaligned or aligned sequences see Loading an alignment from a file or url.
Basic Collection objects#
Constructing a SequenceCollection
or Alignment
object from strings#
from cogent3 import make_aligned_seqs
data = {"seq1": "ATGACC", "seq2": "ATCGCC"}
# for an alignment, sequences must be the same length
seqs = make_aligned_seqs(data=data, moltype="dna")
type(seqs)
cogent3.core.new_alignment.Alignment
from cogent3 import make_unaligned_seqs
data = {"seq1": "ATGCC", "seq2": "ATCG"}
seqs = make_unaligned_seqs(data, moltype="dna")
type(seqs)
cogent3.core.new_alignment.SequenceCollection
Converting a SequenceCollection
to FASTA format#
from cogent3 import load_unaligned_seqs
seqs = load_unaligned_seqs("data/test.paml", moltype="dna")
print(seqs.to_fasta())
>NineBande
GCAAGGCGCCAACAGAGCAGATGGGCTGAAAGTAAGGAAACATGTAATGATAGGCAGACT
>Mouse
GCAGTGAGCCAGCAGAGCAGATGGGCTGCAAGTAAAGGAACATGTAACGACAGGCAGGTT
>Human
GCAAGGAGCCAACATAACAGATGGGCTGGAAGTAAGGAAACATGTAATGATAGGCGGACT
>HowlerMon
GCAAGGAGCCAACATAACAGATGGGCTGAAAGTGAGGAAACATGTAATGATAGGCAGACT
>DogFaced
GCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT
Adding new sequences to an existing collection or alignment#
More than one sequence can be added to a collection simultaneously. Note that add_seqs()
does not modify the existing collection/alignment, it creates a new one.
The elements of a collection or alignment#
Accessing individual sequences from a collection or alignment by name#
We can get a sequence by name by indexing the .seqs
attribute.
from cogent3 import make_unaligned_seqs
seqcoll = make_unaligned_seqs(
[("seq1", "ATGAA"), ("seq2", "ATGAGTGATG"), ("seq3", "ATAGGATG")],
moltype="dna",
)
seq = seqcoll.seqs["seq1"]
seq
0 | |
seq1 | ATGAA |
DnaSequence, length=5
For an alignment, the result is an Aligned instance.
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs(
[("seq1", "ATGAA------"), ("seq2", "ATG-AGTGATG"), ("seq3", "AT--AG-GATG")],
moltype="dna",
)
aligned = aln.seqs["seq1"]
aligned
Aligned(name='seq1', seq='ATGAA--... 11', moltype='dna')
For the alignment case, you can get the ungapped sequence by accessing the .seq
attribute of the aligned instance.
aligned.seq
0 | |
seq1 | ATGAA |
DnaSequence, length=5
seq = aln.get_seq("seq1")
seq.name
type(seq)
seq.is_gapped()
False
Alternatively, if you want to extract the aligned (i.e., gapped) sequence from an alignment, you can use get_gapped_seq
.
seq = aln.get_gapped_seq("seq1")
seq.is_gapped()
seq
0 | |
seq1 | ATGAA------ |
DnaSequence, length=11
To see the names of the sequences in a sequence collection, use the names
attribute.
aln.names
('seq1', 'seq2', 'seq3')
Slice the sequences from a collection like a list#
Use the .seqs
attribute. We can index a single sequence
from cogent3 import load_unaligned_seqs
fn = "data/long_testseqs.fasta"
seqs = load_unaligned_seqs(fn, moltype="dna")
my_seq = seqs.seqs[0]
my_seq
0 | |
Human | TGTGGCACAAATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTATTACTCACT |
DnaSequence, length=2,532 (truncated to 60)
but you cannot index a slice (Use .take_seqs()
for that).
seqs.seqs[:2]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[13], line 1
----> 1 seqs.seqs[:2]
File /opt/hostedtoolcache/Python/3.12.10/x64/lib/python3.12/functools.py:949, in singledispatchmethod.__get__.<locals>._method(*args, **kwargs)
947 def _method(*args, **kwargs):
948 method = self.dispatcher.dispatch(args[0].__class__)
--> 949 return method.__get__(obj, cls)(*args, **kwargs)
File ~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/new_alignment.py:4509, in _IndexableSeqs.__getitem__(self, key)
4503 @singledispatchmethod
4504 def __getitem__(
4505 self,
4506 key: str | int | slice,
4507 ) -> new_sequence.Sequence | Aligned:
4508 msg = f"indexing not supported for {type(key)}, try .take_seqs()"
-> 4509 raise TypeError(msg)
TypeError: indexing not supported for <class 'slice'>, try .take_seqs()
Getting a subset of sequences from the alignment#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/test.paml", moltype="dna")
aln.names
('NineBande', 'Mouse', 'Human', 'HowlerMon', 'DogFaced')
new = aln.take_seqs(["Human", "HowlerMon"])
new.names
('Human', 'HowlerMon')
Alternatively, you can extract only the sequences which are not specified by passing negate=True
:
new = aln.take_seqs(["Human", "HowlerMon"], negate=True)
new.names
('NineBande', 'Mouse', 'DogFaced')
Note
The subset contains references to the original sequences, not copies.
Writing sequences to file#
Both collection and alignment objects have a write()
method. The output format is inferred from the filename suffix,
from cogent3 import make_aligned_seqs
dna = {"seq1": "ATGACC", "seq2": "ATCGCC"}
aln = make_aligned_seqs(data=dna, moltype="dna")
aln.write("sample.fasta")
or by the format
argument.
aln.write("sample", format="fasta")
from cogent3.util.io import remove_files
remove_files(["sample", "sample.fasta"], error_on_missing=False)
Alignments#
Creating an Alignment
object from a SequenceCollection
#
from cogent3 import load_unaligned_seqs
from cogent3.core.alignment import Alignment
seq = load_unaligned_seqs("data/test.paml", moltype="dna")
seq
0 | |
NineBande | GCAAGGCGCCAACAGAGCAGATGGGCTGAAAGTAAGGAAACATGTAATGATAGGCAGACT |
Mouse | GCAGTGAGCCAGCAGAGCAGATGGGCTGCAAGTAAAGGAACATGTAACGACAGGCAGGTT |
Human | GCAAGGAGCCAACATAACAGATGGGCTGGAAGTAAGGAAACATGTAATGATAGGCGGACT |
HowlerMon | GCAAGGAGCCAACATAACAGATGGGCTGAAAGTGAGGAAACATGTAATGATAGGCAGACT |
DogFaced | GCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT |
5 x {min=60, median=60.0, max=60} dna sequence collection
aln = Alignment(seq.seqs)
aln
0 | |
DogFaced | GCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT |
NineBande | ......C....A....G........C.....G............................ |
Mouse | ...GT...........G........C..C..G...A.G.........C..C......GT. |
Human | ...........A..T..........C..G..G.......................G.... |
HowlerMon | ...........A..T..........C.....G.G.......................... |
5 x 60 dna alignment
Convert alignment to DNA, RNA or PROTEIN moltypes#
This is useful if you’ve loaded a sequence alignment without specifying the moltype and later need to convert it using the dedicated method
from cogent3 import make_aligned_seqs
data = [("a", "ACG---"), ("b", "CCTGGG")]
aln = make_aligned_seqs(data=data, moltype="text")
dna = aln.to_dna()
dna
0 | |
b | CCTGGG |
a | A.G--- |
2 x 6 dna alignment
Or using the generic to_moltype()
method
aln.to_moltype("dna")
0 | |
b | CCTGGG |
a | A.G--- |
2 x 6 dna alignment
To RNA
from cogent3 import make_aligned_seqs
data = [("a", "ACG---"), ("b", "CCUGGG")]
aln = make_aligned_seqs(data=data, moltype="text")
rna = aln.to_rna()
rna
0 | |
b | CCUGGG |
a | A.G--- |
2 x 6 rna alignment
To PROTEIN
from cogent3 import make_aligned_seqs
data = [("x", "TYV"), ("y", "TE-")]
aln = make_aligned_seqs(data=data, moltype="text")
prot = aln.to_moltype("protein")
prot
0 | |
x | TYV |
y | .E- |
2 x 3 protein alignment
Handling gaps#
Remove all gaps from an alignment#
This necessarily returns a SequenceCollection
.
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
degapped = aln.degap()
degapped
0 | |
human | AGCGCCCGCGGGTTCTGAGAGCGCTCAAAGCCGCCGAGTCAGGCTGCCCAGCCCGCCGGG |
macaque | AGCGCCCGCGGGTTCTGAGAGCGCTCAAAGCCACCGAGTCAGGCTGCCCAGCCCGCCGGG |
chimp | AGCGCCCGCGGGTTCTGAGAGCGCTCAAAGCCGCCGAGTCAGGCTGCCCAGCCCGCCGGG |
3 x {min=1500, median=1503.0, max=1511} dna sequence collection
Removing all columns with gaps in a named sequence#
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs(
[("seq1", "ATGAA---TG-"), ("seq2", "ATG-AGTGATG"), ("seq3", "AT--AG-GATG")],
moltype="dna",
)
new_aln = aln.get_degapped_relative_to("seq1")
new_aln
0 | |
seq1 | ATGAATG |
seq2 | ...-.AT |
seq3 | ..--.AT |
3 x 7 dna alignment
Converting an alignment to FASTA format#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
fasta_align = aln.to_fasta()
Converting an alignment to Phylip format#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/test.paml", moltype="dna")
got = aln.to_phylip()
print(got)
5 60
NineBande GCAAGGCGCCAACAGAGCAGATGGGCTGAAAGTAAGGAAACATGTAATGATAGGCAGACT
Mouse GCAGTGAGCCAGCAGAGCAGATGGGCTGCAAGTAAAGGAACATGTAACGACAGGCAGGTT
Human GCAAGGAGCCAACATAACAGATGGGCTGGAAGTAAGGAAACATGTAATGATAGGCGGACT
HowlerMon GCAAGGAGCCAACATAACAGATGGGCTGAAAGTGAGGAAACATGTAATGATAGGCAGACT
DogFaced GCAAGGAGCCAGCAGAACAGATGGGTTGAAACTAAGGAAACATGTAATGATAGGCAGACT
Converting an alignment to a list of strings#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/test.paml", moltype="dna")
data = [str(s) for s in aln.seqs]
Converting an alignment to a list of arrays#
import numpy
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/test.paml", moltype="dna")
data = [numpy.array(s) for s in aln.seqs]
data
[array([3, 1, 2, 2, 3, 3, 1, 3, 1, 1, 2, 2, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0], dtype=uint8),
array([3, 1, 2, 3, 0, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
3, 3, 3, 1, 0, 3, 1, 2, 2, 3, 0, 2, 2, 2, 3, 3, 2, 2, 1, 2, 0, 3,
0, 2, 2, 1, 3, 2, 1, 2, 3, 3, 1, 2, 3, 3, 0, 0], dtype=uint8),
array([3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
3, 3, 3, 1, 0, 3, 3, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 3, 3, 2, 1, 0], dtype=uint8),
array([3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 3, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0], dtype=uint8),
array([3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 2, 1, 2, 3, 2, 0,
3, 3, 3, 0, 0, 3, 2, 2, 2, 1, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0], dtype=uint8)]
Getting an alignment as an array#
The rows are sequences and their order corresponds to that of aln.names
.
import numpy
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/test.paml", moltype="dna")
aln.array_seqs
array([[3, 1, 2, 2, 3, 3, 1, 3, 1, 1, 2, 2, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0],
[3, 1, 2, 3, 0, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 3, 1, 2, 3, 2, 0,
3, 3, 3, 1, 0, 3, 1, 2, 2, 3, 0, 2, 2, 2, 3, 3, 2, 2, 1, 2, 0, 3,
0, 2, 2, 1, 3, 2, 1, 2, 3, 3, 1, 2, 3, 3, 0, 0],
[3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
3, 3, 3, 1, 0, 3, 3, 2, 2, 3, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 3, 3, 2, 1, 0],
[3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1, 2, 3, 2, 0,
3, 3, 3, 1, 0, 3, 2, 2, 2, 3, 0, 3, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0],
[3, 1, 2, 2, 3, 3, 2, 3, 1, 1, 2, 3, 1, 2, 3, 2, 2, 1, 2, 3, 2, 0,
3, 3, 3, 0, 0, 3, 2, 2, 2, 1, 0, 2, 2, 3, 3, 2, 2, 2, 1, 2, 0, 3,
0, 2, 2, 0, 3, 2, 0, 2, 3, 3, 1, 2, 3, 2, 1, 0]], dtype=uint8)
Slicing an alignment#
Alignments can be thought of as a matrix, with sequences along the rows and alignment positions as the columns. However, all slicing is only along positions.
from cogent3 import load_aligned_seqs
fn = "data/long_testseqs.fasta"
aln = load_aligned_seqs(fn, moltype="dna")
aln[:24]
0 | |
DogFaced | TGTGGCACAAATACTCATGCCAAC |
Human | ......................G. |
HowlerMon | ......................G. |
Mouse | .........G..G.........G. |
NineBande | ........................ |
5 x 24 dna alignment
Warning
A SequenceCollection
cannot be sliced!
Getting a single column from an alignment#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/test.paml", moltype="dna")
column_four = aln[3]
column_four
0 | |
DogFaced | A |
NineBande | . |
Mouse | G |
Human | . |
HowlerMon | . |
5 x 1 dna alignment
Getting a region of contiguous columns#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
region = aln[50:70]
Iterating over alignment positions#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
col = list(aln[113:115].iter_positions())
col
[['A', 'A', 'A'], ['T', '-', '-']]
Getting codon 3rd positions#
We can use conventional slice notation. Note, because Python counts from 0, the 3rd position starts at index 2.
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs(
data={"seq1": "ATGATGATG---", "seq2": "ATGATGATGATG"},
moltype="dna",
)
aln[2::3]
0 | |
seq2 | GGGG |
seq1 | ...- |
2 x 4 dna alignment
Filtering positions#
Trim terminal stop codons#
For evolutionary analyses that use codon models we need to exclude terminating stop codons. For the case where the sequences are all of length divisible by 3.
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs(
data={"seq1": "ACGTAA---", "seq2": "ACGACA---", "seq3": "ACGCAATGA"},
moltype="dna",
)
new = aln.trim_stop_codons()
new
0 | |
seq2 | ACGACA--- |
seq1 | ...---... |
seq3 | ...CA.... |
3 x 9 dna alignment
To detect if the alignment contains sequences not divisible by 3, use the strict
argument. This argument covers both allowing partial terminating codons / not divisible by 3.
aln = make_aligned_seqs(
data={
"seq1": "ACGTAA---",
"seq2": "ACGAC----", # terminal codon incomplete
"seq3": "ACGCAATGA",
},
moltype="dna",
)
new = aln.trim_stop_codons(strict=True)
Eliminating columns with non-nucleotide characters#
We sometimes want to eliminate ambiguous or gap data from our alignments. We demonstrate how to exclude alignment columns based on the characters they contain. In the first instance, we do this just for single nucleotide columns, then for trinucleotides (equivalent for handling codons). Both are done using the no_degenerates()
method.
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs(
data=[
("seq1", "ATGAAGGTG---"),
("seq2", "ATGAAGGTGATG"),
("seq3", "ATGAAGGNGATG"),
],
moltype="dna",
)
We apply to nucleotides,
nucs = aln.no_degenerates()
nucs
0 | |
seq1 | ATGAAGGG |
seq2 | ........ |
seq3 | ........ |
3 x 8 dna alignment
Applying the same filter to trinucleotides (specified by setting motif_length=3
).
trinucs = aln.no_degenerates(motif_length=3)
trinucs
0 | |
seq1 | ATGAAG |
seq2 | ...... |
seq3 | ...... |
3 x 6 dna alignment
Getting all variable positions from an alignment#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
pos = aln.variable_positions()
just_variable_aln = aln.take_positions(pos)
just_variable_aln[:10]
0 | |
DogFaced | AAACAAAATA |
Human | ..G.....CT |
HowlerMon | ..G...G.CT |
Mouse | GGG.CC.GCT |
NineBande | ...T....CT |
5 x 10 dna alignment
Getting all constant positions from an alignment#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
pos = aln.variable_positions()
just_constant_aln = aln.take_positions(pos, negate=True)
just_constant_aln[:10]
0 | |
DogFaced | TGTGGCACAA |
Human | .......... |
HowlerMon | .......... |
Mouse | .......... |
NineBande | .......... |
5 x 10 dna alignment
Getting all variable codons from an alignment#
This is done using the filtered
method using the motif_length
argument.
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/long_testseqs.fasta", moltype="dna")
pos = aln.variable_positions(motif_length=3)
variable_codons = aln.take_positions(pos)
just_variable_aln[:9]
0 | |
DogFaced | AAACAAAAT |
Human | ..G.....C |
HowlerMon | ..G...G.C |
Mouse | GGG.CC.GC |
NineBande | ...T....C |
5 x 9 dna alignment
Filtering sequences#
Extracting sequences using an arbitrary function into a new alignment object#
You can use take_seqs_if
to extract sequences into a new alignment object based on whether an arbitrary function applied to the sequence evaluates to True. For example, to extract sequences which don’t contain any N bases you could do the following:
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs(
data=[
("seq1", "ATGAAGGTG---"),
("seq2", "ATGAAGGTGATG"),
("seq3", "ATGAAGGNGATG"),
],
moltype="dna",
)
def no_N_chars(s):
return "N" not in str(s)
aln.take_seqs_if(no_N_chars)
0 | |
seq2 | ATGAAGGTGATG |
seq1 | .........--- |
2 x 12 dna alignment
You can additionally get the sequences where the provided function evaluates to False:
aln.take_seqs_if(no_N_chars, negate=True)
0 | |
seq3 | ATGAAGGNGATG |
1 x 12 dna alignment
Computing alignment statistics#
Getting motif counts#
We state the motif length we want and whether to allow gap or ambiguous characters. The latter only has meaning for IPUAC character sets (the DNA, RNA or PROTEIN moltypes). We illustrate this for the DNA moltype with motif lengths of 1 and 3.
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs(
data=[
("seq1", "ATGAAGGTG---"),
("seq2", "ATGAAGGTGATG"),
("seq3", "ATGAAGGNGATG"),
],
moltype="dna",
)
counts = aln.counts()
counts
A | C | G | T |
---|---|---|---|
11 | 0 | 14 | 7 |
counts = aln.counts(motif_length=3)
counts
AAA | AAC | AAG | AAT | ACA | ACC | ACG | ACT | AGA | AGC | AGG | AGT | ATA | ATC | ATG | ATT | CAA | CAC | CAG | CAT | CCA | CCC | CCG | CCT | CGA | CGC | CGG | CGT | CTA | CTC | CTG | CTT | GAA | GAC | GAG | GAT | GCA | GCC | GCG | GCT | GGA | GGC | GGG | GGT | GTA | GTC | GTG | GTT | TAA | TAC | TAG | TAT | TCA | TCC | TCG | TCT | TGA | TGC | TGG | TGT | TTA | TTC | TTG | TTT |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
counts = aln.counts(include_ambiguity=True)
counts
A | C | G | N | T |
---|---|---|---|---|
11 | 0 | 14 | 1 | 7 |
Note
Only the observed motifs are returned, rather than all defined by the alphabet.
Getting motif counts per sequence#
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs(
data=[
("seq1", "ATGAAGGTG---"),
("seq2", "ATGAAGGTGATG"),
("seq3", "ATGAAGGNGATG"),
],
moltype="dna",
)
counts = aln.counts_per_seq()
counts
A | C | G | T | |
---|---|---|---|---|
seq1 | 3 | 0 | 4 | 2 |
seq2 | 4 | 0 | 5 | 3 |
seq3 | 4 | 0 | 5 | 2 |
Note
There are also .probs_per_seq()
and .entropy_per_seq()
methods.
Getting motif counts per position#
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs(
data=[
("seq1", "ATGAAGGTG---"),
("seq2", "ATGAAGGTGATG"),
("seq3", "ATGAAGGNGATG"),
],
moltype="dna",
)
counts = aln.counts_per_pos()
counts
T | C | A | G | |
---|---|---|---|---|
0 | 0 | 0 | 3 | 0 |
1 | 3 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 3 |
3 | 0 | 0 | 3 | 0 |
4 | 0 | 0 | 3 | 0 |
5 | 0 | 0 | 0 | 3 |
6 | 0 | 0 | 0 | 3 |
7 | 2 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 3 |
9 | 0 | 0 | 2 | 0 |
10 | 2 | 0 | 0 | 0 |
11 | 0 | 0 | 0 | 2 |
Note
There are also .probs_per_pos()
and .entropy_per_pos()
methods.
Computing motif probabilities from an alignment#
The method get_motif_probs
of Alignment
objects returns the probabilities for all motifs of a given length. For individual nucleotides:
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
motif_probs = aln.get_motif_probs()
motif_probs
{'T': 0.25520602569782896,
'C': 0.25808595480726626,
'A': 0.24390784226849802,
'G': 0.24280017722640673}
For dinucleotides or longer, we need to pass in a KmerAlphabet
with the appropriate word length. Here is an example with trinucleotides:
from cogent3 import load_aligned_seqs, get_moltype, make_table
trinuc_alphabet = get_moltype("dna").alphabet.get_kmer_alphabet(3)
aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
motif_probs = aln.get_motif_probs(alphabet=trinuc_alphabet)
table = make_table(header=["motif", "freq"], data=list(motif_probs.items()))
table
motif | freq |
---|---|
TTT | 0.0201 |
TTC | 0.0241 |
TTA | 0.0114 |
TTG | 0.0167 |
TCT | 0.0288 |
... | ... |
GAG | 0.0241 |
GGT | 0.0194 |
GGC | 0.0140 |
GGA | 0.0207 |
GGG | 0.0261 |
Top 5 and bottom 5 rows from 64 rows x 2 columns
Some calculations in cogent3
require all non-zero values in the motif probabilities, in which case we use a pseudo-count. We illustrate that here with a simple example where T is missing. Without the pseudo-count, the frequency of T is 0.0, with the pseudo-count defined as 1e-6 then the frequency of T will be slightly less than 1e-6.
aln = make_aligned_seqs(data=[("a", "AACAAC"), ("b", "AAGAAG")], moltype="dna")
motif_probs = aln.get_motif_probs()
assert motif_probs["T"] == 0.0
motif_probs = aln.get_motif_probs(pseudocount=1e-6)
motif_probs
{'T': 8.333330555556482e-08,
'C': 0.16666669444443521,
'A': 0.6666665277778241,
'G': 0.16666669444443521}
Note
For alignments, motif probabilities are computed by treating sequences as non-overlapping tuples. To get all possible k-mers, use the iter_kmers()
method on the sequence classes.
Working with alignment gaps#
Filtering extracted columns for the gap character#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
col = aln[113:115].iter_positions()
c1, c2 = list(col)
c1, c2
list(filter(lambda x: x == "-", c1))
list(filter(lambda x: x == "-", c2))
['-', '-']
Calculating the gaps per position#
from cogent3 import load_aligned_seqs
aln = load_aligned_seqs("data/primate_cdx2_promoter.fasta", moltype="dna")
gap_counts = aln.count_gaps_per_pos()
gap_counts # this is a DictArray
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 | 370 | 371 | 372 | 373 | 374 | 375 | 376 | 377 | 378 | 379 | 380 | 381 | 382 | 383 | 384 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | 703 | 704 | 705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | 715 | 716 | 717 | 718 | 719 | 720 | 721 | 722 | 723 | 724 | 725 | 726 | 727 | 728 | 729 | 730 | 731 | 732 | 733 | 734 | 735 | 736 | 737 | 738 | 739 | 740 | 741 | 742 | 743 | 744 | 745 | 746 | 747 | 748 | 749 | 750 | 751 | 752 | 753 | 754 | 755 | 756 | 757 | 758 | 759 | 760 | 761 | 762 | 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 | 788 | 789 | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 | 810 | 811 | 812 | 813 | 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 | 822 | 823 | 824 | 825 | 826 | 827 | 828 | 829 | 830 | 831 | 832 | 833 | 834 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | 851 | 852 | 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | 861 | 862 | 863 | 864 | 865 | 866 | 867 | 868 | 869 | 870 | 871 | 872 | 873 | 874 | 875 | 876 | 877 | 878 | 879 | 880 | 881 | 882 | 883 | 884 | 885 | 886 | 887 | 888 | 889 | 890 | 891 | 892 | 893 | 894 | 895 | 896 | 897 | 898 | 899 | 900 | 901 | 902 | 903 | 904 | 905 | 906 | 907 | 908 | 909 | 910 | 911 | 912 | 913 | 914 | 915 | 916 | 917 | 918 | 919 | 920 | 921 | 922 | 923 | 924 | 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 | 933 | 934 | 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | 943 | 944 | 945 | 946 | 947 | 948 | 949 | 950 | 951 | 952 | 953 | 954 | 955 | 956 | 957 | 958 | 959 | 960 | 961 | 962 | 963 | 964 | 965 | 966 | 967 | 968 | 969 | 970 | 971 | 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 | 980 | 981 | 982 | 983 | 984 | 985 | 986 | 987 | 988 | 989 | 990 | 991 | 992 | 993 | 994 | 995 | 996 | 997 | 998 | 999 | 1000 | 1001 | 1002 | 1003 | 1004 | 1005 | 1006 | 1007 | 1008 | 1009 | 1010 | 1011 | 1012 | 1013 | 1014 | 1015 | 1016 | 1017 | 1018 | 1019 | 1020 | 1021 | 1022 | 1023 | 1024 | 1025 | 1026 | 1027 | 1028 | 1029 | 1030 | 1031 | 1032 | 1033 | 1034 | 1035 | 1036 | 1037 | 1038 | 1039 | 1040 | 1041 | 1042 | 1043 | 1044 | 1045 | 1046 | 1047 | 1048 | 1049 | 1050 | 1051 | 1052 | 1053 | 1054 | 1055 | 1056 | 1057 | 1058 | 1059 | 1060 | 1061 | 1062 | 1063 | 1064 | 1065 | 1066 | 1067 | 1068 | 1069 | 1070 | 1071 | 1072 | 1073 | 1074 | 1075 | 1076 | 1077 | 1078 | 1079 | 1080 | 1081 | 1082 | 1083 | 1084 | 1085 | 1086 | 1087 | 1088 | 1089 | 1090 | 1091 | 1092 | 1093 | 1094 | 1095 | 1096 | 1097 | 1098 | 1099 | 1100 | 1101 | 1102 | 1103 | 1104 | 1105 | 1106 | 1107 | 1108 | 1109 | 1110 | 1111 | 1112 | 1113 | 1114 | 1115 | 1116 | 1117 | 1118 | 1119 | 1120 | 1121 | 1122 | 1123 | 1124 | 1125 | 1126 | 1127 | 1128 | 1129 | 1130 | 1131 | 1132 | 1133 | 1134 | 1135 | 1136 | 1137 | 1138 | 1139 | 1140 | 1141 | 1142 | 1143 | 1144 | 1145 | 1146 | 1147 | 1148 | 1149 | 1150 | 1151 | 1152 | 1153 | 1154 | 1155 | 1156 | 1157 | 1158 | 1159 | 1160 | 1161 | 1162 | 1163 | 1164 | 1165 | 1166 | 1167 | 1168 | 1169 | 1170 | 1171 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | 1178 | 1179 | 1180 | 1181 | 1182 | 1183 | 1184 | 1185 | 1186 | 1187 | 1188 | 1189 | 1190 | 1191 | 1192 | 1193 | 1194 | 1195 | 1196 | 1197 | 1198 | 1199 | 1200 | 1201 | 1202 | 1203 | 1204 | 1205 | 1206 | 1207 | 1208 | 1209 | 1210 | 1211 | 1212 | 1213 | 1214 | 1215 | 1216 | 1217 | 1218 | 1219 | 1220 | 1221 | 1222 | 1223 | 1224 | 1225 | 1226 | 1227 | 1228 | 1229 | 1230 | 1231 | 1232 | 1233 | 1234 | 1235 | 1236 | 1237 | 1238 | 1239 | 1240 | 1241 | 1242 | 1243 | 1244 | 1245 | 1246 | 1247 | 1248 | 1249 | 1250 | 1251 | 1252 | 1253 | 1254 | 1255 | 1256 | 1257 | 1258 | 1259 | 1260 | 1261 | 1262 | 1263 | 1264 | 1265 | 1266 | 1267 | 1268 | 1269 | 1270 | 1271 | 1272 | 1273 | 1274 | 1275 | 1276 | 1277 | 1278 | 1279 | 1280 | 1281 | 1282 | 1283 | 1284 | 1285 | 1286 | 1287 | 1288 | 1289 | 1290 | 1291 | 1292 | 1293 | 1294 | 1295 | 1296 | 1297 | 1298 | 1299 | 1300 | 1301 | 1302 | 1303 | 1304 | 1305 | 1306 | 1307 | 1308 | 1309 | 1310 | 1311 | 1312 | 1313 | 1314 | 1315 | 1316 | 1317 | 1318 | 1319 | 1320 | 1321 | 1322 | 1323 | 1324 | 1325 | 1326 | 1327 | 1328 | 1329 | 1330 | 1331 | 1332 | 1333 | 1334 | 1335 | 1336 | 1337 | 1338 | 1339 | 1340 | 1341 | 1342 | 1343 | 1344 | 1345 | 1346 | 1347 | 1348 | 1349 | 1350 | 1351 | 1352 | 1353 | 1354 | 1355 | 1356 | 1357 | 1358 | 1359 | 1360 | 1361 | 1362 | 1363 | 1364 | 1365 | 1366 | 1367 | 1368 | 1369 | 1370 | 1371 | 1372 | 1373 | 1374 | 1375 | 1376 | 1377 | 1378 | 1379 | 1380 | 1381 | 1382 | 1383 | 1384 | 1385 | 1386 | 1387 | 1388 | 1389 | 1390 | 1391 | 1392 | 1393 | 1394 | 1395 | 1396 | 1397 | 1398 | 1399 | 1400 | 1401 | 1402 | 1403 | 1404 | 1405 | 1406 | 1407 | 1408 | 1409 | 1410 | 1411 | 1412 | 1413 | 1414 | 1415 | 1416 | 1417 | 1418 | 1419 | 1420 | 1421 | 1422 | 1423 | 1424 | 1425 | 1426 | 1427 | 1428 | 1429 | 1430 | 1431 | 1432 | 1433 | 1434 | 1435 | 1436 | 1437 | 1438 | 1439 | 1440 | 1441 | 1442 | 1443 | 1444 | 1445 | 1446 | 1447 | 1448 | 1449 | 1450 | 1451 | 1452 | 1453 | 1454 | 1455 | 1456 | 1457 | 1458 | 1459 | 1460 | 1461 | 1462 | 1463 | 1464 | 1465 | 1466 | 1467 | 1468 | 1469 | 1470 | 1471 | 1472 | 1473 | 1474 | 1475 | 1476 | 1477 | 1478 | 1479 | 1480 | 1481 | 1482 | 1483 | 1484 | 1485 | 1486 | 1487 | 1488 | 1489 | 1490 | 1491 | 1492 | 1493 | 1494 | 1495 | 1496 | 1497 | 1498 | 1499 | 1500 | 1501 | 1502 | 1503 | 1504 | 1505 | 1506 | 1507 | 1508 | 1509 | 1510 | 1511 | 1512 | 1513 | 1514 | 1515 | 1516 | 1517 | 1518 | 1519 | 1520 | 1521 | 1522 | 1523 | 1524 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
To turn that into grap fraction
gap_frac = gap_counts.array / aln.num_seqs
Filtering alignments based on gaps#
If we want to remove positions from the alignment which are gaps in more than a certain percentage of the sequences, use the omit_gap_pos()
.
aln = make_aligned_seqs(
data=[
("seq1", "ATGAA---TG-"),
("seq2", "ATG-AGTGATG"),
("seq3", "AT--AG-GATG"),
],
moltype="dna",
)
filtered_aln = aln.omit_gap_pos(allowed_gap_frac=0.40)
filtered_aln
0 | |
seq2 | ATGAGGATG |
seq1 | ....--TG- |
seq3 | ..-...... |
3 x 9 dna alignment
Note
The default for filtered_aln.omit_gap_pos()
is to remove columns with gaps in all the sequences. This can occur after sequences have been removed from the alignment.