Removing degenerate characters#
Degenerate IUPAC base symbols represent a site position that can have multiple possible characters. For a DNA example, “Y” represents pyrimidines where the site can be either “C” or “T”.
Note
In many molecular evolutionary and phylogenetic analyses, the gap character “-” is treated “N”, meaning any base.
Let’s create sample data with degenerate characters
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs({"s1": "ACGA-GACG", "s2": "GATGATGYT"}, moltype="dna")
aln
0 | |
s1 | ACGA-GACG |
s2 | GATGATGYT |
2 x 9 dna alignment
Omit aligned columns containing a degenerate character#
from cogent3 import get_app
omit_degens = get_app("omit_degenerates", moltype="dna")
result = omit_degens(aln)
result
0 | |
s1 | ACGAGAG |
s2 | GATGTGT |
2 x 7 dna alignment
Omit all degenerate characters except gaps from an alignment#
If we create the app with the argument gap_is_degen=False
, we can omit degenerate characters but retain gaps.
from cogent3 import get_app
omit_degens_keep_gaps = get_app("omit_degenerates", moltype="dna", gap_is_degen=False)
result = omit_degens_keep_gaps(aln)
result
0 | |
s2 | GATGATGT |
s1 | ACGA-GAG |
2 x 8 dna alignment
Omit k-mers which contain degenerate characters#
If we create omit_degenerates
with the argument motif_length
, it will split sequences into non-overlapping tuples of the specified length and exclude any tuple that contains a degenerate character.
from cogent3 import get_app
omit_degenerates_app = get_app("omit_degenerates", moltype="dna", motif_length=2)
result = omit_degenerates_app(aln)
result
0 | |
s1 | ACGA |
s2 | GATG |
2 x 4 dna alignment