Remove duplicated sequences from an alignment#
The omit_duplicated
app removes redundant sequences from a sequence collection (aligned or unaligned).
Let’s create sample data with duplicated sequences.
from cogent3 import make_aligned_seqs
data = {
"a": "ACGT",
"b": "ACG-", # identical to 'a' except has a gap
"c": "ACGG", # duplicate
"d": "ACGG", # duplicate
"e": "AGTC", # unique
}
aln = make_aligned_seqs(data, moltype="dna")
aln
0 | |
a | ACGT |
b | ...- |
c | ...G |
d | ...G |
e | .GTC |
5 x 4 dna alignment
Creating the omit_duplicated
app with the argument choose="longest"
selects the duplicated sequence with the least number of gaps and ambiguous characters. In the above example, only one of c
and d
will be retained.
from cogent3 import get_app
omit_shorter_duplicate = get_app("omit_duplicated", moltype="dna", choose="longest")
omit_shorter_duplicate(aln)
0 | |
a | ACGT |
b | ...- |
d | ...G |
e | .GTC |
4 x 4 dna alignment
Creating the omit_duplicated
app with the argument choose=None
means only unique sequences are retained.
from cogent3 import get_app
omit_all_duplicates = get_app("omit_duplicated", moltype="dna", choose=None)
omit_all_duplicates(aln)
0 | |
a | ACGT |
b | ...- |
e | .GTC |
3 x 4 dna alignment
The mask_degen
argument specifies how to treat matches between sequences with degenerate characters.
Let’s create sample data that has a DNA ambiguity code.
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs(
{
"s1": "ATCG",
"s2": "ATYG", # matches s1 with ambiguity
"s3": "GGTA",
},
moltype="dna",
)
Since “Y” represents pyrimidines where the site can be either “C” or “T”, s1 indeed matches s2 and one of them will be removed.
from cogent3 import get_app
app_dna = get_app("omit_duplicated", mask_degen=True, choose="longest", moltype="dna")
app_dna(aln)
0 | |
s1 | ATCG |
s3 | GGTA |
2 x 4 dna alignment