Remove problem sequences from an alignment#
Using omit_bad_seqs
we can eliminate sequences from an Alignment
based on their gap fraction and/or the number of gaps they uniquely introduce.
Let’s create a sample alignment with some gaps.
from cogent3 import make_aligned_seqs
aln = make_aligned_seqs(
{
"s1": "---ACC---TT-",
"s2": "---ACC---TT-",
"s3": "---ACC---TT-",
"s4": "--AACCG-GTT-",
"s5": "--AACCGGGTTT",
"s6": "AGAACCGGGTT-",
"s7": "------------",
},
moltype="dna",
)
Removing sequences with more than X% gaps#
Creating the omit_bad_seqs
app with the argument gap_fraction=0.5
will omit sequences that contain 50% or more gaps.
from cogent3 import get_app
omit_frac_05 = get_app("omit_bad_seqs", gap_fraction=0.5)
omit_frac_05(aln)
0 | |
s6 | AGAACCGGGTT- |
s4 | --.....-.... |
s5 | --.........T |
3 x 12 dna alignment
Removing sequences that contribute many gaps#
The quantile=0.8
argument omits sequences that are ranked above the specified quantile with respect to the number of gaps uniquely introduced into the alignment. In the following example, sequence s6
is omitted, as it uniquely introduces gaps in the first two positions of the alignment.
from cogent3 import get_app
omit_quant_08 = get_app("omit_bad_seqs", quantile=0.8)
omit_quant_08(aln)
0 | |
s5 | --AACCGGGTTT |
s1 | ..-...---..- |
s2 | ..-...---..- |
s3 | ..-...---..- |
s4 | .......-...- |
5 x 12 dna alignment