Annotations

Annotations with coordinates

For more extensive documentation about annotations see Advanced sequence handling.

Automated introduction from reading genbank files

We load a sample genbank file with plenty of features and grab the CDS features.

from cogent3.parse.genbank import RichGenbankParser

parser = RichGenbankParser(open("data/ST_genome_part.gb"))
for accession, seq in parser:
    print(accession)
AE006468
cds = seq.get_annotations_matching("CDS")
print(cds)
[CDS "thrL" at [189:255]/10020, CDS "thrA" at [336:2799]/10020, CDS "thrB" at [2800:3730]/10020, CDS "thrC" at [3733:5020]/10020, CDS "yaaA" at [5887:5113]/10020, CDS "yaaJ" at [7396:5965]/10020, CDS "talB" at [7664:8618]/10020, CDS "mog" at [8728:9319]/10020, CDS "yaaH" at [9942:9375]/10020]

Customising annotation construction from reading a genbank file

You can write your own code to construct annotation objects. One reason you might do this is some genbank files do not have a /gene tag on gene related features, instead only possessing a /locus_tag. For illustrating the approach we only create annotations for CDS features. We write a custom callback function that uses the locus_tag as the Feature name.

from cogent3.core.annotation import Feature

def add_annotation(seq, feature, spans):
    type_ = feature["type"]
    if type_ != "CDS":
        return
    name = feature.get("locus_tag", None)
    if name and not isinstance(name, str):
        name = " ".join(name)
    seq.add_annotation(Feature, type_, name, spans)

parser = RichGenbankParser(
    open("data/ST_genome_part.gb"), add_annotation=add_annotation
)
for accession, seq in parser:  # just reading one accession,sequence
    break
genes = seq.get_annotations_matching("CDS")
print(genes)
[CDS "STM0001" at [189:255]/10020, CDS "STM0002" at [336:2799]/10020, CDS "STM0003" at [2800:3730]/10020, CDS "STM0004" at [3733:5020]/10020, CDS "STM0005" at [5887:5113]/10020, CDS "STM0006" at [7396:5965]/10020, CDS "STM0007" at [7664:8618]/10020, CDS "STM0008" at [8728:9319]/10020, CDS "STM0009" at [9942:9375]/10020]

Creating directly on a sequence

from cogent3 import DNA
from cogent3.core.annotation import Feature

s1 = DNA.make_seq(
    "AAGAAGAAGACCCCCAAAAAAAAAA" "TTTTTTTTTTAAAAAGGGAACCCT", name="seq1"
)
print(s1[10:15])  # this will be exon 1
print(s1[30:40])  # this will be exon 2
print(s1[45:48])  # this will be exon 3
s2 = DNA.make_seq("CGAAACGTTT", name="seq2")
s3 = DNA.make_seq("CGAAACGTTT", name="seq3")
CCCCC
TTTTTAAAAA
CCC

Via

add_annotation

from cogent3 import DNA
from cogent3.core.annotation import Feature

s1 = DNA.make_seq(
    "AAGAAGAAGACCCCCAAAAAAAAAA" "TTTTTTTTTTAAAAAGGGAACCCT", name="seq1"
)
exon1 = s1.add_annotation(Feature, "exon", "A", [(10, 15)])
exon2 = s1.add_annotation(Feature, "exon", "B", [(30, 40)])

add_feature

from cogent3 import DNA

s1 = DNA.make_seq(
    "AAGAAGAAGACCCCCAAAAAAAAAA" "TTTTTTTTTTAAAAAGGGAACCCT", name="seq1"
)
exon3 = s1.add_feature("exon", "C", [(45, 48)])

There are other annotation types.

Adding as a series or item-wise

from cogent3 import DNA

s2 = DNA.make_seq("CGAAACGTTT", name="seq2")
cpgs_series = s2.add_feature("cpgsite", "cpg", [(0, 2), (5, 7)])
s3 = DNA.make_seq("CGAAACGTTT", name="seq3")
cpg1 = s3.add_feature("cpgsite", "cpg", [(0, 2)])
cpg2 = s3.add_feature("cpgsite", "cpg", [(5, 7)])

Taking the union of annotations

Construct a pseudo-feature (cds) that’s a union of other features (exon1, exon2, exon3).

from cogent3 import DNA

s1 = DNA.make_seq(
    "AAGAAGAAGACCCCCAAAAAAAAAA" "TTTTTTTTTTAAAAAGGGAACCCT", name="seq1"
)
exon1 = s1.add_feature("exon", "A", [(10, 15)])
exon2 = s1.add_feature("exon", "B", [(30, 40)])
exon3 = s1.add_feature("exon", "C", [(45, 48)])
cds = s1.get_region_covering_all([exon1, exon2, exon3])

Getting annotation coordinates

These are useful for doing custom things, e.g. you could construct intron features using the below.

cds.get_coordinates()
[(10, 15), (30, 40), (45, 48)]

Annotations have shadows

A shadow is a span representing everything but the annotation.

not_cds = cds.get_shadow()
not_cds
region "not exon" at [0:10, 15:30, 40:45, 48:49]/49

Compare to the coordinates of the original.

cds
region "exon" at [10:15, 30:40, 45:48]/49

Adding to a sequence member of an alignment

The following annotation is directly applied onto the sequence and so is in ungapped sequence coordinates.

from cogent3 import make_aligned_seqs

aln1 = make_aligned_seqs(
    data=[["x", "-AAACCCCCA"], ["y", "TTTT--TTTT"]], array_align=False
)
seq_exon = aln1.get_seq("x").add_feature("exon", "A", [(3, 8)])

Adding to an alignment

We add an annotation directly onto an alignment. In this example we add a Variable that can be displayed as a red line on a drawing. The resulting annotation (red_data here) is in alignment coordinates!

from cogent3.core.annotation import Variable

red_data = aln1.add_annotation(
    Variable, "redline", "align", [((0, 15), 1), ((15, 30), 2), ((30, 45), 3)]
)

Slicing sequences and alignments by annotations

By a feature or coordinates returns same sequence span

from cogent3 import DNA

s1 = DNA.make_seq(
    "AAGAAGAAGACCCCCAAAAAAAAAA" "TTTTTTTTTTAAAAAGGGAACCCT", name="seq1"
)
exon1 = s1.add_feature("exon", "A", [(10, 15)])
exon2 = s1.add_feature("exon", "B", [(30, 40)])
s1[exon1]
s1[10:15]
0
seq1CCCCC

5 DnaSequence

Using the annotation object get_slice method returns the same thing.

s1[exon2]
exon2.get_slice()
0
seq1TTTTTAAAAA

10 DnaSequence

Slicing by pseudo-feature or feature series

from cogent3 import DNA

s1 = DNA.make_seq(
    "AAGAAGAAGACCCCCAAAAAAAAAA" "TTTTTTTTTTAAAAAGGGAACCCT", name="seq1"
)
exon1 = s1.add_feature("exon", "A", [(10, 15)])
exon2 = s1.add_feature("exon", "B", [(30, 40)])
exon3 = s1.add_feature("exon", "C", [(45, 48)])
cds = s1.get_region_covering_all([exon1, exon2, exon3])
print(s1[cds])
print(s1[exon1, exon2, exon3])
CCCCCTTTTTAAAAACCC
CCCCCTTTTTAAAAACCC

Warning

Slices are applied in order!

print(s1)
print(s1[exon1, exon2, exon3])
print(s1[exon2])
print(s1[exon3])
print(s1[exon1, exon3, exon2])
AAGAAGAAGACCCCCAAAAAAAAAATTTTTTTTTTAAAAAGGGAACCCT
CCCCCTTTTTAAAAACCC
TTTTTAAAAA
CCC
CCCCCCCCTTTTTAAAAA

Slice series must not be overlapping

s1[1:10, 9:15]
s1[exon1, exon1]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-973ffb8e58c5> in <module>
----> 1 s1[1:10, 9:15]
      2 s1[exon1, exon1]

~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.8/site-packages/cogent3/core/annotation.py in __getitem__(self, index)
     89         map = self._as_map(index)
     90         new = self._mapped(map)
---> 91         sliced_annots = self._sliced_annotations(new, map)
     92         new.attach_annotations(sliced_annots)
     93         if hasattr(self, "_repr_policy"):

~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.8/site-packages/cogent3/core/annotation.py in _sliced_annotations(self, new, slice)
     34             slicemap = self._as_map(slice)
     35             # try:
---> 36             newmap = slicemap.inverse()
     37             # except ValueError, detail:
     38             #    print "Annotations dropped because %s" % detail

~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.8/site-packages/cogent3/core/location.py in inverse(self)
    745     def inverse(self):
    746         if self.__inverse is None:
--> 747             self.__inverse = self._inverse()
    748         return self.__inverse
    749 

~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.8/site-packages/cogent3/core/location.py in _inverse(self)
    770                 new_spans.append(LostSpan(lo - last_hi))
    771             elif lo < last_hi:
--> 772                 raise ValueError("Uninvertable. Overlap: %s < %s" % (lo, last_hi))
    773             new_spans.append(Span(start, end, reverse=start > end))
    774             last_hi = hi

ValueError: Uninvertable. Overlap: 9 < 10

But get_region_covering_all resolves this, ensuring no overlaps.

print(s1.get_region_covering_all([exon3, exon3]).get_slice())
CCC

You can slice an annotation itself

print(s1[exon2])
ex2_start = exon2[0:3]
print(s1[ex2_start])
ex2_end = exon2[-3:]
print(s1[ex2_end])
TTTTTAAAAA
TTT
AAA

Sequence vs Alignment slicing

You can’t slice an alignment using an annotation from a sequence.

aln1[seq_exon]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-bbe05c8843eb> in <module>
----> 1 aln1[seq_exon]

~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.8/site-packages/cogent3/core/annotation.py in __getitem__(self, index)
     87 
     88     def __getitem__(self, index):
---> 89         map = self._as_map(index)
     90         new = self._mapped(map)
     91         sliced_annots = self._sliced_annotations(new, map)

~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.8/site-packages/cogent3/core/annotation.py in _as_map(self, index)
     76                 base = base.parent
     77             if base is not self:
---> 78                 raise ValueError(
     79                     "Can't map %s onto %s via %s" % (index, repr(self), containers)
     80                 )

ValueError: Can't map exon "A" at [3:8]/9 onto 2 x 10 text alignment: x[-AAACCCCCA], y[TTTT--TTTT] via []

Copying annotations

You can copy annotations onto sequences with the same name, even if the length differs

aln2 = make_aligned_seqs(
    data=[["x", "-AAAAAAAAA"], ["y", "TTTT--TTTT"]], array_align=False
)
seq = DNA.make_seq("CCCCCCCCCCCCCCCCCCCC", "x")
match_exon = seq.add_feature("exon", "A", [(3, 8)])
aln2.get_seq("x").copy_annotations(seq)
copied = list(aln2.get_annotations_from_seq("x", "exon"))
copied
[exon "A" at [4:9]/10]

but if the feature lies outside the sequence being copied to, you get a lost span

aln2 = make_aligned_seqs(data=[["x", "-AAAA"], ["y", "TTTTT"]], array_align=False)
seq = DNA.make_seq("CCCCCCCCCCCCCCCCCCCC", "x")
match_exon = seq.add_feature("exon", "A", [(5, 8)])
aln2.get_seq("x").copy_annotations(seq)
copied = list(aln2.get_annotations_from_seq("x", "exon"))
copied
copied[0].get_slice()
0
x----
y....

2 x 4 text alignment

You can copy to a sequence with a different name, in a different alignment if the feature lies within the length

# new test
aln2 = make_aligned_seqs(
    data=[["x", "-AAAAAAAAA"], ["y", "TTTT--TTTT"]], array_align=False
)
seq = DNA.make_seq("CCCCCCCCCCCCCCCCCCCC", "x")
match_exon = seq.add_feature("exon", "A", [(5, 8)])
aln2.get_seq("y").copy_annotations(seq)
copied = list(aln2.get_annotations_from_seq("y", "exon"))
copied
[exon "A" at [7:10]/10]

If the sequence is shorter, again you get a lost span.

aln2 = make_aligned_seqs(
    data=[["x", "-AAAAAAAAA"], ["y", "TTTT--TTTT"]], array_align=False
)
diff_len_seq = DNA.make_seq("CCCCCCCCCCCCCCCCCCCCCCCCCCCC", "x")
nonmatch = diff_len_seq.add_feature("repeat", "A", [(12, 14)])
aln2.get_seq("y").copy_annotations(diff_len_seq)
copied = list(aln2.get_annotations_from_seq("y", "repeat"))
copied
[repeat "A" at [10:10, -6-]/10]

Querying

You need to get a corresponding annotation projected into alignment coordinates via a query.

aln_exon = aln1.get_annotations_from_any_seq("exon")
print(aln1[aln_exon])
>x
CCCCC
>y
--TTT

Querying produces objects only valid for their source

cpgsite2 = s2.get_annotations_matching("cpgsite")
print(s2[cpgsite2])
cpgsite3 = s3.get_annotations_matching("cpgsite")
s2[cpgsite3]
CGCG
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-28-91f7e6a5ce69> in <module>
      2 print(s2[cpgsite2])
      3 cpgsite3 = s3.get_annotations_matching("cpgsite")
----> 4 s2[cpgsite3]

~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.8/site-packages/cogent3/core/annotation.py in __getitem__(self, index)
     87 
     88     def __getitem__(self, index):
---> 89         map = self._as_map(index)
     90         new = self._mapped(map)
     91         sliced_annots = self._sliced_annotations(new, map)

~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.8/site-packages/cogent3/core/annotation.py in _as_map(self, index)
     65             spans = []
     66             for i in index:
---> 67                 spans.extend(self._as_map(i).spans)
     68             map = Map(spans=spans, parent_length=len(self))
     69         elif isinstance(index, _Feature):

~/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.8/site-packages/cogent3/core/annotation.py in _as_map(self, index)
     76                 base = base.parent
     77             if base is not self:
---> 78                 raise ValueError(
     79                     "Can't map %s onto %s via %s" % (index, repr(self), containers)
     80                 )

ValueError: Can't map cpgsite "cpg" at [0:2]/10 onto DnaSequence(CGAAACGTTT) via []

Querying for absent annotation

You get back an empty list, and slicing with this returns an empty sequence.

# this test is new
dont_exist = s2.get_annotations_matching("dont_exist")
dont_exist
s2[dont_exist]

0 DnaSequence

Querying features that span gaps in alignments

If you query for a feature from a sequence, it’s alignment coordinates may be discontinuous.

aln3 = make_aligned_seqs(
    data=[["x", "C-CCCAAAAA"], ["y", "-T----TTTT"]], array_align=False
)
exon = aln3.get_seq("x").add_feature("exon", "ex1", [(0, 4)])
print(exon.get_slice())
aln_exons = list(aln3.get_annotations_from_seq("x", "exon"))
print(aln_exons)
print(aln3[aln_exons])
CCCC
[exon "ex1" at [0:1, 2:5]/10]
>x
CCCC
>y
----

Note

The T opposite the gap is missing since this approach only returns positions directly corresponding to the feature.

as_one_span unifies features with discontinuous alignment coordinates

To get positions spanned by a feature, including gaps, use as_one_span.

unified = aln_exons[0].as_one_span()
print(aln3[unified])
>x
C-CCC
>y
-T---

Behaviour of annotations on nucleic acid sequences

Reverse complementing a sequence does not reverse annotations, that is they retain the reference to the frame for which they were defined.

plus = DNA.make_seq("CCCCCAAAAAAAAAATTTTTTTTTTAAAGG")
plus_rpt = plus.add_feature("blah", "a", [(5, 15), (25, 28)])
print(plus[plus_rpt])
minus = plus.rc()
print(minus)
minus_rpt = minus.get_annotations_matching("blah")
print(minus[minus_rpt])
AAAAAAAAAAAAA
CCTTTAAAAAAAAAATTTTTTTTTTGGGGG
AAAAAAAAAAAAA

Masking annotated regions

We mask the CDS regions.

from cogent3.parse.genbank import RichGenbankParser

parser = RichGenbankParser(open("data/ST_genome_part.gb"))
seq = [seq for accession, seq in parser][0]
no_cds = seq.with_masked_annotations("CDS")
print(no_cds[150:400])
CAAGACAGACAAATAAAAATGACAGAGTACACAACATCC??????????????????????????????????????????????????????????????????CGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGAACAGTGCGGGCTTTTTTTTCGACCAGAGATCACGAGGTAACAACC????????????????????????????????????????????????????????????????

The above sequence could then have positions filtered so no position with the ambiguous character ‘?’ was present.

Masking annotated regions on alignments

We mask exon’s on an alignment.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    data=[["x", "C-CCCAAAAAGGGAA"], ["y", "-T----TTTTG-GTT"]],
    moltype="dna",
    array_align=False,
)
exon = aln.get_seq("x").add_feature("exon", "norwegian", [(0, 4)])
print(aln.with_masked_annotations("exon", mask_char="?"))
>x
?-???AAAAAGGGAA
>y
-T----TTTTG-GTT

These also persist through reverse complement operations.

rc = aln.rc()
print(rc)
print(rc.with_masked_annotations("exon", mask_char="?"))
>x
TTCCCTTTTTGGG-G
>y
AAC-CAAAA----A-

>x
TTCCCTTTTT???-?
>y
AAC-CAAAA----A-

You can take mask of the shadow

from cogent3 import DNA

s = DNA.make_seq("CCCCAAAAAGGGAA", "x")
exon = s.add_feature("exon", "norwegian", [(0, 4)])
rpt = s.add_feature("repeat", "norwegian", [(9, 12)])
rc = s.rc()
print(s.with_masked_annotations("exon", shadow=True))
print(rc.with_masked_annotations("exon", shadow=True))
print(s.with_masked_annotations(["exon", "repeat"], shadow=True))
print(rc.with_masked_annotations(["exon", "repeat"], shadow=True))
CCCC??????????
??????????GGGG
CCCC?????GGG??
??CCC?????GGGG

What features of a certain type are available?

from cogent3 import DNA

s = DNA.make_seq("ATGACCCTGTAAAAAATGTGTTAACCC", name="a")
cds1 = s.add_feature("cds", "cds1", [(0, 12)])
cds2 = s.add_feature("cds", "cds2", [(15, 24)])
all_cds = s.get_annotations_matching("cds")
all_cds
[cds "cds1" at [0:12]/27, cds "cds2" at [15:24]/27]

Getting all features of a type, or everything but that type

The annotation methods get_region_covering_all and get_shadow can be used to grab all the coding sequences or non-coding sequences in a DnaSequence object.

from cogent3.parse.genbank import RichGenbankParser

parser = RichGenbankParser(open("data/ST_genome_part.gb"))
seq = [seq for accession, seq in parser][0]
all_cds = seq.get_annotations_matching("CDS")
coding_seqs = seq.get_region_covering_all(all_cds)
coding_seqs
coding_seqs.get_slice()
noncoding_seqs = coding_seqs.get_shadow()
noncoding_seqs
noncoding_seqs.get_slice()
0
AE006468AGAGATTACGTCTGGTTGCAAGAGATCATGACAGGGGGAATTGGTTGAAAATAAATATAT

957 (truncated to 60) DnaSequence

Getting sequence features when you have an alignment object

Sequence features can be accessed via a containing Alignment.

from cogent3 import make_aligned_seqs

aln = make_aligned_seqs(
    data=[["x", "-AAAAAAAAA"], ["y", "TTTT--TTTT"]], array_align=False
)
print(aln)
exon = aln.get_seq("x").add_feature("exon", "1", [(3, 8)])
aln_exons = aln.get_annotations_from_seq("x", "exon")
aln_exons = aln.get_annotations_from_any_seq("exon")
aln_exons
>x
-AAAAAAAAA
>y
TTTT--TTTT

[exon "1" at [4:9]/10]

Annotation display on sequences

We can display annotations on sequences, writing to file.

We first make a sequence and add some annotations.

from cogent3 import DNA

seq = DNA.make_seq("aaaccggttt" * 10)
v = seq.add_feature("exon", "exon", [(20, 35)])
v = seq.add_feature("repeat_unit", "repeat_unit", [(39, 49)])
v = seq.add_feature("repeat_unit", "rep2", [(49, 60)])