Annotation Databases#
This guide shows you how to use cogent3
’s annotation databases, which are in-memory SQLite databases, to store, query and manipulate the Features (also known as annotations) of one or more biological sequences.
What are the different types of AnnotationDb
?#
There are three types of databases that are available: BasicAnnotationDb
, GffAnnotationDb
, and GenbankAnnotationDb
. A BasicAnnotationDb
provides a “user” table where custom features can be added. The latter two, as hinted in their names, are distinguished by the type of data file used to generate them. Both GffAnnotationDb
and GenbankAnnotationDb
have a “user” table for custom features, which allows them to be merged with a BasicAnnotationDb
. However, they cannot be merged with each other!
How to create a standalone BasicAnnotationDb
#
Achieved by creating an BasicAnnotationDb
instance. This is an empty database to which we can add features.
from cogent3.core.annotation_db import BasicAnnotationDb
anno_db = BasicAnnotationDb()
anno_db
BasicAnnotationDb(source=':memory:', total_records=0)
How to load an standalone AnnotationDb
from a data file#
Typically, we want to load a collection of features from a genomic annotation file, such as a GFF or Genbank file. For the following examples, we will use data from the bacterium Mycoplasma genitalium.
Note
See the list of Data Files Used in the Documentation to download the data used in the following examples.
From a GFF file#
To load features from a GFF file, we can use the load_annotations
function and provide the path to the GFF file. The function automatically determines the file type based on the file extension, and it returns a GffAnnotationDb
object.
from cogent3 import load_annotations
gff_db = load_annotations(path="data/mycoplasma-genitalium.gff")
gff_db
GffAnnotationDb(source=':memory:', total_records=1169)
From a Genbank file#
To load features from a Genbank file, we can once again use the load_annotations
function and provide the path to the Genbank file. The function detects the file type based on the file extension, and it returns a GenbankAnnotationDb
object.
from cogent3 import load_annotations
gb_db = load_annotations(path="data/mycoplasma-genitalium.gb")
gb_db
GenbankAnnotationDb(seqid='NC_000908', source=':memory:', namer=None, total_records=1127)
How to generate a summary of an AnnotationDb
#
To generate a summary of an AnnotationDb
, we can access the describe
attribute of the database. This attribute returns a cogent3.util.table.Table
instance that shows the number of records for each seqid, the count for each biotype, and the number of rows in each table (in this example there is a “gff” table with 1,169 rows and an empty “user” table).
summary = gff_db.describe
summary
count | |
---|---|
seqid('NC_000908.2') | 1,169 |
biotype('CDS') | 521 |
biotype('RNase_P_RNA') | 1 |
biotype('SRP_RNA') | 1 |
biotype('exon') | 42 |
biotype('gene') | 546 |
biotype('pseudogene') | 17 |
biotype('rRNA') | 3 |
biotype('region') | 1 |
biotype('tRNA') | 36 |
biotype('tmRNA') | 1 |
num_rows('gff') | 1,169 |
num_rows('user') | 0 |
13 rows x 2 columns
How to add custom features to an AnnotationDb
#
This is achieved via the add_features
method and for all three types of AnnotationDb
it will be added to the “user” table. The method requires information about the feature, such as its biotype, name, genomic location (spans), and the seqid. The seqid is necessary when linking an AnnotationDb
to a Sequence
object, see How to assign an AnnotationDb to a sequence for more information.
We can add a feature to the empty BasicAnnotationDb
we created above. Now the database has one record!
anno_db.add_feature(
seqid="NC_000908",
biotype="gene",
name="interesting_gene",
spans=[(1, 4)],
strand="+",
)
anno_db.describe
count | |
---|---|
seqid('NC_000908') | 1 |
biotype('gene') | 1 |
num_rows('user') | 1 |
3 rows x 2 columns
We can also add a feature to our GffAnnotationDb
or GenbankAnnotationDb
. Below, the previously empty “user” table now has a row count of one, indicating that our feature has been successfully added to the database.
gff_db.add_feature(
seqid="seq1",
biotype="gene",
name="interesting_gene",
spans=[(1, 4)],
strand="+",
)
gff_db.describe[-2:, :] # showing just last two rows
count | |
---|---|
num_rows('gff') | 1,169 |
num_rows('user') | 1 |
2 rows x 2 columns
How to write an AnnotationDb
to disk for efficient re-loading#
In the above examples, all databases indicate that source=":memory:"
, i.e. they are in-memory databases. We can write any database to disk using the write()
method and providing an outpath.
# write to disk
gb_db.write("data/m-genitalium-database.gbdb")
# do something
# re-load from disk
quick_load_gb_db = GenbankAnnotationDb(source="data/m-genitalium-database.gbdb")
Note
The suffix of the outpath (“.gbdb” in the above example) can be arbitrarily chosen, however, this behaviour may change in the future to only accept registered suffixes! 👀
How to query an AnnotationDb
#
Note, there are two methods with the same interface available to query an AnnotationDb
:
get_features_matching()
. A generator that yields all features that matched the query. The minimal information required to create acogent3
Feature
object is provided in the returned dictionary. For more information on Features see Features.get_records_matching()
. A generator that yields all features that matched the query. The complete record for each matching feature is provided in the returned dictionary.
Put simply, a “feature” is a subset of a “record”.
Querying via Feature Name#
To query a database for a feature by its name, provide the name of the feature as an argument to either get_features_matching()
or get_records_matching()
. Since an AnnotationDb
can contain records for more than one sequence, it is best practice to also include the seqid of the sequence of interest.
For example, querying the GenbankAnnotationDb
for the 16s rRNA gene:
mg_16s = list(
gb_db.get_features_matching(
name="MG_RS00775", biotype="gene", seqid="NC_000908"
)
)
mg_16s
[{'seqid': 'NC_000908',
'biotype': 'gene',
'spans': [(np.int64(170011), np.int64(171529))],
'strand': '+',
'name': 'MG_RS00775',
'on_alignment': None}]
Querying via Feature Biotype#
Similarly, get_features_matching()
and get_records_matching()
can be used to query the database for all features that match a given biotype.
For example, querying the GffAnnotationDb
for all pseudogenes:
pseudogenes = list(gff_db.get_features_matching(biotype="pseudogene"))
pseudogenes[:2] # showing just the first two
[{'seqid': 'NC_000908.2',
'biotype': 'pseudogene',
'spans': [(np.int64(85561), np.int64(86589))],
'strand': '+',
'name': 'gene-MG_RS02910',
'on_alignment': None},
{'seqid': 'NC_000908.2',
'biotype': 'pseudogene',
'spans': [(np.int64(86784), np.int64(87549))],
'strand': '+',
'name': 'gene-MG_RS00385',
'on_alignment': None}]
Querying via region of interest#
We can provide start
and end
arguments to get_features_matching()
and get_records_matching()
and all features within the coordinates will be returned.
For example, the adhesin protein of M. genitalium is organised in an operon between positions 220600 to 229079, so we can query for genes in that region to return all operon genes:
operon_cds = list(
gff_db.get_features_matching(start=220600, end=229067, biotype="CDS")
)
operon_cds
[{'seqid': 'NC_000908.2',
'biotype': 'CDS',
'spans': [(np.int64(220606), np.int64(221563))],
'strand': '+',
'name': 'cds-WP_009886005.1',
'on_alignment': None},
{'seqid': 'NC_000908.2',
'biotype': 'CDS',
'spans': [(np.int64(221569), np.int64(225904))],
'strand': '+',
'name': 'cds-WP_010869366.1',
'on_alignment': None},
{'seqid': 'NC_000908.2',
'biotype': 'CDS',
'spans': [(np.int64(225914), np.int64(229067))],
'strand': '+',
'name': 'cds-WP_041593683.1',
'on_alignment': None}]
Querying via the extended attributes field#
A particularly useful functionality of a GffAnnotationDb
is the ability to search the extended attributes field. This allows querying for records that have matches to a specific string provided to the attributes
argument within their extended attributes field.
For example, we can query for all CDS related to replication:
replication_records = list(
gff_db.get_records_matching(attributes="replication", biotype="CDS")
)
replication_records[0] # showing just the first match
{'seqid': 'NC_000908.2',
'source': 'Protein Homology',
'biotype': 'CDS',
'start': 685,
'stop': 1828,
'score': '.',
'strand': '+',
'phase': '0',
'attributes': 'ID=cds-WP_009885562.1;Parent=gene-MG_RS00005;Dbxref=Genbank:WP_009885562.1;Name=WP_009885562.1;Ontology_term=GO:0006260,GO:0003887,GO:0009360;gbkey=CDS;gene=dnaN;go_component=DNA polymerase III complex|0009360||IEA;go_function=DNA-directed DNA polymerase activity|0003887||IEA;go_process=DNA replication|0006260||IEA;inference=COORDINATES: similar to AA sequence:RefSeq:WP_010874358.1;locus_tag=MG_RS00005;product=DNA polymerase III subunit beta;protein_id=WP_009885562.1;transl_table=4',
'comments': None,
'spans': array([[ 685, 1828]]),
'name': 'cds-WP_009885562.1',
'parent_id': 'gene-MG_RS00005'}
Note
Extended attribute querying only works for GFF databases!
How to interrogate an AnnotationDb
#
An AnnotationDb
can be interrogated to explore the properties of a sequence without needing the sequence information.
How many unique genes are in a given genome?#
Mycoplasma genitalium has the smallest bacterial genome, so the number of genes in the loaded database represents the approximate minimal set of genes required for bacterial life! We can see the total number of genes by using the num_matches()
method and specifying the condition we want to be matched is that the biotype is “gene”.
gb_db.num_matches(biotype="gene")
563
The count is 563, however, this may include genes with more than one copy. To determine the number of distinct genes we can use the count_distinct()
method and specify biotype="gene"
and name=True
to indicate we are interested in genes with distinct names.
total_genes = gb_db.count_distinct(biotype="gene", name=True)
single_copy = total_genes[total_genes.columns["count"] == 1, :]
len(single_copy)
561
The count of unique genes is 561. This means that almost every gene is present only once in the genome, very little redundancy here!
Just for fun, let’s try this with the GFF database… (downloaded from the exact same source)
total_genes = gff_db.num_matches(biotype="gene")
print("total genes: ", total_genes)
genes = gff_db.count_distinct(biotype="gene", name=True)
single_copy = genes[genes.columns["count"] == 1, :]
print("single copy genes: ", len(single_copy))
total genes: 547
single copy genes: 547
What? 🤯
How to find the “children” of a Feature#
To find the “children” of a feature, we can use the get_feature_children()
method. A “child” refers to a feature that is nested within or contained by another “parent” feature. For example, a child feature could be an exon contained within a gene or a CDS contained within a transcript.
This method returns a generator that yields all the child features of the specified feature.
For example, let’s find the children of “gene-MG_RS00035”:
children = list(gff_db.get_feature_children(name="gene-MG_RS00035"))
children
[{'seqid': 'NC_000908.2',
'biotype': 'CDS',
'spans': [(np.int64(9155), np.int64(9920))],
'strand': '+',
'name': 'cds-WP_009885556.1',
'on_alignment': None}]
How to find the “parent” of a Feature#
To find the “parent” of a feature, we can use the get_feature_parent()
method, which achieves the inverse of the above method.
For example, we can use the “child” we returned above "cds-WP_009885556.1"
, to find the original parent gene!
parents = list(gff_db.get_feature_parent(name="cds-WP_009885556.1"))
parents
[{'seqid': 'NC_000908.2',
'biotype': 'gene',
'spans': [(np.int64(9155), np.int64(9920))],
'strand': '+',
'name': 'gene-MG_RS00035',
'on_alignment': None}]
How to combine two AnnotationDb
instances#
Checking the compatibility of two AnnotationDb
instances#
Combining data requires compatibility of the databases, this can be checked via the compatible()
method. Below we check whether a GffAnnotationDb
is compatible with a BasicAnnotationDb
.
gff_db.compatible(anno_db)
True
The method evaluates to True
, indicating that the data of the two databases can be merged.
What about merging a GffAnnotationDb
and GenbankAnnotationDb
?
gff_db.compatible(gb_db)
False
The method evaluates to False
. Merging a GffAnnotationDb
and GenbankAnnotationDb
is not possible.
Taking the union of two AnnotationDb
instances#
The union()
method will return a new instance with merged records.
union_db = gb_db.union(anno_db)
union_db.describe[-2:, :]
count | |
---|---|
num_rows('gb') | 1,127 |
num_rows('user') | 1 |
2 rows x 2 columns
In the new merged database, there is now content in both the “user” and “gff” table.
Updating an AnnotationDb
with the record from another database#
The update()
method will update records of a given database with another and return the same instance of the database.
gff_db.update(anno_db)
gff_db.describe[-2:, :]
count | |
---|---|
num_rows('gff') | 1,169 |
num_rows('user') | 2 |
2 rows x 2 columns
Initialise a AnnotationDb
with another database#
You can assign a compatible database to the db
argument in the AnnotationDb
constructor. If it’s the same class, its db will be bound to self and directly modified.
from cogent3.core.annotation_db import GenbankAnnotationDb
new_gb_db = GenbankAnnotationDb(source="m-genitalium-database.gbdb", db=anno_db)
new_gb_db
GenbankAnnotationDb(seqid='<multiple seqids>', source='m-genitalium-database.gbdb', namer=None, total_records=1)
How to get a subset of an AnnotationDb
#
If you want a subset of a db, use the same arguments as you would for db.get_records_matching()
.
from cogent3 import load_annotations
gff_db = load_annotations(path="data/mycoplasma-genitalium.gff")
just_cds = gff_db.subset(biotype="CDS")
just_cds.describe
count | |
---|---|
seqid('NC_000908.2') | 521 |
biotype('CDS') | 521 |
num_rows('gff') | 521 |
num_rows('user') | 0 |
4 rows x 2 columns
Note
The result is an in-memory database by default. To have this written to disk, assign a path to the source argument, e.g. gff_db.subset(source="som/path/subset.gff3db", biotype="CDS")
.
How to assign an AnnotationDb
to a sequence#
For more extensive documentation about annotating alignments and sequences see Features.
Directly assign an AnnotationDb
to a Sequence#
Assign the AnnotationDb to the annotation_db
attribute of a Sequence
from cogent3 import make_seq
seq1 = make_seq(
"AAGAAGAAGACCCCCAAAAAAAAAATTTTTTTTTTAAAAAGGGAACCCT",
name="NC_000908",
moltype="dna",
)
seq1.annotation_db = anno_db
seq1.annotation_db
BasicAnnotationDb(source=':memory:', total_records=1)
Loading an AnnotationDb
and Sequence
using the load_seq()
function#
For a single sequence Genbank file#
Loading a sequence from a Genbank file will automatically create a database instance containing all features present in the file. This database instance will be bound to the Sequence
instance via the .annotation_db
attribute, accessing this attribute displays a representation of the bound annotations.
from cogent3 import load_seq
gb_seq = load_seq("data/mycoplasma-genitalium.gb")
gb_seq.annotation_db
GenbankAnnotationDb(seqid='NC_000908', source=':memory:', namer=None, total_records=1127)
For a single sequence FASTA file and an associated GFF annotation file#
Data can be loaded by providing the path to the gff file to the annotation_path
argument of load_seq()
.
gff_seq = load_seq(
"data/mycoplasma-genitalium.fa",
annotation_path="data/mycoplasma-genitalium.gff",
)
gff_seq.annotation_db
GffAnnotationDb(source=':memory:', total_records=0)
Note
This assumes an exact match of the sequence name between files!
In the above example, the sequence name in the fasta file does not match any records in the gff3 file (it is "NC_000908.2 Mycoplasmoides genitalium G37, complete sequence"
in the former, and "NC_000908.2"
in the latter). However, if you are confident that they are related, then you can use the label_to_name
argument of load_seq()
to change the sequence name as follows:
seq = load_seq(
"data/mycoplasma-genitalium.fa",
annotation_path="data/mycoplasma-genitalium.gff",
label_to_name=lambda x: x.split()[0],
)
seq.annotation_db
GffAnnotationDb(source=':memory:', total_records=1169)