Data stores – collections of data records#

If you download raw.zip and unzip it, you will see it contains 1,035 files ending with a .fa filename suffix. (It also contains a tab delimited file and a log file, which we ignore for now.) The directory raw is a “data store” and the .fa files are “members” of it. In summary, a data store is a collection of members of the same “type”. This means we can apply the same application to every member.

How do I use a data store?#

A data store is just a “container”. To open a data store you use the open_data_store() function. To load the data for a member of a data store you need an appropriately selected loader type of app.

Types of data store#

Class Name

Supported Operations

Supported Data Types

Identifying Suffix

DataStoreDirectory

read / write / append

text

None

ReadOnlyDataStoreZipped

read

text

.zip

DataStoreSqlite

read, write, append

text or bytes

.sqlitedb

Note

The ReadOnlyDataStoreZipped is just a compressed DataStoreDirectory.

The structure of data stores#

If a directory was not created by cogent3 as a DataStoreDirectory then it has only the structure that existed previously.

If a data store was created by cogent3, either as a directory or as a sqlitedb, then it contains four types of data: completed records, not completed records, log files and md5 files. In a DataStoreDirectory, these are organised using the file system. The completed members are valid data records (as distinct from not completed) and are at the top level. The remaining types are in subdirectories.

demo_dstore
├── logs
├── md5
├── not_completed
└── ... <the completed members>

logs/ stores scitrack log files produced by cogent3.app writer apps. md5/ stores plain text files with the md5 sum of a corresponding data member which are used to check the integrity of the data store.

The DataStoreSqlite stores the same information, just in SQL tables.

Supported operations on a data store#

All data store classes can be iterated over, indexed, checked for membership. These operations return a DataMember object. In addition to providing access to members, the data store classes have convenience methods for describing their contents and providing summaries of log files that are included and of the NotCompleted members (see The NotCompleted object).

Opening a data store#

Use the open_data_store() function, illustrated below. Use the mode argument to identify whether to open as read only (mode="r"), write (mode=w) or append(mode="a").

Opening a read only data store#

We open the zipped directory described above, defining the filenames ending in .fa as the data store members. All files within the directory become members of the data store (unless we use the limit argument).

from cogent3 import open_data_store

dstore = open_data_store("data/raw.zip", suffix="fa", mode="r")
print(dstore)
1035x member ReadOnlyDataStoreZipped(source='/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw.zip', members=[DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw.zip, unique_id=ENSG00000157184.fa), DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw.zip, unique_id=ENSG00000131791.fa)]...)

Summarising the data store#

The .describe property demonstrates that there are only completed members.

dstore.describe
Directory datastore
record typenumber
completed1035
not_completed0
logs0

3 rows x 2 columns

Data store “members”#

Get one member#

You can index a data store like other Python series, in the folowing case the first member.

m = dstore[0]
m
DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw.zip, unique_id=ENSG00000157184.fa)

Looping over a data store#

This gives you one member at a time.

for m in dstore[:5]:
    print(m)
ENSG00000157184.fa
ENSG00000131791.fa
ENSG00000127054.fa
ENSG00000067704.fa
ENSG00000182004.fa

Members can read their own data#

m.read()[:20] # truncating
'>Human\nATGGCGTACCGTG'

Note

For a DataStoreSqlite member, the default data storage format is bytes. So reading the content of an individual record is best done using the load_db app.

Making a writeable data store#

The creation of a writeable data store is specified with mode="w", or (to append) mode="a". In the former case, any existing records are overwritten. In the latter case, existing records are ignored.

DataStoreSqlite stores serialised data#

When you specify a Sqlitedb data store as your output (by using open_data_store()) you write multiple records into a single file making distribution easier.

One important issue to note is the process which creates a Sqlitedb “locks” the file. If that process exits unnaturally (e.g. the run that was producing it was interrupted) then the file may remain in a locked state. If the db is in this state, cogent3 will not modify it unless you explicitly unlock it.

This is represented in the display as shown below.

dstore = open_data_store("data/demo-locked.sqlitedb")
dstore.describe
Unlocked db store.
record typenumber
completed175
not_completed0
logs1

3 rows x 2 columns

To unlock, you execute the following:

dstore.unlock(force=True)

Interrogating run logs#

If you use the apply_to() method, a scitrack logfile will be stored in the data store. This includes useful information regarding the run conditions that produced the contents of the data store.

dstore.summary_logs
summary of log files
timenamepython versionwhocommandcomposable
2019-07-24 14:42:56logs/load_unaligned-progressive_align-write_db-pid8650.log3.7.3gavin/Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.jsonload_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json')

1 rows x 6 columns

Log files can be accessed vial a special attribute.

dstore.logs
[DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/demo-locked.sqlitedb, unique_id=logs/load_unaligned-progressive_align-write_db-pid8650.log)]

Each element in that list is a DataMember which you can use to get the data contents.

print(dstore.logs[0].read()[:225]) # truncated for clarity
2019-07-24 14:42:56	Eratosthenes.local:8650	INFO	system_details : system=Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64
2019-07-24 14:42:56	Eratosthenes.local:8650	INFO	python

Pulling it all together#

We will translate the DNA sequences in raw.zip into amino acid and store them as sqlite database. We will interrogate the generated data store to gtet a synopsis of the results.

Defining the data stores for analysis#

Loading our input data

from cogent3 import open_data_store

in_dstore = open_data_store("data/raw.zip", suffix="fa")

Creating our output DataStoreSqlite

out_dstore = open_data_store("translated.sqlitedb", mode="w")

Create an app and apply it#

We need apps to load the data, translate it and then to write the translated sequences out. We define those and compose into a single app.

from cogent3 import get_app

load = get_app("load_unaligned", moltype="dna")
translate = get_app("translate_seqs")
write = get_app("write_db", data_store=out_dstore)
app = load + translate + write
app
load_unaligned(moltype='dna', format='fasta') + translate_seqs(moltype='dna',
gc=1, allow_rc=False, trim_terminal_stop=True) +
write_db(data_store=DataStoreSqlite(source=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/translated.sqlitedb,
mode=Mode.w, limit=None, verbose=False), id_from_source=<function get_unique_id
at 0x7f0eb09cb380>, serialiser=to_primitive(convertor=<function as_dict at
0x7f0ea97df100>) + pickle_it())

We apply the app to all members of in_dstore. The results will be written to out_dstore.

out_dstore = app.apply_to(in_dstore)

Inspecting the outcome#

The .describe method gives us an analysis level summary.

out_dstore.describe
Locked to the current process.
record typenumber
completed1025
not_completed10
logs1

3 rows x 2 columns

We confirm the data store integrity

out_dstore.validate()
validate status
ConditionValue
Num md5sum correct1035
Num md5sum incorrect0
Num md5sum missing0
Has logTrue

4 rows x 2 columns

We can examine why some input data could not be processed by looking at the summary of the not completed records.

out_dstore.summary_not_completed
not completed records
typeoriginmessagenumsource
ERRORtranslate_seqs"AlphabetError: 'Huma...h not divisible by 3"10ENSG00000198938.fa, ENSG00000183291.fa, ...

1 rows x 5 columns

We see they all came from the translate_seqs step. Some had a terminal stop codon while others had a length that was not divisible by 3.

Note

The .completed and .not_completed attributes give access to the different types of members while the .members attribute gives them all. For example,

len(out_dstore.not_completed)
10

is the same as in the describe output and each element is a DataMember.

out_dstore.not_completed[:2]
[DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/translated.sqlitedb, unique_id=ENSG00000198938),
 DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/translated.sqlitedb, unique_id=ENSG00000183291)]