Data stores – collections of data records#
If you download raw.zip
and unzip it, you will see it contains 1,035 files ending with a .fa
filename suffix. (It also contains a tab delimited file and a log file, which we ignore for now.) The directory raw
is a “data store” and the .fa
files are “members” of it. In summary, a data store is a collection of members of the same “type”. This means we can apply the same application to every member.
How do I use a data store?#
A data store is just a “container”. To open a data store you use the open_data_store()
function. To load the data for a member of a data store you need an appropriately selected loader type of app.
Types of data store#
Class Name |
Supported Operations |
Supported Data Types |
Identifying Suffix |
---|---|---|---|
|
read / write / append |
text |
None |
|
read |
text |
|
|
read, write, append |
text or bytes |
|
Note
The ReadOnlyDataStoreZipped
is just a compressed DataStoreDirectory
.
The structure of data stores#
If a directory was not created by cogent3
as a DataStoreDirectory
then it has only the structure that existed previously.
If a data store was created by cogent3
, either as a directory or as a sqlitedb
, then it contains four types of data: completed records, not completed records, log files and md5 files. In a DataStoreDirectory
, these are organised using the file system. The completed members are valid data records (as distinct from not completed) and are at the top level. The remaining types are in subdirectories.
demo_dstore
├── logs
├── md5
├── not_completed
└── ... <the completed members>
logs/
stores scitrack log files produced by cogent3.app
writer apps. md5/
stores plain text files with the md5 sum of a corresponding data member which are used to check the integrity of the data store.
The DataStoreSqlite
stores the same information, just in SQL tables.
Supported operations on a data store#
All data store classes can be iterated over, indexed, checked for membership. These operations return a DataMember
object. In addition to providing access to members, the data store classes have convenience methods for describing their contents and providing summaries of log files that are included and of the NotCompleted
members (see The NotCompleted object).
Opening a data store#
Use the open_data_store() function, illustrated below. Use the mode argument to identify whether to open as read only (mode="r"
), write (mode=w
) or append(mode="a"
).
Opening a read only data store#
We open the zipped directory described above, defining the filenames ending in .fa
as the data store members. All files within the directory become members of the data store (unless we use the limit
argument).
from cogent3 import open_data_store
dstore = open_data_store("data/raw.zip", suffix="fa", mode="r")
print(dstore)
1035x member ReadOnlyDataStoreZipped(source='/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw.zip', members=[DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw.zip, unique_id=ENSG00000157184.fa), DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw.zip, unique_id=ENSG00000131791.fa)]...)
Summarising the data store#
The .describe
property demonstrates that there are only completed members.
dstore.describe
record type | number |
---|---|
completed | 1035 |
not_completed | 0 |
logs | 0 |
3 rows x 2 columns
Data store “members”#
Get one member#
You can index a data store like other Python series, in the folowing case the first member.
m = dstore[0]
m
DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw.zip, unique_id=ENSG00000157184.fa)
Looping over a data store#
This gives you one member at a time.
for m in dstore[:5]:
print(m)
ENSG00000157184.fa
ENSG00000131791.fa
ENSG00000127054.fa
ENSG00000067704.fa
ENSG00000182004.fa
Members can read their own data#
m.read()[:20] # truncating
'>Human\nATGGCGTACCGTG'
Note
For a DataStoreSqlite
member, the default data storage format is bytes. So reading the content of an individual record is best done using the load_db
app.
Making a writeable data store#
The creation of a writeable data store is specified with mode="w"
, or (to append) mode="a"
. In the former case, any existing records are overwritten. In the latter case, existing records are ignored.
DataStoreSqlite
stores serialised data#
When you specify a Sqlitedb data store as your output (by using open_data_store()
) you write multiple records into a single file making distribution easier.
One important issue to note is the process which creates a Sqlitedb “locks” the file. If that process exits unnaturally (e.g. the run that was producing it was interrupted) then the file may remain in a locked state. If the db is in this state, cogent3
will not modify it unless you explicitly unlock it.
This is represented in the display as shown below.
dstore = open_data_store("data/demo-locked.sqlitedb")
dstore.describe
record type | number |
---|---|
completed | 175 |
not_completed | 0 |
logs | 1 |
3 rows x 2 columns
To unlock, you execute the following:
dstore.unlock(force=True)
Interrogating run logs#
If you use the apply_to()
method, a scitrack logfile will be stored in the data store. This includes useful information regarding the run conditions that produced the contents of the data store.
dstore.summary_logs
time | name | python version | who | command | composable |
---|---|---|---|---|---|
2019-07-24 14:42:56 | logs/load_unaligned-progressive_align-write_db-pid8650.log | 3.7.3 | gavin | /Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.json | load_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json') |
1 rows x 6 columns
Log files can be accessed vial a special attribute.
dstore.logs
[DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/demo-locked.sqlitedb, unique_id=logs/load_unaligned-progressive_align-write_db-pid8650.log)]
Each element in that list is a DataMember
which you can use to get the data contents.
print(dstore.logs[0].read()[:225]) # truncated for clarity
2019-07-24 14:42:56 Eratosthenes.local:8650 INFO system_details : system=Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64
2019-07-24 14:42:56 Eratosthenes.local:8650 INFO python
Pulling it all together#
We will translate the DNA sequences in raw.zip
into amino acid and store them as sqlite database. We will interrogate the generated data store to gtet a synopsis of the results.
Defining the data stores for analysis#
Loading our input data
from cogent3 import open_data_store
in_dstore = open_data_store("data/raw.zip", suffix="fa")
Creating our output DataStoreSqlite
out_dstore = open_data_store("translated.sqlitedb", mode="w")
Create an app and apply it#
We need apps to load the data, translate it and then to write the translated sequences out. We define those and compose into a single app.
from cogent3 import get_app
load = get_app("load_unaligned", moltype="dna")
translate = get_app("translate_seqs")
write = get_app("write_db", data_store=out_dstore)
app = load + translate + write
app
load_unaligned(moltype='dna', format='fasta') + translate_seqs(moltype='dna',
gc=1, allow_rc=False, trim_terminal_stop=True) +
write_db(data_store=DataStoreSqlite(source=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/translated.sqlitedb,
mode=Mode.w, limit=None, verbose=False), id_from_source=<function get_unique_id
at 0x7f0eb09cb380>, serialiser=to_primitive(convertor=<function as_dict at
0x7f0ea97df100>) + pickle_it())
We apply the app to all members of in_dstore
. The results will be written to out_dstore
.
out_dstore = app.apply_to(in_dstore)
Inspecting the outcome#
The .describe
method gives us an analysis level summary.
out_dstore.describe
record type | number |
---|---|
completed | 1025 |
not_completed | 10 |
logs | 1 |
3 rows x 2 columns
We confirm the data store integrity
out_dstore.validate()
Condition | Value |
---|---|
Num md5sum correct | 1035 |
Num md5sum incorrect | 0 |
Num md5sum missing | 0 |
Has log | True |
4 rows x 2 columns
We can examine why some input data could not be processed by looking at the summary of the not completed records.
out_dstore.summary_not_completed
type | origin | message | num | source |
---|---|---|---|---|
ERROR | translate_seqs | "AlphabetError: 'Huma...h not divisible by 3" | 10 | ENSG00000198938.fa, ENSG00000183291.fa, ... |
1 rows x 5 columns
We see they all came from the translate_seqs
step. Some had a terminal stop codon while others had a length that was not divisible by 3.
Note
The .completed
and .not_completed
attributes give access to the different types of members while the .members
attribute gives them all. For example,
len(out_dstore.not_completed)
10
is the same as in the describe
output and each element is a DataMember
.
out_dstore.not_completed[:2]
[DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/translated.sqlitedb, unique_id=ENSG00000198938),
DataMember(data_store=/home/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/translated.sqlitedb, unique_id=ENSG00000183291)]