Specifying data for analysis

We introduce the concept of a “data store”. This represents the data record(s) that you want to analyse. It can be a single file, a directory of files, a zipped directory of files or a single tinydb file containing multiple data records.

We represent this concept by a DataStore class. There are different flavours of these:

These can be read only or writable. All of these types support being indexed, iterated over, filtered, etc.. The tinydb variants do have some unique abilities (discussed below).

A read only data store

To create one of these, you provide a path AND a suffix of the files within the directory / zip that you will be analysing. (If the path ends with .tinydb, no file suffix is required.)

from cogent3.app.io import get_data_store

dstore = get_data_store("data/raw.zip", suffix="fa*", limit=5)
dstore
5x member ReadOnlyZippedDataStore(source='/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw.zip', members=['/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000157184.fa', '/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000131791.fa', '/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000127054.fa'...)

Data store “members”

These are able to read their own raw data.

m = dstore[0]
m
'/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000157184.fa'
m.read()[:20]  # truncating
'>Human\nATGGTGCCCCGCC'

Showing the last few members

Use the head() method to see the first few.

dstore.tail()
['/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000157184.fa',
 '/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000131791.fa',
 '/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000127054.fa',
 '/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000067704.fa',
 '/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000182004.fa']

Filtering a data store for specific members

dstore.filtered("*ENSG00000067704*")
['/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000067704.fa']

Looping over a data store

for m in dstore:
    print(m)
/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000157184.fa
/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000131791.fa
/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000127054.fa
/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000067704.fa
/Users/runner/work/cogent3.github.io/cogent3.github.io/c3org/doc/doc/data/raw/ENSG00000182004.fa

Making a writeable data store

The creation of a writeable data store is handled for you by the different writers we provide under cogent3.app.io.

Warning

The WritableZippedDataStore is deprecated.

TinyDB data stores are special

When you specify a TinyDB data store as your output (by using io.write_db()), you get additional features that are useful for dissecting the results of an analysis.

One important issue to note is the process which creates a TinyDB “locks” the file. If that process exits unnaturally (e.g. the run that was producing it was interrupted) then the file may remain in a locked state. If the db is in this state, cogent3 will not modify it unless you explicitly unlock it.

This is represented in the display as shown below.

dstore = get_data_store("data/demo-locked.tinydb")
dstore.describe
Unlocked db store.
record typenumber
completed175
incomplete0
logs1

3 rows x 2 columns

To unlock, you execute the following:

dstore.unlock(force=True)

Interrogating run logs

If you use the apply_to(logger=true) method, a scitrack logfile will be included in the data store. This includes useful information regarding the run conditions that produced the contents of the data store.

dstore.summary_logs
summary of log files
timenamepython versionwhocommandcomposable
2019-07-24 14:42:56load_unaligned-progressive_align-write_db-pid8650.log3.7.3gavin/Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.jsonload_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json')

1 rows x 6 columns

Log files can be accessed vial a special attribute.

dstore.logs
['load_unaligned-progressive_align-write_db-pid8650.log']

Each element in that list is a DataStoreMember which you can use to get the data contents.

print(dstore.logs[0].read()[:225])  # truncated for clarity
2019-07-24 14:42:56	Eratosthenes.local:8650	INFO	system_details : system=Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64
2019-07-24 14:42:56	Eratosthenes.local:8650	INFO	python