Loading unaligned sequence data#
Note
These docs now use the new_type
core objects via the following setting.
import os
# using new types without requiring an explicit argument
os.environ["COGENT3_NEW_TYPE"] = "1"
We can load unaligned sequence data using the load_unaligned
app, this will return a SequenceCollection
.
Loading unaligned DNA sequences from a single fasta file#
In this example, we load unaligned DNA sequences from a single fasta file using the load_unaligned
app. We specify the molecular type (moltype="protein")
and the file format (format="fasta")
.
from cogent3 import get_app
load_unaligned_app = get_app("load_unaligned", format="fasta", moltype="protein")
seqs = load_unaligned_app("data/inseqs_protein.fasta")
seqs
0 | |
1091044_fragment | IPLDFDKEFRDKTVVIVAIPGAFTPT |
13541053_fragment | KKKNTEVISVSEDTVYVHKAWVQYD |
15605725_fragment | FEILAINMDPENLTGFLKNNP |
3 x {min=21, median=25.0, max=26} protein sequence collection
Loading unaligned DNA sequences from multiple fasta files#
To load unaligned DNA sequences from multiple fasta files, we need two things, a data store that identifies the files we are interested in and a process composed of our apps of interest.
1. A data store that identifies the files we are interested in#
Here we open a read-only (mode="r"
) data store that identifies all fasta files in the data directory, limiting the data store to two members as a minimum example.
from cogent3 import get_app, open_data_store
fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r", limit=2)
2. A composed process that defines our workflow#
In this example, our process loads the unaligned sequences using load_unaligned
, then applies jaccard_dist
to estimate a kmer-based genetic distance, which we write out to a data store using write_tabular
.
Note
Apps that are “writers” require a data store to write to, learn more about writers here!.
out_dstore = open_data_store(path_to_dir, suffix="tsv", mode="w")
load_unaligned_app = get_app("load_unaligned", format="fasta", moltype="dna")
jdist = get_app("jaccard_dist")
writer = get_app("write_tabular", out_dstore, format="tsv")
process = load_unaligned_app + jdist + writer
Tip
When running this code on your machine, remember to replace path_to_dir
with an actual directory path.
Now we’re good to go! We can apply process
to our data store of fasta sequences. result
is a data store, which you can index to see individual data members. We can inspect a given data member using the .read()
on data members.
result = process.apply_to(fasta_seq_dstore)
print(result[1].read())
{"type": "cogent3.app.composable.NotCompleted", "not_completed_construction": {"args": ["ERROR", "load_unaligned", "Traceback (most recent call last):\n File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/app/composable.py\", line 406, in _call\n result = self.main(val, *args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/app/io.py\", line 368, in main\n seqs = _load_seqs(path, cogent3.make_unaligned_seqs, self._parser, self.moltype)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/app/io.py\", line 299, in _load_seqs\n return coll_maker(data=data, moltype=moltype, source=unique_id)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/alignment.py\", line 6028, in make_unaligned_seqs\n return new_alignment.make_unaligned_seqs(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/opt/hostedtoolcache/Python/3.12.10/x64/lib/python3.12/functools.py\", line 912, in wrapper\n return dispatch(args[0].__class__)(*args, **kw)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/new_alignment.py\", line 2869, in make_unaligned_seqs\n seqs_data = make_unaligned_storage(\n ^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/new_alignment.py\", line 2798, in make_unaligned_storage\n return klass.from_seqs(**sd_kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/new_alignment.py\", line 463, in from_seqs\n return cls(data=data, alphabet=alphabet, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/new_alignment.py\", line 426, in __init__\n raise new_alphabet.AlphabetError(\ncogent3.core.new_alphabet.AlphabetError: One or more sequences are invalid for alphabet ('T', 'C', 'A', 'G', '-', 'N', 'R', 'Y', 'W', 'S', 'K', 'M', 'B', 'D', 'H', 'V', '?')\n"], "kwargs": {"source": "primate_rodent.fasta"}}, "version": "2025.5.8a9"}