Loading unaligned sequence data#

We can load unaligned sequence data using the load_unaligned app, this will return a SequenceCollection.

Loading unaligned DNA sequences from a single fasta file#

In this example, we load unaligned DNA sequences from a single fasta file using the load_unaligned app. We specify the molecular type (moltype="protein") and the file format (format="fasta").

from cogent3 import get_app

load_unaligned_app = get_app("load_unaligned", format="fasta", moltype="protein")
seqs = load_unaligned_app("data/inseqs_protein.fasta")
seqs
0
1091044_fragmentIPLDFDKEFRDKTVVIVAIPGAFTPT
13541053_fragmentKKKNTEVISVSEDTVYVHKAWVQYD
15605725_fragmentFEILAINMDPENLTGFLKNNP

3 x {min=21, median=25, max=26} protein sequence collection

Loading unaligned DNA sequences from multiple fasta files#

To load unaligned DNA sequences from multiple fasta files, we need two things, a data store that identifies the files we are interested in and a process composed of our apps of interest.

1. A data store that identifies the files we are interested in#

Here we open a read-only (mode="r") data store that identifies all fasta files in the data directory, limiting the data store to two members as a minimum example.

from cogent3 import get_app, open_data_store

fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r", limit=2)

2. A composed process that defines our workflow#

In this example, our process loads the unaligned sequences using load_unaligned, then applies jaccard_dist to estimate a kmer-based genetic distance, which we write out to a data store using write_tabular.

Note

Apps that are “writers” require a data store to write to, learn more about writers here!.

out_dstore = open_data_store(path_to_dir, suffix="tsv", mode="w")

load_unaligned_app = get_app("load_unaligned", format="fasta", moltype="dna")
jdist = get_app("jaccard_dist")
writer = get_app("write_tabular", out_dstore, format="tsv")

process = load_unaligned_app + jdist + writer

Tip

When running this code on your machine, remember to replace path_to_dir with an actual directory path.

Now we’re good to go! We can apply process to our data store of fasta sequences. result is a data store, which you can index to see individual data members. We can inspect a given data member using the .read() on data members.

result = process.apply_to(fasta_seq_dstore)
print(result[1].read())
{"type": "cogent3.app.composable.NotCompleted", "not_completed_construction": {"args": ["ERROR", "load_unaligned", "Traceback (most recent call last):\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/app/composable.py\", line 401, in _call\n    result = self.main(val, *args, **kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/app/io.py\", line 355, in main\n    seqs = _load_seqs(path, cogent3.make_unaligned_seqs, self._parser, self.moltype)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/app/io.py\", line 290, in _load_seqs\n    return coll_maker(data=data, moltype=moltype, source=unique_id)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/__init__.py\", line 201, in make_unaligned_seqs\n    return _make_seq_container(\n           ^^^^^^^^^^^^^^^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/__init__.py\", line 144, in _make_seq_container\n    return klass(\n           ^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/alignment.py\", line 1855, in __init__\n    super().__init__(*args, **kwargs)\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/alignment.py\", line 444, in __init__\n    seqs, names = conversion_f(\n                  ^^^^^^^^^^^^^\n  File \"/opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/functools.py\", line 912, in wrapper\n    return dispatch(args[0].__class__)(*args, **kw)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/alignment.py\", line 5585, in _\n    return _coerce_to_unaligned_seqs(\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/functools.py\", line 912, in wrapper\n    return dispatch(args[0].__class__)(*args, **kw)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/alignment.py\", line 5566, in _coerce_to_unaligned_seqs\n    seq = _construct_unaligned_seq(data[name], name=name, moltype=moltype)\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/functools.py\", line 912, in wrapper\n    return dispatch(args[0].__class__)(*args, **kw)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/alignment.py\", line 5868, in _\n    return moltype.make_seq(seq=data, name=name, preserve_case=False)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/moltype.py\", line 766, in make_seq\n    return self._make_seq(seq=self.coerce_str(seq), name=name, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/sequence.py\", line 839, in __init__\n    self._seq = _coerce_to_seqview(\n                ^^^^^^^^^^^^^^^^^^^\n  File \"/opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/functools.py\", line 912, in wrapper\n    return dispatch(args[0].__class__)(*args, **kw)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/sequence.py\", line 3179, in _\n    checker(data)\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/sequence.py\", line 834, in <lambda>\n    (lambda x: self.moltype.verify_sequence(x, gaps_allowed, wildcards_allowed))\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/home/runner/work/cogent3.github.io/cogent3.github.io/.venv/lib/python3.12/site-packages/cogent3/core/moltype.py\", line 811, in verify_sequence\n    raise AlphabetError(msg)\ncogent3.core.alphabet.AlphabetError: 'I'\n"], "kwargs": {"source": "inseqs_protein.fasta"}}, "version": "2025.3.22a2"}