Loading unaligned sequence data#
We can load unaligned sequence data using the load_unaligned app, this will return a SequenceCollection.
Loading unaligned DNA sequences from a single fasta file#
In this example, we load unaligned DNA sequences from a single fasta file using the load_unaligned app. We specify the molecular type (moltype="protein") and the file format (format_name="fasta").
from cogent3 import get_app
load_unaligned_app = get_app("load_unaligned", format_name="fasta", moltype="protein")
seqs = load_unaligned_app("data/inseqs_protein.fasta")
seqs
| 0 | |
| 1091044_fragment | IPLDFDKEFRDKTVVIVAIPGAFTPT |
| 13541053_fragment | KKKNTEVISVSEDTVYVHKAWVQYD |
| 15605725_fragment | FEILAINMDPENLTGFLKNNP |
3 x {min=21, median=25.0, max=26} protein sequence collection
Loading unaligned DNA sequences from multiple fasta files#
To load unaligned DNA sequences from multiple fasta files, we need two things, a data store that identifies the files we are interested in and a process composed of our apps of interest.
1. A data store that identifies the files we are interested in#
Here we open a read-only (mode="r") data store that identifies all fasta files in the data directory, limiting the data store to two members as a minimum example.
from cogent3 import get_app, open_data_store
fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r", limit=2)
2. A composed process that defines our workflow#
In this example, our process loads the unaligned sequences using load_unaligned, then applies jaccard_dist to estimate a kmer-based genetic distance, which we write out to a data store using write_tabular.
Note
Apps that are “writers” require a data store to write to, learn more about writers here!.
out_dstore = open_data_store(path_to_dir, suffix="tsv", mode="w")
load_unaligned_app = get_app("load_unaligned", format_name="fasta", moltype="dna")
jdist = get_app("jaccard_dist")
writer = get_app("write_tabular", out_dstore, format_name="tsv")
process = load_unaligned_app + jdist + writer
Tip
When running this code on your machine, remember to replace path_to_dir with an actual directory path.
Now we’re good to go! We can apply process to our data store of fasta sequences. result is a data store, which you can index to see individual data members. We can inspect a given data member using the .read() on data members.
result = process.apply_to(fasta_seq_dstore)
print(result[1].read())
dim-1 dim-2 value
Human RoundEare 0.9375736160188457
Human Wombat 0.9464285714285714
RoundEare Human 0.9375736160188457
RoundEare Wombat 0.9840909090909091
Wombat Human 0.9464285714285714
Wombat RoundEare 0.9840909090909091