Loading unaligned sequence data#
We can load unaligned sequence data using the load_unaligned
app, this will return a SequenceCollection
.
Loading unaligned DNA sequences from a single fasta file#
In this example, we load unaligned DNA sequences from a single fasta file using the load_unaligned
app. We specify the molecular type (moltype="protein")
and the file format (format_name="fasta")
.
from cogent3 import get_app
load_unaligned_app = get_app("load_unaligned", format_name="fasta", moltype="protein")
seqs = load_unaligned_app("data/inseqs_protein.fasta")
seqs
0 | |
1091044_fragment | IPLDFDKEFRDKTVVIVAIPGAFTPT |
13541053_fragment | KKKNTEVISVSEDTVYVHKAWVQYD |
15605725_fragment | FEILAINMDPENLTGFLKNNP |
3 x {min=21, median=25.0, max=26} protein sequence collection
Loading unaligned DNA sequences from multiple fasta files#
To load unaligned DNA sequences from multiple fasta files, we need two things, a data store that identifies the files we are interested in and a process composed of our apps of interest.
1. A data store that identifies the files we are interested in#
Here we open a read-only (mode="r"
) data store that identifies all fasta files in the data directory, limiting the data store to two members as a minimum example.
from cogent3 import get_app, open_data_store
fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r", limit=2)
2. A composed process that defines our workflow#
In this example, our process loads the unaligned sequences using load_unaligned
, then applies jaccard_dist
to estimate a kmer-based genetic distance, which we write out to a data store using write_tabular
.
Note
Apps that are “writers” require a data store to write to, learn more about writers here!.
out_dstore = open_data_store(path_to_dir, suffix="tsv", mode="w")
load_unaligned_app = get_app("load_unaligned", format_name="fasta", moltype="dna")
jdist = get_app("jaccard_dist")
writer = get_app("write_tabular", out_dstore, format_name="tsv")
process = load_unaligned_app + jdist + writer
Tip
When running this code on your machine, remember to replace path_to_dir
with an actual directory path.
Now we’re good to go! We can apply process
to our data store of fasta sequences. result
is a data store, which you can index to see individual data members. We can inspect a given data member using the .read()
on data members.
result = process.apply_to(fasta_seq_dstore)
print(result[1].read())
dim-1 dim-2 value
s0 s1 1.0
s0 s2 1.0
s0 s3 1.0
s0 s4 1.0
s0 s5 1.0
s0 s6 1.0
s0 s7 1.0
s0 s8 1.0
s1 s0 1.0
s1 s2 0.4642857142857143
s1 s3 0.0
s1 s4 1.0
s1 s5 1.0
s1 s6 1.0
s1 s7 1.0
s1 s8 1.0
s2 s0 1.0
s2 s1 0.4642857142857143
s2 s3 0.4642857142857143
s2 s4 1.0
s2 s5 1.0
s2 s6 1.0
s2 s7 1.0
s2 s8 1.0
s3 s0 1.0
s3 s1 0.0
s3 s2 0.4642857142857143
s3 s4 1.0
s3 s5 1.0
s3 s6 1.0
s3 s7 1.0
s3 s8 1.0
s4 s0 1.0
s4 s1 1.0
s4 s2 1.0
s4 s3 1.0
s4 s5 1.0
s4 s6 1.0
s4 s7 1.0
s4 s8 1.0
s5 s0 1.0
s5 s1 1.0
s5 s2 1.0
s5 s3 1.0
s5 s4 1.0
s5 s6 1.0
s5 s7 1.0
s5 s8 1.0
s6 s0 1.0
s6 s1 1.0
s6 s2 1.0
s6 s3 1.0
s6 s4 1.0
s6 s5 1.0
s6 s7 1.0
s6 s8 1.0
s7 s0 1.0
s7 s1 1.0
s7 s2 1.0
s7 s3 1.0
s7 s4 1.0
s7 s5 1.0
s7 s6 1.0
s7 s8 1.0
s8 s0 1.0
s8 s1 1.0
s8 s2 1.0
s8 s3 1.0
s8 s4 1.0
s8 s5 1.0
s8 s6 1.0
s8 s7 1.0