Loading unaligned sequence data#
We can load unaligned sequence data using the load_unaligned
app, this will return a SequenceCollection
.
Loading unaligned DNA sequences from a single fasta file#
In this example, we load unaligned DNA sequences from a single fasta file using the load_unaligned
app. We specify the molecular type (moltype="protein")
and the file format (format_name="fasta")
.
from cogent3 import get_app
load_unaligned_app = get_app("load_unaligned", format_name="fasta", moltype="protein")
seqs = load_unaligned_app("data/inseqs_protein.fasta")
seqs
0 | |
1091044_fragment | IPLDFDKEFRDKTVVIVAIPGAFTPT |
13541053_fragment | KKKNTEVISVSEDTVYVHKAWVQYD |
15605725_fragment | FEILAINMDPENLTGFLKNNP |
3 x {min=21, median=25.0, max=26} protein sequence collection
Loading unaligned DNA sequences from multiple fasta files#
To load unaligned DNA sequences from multiple fasta files, we need two things, a data store that identifies the files we are interested in and a process composed of our apps of interest.
1. A data store that identifies the files we are interested in#
Here we open a read-only (mode="r"
) data store that identifies all fasta files in the data directory, limiting the data store to two members as a minimum example.
from cogent3 import get_app, open_data_store
fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r", limit=2)
2. A composed process that defines our workflow#
In this example, our process loads the unaligned sequences using load_unaligned
, then applies jaccard_dist
to estimate a kmer-based genetic distance, which we write out to a data store using write_tabular
.
Note
Apps that are “writers” require a data store to write to, learn more about writers here!.
out_dstore = open_data_store(path_to_dir, suffix="tsv", mode="w")
load_unaligned_app = get_app("load_unaligned", format_name="fasta", moltype="dna")
jdist = get_app("jaccard_dist")
writer = get_app("write_tabular", out_dstore, format_name="tsv")
process = load_unaligned_app + jdist + writer
Tip
When running this code on your machine, remember to replace path_to_dir
with an actual directory path.
Now we’re good to go! We can apply process
to our data store of fasta sequences. result
is a data store, which you can index to see individual data members. We can inspect a given data member using the .read()
on data members.
result = process.apply_to(fasta_seq_dstore)
print(result[1].read())
dim-1 dim-2 value
Chimpanzee Galago 0.8642478597635548
Chimpanzee Gorilla 0.10051282051282051
Chimpanzee HowlerMon 0.6403615929635964
Chimpanzee Human 0.1535145888594165
Chimpanzee Orangutan 0.24058246280468498
Chimpanzee Rhesus 0.49607152533188836
Galago Chimpanzee 0.8642478597635548
Galago Gorilla 0.8640717342571836
Galago HowlerMon 0.8869617224880383
Galago Human 0.868570271364925
Galago Orangutan 0.8606456885982836
Galago Rhesus 0.8625408496732025
Gorilla Chimpanzee 0.10051282051282051
Gorilla Galago 0.8640717342571836
Gorilla HowlerMon 0.6398729538236012
Gorilla Human 0.15346370566788203
Gorilla Orangutan 0.23883433639531204
Gorilla Rhesus 0.4930800542740841
HowlerMon Chimpanzee 0.6403615929635964
HowlerMon Galago 0.8869617224880383
HowlerMon Gorilla 0.6398729538236012
HowlerMon Human 0.6634730538922156
HowlerMon Orangutan 0.6410444119082479
HowlerMon Rhesus 0.6516363636363636
Human Chimpanzee 0.1535145888594165
Human Galago 0.868570271364925
Human Gorilla 0.15346370566788203
Human HowlerMon 0.6634730538922156
Human Orangutan 0.2772797527047913
Human Rhesus 0.5133547008547008
Orangutan Chimpanzee 0.24058246280468498
Orangutan Galago 0.8606456885982836
Orangutan Gorilla 0.23883433639531204
Orangutan HowlerMon 0.6410444119082479
Orangutan Human 0.2772797527047913
Orangutan Rhesus 0.502291722836344
Rhesus Chimpanzee 0.49607152533188836
Rhesus Galago 0.8625408496732025
Rhesus Gorilla 0.4930800542740841
Rhesus HowlerMon 0.6516363636363636
Rhesus Human 0.5133547008547008
Rhesus Orangutan 0.502291722836344