Loading unaligned sequence data#

We can load unaligned sequence data using the load_unaligned app, this will return a SequenceCollection.

Loading unaligned DNA sequences from a single fasta file#

In this example, we load unaligned DNA sequences from a single fasta file using the load_unaligned app. We specify the molecular type (moltype="protein") and the file format (format_name="fasta").

from cogent3 import get_app

load_unaligned_app = get_app("load_unaligned", format_name="fasta", moltype="protein")
seqs = load_unaligned_app("data/inseqs_protein.fasta")
seqs

	0
1091044_fragment	IPLDFDKEFRDKTVVIVAIPGAFTPT
13541053_fragment	KKKNTEVISVSEDTVYVHKAWVQYD
15605725_fragment	FEILAINMDPENLTGFLKNNP

3 x {min=21, median=25.0, max=26} protein sequence collection

Loading unaligned DNA sequences from multiple fasta files#

To load unaligned DNA sequences from multiple fasta files, we need two things, a data store that identifies the files we are interested in and a process composed of our apps of interest.

1. A data store that identifies the files we are interested in#

Here we open a read-only (mode="r") data store that identifies all fasta files in the data directory, limiting the data store to two members as a minimum example.

from cogent3 import get_app, open_data_store

fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r", limit=2)

2. A composed process that defines our workflow#

In this example, our process loads the unaligned sequences using load_unaligned, then applies jaccard_dist to estimate a kmer-based genetic distance, which we write out to a data store using write_tabular.

Note

Apps that are “writers” require a data store to write to, learn more about writers here!.

out_dstore = open_data_store(path_to_dir, suffix="tsv", mode="w")

load_unaligned_app = get_app("load_unaligned", format_name="fasta", moltype="dna")
jdist = get_app("jaccard_dist")
writer = get_app("write_tabular", out_dstore, format_name="tsv")

process = load_unaligned_app + jdist + writer

Tip

When running this code on your machine, remember to replace path_to_dir with an actual directory path.

Now we’re good to go! We can apply process to our data store of fasta sequences. result is a data store, which you can index to see individual data members. We can inspect a given data member using the .read() on data members.

result = process.apply_to(fasta_seq_dstore)
print(result[1].read())

dim-1	dim-2	value
Chimpanzee	Galago	0.8642478597635548
Chimpanzee	Gorilla	0.10051282051282051
Chimpanzee	HowlerMon	0.6403615929635964
Chimpanzee	Human	0.1535145888594165
Chimpanzee	Orangutan	0.24058246280468498
Chimpanzee	Rhesus	0.49607152533188836
Galago	Chimpanzee	0.8642478597635548
Galago	Gorilla	0.8640717342571836
Galago	HowlerMon	0.8869617224880383
Galago	Human	0.868570271364925
Galago	Orangutan	0.8606456885982836
Galago	Rhesus	0.8625408496732025
Gorilla	Chimpanzee	0.10051282051282051
Gorilla	Galago	0.8640717342571836
Gorilla	HowlerMon	0.6398729538236012
Gorilla	Human	0.15346370566788203
Gorilla	Orangutan	0.23883433639531204
Gorilla	Rhesus	0.4930800542740841
HowlerMon	Chimpanzee	0.6403615929635964
HowlerMon	Galago	0.8869617224880383
HowlerMon	Gorilla	0.6398729538236012
HowlerMon	Human	0.6634730538922156
HowlerMon	Orangutan	0.6410444119082479
HowlerMon	Rhesus	0.6516363636363636
Human	Chimpanzee	0.1535145888594165
Human	Galago	0.868570271364925
Human	Gorilla	0.15346370566788203
Human	HowlerMon	0.6634730538922156
Human	Orangutan	0.2772797527047913
Human	Rhesus	0.5133547008547008
Orangutan	Chimpanzee	0.24058246280468498
Orangutan	Galago	0.8606456885982836
Orangutan	Gorilla	0.23883433639531204
Orangutan	HowlerMon	0.6410444119082479
Orangutan	Human	0.2772797527047913
Orangutan	Rhesus	0.502291722836344
Rhesus	Chimpanzee	0.49607152533188836
Rhesus	Galago	0.8625408496732025
Rhesus	Gorilla	0.4930800542740841
Rhesus	HowlerMon	0.6516363636363636
Rhesus	Human	0.5133547008547008
Rhesus	Orangutan	0.502291722836344