Loading unaligned sequence data#

We can load unaligned sequence data using the load_unaligned app, this will return a SequenceCollection.

Loading unaligned DNA sequences from a single fasta file#

In this example, we load unaligned DNA sequences from a single fasta file using the load_unaligned app. We specify the molecular type (moltype="protein") and the file format (format_name="fasta").

from cogent3 import get_app

load_unaligned_app = get_app("load_unaligned", format_name="fasta", moltype="protein")
seqs = load_unaligned_app("data/inseqs_protein.fasta")
seqs
0
1091044_fragmentIPLDFDKEFRDKTVVIVAIPGAFTPT
13541053_fragmentKKKNTEVISVSEDTVYVHKAWVQYD
15605725_fragmentFEILAINMDPENLTGFLKNNP

3 x {min=21, median=25.0, max=26} protein sequence collection

Loading unaligned DNA sequences from multiple fasta files#

To load unaligned DNA sequences from multiple fasta files, we need two things, a data store that identifies the files we are interested in and a process composed of our apps of interest.

1. A data store that identifies the files we are interested in#

Here we open a read-only (mode="r") data store that identifies all fasta files in the data directory, limiting the data store to two members as a minimum example.

from cogent3 import get_app, open_data_store

fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r", limit=2)

2. A composed process that defines our workflow#

In this example, our process loads the unaligned sequences using load_unaligned, then applies jaccard_dist to estimate a kmer-based genetic distance, which we write out to a data store using write_tabular.

Note

Apps that are “writers” require a data store to write to, learn more about writers here!.

out_dstore = open_data_store(path_to_dir, suffix="tsv", mode="w")

load_unaligned_app = get_app("load_unaligned", format_name="fasta", moltype="dna")
jdist = get_app("jaccard_dist")
writer = get_app("write_tabular", out_dstore, format_name="tsv")

process = load_unaligned_app + jdist + writer

Tip

When running this code on your machine, remember to replace path_to_dir with an actual directory path.

Now we’re good to go! We can apply process to our data store of fasta sequences. result is a data store, which you can index to see individual data members. We can inspect a given data member using the .read() on data members.

result = process.apply_to(fasta_seq_dstore)
print(result[1].read())
dim-1	dim-2	value
s0	s1	1.0
s0	s2	1.0
s0	s3	1.0
s0	s4	1.0
s0	s5	1.0
s0	s6	1.0
s0	s7	1.0
s0	s8	1.0
s1	s0	1.0
s1	s2	0.4642857142857143
s1	s3	0.0
s1	s4	1.0
s1	s5	1.0
s1	s6	1.0
s1	s7	1.0
s1	s8	1.0
s2	s0	1.0
s2	s1	0.4642857142857143
s2	s3	0.4642857142857143
s2	s4	1.0
s2	s5	1.0
s2	s6	1.0
s2	s7	1.0
s2	s8	1.0
s3	s0	1.0
s3	s1	0.0
s3	s2	0.4642857142857143
s3	s4	1.0
s3	s5	1.0
s3	s6	1.0
s3	s7	1.0
s3	s8	1.0
s4	s0	1.0
s4	s1	1.0
s4	s2	1.0
s4	s3	1.0
s4	s5	1.0
s4	s6	1.0
s4	s7	1.0
s4	s8	1.0
s5	s0	1.0
s5	s1	1.0
s5	s2	1.0
s5	s3	1.0
s5	s4	1.0
s5	s6	1.0
s5	s7	1.0
s5	s8	1.0
s6	s0	1.0
s6	s1	1.0
s6	s2	1.0
s6	s3	1.0
s6	s4	1.0
s6	s5	1.0
s6	s7	1.0
s6	s8	1.0
s7	s0	1.0
s7	s1	1.0
s7	s2	1.0
s7	s3	1.0
s7	s4	1.0
s7	s5	1.0
s7	s6	1.0
s7	s8	1.0
s8	s0	1.0
s8	s1	1.0
s8	s2	1.0
s8	s3	1.0
s8	s4	1.0
s8	s5	1.0
s8	s6	1.0
s8	s7	1.0