Loading aligned sequence data#
We can load aligned sequence data using the load_aligned
app. When making the app, you can optionally provide arguments for the molecular type of the sequence and the format of the data.
Loading aligned DNA sequences from a single fasta file#
Here we load the brca1 gene in bats, providing the molecular type (moltype="dna"
) and file format (format="fasta"
).
from cogent3 import get_app
load_aligned_app = get_app("load_aligned", moltype="dna", format="fasta")
aln = load_aligned_app("data/brca1-bats.fasta")
aln
0 | |
LittleBro | TGTGGCACAGATACTCATGCCAGCTCATTACAGCATGAGAACAGCAGTTTACTACTCACC |
FlyingFox | .........A..G.............T...............---......T..TA...T |
DogFaced | .........A............A............................T..TA...T |
FreeTaile | ...........................................................T |
TombBat | .........AG................G...............................T |
5 x 3009 (truncated to 5 x 60) dna alignment
Loading aligned protein sequences from a single phylip file#
Here we load a globin alignment, providing the molecular type (moltype="protein"
) and file format (format="phylip"
).
from cogent3 import get_app
load_aligned_app = get_app("load_aligned", moltype="protein", format="phylip")
aln = load_aligned_app("data/abglobin_aa.phylip")
aln
0 | |
goat-cow | VLSAADKSNVKAAWGKVGGNAGAYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGE |
human | ...P...T..........AH..E....................................K |
rabbit | ...P...T.I.T..E.I.SHG.E.....V.....G............FT...E.I.A..K |
rat | ....D..T.I.NC...I..HG.E..E...Q...AA........S.I.V.P......A..K |
marsupial | ...D...TH...I......H....A....A.T.................P....IQ...K |
5 x 285 (truncated to 5 x 60) protein alignment
Loading aligned DNA sequences from multiple fasta files#
In the above examples, the result is a single alignment, which could have been achieved using standard cogent3
(load_aligned_seqs()
). The real power of apps is for batch processing of a large number of files.
To apply apps to multiple files we need to set two things up:
1. A data store that identifies the files we are interested in#
Here, we create a data store containing all the files with the “.fasta” suffix in the data directory, limiting the data store to two members as a minimum example.
from cogent3 import open_data_store
fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r", limit=2)
2. A composed process that defines our workflow#
In this example, our process loads the sequences, filters the sequences to keep only the third codon position, and then writes the filtered sequences to a data store.
Note
Apps that are “writers” require a data store to write to, learn more about writers here!
from cogent3 import get_app, open_data_store
out_dstore = open_data_store(path_to_dir, suffix="fa", mode="w")
loader = get_app("load_aligned", format="fasta", moltype="dna")
cpos3 = get_app("take_codon_positions", 3)
writer = get_app("write_seqs", out_dstore, format="fasta")
process = loader + cpos3 + writer
Tip
When running this code on your machine, remember to replace path_to_dir
with an actual directory path.
Now we’re good to go, we can apply process
to our data store!#
result
is a data store, which you can index to see individual data members - which are our alignments. We can take a closer look using the .read()
method on data members (truncating to 50 characters).
result = process.apply_to(fasta_seq_dstore)
print(result[0].read()[:50])
>human
CCCGCAGCAGGGTGTCGCCGTCCTTTTCACGGCTAGTAAATAG