Writing sequences and sequence alignments#
Writing a sequence alignment to disk#
Writing out an alignment can be achieved easily with the write_seqs
app. First, let’s load the alignment we want to write.
from cogent3 import get_app
load_aligned_app = get_app("load_aligned", moltype="dna", format="fasta")
aln = load_aligned_app("data/primate_brca1.fasta")
aln
0 | |
Chimpanzee | TGTGGCACAAATACTCATGCCAGCTCATTACAGCATGAGAACAGTTTATTACTCACTAAA |
Galago | .......A................................G................... |
HowlerMon | ...............................................G............ |
Rhesus | ...............................................G............ |
Orangutan | ............................................................ |
Gorilla | ............................................................ |
Human | ............................................................ |
7 x 2814 (truncated to 7 x 60) dna alignment
When creating the write_seqs
app, we need to provide a data store to which we want the data to be written, and optionally, we can specify the format we want the sequences to be written in.
from cogent3 import get_app, open_data_store
seq_dstore = open_data_store(path_to_dir, suffix="phylip", mode="w")
write_seqs_app = get_app("write_seqs", data_store=seq_dstore, format="phylip")
result = write_seqs_app(aln)
result.read()[:50]
'7 2814\nGalago TGTGGCAAAAATACTCATGCCAGCTCATTACA'
Writing many sequence collections to a data store#
Typically, the final step of a data processing pipeline is writing out the filtered data. When write_seqs
is composed into a process, the process will write out multiple sequence collections to a data store.
We can create our input data store containing all the files with the “.fasta” suffix in the data directory using open_data_store
.
from cogent3 import open_data_store
fasta_seq_dstore = open_data_store("data", suffix="fasta", mode="r")
Let’s define a process. In this example, our process loads the sequences, filters the sequences to keep only those which are translatable, translates the sequences, and then writes the filtered sequences to a data store.
from cogent3 import get_app, open_data_store
out_dstore = open_data_store(path_to_dir, suffix="fa", mode="w")
loader = get_app("load_unaligned", format="fasta", moltype="dna")
keep_translatable = get_app("select_translatable")
translate = get_app("translate_seqs")
writer = get_app("write_seqs", out_dstore, format="fasta")
process = loader + keep_translatable + translate + writer
Tip
When running this code on your machine, remember to replace path_to_dir
with an actual directory path.
We apply process
to our input data store, and assign the resulting data store to result
.
result = process.apply_to(fasta_seq_dstore)
Accessing an overview of our process#
We can interrogate result
to see an overview of the process.
result.describe
record type | number |
---|---|
completed | 10 |
not_completed | 3 |
logs | 1 |
3 rows x 2 columns
There were 10 data files to which the process was successfully applied. However, there were three files for which the process did not complete. We can see a summary of the failures by accessing the summary_not_completed
property.
result.summary_not_completed
type | origin | message | num | source |
---|---|---|---|---|
ERROR | select_translatable | 'TypeError: sequence ...instance, list found' | 1 | primate_cdx2_promoter.fasta |
ERROR | load_unaligned | "AlphabetError: 'I', AlphabetError: 'F'" | 2 | inseqs_protein.fasta, refseqs_protein.fasta |
2 rows x 5 columns
Looks like the first two failed because they are protein sequences and load_unaligned
expected DNA sequences.
Interestingly, another file failed in the keep_translatable
step. By design, these failures did not stop the rest of the pipeline from being run. In fact, the data store collects the NotCompleted objects, which store traceback information, allowing you to interrogate any failings.