Map protein alignment gaps to DNA alignment gaps#

Section author: Gavin Huttley

Although Cogent3 provides a means for directly aligning codon sequences, you may want to use a different approach based on the translate-align-introduce gaps into the original paradigm. After you’ve translated your codon sequences, and aligned the resulting amino acid sequences, you want to introduce the gaps from the aligned protein sequences back into the original codon sequences. Here’s how.

from cogent3 import make_aligned_seqs, make_unaligned_seqs

First I’m going to construct an artificial example, using the seqs dict as a means to get the data into the Alignment object. The basic idea, however, is that you should already have a set of DNA sequences that are in frame (i.e. position 0 is the 1st codon position), you’ve translated those sequences and aligned these translated sequences. The result is an alignment of aa sequences and a set of unaligned DNA sequences from which the aa seqs were derived. If your sequences are not in frame you can adjust it by either slicing, or adding N’s to the beginning of the raw string.

seqs = {
    "hum": "AAGCAGATCCAGGAAAGCAGCGAGAATGGCAGCCTGGCCGCGCGCCAGGAGAGGCAGGCCCAGGTCAACCTCACT",
    "mus": "AAGCAGATCCAGGAGAGCGGCGAGAGCGGCAGCCTGGCCGCGCGGCAGGAGAGGCAGGCCCAAGTCAACCTCACG",
    "rat": "CTGAACAAGCAGCCACTTTCAAACAAGAAA",
}
unaligned_DNA = make_unaligned_seqs(seqs, moltype="dna")
unaligned_DNA

	0
hum	AAGCAGATCCAGGAAAGCAGCGAGAATGGCAGCCTGGCCGCGCGCCAGGAGAGGCAGGCC
mus	AAGCAGATCCAGGAGAGCGGCGAGAGCGGCAGCCTGGCCGCGCGGCAGGAGAGGCAGGCC
rat	CTGAACAAGCAGCCACTTTCAAACAAGAAA

3 x {min=30, median=75, max=75} (truncated to 3 x 60) dna sequence collection

In order to ensure the alignment algorithm preserves the coding frame, we align the translation of the sequences. We need to translate them first, but note that because the seqs are unaligned they we have to set aligned=False, or we’ll get an error.

unaligned_aa = unaligned_DNA.get_translation()
unaligned_aa

	0
hum	KQIQESSENGSLAARQERQAQVNLT
mus	KQIQESGESGSLAARQERQAQVNLT
rat	LNKQPLSNKK

3 x {min=10, median=25, max=25} protein sequence collection

The translated seqs can then be written to file, using the method write. That file then serves as input for an alignment program. The resulting alignment file can be read back in. (We won’t write to file in this example.) For this example we will specify the aligned sequences in the dict, rather than from file.

aligned_aa_seqs = {
    "hum": "KQIQESSENGSLAARQERQAQVNLT",
    "mus": "KQIQESGESGSLAARQERQAQVNLT",
    "rat": "LNKQ------PLS---------NKK",
}
aligned_aa = make_aligned_seqs(aligned_aa_seqs, moltype="protein")
aligned_DNA = aligned_aa.replace_seqs(unaligned_DNA)
aligned_DNA

	0
hum	AAGCAGATCCAGGAAAGCAGCGAGAATGGCAGCCTGGCCGCGCGCCAGGAGAGGCAGGCC
mus	..............G...G......GC.................G...............
rat	CT.A.C.AG...------------------CCA..TT.A---------------------

3 x 75 (truncated to 3 x 60) dna alignment