Alphabets

Alphabet and MolType

MolType instances have an Alphabet.

from cogent3 import DNA, PROTEIN

print(DNA.alphabet)
print(PROTEIN.alphabet)
('T', 'C', 'A', 'G')
('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y')

Alphabet instances have a MolType.

PROTEIN.alphabet.moltype == PROTEIN
True

Creating tuple alphabets

You can create a tuple alphabet of, for example, dinucleotides or trinucleotides.

dinuc_alphabet = DNA.alphabet.get_word_alphabet(2)
print(dinuc_alphabet)
trinuc_alphabet = DNA.alphabet.get_word_alphabet(3)
print(trinuc_alphabet)
('TT', 'CT', 'AT', 'GT', 'TC', 'CC', 'AC', 'GC', 'TA', 'CA', 'AA', 'GA', 'TG', 'CG', 'AG', 'GG')
('TTT', 'CTT', 'ATT', 'GTT', 'TCT', 'CCT', 'ACT', 'GCT', 'TAT', 'CAT', 'AAT', 'GAT', 'TGT', 'CGT', 'AGT', 'GGT', 'TTC', 'CTC', 'ATC', 'GTC', 'TCC', 'CCC', 'ACC', 'GCC', 'TAC', 'CAC', 'AAC', 'GAC', 'TGC', 'CGC', 'AGC', 'GGC', 'TTA', 'CTA', 'ATA', 'GTA', 'TCA', 'CCA', 'ACA', 'GCA', 'TAA', 'CAA', 'AAA', 'GAA', 'TGA', 'CGA', 'AGA', 'GGA', 'TTG', 'CTG', 'ATG', 'GTG', 'TCG', 'CCG', 'ACG', 'GCG', 'TAG', 'CAG', 'AAG', 'GAG', 'TGG', 'CGG', 'AGG', 'GGG')

Convert a sequence into integers

seq = "TAGT"
indices = DNA.alphabet.to_indices(seq)
indices
[0, 2, 3, 0]

Convert integers to a sequence

seq = DNA.alphabet.from_indices([0, 2, 3, 0])
seq
['T', 'A', 'G', 'T']

or

seq = DNA.alphabet.from_ordinals_to_seq([0, 2, 3, 0])
seq
0
NoneTAGT

4 DnaSequence