Alphabets#

Note

These docs now use the new_type core objects via the following setting.

import os

# using new types without requiring an explicit argument
os.environ["COGENT3_NEW_TYPE"] = "1"

CharAlphabet and MolType#

MolType instances have an CharAlphabet.

from cogent3 import get_moltype

dna = get_moltype("dna")
print(dna.alphabet)
protein = get_moltype("protein")
print(protein.alphabet)
('T', 'C', 'A', 'G')
('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y')

CharAlphabet instances reference the MolType that created them.

dna.alphabet.moltype is dna
True

Creating tuple alphabets#

You can create a tuple alphabet of, for example, dinucleotides or trinucleotides.

dinuc_alphabet = dna.alphabet.get_kmer_alphabet(2)
trinuc_alphabet = dna.alphabet.get_kmer_alphabet(3)
print(dinuc_alphabet, trinuc_alphabet, sep="\n")
('TT', 'TC', 'TA', 'TG', 'CT', 'CC', 'CA', 'CG', 'AT', 'AC', 'AA', 'AG', 'GT', 'GC', 'GA', 'GG')
('TTT', 'TTC', 'TTA', 'TTG', 'TCT', 'TCC', 'TCA', 'TCG', 'TAT', 'TAC', 'TAA', 'TAG', 'TGT', 'TGC', 'TGA', 'TGG', 'CTT', 'CTC', 'CTA', 'CTG', 'CCT', 'CCC', 'CCA', 'CCG', 'CAT', 'CAC', 'CAA', 'CAG', 'CGT', 'CGC', 'CGA', 'CGG', 'ATT', 'ATC', 'ATA', 'ATG', 'ACT', 'ACC', 'ACA', 'ACG', 'AAT', 'AAC', 'AAA', 'AAG', 'AGT', 'AGC', 'AGA', 'AGG', 'GTT', 'GTC', 'GTA', 'GTG', 'GCT', 'GCC', 'GCA', 'GCG', 'GAT', 'GAC', 'GAA', 'GAG', 'GGT', 'GGC', 'GGA', 'GGG')

Convert a sequence into integers#

seq = "TAGT"
indices = dna.alphabet.to_indices(seq)
indices
array([0, 2, 3, 0], dtype=uint8)

Convert integers to a sequence#

import numpy

data = numpy.array([0, 2, 3, 0], dtype=numpy.uint8)
seq = dna.alphabet.from_indices(data)
seq
'TAGT'

Converting a sequence into k-mer indices#

You can use a KmerAlphabet to convert a standard sequence into a numpy array of integers. In this case, each integer is the encoding of the dinucleotide string into the index of that dinucleotide. Because the CharAlphabet and KmerAlphabet both inherit from tuple, they have the built-in .index() method.

import numpy

bases = dna.alphabet
dinucs = bases.get_kmer_alphabet(2)
dinucs.index("TG")
3

The to_indices() method is faster and provides more flexibility. We use that on the single dinucleotide

dinucs.to_indices("TG")
array([3], dtype=uint8)

and on a longer sequence where we want the independent k-mers.

seq = "TGTGGCACAAATACTCATGCCAGCTCATTA"
dinuc_indices = dinucs.to_indices(seq, independent_kmer=True)
dinuc_indices
array([ 3,  3, 13,  9, 10,  8,  9,  1,  8, 13,  6, 13,  1,  8,  2],
      dtype=uint8)

We can also convert the sequence into all possible k-mers.

dinucs.to_indices(seq, independent_kmer=False)
array([ 3, 12,  3, 15, 13,  6,  9,  6, 10, 10,  8,  2,  9,  4,  1,  6,  8,
        3, 13,  5,  6, 11, 13,  4,  1,  6,  8,  0,  2], dtype=uint8)