Alphabets#
Note
These docs now use the new_type
core objects via the following setting.
import os
# using new types without requiring an explicit argument
os.environ["COGENT3_NEW_TYPE"] = "1"
CharAlphabet
and MolType
#
MolType
instances have an CharAlphabet
.
from cogent3 import get_moltype
dna = get_moltype("dna")
print(dna.alphabet)
protein = get_moltype("protein")
print(protein.alphabet)
('T', 'C', 'A', 'G')
('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y')
CharAlphabet
instances reference the MolType
that created them.
dna.alphabet.moltype is dna
True
Creating tuple alphabets#
You can create a tuple alphabet of, for example, dinucleotides or trinucleotides.
dinuc_alphabet = dna.alphabet.get_kmer_alphabet(2)
trinuc_alphabet = dna.alphabet.get_kmer_alphabet(3)
print(dinuc_alphabet, trinuc_alphabet, sep="\n")
('TT', 'TC', 'TA', 'TG', 'CT', 'CC', 'CA', 'CG', 'AT', 'AC', 'AA', 'AG', 'GT', 'GC', 'GA', 'GG')
('TTT', 'TTC', 'TTA', 'TTG', 'TCT', 'TCC', 'TCA', 'TCG', 'TAT', 'TAC', 'TAA', 'TAG', 'TGT', 'TGC', 'TGA', 'TGG', 'CTT', 'CTC', 'CTA', 'CTG', 'CCT', 'CCC', 'CCA', 'CCG', 'CAT', 'CAC', 'CAA', 'CAG', 'CGT', 'CGC', 'CGA', 'CGG', 'ATT', 'ATC', 'ATA', 'ATG', 'ACT', 'ACC', 'ACA', 'ACG', 'AAT', 'AAC', 'AAA', 'AAG', 'AGT', 'AGC', 'AGA', 'AGG', 'GTT', 'GTC', 'GTA', 'GTG', 'GCT', 'GCC', 'GCA', 'GCG', 'GAT', 'GAC', 'GAA', 'GAG', 'GGT', 'GGC', 'GGA', 'GGG')
Convert a sequence into integers#
seq = "TAGT"
indices = dna.alphabet.to_indices(seq)
indices
array([0, 2, 3, 0], dtype=uint8)
Convert integers to a sequence#
import numpy
data = numpy.array([0, 2, 3, 0], dtype=numpy.uint8)
seq = dna.alphabet.from_indices(data)
seq
'TAGT'
Converting a sequence into k-mer indices#
You can use a KmerAlphabet
to convert a standard sequence into a numpy
array of integers. In this case, each integer is the encoding of the dinucleotide string into the index of that dinucleotide. Because the CharAlphabet
and KmerAlphabet
both inherit from tuple
, they have the built-in .index()
method.
import numpy
bases = dna.alphabet
dinucs = bases.get_kmer_alphabet(2)
dinucs.index("TG")
3
The to_indices()
method is faster and provides more flexibility. We use that on the single dinucleotide
dinucs.to_indices("TG")
array([3], dtype=uint8)
and on a longer sequence where we want the independent k-mers.
seq = "TGTGGCACAAATACTCATGCCAGCTCATTA"
dinuc_indices = dinucs.to_indices(seq, independent_kmer=True)
dinuc_indices
array([ 3, 3, 13, 9, 10, 8, 9, 1, 8, 13, 6, 13, 1, 8, 2],
dtype=uint8)
We can also convert the sequence into all possible k-mers.
dinucs.to_indices(seq, independent_kmer=False)
array([ 3, 12, 3, 15, 13, 6, 9, 6, 10, 10, 8, 2, 9, 4, 1, 6, 8,
3, 13, 5, 6, 11, 13, 4, 1, 6, 8, 0, 2], dtype=uint8)