KmerAlphabet#

class KmerAlphabet(words: tuple[str | bytes, ...], monomers: CharAlphabet, k: int, gap: str | None = None, missing: str | None = None)#

k-mer alphabet represents complete non-monomer alphabets

Attributes:
gap_char
gap_index
missing_char
missing_index
moltype
motif_len
num_canonical

Methods

count(value, /)

Return number of occurrences of value.

from_index(kmer_index)

decodes an integer into a k-mer

from_rich_dict(data)

returns an instance from a serialised dictionary

index(value[, start, stop])

Return first index of value.

is_valid()

whether integers are within the valid range

to_index(-> int  -> int)

encodes a k-mer as a single integer

to_indices(-> ~numpy.ndarray)

returns a sequence of k-mer indices

to_json()

returns a serialisable string

to_rich_dict()

returns a serialisable dictionary

with_gap_motif([include_missing])

returns a new KmerAlphabet with the gap motif added

from_indices

Notes

Differs from SenseCodonAlphabet case by representing all possible permutations of k-length of the provided monomer alphabet. More efficient mapping between

count(value, /)#

Return number of occurrences of value.

from_index(kmer_index: int) ndarray#

decodes an integer into a k-mer

from_indices(kmer_indices: ndarray, independent_kmer: bool = True) ndarray#
classmethod from_rich_dict(data: dict) KmerAlphabet#

returns an instance from a serialised dictionary

property gap_char: str | None#
property gap_index: int | None#
index(value, start=0, stop=sys.maxsize, /)#

Return first index of value.

Raises ValueError if the value is not present.

is_valid(seq: ndarray) bool#
is_valid(seq: ndarray) bool

whether integers are within the valid range

Parameters:
seq

a numpy array of integers

Notes

This will raise a TypeError for string or bytes. Using to_indices() to convert those ensures a valid result.

property missing_char: str | None#
property missing_index: int | None#
property moltype: MolType | None#
abstract property motif_len: int#
property num_canonical: int#
to_index(seq) int#
to_index(seq: str) int
to_index(seq: bytes) int
to_index(seq: ndarray) int

encodes a k-mer as a single integer

Parameters:
seq

sequence to be encoded, can be either a string or numpy array

overlapping

if False, performs operation on sequential k-mers, e.g. codons

Notes

If self.gap_char is defined, then the following rules apply: returns num_states**k if a k-mer contains a gap character, otherwise returns num_states**k + 1 if a k-mer contains a non-canonical character. If self.gap_char is not defined, returns num_states**k for both cases.

to_indices(seq, independent_kmer: bool = True) ndarray#
to_indices(seq: str, independent_kmer: bool = True) ndarray
to_indices(seq: ndarray, independent_kmer: bool = True) ndarray

returns a sequence of k-mer indices

Parameters:
seq

a sequence of monomers

independent_kmer

if True, returns non-overlapping k-mers

Notes

If self.gap_char is not None, then the following rules apply: If a sequence k-mer contains a gap character it is assigned an index of (num. monomer states**k). If a k-mer contains a non-canonical and non-gap character, it is assigned an index of (num. monomer states**k) + 1. If self.gap_char is None, then both of the above cases are defined as (num. monomer states**k).

to_json() str#

returns a serialisable string

to_rich_dict() dict#

returns a serialisable dictionary

with_gap_motif(include_missing: bool = False)#

returns a new KmerAlphabet with the gap motif added

Notes

Adds gap state to monomers and recreates k-mer alphabet for self