KmerAlphabet#

class KmerAlphabet(words: tuple[StrOrBytes, ...], monomers: CharAlphabet[StrOrBytes], k: int, gap: StrOrBytes | None = None, missing: StrOrBytes | None = None)#

k-mer alphabet represents complete non-monomer alphabets

Attributes:
gap_char
gap_index
missing_char
missing_index
moltype
motif_len
num_canonical

Methods

count(value, /)

Return number of occurrences of value.

from_index(kmer_index)

decodes an integer into a k-mer

from_indices(kmer_indices[, independent_kmer])

converts array of k-mer indices into an array of monomer indices

from_rich_dict(data)

returns an instance from a serialised dictionary

index(value[, start, stop])

Return first index of value.

is_valid(seq)

seq is valid for alphabet

to_index(seq)

encodes a k-mer as a single integer

to_indices(seq[, independent_kmer])

returns a sequence of k-mer indices

to_json()

returns a serialisable string

to_rich_dict([for_pickle])

returns a serialisable dictionary

with_gap_motif([include_missing])

returns a new KmerAlphabet with the gap motif added

Notes

Differs from SenseCodonAlphabet case by representing all possible permutations of k-length of the provided monomer alphabet. More efficient mapping between integers and k-length strings

count(value, /)#

Return number of occurrences of value.

from_index(kmer_index: int) ndarray[tuple[int, ...], dtype[integer]]#

decodes an integer into a k-mer

from_indices(kmer_indices: ndarray[tuple[int, ...], dtype[integer]], independent_kmer: bool = True) ndarray[tuple[int, ...], dtype[integer]]#

converts array of k-mer indices into an array of monomer indices

Parameters:
kmer_indices

a sequence of k-mer indices

independent_kmer

whether the k-mers are overlapping or not _description_

classmethod from_rich_dict(data: dict[str, Any]) KmerAlphabet[StrOrBytes]#

returns an instance from a serialised dictionary

property gap_char: StrOrBytes | None#
property gap_index: int | None#
index(value, start=0, stop=sys.maxsize, /)#

Return first index of value.

Raises ValueError if the value is not present.

is_valid(seq: str | bytes | ndarray[tuple[int, ...], dtype[integer]]) bool#

seq is valid for alphabet

Parameters:
seq

a numpy array of integers

Notes

This will raise a TypeError for string or bytes. Using to_indices() to convert those ensures a valid result.

property missing_char: StrOrBytes | None#
property missing_index: int | None#
property moltype: MolType[StrOrBytes] | None#
property motif_len: int#
property num_canonical: int#
to_index(seq: str | bytes | ndarray[tuple[int, ...], dtype[integer]]) int#

encodes a k-mer as a single integer

Parameters:
seq

sequence to be encoded, can be either a string or numpy array

overlapping

if False, performs operation on sequential k-mers, e.g. codons

Notes

If self.gap_char is defined, then the following rules apply: returns num_states**k if a k-mer contains a gap character, otherwise returns num_states**k + 1 if a k-mer contains a non-canonical character. If self.gap_char is not defined, returns num_states**k for both cases.

to_indices(seq: str | bytes | tuple[str | bytes, ...] | list[str | bytes] | ndarray[tuple[int, ...], dtype[integer]], independent_kmer: bool = True) ndarray[tuple[int, ...], dtype[integer]]#

returns a sequence of k-mer indices

Parameters:
seq

a sequence of monomers

independent_kmer

if True, returns non-overlapping k-mers

Notes

If self.gap_char is not None, then the following rules apply: If a sequence k-mer contains a gap character it is assigned an index of (num. monomer states**k). If a k-mer contains a non-canonical and non-gap character, it is assigned an index of (num. monomer states**k) + 1. If self.gap_char is None, then both of the above cases are defined as (num. monomer states**k).

to_json() str#

returns a serialisable string

to_rich_dict(for_pickle: bool = False) dict[str, Any]#

returns a serialisable dictionary

with_gap_motif(include_missing: bool = False, **kwargs: Any) KmerAlphabet[StrOrBytes] | CharAlphabet[StrOrBytes]#

returns a new KmerAlphabet with the gap motif added

Notes

Adds gap state to monomers and recreates k-mer alphabet for self

kwargs is for compatibility with the CharAlphabet method