CharAlphabet#

class CharAlphabet(chars: Sequence[str | bytes], gap: str | None = None, missing: str | None = None)#

representing fundamental monomer character sets.

Attributes:
gap_char
gap_index
missing_char
missing_index
moltype
motif_len
num_canonical

Methods

array_to_bytes(seq)

returns seq as a byte string

as_bytes()

returns self as a byte string

convert_seq_array_to(*, alphabet, seq[, ...])

converts a numpy array with indices from self to other

count(value, /)

Return number of occurrences of value.

from_indices(-> str  -> str)

returns a string from a sequence of indices

from_rich_dict(data)

returns an instance from a serialised dictionary

get_kmer_alphabet(k[, include_gap])

returns kmer alphabet with words of size k

get_motif_len()

the size of each member of the alphabet

get_subset(motif_subset[, excluded])

Returns a new Alphabet object containing a subset of motifs in self.

index(value[, start, stop])

Return first index of value.

is_valid(-> bool  -> bool)

seq is valid for alphabet

to_indices(-> ~numpy.ndarray[tuple[int, ...)

returns a sequence of indices for the characters in seq

to_json()

returns a serialisable string

to_rich_dict([for_pickle])

returns a serialisable dictionary

with_gap_motif([gap_char, missing_char, ...])

returns new monomer alphabet with gap and missing characters added

get_word_alphabet

Notes

Provides methods for efficient conversion between characters and integers from fundamental types of strings, bytes and numpy arrays.

array_to_bytes(seq: ndarray[tuple[int, ...], dtype[integer]]) bytes#

returns seq as a byte string

as_bytes() bytes#

returns self as a byte string

convert_seq_array_to(*, alphabet: Self, seq: ndarray[tuple[int, ...], dtype[integer]], check_valid: bool = True) ndarray[tuple[int, ...], dtype[integer]]#

converts a numpy array with indices from self to other

Parameters:
alphabet

alphabet to convert to

seq

ndarray of uint8 integers

check_valid

validates both input and out sequences are valid for self and other respectively. Validation failure raises an AlphabetError.

Returns:
the indices of characters in common between self and other
are swapped
count(value, /)#

Return number of occurrences of value.

from_indices(seq: str | bytes | ndarray[tuple[int, ...], dtype[integer]]) str#
from_indices(seq: str) str
from_indices(seq: bytes) str
from_indices(seq: ndarray) str

returns a string from a sequence of indices

classmethod from_rich_dict(data: dict) Self#

returns an instance from a serialised dictionary

property gap_char: str | None#
property gap_index: int | None#
get_kmer_alphabet(k: int, include_gap: bool = True) KmerAlphabet#

returns kmer alphabet with words of size k

Parameters:
k

word size

include_gap

if True, and self.gap_char, we set KmerAlphabet.gap_char = self.gap_char * k

Notes

If self.missing_char is present, it is included in the new alphabet as missing_char * k

get_motif_len() int#

the size of each member of the alphabet

get_subset(motif_subset: Sequence[str | bytes], excluded: bool = False) Self#

Returns a new Alphabet object containing a subset of motifs in self.

Raises an exception if any of the items in the subset are not already in self.

get_word_alphabet(k: int, include_gap: bool = True) KmerAlphabet#
index(value, start=0, stop=sys.maxsize, /)#

Return first index of value.

Raises ValueError if the value is not present.

is_valid(seq: str | bytes | ndarray[tuple[int, ...], dtype[integer]]) bool#
is_valid(seq: str) bool
is_valid(seq: bytes) bool
is_valid(seq: ndarray) bool

seq is valid for alphabet

property missing_char: str | None#
property missing_index: int | None#
property moltype: MolType | None#
property motif_len: int#
property num_canonical: int#
to_indices(seq: str | bytes | ndarray[tuple[int, ...], dtype[integer]] | tuple) ndarray[tuple[int, ...], dtype[integer]]#
to_indices(seq: tuple) ndarray[tuple[int, ...], dtype[integer]]
to_indices(seq: bytes) ndarray[tuple[int, ...], dtype[integer]]
to_indices(seq: str) ndarray[tuple[int, ...], dtype[integer]]
to_indices(seq: ndarray) ndarray[tuple[int, ...], dtype[integer]]

returns a sequence of indices for the characters in seq

to_json() str#

returns a serialisable string

to_rich_dict(for_pickle: bool = False) dict[str, Any]#

returns a serialisable dictionary

with_gap_motif(gap_char: str = '-', missing_char: str = '?', include_missing: bool = False, gap_as_state: bool = False) Self#

returns new monomer alphabet with gap and missing characters added

Parameters:
gap_char

the IUPAC gap character “-”

missing_char

the IUPAC missing character “?”

include_missing

if True, and self.missing_char, it is included in the new alphabet

gap_as_state

include the gap character as a state in the alphabet, drops gap_char attribute in resulting KmerAlphabet