CharAlphabet#

class CharAlphabet(chars: Sequence[StrORBytes] | str | bytes, gap: str | None = None, missing: str | None = None)#

representing fundamental monomer character sets.

Attributes:

gap_char
gap_index
missing_char
missing_index
moltype
motif_len
num_canonical

Methods

`array_to_bytes`(seq)	returns seq as a byte string
`as_bytes`()	returns self as a byte string
`convert_seq_array_to`(*, alphabet, seq[, ...])	converts a numpy array with indices from self to other
`count`(value, /)	Return number of occurrences of value.
`from_indices`(-> str -> str)	returns a string from a sequence of indices
`from_rich_dict`(data)	returns an instance from a serialised dictionary
`get_kmer_alphabet`(k[, include_gap])	returns kmer alphabet with words of size k
`get_motif_len`()	the size of each member of the alphabet
`get_subset`(motif_subset[, excluded])	Returns a new Alphabet object containing a subset of motifs in self.
`index`(value[, start, stop])	Return first index of value.
`is_valid`(-> bool -> bool)	seq is valid for alphabet
`to_indices`(-> ~numpy.ndarray[tuple[int, ...)	returns a sequence of indices for the characters in seq
`to_json`()	returns a serialisable string
`to_rich_dict`([for_pickle])	returns a serialisable dictionary
`with_gap_motif`([gap_char, missing_char, ...])	returns new monomer alphabet with gap and missing characters added

get_word_alphabet

Notes

Provides methods for efficient conversion between characters and integers from fundamental types of strings, bytes and numpy arrays.

array_to_bytes(seq: ndarray[tuple[int, ...], dtype[integer]]) → bytes#: returns seq as a byte string

as_bytes() → bytes#: returns self as a byte string

convert_seq_array_to(*, alphabet: Self, seq: ndarray[tuple[int, ...], dtype[integer]], check_valid: bool = True) → ndarray[tuple[int, ...], dtype[integer]]#

converts a numpy array with indices from self to other

Parameters:

alphabet: alphabet to convert to
seq: ndarray of uint8 integers
check_valid: validates both input and out sequences are valid for self and other respectively. Validation failure raises an AlphabetError.

Returns:

the indices of characters in common between self and other
are swapped

count(value, /)#: Return number of occurrences of value.

from_indices(seq: StrORBytesORArray) → str#
from_indices(seq: str) → str
from_indices(seq: bytes) → str
from_indices(seq: ndarray) → str: returns a string from a sequence of indices

classmethod from_rich_dict(data: dict) → Self#: returns an instance from a serialised dictionary

property gap_char: str | None#

property gap_index: int | None#

get_kmer_alphabet(k: int, include_gap: bool = True) → KmerAlphabet#

returns kmer alphabet with words of size k

Parameters:

k: word size
include_gap: if True, and self.gap_char, we set KmerAlphabet.gap_char = self.gap_char * k

Notes

If self.missing_char is present, it is included in the new alphabet as missing_char * k

get_motif_len() → int#: the size of each member of the alphabet

get_subset(motif_subset: Sequence[str] | Sequence[bytes], excluded: bool = False) → Self#

Returns a new Alphabet object containing a subset of motifs in self.

Raises an exception if any of the items in the subset are not already in self.

get_word_alphabet(k: int, include_gap: bool = True) → KmerAlphabet#

index(value, start=0, stop=sys.maxsize, /)#

Return first index of value.

Raises ValueError if the value is not present.

is_valid(seq: StrORBytesORArray) → bool#
is_valid(seq: str) → bool
is_valid(seq: bytes) → bool
is_valid(seq: ndarray) → bool: seq is valid for alphabet

property missing_char: str | None#

property missing_index: int | None#

property moltype: MolType | None#

property motif_len: int#

property num_canonical: int#

to_indices(seq: StrORBytesORArray | tuple) → ndarray[tuple[int, ...], dtype[integer]]#
to_indices(seq: tuple) → ndarray[tuple[int, ...], dtype[integer]]
to_indices(seq: bytes) → ndarray[tuple[int, ...], dtype[integer]]
to_indices(seq: str) → ndarray[tuple[int, ...], dtype[integer]]
to_indices(seq: ndarray) → ndarray[tuple[int, ...], dtype[integer]]: returns a sequence of indices for the characters in seq

to_json() → str#: returns a serialisable string

to_rich_dict(for_pickle: bool = False) → dict[str, Any]#: returns a serialisable dictionary

with_gap_motif(gap_char: str = '-', missing_char: str = '?', include_missing: bool = False, gap_as_state: bool = False) → Self#

returns new monomer alphabet with gap and missing characters added

Parameters:

gap_char: the IUPAC gap character “-”
missing_char: the IUPAC missing character “?”
include_missing: if True, and self.missing_char, it is included in the new alphabet
gap_as_state: include the gap character as a state in the alphabet, drops gap_char attribute in resulting KmerAlphabet