CharAlphabet#

class CharAlphabet(chars: Sequence[str | bytes], gap: str | None = None, missing: str | None = None)#

representing fundamental monomer character sets.

Attributes:
gap_char
gap_index
missing_char
missing_index
moltype
motif_len
num_canonical

Methods

array_to_bytes(seq)

returns seq as a byte string

as_bytes()

returns self as a byte string

convert_seq_array_to(*, alphabet, seq[, ...])

converts a numpy array with indices from self to other

count(value, /)

Return number of occurrences of value.

from_rich_dict(data)

returns an instance from a serialised dictionary

get_kmer_alphabet(k[, include_gap])

returns kmer alphabet with words of size k

get_subset(motif_subset[, excluded])

Returns a new Alphabet object containing a subset of motifs in self.

index(value[, start, stop])

Return first index of value.

to_json()

returns a serialisable string

to_rich_dict([for_pickle])

returns a serialisable dictionary

with_gap_motif([gap_char, missing_char, ...])

returns new monomer alphabet with gap and missing characters added

from_indices

get_motif_len

get_word_alphabet

is_valid

to_indices

Notes

Provides methods for efficient conversion between characters and integers from fundamental types of strings, bytes and numpy arrays.

array_to_bytes(seq: ndarray) bytes#

returns seq as a byte string

as_bytes() bytes#

returns self as a byte string

convert_seq_array_to(*, alphabet: Self, seq: ndarray, check_valid: bool = True) ndarray#

converts a numpy array with indices from self to other

Parameters:
alphabet

alphabet to convert to

seq

ndarray of uint8 integers

check_valid

validates both input and out sequences are valid for self and other respectively. Validation failure raises an AlphabetError.

Returns:
the indices of characters in common between self and other
are swapped
count(value, /)#

Return number of occurrences of value.

from_indices(seq: str | bytes | ndarray) str#
from_indices(seq: str) str
from_indices(seq: bytes) str
from_indices(seq: ndarray) str
classmethod from_rich_dict(data: dict) Self#

returns an instance from a serialised dictionary

property gap_char: str | None#
property gap_index: int | None#
get_kmer_alphabet(k: int, include_gap: bool = True) KmerAlphabet#

returns kmer alphabet with words of size k

Parameters:
k

word size

include_gap

if True, and self.gap_char, we set KmerAlphabet.gap_char = self.gap_char * k

Notes

If self.missing_char is present, it is included in the new alphabet as missing_char * k

get_motif_len() int#
get_subset(motif_subset: Sequence[str | bytes], excluded: bool = False) Self#

Returns a new Alphabet object containing a subset of motifs in self.

Raises an exception if any of the items in the subset are not already in self.

get_word_alphabet(k: int, include_gap: bool = True) KmerAlphabet#
index(value, start=0, stop=sys.maxsize, /)#

Return first index of value.

Raises ValueError if the value is not present.

is_valid(seq: str | bytes | ndarray) bool#
is_valid(seq: str) bool
is_valid(seq: bytes) bool
is_valid(seq: ndarray) bool
property missing_char: str | None#
property missing_index: int | None#
property moltype: MolType | None#
property motif_len: int#
property num_canonical: int#
to_indices(seq: str | bytes | ndarray | tuple) ndarray[int]#
to_indices(seq: tuple) ndarray[int]
to_indices(seq: bytes) ndarray[int]
to_indices(seq: str) ndarray[int]
to_indices(seq: ndarray) ndarray[int]
to_json() str#

returns a serialisable string

to_rich_dict(for_pickle: bool = False) dict[str, Any]#

returns a serialisable dictionary

with_gap_motif(gap_char: str = '-', missing_char: str = '?', include_missing: bool = False, gap_as_state: bool = False) Self#

returns new monomer alphabet with gap and missing characters added

Parameters:
gap_char

the IUPAC gap character “-”

missing_char

the IUPAC missing character “?”

include_missing

if True, and self.missing_char, it is included in the new alphabet

gap_as_state

include the gap character as a state in the alphabet, drops gap_char attribute in resulting KmerAlphabet