MolType#

class MolType(name: str, monomers: dataclasses.InitVar[typing.Union[str, bytes]], make_seq: dataclasses.InitVar[type] = <function MolType.make_seq>, gap: str | None = '-', missing: str | None = '?', complements: dataclasses.InitVar[dict[str, frozenset[str]] | None] = None, ambiguities: dict[str, frozenset[str]] | None = None, colors: dataclasses.InitVar[dict[str, str] | None] = None, pairing_rules: dict[str, dict[frozenset[str], bool]] | None = None, mw_calculator: ~cogent3.data.molecular_weight.WeightCalculator | None = None, coerce_to: ~typing.Callable[[bytes], bytes] | None = None)#

MolType handles operations that depend on the sequence type.

Attributes:
alphabet

monomers

ambiguities
coerce_to
colors
complements
degen_alphabet

monomers + ambiguous characters

degen_gapped_alphabet

monomers + gap + ambiguous characters

gapped_alphabet

monomers + gap

gaps
is_nucleic

is a nucleic acid moltype

label

synonym for name

matching_rules
mw_calculator
pairing_rules

Methods

can_match(first, second)

Returns True if every pos in 1st could match same pos in 2nd.

can_mispair(first, second)

Returns True if any position in self could mispair with other.

complement(-> str  -> str)

converts a string or bytes into it's nucleic acid complement

count_degenerate(-> int)

returns the number of degenerate characters in a sequence

count_gaps(-> int)

returns the number of gap characters in a sequence

count_variants(-> int)

Counts number of possible sequences matching the sequence, given any ambiguous characters in the sequence.

degap(-> bytes  -> str)

removes all gap and missing characters from a sequence

degenerate_from_seq(seq)

Returns least degenerate symbol that encompasses a set of characters

disambiguate(seq[, method])

Returns a non-degenerate sequence from a degenerate one.

get_css_style([colors, font_size, font_family])

returns string of CSS classes and {character: <CSS class name>, ...}

is_ambiguity(query_motif[, validate])

Return True if querymotif is an amibiguity character in alphabet.

is_compatible_alphabet(alphabet[, strict])

checks that characters in alphabet are equal to a bound alphabet

is_degenerate(-> bool  -> bool)

checks if a sequence contains degenerate characters

is_gapped(-> bool  -> bool)

checks if a sequence contains gaps

is_valid(seq)

checks against most degenerate alphabet

iter_alphabets()

yield alphabets in order of most to least degenerate

make_seq(*, seq[, name, check_seq])

creates a Sequence object corresponding to the molecular type of this instance.

most_degen_alphabet()

returns the most degenerate alphabet for this instance

mw(seq[, method, delta])

Returns the molecular weight of the sequence.

random_disambiguate(-> str  -> bytes)

disambiguates a sequence by randomly selecting a non-degenerate character

rc(seq[, validate])

reverse reverse complement of a sequence

resolve_ambiguity(ambig_motif[, alphabet, ...])

Returns tuple of all possible canonical characters corresponding to ambig_motif

strand_symmetric_motifs([motif_length])

returns ordered pairs of strand complementary motifs

strip_bad(-> bytes)

Removes any symbols not in the alphabet.

strip_bad_and_gaps(-> bytes)

Removes any symbols not in the alphabet, and any gaps.

strip_degenerate(-> bytes  -> str)

removes degenerate characters

to_json()

returns result of json formatted string

to_regex(seq)

returns a regex pattern with ambiguities expanded to a character set

can_pair

get_degenerate_positions

has_ambiguity

make_array_seq

to_rich_dict

Notes

The only way to create sequences is via a MolType instance. The instance defines different alphabets that are used for data conversions. Create a moltype using the get_moltype() function.

property alphabet#

monomers

ambiguities: dict[str, frozenset[str]] | None = None#
can_match(first: str, second: str) bool#

Returns True if every pos in 1st could match same pos in 2nd.

Notes

Truncates at length of shorter sequence. gaps are only allowed to match other gaps.

can_mispair(first: str, second: str) bool#

Returns True if any position in self could mispair with other.

Notes

Pairing occurs in reverse order, i.e. last position of other with first position of self, etc.

Truncates at length of shorter sequence.

Gaps are always counted as possible mispairs, as are weak pairs like GU.

can_pair(first: str, second: str) bool#
coerce_to: Callable[[bytes], bytes] | None = None#
colors: dataclasses.InitVar[dict[str, str] | None] = None#
complement(seq: str | bytes | ndarray, validate: bool = True) str#
complement(seq: str, validate: bool = True) str
complement(seq: bytes, validate: bool = True) str
complement(seq: ndarray, validate: bool = True) ndarray[int]

converts a string or bytes into it’s nucleic acid complement

Parameters:
seq

sequence to be complemented

validate

if True, checks the sequence is validated against the most degenerate alphabet

complements: dataclasses.InitVar[dict[str, frozenset[str]] | None] = None#
count_degenerate(seq: str | bytes) int#
count_degenerate(seq: bytes) int
count_degenerate(seq: str) int

returns the number of degenerate characters in a sequence

count_gaps(seq: str | bytes) int#
count_gaps(seq: bytes) int
count_gaps(seq: str) int

returns the number of gap characters in a sequence

count_variants(seq: str | bytes) int#
count_variants(seq: bytes) int
count_variants(seq: str) int

Counts number of possible sequences matching the sequence, given any ambiguous characters in the sequence.

Notes

Uses self.ambiguitues to decide how many possibilities there are at each position in the sequence and calculates the permutations.

degap(seq, validate: bool = True) str | bytes | ndarray#
degap(seq: bytes, validate: bool = True) bytes
degap(seq: str, validate: bool = True) str
degap(seq: ndarray, validate: bool = True) ndarray

removes all gap and missing characters from a sequence

property degen_alphabet#

monomers + ambiguous characters

property degen_gapped_alphabet#

monomers + gap + ambiguous characters

degenerate_from_seq(seq: str) str#

Returns least degenerate symbol that encompasses a set of characters

disambiguate(seq: str | bytes | ndarray, method: str = 'strip') str | bytes | ndarray#

Returns a non-degenerate sequence from a degenerate one.

Parameters:
seq

the sequence to be disambiguated

method

how to disambiguate the sequence, one of “strip”, “random” strip: removes degenerate characters random: randomly selects a non-degenerate character

gap: str | None = '-'#
property gapped_alphabet#

monomers + gap

property gaps: frozenset#
get_css_style(colors: dict[str, str] | None = None, font_size: int = 12, font_family='Lucida Console')#

returns string of CSS classes and {character: <CSS class name>, …}

Parameters:
colors

A dictionary mapping characters to CSS color values.

font_size

Font size in points.

font_family

Name of a monospace font.

get_degenerate_positions(seq: str | bytes | ndarray, include_gap: bool = True, validate: bool = True) list[int]#
has_ambiguity(seq: str | bytes | ndarray) bool#
has_ambiguity(seq: str) bool
has_ambiguity(seq: bytes) bool
has_ambiguity(seq: ndarray) bool
is_ambiguity(query_motif: str, validate: bool = True) bool#

Return True if querymotif is an amibiguity character in alphabet.

Parameters:
query_motif

the motif being queried.

validate

if True, checks the sequence is validated against the most degenerate alphabet

is_compatible_alphabet(alphabet: CharAlphabet, strict: bool = True) bool#

checks that characters in alphabet are equal to a bound alphabet

Parameters:
alphabet

an Alphabet instance

strict

the order of elements must match

is_degenerate(seq: str | bytes | ndarray, validate: bool = True) bool#
is_degenerate(seq: bytes, validate: bool = True) bool
is_degenerate(seq: str, validate: bool = True) bool
is_degenerate(seq: ndarray, validate: bool = True) bool

checks if a sequence contains degenerate characters

is_gapped(seq, validate: bool = True) bool#
is_gapped(seq: str, validate: bool = True) bool
is_gapped(seq: bytes, validate: bool = True) bool
is_gapped(seq: ndarray, validate: bool = True) bool

checks if a sequence contains gaps

property is_nucleic: bool#

is a nucleic acid moltype

Notes

nucleic moltypes can be used for complementing and translating into amino acids.

is_valid(seq: str | ndarray) bool#

checks against most degenerate alphabet

iter_alphabets()#

yield alphabets in order of most to least degenerate

property label#

synonym for name

make_array_seq(*args, **kwargs) Sequence#
make_seq(*, seq: str | SeqViewABC, name: str | None = None, check_seq: bool = True, **kwargs) Sequence#

creates a Sequence object corresponding to the molecular type of this instance.

Parameters:
seq

the raw sequence data

name

the name of the sequence

check_seq

whether to validate the sequence data against the molecular type if True, performs validation checks; set to False if the sequence data is already known to be valid.

**kwargs

additional keyword arguments that may be required for creating the Sequence object

Notes

If seq is a string, and the moltype has a coerce_to attribute, the string will be converted via that callable into a character set compatible with the moltype. Only applies to nucleic acid moltypes.

property matching_rules: dict[frozenset[str], bool]#
missing: str | None = '?'#
monomers: dataclasses.InitVar[Union[str, bytes]]#
most_degen_alphabet()#

returns the most degenerate alphabet for this instance

mw(seq: str, method: str = 'random', delta: float | None = None) float#

Returns the molecular weight of the sequence. If the sequence is ambiguous, uses method to disambiguate the sequence.

Parameters:
seq

the sequence whose molecular weight is to be calculated.

method

the method provided to .disambiguate() to disambiguate the sequence. either “random” or “first”.

delta

if delta is present, uses it instead of the standard weight adjustment.

mw_calculator: WeightCalculator | None = None#
name: str#
pairing_rules: dict[str, dict[frozenset[str], bool]] | None = None#
random_disambiguate(seq: str | bytes | ndarray) str | bytes | ndarray#
random_disambiguate(seq: str) str
random_disambiguate(seq: bytes) bytes
random_disambiguate(seq: ndarray) ndarray

disambiguates a sequence by randomly selecting a non-degenerate character

rc(seq: str, validate: bool = True) str#

reverse reverse complement of a sequence

resolve_ambiguity(ambig_motif: str, alphabet: CharAlphabet | None = None, allow_gap: bool = False, validate: bool = True) tuple[str]#

Returns tuple of all possible canonical characters corresponding to ambig_motif

Parameters:
ambig_motif

the string to be expanded

alphabet

optional, disambiguated motifs not present in alphabet will be excluded. This could be a codon alphabet where stop codons are not present.

allow_gap

whether the gap character is allowed in output. Only applied when alphabet is None.

Notes

If ambig_motif is > 1 character long and alphabet is None, we construct a word alphabet with the same length.

strand_symmetric_motifs(motif_length: int = 1) set[tuple[str, str]]#

returns ordered pairs of strand complementary motifs

strip_bad(seq: str | bytes) str | bytes#
strip_bad(seq: bytes) bytes
strip_bad(seq: str) str

Removes any symbols not in the alphabet.

strip_bad_and_gaps(seq: str | bytes) str | bytes#
strip_bad_and_gaps(seq: bytes) bytes
strip_bad_and_gaps(seq: str) str

Removes any symbols not in the alphabet, and any gaps. Since missing could be a gap, it is also removed.

strip_degenerate(seq: str | bytes | ndarray) str | bytes | ndarray#
strip_degenerate(seq: bytes) bytes
strip_degenerate(seq: str) str
strip_degenerate(seq: ndarray) ndarray

removes degenerate characters

to_json()#

returns result of json formatted string

to_regex(seq: str) str#

returns a regex pattern with ambiguities expanded to a character set

to_rich_dict(**kwargs)#