MolType#

class MolType(motifset, gap='-', missing='?', gaps=None, seq_constructor=None, ambiguities=None, label=None, complements=None, pairs=None, mw_calculator=None, add_lower=False, preserve_existing_moltypes=False, make_alphabet_group=False, array_seq_constructor=None, colors=None, coerce_string=None)#

MolType: Handles operations that depend on the sequence type (e.g. DNA).

The MolType knows how to connect alphabets, sequences, alignments, and so forth, and how to disambiguate ambiguous symbols and perform base pairing (where appropriate).

WARNING: Objects passed to a MolType become associated with that MolType, i.e. if you pass ProteinSequence to a new MolType you make up, all ProteinSequences will now be associated with the new MolType. This may not be what you expect. Use preserve_existing_moltypes=True if you don’t want to reset the moltype.

Attributes:

is_nucleic: for forward compatibility

Methods

`can_match`(first, second)	Returns True if every pos in 1st could match same pos in 2nd.
`can_mismatch`(first, second)	Returns True if any position in 1st could cause a mismatch with 2nd.
`can_mispair`(first, second)	Returns True if any position in 1st could mispair with 2nd.
`can_pair`(first, second)	Returns True if first and second could pair.
`complement`(item)	Returns complement of item, using data from self.complements.
`count_degenerate`(sequence)	Counts the degenerate bases in the specified sequence.
`count_gaps`(sequence)	Counts the gaps in the specified sequence.
`degap`(sequence)	Deletes all gap characters from sequence.
`degenerate_from_seq`(sequence)	Returns least degenerate symbol corresponding to chars in sequence.
`disambiguate`(sequence[, method])	Returns a non-degenerate sequence from a degenerate one.
`first_degenerate`(sequence)	Returns the index of first degenerate symbol in sequence, or None.
`first_gap`(sequence)	Returns the index of the first gap in the sequence, or None.
`first_invalid`(sequence)	Returns the index of first invalid symbol in sequence, or None.
`first_non_strict`(sequence)	Returns the index of first non-strict symbol in sequence, or None.
`first_not_in_alphabet`(sequence[, alphabet])	Returns index of first item not in alphabet, or None.
`gap_indices`(sequence)	Returns list of indices of all gaps in the sequence, or [].
`gap_maps`(sequence)	Returns tuple containing dicts mapping between gapped and ungapped.
`gap_vector`(sequence)	Returns list of bool indicating gap or non-gap in sequence.
`get_css_style`([colors, font_size, font_family])	returns string of CSS classes and {character: <CSS class name>, ...}
`get_degenerate_positions`(sequence[, include_gap])	returns indices matching degenerate characters
`get_type`()	Return the moltype label
`is_ambiguity`(querymotif)	Return True if querymotif is an amibiguity character in alphabet.
`is_compatible_alphabet`(alphabet[, strict])	checks that characters in alphabet are equal to a bound alphabet
`is_degenerate`(sequence)	Returns True if sequence contains degenerate characters.
`is_gap`(char)	Returns True if char is a gap.
`is_gapped`(sequence)	Returns True if sequence contains gaps.
`is_strict`(sequence)	Returns True if sequence contains only items in self.alphabet.
`is_valid`(sequence)	Returns True if sequence contains no items that are not in self.
`make_array_seq`(seq[, name])	creates an array sequence
`make_seq`(seq[, name])	Returns sequence generated by seq_constructor argument to self..
`must_match`(first, second)	Returns True if all positions in 1st must match positions in second.
`must_pair`(first, second)	Returns True if all positions in 1st must pair with second.
`mw`(sequence[, method, delta])	Returns the molecular weight of the sequence.
`possibilities`(sequence)	Counts number of possible sequences matching the sequence.
`rc`(item)	Returns reverse complement of item w/ data from self.complements.
`resolve_ambiguity`(ambig_motif[, alphabet, ...])	Returns tuple of all possible canonical characters corresponding to ambig_motif
`strand_symmetric_motifs`([motif_length])	returns ordered pairs of strand complementary motifs
`to_json`()	returns result of json formatted string
`to_regex`(seq)	returns a regex pattern with ambiguities expanded to a character set
`valid_on_alphabet`(sequence[, alphabet])	Returns True if sequence contains only items in alphabet.
`verify_sequence`(seq[, gaps_allowed, ...])	Checks whether sequence is valid on the default alphabet.
`what_ambiguity`(motifs)	The code that represents all of 'motifs', and minimal others.

coerce_str
to_rich_dict

can_match(first, second) → bool#

Returns True if every pos in 1st could match same pos in 2nd.

Truncates at length of shorter sequence. gaps are only allowed to match other gaps.

can_mismatch(first, second) → bool#

Returns True if any position in 1st could cause a mismatch with 2nd.

Truncates at length of shorter sequence. gaps are always counted as matches.

can_mispair(first, second) → bool#

Returns True if any position in 1st could mispair with 2nd.

Pairing occurs in reverse order, i.e. last position of second with first position of first, etc.

Truncates at length of shorter sequence. gaps are always counted as possible mispairs, as are weak pairs like GU.

can_pair(first, second) → bool#

Returns True if first and second could pair.

Pairing occurs in reverse order, i.e. last position of second with first position of first, etc.

Truncates at length of shorter sequence. gaps are only allowed to pair with other gaps, and are counted as ‘weak’ (same category as GU and degenerate pairs).

NOTE: second must be able to be reverse

coerce_str(data: str) → str#

complement(item)#

Returns complement of item, using data from self.complements.

Always tries to return same type as item: if item looks like a dict, will return list of keys.

count_degenerate(sequence)#: Counts the degenerate bases in the specified sequence.

count_gaps(sequence)#: Counts the gaps in the specified sequence.

degap(sequence)#: Deletes all gap characters from sequence.

degenerate_from_seq(sequence)#

Returns least degenerate symbol corresponding to chars in sequence.

First tries to look up in self.inverse_degenerates. Then disambiguates and tries to look up in self.inverse_degenerates. Then tries converting the case (tries uppercase before lowercase). Raises TypeError if conversion fails.

disambiguate(sequence, method='strip')#

Returns a non-degenerate sequence from a degenerate one.

method can be ‘strip’ (deletes any characters not in monomers or gaps) or ‘random’(assigns the possibilities at random, using equal frequencies).

first_degenerate(sequence)#: Returns the index of first degenerate symbol in sequence, or None.

first_gap(sequence)#: Returns the index of the first gap in the sequence, or None.

first_invalid(sequence)#: Returns the index of first invalid symbol in sequence, or None.

first_non_strict(sequence)#: Returns the index of first non-strict symbol in sequence, or None.

first_not_in_alphabet(sequence, alphabet=None)#

Returns index of first item not in alphabet, or None.

Defaults to self.alphabet if alphabet not supplied.

gap_indices(sequence)#: Returns list of indices of all gaps in the sequence, or [].

gap_maps(sequence)#

Returns tuple containing dicts mapping between gapped and ungapped.

First element is a dict such that d[ungapped_coord] = gapped_coord. Second element is a dict such that d[gapped_coord] = ungapped_coord.

Note that the dicts will be invalid if the sequence changes after the dicts are made.

The gaps themselves are not in the dictionary, so use d.get() or test ‘if pos in d’ to avoid KeyErrors if looking up all elements in a gapped sequence.

gap_vector(sequence)#: Returns list of bool indicating gap or non-gap in sequence.

get_css_style(colors=None, font_size=12, font_family='Lucida Console')#

returns string of CSS classes and {character: <CSS class name>, …}

Parameters:

colors: {char
font_size: in points
font_family: name of a monospace font

get_degenerate_positions(sequence, include_gap=True)#: returns indices matching degenerate characters

get_type()#: Return the moltype label

is_ambiguity(querymotif)#

Return True if querymotif is an amibiguity character in alphabet.

Parameters:

querymotif: the motif being queried.

is_compatible_alphabet(alphabet: Alphabet, strict: bool = True) → bool#

checks that characters in alphabet are equal to a bound alphabet

Parameters:

alphabet: an Alphabet instance
strict: the order of elements must match

is_degenerate(sequence)#: Returns True if sequence contains degenerate characters.

is_gap(char)#: Returns True if char is a gap.

is_gapped(sequence)#: Returns True if sequence contains gaps.

property is_nucleic: bool#: for forward compatibility

is_strict(sequence)#: Returns True if sequence contains only items in self.alphabet.

is_valid(sequence)#: Returns True if sequence contains no items that are not in self.

make_array_seq(seq, name=None, **kwargs)#

creates an array sequence

Parameters:

seq: characters or array
namestr
kwargs: keyword arguments for the ArraySequence constructor.

Returns:

ArraySequence

make_seq(seq, name=None, **kwargs)#

Returns sequence generated by seq_constructor argument to self..

Parameters:

seq: valid input to the coerce_str function
name: value assigned to sequence.name attribute
kwargs: optional keyword arguments to bound sequence constructor class

Notes

If input seq has a moltype attribute that is self, seq is returned unmodified. If the moltype is not self, then the sequence is converted using seq.to_moltype(self) .

must_match(first, second) → bool#: Returns True if all positions in 1st must match positions in second.

must_pair(first, second) → bool#

Returns True if all positions in 1st must pair with second.

Pairing occurs in reverse order, i.e. last position of second with first position of first, etc.

mw(sequence, method='random', delta=None)#

Returns the molecular weight of the sequence.

If the sequence is ambiguous, uses method (random or strip) to disambiguate the sequence.

if delta is present, uses it instead of the standard weight adjustment.

possibilities(sequence)#

Counts number of possible sequences matching the sequence.

Uses self.degenerates to decide how many possibilites there are at each position in the sequence.

rc(item)#

Returns reverse complement of item w/ data from self.complements.

Always returns same type as input.

resolve_ambiguity(ambig_motif: str, alphabet: Alphabet | None = None, allow_gap: bool = False) → tuple[str]#

Returns tuple of all possible canonical characters corresponding to ambig_motif

Parameters:

ambig_motif: the string to be expanded
alphabet: optional, disambiguated motifs not present in alphabet will be excluded. This could be a codon alphabet where stop codons are not present.
allow_gap: whether the gap character is allowed in output. Only applied when alphabet is None.

Notes

If ambig_motif is > 1 character long and alphabet is None, we construct a word alphabet with the same length.

strand_symmetric_motifs(motif_length=1)#: returns ordered pairs of strand complementary motifs

to_json()#: returns result of json formatted string

to_regex(seq)#: returns a regex pattern with ambiguities expanded to a character set

to_rich_dict(for_pickle=False)#

valid_on_alphabet(sequence, alphabet=None)#

Returns True if sequence contains only items in alphabet.

alphabet can actually be anything that implements __contains__. Defaults to self.alphabet if not supplied.

verify_sequence(seq, gaps_allowed=True, wildcards_allowed=True) → None#

Checks whether sequence is valid on the default alphabet.

Has special-case handling for gaps and wild-cards. This mechanism is probably useful to have in parallel with the validation routines that check specifically whether the sequence has gaps, degenerate symbols, etc., or that explicitly take an alphabet as input.

what_ambiguity(motifs)#

The code that represents all of ‘motifs’, and minimal others.

Does this duplicate DegenerateFromSequence directly?