MolType#
- class MolType(name: str, monomers: dataclasses.InitVar[typing.Union[str, bytes]], make_seq: dataclasses.InitVar[type] = <function MolType.make_seq>, gap: str | None = '-', missing: str | None = '?', complements: dataclasses.InitVar[dict[str, frozenset[str]] | None] = None, ambiguities: dict[str, frozenset[str]] | None = None, colors: dataclasses.InitVar[dict[str, str] | None] = None, pairing_rules: dict[str, dict[frozenset[str], bool]] | None = None, mw_calculator: ~cogent3.data.molecular_weight.WeightCalculator | None = None, coerce_to: ~typing.Callable[[bytes], bytes] | None = None)#
MolType handles operations that depend on the sequence type.
- Attributes:
alphabet
monomers
- ambiguities
- coerce_to
- colors
- complements
degen_alphabet
monomers + ambiguous characters
degen_gapped_alphabet
monomers + gap + ambiguous characters
gapped_alphabet
monomers + gap
- gaps
is_nucleic
is a nucleic acid moltype
label
synonym for name
- matching_rules
- mw_calculator
- pairing_rules
Methods
can_match
(first, second)Returns True if every pos in 1st could match same pos in 2nd.
can_mispair
(first, second)Returns True if any position in self could mispair with other.
complement
(-> str -> str)converts a string or bytes into it's nucleic acid complement
count_degenerate
(-> int)returns the number of degenerate characters in a sequence
count_gaps
(-> int)returns the number of gap characters in a sequence
count_variants
(-> int)Counts number of possible sequences matching the sequence, given any ambiguous characters in the sequence.
degap
(-> bytes -> str)removes all gap and missing characters from a sequence
degenerate_from_seq
(seq)Returns least degenerate symbol that encompasses a set of characters
disambiguate
(seq[, method])Returns a non-degenerate sequence from a degenerate one.
get_css_style
([colors, font_size, font_family])returns string of CSS classes and {character: <CSS class name>, ...}
is_ambiguity
(query_motif[, validate])Return True if querymotif is an amibiguity character in alphabet.
is_compatible_alphabet
(alphabet[, strict])checks that characters in alphabet are equal to a bound alphabet
is_degenerate
(-> bool -> bool)checks if a sequence contains degenerate characters
is_gapped
(-> bool -> bool)checks if a sequence contains gaps
is_valid
(seq)checks against most degenerate alphabet
yield alphabets in order of most to least degenerate
make_seq
(*, seq[, name, check_seq])creates a Sequence object corresponding to the molecular type of this instance.
returns the most degenerate alphabet for this instance
mw
(seq[, method, delta])Returns the molecular weight of the sequence.
random_disambiguate
(-> str -> bytes)disambiguates a sequence by randomly selecting a non-degenerate character
rc
(seq[, validate])reverse reverse complement of a sequence
resolve_ambiguity
(ambig_motif[, alphabet, ...])Returns tuple of all possible canonical characters corresponding to ambig_motif
strand_symmetric_motifs
([motif_length])returns ordered pairs of strand complementary motifs
strip_bad
(-> bytes)Removes any symbols not in the alphabet.
strip_bad_and_gaps
(-> bytes)Removes any symbols not in the alphabet, and any gaps.
strip_degenerate
(-> bytes -> str)removes degenerate characters
to_json
()returns result of json formatted string
to_regex
(seq)returns a regex pattern with ambiguities expanded to a character set
can_pair
get_degenerate_positions
has_ambiguity
make_array_seq
to_rich_dict
Notes
The only way to create sequences is via a MolType instance. The instance defines different alphabets that are used for data conversions. Create a moltype using the
get_moltype()
function.- property alphabet#
monomers
- ambiguities: dict[str, frozenset[str]] | None = None#
- can_match(first: str, second: str) bool #
Returns True if every pos in 1st could match same pos in 2nd.
Notes
Truncates at length of shorter sequence. gaps are only allowed to match other gaps.
- can_mispair(first: str, second: str) bool #
Returns True if any position in self could mispair with other.
Notes
Pairing occurs in reverse order, i.e. last position of other with first position of self, etc.
Truncates at length of shorter sequence.
Gaps are always counted as possible mispairs, as are weak pairs like GU.
- can_pair(first: str, second: str) bool #
- coerce_to: Callable[[bytes], bytes] | None = None#
- colors: dataclasses.InitVar[dict[str, str] | None] = None#
- complement(seq: str | bytes | ndarray, validate: bool = True) str #
- complement(seq: str, validate: bool = True) str
- complement(seq: bytes, validate: bool = True) str
- complement(seq: ndarray, validate: bool = True) ndarray[int]
converts a string or bytes into it’s nucleic acid complement
- Parameters:
- seq
sequence to be complemented
- validate
if True, checks the sequence is validated against the most degenerate alphabet
- complements: dataclasses.InitVar[dict[str, frozenset[str]] | None] = None#
- count_degenerate(seq: str | bytes) int #
- count_degenerate(seq: bytes) int
- count_degenerate(seq: str) int
returns the number of degenerate characters in a sequence
- count_gaps(seq: str | bytes) int #
- count_gaps(seq: bytes) int
- count_gaps(seq: str) int
returns the number of gap characters in a sequence
- count_variants(seq: str | bytes) int #
- count_variants(seq: bytes) int
- count_variants(seq: str) int
Counts number of possible sequences matching the sequence, given any ambiguous characters in the sequence.
Notes
Uses self.ambiguitues to decide how many possibilities there are at each position in the sequence and calculates the permutations.
- degap(seq, validate: bool = True) str | bytes | ndarray #
- degap(seq: bytes, validate: bool = True) bytes
- degap(seq: str, validate: bool = True) str
- degap(seq: ndarray, validate: bool = True) ndarray
removes all gap and missing characters from a sequence
- property degen_alphabet#
monomers + ambiguous characters
- property degen_gapped_alphabet#
monomers + gap + ambiguous characters
- degenerate_from_seq(seq: str) str #
Returns least degenerate symbol that encompasses a set of characters
- disambiguate(seq: str | bytes | ndarray, method: str = 'strip') str | bytes | ndarray #
Returns a non-degenerate sequence from a degenerate one.
- Parameters:
- seq
the sequence to be disambiguated
- method
how to disambiguate the sequence, one of “strip”, “random” strip: removes degenerate characters random: randomly selects a non-degenerate character
- gap: str | None = '-'#
- property gapped_alphabet#
monomers + gap
- property gaps: frozenset#
- get_css_style(colors: dict[str, str] | None = None, font_size: int = 12, font_family='Lucida Console')#
returns string of CSS classes and {character: <CSS class name>, …}
- Parameters:
- colors
A dictionary mapping characters to CSS color values.
- font_size
Font size in points.
- font_family
Name of a monospace font.
- get_degenerate_positions(seq: str | bytes | ndarray, include_gap: bool = True, validate: bool = True) list[int] #
- has_ambiguity(seq: str | bytes | ndarray) bool #
- has_ambiguity(seq: str) bool
- has_ambiguity(seq: bytes) bool
- has_ambiguity(seq: ndarray) bool
- is_ambiguity(query_motif: str, validate: bool = True) bool #
Return True if querymotif is an amibiguity character in alphabet.
- Parameters:
- query_motif
the motif being queried.
- validate
if True, checks the sequence is validated against the most degenerate alphabet
- is_compatible_alphabet(alphabet: CharAlphabet, strict: bool = True) bool #
checks that characters in alphabet are equal to a bound alphabet
- Parameters:
- alphabet
an Alphabet instance
- strict
the order of elements must match
- is_degenerate(seq: str | bytes | ndarray, validate: bool = True) bool #
- is_degenerate(seq: bytes, validate: bool = True) bool
- is_degenerate(seq: str, validate: bool = True) bool
- is_degenerate(seq: ndarray, validate: bool = True) bool
checks if a sequence contains degenerate characters
- is_gapped(seq, validate: bool = True) bool #
- is_gapped(seq: str, validate: bool = True) bool
- is_gapped(seq: bytes, validate: bool = True) bool
- is_gapped(seq: ndarray, validate: bool = True) bool
checks if a sequence contains gaps
- property is_nucleic: bool#
is a nucleic acid moltype
Notes
nucleic moltypes can be used for complementing and translating into amino acids.
- is_valid(seq: str | ndarray) bool #
checks against most degenerate alphabet
- iter_alphabets()#
yield alphabets in order of most to least degenerate
- property label#
synonym for name
- make_array_seq(*args, **kwargs) Sequence #
- make_seq(*, seq: str | SeqViewABC, name: str | None = None, check_seq: bool = True, **kwargs) Sequence #
creates a Sequence object corresponding to the molecular type of this instance.
- Parameters:
- seq
the raw sequence data
- name
the name of the sequence
- check_seq
whether to validate the sequence data against the molecular type if True, performs validation checks; set to False if the sequence data is already known to be valid.
- **kwargs
additional keyword arguments that may be required for creating the Sequence object
Notes
If seq is a string, and the moltype has a coerce_to attribute, the string will be converted via that callable into a character set compatible with the moltype. Only applies to nucleic acid moltypes.
- property matching_rules: dict[frozenset[str], bool]#
- missing: str | None = '?'#
- monomers: dataclasses.InitVar[Union[str, bytes]]#
- most_degen_alphabet()#
returns the most degenerate alphabet for this instance
- mw(seq: str, method: str = 'random', delta: float | None = None) float #
Returns the molecular weight of the sequence. If the sequence is ambiguous, uses method to disambiguate the sequence.
- Parameters:
- seq
the sequence whose molecular weight is to be calculated.
- method
the method provided to .disambiguate() to disambiguate the sequence. either “random” or “first”.
- delta
if delta is present, uses it instead of the standard weight adjustment.
- mw_calculator: WeightCalculator | None = None#
- name: str#
- pairing_rules: dict[str, dict[frozenset[str], bool]] | None = None#
- random_disambiguate(seq: str | bytes | ndarray) str | bytes | ndarray #
- random_disambiguate(seq: str) str
- random_disambiguate(seq: bytes) bytes
- random_disambiguate(seq: ndarray) ndarray
disambiguates a sequence by randomly selecting a non-degenerate character
- rc(seq: str, validate: bool = True) str #
reverse reverse complement of a sequence
- resolve_ambiguity(ambig_motif: str, alphabet: CharAlphabet | None = None, allow_gap: bool = False, validate: bool = True) tuple[str] #
Returns tuple of all possible canonical characters corresponding to ambig_motif
- Parameters:
- ambig_motif
the string to be expanded
- alphabet
optional, disambiguated motifs not present in alphabet will be excluded. This could be a codon alphabet where stop codons are not present.
- allow_gap
whether the gap character is allowed in output. Only applied when alphabet is None.
Notes
If ambig_motif is > 1 character long and alphabet is None, we construct a word alphabet with the same length.
- strand_symmetric_motifs(motif_length: int = 1) set[tuple[str, str]] #
returns ordered pairs of strand complementary motifs
- strip_bad(seq: str | bytes) str | bytes #
- strip_bad(seq: bytes) bytes
- strip_bad(seq: str) str
Removes any symbols not in the alphabet.
- strip_bad_and_gaps(seq: str | bytes) str | bytes #
- strip_bad_and_gaps(seq: bytes) bytes
- strip_bad_and_gaps(seq: str) str
Removes any symbols not in the alphabet, and any gaps. Since missing could be a gap, it is also removed.
- strip_degenerate(seq: str | bytes | ndarray) str | bytes | ndarray #
- strip_degenerate(seq: bytes) bytes
- strip_degenerate(seq: str) str
- strip_degenerate(seq: ndarray) ndarray
removes degenerate characters
- to_json()#
returns result of json formatted string
- to_regex(seq: str) str #
returns a regex pattern with ambiguities expanded to a character set
- to_rich_dict(**kwargs)#