ProteinSequence#
- class ProteinSequence(moltype: MolType, seq: str | bytes | ndarray | SeqViewABC, *, name: str | None = None, info: dict | Info | None = None, annotation_offset: int = 0, annotation_db: SupportsFeatures | None = None)#
Holds the standard Protein sequence.
- Attributes:
- annotation_db
annotation_offset
The offset between annotation coordinates and sequence coordinates.
- info
- moltype
- name
Methods
add_feature
(*, biotype, name, spans[, ...])add a feature to annotation_db
annotate_from_gff
(f[, offset])copies annotations from a gff file to self,
annotate_matches_to
(pattern, biotype, name)Adds an annotation at sequence positions matching pattern.
can_match
(other)Returns True if every pos in self could match same pos in other.
copy
([exclude_annotations, sliced])returns a copy of self
copy_annotations
(seq_db)copy annotations into attached annotation db
count
(item)count() delegates to self._seq.
Returns the number of ambiguous characters in the sequence.
Counts the degenerate bases in the specified sequence.
Counts the gaps in the specified sequence.
Counts number of possible sequences matching the sequence, given any ambiguous characters in the sequence.
counts
([motif_length, include_ambiguity, ...])returns dict of counts of motifs
degap
()Deletes all gap characters from sequence.
diff
(other)Returns number of differences between self and other.
disambiguate
([method])Returns a non-degenerate sequence from a degenerate one.
distance
(other[, function])Returns distance between self and other using function(i,j).
frac_diff
(other)Returns fraction of positions where self and other differ.
frac_diff_gaps
(other)Returns frac.
frac_diff_non_gaps
(other)Returns fraction of non-gap positions where self differs from other.
frac_same
(other)Returns fraction of positions where self and other are the same.
frac_same_gaps
(other)Returns fraction of positions where self and other share gap states.
frac_same_non_gaps
(other)Returns fraction of non-gap positions where self matches other.
frac_similar
(other, similar_pairs)Returns fraction of positions where self[i] is similar to other[i].
Returns array of the indices of all gaps in the sequence
Returns vector of True or False according to which pos are gaps or missing.
get_drawable
(*[, biotype, width, vertical])make a figure from sequence features
get_drawables
(*[, biotype])returns a dict of drawables, keyed by type
get_features
(*[, biotype, name, start, ...])yields Feature instances
get_in_motif_size
([motif_length, warn])returns sequence as list of non-overlapping motifs
get_kmers
(k[, strict])return all overlapping k-mers
get_name
()Return the sequence name -- should just use name instead.
get_type
()Return the sequence type as moltype label.
is_annotated
([biotype])returns True if sequence parent name has any annotations
Returns True if sequence contains degenerate characters.
Returns True if sequence contains gaps.
Returns True if sequence contains only monomers.
is_valid
()Returns True if sequence contains no items absent from alphabet.
iter_kmers
(k[, strict])generates all overlapping k-mers.
make_feature
(feature, *args)return an Feature instance from feature data
matrix_distance
(other, matrix)Returns distance between self and other using a score matrix.
mw
([method, delta])Returns the molecular weight of (one strand of) the sequence.
returns seqid, start, stop, strand of this sequence on its parent
returns Map corresponding to gap locations and ungapped Sequence
replace_annotation_db
(value[, check])public interface to assigning the annotation_db
Returns a list of sets of strings.
shuffle
()returns a randomized copy of the Sequence object
sliding_windows
(window, step[, start, end])Generator function that yield new sequence objects of a given length at a given interval.
Removes any symbols not in the alphabet.
Removes any symbols not in the alphabet, and any gaps.
Removes degenerate bases by stripping them out of the sequence.
to_array
([apply_transforms])returns the numpy array
to_fasta
([make_seqlabel, block_size])Return string of self in FASTA format, no trailing newline
to_html
([wrap, limit, colors, font_size, ...])returns html with embedded styles for sequence colouring
to_json
()returns a json formatted string
to_moltype
(moltype)returns copy of self with moltype seq
to_phylip
([name_len, label_len])Return string of self in one line for PHYLIP, no newline.
to_rich_dict
([exclude_annotations])returns {'name': name, 'seq': sequence, 'moltype': moltype.label}
with_masked_annotations
(annot_types[, ...])returns a sequence with annot_types regions replaced by mask_char if shadow is False, otherwise all other regions are masked.
Returns copy of sequence with terminal gaps remapped as missing.
from_rich_dict
gapped_by_map
gapped_by_map_motif_iter
gapped_by_map_segment_iter
- add_feature(*, biotype: str, name: str, spans: list[tuple[int, int]], parent_id: str | None = None, strand: str | None = None, on_alignment: bool = False, seqid: str | None = None) Feature #
add a feature to annotation_db
- Parameters:
- biotype
biological type
- name
name of the feature
- spans
coordinates for this sequence
- parent_id
name of the feature parent
- strand
‘+’ or ‘-’, defaults to ‘+’
- on_alignment
whether the feature spans are alignment coordinates
- seqid
ignored since the feature is added to this sequence
- Returns:
- Feature instance
- annotate_from_gff(f: os.PathLike, offset: int | None = None) None #
copies annotations from a gff file to self,
- Parameters:
- fpath to gff annotation file.
- offsetOptional, the offset between annotation coordinates and sequence coordinates.
- annotate_matches_to(pattern: str, biotype: str, name: str, allow_multiple: bool = False)#
Adds an annotation at sequence positions matching pattern.
- Parameters:
- pattern
The search string for which annotations are made. IUPAC ambiguities are converted to regex on sequences with the appropriate MolType.
- biotype
The type of the annotation (e.g. “domain”).
- name
The name of the annotation.
- allow_multiple
If True, allows multiple occurrences of the input pattern. Otherwise, only the first match is used.
- Returns:
- Returns a list of Feature instances.
- property annotation_db#
- property annotation_offset#
The offset between annotation coordinates and sequence coordinates.
The offset can be used to adjust annotation coordinates to match the position of the given Sequence within a larger genomic context. For example, if the Annotations are with respect to a chromosome, and the sequence represents a gene that is 100 bp from the start of a chromosome, the offset can be set to 100 to ensure that the gene’s annotations are aligned with the appropriate genomic positions.
- Returns:
int: The offset between annotation coordinates and sequence coordinates.
- can_match(other) bool #
Returns True if every pos in self could match same pos in other.
Truncates at length of shorter sequence. gaps are only allowed to match other gaps.
- copy(exclude_annotations: bool = False, sliced: bool = True)#
returns a copy of self
- Parameters:
- sliced
Slices underlying sequence with start/end of self coordinates. The offset is retained.
- exclude_annotations
drops annotation_db when True
- copy_annotations(seq_db: SupportsFeatures) None #
copy annotations into attached annotation db
- Parameters:
- seq_db
compatible annotation db
Notes
Only copies annotations for records with seqid equal to self.name
- count(item: str)#
count() delegates to self._seq.
- count_ambiguous() int #
Returns the number of ambiguous characters in the sequence.
- count_degenerate() int #
Counts the degenerate bases in the specified sequence.
Notes
gap and missing characters are counted as degenerate.
- count_gaps() int #
Counts the gaps in the specified sequence.
- count_variants()#
Counts number of possible sequences matching the sequence, given any ambiguous characters in the sequence.
Notes
Uses self.ambiguitues to decide how many possibilities there are at each position in the sequence and calculates the permutations.
- counts(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = False, warn: bool = False) CategoryCounter #
returns dict of counts of motifs
only non-overlapping motifs are counted.
- Parameters:
- motif_length
number of elements per character.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gaps
if True, motifs containing a gap character are included.
- exclude_unobserved
if True, unobserved motif combinations are excluded.
- warn
warns if motif_length > 1 and alignment trimmed to produce motif columns
- degap()#
Deletes all gap characters from sequence.
- diff(other) int #
Returns number of differences between self and other.
Notes
Truncates at the length of the shorter sequence.
- disambiguate(method: str = 'strip')#
Returns a non-degenerate sequence from a degenerate one.
- Parameters:
- seq
the sequence to be disambiguated
- method
how to disambiguate the sequence, one of “strip”, “random” strip: deletes any characters not in monomers or gaps random: assigns the possibilities at random, using equal frequencies
- distance(other, function: Callable[[str, str], int | float] | None = None) int | float #
Returns distance between self and other using function(i,j).
- Parameters:
- other
a sequence to compare to self
- function
takes two seq residues and returns a number. To turn a 2D matrix into a function, use cogent3.util.miscs.DistanceFromMatrix(matrix).
Notes
Truncates at the length of the shorter sequence.
The function acts on two _elements_ of the sequences, not the two sequences themselves (i.e. the behavior will be the same for every position in the sequences, such as identity scoring or a function derived from a distance matrix as suggested above). One limitation of this approach is that the distance function cannot use properties of the sequences themselves: for example, it cannot use the lengths of the sequences to normalize the scores as percent similarities or percent differences.
If you want functions that act on the two sequences themselves, there is no particular advantage in making these functions methods of the first sequences by passing them in as parameters like the function in this method. It makes more sense to use them as standalone functions. The factory function cogent3.util.transform.for_seq is useful for converting per-element functions into per-sequence functions, since it takes as parameters a per-element scoring function, a score aggregation function, and a normalization function (which itself takes the two sequences as parameters), returning a single function that combines these functions and that acts on two complete sequences.
- frac_diff(other) float #
Returns fraction of positions where self and other differ.
Notes
Truncates at length of shorter sequence. Will return 0 if one sequence is empty.
- frac_diff_gaps(other)#
Returns frac. of positions where self and other’s gap states differ.
In other words, if self and other are both all gaps, or both all non-gaps, or both have gaps in the same places, frac_diff_gaps will return 0.0. If self is all gaps and other has no gaps, frac_diff_gaps will return 1.0.
Returns 0 if one sequence is empty.
Uses self’s gap characters for both sequences.
- frac_diff_non_gaps(other)#
Returns fraction of non-gap positions where self differs from other.
Doesn’t count any position where self or other has a gap. Truncates at the length of the shorter sequence.
Returns 0 if one sequence is empty. Note that this means that frac_diff_non_gaps is _not_ the same as 1 - frac_same_non_gaps, since both return 0 if one sequence is empty.
- frac_same(other) float #
Returns fraction of positions where self and other are the same.
Notes
Truncates at length of shorter sequence. Will return 0 if one sequence is empty.
- frac_same_gaps(other)#
Returns fraction of positions where self and other share gap states.
In other words, if self and other are both all gaps, or both all non-gaps, or both have gaps in the same places, frac_same_gaps will return 1.0. If self is all gaps and other has no gaps, frac_same_gaps will return 0.0. Returns 0 if one sequence is empty.
Uses self’s gap characters for both sequences.
- frac_same_non_gaps(other)#
Returns fraction of non-gap positions where self matches other.
Doesn’t count any position where self or other has a gap. Truncates at the length of the shorter sequence.
Returns 0 if one sequence is empty.
- frac_similar(other, similar_pairs: dict[(<class 'str'>, <class 'str'>), ~typing.Any])#
Returns fraction of positions where self[i] is similar to other[i].
similar_pairs must be a dict such that d[(i,j)] exists if i and j are to be counted as similar. Use PairsFromGroups in cogent3.util.misc to construct such a dict from a list of lists of similar residues.
Truncates at the length of the shorter sequence.
Note: current implementation re-creates the distance function each time, so may be expensive compared to creating the distance function using for_seq separately.
Returns 0 if one sequence is empty.
- classmethod from_rich_dict(data: dict)#
- gap_indices() ndarray #
Returns array of the indices of all gaps in the sequence
- gap_vector() list[bool] #
Returns vector of True or False according to which pos are gaps or missing.
- gapped_by_map(segment_map: IndelMap, recode_gaps: bool = False)#
- gapped_by_map_motif_iter(segment_map: IndelMap) Iterator[str, str, str] #
- gapped_by_map_segment_iter(segment_map: IndelMap, allow_gaps: bool = True, recode_gaps: bool = False) Iterator[str, str, str] #
- get_drawable(*, biotype: str | Iterable[str] | None = None, width: int = 600, vertical: bool = False)#
make a figure from sequence features
- Parameters:
- biotype
passed to get_features(biotype). Can be a single biotype or series. Only features matching this will be included.
- width
width in pixels
- vertical
rotates the drawable
- Returns:
- a Drawable instance
Notes
If provided, the biotype is used for plot order.
- get_drawables(*, biotype: str | Iterable[str] | None = None) dict #
returns a dict of drawables, keyed by type
- Parameters:
- biotype
passed to get_features(biotype). Can be a single biotype or series. Only features matching this will be included.
- get_features(*, biotype: str | None = None, name: str | None = None, start: int | None = None, stop: int | None = None, allow_partial: bool = False)#
yields Feature instances
- Parameters:
- biotype
biotype of the feature
- name
name of the feature
- start, stop
start, stop positions to search between, relative to offset of this sequence. If not provided, entire span of sequence is used.
Notes
When dealing with a nucleic acid moltype, the returned features will yield a sequence segment that is consistently oriented irrespective of strand of the current instance.
- get_in_motif_size(motif_length=1, warn=False)#
returns sequence as list of non-overlapping motifs
- Parameters:
- motif_length
length of the motifs
- warn
whether to notify of an incomplete terminal motif
- get_kmers(k: int, strict: bool = True) list[str] #
return all overlapping k-mers
- get_name()#
Return the sequence name – should just use name instead.
- get_type()#
Return the sequence type as moltype label.
- info#
- is_annotated(biotype: str | tuple[str] | None = None) bool #
returns True if sequence parent name has any annotations
- Parameters:
- biotype
amend condition to return True only if the sequence is annotated with one of provided biotypes.
- is_degenerate() bool #
Returns True if sequence contains degenerate characters.
- is_gapped() bool #
Returns True if sequence contains gaps.
- is_strict() bool #
Returns True if sequence contains only monomers.
- is_valid() bool #
Returns True if sequence contains no items absent from alphabet.
- iter_kmers(k: int, strict: bool = True) Iterator[str] #
generates all overlapping k-mers. When strict is True, the characters in the k-mer must be a subset of the canonical characters for the moltype
- make_feature(feature: FeatureDataType, *args) Feature #
return an Feature instance from feature data
- Parameters:
- feature
dict of key data to make an Feature instance
Notes
Unlike add_feature(), this method does not add the feature to the database. We assume that spans represent the coordinates for this instance!
- matrix_distance(other, matrix) int | float #
Returns distance between self and other using a score matrix.
- moltype#
- mw(method: str = 'random', delta: float | None = None) float #
Returns the molecular weight of (one strand of) the sequence.
- Parameters:
- method
If the sequence is ambiguous, uses method (random or strip) to disambiguate the sequence.
- delta
If delta is passed in, adds delta per strand. Default is None, which uses the alphabet default. Typically, this adds 18 Da for terminal water. However, note that the default nucleic acid weight assumes 5’ monophosphate and 3’ OH: pass in delta=18.0 if you want 5’ OH as well.
Notes
this method only calculates the MW of the coding strand. If you want the MW of the reverse strand, add self.rc().mw(). DO NOT just multiply the MW by 2: the results may not be accurate due to strand bias, e.g. in mitochondrial genomes.
- name#
- parent_coordinates() tuple[str, int, int, int] #
returns seqid, start, stop, strand of this sequence on its parent
- Returns:
- seqid, start, end, strand of this sequence on the parent. strand is either
- -1 or 1.
Notes
seqid is the identifier of the parent. Returned coordinates are with respect to the plus strand, irrespective of whether the sequence has been reversed complemented or not.
- parse_out_gaps()#
returns Map corresponding to gap locations and ungapped Sequence
- replace_annotation_db(value: SupportsFeatures, check: bool = True) None #
public interface to assigning the annotation_db
- Parameters:
- value
the annotation db instance
- check
whether to check value supports the feature interface
Notes
The check can be very expensive, so if you’re confident set it to False
- resolved_ambiguities() list[set[str]] #
Returns a list of sets of strings.
- shuffle()#
returns a randomized copy of the Sequence object
- sliding_windows(window, step, start=None, end=None)#
Generator function that yield new sequence objects of a given length at a given interval.
- Parameters:
- window
The length of the returned sequence
- step
The interval between the start of the returned sequence objects
- start
first window start position
- end
last window start position
- strip_bad()#
Removes any symbols not in the alphabet.
- strip_bad_and_gaps()#
Removes any symbols not in the alphabet, and any gaps. As the missing character could be a gap, this method will remove it as well.
- strip_degenerate()#
Removes degenerate bases by stripping them out of the sequence.
- to_array(apply_transforms: bool = True) ndarray[int] #
returns the numpy array
- Parameters:
- apply_transforms
if True, applies any reverse complement operation
Notes
Use this method with apply_transforms=False if you are creating data for storage in a SeqData instance.
- to_fasta(make_seqlabel=None, block_size=60) str #
Return string of self in FASTA format, no trailing newline
- Parameters:
- make_seqlabel
callback function that takes the seq object and returns a label str
- to_html(wrap: int = 60, limit: int | None = None, colors: Mapping[str, str] | None = None, font_size: int = 12, font_family: str = 'Lucida Console')#
returns html with embedded styles for sequence colouring
- Parameters:
- wrap
maximum number of printed bases, defaults to alignment length
- limit
truncate alignment to this length
- colors
dict of {char: color} to use for coloring
- font_size
in points. Affects labels and sequence and line spacing (proportional to value)
- font_family
string denoting font family
- To display in jupyter notebook:
>>> from IPython.core.display import HTML >>> HTML(aln.to_html())
- to_json() str #
returns a json formatted string
- to_moltype(moltype: str | MolType) Sequence #
returns copy of self with moltype seq
- Parameters:
- moltype
molecular type
Notes
This method cannot convert between nucleic acids and proteins. Use get_translation() for that.
When applied to a sequence in a SequenceCollection, the resulting sequence will no longer be part of the collection.
- to_phylip(name_len: int = 28, label_len: int = 30) str #
Return string of self in one line for PHYLIP, no newline.
Default: max name length is 28, label length is 30.
- to_rich_dict(exclude_annotations: bool = True) dict[str, str | dict[str, str]] #
returns {‘name’: name, ‘seq’: sequence, ‘moltype’: moltype.label}
Notes
Deserialisation of the sequence object will not include the annotation_db even if exclude_annotations=False.
- with_masked_annotations(annot_types: str | Iterable[str], mask_char: str | None = None, shadow: bool = False, extend_query: bool = False)#
returns a sequence with annot_types regions replaced by mask_char if shadow is False, otherwise all other regions are masked.
- Parameters:
- annot_types
annotation type(s)
- mask_char
must be a character valid for the seq MolType. The default value is the most ambiguous character, eg. ‘?’ for DNA
- shadow
whether to mask the annotated regions, or everything but the annotated regions
- with_termini_unknown()#
Returns copy of sequence with terminal gaps remapped as missing.