Alignment#
- class Alignment(seqs_data: AlignedSeqsDataABC, slice_record: SliceRecord | None = None, **kwargs)#
A collection of aligned sequences.
- Attributes:
- annotation_db
array_positions
Returns a numpy array of positions, axis 0 is alignment positions columns in order corresponding to names.
array_seqs
Returns a numpy array of sequences, axis 0 is seqs in order
- named_seqs
- names
- num_seqs
- positions
- seqs
Methods
add_feature
(*, biotype, name, spans[, ...])add feature on named sequence, or on the alignment itself
add_seqs
(seqs, **kwargs)Returns new collection with additional sequences.
alignment_quality
([app_name])Computes the alignment quality using the indicated app
apply_pssm
([pssm, path, background, ...])scores sequences using the specified pssm
apply_scaled_gaps
(other[, aa_to_codon])applies gaps in self to unagpped sequences
coevolution
([stat, segments, drawable, ...])performs pairwise coevolution measurement
copy
()creates new instance, only mutable attributes are copied
copy_annotations
(seq_db)copy annotations into attached annotation db
Return the counts of ambiguous characters per sequence as a DictArray.
count_gaps_per_pos
([include_ambiguity])return counts of gaps per position as a DictArray
count_gaps_per_seq
([induced_by, unique, ...])return counts of gaps per sequence as a DictArray
counts
([motif_length, include_ambiguity, ...])counts of motifs
counts_per_pos
([motif_length, ...])return DictArray of counts per position
counts_per_seq
([motif_length, ...])counts of non-overlapping motifs per sequence
deepcopy
(**kwargs)returns deep copy of self
degap
()Returns new SequenceCollection in which sequences have no gaps or missing characters.
distance_matrix
([calc, drop_invalid, parallel])Returns pairwise distances between sequences.
dotplot
([name1, name2, window, threshold, ...])make a dotplot between specified sequences.
entropy_per_pos
([motif_length, ...])returns shannon entropy per position
entropy_per_seq
([motif_length, ...])returns the Shannon entropy per sequence
filtered
(predicate[, motif_length, ...])The alignment positions where predicate(column) is true.
from_rich_dict
(data)returns a new instance from a rich dict
Returns dict of seq:{position:char} for ambiguous chars.
get_degapped_relative_to
(name)Remove all columns with gaps in sequence with given name.
get_drawable
(*[, biotype, width, vertical, ...])make a figure from sequence features
get_drawables
(*[, biotype])returns a dict of drawables, keyed by type
get_features
(*[, seqid, biotype, name, ...])yields Feature instances
get_gap_array
([include_ambiguity])returns bool array with gap state True, False otherwise
get_gapped_seq
(seqname[, recode_gaps])Return a gapped Sequence object for the specified seqname.
get_identical_sets
([mask_degen])returns sets of names for sequences that are identical
get_lengths
([include_ambiguity, allow_gap])returns sequence lengths as a dict of {seqid: length}
get_motif_probs
([alphabet, ...])Return a dictionary of motif probs, calculated as the averaged frequency across sequences.
get_position_indices
(f[, native, negate])Returns list of column indices for which f(col) is True.
get_projected_feature
(*, seqid, feature)returns an alignment feature projected onto the seqid sequence
get_projected_features
(*, seqid, **kwargs)projects all features from other sequences onto seqid
get_seq
(seqname[, copy_annotations])Return a Sequence object for the specified seqname.
get_seq_names_if
(f[, negate])Returns list of names of seqs where f(seq) is True.
get_similar
(target[, min_similarity, ...])Returns new SequenceCollection containing sequences similar to target.
get_translation
([gc, incomplete_ok, ...])translate sequences from nucleic acid to protein
has_terminal_stop
([gc, strict])Returns True if any sequence has a terminal stop codon.
information_plot
([width, height, window, ...])plot information per position
by definition False for an Alignment
iter_positions
([pos_order])Iterates over positions in the alignment, in order.
iter_seqs
([seq_order])Iterates over sequences in the collection, in order.
iupac_consensus
([allow_gap])Returns string containing IUPAC consensus sequence of the alignment.
Returns consensus sequence containing most frequent item at each position.
make_feature
(*, feature[, on_alignment])create a feature on named sequence, or on the alignment itself
matching_ref
(ref_name, gap_fraction, gap_run)Returns new alignment with seqs well aligned with a reference.
no_degenerates
([motif_length, allow_gap])returns new alignment without degenerate characters
omit_bad_seqs
([quantile])Returns new alignment without sequences with a number of uniquely introduced gaps exceeding quantile
omit_gap_pos
([allowed_gap_frac, motif_length])Returns new alignment where all cols (motifs) have <= allowed_gap_frac gaps.
pad_seqs
([pad_length])Returns copy in which sequences are padded with the gap character to same length.
probs_per_pos
([motif_length, ...])returns MotifFreqsArray per position
probs_per_seq
([motif_length, ...])return MotifFreqsArray per sequence
quick_tree
([calc, bootstrap, drop_invalid, ...])Returns a phylogenetic tree.
rc
()Returns the reverse complement of all sequences in the alignment.
rename_seqs
(renamer)Returns new alignment with renamed sequences.
Returns the reverse complement of all sequences in the collection.
sample
(*[, n, with_replacement, ...])Returns random sample of positions from self, e.g. to bootstrap.
seqlogo
([width, height, wrap, vspace, colours])returns Drawable sequence logo using mutual information
set_repr_policy
([num_seqs, num_pos, ...])specify policy for repr(self)
sliding_windows
(window, step[, start, end])Generator yielding new alignments of given length and interval.
strand_symmetry
([motif_length])returns dict of strand symmetry test results per ungapped seq
take_positions
(cols[, negate])Returns new Alignment containing only specified positions.
take_positions_if
(f[, negate])Returns new Alignment containing cols where f(col) is True.
take_seqs
(names[, negate, copy_annotations])Returns new collection containing only specified seqs.
take_seqs_if
(f[, negate])Returns new collection containing seqs where f(seq) is True.
to_dict
([as_array])Return a dictionary of sequences.
to_dna
()returns copy of self as a collection of DNA moltype seqs
to_fasta
([block_size])Return collection in Fasta format.
to_html
([name_order, wrap, limit, ref_name, ...])returns html with embedded styles for sequence colouring
to_json
()returns json formatted string
to_moltype
(moltype)returns copy of self with changed moltype
Return collection in PHYLIP format and mapping to sequence ids
to_pretty
([name_order, wrap])returns a string representation of the alignment in pretty print format
returns a json serialisable dict
to_rna
()returns copy of self as a collection of RNA moltype seqs
trim_stop_codons
([gc, strict])Removes any terminal stop codons from the sequences
variable_positions
([include_gap_motif, ...])Return a list of variable position indexes.
with_masked_annotations
(biotypes[, ...])returns an alignment with regions replaced by mask_char
write
(filename[, file_format])Write the sequences to a file, preserving order of sequences.
gapped_by_map
to_type
Notes
Should be constructed using
make_aligned_seqs()
.- add_feature(*, biotype: str, name: str, spans: List[Tuple[int, int]], seqid: OptStr = None, parent_id: OptStr = None, strand: str = '+', on_alignment: OptBool = None) Feature #
add feature on named sequence, or on the alignment itself
- Parameters:
- seqid
sequence name, incompatible with on_alignment
- parent_id
name of the parent feature
- biotype
biological type, e.g. CDS
- name
name of the feature
- spans
plus strand coordinates of feature
- strand
‘+’ (default) or ‘-’
- on_alignment
the feature is in alignment coordinates, incompatible with setting seqid. Set to True if seqid not provided.
- Returns:
- Feature
- Raises:
- ValueError if define a seqid not on alignment or use seqid and
- on_alignment.
- add_seqs(seqs: dict[str, str | bytes | ndarray[int]] | SeqsData | list, **kwargs) SequenceCollection #
Returns new collection with additional sequences.
- Parameters:
- seqs
sequences to add
- alignment_quality(app_name: str = 'ic_score', **kwargs)#
Computes the alignment quality using the indicated app
- Parameters:
- app_name
name of an alignment score calculating app, e.g. ‘ic_score’, ‘cogent3_score’, ‘sp_score’
- kwargs
keyword arguments to be passed to the app. Use
cogent3.app_help(app_name)
to see the available options.
- Returns:
- float or a NotCompleted instance if the score could not be computed
- property annotation_db#
- apply_pssm(pssm: PSSM = None, path: str | None = None, background: ndarray = None, pseudocount: int = 0, names: list | None = None, ui=None) array #
scores sequences using the specified pssm
- Parameters:
- pssm
A profile.PSSM instance, if not provided, will be loaded from path
- path
path to either a jaspar or cisbp matrix (path must end have a suffix matching the format).
- background
background frequencies distribution
- pseudocount
adjustment for zero in matrix
- names
returns only scores for these sequences and in the name order
- Returns:
- numpy array of log2 based scores at every position
- apply_scaled_gaps(other: SequenceCollection, aa_to_codon: bool | None = None)#
applies gaps in self to unagpped sequences
- property array_positions: ndarray#
Returns a numpy array of positions, axis 0 is alignment positions columns in order corresponding to names.
- property array_seqs: ndarray#
Returns a numpy array of sequences, axis 0 is seqs in order corresponding to names
- coevolution(stat: str = 'nmi', segments: list[tuple[int, int]] | None = None, drawable: str | None = None, show_progress: bool = False, parallel: bool = False, par_kw: dict | None = None)#
performs pairwise coevolution measurement
- Parameters:
- stat
coevolution metric, defaults to ‘nmi’ (Normalized Mutual Information). Valid choices are ‘rmi’ (Resampled Mutual Information) and ‘mi’, mutual information.
- segments
coordinates of the form [(start, end), …] where all possible pairs of alignment positions within and between segments are examined.
- drawable
Result object is capable of plotting data specified type. str value must be one of plot type ‘box’, ‘heatmap’, ‘violin’.
- show_progress
shows a progress bar.
- parallel
run in parallel, according to arguments in par_kwargs.
- par_kw
dict of values for configuring parallel execution.
- Returns:
- DictArray of results with lower-triangular values. Upper triangular
- elements and estimates that could not be computed for numerical reasons
- are set as nan
- copy()#
creates new instance, only mutable attributes are copied
- copy_annotations(seq_db: SupportsFeatures) None #
copy annotations into attached annotation db
- Parameters:
- seq_db
compatible annotation db
Notes
Only copies annotations for records with seqid in self.names
- count_ambiguous_per_seq() DictArray #
Return the counts of ambiguous characters per sequence as a DictArray.
- count_gaps_per_pos(include_ambiguity: bool = True) DictArray #
return counts of gaps per position as a DictArray
- Parameters:
- include_ambiguity
if True, ambiguity characters that include the gap state are included
- count_gaps_per_seq(induced_by: bool = False, unique: bool = False, include_ambiguity: bool = True, drawable: bool = False) DictArray #
return counts of gaps per sequence as a DictArray
- Parameters:
- induced_by
a gapped column is considered to be induced by a seq if the seq has a non-gap character in that column.
- unique
count is limited to gaps uniquely induced by each sequence
- include_ambiguity
if True, ambiguity characters that include the gap state are included
- drawable
if True, resulting object is capable of plotting data via specified plot type ‘bar’, ‘box’ or ‘violin’
- counts(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = False) MotifCountsArray #
counts of motifs
- Parameters:
- motif_length
number of elements per character.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap
if True, motifs containing a gap character are included.
- exclude_unobserved
if True, unobserved motif combinations are excluded.
Notes
only non-overlapping motifs are counted
- counts_per_pos(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, warn: bool = False) DictArray #
return DictArray of counts per position
- Parameters:
- motif_length
number of elements per character.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap
if True, motifs containing a gap character are included.
- warn
warns if motif_length > 1 and alignment trimmed to produce motif columns
- counts_per_seq(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = False, warn: bool = False) MotifCountsArray #
counts of non-overlapping motifs per sequence
- Parameters:
- motif_length
number of elements per character.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap
if True, motifs containing a gap character are included.
- exclude_unobserved
if False, all canonical states included
- warn
warns if motif_length > 1 and alignment trimmed to produce motif columns
- deepcopy(**kwargs)#
returns deep copy of self
Notes
Reduced to sliced sequences in self, kwargs are ignored. Annotation db is not copied if the alignment has been sliced.
- degap() SequenceCollection #
Returns new SequenceCollection in which sequences have no gaps or missing characters.
Notes
The returned collection will not retain an annotation_db if present.
- distance_matrix(calc: str = 'pdist', drop_invalid: bool = False, parallel: bool = False)#
Returns pairwise distances between sequences.
- Parameters:
- calc
a pairwise distance calculator name. Presently only ‘pdist’, ‘jc69’, ‘tn93’, ‘hamming’, ‘paralinear’ are supported.
- drop_invalid
If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.
- dotplot(name1: str | None = None, name2: str | None = None, window: int = 20, threshold: int | None = None, k: int | None = None, min_gap: int = 0, width: int = 500, title: str | None = None, rc: bool = False, biotype: str | tuple[str] = 'gene', show_progress: bool = False)#
make a dotplot between specified sequences. Random sequences chosen if names not provided.
- Parameters:
- name1, name2
names of sequences – if not provided, a random choice is made
- window
segment size for comparison between sequences
- threshold
windows where the sequences are identical >= threshold are a match
- k
size of k-mer to break sequences into. Larger values increase speed but reduce resolution. If not specified, and window == threshold, then k is set to window. Otherwise, it is computed as the maximum of {threshold // (window - threshold), 5}.
- min_gap
permitted gap for joining adjacent line segments, default is no gap joining
- width
figure width. Figure height is computed based on the ratio of len(seq1) / len(seq2)
- title
title for the plot
- rc
include dotplot of reverse compliment also. Only applies to Nucleic acids moltypes
- biotype
if selected sequences are annotated, display only these biotypes
- Returns:
- a Drawable or AnnotatedDrawable
- entropy_per_pos(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, warn: bool = False) ndarray #
returns shannon entropy per position
- entropy_per_seq(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = True, warn: bool = False) ndarray #
returns the Shannon entropy per sequence
- Parameters:
- motif_length
number of characters per tuple.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap
if True, motifs containing a gap character are included.
- exclude_unobserved
if True, unobserved motif combinations are excluded.
- warn
warns if motif_length > 1 and alignment trimmed to produce motif columns
Notes
For motif_length > 1, it’s advisable to specify exclude_unobserved=True, this avoids unnecessary calculations.
- filtered(predicate: Callable[[Self], bool], motif_length: int = 1, drop_remainder: bool = True, **kwargs)#
The alignment positions where predicate(column) is true.
- Parameters:
- predicate
a callback function that takes an tuple of motifs and returns True/False
- motif_length
length of the motifs the sequences should be split into, eg. 3 for filtering aligned codons.
- drop_remainder
If length is not modulo motif_length, allow dropping the terminal remaining columns
- classmethod from_rich_dict(data: dict[str, str | dict[str, str]])#
returns a new instance from a rich dict
- gapped_by_map(keep: FeatureMap, **kwargs)#
- get_ambiguous_positions()#
Returns dict of seq:{position:char} for ambiguous chars.
Used in likelihood calculations.
- get_degapped_relative_to(name: str)#
Remove all columns with gaps in sequence with given name.
- Parameters:
- name
sequence name
Notes
The returned alignment will not retain an annotation_db if present.
- get_drawable(*, biotype: str | Iterable[str] | None = None, width: int = 600, vertical: int = False, title: OptStr = None)#
make a figure from sequence features
- Parameters:
- biotype
passed to get_features(biotype). Can be a single biotype or series. Only features matching this will be included.
- width
width in pixels
- vertical
rotates the drawable
- title
title for the plot
- Returns:
- a Drawable instance
- get_drawables(*, biotype: str | Iterable[str] | None = None) dict #
returns a dict of drawables, keyed by type
- Parameters:
- biotype
passed to get_features(biotype). Can be a single biotype or series. Only features matching this will be included.
- get_features(*, seqid: str | None = None, biotype: str | None = None, name: str | None = None, on_alignment: bool | None = None, allow_partial: bool = False) Iterator[Feature] #
yields Feature instances
- Parameters:
- seqid
limit search to features on this named sequence, defaults to search all
- biotype
biotype of the feature, e.g. CDS, gene
- name
name of the feature
- on_alignment
limit query to features on Alignment, ignores sequences. Ignored on SequenceCollection instances.
- allow_partial
allow features partially overlaping self
Notes
When dealing with a nucleic acid moltype, the returned features will yield a sequence segment that is consistently oriented irrespective of strand of the current instance.
- get_gap_array(include_ambiguity: bool = True) ndarray #
returns bool array with gap state True, False otherwise
- Parameters:
- include_ambiguity
if True, ambiguity characters that include the gap state are included
- get_gapped_seq(seqname: str, recode_gaps: bool = False) Sequence #
Return a gapped Sequence object for the specified seqname.
- Parameters:
- seqname
sequence name
- recode_gaps
if True, gap characters are replaced by the most general ambiguity code, e.g. N for DNA and RNA
Notes
This method breaks the connection to the annotation database.
- get_identical_sets(mask_degen: bool = False) list[set] #
returns sets of names for sequences that are identical
- Parameters:
- mask_degen
if True, degenerate characters are ignored
- get_lengths(include_ambiguity: bool = False, allow_gap: bool = False) dict[str, int] #
returns sequence lengths as a dict of {seqid: length}
- Parameters:
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap
if True, motifs containing a gap character are included.
- get_motif_probs(alphabet: AlphabetABC = None, include_ambiguity: bool = False, exclude_unobserved: bool = False, allow_gap: bool = False, pseudocount: int = 0) dict #
Return a dictionary of motif probs, calculated as the averaged frequency across sequences.
- Parameters:
- alphabet
alphabet to use for motifs
- include_ambiguity
if True resolved ambiguous codes are included in estimation of frequencies.
- exclude_unobserved
if True, motifs that are not present in the alignment are excluded from the returned dictionary.
- allow_gap
allow gap motif
- pseudocount
value to add to each count
Notes
only non-overlapping motifs are counted
- get_position_indices(f: Callable, native: bool = False, negate: bool = False) list[int] #
Returns list of column indices for which f(col) is True.
- f
function that returns true/false given an alignment position
- native
if True, f is provided with slice of array, otherwise the string is used
- negate
if True, not f() is used
- get_projected_feature(*, seqid: str, feature: Feature) Feature #
returns an alignment feature projected onto the seqid sequence
- Parameters:
- seqid
name of the sequence to project the feature onto
- feature
a Feature, bound to self, that will be projected
- Returns:
- a new Feature bound to seqid
Notes
The alignment coordinates of feature are converted into the seqid sequence coordinates and the object is bound to that sequence.
The feature is added to the annotation_db.
- get_projected_features(*, seqid: str, **kwargs) list[Feature] #
projects all features from other sequences onto seqid
- get_seq(seqname: str, copy_annotations: bool = False) Sequence #
Return a Sequence object for the specified seqname.
- Parameters:
- seqname
name of the sequence to return
- copy_annotations
if True, only the annotations for the specified sequence are copied to the annotation database of the Sequence object. If False, all annotations are copied.
- get_seq_names_if(f: Callable[[Sequence], bool], negate: bool = False)#
Returns list of names of seqs where f(seq) is True.
- Parameters:
- f
function that takes a sequence object and returns True or False
- negate
select all sequences EXCEPT those where f(seq) is True
Notes
Sequence objects can be converted into strings or numpy arrays using str() and numpy.array() respectively.
- get_similar(target: ~cogent3.core.new_sequence.Sequence, min_similarity: float = 0.0, max_similarity: float = 1.0, metric: ~collections.abc.Callable[[~cogent3.core.new_sequence.Sequence, ~cogent3.core.new_sequence.Sequence], float] = <cogent3.util.transform.for_seq object>, transform: bool | None = None) SequenceCollection #
Returns new SequenceCollection containing sequences similar to target.
- Parameters:
- target
sequence object to compare to. Can be in the collection.
- min_similarity
minimum similarity that will be kept. Default 0.0.
- max_similarity
maximum similarity that will be kept. Default 1.0.
- metric
a similarity function to use. Must be f(first_seq, second_seq). The default metric is fraction similarity, ranging from 0.0 (0% identical) to 1.0 (100% identical). The Sequence class have lots of methods that can be passed in as unbound methods to act as the metric, e.g. frac_same_gaps.
- transform
transformation function to use on the sequences before the metric is calculated. If None, uses the whole sequences in each case. A frequent transformation is a function that returns a specified range of a sequence, e.g. eliminating the ends. Note that the transform applies to both the real sequence and the target sequence.
Notes
both min_similarity and max_similarity are inclusive.
- get_translation(gc: int | None = None, incomplete_ok: bool = False, include_stop: bool = False, trim_stop: bool = True, **kwargs)#
translate sequences from nucleic acid to protein
- Parameters:
- gc
genetic code, either the number or name (use cogent3.core.genetic_code.available_codes)
- incomplete_ok
codons that are mixes of nucleotide and gaps converted to ‘?’. raises a ValueError if False
- include_stop
whether to allow a stops in the translated sequence
- trim_stop
exclude terminal stop codons if they exist
- kwargs
related to construction of the resulting object
- Returns:
- A new instance of self translated into protein
Notes
Translating will break the relationship to an annotation_db if present.
- has_terminal_stop(gc: Any = None, strict: bool = False) bool #
Returns True if any sequence has a terminal stop codon.
- Parameters:
- gc
valid input to cogent3.get_code(), a genetic code object, number or name
- strict
If True, raises an exception if a seq length not divisible by 3
- information_plot(width: int | None = None, height: int | None = None, window: int | None = None, stat: str = 'median', include_gap: bool = True)#
plot information per position
- Parameters:
- width
figure width in pixels
- height
figure height in pixels
- window
used for smoothing line, defaults to sqrt(length)
- stat
‘mean’ or ‘median, used as the summary statistic for each window
- include_gap
whether to include gap counts, shown on right y-axis
- is_ragged() bool #
by definition False for an Alignment
- iter_positions(pos_order: list | None = None) Iterator[list, list, list] #
Iterates over positions in the alignment, in order.
- Parameters:
- pos_order
list of indices specifying the column order. If None, the positions are iterated in order.
- Returns:
- yields lists of elemenets for each position (column) in the alignment
- iter_seqs(seq_order: list | None = None) Iterator[Sequence | SeqViewABC] #
Iterates over sequences in the collection, in order.
- Parameters:
- seq_order:
list of seqids giving the order in which seqs will be returned. Defaults to self.names
- iupac_consensus(allow_gap: bool = True) str #
Returns string containing IUPAC consensus sequence of the alignment.
- majority_consensus() Sequence #
Returns consensus sequence containing most frequent item at each position.
- make_feature(*, feature: FeatureDataType, on_alignment: bool | None = None) Feature #
create a feature on named sequence, or on the alignment itself
- Parameters:
- feature
a dict with all the necessary data rto construct a feature
- on_alignment
the feature is in alignment coordinates, incompatible with setting ‘seqid’. Set to True if ‘seqid’ not provided.
- Returns:
- Feature
- Raises:
- ValueError if define a ‘seqid’ not on alignment or use ‘seqid’ and
- on_alignment.
Notes
To get a feature AND add it to annotation_db, use add_feature().
- matching_ref(ref_name: str, gap_fraction: float, gap_run: int)#
Returns new alignment with seqs well aligned with a reference.
- Parameters:
- ref_name
name of the sequence to use as the reference
- gap_fraction
fraction of positions that either have a gap in the template but not in the seq or in the seq but not in the template
- gap_run
number of consecutive gaps tolerated in query relative to sequence or sequence relative to query
- property named_seqs: SeqsDataABC#
- property names: list#
- no_degenerates(motif_length: int = 1, allow_gap: bool = False)#
returns new alignment without degenerate characters
- Parameters:
- motif_length
sequences are segmented into units of this size and the segments are excluded if they contain degenerate characters.
- allow_gap
whether gaps are allowed or whether they are treated as a degenerate character (latter is default, as most evolutionary modelling treats gaps as N).
- property num_seqs: int#
- omit_bad_seqs(quantile: float | None = None)#
Returns new alignment without sequences with a number of uniquely introduced gaps exceeding quantile
Uses count_gaps_per_seq(unique=True) to obtain the counts of gaps uniquely introduced by a sequence. The cutoff is the quantile of this distribution.
- Parameters:
- quantile
sequences whose unique gap count is in a quantile larger than this cutoff are excluded. The default quantile is (num_seqs - 1) / num_seqs
- omit_gap_pos(allowed_gap_frac: float | None = None, motif_length: int = 1)#
Returns new alignment where all cols (motifs) have <= allowed_gap_frac gaps.
- Parameters:
- allowed_gap_frac
specifies proportion of gaps is allowed in each column. Set to 0 to exclude columns with any gaps, 1 to include all columns. Default is None which is equivalent to (num_seqs-1)/num_seqs and leads to elimination of columns that are only gaps.
- motif_length
set’s the “column” width, e.g. setting to 3 corresponds to codons. A motif that includes a gap at any position is included in the counting.
- pad_seqs(pad_length: int | None = None)#
Returns copy in which sequences are padded with the gap character to same length.
- Parameters:
- pad_length
Length all sequences are to be padded to. Will pad to max sequence length if pad_length is None or less than max length.
- property positions#
- probs_per_pos(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, warn: bool = False) MotifFreqsArray #
returns MotifFreqsArray per position
- probs_per_seq(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = False, warn: bool = False) MotifFreqsArray #
return MotifFreqsArray per sequence
- Parameters:
- motif_length
number of characters per tuple.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap
if True, motifs containing a gap character are included.
- exclude_unobserved
if True, unobserved motif combinations are excluded.
- warn
warns if motif_length > 1 and alignment trimmed to produce motif columns
- quick_tree(calc: str = 'pdist', bootstrap: int | None = None, drop_invalid: bool = False, parallel: bool = False, show_progress: bool = False, ui=None)#
Returns a phylogenetic tree.
- Parameters:
- calc
a pairwise distance calculator or name of one. For options see cogent3.evolve.fast_distance.available_distances
- bootstrap
Number of non-parametric bootstrap replicates. Resamples alignment columns with replacement and builds a phylogeny for each such resampling.
- drop_invalid
If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.
- parallel
parallel execution of distance calculations
- show_progress
controls progress display for distance calculation
- Returns:
- a phylogenetic tree. If bootstrap specified, returns the weighted
- majority consensus. Support for each node is stored as
- edge.params[‘params’].
Notes
Sequences in the observed alignment for which distances could not be computed are omitted. Bootstrap replicates are required to have distances for all seqs present in the observed data distance matrix.
- rc()#
Returns the reverse complement of all sequences in the alignment. A synonym for reverse_complement.
- rename_seqs(renamer: Callable[[str], str])#
Returns new alignment with renamed sequences.
- reverse_complement()#
Returns the reverse complement of all sequences in the collection. A synonym for rc.
- sample(*, n: int | None = None, with_replacement: bool = False, motif_length: int = 1, randint=<bound method RandomState.randint of RandomState(MT19937)>, permutation=<bound method RandomState.permutation of RandomState(MT19937)>)#
Returns random sample of positions from self, e.g. to bootstrap.
- Parameters:
- n
number of positions to sample. If None, all positions are sampled.
- with_replacement
if True, samples with replacement.
- motif_length
number of positions to sample as a single motif.
- randint
random number generator, default is numpy.randint
- permutation
function to generate a random permutation of positions, default is numpy.permutation
Notes
By default (resampling all positions without replacement), generates a permutation of the positions of the alignment.
Setting with_replacement to True and otherwise leaving parameters as defaults generates a standard bootstrap resampling of the alignment.
- seqlogo(width: float = 700, height: float = 100, wrap: int | None = None, vspace: float = 0.005, colours: dict | None = None)#
returns Drawable sequence logo using mutual information
- Parameters:
- width, height
plot dimensions in pixels
- wrap
number of alignment columns per row
- vspace
vertical separation between rows, as a proportion of total plot
- colours
mapping of characters to colours. If note provided, defaults to custom for everything ecept protein, which uses protein moltype colours.
Notes
Computes MI based on log2 and includes the gap state, so the maximum possible value is -log2(1/num_states)
- property seqs: _IndexableSeqs#
- set_repr_policy(num_seqs: int | None = None, num_pos: int | None = None, ref_name: int | None = None, wrap: int | None = None) None #
specify policy for repr(self)
- Parameters:
- num_seqs
number of sequences to include in represented display.
- num_pos
length of sequences to include in represented display.
- ref_name
name of sequence to be placed first, or “longest” (default). If latter, indicates longest sequence will be chosen.
- wrap
number of printed bases per row
- sliding_windows(window: int, step: int, start: int | None = None, end: int | None = None)#
Generator yielding new alignments of given length and interval.
- Parameters:
- window
The length of each returned alignment.
- step
The interval between the start of the successive windows.
- start
first window start position
- end
last window start position
- strand_symmetry(motif_length: int = 1)#
returns dict of strand symmetry test results per ungapped seq
- take_positions(cols: list, negate: bool = False)#
Returns new Alignment containing only specified positions.
- Parameters:
- cols
list of column indices to keep
- negate
if True, all columns except those in cols are kept
- take_positions_if(f, negate=False)#
Returns new Alignment containing cols where f(col) is True.
- take_seqs(names: str | Sequence[str], negate: bool = False, copy_annotations: bool = False, **kwargs)#
Returns new collection containing only specified seqs.
- Parameters:
- names
sequences to select (or exclude if negate=True)
- negate
select all sequences EXCEPT names
- kwargs
keyword arguments to be passed to the constructor of the new collection
- copy_annotations
if True, only annotations from selected seqs are copied to the annotation_db of the new collection
- take_seqs_if(f: Callable[[Sequence], bool], negate: bool = False)#
Returns new collection containing seqs where f(seq) is True.
- Parameters:
- f
function that takes a sequence object and returns True or False
- negate
select all sequences EXCEPT those where f(seq) is True
Notes
Sequence objects can be converted into strings or numpy arrays using str() and numpy.array() respectively.
- to_dict(as_array: bool = False) dict[str, str | ndarray] #
Return a dictionary of sequences.
- Parameters:
- as_array
if True, sequences are returned as numpy arrays, otherwise as strings
- to_dna()#
returns copy of self as a collection of DNA moltype seqs
- to_fasta(block_size: int = 60) str #
Return collection in Fasta format.
- Parameters:
- block_size
the sequence length to write to each line, by default 60
- Returns:
- The collection in Fasta format.
- to_html(name_order: Sequence[str] | None = None, wrap: int = 60, limit: int | None = None, ref_name: str = 'longest', colors: Mapping[str, str] | None = None, font_size: int = 12, font_family: str = 'Lucida Console') str #
returns html with embedded styles for sequence colouring
- Parameters:
- name_order
order of names for display.
- wrap
number of alignment columns per row
- limit
truncate alignment to this length
- ref_name
Name of an existing sequence or ‘longest’. If the latter, the longest sequence (excluding gaps and ambiguities) is selected as the reference.
- colors
{character moltype.
- font_size
in points. Affects labels and sequence and line spacing (proportional to value)
- font_family
string denoting font family
Examples
In a jupyter notebook, this code is used to provide the representation.
aln # is rendered by jupyter
You can directly use the result for display in a notebook as
from IPython.core.display import HTML HTML(aln.to_html())
- to_json()#
returns json formatted string
- to_moltype(moltype: str | MolType) SequenceCollection #
returns copy of self with changed moltype
- Parameters:
- moltype
name of the new moltype, e.g, ‘dna’, ‘rna’.
Notes
Cannot convert from nucleic acids to proteins. Use get_translation() for that.
- to_phylip()#
Return collection in PHYLIP format and mapping to sequence ids
Notes
raises exception if sequences do not all have the same length
- to_pretty(name_order=None, wrap=None)#
returns a string representation of the alignment in pretty print format
- Parameters:
- name_order
order of names for display.
- wrap
maximum number of printed bases
- to_rich_dict() dict[str, str | dict[str, str]] #
returns a json serialisable dict
- to_rna()#
returns copy of self as a collection of RNA moltype seqs
- to_type(**kwargs)#
- trim_stop_codons(gc: Any = None, strict: bool = False, **kwargs)#
Removes any terminal stop codons from the sequences
- Parameters:
- gc
valid input to cogent3.get_code(), a genetic code object, number or name, defaults to standard code
- strict
If True, raises an exception if a seq length not divisible by 3
- variable_positions(include_gap_motif: bool = True, include_ambiguity: bool = False, motif_length: int = 1) tuple[int] #
Return a list of variable position indexes.
- Parameters:
- include_gap_motif
if False, sequences with a gap motif in a column are ignored.
- include_ambiguity
if True, all states are considered.
- motif_length
if any position within a motif is variable, the entire motif is considered variable.
- Returns:
- tuple of integers, if motif_length > 1, the returned positions are
- motif_length long sequential indices.
Notes
Truncates alignment to be modulo motif_length.
- with_masked_annotations(biotypes: Sequence[str], mask_char: str = '?', shadow: bool = False)#
returns an alignment with regions replaced by mask_char
- Parameters:
- biotypes
annotation type(s)
- mask_char
must be a character valid for the moltype. The default value is the most ambiguous character, eg. ‘?’ for DNA
- shadow
If True, masks everything but the biotypes
- write(filename: str, file_format: str | None = None, **kwargs) None #
Write the sequences to a file, preserving order of sequences.
- Parameters:
- filename
name of the sequence file
- file_format
format of the sequence file
Notes
If file_format is None, will attempt to infer format from the filename suffix.