Alignment#

class Alignment(seqs_data: AlignedSeqsDataABC, slice_record: SliceRecord | None = None, **kwargs)#

A collection of aligned sequences.

Attributes:

annotation_db: the annotation database for the collection
array_positions: Returns a numpy array of positions, axis 0 is alignment positions columns in order corresponding to names.
array_seqs: Returns a numpy array of sequences, axis 0 is seqs in order
modified: collection is a modification of underlying storage
name_map: returns mapping of seq names to parent seq names
names: returns the names of the sequences in the collection
num_seqs: the number of sequences in the collection
positions
seqs: iterable of sequences in the collection
storage: the aligned sequence storage instance of the collection

Methods

`add_feature`(*, biotype, name, spans[, ...])	add feature on named sequence, or on the alignment itself
`add_seqs`(seqs, **kwargs)	Returns new collection with additional sequences.
`alignment_quality`([app_name])	Computes the alignment quality using the indicated app
`apply_pssm`([pssm, path, background, ...])	scores sequences using the specified pssm
`apply_scaled_gaps`(other[, aa_to_codon])	applies gaps in self to ungapped sequences
`coevolution`([stat, segments, drawable, ...])	performs pairwise coevolution measurement
`copy`([copy_annotations])	creates new instance, only mutable attributes are copied
`copy_annotations`(seq_db)	copy annotations into attached annotation db
`count_ambiguous_per_seq`()	Return the counts of ambiguous characters per sequence as a DictArray.
`count_gaps_per_pos`([include_ambiguity])	return counts of gaps per position as a DictArray
`count_gaps_per_seq`([induced_by, unique, ...])	return counts of gaps per sequence as a DictArray
`counts`([motif_length, include_ambiguity, ...])	counts of motifs
`counts_per_pos`([motif_length, ...])	return DictArray of counts per position
`counts_per_seq`([motif_length, ...])	counts of non-overlapping motifs per sequence
`deepcopy`(**kwargs)	returns deep copy of self
`degap`([storage_backend])	returns collection sequences without gaps or missing characters.
`distance_matrix`([calc, drop_invalid, parallel])	Returns pairwise distances between sequences.
`dotplot`([name1, name2, window, threshold, ...])	make a dotplot between specified sequences.
`drop_duplicated_seqs`()	returns self without duplicated sequences
`duplicated_seqs`()	returns the names of duplicated sequences
`entropy_per_pos`([motif_length, ...])	returns shannon entropy per position
`entropy_per_seq`([motif_length, ...])	returns the Shannon entropy per sequence
`filtered`(predicate[, motif_length, ...])	The alignment positions where predicate(column) is true.
`from_rich_dict`(data)	returns a new instance from a rich dict
`get_ambiguous_positions`()	Returns dict of seq:{position:char} for ambiguous chars.
`get_degapped_relative_to`(name)	Remove all columns with gaps in sequence with given name.
`get_drawable`(*[, biotype, width, vertical, ...])	make a figure from sequence features
`get_drawables`(*[, biotype])	returns a dict of drawables, keyed by type
`get_features`(*[, seqid, biotype, name, ...])	yields Feature instances
`get_gap_array`([include_ambiguity])	returns bool array with gap state True, False otherwise
`get_gapped_seq`(seqname[, recode_gaps])	Return a gapped Sequence object for the specified seqname.
`get_identical_sets`([mask_degen])	returns sets of names for sequences that are identical
`get_lengths`([include_ambiguity, allow_gap])	returns sequence lengths as a dict of {seqid: length}
`get_motif_probs`([alphabet, ...])	Return a dictionary of motif probs, calculated as the averaged frequency across sequences.
`get_position_indices`(f[, negate])	Returns list of column indices for which f(col) is True.
`get_projected_feature`(*, seqid, feature)	returns an alignment feature projected onto the seqid sequence
`get_projected_features`(, seqid, *kwargs)	projects all features from other sequences onto seqid
`get_seq`(seqname[, copy_annotations])	Return a Sequence object for the specified seqname.
`get_seq_names_if`(f[, negate])	Returns list of names of seqs where f(seq) is True.
`get_similar`(target[, min_similarity, ...])	Returns new SequenceCollection containing sequences similar to target.
`get_translation`([gc, incomplete_ok, ...])	translate sequences from nucleic acid to protein
`has_terminal_stop`([gc, strict])	Returns True if any sequence has a terminal stop codon.
`information_plot`([width, height, window, ...])	plot information per position
`is_ragged`()	by definition False for an Alignment
`iter_positions`([pos_order])	Iterates over positions in the alignment, in order.
`iter_seqs`([seq_order])	Iterates over sequences in the collection, in order.
`iupac_consensus`([allow_gap])	Returns string containing IUPAC consensus sequence of the alignment.
`majority_consensus`()	Returns consensus sequence containing most frequent item at each position.
`make_feature`(*, feature[, on_alignment])	create a feature on named sequence, or on the alignment itself
`matching_ref`(ref_name, gap_fraction, gap_run)	Returns new alignment with seqs well aligned with a reference.
`no_degenerates`([motif_length, allow_gap])	returns new alignment without degenerate characters
`omit_bad_seqs`([quantile])	Returns new alignment without sequences with a number of uniquely introduced gaps exceeding quantile
`omit_gap_pos`([allowed_gap_frac, motif_length])	Returns new alignment where all cols (motifs) have <= allowed_gap_frac gaps.
`pad_seqs`([pad_length])	Returns copy in which sequences are padded with the gap character to same length.
`probs_per_pos`([motif_length, ...])	returns MotifFreqsArray per position
`probs_per_seq`([motif_length, ...])	return MotifFreqsArray per sequence
`quick_tree`([calc, drop_invalid, parallel, ...])	Returns a phylogenetic tree.
`rc`()	Returns the reverse complement of all sequences in the alignment.
`rename_seqs`(renamer)	Returns new alignment with renamed sequences.
`replace_annotation_db`(value[, check])	public interface to assigning the annotation_db
`reverse_complement`()	Returns the reverse complement of all sequences in the collection.
`sample`(*[, n, with_replacement, ...])	Returns random sample of positions from self, e.g. to bootstrap.
`seqlogo`([width, height, wrap, vspace, colours])	returns Drawable sequence logo using mutual information
`set_repr_policy`([num_seqs, num_pos, ...])	specify policy for repr(self)
`sliding_windows`(window, step[, start, end])	Generator yielding new alignments of given length and interval.
`strand_symmetry`([motif_length])	returns dict of strand symmetry test results per ungapped seq
`take_positions`(cols[, negate])	Returns new Alignment containing only specified positions.
`take_positions_if`(f[, negate])	Returns new Alignment containing cols where f(col) is True.
`take_seqs`(names[, negate, copy_annotations])	Returns new collection containing only specified seqs.
`take_seqs_if`(f[, negate])	Returns new collection containing seqs where f(seq) is True.
`to_dict`([as_array])	Return a dictionary of sequences.
`to_dna`()	returns copy of self as a collection of DNA moltype seqs
`to_fasta`([block_size])	Return collection in Fasta format.
`to_html`([name_order, wrap, limit, ref_name, ...])	returns html with embedded styles for sequence colouring
`to_json`()	returns json formatted string
`to_moltype`(moltype)	returns copy of self with changed moltype
`to_phylip`()	Return collection in PHYLIP format and mapping to sequence ids
`to_pretty`([name_order, wrap])	returns a string representation of the alignment in pretty print format
`to_rich_dict`()	returns a json serialisable dict
`to_rna`()	returns copy of self as a collection of RNA moltype seqs
`trim_stop_codons`([gc, strict])	Removes any terminal stop codons from the sequences
`variable_positions`([include_gap_motif, ...])	Return a list of variable position indexes.
`with_masked_annotations`(biotypes[, ...])	returns an alignment with regions replaced by mask_char
`write`(filename[, format_name])	Write the sequences to a file, preserving order of sequences.

gapped_by_map

Notes

Should be constructed using make_aligned_seqs().

add_feature(*, biotype: str, name: str, spans: list[tuple[int, int]], seqid: str | None = None, parent_id: str | None = None, strand: str = '+', on_alignment: bool | None = None) → Feature#

add feature on named sequence, or on the alignment itself

Parameters:

seqid: sequence name, incompatible with on_alignment
parent_id: name of the parent feature
biotype: biological type, e.g. CDS
name: name of the feature
spans: plus strand coordinates of feature
strand: ‘+’ (default) or ‘-’
on_alignment: the feature is in alignment coordinates, incompatible with setting seqid. Set to True if seqid not provided.

Returns:

Feature

Raises:

ValueError if define a seqid not on alignment or use seqid and
on_alignment.

add_seqs(seqs: dict[str, str | bytes | ndarray[int]] | SeqsData | list, **kwargs) → Self#

Returns new collection with additional sequences.

Parameters:

seqs: sequences to add

alignment_quality(app_name: str = 'ic_score', **kwargs)#

Computes the alignment quality using the indicated app

Parameters:

app_name: name of an alignment score calculating app, e.g. ‘ic_score’, ‘cogent3_score’, ‘sp_score’
kwargs: keyword arguments to be passed to the app. Use cogent3.app_help(app_name) to see the available options.

Returns:

float or a NotCompleted instance if the score could not be computed

property annotation_db: SupportsFeatures#: the annotation database for the collection

apply_pssm(pssm: PSSM = None, path: str | None = None, background: ndarray = None, pseudocount: int = 0, names: list | None = None, ui=None) → array#

scores sequences using the specified pssm

Parameters:

pssm: A profile.PSSM instance, if not provided, will be loaded from path
path: path to either a jaspar or cisbp matrix (path must end have a suffix matching the format).
background: background frequencies distribution
pseudocount: adjustment for zero in matrix
names: returns only scores for these sequences and in the name order

Returns:

numpy array of log2 based scores at every position

apply_scaled_gaps(other: SequenceCollection, aa_to_codon: bool = True) → Self#: applies gaps in self to ungapped sequences

property array_positions: ndarray#: Returns a numpy array of positions, axis 0 is alignment positions columns in order corresponding to names.

property array_seqs: ndarray#: Returns a numpy array of sequences, axis 0 is seqs in order corresponding to names

coevolution(stat: str = 'nmi', segments: list[tuple[int, int]] | None = None, drawable: str | None = None, show_progress: bool = False, parallel: bool = False, par_kw: dict | None = None)#

performs pairwise coevolution measurement

Parameters:

stat: coevolution metric, defaults to ‘nmi’ (Normalized Mutual Information). Valid choices are ‘rmi’ (Resampled Mutual Information) and ‘mi’, mutual information.
segments: coordinates of the form [(start, end), …] where all possible pairs of alignment positions within and between segments are examined.
drawable: Result object is capable of plotting data specified type. str value must be one of plot type ‘box’, ‘heatmap’, ‘violin’.
show_progress: shows a progress bar.
parallel: run in parallel, according to arguments in par_kwargs.
par_kw: dict of values for configuring parallel execution.

Returns:

DictArray of results with lower-triangular values. Upper triangular
elements and estimates that could not be computed for numerical reasons
are set as nan

copy(copy_annotations: bool = False) → Self#: creates new instance, only mutable attributes are copied

copy_annotations(seq_db: SupportsFeatures) → None#

copy annotations into attached annotation db

Parameters:

seq_db: compatible annotation db

Notes

Only copies annotations for records with seqid in self.names

count_ambiguous_per_seq() → DictArray#: Return the counts of ambiguous characters per sequence as a DictArray.

count_gaps_per_pos(include_ambiguity: bool = True) → DictArray#

return counts of gaps per position as a DictArray

Parameters:

include_ambiguity: if True, ambiguity characters that include the gap state are included

count_gaps_per_seq(induced_by: bool = False, unique: bool = False, include_ambiguity: bool = True, drawable: bool = False) → DictArray#

return counts of gaps per sequence as a DictArray

Parameters:

induced_by: a gapped column is considered to be induced by a seq if the seq has a non-gap character in that column.
unique: count is limited to gaps uniquely induced by each sequence
include_ambiguity: if True, ambiguity characters that include the gap state are included
drawable: if True, resulting object is capable of plotting data via specified plot type ‘bar’, ‘box’ or ‘violin’

counts(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = False) → MotifCountsArray#

counts of motifs

Parameters:

motif_length: number of elements per character.
include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.
exclude_unobserved: if True, unobserved motif combinations are excluded.

Notes

only non-overlapping motifs are counted

counts_per_pos(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, warn: bool = False) → DictArray#

return DictArray of counts per position

Parameters:

motif_length: number of elements per character.
include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.
warn: warns if motif_length > 1 and alignment trimmed to produce motif columns

counts_per_seq(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = False, warn: bool = False) → MotifCountsArray#

counts of non-overlapping motifs per sequence

Parameters:

motif_length: number of elements per character.
include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.
exclude_unobserved: if False, all canonical states included
warn: warns if motif_length > 1 and alignment trimmed to produce motif columns

deepcopy(**kwargs) → Self#

returns deep copy of self

Notes

Reduced to sliced sequences in self, kwargs are ignored. Annotation db is not copied if the alignment has been sliced.

degap(storage_backend: str | None = None, **kwargs) → SequenceCollection#

returns collection sequences without gaps or missing characters.

Parameters:

storage_backend: name of the storage backend to use for the SeqsData object, defaults to cogent3 builtin.
kwargs: keyword arguments for the storage driver

Notes

The returned collection will not retain an annotation_db if present.

distance_matrix(calc: str = 'pdist', drop_invalid: bool = False, parallel: bool = False)#

Returns pairwise distances between sequences.

Parameters:

calc: a pairwise distance calculator name. Presently only ‘pdist’, ‘jc69’, ‘tn93’, ‘hamming’, ‘paralinear’ are supported.
drop_invalid: If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.

make a dotplot between specified sequences. Random sequences chosen if names not provided.

Parameters:

name1, name2: names of sequences – if not provided, a random choice is made
window: segment size for comparison between sequences
threshold: windows where the sequences are identical >= threshold are a match
k: size of k-mer to break sequences into. Larger values increase speed but reduce resolution. If not specified, and window == threshold, then k is set to window. Otherwise, it is computed as the maximum of {threshold // (window - threshold), 5}.
min_gap: permitted gap for joining adjacent line segments, default is no gap joining
width: figure width. Figure height is computed based on the ratio of len(seq1) / len(seq2)
title: title for the plot
rc: include dotplot of reverse compliment also. Only applies to Nucleic acids moltypes
biotype: if selected sequences are annotated, display only these biotypes

Returns:

a Drawable or AnnotatedDrawable

drop_duplicated_seqs() → Self#

returns self without duplicated sequences

Notes

Retains the first sequence of each duplicte group.

duplicated_seqs() → list[list[str]]#

returns the names of duplicated sequences

Notes

The gapped sequence is used.

entropy_per_pos(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, warn: bool = False) → ndarray#: returns shannon entropy per position

entropy_per_seq(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = True, warn: bool = False) → ndarray#

returns the Shannon entropy per sequence

Parameters:

motif_length: number of characters per tuple.
include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.
exclude_unobserved: if True, unobserved motif combinations are excluded.
warn: warns if motif_length > 1 and alignment trimmed to produce motif columns

Notes

For motif_length > 1, it’s advisable to specify exclude_unobserved=True, this avoids unnecessary calculations.

filtered(predicate: Callable[[Self], bool], motif_length: int = 1, drop_remainder: bool = True, **kwargs) → Self#

The alignment positions where predicate(column) is true.

Parameters:

predicate: a callback function that takes an tuple of motifs and returns True/False
motif_length: length of the motifs the sequences should be split into, eg. 3 for filtering aligned codons.
drop_remainder: If length is not modulo motif_length, allow dropping the terminal remaining columns

classmethod from_rich_dict(data: dict[str, str | dict[str, str]]) → Alignment#: returns a new instance from a rich dict

gapped_by_map(keep: FeatureMap, **kwargs) → Self#

get_ambiguous_positions()#

Returns dict of seq:{position:char} for ambiguous chars.

Used in likelihood calculations.

get_degapped_relative_to(name: str) → Self#

Remove all columns with gaps in sequence with given name.

Parameters:

name: sequence name

Notes

The returned alignment will not retain an annotation_db if present.

get_drawable(*, biotype: str | Iterable[str] | None = None, width: int = 600, vertical: int = False, title: OptStr = None)#

make a figure from sequence features

Parameters:

biotype: passed to get_features(biotype). Can be a single biotype or series. Only features matching this will be included.
width: width in pixels
vertical: rotates the drawable
title: title for the plot

Returns:

a Drawable instance

get_drawables(*, biotype: str | Iterable[str] | None = None) → dict#

returns a dict of drawables, keyed by type

Parameters:

biotype: passed to get_features(biotype). Can be a single biotype or series. Only features matching this will be included.

get_features(*, seqid: str | None = None, biotype: str | None = None, name: str | None = None, on_alignment: bool | None = None, allow_partial: bool = False) → Iterator[Feature]#

yields Feature instances

Parameters:

seqid: limit search to features on this named sequence, defaults to search all
biotype: biotype of the feature, e.g. CDS, gene
name: name of the feature
on_alignment: limit query to features on Alignment, ignores sequences. Ignored on SequenceCollection instances.
allow_partial: allow features partially overlaping self

Notes

When dealing with a nucleic acid moltype, the returned features will yield a sequence segment that is consistently oriented irrespective of strand of the current instance.

get_gap_array(include_ambiguity: bool = True) → ndarray#

returns bool array with gap state True, False otherwise

Parameters:

include_ambiguity: if True, ambiguity characters that include the gap state are included

get_gapped_seq(seqname: str, recode_gaps: bool = False) → Sequence#

Return a gapped Sequence object for the specified seqname.

Parameters:

seqname: sequence name
recode_gaps: if True, gap characters are replaced by the most general ambiguity code, e.g. N for DNA and RNA

Notes

This method breaks the connection to the annotation database.

get_identical_sets(mask_degen: bool = False) → list[set]#

returns sets of names for sequences that are identical

Parameters:

mask_degen: if True, degenerate characters are ignored

get_lengths(include_ambiguity: bool = False, allow_gap: bool = False) → dict[str, int]#

returns sequence lengths as a dict of {seqid: length}

Parameters:

include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.

get_motif_probs(alphabet: AlphabetABC = None, include_ambiguity: bool = False, exclude_unobserved: bool = False, allow_gap: bool = False, pseudocount: int = 0) → dict#

Return a dictionary of motif probs, calculated as the averaged frequency across sequences.

Parameters:

alphabet: alphabet to use for motifs
include_ambiguity: if True resolved ambiguous codes are included in estimation of frequencies.
exclude_unobserved: if True, motifs that are not present in the alignment are excluded from the returned dictionary.
allow_gap: allow gap motif
pseudocount: value to add to each count

Notes

only non-overlapping motifs are counted

get_position_indices(f: Callable[[str], bool], negate: bool = False) → list[int]#

Returns list of column indices for which f(col) is True.

Parameters:

f: function that returns true/false given an alignment position
negate: if True, not f() is used

get_projected_feature(*, seqid: str, feature: Feature) → Feature#

returns an alignment feature projected onto the seqid sequence

Parameters:

seqid: name of the sequence to project the feature onto
feature: a Feature, bound to self, that will be projected

Returns:

a new Feature bound to seqid

Notes

The alignment coordinates of feature are converted into the seqid sequence coordinates and the object is bound to that sequence.

The feature is added to the annotation_db.

get_projected_features(*, seqid: str, **kwargs) → list[Feature]#: projects all features from other sequences onto seqid

get_seq(seqname: str, copy_annotations: bool = False) → Sequence#

Return a Sequence object for the specified seqname.

Parameters:

seqname: name of the sequence to return
copy_annotations: if True, only the annotations for the specified sequence are copied to the annotation database of the Sequence object which is decoupled from this collection. If False, the connection to this collections db is retained.

get_seq_names_if(f: Callable[[Sequence], bool], negate: bool = False) → list[str]#

Returns list of names of seqs where f(seq) is True.

Parameters:

f: function that takes a sequence object and returns True or False
negate: select all sequences EXCEPT those where f(seq) is True

Notes

Sequence objects can be converted into strings or numpy arrays using str() and numpy.array() respectively.

get_similar(target: ~cogent3.core.sequence.Sequence, min_similarity: float = 0.0, max_similarity: float = 1.0, metric: ~collections.abc.Callable[[~cogent3.core.sequence.Sequence, ~cogent3.core.sequence.Sequence], float] = <cogent3.util.transform.for_seq object>, transform: bool | None = None) → SequenceCollection#

Returns new SequenceCollection containing sequences similar to target.

Parameters:

target: sequence object to compare to. Can be in the collection.
min_similarity: minimum similarity that will be kept. Default 0.0.
max_similarity: maximum similarity that will be kept. Default 1.0.
metric: a similarity function to use. Must be f(first_seq, second_seq). The default metric is fraction similarity, ranging from 0.0 (0% identical) to 1.0 (100% identical). The Sequence class have lots of methods that can be passed in as unbound methods to act as the metric, e.g. frac_same_gaps.
transform: transformation function to use on the sequences before the metric is calculated. If None, uses the whole sequences in each case. A frequent transformation is a function that returns a specified range of a sequence, e.g. eliminating the ends. Note that the transform applies to both the real sequence and the target sequence.

Warning

if the transformation changes the type of the sequence (e.g. extracting a string from an RnaSequence object), distance metrics that depend on instance data of the original class may fail.

Notes

both min_similarity and max_similarity are inclusive.

get_translation(gc: c3_genetic_code.GeneticCodeChoiceType = 1, incomplete_ok: bool = False, include_stop: bool = False, trim_stop: bool = True, **kwargs) → typing_extensions.Self#

translate sequences from nucleic acid to protein

Parameters:

gc: genetic code, either the number or name (use cogent3.core.genetic_code.available_codes)
incomplete_ok: codons that are mixes of nucleotide and gaps converted to ‘?’. raises a ValueError if False
include_stop: whether to allow a stops in the translated sequence
trim_stop: exclude terminal stop codons if they exist
kwargs: related to construction of the resulting object

Returns:

A new instance of self translated into protein

Notes

Translating will break the relationship to an annotation_db if present.

has_terminal_stop(gc: c3_genetic_code.GeneticCodeChoiceType = 1, strict: bool = False) → bool#

Returns True if any sequence has a terminal stop codon.

Parameters:

gc: valid input to cogent3.get_code(), a genetic code object, number or name
strict: If True, raises an exception if a seq length not divisible by 3

information_plot(width: int | None = None, height: int | None = None, window: int | None = None, stat: str = 'median', include_gap: bool = True)#

plot information per position

Parameters:

width: figure width in pixels
height: figure height in pixels
window: used for smoothing line, defaults to sqrt(length)
stat: ‘mean’ or ‘median, used as the summary statistic for each window
include_gap: whether to include gap counts, shown on right y-axis

is_ragged() → bool#: by definition False for an Alignment

iter_positions(pos_order: list | None = None) → Iterator[list, list, list]#

Iterates over positions in the alignment, in order.

Parameters:

pos_order: list of indices specifying the column order. If None, the positions are iterated in order.

Returns:

yields lists of elemenets for each position (column) in the alignment

iter_seqs(seq_order: list | None = None) → Iterator[Sequence | SeqViewABC]#

Iterates over sequences in the collection, in order.

Parameters:

seq_order:: list of seqids giving the order in which seqs will be returned. Defaults to self.names

iupac_consensus(allow_gap: bool = True) → str#: Returns string containing IUPAC consensus sequence of the alignment.

majority_consensus() → Sequence#: Returns consensus sequence containing most frequent item at each position.

make_feature(*, feature: FeatureDataType, on_alignment: bool | None = None) → Feature#

create a feature on named sequence, or on the alignment itself

Parameters:

feature: a dict with all the necessary data rto construct a feature
on_alignment: the feature is in alignment coordinates, incompatible with setting ‘seqid’. Set to True if ‘seqid’ not provided.

Returns:

Feature

Raises:

ValueError if define a ‘seqid’ not on alignment or use ‘seqid’ and
on_alignment.

Notes

To get a feature AND add it to annotation_db, use add_feature().

matching_ref(ref_name: str, gap_fraction: float, gap_run: int) → Self#

Returns new alignment with seqs well aligned with a reference.

Parameters:

ref_name: name of the sequence to use as the reference
gap_fraction: fraction of positions that either have a gap in the template but not in the seq or in the seq but not in the template
gap_run: number of consecutive gaps tolerated in query relative to sequence or sequence relative to query

property modified: bool#: collection is a modification of underlying storage

property name_map: MappingProxyType#

returns mapping of seq names to parent seq names

Notes

The underlying SeqsData may have different names for the same sequences. This object maps the names of sequences in self to the names of sequences in SeqsData. MappingProxyType is an immutable mapping, so it cannot be changed. Use self.rename_seqs() to do that.

property names: tuple[str, ...]#: returns the names of the sequences in the collection

no_degenerates(motif_length: int = 1, allow_gap: bool = False) → Self#

returns new alignment without degenerate characters

Parameters:

motif_length: sequences are segmented into units of this size and the segments are excluded if they contain degenerate characters.
allow_gap: whether gaps are allowed or whether they are treated as a degenerate character (latter is default, as most evolutionary modelling treats gaps as N).

property num_seqs: int#: the number of sequences in the collection

omit_bad_seqs(quantile: float | None = None)#

Returns new alignment without sequences with a number of uniquely introduced gaps exceeding quantile

Uses count_gaps_per_seq(unique=True) to obtain the counts of gaps uniquely introduced by a sequence. The cutoff is the quantile of this distribution.

Parameters:

quantile: sequences whose unique gap count is in a quantile larger than this cutoff are excluded. The default quantile is (num_seqs - 1) / num_seqs

omit_gap_pos(allowed_gap_frac: float | None = None, motif_length: int = 1)#

Returns new alignment where all cols (motifs) have <= allowed_gap_frac gaps.

Parameters:

allowed_gap_frac: specifies proportion of gaps is allowed in each column. Set to 0 to exclude columns with any gaps, 1 to include all columns. Default is None which is equivalent to (num_seqs-1)/num_seqs and leads to elimination of columns that are only gaps.
motif_length: set’s the “column” width, e.g. setting to 3 corresponds to codons. A motif that includes a gap at any position is included in the counting.

pad_seqs(pad_length: int | None = None)#

Returns copy in which sequences are padded with the gap character to same length.

Parameters:

pad_length: Length all sequences are to be padded to. Will pad to max sequence length if pad_length is None or less than max length.

property positions: list[list[str]]#

probs_per_pos(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, warn: bool = False) → MotifFreqsArray#: returns MotifFreqsArray per position

probs_per_seq(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = False, warn: bool = False) → MotifFreqsArray#

return MotifFreqsArray per sequence

Parameters:

motif_length: number of characters per tuple.
include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.
exclude_unobserved: if True, unobserved motif combinations are excluded.
warn: warns if motif_length > 1 and alignment trimmed to produce motif columns

quick_tree(calc: str = 'pdist', drop_invalid: bool = False, parallel: bool = False, use_hook: str | None = None) → PhyloNode#

Returns a phylogenetic tree.

Parameters:

calc: a pairwise distance calculator or name of one. For options see cogent3.evolve.fast_distance.available_distances
drop_invalid: If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.
parallel: parallel execution of distance calculations
use_hook: name of a third-party package that implements the quick_tree hook. If not specified, defaults to the first available hook or the cogent3 quick_tree() app. To force default, set use_hook=”cogent3”.

Returns:

a phylogenetic tree

rc()#: Returns the reverse complement of all sequences in the alignment. A synonym for reverse_complement.

rename_seqs(renamer: Callable[[str], str])#: Returns new alignment with renamed sequences.

replace_annotation_db(value: SupportsFeatures | list[SupportsFeatures] | None, check: bool = True) → None#

public interface to assigning the annotation_db

Parameters:

value: the annotation db instance
check: whether to check value supports the feature interface

Notes

The check can be very expensive, so if you’re confident set it to False

reverse_complement() → Self#: Returns the reverse complement of all sequences in the collection. A synonym for rc.

sample(*, n: int | None = None, with_replacement: bool = False, motif_length: int = 1, randint: ~typing.Callable[[int, int | None, int | None], ~numpy.ndarray] = <bound method RandomState.randint of RandomState(MT19937)>, permutation: ~typing.Callable[[~numpy.ndarray], ~numpy.ndarray] = <bound method RandomState.permutation of RandomState(MT19937)>) → Self#

Returns random sample of positions from self, e.g. to bootstrap.

Parameters:

n: number of positions to sample. If None, all positions are sampled.
with_replacement: if True, samples with replacement.
motif_length: number of positions to sample as a single motif.
randint: random number generator, default is numpy.randint
permutation: function to generate a random permutation of positions, default is numpy.permutation

Notes

By default (resampling all positions without replacement), generates a permutation of the positions of the alignment.

Setting with_replacement to True and otherwise leaving parameters as defaults generates a standard bootstrap resampling of the alignment.

seqlogo(width: float = 700, height: float = 100, wrap: int | None = None, vspace: float = 0.005, colours: dict | None = None)#

returns Drawable sequence logo using mutual information

Parameters:

width, height: plot dimensions in pixels
wrap: number of alignment columns per row
vspace: vertical separation between rows, as a proportion of total plot
colours: mapping of characters to colours. If note provided, defaults to custom for everything ecept protein, which uses protein moltype colours.

Notes

Computes MI based on log2 and includes the gap state, so the maximum possible value is -log2(1/num_states)

property seqs: _IndexableSeqs#

iterable of sequences in the collection

Returns:

Instance of MolType sequence or Aligned sequence if
self is an Alignment.

Notes

Can be indexed by a sequence name or integer index. Cannot be sliced.

set_repr_policy(num_seqs: int | None = None, num_pos: int | None = None, ref_name: int | None = None, wrap: int | None = None) → None#

specify policy for repr(self)

Parameters:

num_seqs: number of sequences to include in represented display.
num_pos: length of sequences to include in represented display.
ref_name: name of sequence to be placed first, or “longest” (default). If latter, indicates longest sequence will be chosen.
wrap: number of printed bases per row

sliding_windows(window: int, step: int, start: int | None = None, end: int | None = None) → Generator[Self, None, None]#

Generator yielding new alignments of given length and interval.

Parameters:

window: The length of each returned alignment.
step: The interval between the start of the successive windows.
start: first window start position
end: last window start position

property storage: AlignedSeqsDataABC#: the aligned sequence storage instance of the collection

strand_symmetry(motif_length: int = 1) → dict[str, TestResult]#: returns dict of strand symmetry test results per ungapped seq

take_positions(cols: list[int] | ndarray[int], negate: bool = False) → Self#

Returns new Alignment containing only specified positions.

Parameters:

cols: list of column indices to keep
negate: if True, all columns except those in cols are kept

take_positions_if(f: Callable[[str], bool], negate: bool = False) → Self#: Returns new Alignment containing cols where f(col) is True.

take_seqs(names: str | Sequence[str], negate: bool = False, copy_annotations: bool = False, **kwargs) → Self#

Returns new collection containing only specified seqs.

Parameters:

names: sequences to select (or exclude if negate=True)
negate: select all sequences EXCEPT names
kwargs: keyword arguments to be passed to the constructor of the new collection
copy_annotations: if True, only annotations from selected seqs are copied to the annotation_db of the new collection

take_seqs_if(f: Callable[[Sequence], bool], negate: bool = False) → Self#

Returns new collection containing seqs where f(seq) is True.

Parameters:

f: function that takes a sequence object and returns True or False
negate: select all sequences EXCEPT those where f(seq) is True

Notes

Sequence objects can be converted into strings or numpy arrays using str() and numpy.array() respectively.

to_dict(as_array: bool = False) → dict[str, str | ndarray]#

Return a dictionary of sequences.

Parameters:

as_array: if True, sequences are returned as numpy arrays, otherwise as strings

to_dna() → Self#: returns copy of self as a collection of DNA moltype seqs

to_fasta(block_size: int = 60) → str#

Return collection in Fasta format.

Parameters:

block_size: the sequence length to write to each line, by default 60

Returns:

The collection in Fasta format.

to_html(name_order: Sequence[str] | None = None, wrap: int = 60, limit: int | None = None, ref_name: str = 'longest', colors: Mapping[str, str] | None = None, font_size: int = 12, font_family: str = 'Lucida Console') → str#

returns html with embedded styles for sequence colouring

Parameters:

name_order: order of names for display.
wrap: number of alignment columns per row
limit: truncate alignment to this length
ref_name: Name of an existing sequence or ‘longest’. If the latter, the longest sequence (excluding gaps and ambiguities) is selected as the reference.
colors: {character moltype.
font_size: in points. Affects labels and sequence and line spacing (proportional to value)
font_family: string denoting font family

Examples

In a jupyter notebook, this code is used to provide the representation.

aln  # is rendered by jupyter

You can directly use the result for display in a notebook as

from IPython.core.display import HTML

HTML(aln.to_html())

to_json() → str#: returns json formatted string

to_moltype(moltype: Literal['dna', 'rna', 'protein', 'protein_with_stop', 'text', 'bytes'] | MolType) → Self#

returns copy of self with changed moltype

Parameters:

moltype: name of the new moltype, e.g, ‘dna’, ‘rna’.

Notes

Cannot convert from nucleic acids to proteins. Use get_translation() for that.

to_phylip() → str#

Return collection in PHYLIP format and mapping to sequence ids

Notes

raises exception if sequences do not all have the same length

to_pretty(name_order: list[str] | None = None, wrap: int | None = None) → str#

returns a string representation of the alignment in pretty print format

Parameters:

name_order: order of names for display.
wrap: maximum number of printed bases

to_rich_dict() → dict[str, str | dict[str, str]]#: returns a json serialisable dict

to_rna() → Self#: returns copy of self as a collection of RNA moltype seqs

trim_stop_codons(gc: c3_genetic_code.GeneticCodeChoiceType = 1, strict: bool = False, **kwargs) → typing_extensions.Self#

Removes any terminal stop codons from the sequences

Parameters:

gc: valid input to cogent3.get_code(), a genetic code object, number or name, defaults to standard code
strict: If True, raises an exception if a seq length not divisible by 3

variable_positions(include_gap_motif: bool = True, include_ambiguity: bool = False, motif_length: int = 1) → tuple[int]#

Return a list of variable position indexes.

Parameters:

include_gap_motif: if False, sequences with a gap motif in a column are ignored.
include_ambiguity: if True, all states are considered.
motif_length: if any position within a motif is variable, the entire motif is considered variable.

Returns:

tuple of integers, if motif_length > 1, the returned positions are
motif_length long sequential indices.

Notes

Truncates alignment to be modulo motif_length.

with_masked_annotations(biotypes: Sequence[str], mask_char: str = '?', shadow: bool = False, seqid: str | None = None) → Self#

returns an alignment with regions replaced by mask_char

Parameters:

biotypes: annotation type(s)
mask_char: must be a character valid for the moltype. The default value is the most ambiguous character, eg. ‘?’ for DNA
shadow: If True, masks everything but the biotypes
seqid: name of sequence to mask, defaults to all

write(filename: str, format_name: str | None = None, **kwargs) → None#

Write the sequences to a file, preserving order of sequences.

Parameters:

filename: name of the sequence file
format_name: format of the sequence file

Notes

If format_name is None, will attempt to infer format from the filename suffix.