Alignment#

class Alignment(seqs_data: AlignedSeqsDataABC, slice_record: SliceRecord | None = None, **kwargs)#

A collection of aligned sequences.

Attributes:
annotation_db
array_positions

Returns a numpy array of positions, axis 0 is alignment positions columns in order corresponding to names.

array_seqs

Returns a numpy array of sequences, axis 0 is seqs in order

named_seqs
names
num_seqs
positions
seqs

Methods

add_feature(*, biotype, name, spans[, ...])

add feature on named sequence, or on the alignment itself

add_seqs(seqs, **kwargs)

Returns new collection with additional sequences.

alignment_quality([app_name])

Computes the alignment quality using the indicated app

apply_pssm([pssm, path, background, ...])

scores sequences using the specified pssm

apply_scaled_gaps(other[, aa_to_codon])

applies gaps in self to unagpped sequences

coevolution([stat, segments, drawable, ...])

performs pairwise coevolution measurement

copy()

creates new instance, only mutable attributes are copied

copy_annotations(seq_db)

copy annotations into attached annotation db

count_ambiguous_per_seq()

Return the counts of ambiguous characters per sequence as a DictArray.

count_gaps_per_pos([include_ambiguity])

return counts of gaps per position as a DictArray

count_gaps_per_seq([induced_by, unique, ...])

return counts of gaps per sequence as a DictArray

counts([motif_length, include_ambiguity, ...])

counts of motifs

counts_per_pos([motif_length, ...])

return DictArray of counts per position

counts_per_seq([motif_length, ...])

counts of non-overlapping motifs per sequence

deepcopy(**kwargs)

returns deep copy of self

degap()

Returns new SequenceCollection in which sequences have no gaps or missing characters.

distance_matrix([calc, drop_invalid, parallel])

Returns pairwise distances between sequences.

dotplot([name1, name2, window, threshold, ...])

make a dotplot between specified sequences.

entropy_per_pos([motif_length, ...])

returns shannon entropy per position

entropy_per_seq([motif_length, ...])

returns the Shannon entropy per sequence

filtered(predicate[, motif_length, ...])

The alignment positions where predicate(column) is true.

from_rich_dict(data)

returns a new instance from a rich dict

get_ambiguous_positions()

Returns dict of seq:{position:char} for ambiguous chars.

get_degapped_relative_to(name)

Remove all columns with gaps in sequence with given name.

get_drawable(*[, biotype, width, vertical, ...])

make a figure from sequence features

get_drawables(*[, biotype])

returns a dict of drawables, keyed by type

get_features(*[, seqid, biotype, name, ...])

yields Feature instances

get_gap_array([include_ambiguity])

returns bool array with gap state True, False otherwise

get_gapped_seq(seqname[, recode_gaps])

Return a gapped Sequence object for the specified seqname.

get_identical_sets([mask_degen])

returns sets of names for sequences that are identical

get_lengths([include_ambiguity, allow_gap])

returns sequence lengths as a dict of {seqid: length}

get_motif_probs([alphabet, ...])

Return a dictionary of motif probs, calculated as the averaged frequency across sequences.

get_position_indices(f[, native, negate])

Returns list of column indices for which f(col) is True.

get_projected_feature(*, seqid, feature)

returns an alignment feature projected onto the seqid sequence

get_projected_features(*, seqid, **kwargs)

projects all features from other sequences onto seqid

get_seq(seqname[, copy_annotations])

Return a Sequence object for the specified seqname.

get_seq_names_if(f[, negate])

Returns list of names of seqs where f(seq) is True.

get_similar(target[, min_similarity, ...])

Returns new SequenceCollection containing sequences similar to target.

get_translation([gc, incomplete_ok, ...])

translate sequences from nucleic acid to protein

has_terminal_stop([gc, strict])

Returns True if any sequence has a terminal stop codon.

information_plot([width, height, window, ...])

plot information per position

is_ragged()

by definition False for an Alignment

iter_positions([pos_order])

Iterates over positions in the alignment, in order.

iter_seqs([seq_order])

Iterates over sequences in the collection, in order.

iupac_consensus([allow_gap])

Returns string containing IUPAC consensus sequence of the alignment.

majority_consensus()

Returns consensus sequence containing most frequent item at each position.

make_feature(*, feature[, on_alignment])

create a feature on named sequence, or on the alignment itself

matching_ref(ref_name, gap_fraction, gap_run)

Returns new alignment with seqs well aligned with a reference.

no_degenerates([motif_length, allow_gap])

returns new alignment without degenerate characters

omit_bad_seqs([quantile])

Returns new alignment without sequences with a number of uniquely introduced gaps exceeding quantile

omit_gap_pos([allowed_gap_frac, motif_length])

Returns new alignment where all cols (motifs) have <= allowed_gap_frac gaps.

pad_seqs([pad_length])

Returns copy in which sequences are padded with the gap character to same length.

probs_per_pos([motif_length, ...])

returns MotifFreqsArray per position

probs_per_seq([motif_length, ...])

return MotifFreqsArray per sequence

quick_tree([calc, bootstrap, drop_invalid, ...])

Returns a phylogenetic tree.

rc()

Returns the reverse complement of all sequences in the alignment.

rename_seqs(renamer)

Returns new alignment with renamed sequences.

reverse_complement()

Returns the reverse complement of all sequences in the collection.

sample(*[, n, with_replacement, ...])

Returns random sample of positions from self, e.g. to bootstrap.

seqlogo([width, height, wrap, vspace, colours])

returns Drawable sequence logo using mutual information

set_repr_policy([num_seqs, num_pos, ...])

specify policy for repr(self)

sliding_windows(window, step[, start, end])

Generator yielding new alignments of given length and interval.

strand_symmetry([motif_length])

returns dict of strand symmetry test results per ungapped seq

take_positions(cols[, negate])

Returns new Alignment containing only specified positions.

take_positions_if(f[, negate])

Returns new Alignment containing cols where f(col) is True.

take_seqs(names[, negate, copy_annotations])

Returns new collection containing only specified seqs.

take_seqs_if(f[, negate])

Returns new collection containing seqs where f(seq) is True.

to_dict([as_array])

Return a dictionary of sequences.

to_dna()

returns copy of self as a collection of DNA moltype seqs

to_fasta([block_size])

Return collection in Fasta format.

to_html([name_order, wrap, limit, ref_name, ...])

returns html with embedded styles for sequence colouring

to_json()

returns json formatted string

to_moltype(moltype)

returns copy of self with changed moltype

to_phylip()

Return collection in PHYLIP format and mapping to sequence ids

to_pretty([name_order, wrap])

returns a string representation of the alignment in pretty print format

to_rich_dict()

returns a json serialisable dict

to_rna()

returns copy of self as a collection of RNA moltype seqs

trim_stop_codons([gc, strict])

Removes any terminal stop codons from the sequences

variable_positions([include_gap_motif, ...])

Return a list of variable position indexes.

with_masked_annotations(biotypes[, ...])

returns an alignment with regions replaced by mask_char

write(filename[, file_format])

Write the sequences to a file, preserving order of sequences.

gapped_by_map

to_type

Notes

Should be constructed using make_aligned_seqs().

add_feature(*, biotype: str, name: str, spans: List[Tuple[int, int]], seqid: OptStr = None, parent_id: OptStr = None, strand: str = '+', on_alignment: OptBool = None) Feature#

add feature on named sequence, or on the alignment itself

Parameters:
seqid

sequence name, incompatible with on_alignment

parent_id

name of the parent feature

biotype

biological type, e.g. CDS

name

name of the feature

spans

plus strand coordinates of feature

strand

‘+’ (default) or ‘-’

on_alignment

the feature is in alignment coordinates, incompatible with setting seqid. Set to True if seqid not provided.

Returns:
Feature
Raises:
ValueError if define a seqid not on alignment or use seqid and
on_alignment.
add_seqs(seqs: dict[str, str | bytes | ndarray[int]] | SeqsData | list, **kwargs) SequenceCollection#

Returns new collection with additional sequences.

Parameters:
seqs

sequences to add

alignment_quality(app_name: str = 'ic_score', **kwargs)#

Computes the alignment quality using the indicated app

Parameters:
app_name

name of an alignment score calculating app, e.g. ‘ic_score’, ‘cogent3_score’, ‘sp_score’

kwargs

keyword arguments to be passed to the app. Use cogent3.app_help(app_name) to see the available options.

Returns:
float or a NotCompleted instance if the score could not be computed
property annotation_db#
apply_pssm(pssm: PSSM = None, path: str | None = None, background: ndarray = None, pseudocount: int = 0, names: list | None = None, ui=None) array#

scores sequences using the specified pssm

Parameters:
pssm

A profile.PSSM instance, if not provided, will be loaded from path

path

path to either a jaspar or cisbp matrix (path must end have a suffix matching the format).

background

background frequencies distribution

pseudocount

adjustment for zero in matrix

names

returns only scores for these sequences and in the name order

Returns:
numpy array of log2 based scores at every position
apply_scaled_gaps(other: SequenceCollection, aa_to_codon: bool | None = None)#

applies gaps in self to unagpped sequences

property array_positions: ndarray#

Returns a numpy array of positions, axis 0 is alignment positions columns in order corresponding to names.

property array_seqs: ndarray#

Returns a numpy array of sequences, axis 0 is seqs in order corresponding to names

coevolution(stat: str = 'nmi', segments: list[tuple[int, int]] | None = None, drawable: str | None = None, show_progress: bool = False, parallel: bool = False, par_kw: dict | None = None)#

performs pairwise coevolution measurement

Parameters:
stat

coevolution metric, defaults to ‘nmi’ (Normalized Mutual Information). Valid choices are ‘rmi’ (Resampled Mutual Information) and ‘mi’, mutual information.

segments

coordinates of the form [(start, end), …] where all possible pairs of alignment positions within and between segments are examined.

drawable

Result object is capable of plotting data specified type. str value must be one of plot type ‘box’, ‘heatmap’, ‘violin’.

show_progress

shows a progress bar.

parallel

run in parallel, according to arguments in par_kwargs.

par_kw

dict of values for configuring parallel execution.

Returns:
DictArray of results with lower-triangular values. Upper triangular
elements and estimates that could not be computed for numerical reasons
are set as nan
copy()#

creates new instance, only mutable attributes are copied

copy_annotations(seq_db: SupportsFeatures) None#

copy annotations into attached annotation db

Parameters:
seq_db

compatible annotation db

Notes

Only copies annotations for records with seqid in self.names

count_ambiguous_per_seq() DictArray#

Return the counts of ambiguous characters per sequence as a DictArray.

count_gaps_per_pos(include_ambiguity: bool = True) DictArray#

return counts of gaps per position as a DictArray

Parameters:
include_ambiguity

if True, ambiguity characters that include the gap state are included

count_gaps_per_seq(induced_by: bool = False, unique: bool = False, include_ambiguity: bool = True, drawable: bool = False) DictArray#

return counts of gaps per sequence as a DictArray

Parameters:
induced_by

a gapped column is considered to be induced by a seq if the seq has a non-gap character in that column.

unique

count is limited to gaps uniquely induced by each sequence

include_ambiguity

if True, ambiguity characters that include the gap state are included

drawable

if True, resulting object is capable of plotting data via specified plot type ‘bar’, ‘box’ or ‘violin’

counts(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = False) MotifCountsArray#

counts of motifs

Parameters:
motif_length

number of elements per character.

include_ambiguity

if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.

allow_gap

if True, motifs containing a gap character are included.

exclude_unobserved

if True, unobserved motif combinations are excluded.

Notes

only non-overlapping motifs are counted

counts_per_pos(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, warn: bool = False) DictArray#

return DictArray of counts per position

Parameters:
motif_length

number of elements per character.

include_ambiguity

if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.

allow_gap

if True, motifs containing a gap character are included.

warn

warns if motif_length > 1 and alignment trimmed to produce motif columns

counts_per_seq(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = False, warn: bool = False) MotifCountsArray#

counts of non-overlapping motifs per sequence

Parameters:
motif_length

number of elements per character.

include_ambiguity

if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.

allow_gap

if True, motifs containing a gap character are included.

exclude_unobserved

if False, all canonical states included

warn

warns if motif_length > 1 and alignment trimmed to produce motif columns

deepcopy(**kwargs)#

returns deep copy of self

Notes

Reduced to sliced sequences in self, kwargs are ignored. Annotation db is not copied if the alignment has been sliced.

degap() SequenceCollection#

Returns new SequenceCollection in which sequences have no gaps or missing characters.

Notes

The returned collection will not retain an annotation_db if present.

distance_matrix(calc: str = 'pdist', drop_invalid: bool = False, parallel: bool = False)#

Returns pairwise distances between sequences.

Parameters:
calc

a pairwise distance calculator name. Presently only ‘pdist’, ‘jc69’, ‘tn93’, ‘hamming’, ‘paralinear’ are supported.

drop_invalid

If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.

dotplot(name1: str | None = None, name2: str | None = None, window: int = 20, threshold: int | None = None, k: int | None = None, min_gap: int = 0, width: int = 500, title: str | None = None, rc: bool = False, biotype: str | tuple[str] = 'gene', show_progress: bool = False)#

make a dotplot between specified sequences. Random sequences chosen if names not provided.

Parameters:
name1, name2

names of sequences – if not provided, a random choice is made

window

segment size for comparison between sequences

threshold

windows where the sequences are identical >= threshold are a match

k

size of k-mer to break sequences into. Larger values increase speed but reduce resolution. If not specified, and window == threshold, then k is set to window. Otherwise, it is computed as the maximum of {threshold // (window - threshold), 5}.

min_gap

permitted gap for joining adjacent line segments, default is no gap joining

width

figure width. Figure height is computed based on the ratio of len(seq1) / len(seq2)

title

title for the plot

rc

include dotplot of reverse compliment also. Only applies to Nucleic acids moltypes

biotype

if selected sequences are annotated, display only these biotypes

Returns:
a Drawable or AnnotatedDrawable
entropy_per_pos(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, warn: bool = False) ndarray#

returns shannon entropy per position

entropy_per_seq(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = True, warn: bool = False) ndarray#

returns the Shannon entropy per sequence

Parameters:
motif_length

number of characters per tuple.

include_ambiguity

if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.

allow_gap

if True, motifs containing a gap character are included.

exclude_unobserved

if True, unobserved motif combinations are excluded.

warn

warns if motif_length > 1 and alignment trimmed to produce motif columns

Notes

For motif_length > 1, it’s advisable to specify exclude_unobserved=True, this avoids unnecessary calculations.

filtered(predicate: Callable[[Self], bool], motif_length: int = 1, drop_remainder: bool = True, **kwargs)#

The alignment positions where predicate(column) is true.

Parameters:
predicate

a callback function that takes an tuple of motifs and returns True/False

motif_length

length of the motifs the sequences should be split into, eg. 3 for filtering aligned codons.

drop_remainder

If length is not modulo motif_length, allow dropping the terminal remaining columns

classmethod from_rich_dict(data: dict[str, str | dict[str, str]])#

returns a new instance from a rich dict

gapped_by_map(keep: FeatureMap, **kwargs)#
get_ambiguous_positions()#

Returns dict of seq:{position:char} for ambiguous chars.

Used in likelihood calculations.

get_degapped_relative_to(name: str)#

Remove all columns with gaps in sequence with given name.

Parameters:
name

sequence name

Notes

The returned alignment will not retain an annotation_db if present.

get_drawable(*, biotype: str | Iterable[str] | None = None, width: int = 600, vertical: int = False, title: OptStr = None)#

make a figure from sequence features

Parameters:
biotype

passed to get_features(biotype). Can be a single biotype or series. Only features matching this will be included.

width

width in pixels

vertical

rotates the drawable

title

title for the plot

Returns:
a Drawable instance
get_drawables(*, biotype: str | Iterable[str] | None = None) dict#

returns a dict of drawables, keyed by type

Parameters:
biotype

passed to get_features(biotype). Can be a single biotype or series. Only features matching this will be included.

get_features(*, seqid: str | None = None, biotype: str | None = None, name: str | None = None, on_alignment: bool | None = None, allow_partial: bool = False) Iterator[Feature]#

yields Feature instances

Parameters:
seqid

limit search to features on this named sequence, defaults to search all

biotype

biotype of the feature, e.g. CDS, gene

name

name of the feature

on_alignment

limit query to features on Alignment, ignores sequences. Ignored on SequenceCollection instances.

allow_partial

allow features partially overlaping self

Notes

When dealing with a nucleic acid moltype, the returned features will yield a sequence segment that is consistently oriented irrespective of strand of the current instance.

get_gap_array(include_ambiguity: bool = True) ndarray#

returns bool array with gap state True, False otherwise

Parameters:
include_ambiguity

if True, ambiguity characters that include the gap state are included

get_gapped_seq(seqname: str, recode_gaps: bool = False) Sequence#

Return a gapped Sequence object for the specified seqname.

Parameters:
seqname

sequence name

recode_gaps

if True, gap characters are replaced by the most general ambiguity code, e.g. N for DNA and RNA

Notes

This method breaks the connection to the annotation database.

get_identical_sets(mask_degen: bool = False) list[set]#

returns sets of names for sequences that are identical

Parameters:
mask_degen

if True, degenerate characters are ignored

get_lengths(include_ambiguity: bool = False, allow_gap: bool = False) dict[str, int]#

returns sequence lengths as a dict of {seqid: length}

Parameters:
include_ambiguity

if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.

allow_gap

if True, motifs containing a gap character are included.

get_motif_probs(alphabet: AlphabetABC = None, include_ambiguity: bool = False, exclude_unobserved: bool = False, allow_gap: bool = False, pseudocount: int = 0) dict#

Return a dictionary of motif probs, calculated as the averaged frequency across sequences.

Parameters:
alphabet

alphabet to use for motifs

include_ambiguity

if True resolved ambiguous codes are included in estimation of frequencies.

exclude_unobserved

if True, motifs that are not present in the alignment are excluded from the returned dictionary.

allow_gap

allow gap motif

pseudocount

value to add to each count

Notes

only non-overlapping motifs are counted

get_position_indices(f: Callable, native: bool = False, negate: bool = False) list[int]#

Returns list of column indices for which f(col) is True.

f

function that returns true/false given an alignment position

native

if True, f is provided with slice of array, otherwise the string is used

negate

if True, not f() is used

get_projected_feature(*, seqid: str, feature: Feature) Feature#

returns an alignment feature projected onto the seqid sequence

Parameters:
seqid

name of the sequence to project the feature onto

feature

a Feature, bound to self, that will be projected

Returns:
a new Feature bound to seqid

Notes

The alignment coordinates of feature are converted into the seqid sequence coordinates and the object is bound to that sequence.

The feature is added to the annotation_db.

get_projected_features(*, seqid: str, **kwargs) list[Feature]#

projects all features from other sequences onto seqid

get_seq(seqname: str, copy_annotations: bool = False) Sequence#

Return a Sequence object for the specified seqname.

Parameters:
seqname

name of the sequence to return

copy_annotations

if True, only the annotations for the specified sequence are copied to the annotation database of the Sequence object. If False, all annotations are copied.

get_seq_names_if(f: Callable[[Sequence], bool], negate: bool = False)#

Returns list of names of seqs where f(seq) is True.

Parameters:
f

function that takes a sequence object and returns True or False

negate

select all sequences EXCEPT those where f(seq) is True

Notes

Sequence objects can be converted into strings or numpy arrays using str() and numpy.array() respectively.

get_similar(target: ~cogent3.core.new_sequence.Sequence, min_similarity: float = 0.0, max_similarity: float = 1.0, metric: ~collections.abc.Callable[[~cogent3.core.new_sequence.Sequence, ~cogent3.core.new_sequence.Sequence], float] = <cogent3.util.transform.for_seq object>, transform: bool | None = None) SequenceCollection#

Returns new SequenceCollection containing sequences similar to target.

Parameters:
target

sequence object to compare to. Can be in the collection.

min_similarity

minimum similarity that will be kept. Default 0.0.

max_similarity

maximum similarity that will be kept. Default 1.0.

metric

a similarity function to use. Must be f(first_seq, second_seq). The default metric is fraction similarity, ranging from 0.0 (0% identical) to 1.0 (100% identical). The Sequence class have lots of methods that can be passed in as unbound methods to act as the metric, e.g. frac_same_gaps.

transform

transformation function to use on the sequences before the metric is calculated. If None, uses the whole sequences in each case. A frequent transformation is a function that returns a specified range of a sequence, e.g. eliminating the ends. Note that the transform applies to both the real sequence and the target sequence.

Notes

both min_similarity and max_similarity are inclusive.

get_translation(gc: int | None = None, incomplete_ok: bool = False, include_stop: bool = False, trim_stop: bool = True, **kwargs)#

translate sequences from nucleic acid to protein

Parameters:
gc

genetic code, either the number or name (use cogent3.core.genetic_code.available_codes)

incomplete_ok

codons that are mixes of nucleotide and gaps converted to ‘?’. raises a ValueError if False

include_stop

whether to allow a stops in the translated sequence

trim_stop

exclude terminal stop codons if they exist

kwargs

related to construction of the resulting object

Returns:
A new instance of self translated into protein

Notes

Translating will break the relationship to an annotation_db if present.

has_terminal_stop(gc: Any = None, strict: bool = False) bool#

Returns True if any sequence has a terminal stop codon.

Parameters:
gc

valid input to cogent3.get_code(), a genetic code object, number or name

strict

If True, raises an exception if a seq length not divisible by 3

information_plot(width: int | None = None, height: int | None = None, window: int | None = None, stat: str = 'median', include_gap: bool = True)#

plot information per position

Parameters:
width

figure width in pixels

height

figure height in pixels

window

used for smoothing line, defaults to sqrt(length)

stat

‘mean’ or ‘median, used as the summary statistic for each window

include_gap

whether to include gap counts, shown on right y-axis

is_ragged() bool#

by definition False for an Alignment

iter_positions(pos_order: list | None = None) Iterator[list, list, list]#

Iterates over positions in the alignment, in order.

Parameters:
pos_order

list of indices specifying the column order. If None, the positions are iterated in order.

Returns:
yields lists of elemenets for each position (column) in the alignment
iter_seqs(seq_order: list | None = None) Iterator[Sequence | SeqViewABC]#

Iterates over sequences in the collection, in order.

Parameters:
seq_order:

list of seqids giving the order in which seqs will be returned. Defaults to self.names

iupac_consensus(allow_gap: bool = True) str#

Returns string containing IUPAC consensus sequence of the alignment.

majority_consensus() Sequence#

Returns consensus sequence containing most frequent item at each position.

make_feature(*, feature: FeatureDataType, on_alignment: bool | None = None) Feature#

create a feature on named sequence, or on the alignment itself

Parameters:
feature

a dict with all the necessary data rto construct a feature

on_alignment

the feature is in alignment coordinates, incompatible with setting ‘seqid’. Set to True if ‘seqid’ not provided.

Returns:
Feature
Raises:
ValueError if define a ‘seqid’ not on alignment or use ‘seqid’ and
on_alignment.

Notes

To get a feature AND add it to annotation_db, use add_feature().

matching_ref(ref_name: str, gap_fraction: float, gap_run: int)#

Returns new alignment with seqs well aligned with a reference.

Parameters:
ref_name

name of the sequence to use as the reference

gap_fraction

fraction of positions that either have a gap in the template but not in the seq or in the seq but not in the template

gap_run

number of consecutive gaps tolerated in query relative to sequence or sequence relative to query

property named_seqs: SeqsDataABC#
property names: list#
no_degenerates(motif_length: int = 1, allow_gap: bool = False)#

returns new alignment without degenerate characters

Parameters:
motif_length

sequences are segmented into units of this size and the segments are excluded if they contain degenerate characters.

allow_gap

whether gaps are allowed or whether they are treated as a degenerate character (latter is default, as most evolutionary modelling treats gaps as N).

property num_seqs: int#
omit_bad_seqs(quantile: float | None = None)#

Returns new alignment without sequences with a number of uniquely introduced gaps exceeding quantile

Uses count_gaps_per_seq(unique=True) to obtain the counts of gaps uniquely introduced by a sequence. The cutoff is the quantile of this distribution.

Parameters:
quantile

sequences whose unique gap count is in a quantile larger than this cutoff are excluded. The default quantile is (num_seqs - 1) / num_seqs

omit_gap_pos(allowed_gap_frac: float | None = None, motif_length: int = 1)#

Returns new alignment where all cols (motifs) have <= allowed_gap_frac gaps.

Parameters:
allowed_gap_frac

specifies proportion of gaps is allowed in each column. Set to 0 to exclude columns with any gaps, 1 to include all columns. Default is None which is equivalent to (num_seqs-1)/num_seqs and leads to elimination of columns that are only gaps.

motif_length

set’s the “column” width, e.g. setting to 3 corresponds to codons. A motif that includes a gap at any position is included in the counting.

pad_seqs(pad_length: int | None = None)#

Returns copy in which sequences are padded with the gap character to same length.

Parameters:
pad_length

Length all sequences are to be padded to. Will pad to max sequence length if pad_length is None or less than max length.

property positions#
probs_per_pos(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, warn: bool = False) MotifFreqsArray#

returns MotifFreqsArray per position

probs_per_seq(motif_length: int = 1, include_ambiguity: bool = False, allow_gap: bool = False, exclude_unobserved: bool = False, warn: bool = False) MotifFreqsArray#

return MotifFreqsArray per sequence

Parameters:
motif_length

number of characters per tuple.

include_ambiguity

if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.

allow_gap

if True, motifs containing a gap character are included.

exclude_unobserved

if True, unobserved motif combinations are excluded.

warn

warns if motif_length > 1 and alignment trimmed to produce motif columns

quick_tree(calc: str = 'pdist', bootstrap: int | None = None, drop_invalid: bool = False, parallel: bool = False, show_progress: bool = False, ui=None)#

Returns a phylogenetic tree.

Parameters:
calc

a pairwise distance calculator or name of one. For options see cogent3.evolve.fast_distance.available_distances

bootstrap

Number of non-parametric bootstrap replicates. Resamples alignment columns with replacement and builds a phylogeny for each such resampling.

drop_invalid

If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.

parallel

parallel execution of distance calculations

show_progress

controls progress display for distance calculation

Returns:
a phylogenetic tree. If bootstrap specified, returns the weighted
majority consensus. Support for each node is stored as
edge.params[‘params’].

Notes

Sequences in the observed alignment for which distances could not be computed are omitted. Bootstrap replicates are required to have distances for all seqs present in the observed data distance matrix.

rc()#

Returns the reverse complement of all sequences in the alignment. A synonym for reverse_complement.

rename_seqs(renamer: Callable[[str], str])#

Returns new alignment with renamed sequences.

reverse_complement()#

Returns the reverse complement of all sequences in the collection. A synonym for rc.

sample(*, n: int | None = None, with_replacement: bool = False, motif_length: int = 1, randint=<bound method RandomState.randint of RandomState(MT19937)>, permutation=<bound method RandomState.permutation of RandomState(MT19937)>)#

Returns random sample of positions from self, e.g. to bootstrap.

Parameters:
n

number of positions to sample. If None, all positions are sampled.

with_replacement

if True, samples with replacement.

motif_length

number of positions to sample as a single motif.

randint

random number generator, default is numpy.randint

permutation

function to generate a random permutation of positions, default is numpy.permutation

Notes

By default (resampling all positions without replacement), generates a permutation of the positions of the alignment.

Setting with_replacement to True and otherwise leaving parameters as defaults generates a standard bootstrap resampling of the alignment.

returns Drawable sequence logo using mutual information

Parameters:
width, height

plot dimensions in pixels

wrap

number of alignment columns per row

vspace

vertical separation between rows, as a proportion of total plot

colours

mapping of characters to colours. If note provided, defaults to custom for everything ecept protein, which uses protein moltype colours.

Notes

Computes MI based on log2 and includes the gap state, so the maximum possible value is -log2(1/num_states)

property seqs: _IndexableSeqs#
set_repr_policy(num_seqs: int | None = None, num_pos: int | None = None, ref_name: int | None = None, wrap: int | None = None) None#

specify policy for repr(self)

Parameters:
num_seqs

number of sequences to include in represented display.

num_pos

length of sequences to include in represented display.

ref_name

name of sequence to be placed first, or “longest” (default). If latter, indicates longest sequence will be chosen.

wrap

number of printed bases per row

sliding_windows(window: int, step: int, start: int | None = None, end: int | None = None)#

Generator yielding new alignments of given length and interval.

Parameters:
window

The length of each returned alignment.

step

The interval between the start of the successive windows.

start

first window start position

end

last window start position

strand_symmetry(motif_length: int = 1)#

returns dict of strand symmetry test results per ungapped seq

take_positions(cols: list, negate: bool = False)#

Returns new Alignment containing only specified positions.

Parameters:
cols

list of column indices to keep

negate

if True, all columns except those in cols are kept

take_positions_if(f, negate=False)#

Returns new Alignment containing cols where f(col) is True.

take_seqs(names: str | Sequence[str], negate: bool = False, copy_annotations: bool = False, **kwargs)#

Returns new collection containing only specified seqs.

Parameters:
names

sequences to select (or exclude if negate=True)

negate

select all sequences EXCEPT names

kwargs

keyword arguments to be passed to the constructor of the new collection

copy_annotations

if True, only annotations from selected seqs are copied to the annotation_db of the new collection

take_seqs_if(f: Callable[[Sequence], bool], negate: bool = False)#

Returns new collection containing seqs where f(seq) is True.

Parameters:
f

function that takes a sequence object and returns True or False

negate

select all sequences EXCEPT those where f(seq) is True

Notes

Sequence objects can be converted into strings or numpy arrays using str() and numpy.array() respectively.

to_dict(as_array: bool = False) dict[str, str | ndarray]#

Return a dictionary of sequences.

Parameters:
as_array

if True, sequences are returned as numpy arrays, otherwise as strings

to_dna()#

returns copy of self as a collection of DNA moltype seqs

to_fasta(block_size: int = 60) str#

Return collection in Fasta format.

Parameters:
block_size

the sequence length to write to each line, by default 60

Returns:
The collection in Fasta format.
to_html(name_order: Sequence[str] | None = None, wrap: int = 60, limit: int | None = None, ref_name: str = 'longest', colors: Mapping[str, str] | None = None, font_size: int = 12, font_family: str = 'Lucida Console') str#

returns html with embedded styles for sequence colouring

Parameters:
name_order

order of names for display.

wrap

number of alignment columns per row

limit

truncate alignment to this length

ref_name

Name of an existing sequence or ‘longest’. If the latter, the longest sequence (excluding gaps and ambiguities) is selected as the reference.

colors

{character moltype.

font_size

in points. Affects labels and sequence and line spacing (proportional to value)

font_family

string denoting font family

Examples

In a jupyter notebook, this code is used to provide the representation.

aln  # is rendered by jupyter

You can directly use the result for display in a notebook as

from IPython.core.display import HTML

HTML(aln.to_html())
to_json()#

returns json formatted string

to_moltype(moltype: str | MolType) SequenceCollection#

returns copy of self with changed moltype

Parameters:
moltype

name of the new moltype, e.g, ‘dna’, ‘rna’.

Notes

Cannot convert from nucleic acids to proteins. Use get_translation() for that.

to_phylip()#

Return collection in PHYLIP format and mapping to sequence ids

Notes

raises exception if sequences do not all have the same length

to_pretty(name_order=None, wrap=None)#

returns a string representation of the alignment in pretty print format

Parameters:
name_order

order of names for display.

wrap

maximum number of printed bases

to_rich_dict() dict[str, str | dict[str, str]]#

returns a json serialisable dict

to_rna()#

returns copy of self as a collection of RNA moltype seqs

to_type(**kwargs)#
trim_stop_codons(gc: Any = None, strict: bool = False, **kwargs)#

Removes any terminal stop codons from the sequences

Parameters:
gc

valid input to cogent3.get_code(), a genetic code object, number or name, defaults to standard code

strict

If True, raises an exception if a seq length not divisible by 3

variable_positions(include_gap_motif: bool = True, include_ambiguity: bool = False, motif_length: int = 1) tuple[int]#

Return a list of variable position indexes.

Parameters:
include_gap_motif

if False, sequences with a gap motif in a column are ignored.

include_ambiguity

if True, all states are considered.

motif_length

if any position within a motif is variable, the entire motif is considered variable.

Returns:
tuple of integers, if motif_length > 1, the returned positions are
motif_length long sequential indices.

Notes

Truncates alignment to be modulo motif_length.

with_masked_annotations(biotypes: Sequence[str], mask_char: str = '?', shadow: bool = False)#

returns an alignment with regions replaced by mask_char

Parameters:
biotypes

annotation type(s)

mask_char

must be a character valid for the moltype. The default value is the most ambiguous character, eg. ‘?’ for DNA

shadow

If True, masks everything but the biotypes

write(filename: str, file_format: str | None = None, **kwargs) None#

Write the sequences to a file, preserving order of sequences.

Parameters:
filename

name of the sequence file

file_format

format of the sequence file

Notes

If file_format is None, will attempt to infer format from the filename suffix.