ArrayAlignment#

class ArrayAlignment(*args, **kwargs)#

Holds a dense array representing a multiple sequence alignment.

An Alignment is _often_, but not necessarily, an array of chars. You might want to use some other data type for the alignment if you have a large number of symbols. For example, codons on an ungapped DNA alphabet has 4*4*4=64 entries so can fit in a standard char data type, but tripeptides on the 20-letter ungapped protein alphabet has 20*20*20=8000 entries so can _not_ fit in a char and values will wrap around (i.e. you will get an unpredictable, wrong value for any item whose index is greater than the max value, e.g. 255 for uint8), so in this case you would need to use UInt16, which can hold 65536 values. DO NOT USE SIGNED DATA TYPES FOR YOUR ALIGNMENT ARRAY UNLESS YOU LOVE MISERY AND HARD-TO-DEBUG PROBLEMS.

Implementation: aln[i] returns position i in the alignment.

aln.positions[i] returns the same as aln[i] – usually, users think of this as a ‘column’, because alignment editors such as Clustal typically display each sequence as a row so a position that cuts across sequences is a column.

aln.seqs[i] returns a sequence, or ‘row’ of the alignment in standard terminology.

WARNING: aln.seqs and aln.positions are different views of the same array, so if you change one you will change the other. This will no longer be true if you assign to seqs or positions directly, so don’t do it. If you want to change the data in the whole array, always assign to a slice so that both views update: aln.seqs[:] = x instead of aln.seqs = x. If you get the two views out of sync, you will get all sorts of exceptions. No validation is performed on aln.seqs and aln.positions for performance reasons, so this can really get you into trouble.

Alignments are immutable, though this is not enforced. If you change the data after the alignment is created, all sorts of bad things might happen.

Class properties: alphabet: should be an Alphabet object. Must provide mapping between items (possibly, but not necessarily, characters) in the alignment and indices of those characters in the resulting Alignment object.

SequenceType: Constructor to use when building sequences. Default: Sequence

Creating a new array will always result in a new object unless you use the force_same_object=True parameter.

WARNING: Rebinding the names attribute in a ArrayAlignment is not recommended because not all methods will use the updated name order. This is because the original sequence and name order are used to produce data structures that are cached for efficiency, and are not updated if you change the names attribute.

WARNING: ArrayAlignment strips off info objects from sequences that have them, primarily for efficiency.

Attributes:

annotation_db
named_seqs
num_seqs: Returns the number of sequences in the alignment.
positions: Override superclass positions to return positions as symbols.
seqs

Methods

`add_from_ref_aln`(ref_aln[, before_name, ...])	Insert sequence(s) to self based on their alignment to a reference sequence.
`add_seqs`(other[, before_name, after_name])	Returns new object of class self with sequences from other added.
`alignment_quality`([app_name])	Computes the alignment quality using the indicated app
`apply_pssm`([pssm, path, background, ...])	scores sequences using the specified pssm
`coevolution`([stat, segments, drawable, ...])	performs pairwise coevolution measurement
`copy`()	Returns deep copy of self.
`count_gaps_per_pos`([include_ambiguity])	return counts of gaps per position as a DictArray
`count_gaps_per_seq`([induced_by, unique, ...])	return counts of gaps per sequence as a DictArray
`counts`([motif_length, include_ambiguity, ...])	counts of motifs
`counts_per_pos`([motif_length, ...])	return DictArray of counts per position
`counts_per_seq`([motif_length, ...])	counts of non-overlapping motifs per sequence
`deepcopy`([sliced])	Returns deep copy of self.
`degap`(**kwargs)	Returns copy in which sequences have no gaps.
`distance_matrix`([calc, show_progress, ...])	Returns pairwise distances between sequences.
`dotplot`([name1, name2, window, threshold, ...])	make a dotplot between specified sequences.
`entropy_per_pos`([motif_length, ...])	returns shannon entropy per position
`entropy_per_seq`([motif_length, ...])	returns the Shannon entropy per sequence
`filtered`(predicate[, motif_length, ...])	The alignment positions where predicate(column) is true.
`get_ambiguous_positions`()	Returns dict of seq:{position:char} for ambiguous chars.
`get_degapped_relative_to`(name)	Remove all columns with gaps in sequence with given name.
`get_gap_array`([include_ambiguity])	returns bool array with gap state True, False otherwise
`get_gapped_seq`(seq_name[, recode_gaps])	Return a gapped Sequence object for the specified seqname.
`get_identical_sets`([mask_degen])	returns sets of names for sequences that are identical
`get_lengths`([include_ambiguity, allow_gap])	returns {name: seq length, ...}
`get_motif_probs`([alphabet, ...])	Return a dictionary of motif probs, calculated as the averaged frequency across sequences.
`get_position_indices`(f[, native, negate])	Returns list of column indices for which f(col) is True.
`get_seq`(seqname)	Return a sequence object for the specified seqname.
`get_seq_indices`(f[, negate])	Returns list of keys of seqs where f(row) is True.
`get_similar`(target[, min_similarity, ...])	Returns new Alignment containing sequences similar to target.
`get_sub_alignment`([seqs, pos, negate_seqs, ...])	Returns subalignment of specified sequences and positions.
`get_translation`([gc, incomplete_ok, ...])	translate from nucleic acid to protein
`has_terminal_stop`([gc, strict])	Returns True if any sequence has a terminal stop codon.
`information_plot`([width, height, window, ...])	plot information per position
`is_ragged`()	Returns True if alignment has sequences of different lengths.
`iter_positions`([pos_order])	Iterates over positions in the alignment, in order.
`iter_selected`([seq_order, pos_order])	Iterates over elements in the alignment.
`iter_seqs`([seq_order])	Iterates over values (sequences) in the alignment, in order.
`iupac_consensus`([alphabet, allow_gap])	Returns string containing IUPAC consensus sequence of the alignment.
`majority_consensus`()	Returns list containing most frequent item at each position.
`matching_ref`(ref_name, gap_fraction, gap_run)	Returns new alignment with seqs well aligned with a reference.
`no_degenerates`([motif_length, allow_gap])	returns new alignment without degenerate characters
`omit_bad_seqs`([quantile])	Returns new alignment without sequences with a number of uniquely introduced gaps exceeding quantile
`omit_gap_pos`([allowed_gap_frac, motif_length])	Returns new alignment where all cols (motifs) have <= allowed_gap_frac gaps.
`omit_gap_runs`([allowed_run])	Returns new alignment where all seqs have runs of gaps <=allowed_run.
`omit_gap_seqs`([allowed_gap_frac])	Returns new alignment with seqs that have <= allowed_gap_frac.
`pad_seqs`([pad_length])	Returns copy in which sequences are padded to same length.
`probs_per_pos`([motif_length, ...])	returns MotifFreqsArray per position
`probs_per_seq`([motif_length, ...])	return MotifFreqsArray per sequence
`quick_tree`([calc, bootstrap, drop_invalid, ...])	Returns pairwise distances between sequences.
`rc`()	Returns the reverse complement alignment
`rename_seqs`(renamer)	returns new instance with sequences renamed
`replace_seqs`(seqs[, aa_to_codon])	Returns new alignment with same shape but with data taken from seqs.
`reverse_complement`()	Returns the reverse complement alignment.
`sample`([n, with_replacement, motif_length, ...])	Returns random sample of positions from self, e.g. to bootstrap.
`seqlogo`([width, height, wrap, vspace, colours])	returns Drawable sequence logo using mutual information
`set_repr_policy`([num_seqs, num_pos, ...])	specify policy for repr(self)
`sliding_windows`(window, step[, start, end])	Generator yielding new alignments of given length and interval.
`strand_symmetry`([motif_length])	returns dict of strand symmetry test results per seq
`take_positions`(cols[, negate])	Returns new Alignment containing only specified positions.
`take_positions_if`(f[, negate])	Returns new Alignment containing cols where f(col) is True.
`take_seqs`(seqs[, negate])	Returns new Alignment containing only specified seqs.
`take_seqs_if`(f[, negate])	Returns new Alignment containing seqs where f(row) is True.
`to_dict`()	Returns the alignment as a dict of sequence names -> strings.
`to_dna`()	returns copy of self as an alignment of DNA moltype seqs
`to_fasta`([block_size])	Return alignment in Fasta format.
`to_html`([name_order, wrap, limit, ref_name, ...])	returns html with embedded styles for sequence colouring
`to_json`()	returns json formatted string
`to_moltype`(moltype)	returns copy of self with moltype seqs
`to_nexus`(seq_type[, wrap])	Return alignment in NEXUS format and mapping to sequence ids
`to_phylip`()	Return alignment in PHYLIP format and mapping to sequence ids
`to_pretty`([name_order, wrap])	returns a string representation of the alignment in pretty print format
`to_protein`()	returns copy of self as an alignment of PROTEIN moltype seqs
`to_rich_dict`()	returns detailed content including info and moltype attributes
`to_rna`()	returns copy of self as an alignment of RNA moltype seqs
`to_type`([array_align, moltype, alphabet])	returns alignment of type indicated by array_align
`trim_stop_codons`([gc, strict])	Removes any terminal stop codons from the sequences
`variable_positions`([include_gap_motif])	Return a list of variable position indexes.
`with_modified_termini`()	Changes the termini to include termini char instead of gapmotif.
`write`([filename, format])	Write the alignment to a file, preserving order of sequences.

add_from_ref_aln(ref_aln, before_name=None, after_name=None)#

Insert sequence(s) to self based on their alignment to a reference sequence. Assumes the first sequence in ref_aln.names[0] is the reference.

By default the sequence is appended to the end of the alignment, this can be changed by using either before_name or after_name arguments.

Returns Alignment object of the same class.

Parameters:

ref_aln: reference alignment (Alignment object/series) of reference sequence and sequences to add. New sequences in ref_aln (ref_aln.names[1:] are sequences to add. If series is used as ref_aln, it must have the structure [[‘ref_name’, SEQ], [‘name’, SEQ]]
before_name: name of the sequence before which sequence is added
after_name: name of the sequence after which sequence is added If both before_name and after_name are specified seqs will be inserted using before_name.

Examples

Aln1: -AC-DEFGHI (name: seq1) XXXXXX–XX (name: seq2) YYYY-YYYYY (name: seq3)

Aln2: ACDEFGHI (name: seq1) KL–MNPR (name: seqX) KLACMNPR (name: seqY) KL–MNPR (name: seqZ)

Out: -AC-DEFGHI (name: seq1) XXXXXX–XX (name: seq2) YYYY-YYYYY (name: seq3) -KL—MNPR (name: seqX) -KL-ACMNPR (name: seqY) -KL—MNPR (name: seqZ)

add_seqs(other, before_name=None, after_name=None)#

Returns new object of class self with sequences from other added.

Parameters:

other: same class as self or coerceable to that class
before_namestr: which sequence is added
after_namestr: which sequence is added

Notes

If both before_name and after_name are specified, the seqs will be inserted using before_name.

By default the sequence is appended to the end of the alignment, this can be changed by using either before_name or after_name arguments.

alignment_quality(app_name: str = 'ic_score', **kwargs)#

Computes the alignment quality using the indicated app

Parameters:

app_name: name of an alignment score calculating app, e.g. ‘ic_score’, ‘cogent3_score’, ‘sp_score’
kwargs: keyword arguments to be passed to the app. Use cogent3.app_help(app_name) to see the available options.

Returns:

float or a NotCompleted instance if the score could not be computed

alphabet = ('\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08', '\t', '\n', '\x0b', '\x0c', '\r', '\x0e', '\x0f', '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x7f', '\x80', '\x81', '\x82', '\x83', '\x84', '\x85', '\x86', '\x87', '\x88', '\x89', '\x8a', '\x8b', '\x8c', '\x8d', '\x8e', '\x8f', '\x90', '\x91', '\x92', '\x93', '\x94', '\x95', '\x96', '\x97', '\x98', '\x99', '\x9a', '\x9b', '\x9c', '\x9d', '\x9e', '\x9f', '\xa0', '¡', '¢', '£', '¤', '¥', '¦', '§', '¨', '©', 'ª', '«', '¬', '\xad', '®', '¯', '°', '±', '²', '³', '´', 'µ', '¶', '·', '¸', '¹', 'º', '»', '¼', '½', '¾', '¿', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', '×', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', '÷', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'þ', 'ÿ')#

property annotation_db#

apply_pssm(pssm=None, path=None, background=None, pseudocount=0, names=None, ui=None)#

scores sequences using the specified pssm

Parameters:

pssmprofile.PSSM: if not provided, will be loaded from path
path: path to either a jaspar or cisbp matrix (path must end have a suffix matching the format).
pseudocount: adjustment for zero in matrix
names: returns only scores for these sequences and in the name order

Returns:

numpy array of log2 based scores at every position

coevolution(stat='nmi', segments=None, drawable=None, show_progress=False, parallel=False, par_kw=None)#

performs pairwise coevolution measurement

Parameters:

statstr: coevolution metric, defaults to ‘nmi’ (Normalized Mutual Information). Valid choices are ‘rmi’ (Resampled Mutual Information) and ‘mi’, mutual information.
segmentscoordinate series: coordinates of the form [(start, end), …] where all possible pairs of alignment positions within and between segments are examined.
drawableNone or str: Result object is capable of plotting data specified type. str value must be one of plot type ‘box’, ‘heatmap’, ‘violin’.
show_progressbool: shows a progress bar

Returns:

DictArray of results with lower-triangular values. Upper triangular
elements and estimates that could not be computed for numerical reasons
are set as nan

copy()#: Returns deep copy of self.

count_gaps_per_pos(include_ambiguity=True)#

return counts of gaps per position as a DictArray

Parameters:

include_ambiguitybool: if True, ambiguity characters that include the gap state are included

count_gaps_per_seq(induced_by=False, unique=False, include_ambiguity=True, drawable=False)#

return counts of gaps per sequence as a DictArray

Parameters:

induced_bybool: a gapped column is considered to be induced by a seq if the seq has a non-gap character in that column.
uniquebool: count is limited to gaps uniquely induced by each sequence
include_ambiguitybool: if True, ambiguity characters that include the gap state are included
drawablebool or str: if True, resulting object is capable of plotting data via specified plot type ‘bar’, ‘box’ or ‘violin’

counts(motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False)#

counts of motifs

Parameters:

motif_length: number of elements per character.
include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.
exclude_unobserved: if True, unobserved motif combinations are excluded.

Notes

only non-overlapping motifs are counted

counts_per_pos(motif_length=1, include_ambiguity=False, allow_gap=False, warn=False)#

return DictArray of counts per position

Parameters:

warn: warns if motif_length > 1 and alignment trimmed to produce motif columns

counts_per_seq(motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False, warn=False)#

counts of non-overlapping motifs per sequence

Parameters:

motif_length: number of elements per character.
include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.
exclude_unobserved: if False, all canonical states included
warn: warns if motif_length > 1 and alignment trimmed to produce motif columns

Returns:

MotifCountsArray

deepcopy(sliced=True)#: Returns deep copy of self.

default_gap = '-'#

degap(**kwargs)#

Returns copy in which sequences have no gaps.

Parameters:

kwargs: passed to class constructor

distance_matrix(calc='pdist', show_progress=False, drop_invalid=False)#

Returns pairwise distances between sequences.

Parameters:

calcstr: a pairwise distance calculator or name of one. For options see cogent3.evolve.fast_distance.available_distances
show_progressbool: controls progress display for distance calculation
drop_invalidbool: If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.

dotplot(name1: str | None = None, name2: str | None = None, window: int = 20, threshold: int | None = None, k: int | None = None, min_gap: int = 0, width: int = 500, title: str | None = None, rc: bool = False, show_progress: bool = False)#

make a dotplot between specified sequences. Random sequences chosen if names not provided.

Parameters:

name1, name2: names of sequences. If not provided, a random choice is made
window: segment size for comparison between sequences
threshold: windows where the sequences are identical >= threshold are a match
k: size of k-mer to break sequences into. Larger values increase speed but reduce resolution. If not specified, and window == threshold, then k is set to window. Otherwise, it is computed as the maximum of {threshold // (window - threshold), 5}.
min_gap: permitted gap for joining adjacent line segments, default is no gap joining
width: figure width. Figure height is computed based on the ratio of len(seq1) / len(seq2)
title: title for the plot
rc: include dotplot of reverse compliment also. Only applies to Nucleic acids moltypes

Returns:

a Drawable or AnnotatedDrawable

entropy_per_pos(motif_length=1, include_ambiguity=False, allow_gap=False, warn=False)#: returns shannon entropy per position

entropy_per_seq(motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=True, warn=False)#

returns the Shannon entropy per sequence

Parameters:

motif_length: number of characters per tuple.
include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.
exclude_unobserved: if True, unobserved motif combinations are excluded.
warn: warns if motif_length > 1 and alignment trimmed to produce motif columns

Notes

For motif_length > 1, it’s advisable to specify exclude_unobserved=True, this avoids unnecessary calculations.

filtered(predicate, motif_length=1, drop_remainder=True, **kwargs)#

The alignment positions where predicate(column) is true.

Parameters:

predicatecallable: a callback function that takes an tuple of motifs and returns True/False
motif_lengthint: length of the motifs the sequences should be split into, eg. 3 for filtering aligned codons.
drop_remainderbool: If length is not modulo motif_length, allow dropping the terminal remaining columns

gap_chars = {'-': None, '?': None}#

get_ambiguous_positions()#

Returns dict of seq:{position:char} for ambiguous chars.

Used in likelihood calculations.

get_degapped_relative_to(name)#

Remove all columns with gaps in sequence with given name.

Returns Alignment object of the same class. Note that the seqs in the new Alignment are always new objects.

Parameters:

name: sequence name

get_gap_array(include_ambiguity=True)#

returns bool array with gap state True, False otherwise

Parameters:

include_ambiguitybool: if True, ambiguity characters that include the gap state are included

get_gapped_seq(seq_name, recode_gaps=False)#

Return a gapped Sequence object for the specified seqname.

Note: return type may depend on what data was loaded into the SequenceCollection or Alignment.

get_identical_sets(mask_degen=False)#

returns sets of names for sequences that are identical

Parameters:

mask_degen: if True, degenerate characters are ignored

get_lengths(include_ambiguity=False, allow_gap=False)#

returns {name: seq length, …}

Parameters:

include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.

get_motif_probs(alphabet=None, include_ambiguity=False, exclude_unobserved=False, allow_gap=False, pseudocount=0)#

Return a dictionary of motif probs, calculated as the averaged frequency across sequences.

Parameters:

include_ambiguity: if True resolved ambiguous codes are included in estimation of frequencies, default is False.
exclude_unobserved: if True, motifs that are not present in the alignment are excluded from the returned dictionary, default is False.
allow_gap: allow gap motif

Notes

only non-overlapping motifs are counted

get_position_indices(f, native=False, negate=False)#

Returns list of column indices for which f(col) is True.

fcallable: function that returns true/false given an alignment position
nativeboolean: if True, and ArrayAlignment, f is provided with slice of array otherwise the string is used
negateboolean: if True, not f() is used

get_seq(seqname)#: Return a sequence object for the specified seqname.

get_seq_indices(f, negate=False)#

Returns list of keys of seqs where f(row) is True.

List will be in the same order as self.names, if present.

get_similar(target, min_similarity=0.0, max_similarity=1.0, metric=<cogent3.util.transform.for_seq object>, transform=None)#

Returns new Alignment containing sequences similar to target.

Parameters:

target: sequence object to compare to. Can be in the alignment.
min_similarity: minimum similarity that will be kept. Default 0.0.
max_similarity: maximum similarity that will be kept. Default 1.0. (Note that both min_similarity and max_similarity are inclusive.) metric similarity function to use. Must be f(first_seq, second_seq).
The default metric is fraction similarity, ranging from 0.0 (0%
identical) to 1.0 (100% identical). The Sequence classes have lots
of methods that can be passed in as unbound methods to act as the
metric, e.g. frac_same_gaps.
transform: transformation function to use on the sequences before the metric is calculated. If None, uses the whole sequences in each case. A frequent transformation is a function that returns a specified range of a sequence, e.g. eliminating the ends. Note that the transform applies to both the real sequence and the target sequence.
WARNING: if the transformation changes the type of the sequence (e.g.
extracting a string from an RnaSequence object), distance metrics that
depend on instance data of the original class may fail.

get_sub_alignment(seqs=None, pos=None, negate_seqs=False, negate_pos=False)#

Returns subalignment of specified sequences and positions.

seqs and pos can be passed in as lists of sequence indices to keep or positions to keep.

negate_seqs: if True (default False), gets everything _except_ the specified sequences.

negate_pos: if True (default False), gets everything _except_ the specified positions.

Unlike most of the other code that gets things out of an alignment, this method returns a new alignment that does NOT share data with the original alignment.

get_translation(gc=None, incomplete_ok=False, include_stop=False, trim_stop=True, **kwargs)#

translate from nucleic acid to protein

Parameters:

gc: genetic code, either the number or name (use cogent3.core.genetic_code.available_codes)
incomplete_ok: codons that are mixes of nucleotide and gaps converted to ‘?’. raises a ValueError if False
include_stop: whether to allow a stops in the translated sequence
trim_stop: exclude terminal stop codons if they exist
kwargs: related to construction of the resulting object

Returns:

A new instance of self translated into protein

has_terminal_stop(gc: Any = None, strict: bool = False) → bool#

Returns True if any sequence has a terminal stop codon.

Parameters:

gc: valid input to cogent3.get_code(), a genetic code object, number or name
strict: If True, raises an exception if a seq length not divisible by 3

information_plot(width=None, height=None, window=None, stat='median', include_gap=True)#

plot information per position

Parameters:

widthint: figure width in pixels
heightint: figure height in pixels
windowint or None: used for smoothing line, defaults to sqrt(length)
statstr: ‘mean’ or ‘median, used as the summary statistic for each window
include_gap: whether to include gap counts, shown on right y-axis

is_array = {'array', 'array_seqs'}#

is_ragged() → bool#: Returns True if alignment has sequences of different lengths.

iter_positions(pos_order=None)#

Iterates over positions in the alignment, in order.

pos_order refers to a list of indices (ints) specifying the column order. This lets you rearrange positions if you want to (e.g. to pull out individual codon positions).

Note that self.iter_positions() always returns new objects, by default lists of elements. Use map(f, self.iter_positions) to apply the constructor or function f to the resulting lists (f must take a single list as a parameter).

Will raise IndexError if one of the indices in order exceeds the sequence length. This will always happen on ragged alignments: assign to self.seq_len to set all sequences to the same length.

iter_selected(seq_order=None, pos_order=None)#

Iterates over elements in the alignment.

seq_order (names) can be used to select a subset of seqs. pos_order (positions) can be used to select a subset of positions.

Always iterates along a seq first, then down a position (transposes normal order of a[i][j]; possibly, this should change)..

WARNING: Alignment.iter_selected() is not the same as alignment.iteritems() (which is the built-in dict iteritems that iterates over key-value pairs).

iter_seqs(seq_order=None)#

Iterates over values (sequences) in the alignment, in order.

seq_order: list of keys giving the order in which seqs will be returned. Defaults to self.Names. Note that only these sequences will be returned, and that KeyError will be raised if there are sequences in order that have been deleted from the Alignment. If self.Names is None, returns the sequences in the same order as self.named_seqs.values().

Use map(f, self.seqs()) to apply the constructor f to each seq. f must accept a single list as an argument.

Always returns references to the same objects that are values of the alignment.

iupac_consensus(alphabet=None, allow_gap=True)#: Returns string containing IUPAC consensus sequence of the alignment.

majority_consensus()#

Returns list containing most frequent item at each position.

Optional parameter transform gives constructor for type to which result will be converted (useful when consensus should be same type as originals).

matching_ref(ref_name, gap_fraction, gap_run)#

Returns new alignment with seqs well aligned with a reference.

gap_fraction = fraction of positions that either have a gap in the: template but not in the seq or in the seq but not in the template
gap_run = number of consecutive gaps tolerated in query relative to: sequence or sequence relative to query

moltype = MolType(('\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07', '\x08', '\t', '\n', '\x0b', '\x0c', '\r', '\x0e', '\x0f', '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17', '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\x7f', '\x80', '\x81', '\x82', '\x83', '\x84', '\x85', '\x86', '\x87', '\x88', '\x89', '\x8a', '\x8b', '\x8c', '\x8d', '\x8e', '\x8f', '\x90', '\x91', '\x92', '\x93', '\x94', '\x95', '\x96', '\x97', '\x98', '\x99', '\x9a', '\x9b', '\x9c', '\x9d', '\x9e', '\x9f', '\xa0', '¡', '¢', '£', '¤', '¥', '¦', '§', '¨', '©', 'ª', '«', '¬', '\xad', '®', '¯', '°', '±', '²', '³', '´', 'µ', '¶', '·', '¸', '¹', 'º', '»', '¼', '½', '¾', '¿', 'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', '×', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'Þ', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ð', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', '÷', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'þ', 'ÿ'))#

property named_seqs#

no_degenerates(motif_length=1, allow_gap=False)#

returns new alignment without degenerate characters

Parameters:

motif_length: sequences are segmented into units of this size
allow_gap: whether gaps are to be treated as a degenerate character (default, most evolutionary modelling treats gaps as N) or not.

property num_seqs#: Returns the number of sequences in the alignment.

omit_bad_seqs(quantile=None)#

Returns new alignment without sequences with a number of uniquely introduced gaps exceeding quantile

Uses count_gaps_per_seq(unique=True) to obtain the counts of gaps uniquely introduced by a sequence. The cutoff is the quantile of this distribution.

Parameters:

quantilefloat or None: sequences whose unique gap count is in a quantile larger than this cutoff are excluded. The default quantile is (num_seqs - 1) / num_seqs

omit_gap_pos(allowed_gap_frac=0.999999, motif_length=1)#

Returns new alignment where all cols (motifs) have <= allowed_gap_frac gaps.

Parameters:

allowed_gap_frac: specifies proportion of gaps is allowed in each column (default is just < 1, i.e. only cols with at least one gap character are preserved). Set to 1 - e-6 to exclude strictly gapped columns.
motif_length: set’s the “column” width, e.g. setting to 3 corresponds to codons. A motif that includes a gap at any position is included in the counting. Default is 1.

omit_gap_runs(allowed_run=1)#

Returns new alignment where all seqs have runs of gaps <=allowed_run.

Note that seqs with exactly allowed_run gaps are not deleted. Default is for allowed_run to be 1 (i.e. no consecutive gaps allowed).

Because the test for whether the current gap run exceeds the maximum allowed gap run is only triggered when there is at least one gap, even negative values for allowed_run will still let sequences with no gaps through.

omit_gap_seqs(allowed_gap_frac=0)#

Returns new alignment with seqs that have <= allowed_gap_frac.

allowed_gap_frac should be a fraction between 0 and 1 inclusive. Default is 0.

pad_seqs(pad_length=None, **kwargs)#

Returns copy in which sequences are padded to same length.

Parameters:

pad_length: Length all sequences are to be padded to. Will pad to max sequence length if pad_length is None or less than max length.

property positions#: Override superclass positions to return positions as symbols.

probs_per_pos(motif_length=1, include_ambiguity=False, allow_gap=False, warn=False)#: returns MotifFreqsArray per position

probs_per_seq(motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False, warn=False)#

return MotifFreqsArray per sequence

Parameters:

motif_length: number of characters per tuple.
include_ambiguity: if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
allow_gap: if True, motifs containing a gap character are included.
exclude_unobserved: if True, unobserved motif combinations are excluded.
warn: warns if motif_length > 1 and alignment trimmed to produce motif columns

quick_tree(calc='pdist', bootstrap=None, drop_invalid=False, show_progress=False, ui=None)#

Returns pairwise distances between sequences.

Parameters:

calcstr: a pairwise distance calculator or name of one. For options see cogent3.evolve.fast_distance.available_distances
show_progressbool: controls progress display for distance calculation
drop_invalidbool: If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.
bootstrapint or None: Number of non-parametric bootstrap replicates. Resamples alignment columns with replacement and builds a phylogeny for each such resampling.
drop_invalidbool: If True, sequences for which a pairwise distance could not be calculated are excluded. If False, an ArithmeticError is raised if a distance could not be computed on observed data.

Returns:

a phylogenetic tree. If bootstrap specified, returns the weighted
majority consensus. Support for each node is stored as
edge.params[‘params’].

Notes

Sequences in the observed alignment for which distances could not be computed are omitted. Bootstrap replicates are required to have distances for all seqs present in the observed data distance matrix.

rc()#: Returns the reverse complement alignment

rename_seqs(renamer)#

returns new instance with sequences renamed

Parameters:

renamercallable: function that will take current sequences and return the new one

replace_seqs(seqs, aa_to_codon=True)#

Returns new alignment with same shape but with data taken from seqs.

Parameters:

aa_to_codon: If True (default) aligns codons from protein alignment, or, more generally, substituting in codons from a set of protein sequences (not necessarily aligned). For this reason, it takes characters from seqs three at a time rather than one at a time (i.e. 3 characters in seqs are put in place of 1 character in self). If False, seqs must be the same lengths.
If seqs is an alignment, any gaps in it will be ignored.

reverse_complement()#: Returns the reverse complement alignment. A synonym for rc.

sample(n=None, with_replacement=False, motif_length=1, randint=<bound method RandomState.randint of RandomState(MT19937)>, permutation=<bound method RandomState.permutation of RandomState(MT19937)>)#

Returns random sample of positions from self, e.g. to bootstrap.

Parameters:

n

the number of positions to sample from the alignment. Default is alignment length

with_replacement

boolean flag for determining if sampled positions

randint and permutation

functions for random integer in a specified range, and permutation, respectively.

Notes:

By default (resampling all positions without replacement), generates a permutation of the positions of the alignment.

Setting with_replacement to True and otherwise leaving parameters as defaults generates a standard bootstrap resampling of the alignment.

seqlogo(width=700, height=100, wrap=None, vspace=0.005, colours=None)#

returns Drawable sequence logo using mutual information

Parameters:

width, heightfloat: plot dimensions in pixels
wrapint: number of alignment columns per row
vspacefloat: vertical separation between rows, as a proportion of total plot
coloursdict: mapping of characters to colours. If note provided, defaults to custom for everything ecept protein, which uses protein moltype colours.

Notes

Computes MI based on log2 and includes the gap state, so the maximum possible value is -log2(1/num_states)

property seqs#

set_repr_policy(num_seqs=None, num_pos=None, ref_name=None, wrap=None) → None#

specify policy for repr(self)

Parameters:

num_seqsint or None: number of sequences to include in represented display.
num_posint or None: length of sequences to include in represented display.
ref_namestr or None: name of sequence to be placed first, or “longest” (default). If latter, indicates longest sequence will be chosen.
wrapint or None: number of printed bases per row

sliding_windows(window, step, start=None, end=None)#

Generator yielding new alignments of given length and interval.

Parameters:

window: The length of each returned alignment.
step: The interval between the start of the successive alignment objects returned.
start: first window start position
end: last window start position

strand_symmetry(motif_length=1)#: returns dict of strand symmetry test results per seq

take_positions(cols, negate=False)#

Returns new Alignment containing only specified positions.

By default, the seqs will be lists, but an alternative constructor can be specified.

Note that take_positions will fail on ragged positions.

take_positions_if(f, negate=False)#: Returns new Alignment containing cols where f(col) is True.

take_seqs(seqs: str | Sequence[str], negate=False, **kwargs)#

Returns new Alignment containing only specified seqs.

Note that the seqs in the new alignment will be references to the same objects as the seqs in the old alignment.

take_seqs_if(f, negate=False, **kwargs)#

Returns new Alignment containing seqs where f(row) is True.

Note that the seqs in the new Alignment are the same objects as the seqs in the old Alignment, not copies.

to_dict() → dict[str, str]#

Returns the alignment as a dict of sequence names -> strings.

Note the mapping goes to strings, not Sequence objects.

Returns:

a dict mapping sequence names to a string representation of
their sequences.

to_dna()#: returns copy of self as an alignment of DNA moltype seqs

to_fasta(block_size: int = 60) → str#

Return alignment in Fasta format.

Parameters:

block_size: the sequence length to write to each line, by default 60

Returns:

The Fasta formatted alignment.

to_html(name_order: Sequence[str] | None = None, wrap: int = 60, limit: int | None = None, ref_name: str = 'longest', colors: Mapping[str, str] | None = None, font_size: int = 12, font_family: str = 'Lucida Console') → str#

returns html with embedded styles for sequence colouring

Parameters:

name_order: order of names for display.
wrap: number of alignment columns per row
limit: truncate alignment to this length
ref_name: Name of an existing sequence or ‘longest’. If the latter, the longest sequence (excluding gaps and ambiguities) is selected as the reference.
colors: {character moltype.
font_size: in points. Affects labels and sequence and line spacing (proportional to value)
font_family: string denoting font family

Examples

In a jupyter notebook, this code is used to provide the representation.

aln  # is rendered by jupyter

You can directly use the result for display in a notebook as

from IPython.core.display import HTML

HTML(aln.to_html())

to_json()#: returns json formatted string

to_moltype(moltype)#: returns copy of self with moltype seqs

to_nexus(seq_type, wrap=50)#

Return alignment in NEXUS format and mapping to sequence ids

NOTE Not that every sequence in the alignment MUST come from: a different species!! (You can concatenate multiple sequences from same species together before building tree)

seq_type: dna, rna, or protein

Raises exception if invalid alignment

to_phylip()#

Return alignment in PHYLIP format and mapping to sequence ids

raises exception if invalid alignment

to_pretty(name_order=None, wrap=None)#

returns a string representation of the alignment in pretty print format

Parameters:

name_order: order of names for display.
wrap: maximum number of printed bases

to_protein()#: returns copy of self as an alignment of PROTEIN moltype seqs

to_rich_dict()#: returns detailed content including info and moltype attributes

to_rna()#: returns copy of self as an alignment of RNA moltype seqs

to_type(array_align=False, moltype=None, alphabet=None)#

returns alignment of type indicated by array_align

Parameters:

array_align: bool: if True, returns as ArrayAlignment. Otherwise as “standard” Alignment class. Conversion to ArrayAlignment loses annotations.
moltypeMolType instance: overrides self.moltype
alphabetAlphabet instance: overrides self.alphabet
If array_align would result in no change (class is same as self),
returns self

trim_stop_codons(gc: Any = None, strict: bool = False, **kwargs)#

Removes any terminal stop codons from the sequences

Parameters:

gc: valid input to cogent3.get_code(), a genetic code object, number or name
strict: If True, raises an exception if a seq length not divisible by 3

variable_positions(include_gap_motif=True)#

Return a list of variable position indexes.

Parameters:

include_gap_motif: if False, sequences with a gap motif in a column are ignored.

with_modified_termini()#

Changes the termini to include termini char instead of gapmotif.

Useful to correct the standard gap char output by most alignment programs when aligned sequences have different ends.

write(filename=None, format=None, **kwargs) → None#

Write the alignment to a file, preserving order of sequences.

Parameters:

filename: name of the sequence file
format: format of the sequence file

Notes

If format is None, will attempt to infer format from the filename suffix.