SequenceCollection#
- class SequenceCollection(*args, **kwargs)#
Container for unaligned sequences
- Attributes:
- annotation_db
- named_seqs
num_seqs
Returns the number of sequences in the alignment.
- seqs
Methods
add_feature
(*, seqid, biotype, name, spans)add feature on named sequence
add_seqs
(other[, before_name, after_name])Returns new object of class self with sequences from other added.
annotate_from_gff
(f[, seq_ids])copies annotations from a gff file to a sequence in self
apply_pssm
([pssm, path, background, ...])scores sequences using the specified pssm
copy
()Returns deep copy of self.
copy_annotations
(seq_db)copy annotations into attached annotation db
counts
([motif_length, include_ambiguity, ...])counts of motifs
counts_per_seq
([motif_length, ...])counts of motifs per sequence
deepcopy
([sliced])returns deep copy of self.
degap
(**kwargs)Returns copy in which sequences have no gaps.
distance_matrix
([calc])Estimated pairwise distance between sequences
dotplot
([name1, name2, window, threshold, ...])make a dotplot between specified sequences.
entropy_per_seq
([motif_length, ...])Returns the Shannon entropy per sequence.
Returns dict of seq:{position:char} for ambiguous chars.
get_features
(*[, seqid, biotype, name, ...])yields Feature instances
get_identical_sets
([mask_degen])returns sets of names for sequences that are identical
get_lengths
([include_ambiguity, allow_gap])returns {name: seq length, ...}
get_motif_probs
([alphabet, ...])Return a dictionary of motif probs, calculated as the averaged frequency across sequences.
get_seq
(seqname)Return a sequence object for the specified seqname.
get_seq_indices
(f[, negate])Returns list of keys of seqs where f(row) is True.
get_similar
(target[, min_similarity, ...])Returns new Alignment containing sequences similar to target.
get_translation
([gc, incomplete_ok, ...])translate from nucleic acid to protein
has_terminal_stop
([gc, strict])Returns True if any sequence has a terminal stop codon.
Returns True if alignment has sequences of different lengths.
iter_selected
([seq_order, pos_order])Iterates over elements in the alignment.
iter_seqs
([seq_order])Iterates over values (sequences) in the alignment, in order.
make_feature
(*, feature)create a feature on named sequence, or on the alignment itself
omit_gap_runs
([allowed_run])Returns new alignment where all seqs have runs of gaps <=allowed_run.
omit_gap_seqs
([allowed_gap_frac])Returns new alignment with seqs that have <= allowed_gap_frac.
pad_seqs
([pad_length])Returns copy in which sequences are padded to same length.
probs_per_seq
([motif_length, ...])return MotifFreqsArray per sequence
rc
()Returns the reverse complement alignment
rename_seqs
(renamer)returns new instance with sequences renamed
Returns the reverse complement alignment.
set_repr_policy
([num_seqs, num_pos, ...])specify policy for repr(self)
strand_symmetry
([motif_length])returns dict of strand symmetry test results per seq
take_seqs
(seqs[, negate])Returns new Alignment containing only specified seqs.
take_seqs_if
(f[, negate])Returns new Alignment containing seqs where f(row) is True.
to_dict
()Returns the alignment as a dict of sequence names -> strings.
to_dna
()returns copy of self as an alignment of DNA moltype seqs
to_fasta
([block_size])Return alignment in Fasta format.
to_html
([name_order, wrap, limit, colors, ...])returns html with embedded styles for sequence colouring
to_json
()returns json formatted string
to_moltype
(moltype)returns copy of self with moltype seqs
to_nexus
(seq_type[, wrap])Return alignment in NEXUS format and mapping to sequence ids
Return alignment in PHYLIP format and mapping to sequence ids
returns copy of self as an alignment of PROTEIN moltype seqs
returns detailed content including info and moltype attributes
to_rna
()returns copy of self as an alignment of RNA moltype seqs
trim_stop_codons
([gc, strict])Removes any terminal stop codons from the sequences
Changes the termini to include termini char instead of gapmotif.
write
([filename, format])Write the alignment to a file, preserving order of sequences.
- add_feature(*, seqid: str, biotype: str, name: str, spans: list[tuple[int, int]], parent_id: str | None = None, strand: str = '+') Feature #
add feature on named sequence
- Parameters:
- seqid
seq name to associate with
- parent_id
name of the parent feature
- biotype
biological type
- name
feature name
- spans
plus strand coordinates
- strand
either ‘+’ or ‘-’
- Returns:
- Feature
- add_seqs(other, before_name=None, after_name=None)#
Returns new object of class self with sequences from other added.
- Parameters:
- other
same class as self or coerceable to that class
- before_namestr
which sequence is added
- after_namestr
which sequence is added
Notes
If both before_name and after_name are specified, the seqs will be inserted using before_name.
By default the sequence is appended to the end of the alignment, this can be changed by using either before_name or after_name arguments.
- annotate_from_gff(f: PathLike, seq_ids: list[str] | str | None = None) None #
copies annotations from a gff file to a sequence in self
- Parameters:
- f
path to gff annotation file.
- seq_name
names of seqs to be annotated. Does not support setting offset, set offset directly on sequences with seq.annotation_offset = offset
- property annotation_db#
- apply_pssm(pssm=None, path=None, background=None, pseudocount=0, names=None, ui=None)#
scores sequences using the specified pssm
- Parameters:
- pssmprofile.PSSM
if not provided, will be loaded from path
- path
path to either a jaspar or cisbp matrix (path must end have a suffix matching the format).
- pseudocount
adjustment for zero in matrix
- names
returns only scores for these sequences and in the name order
- Returns:
- numpy array of log2 based scores at every position
- copy()#
Returns deep copy of self.
- copy_annotations(seq_db: SupportsFeatures) None #
copy annotations into attached annotation db
- Parameters:
- seq_db
compatible annotation db
Notes
Only copies annotations for records with seqid in self.names
- counts(motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False)#
counts of motifs
- Parameters:
- motif_length
number of elements per character.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap
if True, motifs containing a gap character are included.
- exclude_unobserved
if True, unobserved motif combinations are excluded.
Notes
only non-overlapping motifs are counted
- counts_per_seq(motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False, warn=False)#
counts of motifs per sequence
- Parameters:
- motif_length
number of characters per tuple.
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap
if True, motifs containing a gap character are included.
- warn
warns if motif_length > 1 and alignment trimmed to produce motif columns
- Returns:
- MotifCountsArray
Notes
only non-overlapping motifs are counted
- deepcopy(sliced: bool = True)#
returns deep copy of self.
- Parameters:
- sliced
if True, reduces the sequence to current interval. This also causes dropping annotations.
- degap(**kwargs)#
Returns copy in which sequences have no gaps.
- Parameters:
- kwargs
passed to class constructor
- distance_matrix(calc='pdist')#
Estimated pairwise distance between sequences
- Parameters:
- calcstr
The distance calculation method to use, either “pdist” or “jc69” “pdist” is an approximation of the proportion sites different “jc69” is an approximation of the Jukes Cantor distance
- Returns:
- DistanceMatrix
Estimated pairwise distances between sequences in the collection
Notes
pdist approximates the proportion sites different from the Jaccard distance. Coefficients for the approximation were derived from a polynomial fit between Jaccard distance of kmers with k=10 and the proportion of sites different using mammalian 106 protein coding gene DNA sequence alignments.
jc69 approximates the Jukes Cantor distance using the approximated proportion sites different, i.e., a transformation of the above.
- dotplot(name1: str | None = None, name2: str | None = None, window: int = 20, threshold: int | None = None, k: int | None = None, min_gap: int = 0, width: int = 500, title: str | None = None, rc: bool = False, show_progress: bool = False)#
make a dotplot between specified sequences. Random sequences chosen if names not provided.
- Parameters:
- name1, name2
names of sequences. If not provided, a random choice is made
- window
segment size for comparison between sequences
- threshold
windows where the sequences are identical >= threshold are a match
- k
size of k-mer to break sequences into. Larger values increase speed but reduce resolution. If not specified, and window == threshold, then k is set to window. Otherwise, it is computed as the maximum of {threshold // (window - threshold), 5}.
- min_gap
permitted gap for joining adjacent line segments, default is no gap joining
- width
figure width. Figure height is computed based on the ratio of len(seq1) / len(seq2)
- title
title for the plot
- rc
include dotplot of reverse compliment also. Only applies to Nucleic acids moltypes
- Returns:
- a Drawable or AnnotatedDrawable
- entropy_per_seq(motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=True, warn=False)#
Returns the Shannon entropy per sequence.
- Parameters:
- motif_length: int
number of characters per tuple.
- include_ambiguity: bool
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap: bool
if True, motifs containing a gap character are included.
- exclude_unobserved: bool
if True, unobserved motif combinations are excluded.
- warn
warns if motif_length > 1 and alignment trimmed to produce motif columns
Notes
For motif_length > 1, it’s advisable to specify exclude_unobserved=True, this avoids unnecessary calculations.
- get_ambiguous_positions()#
Returns dict of seq:{position:char} for ambiguous chars.
Used in likelihood calculations.
- get_features(*, seqid: str | Iterable[str] | None = None, biotype: str | None = None, name: str | None = None, allow_partial: bool = False) Iterator[Feature] #
yields Feature instances
- Parameters:
- seqid
limit search to features on this named sequence, defaults to search all
- biotype
biotype of the feature, e.g. CDS, gene
- name
name of the feature
- allow_partial
allow features partially overlaping self
Notes
When dealing with a nucleic acid moltype, the returned features will yield a sequence segment that is consistently oriented irrespective of strand of the current instance.
- get_identical_sets(mask_degen=False)#
returns sets of names for sequences that are identical
- Parameters:
- mask_degen
if True, degenerate characters are ignored
- get_lengths(include_ambiguity=False, allow_gap=False)#
returns {name: seq length, …}
- Parameters:
- include_ambiguity
if True, motifs containing ambiguous characters from the seq moltype are included. No expansion of those is attempted.
- allow_gap
if True, motifs containing a gap character are included.
- get_motif_probs(alphabet=None, include_ambiguity=False, exclude_unobserved=False, allow_gap=False, pseudocount=0)#
Return a dictionary of motif probs, calculated as the averaged frequency across sequences.
- Parameters:
- include_ambiguity
if True resolved ambiguous codes are included in estimation of frequencies, default is False.
- exclude_unobserved
if True, motifs that are not present in the alignment are excluded from the returned dictionary, default is False.
- allow_gap
allow gap motif
Notes
only non-overlapping motifs are counted
- get_seq(seqname)#
Return a sequence object for the specified seqname.
- get_seq_indices(f, negate=False)#
Returns list of keys of seqs where f(row) is True.
List will be in the same order as self.names, if present.
- get_similar(target, min_similarity=0.0, max_similarity=1.0, metric=<cogent3.util.transform.for_seq object>, transform=None)#
Returns new Alignment containing sequences similar to target.
- Parameters:
- target
sequence object to compare to. Can be in the alignment.
- min_similarity
minimum similarity that will be kept. Default 0.0.
- max_similarity
maximum similarity that will be kept. Default 1.0. (Note that both min_similarity and max_similarity are inclusive.) metric similarity function to use. Must be f(first_seq, second_seq).
- The default metric is fraction similarity, ranging from 0.0 (0%
- identical) to 1.0 (100% identical). The Sequence classes have lots
- of methods that can be passed in as unbound methods to act as the
- metric, e.g. frac_same_gaps.
- transform
transformation function to use on the sequences before the metric is calculated. If None, uses the whole sequences in each case. A frequent transformation is a function that returns a specified range of a sequence, e.g. eliminating the ends. Note that the transform applies to both the real sequence and the target sequence.
- WARNING: if the transformation changes the type of the sequence (e.g.
- extracting a string from an RnaSequence object), distance metrics that
- depend on instance data of the original class may fail.
- get_translation(gc=None, incomplete_ok=False, include_stop=False, trim_stop=True, **kwargs)#
translate from nucleic acid to protein
- Parameters:
- gc
genetic code, either the number or name (use cogent3.core.genetic_code.available_codes)
- incomplete_ok
codons that are mixes of nucleotide and gaps converted to ‘?’. raises a ValueError if False
- include_stop
whether to allow a stops in the translated sequence
- trim_stop
exclude terminal stop codons if they exist
- kwargs
related to construction of the resulting object
- Returns:
- A new instance of self translated into protein
- has_terminal_stop(gc: Any = None, strict: bool = False) bool #
Returns True if any sequence has a terminal stop codon.
- Parameters:
- gc
valid input to cogent3.get_code(), a genetic code object, number or name
- strict
If True, raises an exception if a seq length not divisible by 3
- is_array = {'array', 'array_seqs'}#
- is_ragged() bool #
Returns True if alignment has sequences of different lengths.
- iter_selected(seq_order=None, pos_order=None)#
Iterates over elements in the alignment.
seq_order (names) can be used to select a subset of seqs. pos_order (positions) can be used to select a subset of positions.
Always iterates along a seq first, then down a position (transposes normal order of a[i][j]; possibly, this should change)..
WARNING: Alignment.iter_selected() is not the same as alignment.iteritems() (which is the built-in dict iteritems that iterates over key-value pairs).
- iter_seqs(seq_order=None)#
Iterates over values (sequences) in the alignment, in order.
seq_order: list of keys giving the order in which seqs will be returned. Defaults to self.Names. Note that only these sequences will be returned, and that KeyError will be raised if there are sequences in order that have been deleted from the Alignment. If self.Names is None, returns the sequences in the same order as self.named_seqs.values().
Use map(f, self.seqs()) to apply the constructor f to each seq. f must accept a single list as an argument.
Always returns references to the same objects that are values of the alignment.
- make_feature(*, feature: FeatureDataType) Feature #
create a feature on named sequence, or on the alignment itself
- Parameters:
- feature
a dict with all the necessary data rto construct a feature
- Returns:
- Feature
Notes
To get a feature AND add it to annotation_db, use add_feature().
- moltype = MolType(('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'))#
- property named_seqs#
- property num_seqs#
Returns the number of sequences in the alignment.
- omit_gap_runs(allowed_run=1)#
Returns new alignment where all seqs have runs of gaps <=allowed_run.
Note that seqs with exactly allowed_run gaps are not deleted. Default is for allowed_run to be 1 (i.e. no consecutive gaps allowed).
Because the test for whether the current gap run exceeds the maximum allowed gap run is only triggered when there is at least one gap, even negative values for allowed_run will still let sequences with no gaps through.
- omit_gap_seqs(allowed_gap_frac=0)#
Returns new alignment with seqs that have <= allowed_gap_frac.
allowed_gap_frac should be a fraction between 0 and 1 inclusive. Default is 0.
- pad_seqs(pad_length=None, **kwargs)#
Returns copy in which sequences are padded to same length.
- Parameters:
- pad_length
Length all sequences are to be padded to. Will pad to max sequence length if pad_length is None or less than max length.
- probs_per_seq(motif_length=1, include_ambiguity=False, allow_gap=False, exclude_unobserved=False, warn=False)#
return MotifFreqsArray per sequence
- rc()#
Returns the reverse complement alignment
- rename_seqs(renamer)#
returns new instance with sequences renamed
- Parameters:
- renamercallable
function that will take current sequences and return the new one
- reverse_complement()#
Returns the reverse complement alignment. A synonym for rc.
- property seqs#
- set_repr_policy(num_seqs=None, num_pos=None, ref_name=None, wrap=None) None #
specify policy for repr(self)
- Parameters:
- num_seqsint or None
number of sequences to include in represented display.
- num_posint or None
length of sequences to include in represented display.
- ref_namestr or None
name of sequence to be placed first, or “longest” (default). If latter, indicates longest sequence will be chosen.
- wrapint or None
number of printed bases per row
- strand_symmetry(motif_length=1)#
returns dict of strand symmetry test results per seq
- take_seqs(seqs: str | Sequence[str], negate=False, **kwargs)#
Returns new Alignment containing only specified seqs.
Note that the seqs in the new alignment will be references to the same objects as the seqs in the old alignment.
- take_seqs_if(f, negate=False, **kwargs)#
Returns new Alignment containing seqs where f(row) is True.
Note that the seqs in the new Alignment are the same objects as the seqs in the old Alignment, not copies.
- to_dict() dict[str, str] #
Returns the alignment as a dict of sequence names -> strings.
Note the mapping goes to strings, not Sequence objects.
- Returns:
- a dict mapping sequence names to a string representation of
- their sequences.
- to_dna()#
returns copy of self as an alignment of DNA moltype seqs
- to_fasta(block_size: int = 60) str #
Return alignment in Fasta format.
- Parameters:
- block_size
the sequence length to write to each line, by default 60
- Returns:
- The Fasta formatted alignment.
- to_html(name_order: Sequence[str] | None = None, wrap: int = 60, limit: int | None = None, colors: Mapping[str, str] | None = None, font_size: int = 12, font_family: str = 'Lucida Console') str #
returns html with embedded styles for sequence colouring
- Parameters:
- name_order
order of names for display.
- wrap
number of alignment columns per row
- limit
truncate alignment to this length
- colors
{character moltype.
- font_size
in points. Affects labels and sequence and line spacing (proportional to value)
- font_family
string denoting font family
Examples
In a jupyter notebook, this code is used to provide the representation.
seq_col # is rendered by jupyter
You can directly use the result for display in a notebook as
from IPython.core.display import HTML HTML(seq_col.to_html())
- to_json()#
returns json formatted string
- to_moltype(moltype)#
returns copy of self with moltype seqs
- to_nexus(seq_type, wrap=50)#
Return alignment in NEXUS format and mapping to sequence ids
- NOTE Not that every sequence in the alignment MUST come from
a different species!! (You can concatenate multiple sequences from same species together before building tree)
seq_type: dna, rna, or protein
Raises exception if invalid alignment
- to_phylip()#
Return alignment in PHYLIP format and mapping to sequence ids
raises exception if invalid alignment
- to_protein()#
returns copy of self as an alignment of PROTEIN moltype seqs
- to_rich_dict()#
returns detailed content including info and moltype attributes
- to_rna()#
returns copy of self as an alignment of RNA moltype seqs
- trim_stop_codons(gc: Any = None, strict: bool = False, **kwargs)#
Removes any terminal stop codons from the sequences
- Parameters:
- gc
valid input to cogent3.get_code(), a genetic code object, number or name
- strict
If True, raises an exception if a seq length not divisible by 3
- with_modified_termini()#
Changes the termini to include termini char instead of gapmotif.
Useful to correct the standard gap char output by most alignment programs when aligned sequences have different ends.
- write(filename=None, format=None, **kwargs) None #
Write the alignment to a file, preserving order of sequences.
- Parameters:
- filename
name of the sequence file
- format
format of the sequence file
Notes
If format is None, will attempt to infer format from the filename suffix.