AlignedSeqsData#

class AlignedSeqsData(*, gapped_seqs: ndarray, names: tuple[str], alphabet: CharAlphabet, ungapped_seqs: dict[str, ndarray] | None = None, gaps: dict[str, ndarray] | None = None, offset: dict[str, int] | None = None, align_len: int | None = None, check: bool = True, reversed_seqs: set[str] | None = None)#

The builtin cogent3 implementation of aligned sequences storage underlying an Alignment. Indexing this object returns an AlignedDataView which can realise the corresponding slice as a string, bytes, or numpy array, gapped or ungapped.

Attributes:
align_len

Return the length of the alignment.

alphabet

the character alphabet for validating, encoding, decoding sequences

names

returns the names of the sequences in the storage

offset

returns the offset of each sequence in the Alignment

reversed_seqs

names of sequences that are reverse complemented

Methods

add_seqs(seqs[, force_unique_keys, offset])

Returns a new AlignedSeqsData object with added sequences.

copy(**kwargs)

shallow copy of self

from_names_and_array(*, names, data, alphabet)

Construct an AlignedSeqsData object from a list of names and a numpy array of aligned sequence data.

from_seqs(*, data, alphabet, **kwargs)

Construct an AlignedSeqsData object from a dict of aligned sequences

from_seqs_and_gaps(*, seqs, gaps, alphabet, ...)

Construct an AlignedSeqsData object from a dict of ungapped sequences and a corresponding dict of gap data.

get_gapped_seq_array(*, seqid[, start, ...])

Return sequence data corresponding to seqid as an array of indices.

get_gapped_seq_bytes(*, seqid[, start, ...])

Return sequence corresponding to seqid as a bytes string.

get_gapped_seq_str(*, seqid[, start, stop, step])

Return sequence corresponding to seqid as a string.

get_gaps(seqid)

returns the gap data for seqid

get_positions(names[, start, stop, step])

returns an array of the selected positions for names.

get_seq_array(*, seqid[, start, stop, step])

Return ungapped sequence corresponding to seqid as an array of indices.

get_seq_bytes(*, seqid[, start, stop, step])

Return ungapped sequence corresponding to seqid as a bytes string.

get_seq_length(seqid)

return length of the unaligned seq for seqid

get_seq_str(*, seqid[, start, stop, step])

Return ungapped sequence corresponding to seqid as a string.

get_ungapped(name_map[, start, stop, step])

Returns a dictionary of sequence data with no gaps or missing characters and a dictionary with information to construct a new SequenceCollection via make_unaligned_seqs.

get_view()

reurns view of aligned sequence data for seqid

to_alphabet(alphabet[, check_valid])

Returns a new AlignedSeqsData object with the same underlying data with a new alphabet.

Notes

Methods on this object only accepts plust strand start, stop and step indices for selecting segments of data. It can return the gap coordinates for a sequence as used by IndelMap.

add_seqs(seqs: dict[str, str | ndarray[int]], force_unique_keys: bool = True, offset: dict[str, int] | None = None) AlignedSeqsData#

Returns a new AlignedSeqsData object with added sequences.

Parameters:
seqs

dict of sequences to add {name: seq, …}

force_unique_keys

if True, raises ValueError if any sequence names already exist in the collection

offset

dict of offsets relative to for the new sequences.

property align_len: int#

Return the length of the alignment.

property alphabet: CharAlphabet#

the character alphabet for validating, encoding, decoding sequences

copy(**kwargs) Self#

shallow copy of self

Notes

kwargs are passed to constructor and will over-ride existing values

classmethod from_names_and_array(*, names: Sequence[str], data: ndarray, alphabet: AlphabetABC) Self#

Construct an AlignedSeqsData object from a list of names and a numpy array of aligned sequence data.

Parameters:
names

list of sequence names

data

numpy array of aligned sequence data

alphabet

alphabet object for the sequences

classmethod from_seqs(*, data: dict[str, str | ndarray[int]], alphabet: AlphabetABC, **kwargs) Self#

Construct an AlignedSeqsData object from a dict of aligned sequences

Parameters:
data

dict of gapped sequences {name: seq, …}. sequences must all be the same length

alphabet

alphabet object for the sequences

classmethod from_seqs_and_gaps(*, seqs: dict[str, str | bytes | ndarray[int]], gaps: dict[str, ndarray], alphabet: AlphabetABC, **kwargs) Self#

Construct an AlignedSeqsData object from a dict of ungapped sequences and a corresponding dict of gap data.

Parameters:
seqs

dict of ungapped sequences {name: seq, …}

gaps

gap data {name: [[seq gap position, cumulative gap length], …], …}

alphabet

alphabet object for the sequences

get_gapped_seq_array(*, seqid: str, start: int | None = None, stop: int | None = None, step: int | None = None) ndarray#

Return sequence data corresponding to seqid as an array of indices. start/stop are in alignment coordinates. Includes gaps.

get_gapped_seq_bytes(*, seqid: str, start: int | None = None, stop: int | None = None, step: int | None = None) bytes#

Return sequence corresponding to seqid as a bytes string. start/stop are in alignment coordinates. Includes gaps.

get_gapped_seq_str(*, seqid: str, start: int | None = None, stop: int | None = None, step: int | None = None) str#

Return sequence corresponding to seqid as a string. start/stop are in alignment coordinates. Includes gaps.

get_gaps(seqid: str) ndarray#

returns the gap data for seqid

get_positions(names: Sequence[str], start: int | None = None, stop: int | None = None, step: int | None = None) ndarray#

returns an array of the selected positions for names.

get_seq_array(*, seqid: str, start: int | None = None, stop: int | None = None, step: int | None = None) ndarray#

Return ungapped sequence corresponding to seqid as an array of indices.

Notes

Assumes start/stop are in sequence coordinates. If seqid is in reversed_seqs, that sequence will be in plus strand orientation. It is client codes responsibility to ensure the coordinates are consistent with that.

get_seq_bytes(*, seqid: str, start: int | None = None, stop: int | None = None, step: int | None = None) bytes#

Return ungapped sequence corresponding to seqid as a bytes string. start/stop are in sequence coordinates. Excludes gaps.

get_seq_length(seqid: str) int#

return length of the unaligned seq for seqid

get_seq_str(*, seqid: str, start: int | None = None, stop: int | None = None, step: int | None = None) str#

Return ungapped sequence corresponding to seqid as a string. start/stop are in sequence coordinates. Excludes gaps.

get_ungapped(name_map: dict[str, str], start: int | None = None, stop: int | None = None, step: int | None = None) tuple[dict, dict]#

Returns a dictionary of sequence data with no gaps or missing characters and a dictionary with information to construct a new SequenceCollection via make_unaligned_seqs.

Parameters:
name_map

A dict of {aln_name: data_name, …} indicating the mapping between names in the encompassing Alignment (aln_name) and the names in self (data_name).

start

The alignment starting position.

stop

The alignment stopping position.

step

The step size.

Returns:
tuple

A tuple containing the following: - seqs (dict): A dictionary of {name: seq, …} where the sequences have no gaps

or missing characters.

  • kwargs (dict): A dictionary of keyword arguments for make_unaligned_seqs, e.g., {“offset”: self.offset, “name_map”: name_map}.

get_view(seqid: str, slice_record: SliceRecord | None = None) AlignedDataView#
get_view(seqid: int)

reurns view of aligned sequence data for seqid

Parameters:
seqid

sequence name

slice_record

slice record to use for slicing the data. If None, uses the default slice record for the entire sequence.

property names: tuple[str, ...]#

returns the names of the sequences in the storage

property offset: dict[str, int]#

returns the offset of each sequence in the Alignment

property reversed_seqs: frozenset[str]#

names of sequences that are reverse complemented

to_alphabet(alphabet: AlphabetABC, check_valid: bool = True) Self#

Returns a new AlignedSeqsData object with the same underlying data with a new alphabet.