cogent3.core.alignment.SequenceCollection#
- class SequenceCollection(*, seqs_data: SeqsDataABC, moltype: c3_moltype.MolType[Any], info: dict[str, Any] | InfoClass | None = None, source: PathType | None = None, annotation_db: AnnotationDbABC | list[AnnotationDbABC] | None = None, name_map: Mapping[str, str] | None = None, is_reversed: bool = False)#
A container of unaligned sequences.
- Attributes:
annotation_dbthe annotation database for the collection
modifiedcollection is a modification of underlying storage
name_mapreturns mapping of seq names to parent seq names
namesreturns the names of the sequences in the collection
num_seqsthe number of sequences in the collection
seqsiterable of sequences in the collection
storagethe unaligned sequence storage instance of the collection
Methods
add_feature(*[, seqid, parent_id, strand])add feature on named sequence
add_seqs(seqs, **kwargs)Returns new collection with additional sequences.
apply_pssm([pssm, path, background, ...])scores sequences using the specified pssm
copy_annotations(seq_db)copy annotations into attached annotation db
count_ambiguous_per_seq()Counts of ambiguous characters per sequence.
count_kmers([k, use_hook])return kmer counts for each sequence
counts([motif_length, include_ambiguity, ...])counts of motifs
counts_per_seq([motif_length, ...])counts of motifs per sequence
degap([storage_backend])returns collection sequences without gaps or missing characters.
distance_matrix([calc])Estimated pairwise distance between sequences
dotplot([name1, name2, window, threshold, ...])make a dotplot between two sequences.
drop_duplicated_seqs()returns self without duplicated sequences
duplicated_seqs()returns the names of duplicated sequences
entropy_per_seq([motif_length, ...])Returns the Shannon entropy per sequence.
from_rich_dict(data)returns a new instance from a rich dict
get_ambiguous_positions()Returns dict of seq:{position:char} for ambiguous chars.
get_features(*[, seqid, biotype, name, ...])yields Feature instances
get_identical_sets([mask_degen])returns sets of names for sequences that are identical
get_lengths([include_ambiguity, allow_gap])returns sequence lengths as a dict of {seqid: length}
get_motif_probs([alphabet, ...])Return a dictionary of motif probs, calculated as the averaged frequency across sequences.
get_seq(seqname[, copy_annotations])Return a Sequence object for the specified seqname.
get_seq_names_if(f[, negate])Returns list of names of seqs where f(seq) is True.
get_similar(target, min_similarity, ...)Returns new SequenceCollection containing sequences similar to target.
get_translation([gc, incomplete_ok, ...])translate sequences from nucleic acid to protein
has_annotation_db()returns True if self has annotation db
has_terminal_stop([gc, strict])Returns True if any sequence has a terminal stop codon.
is_ragged()rerturns True if sequences are of different lengths
iter_seqs([seq_order])Iterates over sequences in the collection, in order.
make_feature(*, feature, **kwargs)create a feature on named sequence, or on the collection itself
pad_seqs([pad_length])Returns copy in which sequences are padded with the gap character to same length.
probs_per_seq([motif_length, ...])return frequency array of motifs per sequence
rc()Returns the reverse complement of all sequences in the collection.
renamed_seqs(renamer)Returns new collection with renamed sequences.
replace_annotation_db(value[, check])public interface to assigning the annotation_db
reverse_complement()Returns the reverse complement of all sequences in the collection.
set_repr_policy([num_seqs, num_pos, ...])specify policy for repr(self)
strand_symmetry([motif_length])returns dict of strand symmetry test results per seq
take_seqs(names[, negate, copy_annotations])Returns new collection containing only specified seqs.
take_seqs_if(f[, negate])Returns new collection containing seqs where f(seq) is True.
to_dict(-> dict[str, str] -> dict[str, str])Return a dictionary of sequences.
to_dna()returns copy of self as a collection of DNA moltype seqs
to_fasta([block_size])Return collection in Fasta format.
to_html([name_order, wrap, limit, colors, ...])returns html with embedded styles for sequence colouring
to_json()returns json formatted string
to_moltype(moltype)returns copy of self with changed moltype
to_rich_dict()returns a json serialisable dict
to_rna()returns copy of self as a collection of RNA moltype seqs
trim_stop_codons([gc, strict])Removes any terminal stop codons from the sequences
write(filename[, format_name])Write the sequences to a file, preserving order of sequences.
Notes
Should be constructed using
make_unaligned_seqs().