Making Sense from Sequence

cogent3 is a mature python library for analysis of biological sequence data. We endeavour to provide a first-class experience within Jupyter notebooks, but the algorithms also support parallel execution on compute systems with 1000’s of processors.

cogent3 is released under the BSD-3 license, links to documentation and code are above. If you would like to contribute (and we hope you do!), we have created a companion c3dev repo which provides details on how to contribute and some useful tools for doing so.

Who is it for? Anyone who wants to analyse sequence divergence using robust statistical models.

cogent3 is unique in providing numerous non-stationary Markov models [1] for modelling sequence evolution, including novel codon models [2]. cogent3 also includes an extensive collection of time-reversible models (again including novel codon models [3]). (See using non-stationary models.). We have done more than just invent these new methods, we have established the most robust algorithms [4] for their implementation and their suitability for real data [5]. Additionally, there are novel signal processing methods focussed on statistical estimation of integer period signals [6, 7].

Anyone who wants to undertake exploratory genomic data analysis

Beyond our novel methods, cogent3 provides an extensive suite of capabilities for manipulating and analysing sequence data. For instance, the ability to read standard biological data formats, manipulate sequences by their annotations, to perform multiple sequence alignment (app docs) using any of our substitution models, phylogenetic reconstruction and tree manipulation, manipulation of tabular data, visualisation of phylogenies (image gallery) and much more.

Anyone looking for a functional programming style approach to genomic data analysis

Our cogent3.app module (app docs) provides a very different approach to using the library capabilities. Notably, a functional programming style interface lowers the barrier to entry for using cogent3‘s advanced capabilities. It also supports building pipelines suitable for large-scale analysis. Individuals comfortable with R should find this interface pretty easy to use.


Citations

[1]

Benjamin D Kaehler, Von Bing Yap, Rongli Zhang, and Gavin A Huttley. Genetic distance for a general non-stationary Markov substitution process. Systematic Biology, 64:281–93, 2015. URL: https://www.ncbi.nlm.nih.gov/pubmed/25503772.

[2]

Benjamin D Kaehler, Von Bing Yap, and Gavin A Huttley. Standard codon substitution models overestimate purifying selection for non-stationary data. Genome Biology and Evolution, 9:134–149, 2017. URL: https://www.ncbi.nlm.nih.gov/pubmed/28175284.

[3]

Von Bing Yap, Helen Lindsay, Simon Easteal, and Gavin Huttley. Estimates of the effect of natural selection on protein-coding content. Molecular Biology and Evolution, 27:726–734, 2010. URL: https://www.ncbi.nlm.nih.gov/pubmed/19815689.

[4]

Harold W Schranz, Von Bing Yap, Simon Easteal, Rob Knight, and Gavin A Huttley. Pathological rate matrices: from primates to pathogens. BMC Bioinformatics, 9:550, 2008. URL: https://www.ncbi.nlm.nih.gov/pubmed/19099591.

[5]

Klara L Verbyla, Von Bing Yap, Anuj Pahwa, Yunli Shao, and Gavin A Huttley. The embedding problem for Markov models of nucleotide substitution. PLoS ONE, 8:e69187, 2013. URL: https://pubmed.ncbi.nlm.nih.gov/23935949/.

[6]

Julien Epps, Hua Ying, and Gavin Huttley. Statistical methods for detecting periodic fragments in DNA sequence data. Biology Direct, 6:21, 2011. URL: https://www.ncbi.nlm.nih.gov/pubmed/21527008.

[7]

M Bellani, J Epps, and G Huttley. A Comparison of Periodicity Profile Methods for Sequence Analysis. Proc. IEEE Workshop on Genomic Signal Processing and Statistics, 2012. URL: https://ieeexplore.ieee.org/document/6507731.