Abstract:Motivation: Reconstructing evolutionary histories of biological entities, such as genes, cells, organisms, populations, and species, from phenotypic and molecular sequencing data is central to many biological, palaeontological, and biomedical disciplines. Typically, due to uncertainties and incompleteness in data, the true evolutionary history (phylogeny) is challenging to estimate. Statistical modelling approaches address this problem by introducing and studying probability distributions over all possible evolutionary histories, but can also introduce uncertainties due to misspecification. In practice, computational methods are deployed to learn those distributions typically by sampling them. This approach, however, is fundamentally challenging as it requires designing and implementing various statistical methods over a space of phylogenetic trees (or treespace).Although the problem of developing statistics over a treespace has received substantial attention in the literature and numerous breakthroughs have been made, it remains largely unsolved. The challenge of solving this problem is two-fold: A treespace has non-trivial often counter-intuitive geometry implying that much of classical Euclidean statistics does not immediately apply; many parametrisations of treespace with promising statistical properties are computationally hard, so they cannot be used in data analyses. As a result, there is no single conventional method for estimating even the most fundamental statistics over any treespace, such as mean and variance, and various heuristics are used in practice. Despite the existence of numerous tree summary methods to approximate means of probability distributions over a treespace based on its geometry, and the theoretical promise of this idea, none of the attempts resulted in a practical method for summarising tree samples. Results: In this paper we present a tree summary method along with useful properties of our chosen treespace while focusing on its impact on phylogenetic analyses of real datasets. We perform an extensive benchmark study and demonstrate that our method outperforms currently most popular methods with respect to a number of important "quality" statistics. Further, we apply our method to three empirical datasets ranging from cancer evolution to linguistics and find novel insights into corresponding evolutionary problems in all of them. We hence conclude that this treespace is a promising candidate to serve as a foundation for developing statistics over phylogenetic trees analytically, as well as new computational tools for evolutionary data analyses. Availability and implementation: An implementation is available at https://github.com/bioDS/Centroid-Code. Supplementary information: Supplementary data are available at Bioinformatics online.

The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples

EMDUnifrac: Exact Linear Time Computation of the Unifrac Metric and Identification of Differentially Abundant Organisms

Interpretable metric learning in comparative metagenomics: The adaptive Haar-like distance

FrackyFrac: A Standalone UniFrac Calculator

An Empirical Bayes Approach to Normalization and Differential Abundance Testing for Microbiome Data

Use of directed quasi-metric distances for quantifying the information of gene families

Statistical summaries of unlabelled evolutionary trees and ranked hierarchical clustering trees

Tropical Density Estimation of Phylogenetic Trees

Choice of Metric Divergence in Genome Sequence Comparison

PhyloFunc: Phylogeny-informed Functional Distance as a New Ecological Metric for Metaproteomic Data Analysis

Statistical Phylogenetic Tree Analysis Using Differences of Means

Finer Metagenomic Reconstruction via Biodiversity Optimization

Distributions of topological tree metrics between a species tree and a gene tree

Estimating the mean in the space of ranked phylogenetic trees

A partial order and cluster-similarity metric on rooted phylogenetic trees

Advantages of phylogenetic distance based constrained ordination analyses for the examination of microbial communities

A phylogenetic scan test on Dirichlet-tree multinomial model for microbiome data

Multiple Comparative Metagenomics using Multiset k-mer Counting

Modelling phylogeny in 16S rRNA gene sequencing datasets using string kernels

Interpreting 16S metagenomic data without clustering to achieve sub-OTU resolution

Map of Life: Measuring and Visualizing Species' Relatedness with "Molecular Distance Maps"