Statistical Phylogenetic Tree Analysis Using Differences of Means

Elissaveta Arnaoudova,David Haws,Peter Huggins,Jerzy W. Jaromczyk,Neil Moore,Chris Schardl,Ruriko Yoshida

DOI: https://doi.org/10.48550/arXiv.1004.2101

2010-04-13

Abstract:We propose a statistical method to test whether two phylogenetic trees with given alignments are significantly incongruent. Our method compares the two distributions of phylogenetic trees given by the input alignments, instead of comparing point estimations of trees. This statistical approach can be applied to gene tree analysis for example, detecting unusual events in genome evolution such as horizontal gene transfer and reshuffling. Our method uses difference of means to compare two distributions of trees, after embedding trees in a vector space. Bootstrapping alignment columns can then be applied to obtain p-values. To compute distances between means, we employ a "kernel trick" which speeds up distance calculations when trees are embedded in a high-dimensional feature space, e.g. splits or quartets feature space. In this pilot study, first we test our statistical method's ability to distinguish between sets of gene trees generated under coalescence models with species trees of varying dissimilarity. We follow our simulation results with applications to various data sets of gophers and lice, grasses and their endophytes, and different fungal genes from the same genome. A companion toolkit, {\tt Phylotree}, is provided to facilitate computational experiments.

Populations and Evolution,Genomics

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to evaluate whether two phylogenetic trees are significantly different. Specifically, the author proposes a statistical method to test whether two phylogenetic trees for given alignment data are significantly inconsistent. This method does not compare the estimates of a single tree, but rather compares the two tree distributions given by the input alignment data. By embedding the trees in a vector space and using the mean difference to compare the two distributions, this method can detect abnormal events in genome evolution, such as horizontal gene transfer and rearrangement. ### Core Problems of the Paper - **Main Problem**: How to determine whether two phylogenetic trees are significantly inconsistent under given alignment data. - **Method**: Evaluate the significant inconsistency of trees by comparing the mean differences of the two tree distributions. - **Application Scenarios**: It can be applied to gene tree analysis, for example, to detect atypical evolutionary processes such as horizontal gene transfer and rearrangement in genome evolution. ### Method Overview 1. **Tree Embedding**: Embed the phylogenetic trees into a vector space so that each tree corresponds to a feature vector. 2. **Mean Difference**: Calculate the mean difference of the two tree distributions, that is, \(\hat{\Delta} = \frac{1}{N_1} \sum_{i = 1}^{N_1}v(t_i)-\frac{1}{N_2} \sum_{i = 1}^{N_2}v(t'_i)\), where \(v(t_i)\) and \(v(t'_i)\) are the feature vectors of tree \(t_i\) and \(t'_i\) respectively. 3. **Bootstrap Method**: Generate new alignment data through Bootstrap sampling and re - estimate the tree distributions to evaluate the significance of the mean difference. 4. **Distance Calculation**: Use the kernel trick to efficiently calculate the distances in the high - dimensional feature space, thereby avoiding explicitly writing out the feature vectors. ### Experimental Results - **Simulation Experiments**: The effectiveness of the method was verified through simulated data, and the results showed that this method can correctly distinguish gene trees under different species trees. - **Application to Actual Data**: Applied to the known pika - louse data set and grass - endophytic fungus data set, the results indicated that this method can detect significant tree inconsistencies. ### Conclusion The method proposed in this paper provides an effective statistical method to evaluate the significant inconsistency between two phylogenetic trees by comparing the mean differences of tree distributions. This method is not only applicable to gene tree analysis, but can also be used to detect atypical events in genome evolution.

Statistical Phylogenetic Tree Analysis Using Differences of Means

Statistics for Phylogenetic Trees in the Presence of Stickiness

Estimating the mean in the space of ranked phylogenetic trees

A two-sample tree-based test for hierarchically organized genomic signals

Distributions of topological tree metrics between a species tree and a gene tree

A tale of too many trees: a conundrum for phylogenetic regression

Analyzing Contentious Relationships and Outlier Genes in Phylogenomics

Measuring Fit of Sequence Data to Phylogenetic Model: Gain of Power Using Marginal Tests

Evaluating the Performance of Probabilistic Algorithms for Phylogenetic Analysis of Big Morphological Datasets: A Simulation Study

VEGA: Visual Comparison of Phylogenetic Trees for Evolutionary Genome Analysis (chinavis 2019)

Improvement of Phylogenetic Method to Analyze Compositional Heterogeneity.

When species trees disagree: an approach consistent with the coalescent that quantifies phylogenomic support for contentious relationships

Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer

A Statistical Test for Clades in Phylogenies

Comparison of phylogenetic trees defined on different but mutually overlapping sets of taxa: A review

Efficient Exploration of the Space of Reconciled Gene Trees

A method for investigating relative timing information on phylogenetic trees

PhyloAcc-GT: A Bayesian method for inferring patterns of substitution rate shifts on targeted lineages accounting for gene tree discordance

An automated convergence diagnostic for phylogenetic MCMC analyses

The Path-Label Reconciliation (PLR) Dissimilarity Measure for Gene Trees

Phylogenetic tree statistics: a systematic overview using the new R package ‘treestats’