Uncertainty in phylogenetic tree estimates

Amy D. Willis,Rayna C. Bell
DOI: https://doi.org/10.48550/arXiv.1611.03456
2017-10-13
Abstract:Estimating phylogenetic trees is an important problem in evolutionary biology, environmental policy and medicine. Although trees are estimated, their uncertainties are discarded by mathematicians working in tree space. Here we explicitly model the multivariate uncertainty of tree estimates. We consider both the cases where uncertainty information arises extrinsically (through covariate information) and intrinsically (through the tree estimates themselves). The importance of accounting for tree uncertainty in tree space is demonstrated in two case studies. In the first instance, differences between gene trees are small relative to their uncertainties, while in the second, the differences are relatively large. Our main goal is visualization of tree uncertainty, and we demonstrate advantages of our method with respect to reproducibility, speed and preservation of topological differences compared to visualization based on multidimensional scaling. The proposal highlights that phylogenetic trees are estimated in an extremely high-dimensional space, resulting in uncertainty information that cannot be discarded. Most importantly, it is a method that allows biologists to diagnose whether differences between gene trees are biologically meaningful, or due to uncertainty in estimation.
Methodology,Populations and Evolution
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is **how to effectively model and visualize the uncertainty of trees in phylogenetic tree estimation**. Specifically, the author focuses on how to consider the multivariate uncertainty of tree estimation in statistical models, which is often overlooked in current phylogenetic research. The paper proposes the use of the log map method to map trees from their metric space to Euclidean space, enabling the modeling and visualization of tree estimates and their uncertainties in Euclidean space. This method can not only better reflect the topological differences between trees but also retain the multivariate uncertainty information in tree estimates. ### Main Objectives 1. **Modeling the Uncertainty of Trees**: By introducing the log map, convert trees from the metric space to the Euclidean space, thereby enabling more accurate modeling of the multivariate uncertainty in tree estimates. 2. **Visualizing the Uncertainty of Trees**: Propose a new method to visualize the uncertainty of trees, especially tree estimates in high - dimensional space. This method has advantages in reproducibility, speed, and preservation of topological differences. 3. **Evaluating the Differences between Trees**: Provide a method that enables biologists to diagnose whether the differences between gene trees are biologically significant or merely due to estimation uncertainty. ### Method Overview - **Log Map**: Map trees from the metric space \(T_{m + 3}\) to the Euclidean space \(\mathbb{R}^m\), retaining the topological structure and branch length information of the trees. - **Weighted Fréchet Mean**: Used to determine the base tree of the log map, taking into account the precision information of each tree estimate. - **Maximum Likelihood Estimation**: Used to estimate model parameters, including the mean and variance of the log map. - **Visualization**: Visualize the uncertainty of trees by projecting the high - dimensional uncertainty set onto the first two principal components. ### Case Studies The paper demonstrates the effectiveness of the method through two case studies: 1. **Small Differences between Gene Trees**: In this case, the differences between gene trees are relatively small, but the uncertainty is large. 2. **Large Differences between Gene Trees**: In this case, the differences between gene trees are relatively large, and the uncertainty is small. ### Conclusions The paper emphasizes the importance of modeling and visualizing the uncertainty of trees in high - dimensional space and proposes an effective method to achieve this goal. This method can not only better reflect the topological differences of trees but also help biologists determine whether the differences between gene trees are biologically significant.