Lars Berling,Jonathan Klawitter,Remco Bouckaert,Dong Xie,Alex Gavryushkin,Alexei Drummond
Abstract:Bayesian phylogenetic analysis with MCMC algorithms gen- erates an estimate of the posterior distribution of phylogenetic trees in the form of a sample of phylogenetic trees and related parameters. The high dimensionality and non-Euclidean nature of tree space complicates summarizing the central tendency and variance of the posterior distri- bution in tree space. Here we introduce a new tractable tree distribution and associated point estimator that can be constructed from a posterior sample of trees. Through simulation studies we show that this point esti- mator performs at least as well and often better than standard methods of producing Bayesian posterior summary trees. We also show that the method of summary that performs best depends on the sample size and dimensionality of the problem in non-trivial ways.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to more accurately estimate the central tendency and variance of the posterior distribution in Bayesian phylogenetic analysis, especially for point estimation in tree space. Specifically, the paper introduces a new tractable tree distribution and its related point estimators, which can be constructed from posterior sample trees. Through simulation studies, the authors show that this new method is at least as good as, if not better than, the existing standard methods in generating Bayesian posterior summary trees. In addition, the paper also points out that the choice of the best summary method depends on the sample size and the complexity of the problem dimension, which is not obvious.
### Main Contributions
1. **New Tree Distribution Model**: The paper defines a new tree distribution model based on the observed lineage frequencies and studies a closely related model based on the observed lineage split frequencies. These models are easy to operate, and experiments show that they can provide excellent estimates of the true posterior distribution.
2. **Point Estimator**: The paper proposes a new point estimator that can find the tree with the highest posterior probability from the posterior sample trees as the summary tree or point estimate of the distribution.
3. **Performance Evaluation**: Through simulation studies, the paper shows the performance of the new method under different sample sizes and problem dimensions. The results indicate that the new method is at least as good as, if not better than, the existing methods.
4. **Challenges and Prospects**: Although the new method performs well in many cases, choosing the best summary method is still challenging because it depends on the sample size and the complexity of the problem. This work has the potential to improve the accuracy of phylogenetic research.
### Method Overview
- **Tree Distribution Model**: The paper discusses the properties of tractable tree distributions and defines three different parameterized conditional clade distributions (CCD), namely CCD0, CCD1, and CCD2.
- **Point Estimator**: The paper introduces two commonly used point estimators - the maximum clade credibility (MCC) tree and the greedy consensus tree, and proposes a new point estimator based on CCD, namely the CCD - MAP tree.
- **Dataset**: The paper describes the datasets used for the experiments and efficiently calculates the number of trees and entropy values, etc., through the dynamic programming algorithm.
### Formula Summary
- **Conditional Clade Probability (CCP)**:
\[
\text{Pr}(S)=\frac{f(S)}{f(C)}
\]
where \(f(S)\) is the frequency of the clade \(S\) and \(f(C)\) is the frequency of the clade \(C\).
- **Tree Probability**:
\[
\text{Pr}(T)=\prod_{S\in S(T)}\text{Pr}(S)
\]
where \(S(T)\) is all the clades in the tree \(T\).
- **Maximum Probability Subtree**:
\[
\text{Pr}^{\star}(C)=\max_{\{C_{1},C_{2}\}\in S(C)}\left\{\text{Pr}(C_{1},C_{2}|C)\cdot\text{Pr}^{\star}(C_{1})\cdot\text{Pr}^{\star}(C_{2})\right\}
\]
where \(\text{Pr}^{\star}(C)\) represents the maximum probability subtree rooted at the clade \(C\).
Through these methods and models, the paper aims to improve the accuracy and robustness of point estimation in Bayesian phylogenetic analysis.