Systematic Exploration of the High Likelihood Set of Phylogenetic Tree Topologies

Chris Whidden,Brian C. Claywell,Thayer Fisher,Andrew F. Magee,Mathieu Fourment,Frederick A. Matsen IV
DOI: https://doi.org/10.48550/arXiv.1811.11007
2018-11-27
Abstract:Bayesian Markov chain Monte Carlo explores tree space slowly, in part because it frequently returns to the same tree topology. An alternative strategy would be to explore tree space systematically, and never return to the same topology. In this paper, we present an efficient parallelized method to map out the high likelihood set of phylogenetic tree topologies via systematic search, which we show to be a good approximation of the high posterior set of tree topologies. Here `likelihood' of a topology refers to the tree likelihood for the corresponding tree with optimized branch lengths. We call this method `phylogenetic topographer' (PT). The PT strategy is very simple: starting in a number of local topology maxima (obtained by hill-climbing from random starting points), explore out using local topology rearrangements, only continuing through topologies that are better than than some likelihood threshold below the best observed topology. We show that the normalized topology likelihoods are a useful proxy for the Bayesian posterior probability of those topologies. By using a non-blocking hash table keyed on unique representations of tree topologies, we avoid visiting topologies more than once across all concurrent threads exploring tree space. We demonstrate that PT can be used directly to approximate a Bayesian consensus tree topology. When combined with an accurate means of evaluating per-topology marginal likelihoods, PT gives an alternative procedure for obtaining Bayesian posterior distributions on phylogenetic tree topologies.
Populations and Evolution,Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the efficiency of exploring the phylogenetic tree space within the Bayesian framework. Specifically, the traditional Bayesian Markov Chain Monte Carlo (MCMC) method progresses slowly when exploring the phylogenetic tree space, partly because it often returns to the same tree topologies. This makes the effective exploration of the tree space a challenge, especially when the data set is large in information and the posterior probabilities are concentrated on a few tree topologies, while most random modifications will lead to the exploration of tree topologies with low posterior probabilities. To solve this problem, the paper proposes an efficient parallelized method - the Phylogenetic Topographer (PT), which maps out the set of phylogenetic tree topologies with high likelihood through systematic search and ensures that the same topology will not be visited repeatedly. The core of the PT method is to start from multiple local maximum points and use local tree rearrangement operations (such as the Nearest - Neighbor Interchange (NNI) operation) to only continue exploring those topologies whose likelihood values are higher than the best - observed topology by a certain threshold. This method can not only quickly identify the high - likelihood set containing the credible set, but also avoid the problem of repeated visits when multiple threads explore the tree space through non - blocking hash tables. Through this method, PT can effectively approximate the Bayesian consensus tree topology, and when combined with accurate per - topology marginal likelihood estimation means, PT provides an alternative procedure for obtaining the Bayesian posterior distribution of phylogenetic tree topologies. The paper proves the effectiveness of the PT method through experiments, especially its performance on standard test data sets is highly consistent with that of the MrBayes method.