An independent base composition of each rate class for improved likelihood-based phylogeny estimation: The 5rf model

Peter J Waddell,Remco R Bouckaert
DOI: https://doi.org/10.1101/2024.09.03.610719
2024-09-08
Abstract:The combination of a time reversible Markov process with a 'hidden' mixture of gamma distributed relative site rates plus invariant sites have become the most favoured options for likelihood and other probabilistic models of nucleotide evolution (e.g., tr4gi which approximates a gamma with four rate classes). However, these models assume a homogeneous and stationary distribution of nucleotide (character or base) frequencies. Here, we explore the potential benefits and pitfalls of allowing each rate category (rate class) of a 4gi mixture model to have its own base frequencies. This is achieved by starting each of the five rate classes, at the tree's root, with its own free choice of nucleotide frequencies to create a 4gi5rf model or a 5rf model in shorthand We assess the practical identifiability of this approach with a BEAST 2 implementation, aiming to determine if it can accurately estimate credibility intervals and expected values for a wide range of plausible parameter values. Practical identifiability, as distinguished from mathematical identifiability, gauges the model's ability to identify parameters in real-world scenarios, as opposed to theoretically with infinite data. One of the most common types of phylogenetic data is mitochondrial DNA (mtDNA) protein coding sequence. It is often assumed current models analyse robustly such data and that higher likelihood/posterior probability models do better. However, this abstract shows that vertebrate mtDNA remains a very difficult type of data to fully model, and that dramatically higher likelihoods do not mean a model is measurably more accurate with respect to recovering key parameters of biological interest (e.g., monophyletic groups, their support and their ages). The 4gi5rf model considerably improves marginal likelihoods and seems to reverse some apparent errors exacerbated by the 4gi model, while introducing others. Problems appear to be linked to non-stationary DNA repair processes that alter the mutation/substitution spectra across lineages and time. We also show such problems are not unique to mtDNA and are encountered in analysing nuclear sequences. Non-stationarity of DNA repair processes mutation/substitution spectra thus pose an active challenge to obtaining reliable inferences of relationships and divergence times near the root of placental mammals, for example. An open source implementation is available under the LGPL 3.0 license in the beastbooster package for BEAST 2, available from https://github.com/rbouckaert/beastbooster.
Evolutionary Biology
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of existing nucleotide evolution models when dealing with non - stationary data. Specifically, the author explores the impact of allowing each rate class to have independent base frequencies on likelihood estimation. Traditional models assume that nucleotide frequencies are uniform and stable throughout the evolutionary process, but this assumption may not hold when dealing with certain types of data, such as mitochondrial DNA (mtDNA) protein - coding sequences. Due to the non - uniformity and temporal changes in the DNA repair process, these data lead to changes in the mutation/substitution spectra, thus affecting the accuracy of the model. To this end, the author proposes an improved model - the 4gi5rf model, which allows each rate class to have its own independently selected nucleotide frequencies at the root of the tree. In this way, the model can better adapt to the non - stationarity of different rate classes, thereby improving the accuracy of estimating key biological parameters such as monophyletic groups, support, and age. In addition, the paper also evaluates the practical identifiability of this model, that is, the model's ability to identify parameters in real - world scenarios, not just in the theoretical case of using infinite data. The paper evaluates the performance of the 4gi5rf model through simulation studies and the application of actual data sets, especially its effectiveness in dealing with long - branch attraction and long - branch repulsion problems. These problems are caused by non - stationary evolution, especially when dealing with the phylogenetic relationships and divergence times of higher mammalian taxa (such as eutherian mammals).