Unidentifiable divergence times in rates-across-sites models

Steven N. Evans,Tandy Warnow
DOI: https://doi.org/10.48550/arXiv.q-bio/0408011
2004-11-22
Abstract:The rates-across-sites assumption in phylogenetic inference posits that the rate matrix governing the Markovian evolution of a character on an edge of the putative phylogenetic tree is the product of a character-specific scale factor and a rate matrix that is particular to that edge. Thus, evolution follows basically the same process for all characters, except that it occurs faster for some characters than others. To allow estimation of tree topologies and edge lengths for such models, it is commonly assumed that the scale factors are not arbitrary unknown constants, but rather unobserved, independent, identically distributed draws from a member of some parametric family of distributions. A popular choice is the gamma family. We consider an example of a clock-like tree with three taxa, one unknown edge length, and a parametric family of scale factor distributions that contain the gamma family. This model has the property that, for a generic choice of unknown edge length and scale factor distribution, there is another edge length and scale factor distribution which generates data with exactly the same distribution, so that even with infinitely many data it will be typically impossible to make correct inferences about the unknown edge length.
Populations and Evolution,Genomics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the unidentifiability of relative branch lengths when using "rates - across - sites models" for phylogenetic tree inference in molecular systematics. Specifically, the author explores whether the unknown branch lengths can be uniquely estimated from the data when the evolutionary rate of each site is assumed to be randomly drawn from a certain distribution (such as the gamma distribution). By constructing a specific three - taxon tree model and using the Neyman two - state model to simulate the evolutionary process of sites, the author proves that even with an infinite amount of data, for some parameter settings, it is still impossible to distinguish between two different branch lengths. This indicates that in these models, even if the random - effects method is used, there may be fundamental difficulties in estimating branch lengths. ### Core problems of the paper 1. **Unidentifiability problem**: The paper focuses on the unidentifiability of branch lengths in the "rates - across - sites models". That is, whether there are different branch lengths and different site - rate distributions such that the generated data have the same probability distribution, resulting in the inability to correctly infer the true branch lengths even with an infinite amount of data. 2. **Effectiveness of the random - effects method**: The paper explores whether using the random - effects method (i.e., assuming that the evolutionary rate of each site is randomly drawn from a certain distribution) can solve the unidentifiability problem. In particular, the author investigates whether the assumption of using the gamma distribution as the site - rate distribution is reasonable. ### Main findings - **Existence of unidentifiability**: Through specific mathematical analysis and examples, the author proves that in some cases, even if the random - effects method is used, the branch lengths are still unidentifiable. This means that even if the distribution of site rates is known, there may be two different branch lengths that produce the same data distribution. - **Specialty of the gamma distribution**: Although the gamma distribution is often used to describe the variation of site rates, the paper points out that this choice is not due to its biological rationality but rather its mathematical convenience. In fact, there are other equally reasonable distributions, but they will lead to the unidentifiability of branch lengths. ### Significance - **Impact on phylogenetic tree inference**: This finding is of great significance for phylogenetic tree inference based on statistical methods. If the branch lengths are unidentifiable, then the time estimates based on these models (such as the time of internal nodes) will also become unreliable. - **Future research directions**: The paper calls on researchers to reconsider the assumption of using the gamma distribution as the site - rate distribution and suggests conducting more simulation studies to test the impact of different distributions on inference results. ### Conclusion Through rigorous mathematical analysis and specific examples, the paper reveals the unidentifiability problem of branch lengths in the "rates - across - sites models". This finding is not only of great significance for theoretical research but also provides an important reference for model selection and parameter estimation in practical applications.