Anomaly zones for uniformly sampled gene trees under the gene duplication and loss model

Brandon Legried
2024-03-29
Abstract:Recently, there has been interest in extending long-known results about the multispecies coalescent tree to other models of gene trees. Results about the gene duplication and loss (GDL) tree have mathematical proofs, including species tree identifiability, estimability, and sample complexity of popular algorithms like ASTRAL. Here, this work is continued by characterizing the anomaly zones of uniformly sampled gene trees. The anomaly zone for species trees is the set of parameters where some discordant gene tree occurs with the maximal probability. The detection of anomalous gene trees is an important problem in phylogenomics, as their presence renders effective estimation methods to being positively misleading. Under the multispecies coalescent, anomaly zones are known to exist for rooted species trees with as few as four species. The gene duplication and loss process is a generalization of the generalized linear-birth death process to the rooted species tree, where each edge is treated as a single timeline with exponential-rate duplication and loss. The methods and results come from a detailed probabilistic analysis of trajectories observed from this stochastic process. It is shown that anomaly zones do not exist for rooted GDL balanced trees on four species, but do exist for rooted caterpillar trees, as with the multispecies coalescent.
Populations and Evolution
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper attempts to solve the problem of the existence of abnormal regions in uniformly sampled gene trees under the Gene Duplication and Loss (GDL) model. Specifically, the author attempts to determine whether there are abnormal regions in root - balanced trees and root - caterpillar trees with three or four species under the GDL model. #### Background and motivation 1. **Abnormal regions in the multispecies coalescent model** - In the Multispecies Coalescent (MSC) model, it is known that when the number of species is as few as 4, there are abnormal regions. Parameters within these regions will lead to the highest probability of the appearance of certain inconsistent gene trees, thus making the majority - voting - based method ineffective. 2. **Gene duplication and loss model** - The gene duplication and loss process can be regarded as a generalization of the generalized linear birth - death process on the root - species tree. Each edge is regarded as a timeline, in which the incidence rates of gene duplication and loss are exponentially distributed. 3. **Research significance** - Detecting abnormal gene trees is an important problem in phylogenomics because their existence can make effective estimation methods misleading. Therefore, understanding the existence of abnormal regions under the GDL model is of great significance for improving species - tree estimation methods. #### Main contributions 1. **Definition of abnormal regions** - An abnormal region refers to a region where, under certain parameter settings, the probability of the appearance of certain inconsistent gene trees is the highest. Under the GDL model, the author analyzes in detail the distribution of gene trees and explores the existence of abnormal regions. 2. **Main results** - The author proves that for the root - balanced Quartet, there are no abnormal regions under the GDL model. - For the root - caterpillar Quartet, there may be abnormal regions, but only balanced gene trees may be abnormal. 3. **Methods and techniques** - The author uses the probability analysis method to conduct a detailed analysis of the trajectories in the GDL process. By calculating the probability distribution of gene trees, the author draws the above conclusions. #### Conclusion - Through mathematical proof and probability analysis, this paper clarifies the existence of abnormal regions for specific types of species trees (such as root - balanced Quartet and root - caterpillar Quartet) under the GDL model. These results are helpful for understanding the impact of the gene duplication and loss process on species - tree estimation and provide a theoretical basis for developing more accurate species - tree reconstruction methods.