The distributions under two species-tree models of the total number of ancestral configurations for matching gene trees and species trees

Filippo Disanto,Michael Fuchs,Chun-Yen Huang,Ariel R. Paningbatan,Noah A. Rosenberg
2023-05-07
Abstract:Given a gene-tree labeled topology $G$ and a species tree $S$, the "ancestral configurations" at an internal node $k$ of $S$ represent the combinatorially different sets of gene lineages that can be present at $k$ when all possible realizations of $G$ in $S$ are considered. Ancestral configurations have been introduced as a data structure for evaluating the conditional probability of a gene-tree labeled topology given a species tree, and their enumeration assists in describing the complexity of this computation. In the case that the gene-tree labeled topology $G=t$ matches that of the species tree $S$, by techniques of analytic combinatorics, we study distributional properties of the "total" number of ancestral configurations measured across the different nodes of a random labeled topology $t$ selected under the uniform and the Yule probability models. Under both of these probabilistic scenarios, we show that the total number $T_n$ of ancestral configurations of a random labeled topology of $n$ taxa asymptotically follows a lognormal distribution. Over uniformly distributed labeled topologies, the asymptotic growth of the mean and the variance of $T_n$ are found to satisfy $\mathbb{E}_{\rm U}[T_n] \sim 2.449 \cdot 1.333^n$ and $\mathbb{V}_{\rm U}[T_n] \sim 5.050 \cdot 1.822^n$, respectively. Under the Yule model, which assigns higher probabilities to more balanced labeled topologies, we obtain the mean $\mathbb{E}_{\rm Y}[T_n] \sim 1.425^n$ and the variance $\mathbb{V}_{\rm Y}[T_n] \sim 2.045^n$.
Probability,Combinatorics,Populations and Evolution
What problem does this paper attempt to address?
The problem that this paper attempts to solve is about the distribution characteristics of the total number of ancestral configurations in the case of gene tree - species tree matching. Specifically, the author studied the distribution characteristics of the total number of ancestral configurations \(T_n\) in randomly labeled topologies under the uniform distribution and the Yule probability model. The main objectives of the paper are: 1. **Determine the asymptotic distribution of the total number of ancestral configurations**: By the method of generating functions, the author proved that under the uniform distribution and the Yule model, the asymptotic distribution of the total number of ancestral configurations \(T_n\) conforms to the log - normal distribution. 2. **Analyze the growth rates of the mean and variance**: For the uniform distribution, the asymptotic growths of the mean \(E_U[T_n]\) and the variance \(V_U[T_n]\) of the total number of ancestral configurations respectively satisfy: \[ E_U[T_n] \sim 2.449 \cdot 1.333^n \] \[ V_U[T_n] \sim 5.050 \cdot 1.822^n \] For the Yule model, the asymptotic growths of the mean \(E_Y[T_n]\) and the variance \(V_Y[T_n]\) respectively satisfy: \[ E_Y[T_n] \sim 1.425^n \] \[ V_Y[T_n] \sim 2.045^n \] 3. **Study the correlation between the total number of ancestral configurations and the number of root - ancestral configurations**: The author also explored the relationship between the total number of ancestral configurations and the number of root - ancestral configurations, especially the correlation of these quantities in randomly labeled topologies. The paper uses the techniques of analytic combinatorics, especially the method of generating functions, to study these distribution characteristics. These results are helpful for understanding the complexity of the relationship between gene trees and species trees and provide theoretical support for algorithms for calculating gene tree probabilities.