Estimating Genome-wide Phylogenies Using Probabilistic Topic Modeling

Marzieh Khodaei,Scott V. Edwards,Peter Beerli
DOI: https://doi.org/10.1101/2023.12.20.572577
2024-02-15
Abstract:Inferring the evolutionary history of species or populations with genome-wide data is gaining ground, but computational constraints still limit our abilities in this area. We developed an alignment-free method to infer the genome-wide species tree and implemented it in the Python package T C . The method uses probabilistic topic modeling (specifically, Latent Dirichlet Allocation or LDA) to extract ‘topic’ frequencies from -mers, which are derived from multilocus DNA sequences. These extracted frequencies then serve as an input for the program C in the PHYLIP package, which is used to generate a species tree. We evaluated the performance of our method with biological and simulated data sets: a data set with 14 DNA sequence loci from 78-92 haplotypes from two Australian bird species distributed in 9 populations; a second data set of 5162 loci from 80 mammal species; and a third data set of 67317 autosomal loci and 4157 X-chromosome loci of 6 species in the complex, and several simulated data sets. Our empirical results and simulated data suggest that our method is efficient and statistically accurate. We also assessed the uncertainty of the estimated relationships among clades using a bootstrap procedure for aligned sequence data and for -mer data.
Evolutionary Biology
What problem does this paper attempt to address?