Large-scale Species Tree Estimation

Erin Molloy,Tandy Warnow
DOI: https://doi.org/10.48550/arXiv.1904.02600
2019-04-06
Abstract:Species tree estimation is a complex problem, due to the fact that different parts of the genome can have different evolutionary histories than the genome itself. One of the causes for this discord is incomplete lineage sorting (also called deep coalescence), which is a population-level process that produces gene trees that differ from the species tree. The last decade has seen a large number of new methods developed to estimate species trees from multi-locus datasets, specifically addressing this cause of gene tree heterogeneity. In this paper, we review these methods, focusing mainly on issues that relate to analyses of datasets containing large numbers of species or loci (or both). We also discuss divide-and-conquer strategies for enabling species tree estimation methods to run on large datasets, including new approaches that are based on algorithms (such as TreeMerge) for the Disjoint Tree Merger problem.
Populations and Evolution
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of estimating species trees in large - scale multi - gene locus datasets. Specifically, since different parts of the genome may have different evolutionary histories, this leads to inconsistencies between gene trees and species trees. One of the main reasons is Incomplete Lineage Sorting (ILS), which is a process that occurs at the population level and can produce gene trees that are different from species trees. The paper reviews various new methods developed in recent years, which aim to estimate species trees from multi - gene locus datasets, especially for the problem of gene tree heterogeneity caused by ILS. In addition, the paper also discusses how to make species tree estimation methods capable of handling large datasets through a divide - and - conquer strategy, including new methods based on algorithms (such as TreeMerge) for solving the "disjoint tree merging" problem. The core of the paper lies in exploring the performance of different methods when dealing with large - scale species or locus (or both) datasets, as well as the advantages and limitations of these methods in computational performance. By comparing the performance of various methods, the paper aims to provide guidance for researchers to select appropriate methods, especially when dealing with datasets containing a large number of species or loci.