Abstract:Abstract Whole-genome duplication (WGD) occurs broadly and repeatedly across the history of eukaryotes and is recognized as a prominent evolutionary force, especially in plants. Immediately following WGD, most genes are present in two copies as paralogs. Due to this redundancy, one copy of a paralog pair commonly undergoes pseudogenization and is eventually lost. When speciation occurs shortly after WGD; however, differential loss of paralogs may lead to spurious phylogenetic inference resulting from the inclusion of pseudoorthologs–paralogous genes mistakenly identified as orthologs because they are present in single copies within each sampled species. The influence and impact of including pseudoorthologs versus true orthologs as a result of gene extinction (or incomplete laboratory sampling) are only recently gaining empirical attention in the phylogenomics community. Moreover, few studies have yet to investigate this phenomenon in an explicit coalescent framework. Here, using mathematical models, numerous simulated data sets, and two newly assembled empirical data sets, we assess the effect of pseudoorthologs on species tree estimation under varying degrees of incomplete lineage sorting (ILS) and differential gene loss scenarios following WGD. When gene loss occurs along the terminal branches of the species tree, alignment-based (BPP) and gene-tree-based (ASTRAL, MP-EST, and STAR) coalescent methods are adversely affected as the degree of ILS increases. This can be greatly improved by sampling a sufficiently large number of genes. Under the same circumstances, however, concatenation methods consistently estimate incorrect species trees as the number of genes increases. Additionally, pseudoorthologs can greatly mislead species tree inference when gene loss occurs along the internal branches of the species tree. Here, both coalescent and concatenation methods yield inconsistent results. These results underscore the importance of understanding the influence of pseudoorthologs in the phylogenomics era. [Coalescent method; concatenation method; incomplete lineage sorting; pseudoorthologs; single-copy gene; whole-genome duplication.]

Large-scale Species Tree Estimation

Estimating phylogenetic trees from genome-scale data

Bayesian Inference of Species Trees from Multilocus Data

ASTRAL: genome-scale coalescent-based species tree estimation

Computing the probability of gene trees concordant with the species tree in the multispecies coalescent

Species Tree Inference Methods Intended to Deal with Incomplete Lineage Sorting Are Robust to the Presence of Paralogs

The Impact of Missing Data on Species Tree Estimation.

Properties of Consensus Methods for Inferring Species Trees from Gene Trees

Species Tree Estimation and the Impact of Gene Loss Following Whole-Genome Duplication

Reconciling Multiple Genes Trees via Segmental Duplications and Losses

Coalescent-based species tree estimation: a stochastic Farris transform

From gene trees to species trees II: Species tree inference in the deep coalescence model

Species, Clusters and the 'Tree of Life': A graph-theoretic perspective

Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis

Efficient Exploration of the Space of Reconciled Gene Trees

A tale of too many trees: a conundrum for phylogenetic regression

Computational Performance and Statistical Accuracy of *BEAST and Comparisons with Other Methods

Reconciliation of Gene and Species Trees With Polytomies

Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method

DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition

Optimal data partitioning, multispecies coalescent and Bayesian concordance analyses resolve early divergences of the grape family (Vitaceae)