Categorical edge-based analyses of phylogenomic data reveal conflicting signals for difficult relationships in the avian tree
Ning Wang,Edward L Braun,Bin Liang,Joel Cracraft,Stephen A. Smith
DOI: https://doi.org/10.1101/2021.05.17.444565
2021-05-18
Abstract:Phylogenetic analyses of large-scale datasets sometimes fail to yield a satisfactory resolution of the relationships among taxa for a number of nodes in the tree of life. This has even been true for genome-scale datasets, where the failure to resolve relationships is unlikely to reflect limitations in the amount of data. Gene tree conflicts are particularly notable in studies focused on these contentious nodes in the tree of life, and taxon sampling, different analytical methods, and/or data-type effects are thought to further confound analyses. Observed conflicts among gene trees arise from both biological processes and artefactual sources of noise in analyses. Although many efforts have been made to incorporate biological conflicts, few studies have curated individual genes for their efficiency in phylogenomic studies. Here, we conduct an edge-based analysis of Neoavian evolution, examining the phylogenetic efficacy of two recent phylogenomic bird datasets and three datatypes (ultraconserved elements [UCEs], introns, and coding regions). We assess the potential causes for biases in signal-resolution for three difficult nodes: the earliest divergence of Neoaves, the position of the enigmatic Hoatzin (Opisthocomus hoazin), and the position of owls (Strigidae). We observed extensive conflict among genes for all data types and datasets even after we removed potentially problematic loci. Edge-based analyses increased congruence and examined the impact of data type, GC content variation (GCCV), and outlier genes on analyses. These factors had different impact on each of nodes we examined. First, outlier gene signals appeared to drive different patterns of support for the relationships among the earliest diverging Neoaves. Second, the position of Hoatzin was highly variable, but we found that data type was correlated with the signals that support different placements of the Hoatzin. However, the resolution with the most support in our analyses was Hoatzin + shorebirds. Finally, GCCV, rather than data type (i.e., coding vs non-coding) per se, was correlated with an owl + Accipitriformes signal. Eliminating high GCCV loci increased the signal for an owl + mousebird relationship. Difficult edges (i.e., characterized by deep coalescence and high gene-tree estimation error) are hard to recover with all methods (including concatenation, multispecies coalescent, and edge-based analyses), whereas "easy" edges (e.g., flamingos + grebes) can be recovered without ambiguity. Thus, the nature of the edges, rather than the methods, is the limiting factor. Categorical edge-based analyses can reveal the nature of each edge and provide a way to highlight especially problematic branches that warrant further examination in future phylogenomic studies. We suggest that edge-based analyses provide a tool that can increase our understanding about the parts of the avian tree that remain unclear, even with large-scale data. In fact, our results emphasize that the conflicts associated with edges that remain contentious in the bird tree may be even greater than appreciated based on previous studies.