A global test of hybrid ancestry from genome-scale data

Rejuan Haque,Laura Kubatko,Md Rejuan Haque
DOI: https://doi.org/10.1515/sagmb-2022-0061
2024-02-18
Statistical Applications in Genetics and Molecular Biology
Abstract:Methods based on the multi-species coalescent have been widely used in phylogenetic tree estimation using genome-scale DNA sequence data to understand the underlying evolutionary relationship between the sampled species. Evolutionary processes such as hybridization, which creates new species through interbreeding between two different species, necessitate inferring a species network instead of a species tree. A species tree is strictly bifurcating and thus fails to incorporate hybridization events which require an internal node of degree three. Hence, it is crucial to decide whether a tree or network analysis should be performed given a DNA sequence data set, a decision that is based on the presence of hybrid species in the sampled species. Although many methods have been proposed for hybridization detection, it is rare to find a technique that does so globally while considering a data generation mechanism that allows both hybridization and incomplete lineage sorting. In this paper, we consider hybridization and coalescence in a unified framework and propose a new test that can detect whether there are any hybrid species in a set of species of arbitrary size. Based on this global test of hybridization, one can decide whether a tree or network analysis is appropriate for a given data set.
statistics & probability,mathematical & computational biology,biochemistry & molecular biology
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the problem of how to detect hybridization events from genome - scale data. Specifically, the author proposes a new global testing method to determine whether there are any hybrid species in a given set of species. The importance of this problem lies in: 1. **Differences between phylogenetic trees and networks**: Traditional phylogenetic tree analysis assumes that species evolve through bifurcating splits (i.e., each ancestor has only two descendants) and cannot capture hybridization events. Species networks, on the other hand, can represent complex evolutionary relationships that include hybridization events. Therefore, when conducting evolutionary analysis, it is necessary to first determine whether to use a phylogenetic tree or a species network. 2. **Limitations of existing methods**: Existing hybridization detection methods are usually based on specific locus - pattern frequencies or likelihood - ratio tests, but these methods often overlook the influence of Incomplete Lineage Sorting (ILS). ILS refers to the phenomenon that gene trees may be inconsistent with species trees at different loci, which can interfere with the accuracy of hybridization detection. 3. **The need for global testing**: Many existing methods can only detect whether a specific species is a hybrid, but cannot conduct comprehensive hybridization detection on the entire data set. Therefore, a global testing method that can handle any number of species is needed to decide whether network analysis is required. ### Method overview The author proposes a new global testing method that combines the Cauchy combination test (CCT) and the MinP - CCT - MinP (MCM) test to improve the statistical power of detecting hybridization events. The specific steps are as follows: 1. **Individual testing**: For each subset of four species, assume that one of the species is a hybrid, construct a four - species network, and use the method of Kubatko and Chifman (2019) to calculate the H - statistic and its corresponding p - value. 2. **Combination testing**: Combine all the p - values of individual tests to form a global test statistic. The author uses two combination methods: - **Cauchy combination test (CCT)**: Calculate the global test statistic by weighted summation of the transformed p - values. - **MinP - CCT - MinP (MCM) test**: Combine the minimum p - value method (MinP) and CCT to improve the detection efficiency. 3. **Simulation study**: Evaluate the performance of the proposed method by simulating genomic data under different evolutionary scenarios. The simulation study takes into account factors such as the number of different species, the time of hybridization events, and the proportion of hybridization, to verify the robustness and effectiveness of the method. ### Conclusion The new method proposed in this paper can effectively detect the existence of hybrid species when dealing with large - scale genomic data, thus providing a basis for choosing an appropriate evolutionary analysis method. Through simulation studies, the author has proven that this method has high detection efficiency in multiple evolutionary scenarios, especially in the presence of incomplete lineage sorting.