Inference of continuous gene flow between species under misspecified models

Yuttapong Thawornwattana,Tomas Flouri,James Mallet,Ziheng Yang
DOI: https://doi.org/10.1101/2024.05.13.593926
2024-05-15
Abstract:Gene flow between species is increasingly recognized as an important evolutionary process with potential adaptive consequences. Recent methodological advances make it possible to infer different modes of gene flow from genome-scale data, including pulse introgression at a specific time and continuous gene flow over an extended time period. However, it remains challenging to infer the history of species divergence and between-species gene flow from genomic sequence data. As a result, models used in real data analysis may often be misspecified, potentially leading to incorrect biological interpretations. Here, we characterize biases in parameter estimation under continuous migration models using a combination of asymptotic analysis and posterior inference from simulated datasets. When sequence data are generated under a pulse introgression model, isolation-with-initial-migration models assuming no recent gene flow are able to better recover gene flow with less bias than models that assume recent gene flow. When gene flow is assigned to an incorrect branch in the phylogeny, there may be large biases associated with the migration rate and species divergence times. When the direction of gene flow is incorrectly assumed, we may still detect gene flow if it is recent and between non-sister species but not when it is ancestral and between sister species. Overall, the impact of model misspecification is local in the species phylogeny. The pulse introgression model appears to be more robust to model misspecification and is preferable in real data analysis over the continuous migration model unless there is substantive evidence for continuous gene flow.
Evolutionary Biology
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how the mis - specification of gene - flow models affects the accuracy of estimating species divergence history and gene - flow parameters. Specifically, the study focuses on the following three aspects of model mis - specification: 1. **False assumptions of gene - flow patterns**: For example, gene - flow actually occurs in a single - pulse form, but the analysis model assumes continuous gene - flow. 2. **Mis - specification of lineages involved in gene - flow**: In large phylogenetic trees, gene - flow may be wrongly assigned to ancestral branches. 3. **Mis - specification of the direction of gene - flow**: Most summarization methods cannot determine the direction of gene - flow, so they may wrongly specify the source and recipient populations. To explore these issues, the authors use a combination of asymptotic analysis and Bayesian inference to characterize the bias and variance of parameter estimates by simulating multi - locus data sets. The study specifically focuses on the effects on the estimates of species divergence time, population size, and migration rate (corresponding to introgression probability in the data - generation model). ### Main Findings 1. **False assumptions of gene - flow patterns in the two - species model**: - When the data are generated by the pulsed introgression model (MSC - I) but analyzed using the continuous migration model (MSC - M), the isolation - with - initial - migration (IIM) model can better recover gene - flow with less bias. - The migration stop time (τT) is usually slightly later than the actual introgression time (τX) due to the mismatch of gene - flow patterns. 2. **False assumptions of gene - flow patterns and lineages in the four - species model**: - When gene - flow is wrongly assigned to the wrong branch, the root divergence time (τR) and the out - group population size (θD) can still be well estimated. - However, other parameters such as the divergence times and population sizes of internal nodes (τT, τS, θS) will be severely underestimated or overestimated. - The estimates of the gene - flow rate (M) and the recipient population size (θT) may be unreasonably large, but their ratio (M/θT) can be well estimated. ### Conclusions Overall, the IIM model performs better in dealing with the problem of mis - specified gene - flow patterns. It can detect more gene - flow at shorter sequence lengths and provide more precise and accurate estimates of population size and divergence time. In contrast, the IM and SC models require longer sequences to detect a reasonable amount of gene - flow. Moreover, when gene - flow patterns and lineages are mis - specified, the IIM model also outperforms other models. These results emphasize the importance of choosing an appropriate gene - flow model in actual data analysis. In particular, in the absence of substantial evidence to support continuous gene - flow, using the pulsed introgression model may be a better choice.