Minus the Error: Estimating dN/dS and Testing for Natural Selection in the Presence of Residual Alignment Errors

Avery Selberg,Maria Chikina,Timothy B Sackton,Spencer Muse,Alex Lucaci,Steven Weaver,Anton Nekrutenko,Nathaniel Clark,Sergei L Kosakovsky Pond
DOI: https://doi.org/10.1101/2024.11.13.620707
2024-11-15
Abstract:Errors in multiple sequence alignments (MSAs) are known to bias many comparative evolutionary methods. In the context of natural selection analyses, specifically codon evolutionary models, excessive rates of false positives result. A characteristic signature of error-driven findings is unrealistically high estimates of dN/dS (e.g., >100), affecting only a small fraction (e.g., ~0.1%) of the alignment. Despite the widespread use of codon models to assess alignment quality, their potential for error correction remains unexplored. We present BUSTED-E: a novel method designed to detect positive selection while concurrently identifying alignment errors. This method is a straightforward adaptation of the BUSTED flexible branch-site random effects model used to fit distributions of dN/dS, with an important modification: it integrates an "error-sink" component representing an abiological evolutionary regime (dN/dS > 100), and provides the option for masking errors in the MSA for downstream analyses. Statistical performance of BUSTED-E on data simulated without errors shows that there is a small loss of power, which can be mitigated by model averaged inference. Using four published empirical datasets, we show BUSTED-E reduces unrealistic rates of positive selection detection, often by an order of magnitude, and improves biological interpretability of results. BUSTED-E also detects errors that are largely distinct from other popular alignment cleaning tools (HMMCleaner and BMGE). Overall, BUSTED-E is a robust and scalable solution for improving the accuracy of evolutionary analyses in the presence of residual alignment errors, contributing to a more nuanced understanding of natural selection and adaptive evolution.
Evolutionary Biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to estimate the ratio of non - synonymous substitution rate to synonymous substitution rate \(dN/dS\) more accurately and test natural selection in the presence of alignment errors. Specifically, the researchers are concerned with how to reduce the impact of these errors on codon - based evolutionary analysis, especially on positive selection detection, when there are residual errors in multiple sequence alignments (MSAs). ### Background problems 1. **Impact of alignment errors**: Errors in multiple sequence alignments (such as local mismatches) can lead to biases in many comparative evolutionary methods, especially when detecting natural selection. These errors may increase the false - positive rate, that is, wrongly identify positive selection signals. 2. **Limitations of existing methods**: Although there are already some methods for identifying and filtering alignment errors, the application of these methods in large - scale genomic datasets is still limited, and they are not sensitive enough to certain types of errors. ### Goals of the paper 1. **Propose a new method**: The paper proposes a new method, BUSTED - E, which aims to detect positive selection while identifying and handling errors in the alignment. 2. **Improve detection accuracy**: By introducing an "error - sink component", BUSTED - E can identify the small part of the alignment regions that lead to unrealistically high \(dN/dS\) values (for example, > 100) and mark them as errors. 3. **Improve biological interpretation**: The use of BUSTED - E can significantly reduce the unrealistic positive selection detection rate and improve the biological interpretability of the results. ### Method innovation 1. **BUSTED - E model**: BUSTED - E is an extension of the existing BUSTED model, adding the ability to identify and handle errors. BUSTED - E introduces a new evolutionary category \(\omega_E\geq100\), with its weight limited to within 1%, to capture false evolutionary signals generated by local alignment errors. 2. **Statistical performance**: Through tests on simulated data and actual datasets, BUSTED - E shows a small power loss on error - free data, and this loss can be further alleviated by model - averaging inference. ### Experimental verification 1. **Simulated data**: Tests on simulated data show that BUSTED - E does not increase the false - positive rate in scenarios without positive selection, and can effectively identify and handle these errors in scenarios simulating error - containing categories. 2. **Actual datasets**: Through the re - analysis of four published large - scale datasets, BUSTED - E significantly reduces the unrealistic positive selection detection rate, usually by several orders of magnitude, and improves the functional enrichment of the detected positive - selection genes. ### Conclusion BUSTED - E is a robust and scalable solution that can improve the accuracy of evolutionary analysis in the presence of residual alignment errors and is helpful for a more detailed understanding of natural selection and adaptive evolution.