Abstract:Inference and interpretation of evolutionary processes, in particular of the types and targets of natural selection affecting coding sequences, are critically influenced by the assumptions built into statistical models and tests. If certain aspects of the substitution process (even when they are not of direct interest) are presumed absent or are modeled with too crude of a simplification, estimates of key model parameters can become biased, often systematically, and lead to poor statistical performance. Previous work established that failing to accommodate multinucleotide (or multihit, MH) substitutions strongly biases dN/dS-based inference towards false-positive inferences of diversifying episodic selection, as does failing to model variation in the rate of synonymous substitution (SRV) among sites. Here, we develop an integrated analytical framework and software tools to simultaneously incorporate these sources of evolutionary complexity into selection analyses. We found that both MH and SRV are ubiquitous in empirical alignments, and incorporating them has a strong effect on whether or not positive selection is detected (1.4-fold reduction) and on the distributions of inferred evolutionary rates. With simulation studies, we show that this effect is not attributable to reduced statistical power caused by using a more complex model. After a detailed examination of 21 benchmark alignments and a new high-resolution analysis showing which parts of the alignment provide support for positive selection, we show that MH substitutions occurring along shorter branches in the tree explain a significant fraction of discrepant results in selection detection. Our results add to the growing body of literature which examines decades-old modeling assumptions (including MH) and finds them to be problematic for comparative genomic data analysis. Because multinucleotide substitutions have a significant impact on natural selection detection even at the level of an entire gene, we recommend that selection analyses of this type consider their inclusion as a matter of routine. To facilitate this procedure, we developed, implemented, and benchmarked a simple and well-performing model testing selection detection framework able to screen an alignment for positive selection with two biologically important confounding processes: site-to-site synonymous rate variation, and multinucleotide instantaneous substitutions.

Influence of Multiple Sequence Alignment Depth on Potts Statistical Models of Protein Covariation

Selection of sequence motifs and generative Hopfield-Potts models for protein familiesilies

Neural Potts Model

Boltzmann machine learning and regularization methods for inferring evolutionary fields and couplings from a multiple sequence alignment

Protein stability models fail to capture epistatic interactions of double point mutations

Protein Language Model Fitness Is a Matter of Preference

mmCSM-PPI: predicting the effects of multiple point mutations on protein–protein interactions

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction

Selective Constraints on Amino Acids Estimated by a Mechanistic Codon Substitution Model with Multiple Nucleotide Changes

How pairwise coevolutionary models capture the collective residue variability in proteins

Potts Hamiltonian Models and Molecular Dynamics Free Energy Simulations for Predicting the Impact of Mutations on Protein Kinase Stability

Exploring evolution to uncover insights into protein mutational stability

Quantification of the effect of mutations using a global probability model of natural sequence variation

Predicting Pathology of Missense Mutations through Protein-Specific Evolutionary Pattern

Direct Coupling Analysis of Epistasis in Allosteric Materials

Evolutionary Shortcuts via Multinucleotide Substitutions and Their Impact on Natural Selection Analyses

A new comparative framework for estimating selection on synonymous substitutions

Deep generative models of genetic variation capture mutation effects

Learning protein fitness models from evolutionary and assay-labeled data

Structure-informed protein language models are robust predictors for variant effects