Abstract:Inference and interpretation of evolutionary processes, in particular of the types and targets of natural selection affecting coding sequences, are critically influenced by the assumptions built into statistical models and tests. If certain aspects of the substitution process (even when they are not of direct interest) are presumed absent or are modeled with too crude of a simplification, estimates of key model parameters can become biased, often systematically, and lead to poor statistical performance. Previous work established that failing to accommodate multinucleotide (or multihit, MH) substitutions strongly biases dN/dS-based inference towards false-positive inferences of diversifying episodic selection, as does failing to model variation in the rate of synonymous substitution (SRV) among sites. Here, we develop an integrated analytical framework and software tools to simultaneously incorporate these sources of evolutionary complexity into selection analyses. We found that both MH and SRV are ubiquitous in empirical alignments, and incorporating them has a strong effect on whether or not positive selection is detected (1.4-fold reduction) and on the distributions of inferred evolutionary rates. With simulation studies, we show that this effect is not attributable to reduced statistical power caused by using a more complex model. After a detailed examination of 21 benchmark alignments and a new high-resolution analysis showing which parts of the alignment provide support for positive selection, we show that MH substitutions occurring along shorter branches in the tree explain a significant fraction of discrepant results in selection detection. Our results add to the growing body of literature which examines decades-old modeling assumptions (including MH) and finds them to be problematic for comparative genomic data analysis. Because multinucleotide substitutions have a significant impact on natural selection detection even at the level of an entire gene, we recommend that selection analyses of this type consider their inclusion as a matter of routine. To facilitate this procedure, we developed, implemented, and benchmarked a simple and well-performing model testing selection detection framework able to screen an alignment for positive selection with two biologically important confounding processes: site-to-site synonymous rate variation, and multinucleotide instantaneous substitutions.

Detecting Positively Selected Sites from Amino Acid Sequences: an Implicit Codon Model

A Maximum Likelihood Method for Detecting Functional Divergence at Individual Codon Sites, with Application to Gene Family Evolution

Synonymous codon usage and selection on proteins

Evolutionary Shortcuts via Multinucleotide Substitutions and Their Impact on Natural Selection Analyses

Detecting Recent Positive Selection with High Accuracy and Reliability by Conditional Coalescent Tree.

Selective Constraints on Amino Acids Estimated by a Mechanistic Codon Substitution Model with Multiple Nucleotide Changes

Accurate prediction of site- and amino-acid substitution rates with a mutation-selection model

A Population Genetics-Phylogenetics Approach to Inferring Natural Selection in Coding Sequences

Predicting Pathology of Missense Mutations through Protein-Specific Evolutionary Pattern

Inferring biophysical models of evolution from genome-wide patterns of codon usage

SENCA: A Multilayered Codon Model to Study the Origins and Dynamics of Codon Usage

Genome wide signals of pervasive positive selection in human evolution

Selection on protein structure, interaction, and sequence

Recent codon preference reversals in the lineage

Using Maximum Likelihood Method to Detect Adaptive Evolution of HCV Envelope Protein-Coding Genes

The effectiveness of selection in a species affects the direction of amino acid frequency evolution

A new comparative framework for estimating selection on synonymous substitutions

Relationship between amino acid usage and amino acid evolution in primates.

Synonymous and non-synonymous transitions/transversions vividly disclose purifying selection in coding sequences

Enhanced detection and molecular modeling of adaptive mutations in SARS-CoV-2 coding and non-coding regions using the c/μ test

Identifying the Genetic Basis of Functional Protein Evolution Using Reconstructed Ancestors