Abstract:Inference and interpretation of evolutionary processes, in particular of the types and targets of natural selection affecting coding sequences, are critically influenced by the assumptions built into statistical models and tests. If certain aspects of the substitution process (even when they are not of direct interest) are presumed absent or are modeled with too crude of a simplification, estimates of key model parameters can become biased, often systematically, and lead to poor statistical performance. Previous work established that failing to accommodate multinucleotide (or multihit, MH) substitutions strongly biases dN/dS-based inference towards false-positive inferences of diversifying episodic selection, as does failing to model variation in the rate of synonymous substitution (SRV) among sites. Here, we develop an integrated analytical framework and software tools to simultaneously incorporate these sources of evolutionary complexity into selection analyses. We found that both MH and SRV are ubiquitous in empirical alignments, and incorporating them has a strong effect on whether or not positive selection is detected (1.4-fold reduction) and on the distributions of inferred evolutionary rates. With simulation studies, we show that this effect is not attributable to reduced statistical power caused by using a more complex model. After a detailed examination of 21 benchmark alignments and a new high-resolution analysis showing which parts of the alignment provide support for positive selection, we show that MH substitutions occurring along shorter branches in the tree explain a significant fraction of discrepant results in selection detection. Our results add to the growing body of literature which examines decades-old modeling assumptions (including MH) and finds them to be problematic for comparative genomic data analysis. Because multinucleotide substitutions have a significant impact on natural selection detection even at the level of an entire gene, we recommend that selection analyses of this type consider their inclusion as a matter of routine. To facilitate this procedure, we developed, implemented, and benchmarked a simple and well-performing model testing selection detection framework able to screen an alignment for positive selection with two biologically important confounding processes: site-to-site synonymous rate variation, and multinucleotide instantaneous substitutions.

Detecting Natural Selection in RNA Virus Populations Using Sequence Summary Statistics.

Variation In The Analysis Of Positively Selected Sites Using Nonsynonymous/Synonymous Rate Ratios: An Example Using Influenza Virus

Evolutionary Shortcuts via Multinucleotide Substitutions and Their Impact on Natural Selection Analyses

New Virus Variant Detection Based on the Optimal Natural Metric

A New Method for Detecting Natural Selection at the Level of Nucleotide Sites

Estimation of genetic diversity in viral populations from next generation sequencing data with extremely deep coverage

Detecting Natural Selection at the DNA Level

Measurements of Intrahost Viral Diversity Require an Unbiased Diversity Metric

Identification of HIV Rapid Mutations Using Differences in Nucleotide Distribution over Time.

A Novel Statistical Method for Interpreting the Pathogenicity of Rare Variants

Enhanced detection and molecular modeling of adaptive mutations in SARS-CoV-2 coding and non-coding regions using the c/μ test

Using Maximum Likelihood Method to Detect Adaptive Evolution of HCV Envelope Protein-Coding Genes

New Genome Sequence Detection Via Natural Vector Convex Hull Method

Natural Selection on Synonymous Mutations in SARS-CoV-2 and the Impact on Estimating Divergence Time

Detecting Recent Positive Selection with High Accuracy and Reliability by Conditional Coalescent Tree.

Statistical Tests for Detecting Positive Selection by Utilizing High-Frequency Variants

Rate variation and recurrent sequence errors in pandemic-scale phylogenetics

Validation of an unbiased metagenomic detection assay for RNA viruses in viral transport media and plasma

Two Decades of Suspect Evidence for Adaptive Molecular Evolution-Negative Selection Confounding Positive-Selection Signals

Efficient and accurate detection of viral sequences at single-cell resolution reveals putative novel viruses perturbing host gene expression

A novel alignment-free method for HIV-1 subtype classification