Abstract:There is strong interest in accurate methods for predicting changes in protein stability resulting from amino acid mutations to the protein sequence. Recombinant proteins must often be stabilized to be used as therapeutics or reagents, and destabilizing mutations are implicated in a variety of diseases. Due to increased data availability and improved modeling techniques, recent studies have shown advancements in predicting changes in protein stability when a single point mutation is made. Less focus has been directed toward predicting changes in protein stability when there are two or more mutations, despite the significance of mutation clusters for disease pathways and protein design studies. Here, we analyze the largest available dataset of double point mutation stability and benchmark several widely used protein stability models on this and other datasets. We identify a blind spot in how predictors are typically evaluated on multiple mutations, finding that, contrary to assumptions in the field, current stability models are unable to consistently capture epistatic interactions between double mutations. We observe one notable deviation from this trend, which is that epistasis-aware models provide marginally better predictions on stabilizing double point mutations. We develop an extension of the ThermoMPNN framework for double mutant modeling as well as a novel data augmentation scheme which mitigates some of the limitations in available datasets. Collectively, our findings indicate that current protein stability models fail to capture the nuanced epistatic interactions between concurrent mutations due to several factors, including training dataset limitations and insufficient model sensitivity. Significance: Protein stability is governed in part by epistatic interactions between energetically coupled residues. Prediction of these couplings represents the next frontier in protein stability modeling. In this work, we benchmark protein stability models on a large dataset of double point mutations and identify previously overlooked limitations in model design and evaluation. We also introduce several new strategies to improve modeling of epistatic couplings between protein point mutations.

Data-Error Scaling in Machine Learning on Natural Discrete Combinatorial Mutation-prone Sets: Case Studies on Peptides and Small Molecules

Protein stability models fail to capture epistatic interactions of double point mutations

Predicting Protein Thermostability Upon Mutation Using Molecular Dynamics Timeseries Data

Quantification of the effect of mutations using a global probability model of natural sequence variation

Decoding Stability and Epistasis in Human Myoglobin by Deep Mutational Scanning and Codon-level Machine Learning

Using machine learning to predict the effects and consequences of mutations in proteins

Understanding large scale sequencing datasets through changes to protein folding

Accelerated Missense Mutation Identification in Intrinsically Disordered Proteins using Deep Learning

Predicting Mutational Function Using Machine Learning.

Predicting protein thermal stability changes upon single and multi-point mutations via restricted attention subgraph neural network

Comparing Supervised Learning and Rigorous Approach for Predicting Protein Stability upon Point Mutations in Difficult Targets

Predicting Clinical Significance of Single-Missense Mutations in Ocular Proteins Using Machine Learning

Quantification of biases in predictions of protein–protein binding affinity changes upon mutations

Deep generative models of genetic variation capture mutation effects

Decoding Phase Separation of Prion-Like Domains through Data-Driven Scaling Laws

Physicochemical feature-based classification of amino acid mutations.

Small molecule machine learning: All models are wrong, some may not even be useful

ProteoMutaMetrics: machine learning approaches for solute carrier family 6 mutation pathogenicity prediction

Evolvability and Single-Genotype Fluctuation in Phenotypic Properties: a Simple Heteropolymer Model.