Enzyme structure correlates with variant effect predictability

Floris Julian van der Flier,Dave Estell,Sina Pricelius,Lydia Dankmeyer,Sander van Stigt Thans,Harm Mulder,Rei Otsuka,Frits Goedegebuur,Laurens Lammerts,Diego Staphorst,Aalt D.J. van Dijk,Dick de Ridder,Henning Redestig

DOI: https://doi.org/10.1101/2023.09.25.559319

2024-06-12

Abstract:Protein engineering increasingly relies on machine learning models to computationally pre-screen promising novel candidates. Although machine learning approaches have proven effective, their performance on prospective screening data leaves room for improvement; prediction accuracy can vary greatly from one protein variant to the next. So far, it is unclear what characterizes variants that are associated with large prediction error. In order to establish whether structural characteristics influence predictability, we created a combinatorial variant dataset for an enzyme, that can be partitioned into subsets of variants with mutations at positions exclusively belonging to a particular structural class. By training four different variant effect prediction (VEP) models on structurally partitioned subsets of our data, we found that predictability strongly depended on all four structural characteristics we tested; buriedness, number of contact residues, proximity to the active site and presence of secondary structure elements. These same dependencies were found in various single mutation enzyme variant datasets, with effect directions being specific to the assay. Most importantly, we found that these dependencies are highly alike for all four models we tested, indicating that there are specific structure and function determinants that are insufficiently accounted for by popular existing approaches. Overall, our findings suggest that significant improvements can be made to VEP models by exploring new inductive biases and by leveraging different data modalities of protein variants, and that stratified dataset design can highlight areas of improvement for machine learning guided protein engineering.

Bioinformatics

What problem does this paper attempt to address?

This paper investigates the relationship between enzyme structure and the predictive effect of variation. In the study, the authors created a dataset of combined enzyme variations, classifying the variation positions into different structural categories, such as burial, contact residue number, distance to active site, and the presence of secondary structure elements. They trained four different variation effect prediction (VEP) models and found that predictive performance strongly depends on all four tested structural characteristics. The paper points out that although machine learning models have shown effectiveness in protein engineering, there is still room for improvement in predicting accuracy on prospective screening data. The study found that variations in buried, multiple-contact residue, proximity to the active site, and those located in secondary structure elements are more challenging to predict. These structural features also affect prediction differently in individual enzyme datasets, depending on specific experimental determinations. Importantly, the four models showed a high degree of consistency in these dependencies, indicating that existing popular methods have not fully considered certain specific structural and functional determinants. The paper suggests that significant improvements to VEP models can be achieved by exploring new inductive biases and utilizing different types of protein variation data. Additionally, a stratified dataset design helps highlight areas for protein engineering improvements guided by machine learning. In conclusion, the paper reveals the crucial role of enzyme structural characteristics in predicting variation effects and proposes new strategies to improve the models and enhance the efficiency of protein engineering.

Enzyme structure correlates with variant effect predictability

Enzyme structure correlates with variant effect predictability

Protein design using structure-based residue preferences

Variant effect predictor correlation with functional assays is reflective of clinical classification performance

A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes

Predicted mechanistic impacts of human protein missense variants

Machine Learning Integrating Protein Structure, Sequence, and Dynamics to Predict the Enzyme Activity of Bovine Enterokinase Variants

Using machine learning to predict the effects and consequences of mutations in proteins

Leveraging Structure for Enzyme Function Prediction: Methods, Opportunities, and Challenges.

Understanding structure-guided variant effect predictions using 3D convolutional neural networks

More Structures, Less Accuracy: ESM3's Binding Prediction Paradox

Understanding the heterogeneous performance of variant effect predictors across human protein-coding genes

Predicting protein variants with equivariant graph neural networks

NeuroFold: A Multimodal Approach to Generating Novel Protein Variants

A Perspective on the Prospective Use of AI in Protein Structure Prediction

Embeddings from protein language models predict conservation and variant effects

Structure-informed protein language models are robust predictors for variant effects

Exploring structure-function relationships in engineered receptor performance using computational structure prediction

Multi-level Protein Representation Learning for Blind Mutational Effect Prediction

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction