Enzyme structure correlates with variant effect predictability
Floris Julian van der Flier,Dave Estell,Sina Pricelius,Lydia Dankmeyer,Sander van Stigt Thans,Harm Mulder,Rei Otsuka,Frits Goedegebuur,Laurens Lammerts,Diego Staphorst,Aalt D.J. van Dijk,Dick de Ridder,Henning Redestig
DOI: https://doi.org/10.1101/2023.09.25.559319
2024-06-12
Abstract:Protein engineering increasingly relies on machine learning models to computationally pre-screen promising novel candidates. Although machine learning approaches have proven effective, their performance on prospective screening data leaves room for improvement; prediction accuracy can vary greatly from one protein variant to the next. So far, it is unclear what characterizes variants that are associated with large prediction error. In order to establish whether structural characteristics influence predictability, we created a combinatorial variant dataset for an enzyme, that can be partitioned into subsets of variants with mutations at positions exclusively belonging to a particular structural class. By training four different variant effect prediction (VEP) models on structurally partitioned subsets of our data, we found that predictability strongly depended on all four structural characteristics we tested; buriedness, number of contact residues, proximity to the active site and presence of secondary structure elements. These same dependencies were found in various single mutation enzyme variant datasets, with effect directions being specific to the assay. Most importantly, we found that these dependencies are highly alike for all four models we tested, indicating that there are specific structure and function determinants that are insufficiently accounted for by popular existing approaches. Overall, our findings suggest that significant improvements can be made to VEP models by exploring new inductive biases and by leveraging different data modalities of protein variants, and that stratified dataset design can highlight areas of improvement for machine learning guided protein engineering.
Bioinformatics