VIPER: A General Model for Prediction of Enzyme Substrates

Max James Campbell
DOI: https://doi.org/10.1101/2024.06.21.599972
2024-06-28
Abstract:Enzymes, nature's catalysts, possess remarkable properties such as high stereo-, regio-, and chemo-specificity. These properties allow enzymes to greatly simplify complex synthetic processes, resulting in improved yields and reduced manufacturing costs compared to traditional chemical methods. However, the lack of experimental characterization of enzyme substrates, with only a few thousand out of tens of millions of known enzymes in Uniprot having annotated substrates, severely limits the ability of chemists to repurpose enzymes for industrial applications. Previous machine learning models aimed at predicting enzyme substrates have been hampered by poor generalization to new substrates. Here, we introduce VIPER (Virtual Interaction Predictor for Enzyme Reactivity), a model that achieves an average 34% improvement over the previous state-of-the-art model (ProSmith) in reaction prediction for unseen substrates. Furthermore, we present a novel benchmarking methodology for assessing the out-of-distribution generalization capabilities of enzyme-substrate prediction models. VIPER represents a significant advance towards the in silico prediction of enzyme-substrate compatibility, paving the way for the discovery of novel biocatalytic routes for the sustainable synthesis of high-value chemicals.
Bioinformatics
What problem does this paper attempt to address?
The paper attempts to address two main issues in enzyme-substrate prediction: 1. **Insufficient generalization ability of existing models in predicting new substrates**: Existing models (such as ESP, ProSmith, and Ridge Regression) perform poorly when dealing with unseen substrates, particularly showing poor generalization ability when facing new substrates. 2. **Lack of high-quality datasets**: Currently available enzyme-substrate datasets suffer from a large number of misannotations, leading to suboptimal model training results. To address these issues, the paper proposes VIPER (Virtual Interaction Predictor for Enzyme Reactivity), a new machine learning model designed to improve generalization ability on unseen substrates and ensure data quality through improved data preprocessing methods. Compared to the previous best model, ProSmith, VIPER improves prediction performance on unseen substrates by 34%, demonstrating better generalization ability. Additionally, the paper proposes a new evaluation method to measure the model's generalization ability on out-of-distribution data.