Abstract:Descriptors play a pivotal role in enzyme design for the greener synthesis of biochemicals, as they could characterize enzymes and chemicals from the physicochemical and evolutionary perspective. This study examined the effects of various descriptors on the performance of Random Forest model used for enzyme-chemical relationships prediction. We curated activity data of seven specific enzyme families from the literature and developed the pipeline for evaluation the machine learning model performance using 10-fold cross-validation. The influence of protein and chemical descriptors was assessed in three scenarios, which were predicting the activity of unknown relations between known enzymes and known chemicals (new relationship evaluation), predicting the activity of novel enzymes on known chemicals (new enzyme evaluation), and predicting the activity of new chemicals on known enzymes (new chemical evaluation). The results showed that protein descriptors significantly enhanced the classification performance of model on new enzyme evaluation in three out of the seven datasets with the greatest number of enzymes, whereas chemical descriptors appear no effect. A variety of sequence-based and structure-based protein descriptors were constructed, among which the esm-2 descriptor achieved the best results. Using enzyme families as labels showed that descriptors could cluster proteins well, which could explain the contributions of descriptors to the machine learning model. As a counterpart, in the new chemical evaluation, chemical descriptors made significant improvement in four out of the seven datasets, while protein descriptors appear no effect. We attempted to evaluate the generalization ability of the model by correlating the statistics of the datasets with the performance of the models. The results showed that datasets with higher sequence similarity were more likely to get better results in the new enzyme evaluation and datasets with more enzymes were more likely beneficial from the protein descriptor strategy. This work provides guidance for the development of machine learning models for specific enzyme families.

Descriptor-augmented machine learning for enzyme-chemical interaction predictions

Accelerating the optimization of enzyme-catalyzed synthesis conditions via machine learning and reactivity descriptors.

Exploration and Evaluation of Machine Learning-Based Models for Predicting Enzymatic Reactions

Machine learning-assisted amidase-catalytic enantioselectivity prediction and rational design of variants for improving enantioselectivity

Prediction of Interaction Between Enzymes and Small Molecules in Metabolic Pathways Through Integrating Multiple Classifiers.

Predicting the Stereoselectivity of Chemical Transformations by Machine Learning

Machine learning modeling of family wide enzyme-substrate specificity screens

Data‐Driven Protein Engineering for Improving Catalytic Activity and Selectivity

Cross-Modal Prediction of Spectral and Structural Descriptors via a Pretrained Model Enhanced with Chemical Insights

Navigating the landscape of enzyme design: from molecular simulations to machine learning

A general model for predicting enzyme functions based on enzymatic reactions

MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization

A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes

An explainability framework for deep learning on chemical reactions exemplified by enzyme-catalysed reaction classification

Machine Learning Descriptors for Data‐Driven Catalysis Study

Predicting Enzyme Functions Using Contrastive Learning with Hierarchical Enzyme Structure Information

The Effect of Chemical Representation on Active Machine Learning Towards Closed-Loop Optimization

Machine Learning Identifies Chemical Characteristics That Promote Enzyme Catalysis

ReactZyme: A Benchmark for Enzyme-Reaction Prediction

An Ensemble Structure and Physiochemical (SPOC) Descriptor for Machine‐Learning Prediction of Chemical Reaction and Molecular Properties

FusionESP: Improved enzyme-substrate pair prediction by fusing protein and chemical knowledge