CatPred: A comprehensive framework for deep learning in vitro enzyme kinetic parameters , and

Veda Sheersh Boorla,Costas D. Maranas
DOI: https://doi.org/10.1101/2024.03.10.584340
2024-03-26
Abstract:Quantification of enzymatic activities still heavily relies on experimental assays, which can be expensive and time-consuming. Therefore, methods that enable accurate predictions of enzyme activity can serve as effective digital twins. A few recent studies have shown the possibility of training machine learning (ML) models for predicting the enzyme turnover numbers ( ) and Michaelis constants ( ) using only features derived from enzyme sequences and substrate chemical topologies by training on measurements. However, several challenges remain such as lack of standardized training datasets, evaluation of predictive performance on out-of-distribution examples, and model uncertainty quantification. Here, we introduce CatPred, a comprehensive framework for ML prediction of enzyme kinetics. We explored different learning architectures and feature representations for enzymes including those utilizing pretrained protein language model features and pretrained three-dimensional structural features. We systematically evaluate the performance of trained models for predicting , , and inhibition constants ( ) of enzymatic reactions on held-out test sets with a special emphasis on out-of-distribution test samples (corresponding to enzyme sequences dissimilar from those encountered during training). CatPred assumes a probabilistic regression approach offering query-specific standard deviation and mean value predictions. Results on unseen data confirm that accuracy in enzyme parameter predictions made by CatPred positively correlate with lower predicted variances. Incorporating pre-trained language model features is found to be enabling for achieving robust performance on out-of-distribution samples. Test evaluations on both held-out and out-of-distribution test datasets confirm that CatPred performs at least competitively with existing methods while simultaneously offering robust uncertainty quantification. CatPred offers wider scope and larger data coverage (∼23k, 41k, 12k data-points respectively for ). A web-resource to use the trained models is made available at:
Bioinformatics
What problem does this paper attempt to address?
The paper aims to address the issue of quantitative prediction of enzyme activity to reduce reliance on expensive and time-consuming experimental determination methods. Specifically, the paper introduces a comprehensive framework named CatPred for machine learning (ML) prediction of in vitro enzyme kinetics parameters, including the catalytic constant \( k_{\text{cat}} \), Michaelis constant \( K_m \), and inhibition constant \( K_i \). By leveraging pre-trained protein language model features and three-dimensional structural features, CatPred is capable of performing robust evaluations on datasets containing out-of-distribution test samples and provides quantification of prediction uncertainty. The paper highlights several challenges in current methods, such as the lack of standardized training datasets, the evaluation of out-of-distribution sample prediction performance, and the quantification of model uncertainty. To address these issues, the authors constructed CatPred-DB, a large-scale dataset covering enzyme kinetics parameter measurements from the BRENDA and SABIO-RK databases. Additionally, CatPred employs a probabilistic regression approach that outputs not only the predicted mean but also the standard deviation of the predictions, thereby allowing the assessment of the reliability of the prediction results. Overall, this study aims to develop a method capable of efficiently predicting enzyme kinetics parameters to accelerate the processes of enzyme engineering, metabolic pathway design, and metabolic model parameterization.