FusionESP: Improved enzyme-substrate pair prediction by fusing protein and chemical knowledge

Zhenjiao Du,Weiming Fu,Xiaolong Guo,Doina Caragea,Yonghui Li
DOI: https://doi.org/10.1101/2024.08.13.607829
2024-10-14
Abstract:To reduce the cost of experimental characterization of the potential substrates for enzymes, machine learning prediction model offers an alternative solution. Pretrained language models, as powerful approaches for protein and molecule representation, have been employed in the development of enzyme-substrate prediction models, achieving promising performance. In addition to continuing improvements in language models, effectively fusing encoders to handle multimodal prediction tasks is critical for further enhancing model performance using available representation methods. Here, we present FusionESP , a multimodal architecture that integrates protein and chemistry language models with a newly designed contrastive learning strategy for predicting enzyme-substrate pairs. Our best model achieved state-of-the-art performance with an accuracy of 94.77% on independent test data and exhibited better generalization capacity while requiring fewer computational resources and training data, compared to previous studies of finetuned encoder or employing more encoders. It also confirmed our hypothesis that embeddings of positive pairs are closer to each other in high-dimension space, while negative pairs exhibit the opposite trend. The proposed architecture is expected to be further applied to enhance performance in additional multimodality prediction tasks in biology. A user-friendly web server of FusionESP is established and freely accessible at https://rqkjkgpsyu.us-east-1.awsapprunner.com/.
Bioinformatics
What problem does this paper attempt to address?
The problem this paper attempts to address is how to reduce the cost of experimental characterization of enzyme-substrate pairs by developing an improved machine learning prediction model. Specifically, the paper introduces a multimodal architecture called FusionESP, which integrates protein and chemical language models and employs a newly designed contrastive learning strategy for predicting enzyme-substrate pairs. This approach aims to enhance model performance while reducing the required computational resources and training data, thereby achieving better application results in multimodal prediction tasks in the biological field. The paper points out that although pre-trained language models perform well in protein and molecular representation, effectively integrating these models to handle multimodal prediction tasks remains a key challenge. FusionESP introduces a contrastive learning strategy that brings the embeddings of positive enzyme-substrate pairs closer in high-dimensional space while pushing negative pairs further apart, thereby improving the model's generalization ability and accuracy. Additionally, the study establishes a user-friendly web server, allowing researchers to conveniently use FusionESP for enzyme-substrate pair predictions.