FusionESP: Improved enzyme-substrate pair prediction by fusing protein and chemical knowledge

Zhenjiao Du,Weiming Fu,Xiaolong Guo,Doina Caragea,Yonghui Li

DOI: https://doi.org/10.1101/2024.08.13.607829

2024-10-14

Abstract:To reduce the cost of experimental characterization of the potential substrates for enzymes, machine learning prediction model offers an alternative solution. Pretrained language models, as powerful approaches for protein and molecule representation, have been employed in the development of enzyme-substrate prediction models, achieving promising performance. In addition to continuing improvements in language models, effectively fusing encoders to handle multimodal prediction tasks is critical for further enhancing model performance using available representation methods. Here, we present FusionESP , a multimodal architecture that integrates protein and chemistry language models with a newly designed contrastive learning strategy for predicting enzyme-substrate pairs. Our best model achieved state-of-the-art performance with an accuracy of 94.77% on independent test data and exhibited better generalization capacity while requiring fewer computational resources and training data, compared to previous studies of finetuned encoder or employing more encoders. It also confirmed our hypothesis that embeddings of positive pairs are closer to each other in high-dimension space, while negative pairs exhibit the opposite trend. The proposed architecture is expected to be further applied to enhance performance in additional multimodality prediction tasks in biology. A user-friendly web server of FusionESP is established and freely accessible at https://rqkjkgpsyu.us-east-1.awsapprunner.com/.

Bioinformatics

What problem does this paper attempt to address?

The problem this paper attempts to address is how to reduce the cost of experimental characterization of enzyme-substrate pairs by developing an improved machine learning prediction model. Specifically, the paper introduces a multimodal architecture called FusionESP, which integrates protein and chemical language models and employs a newly designed contrastive learning strategy for predicting enzyme-substrate pairs. This approach aims to enhance model performance while reducing the required computational resources and training data, thereby achieving better application results in multimodal prediction tasks in the biological field. The paper points out that although pre-trained language models perform well in protein and molecular representation, effectively integrating these models to handle multimodal prediction tasks remains a key challenge. FusionESP introduces a contrastive learning strategy that brings the embeddings of positive enzyme-substrate pairs closer in high-dimensional space while pushing negative pairs further apart, thereby improving the model's generalization ability and accuracy. Additionally, the study establishes a user-friendly web server, allowing researchers to conveniently use FusionESP for enzyme-substrate pair predictions.

FusionESP: Improved enzyme-substrate pair prediction by fusing protein and chemical knowledge

A general model to predict small molecule substrates of enzymes based on machine and deep learning

MSF-PFP: A Novel Multisource Feature Fusion Model for Protein Function Prediction

Enzyme Activity Prediction of Sequence Variants on Novel Substrates using Improved Substrate Encodings and Convolutional Pooling

Boost-RS: boosted embeddings for recommender systems and its application to enzyme–substrate interaction prediction

FEDKEA: Enzyme function prediction with a large pretrained protein language model and distance-weighted k-nearest neighbor

Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks

PMSFF: Improved Protein Binding Residues Prediction through Multi-Scale Sequence-Based Feature Fusion Strategy

FusOn-pLM: A Fusion Oncoprotein-Specific Language Model via Focused Probabilistic Masking

Surface-based multimodal protein-ligand binding affinity prediction

Enhanced compound-protein binding affinity prediction by representing protein multimodal information via a coevolutionary strategy

Improved Protein–Ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference

Descriptor-augmented machine learning for enzyme-chemical interaction predictions

Enhancing protein‐ligand binding affinity prediction through sequential fusion of graph and convolutional neural networks

An Integration of Deep Learning with Feature Embedding for Protein–protein Interaction Prediction

High-throughput prediction of enzyme promiscuity based on substrate–product pairs

PSSP-MFFNet: A Multifeature Fusion Network for Protein Secondary Structure Prediction

Autoregressive Enzyme Function Prediction with Multi-scale Multi-modality Fusion

Enzyme Commission Number Prediction and Benchmarking with Hierarchical Dual-core Multitask Learning Framework

An Interpretable Double-Scale Attention Model for Enzyme Protein Class Prediction Based on Transformer Encoders and Multi-Scale Convolutions

CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation