CLEAN-Contact: Contrastive Learning-enabled Enzyme Functional Annotation Prediction with Structural Inference

Yuxin Yang,Abby Jerger,Song Feng,Zixu Wang,Christina Brasfield,Margaret S Cheung,Jeremy Zucker,Qiang Guan

DOI: https://doi.org/10.1101/2024.05.14.594148

2024-10-08

Abstract:Recent years have witnessed the remarkable progress of deep learning within the realm of scientific disciplines, yielding a wealth of promising outcomes. A prominent challenge within this domain has been the task of predicting enzyme function, a complex problem that has seen the development of numerous computational methods, particularly those rooted in deep learning techniques. However, the majority of these methods have primarily focused on either amino acid sequence data or protein structure data, neglecting the potential synergy of combining of both modalities. To address this gap, we propose a novel ontrastive earning framework for nzyme functional notation prediction combined with protein amino acid sequences and maps (CLEAN-Contact). We rigorously evaluated the performance of our CLEAN-Contact framework against the state-of-the-art enzyme function prediction model using multiple benchmark datasets. Using CLEAN-Contact, we predicted novel enzyme functions within the proteome of MED4. Our findings convincingly demonstrate the substantial superiority of our CLEAN-Contact framework, marking a significant step forward in enzyme function prediction accuracy.

Bioinformatics

What problem does this paper attempt to address?

The problem that this paper attempts to solve is enzyme function prediction. Specifically, the functional annotation of enzymes plays a crucial role in understanding the mechanisms in biological processes, and the function of enzymes is usually represented by the Enzyme Commission (EC) number. Traditional enzyme function prediction methods mainly rely on sequence similarity analysis, such as BLASTP and HH - suite, but these methods have limitations. In recent years, with the development of deep - learning technology, especially the solution of the protein structure prediction problem, people have begun to explore how to use these technologies to improve enzyme function prediction. However, most of the existing deep - learning - based methods mainly focus on amino acid sequence data or protein structure data, ignoring the potential synergy of combining the two. To solve this problem, the paper proposes a new framework named CLEAN - Contact, which combines contrastive learning and protein contact maps, aiming to improve the accuracy of enzyme function prediction by integrating amino acid sequence and protein structure information. ### Main contributions: 1. **Combining sequence and structure information**: The CLEAN - Contact framework utilizes both amino acid sequence data and protein contact maps, and combines the information of these two modalities through the contrastive learning method, thereby improving the performance of enzyme function prediction. 2. **Contrastive learning**: Through contrastive learning, the framework can minimize the embedding distance between enzymes with the same EC number, while maximizing the embedding distance between enzymes with different EC numbers. 3. **Performance evaluation**: The researchers evaluated CLEAN - Contact on multiple benchmark datasets and compared it with the current state - of - the - art enzyme function prediction models. The results show that CLEAN - Contact exhibits significant advantages in multiple metrics such as precision, recall, F1 - score, and AUC. 4. **Practical application**: The researchers used the CLEAN - Contact framework to predict the functions of unannotated enzymes in the proteome of the cyanobacterium Prochlorococcus marinus MED4, and discovered many new enzyme functions with high confidence. ### Conclusion: The CLEAN - Contact framework significantly improves the accuracy of enzyme function prediction by combining amino acid sequence and protein structure information. This method not only performs excellently in benchmark tests, but also shows its potential in practical applications, especially in discovering the functions of unannotated enzymes. Future research can further expand the application scope of this framework, for example, predicting a wider range of protein function annotations, such as Gene Ontology (GO) numbers and FunCat categories.

CLEAN-Contact: Contrastive Learning-enabled Enzyme Functional Annotation Prediction with Structural Inference

Enzyme function prediction using contrastive learning

Predicting Enzyme Functions Using Contrastive Learning with Hierarchical Enzyme Structure Information

Parallel convolutional contrastive learning method for enzyme function prediction

BioStructNet: Structure-Based Network with Transfer Learning for Predicting Biocatalyst Functions

GELKcat: An Integration Learning of Substrate Graph with Enzyme Embedding for Kcat prediction.

DEEPre: sequence-based enzyme EC number prediction by deep learning.

An Improved Prediction Of Catalytic Residues In Enzyme Structures

FEDKEA: Enzyme function prediction with a large pretrained protein language model and distance-weighted k-nearest neighbor

Enzyme Activity Prediction of Sequence Variants on Novel Substrates using Improved Substrate Encodings and Convolutional Pooling

Leveraging conformal prediction to annotate enzyme function space with limited false positives

Recursive Cleaning for Large-scale Protein Data via Multimodal Learning

Autoregressive Enzyme Function Prediction with Multi-scale Multi-modality Fusion

DeepEnzyme: a robust deep learning model for improved enzyme turnover number prediction by utilizing features of protein 3D-structures

Protein Functional Annotation of Simultaneously Improved Stability, Accuracy and False Discovery Rate Achieved by a Sequence-Based Deep Learning

ConPep: Prediction of peptide contact maps with pre-trained biological language model and multi-view feature extracting strategy

Mldeepre: Multi-Functional Enzyme Function Prediction with Hierarchical Multi-Label Deep Learning

ECRECer: Enzyme Commission Number Recommendation and Benchmarking based on Multiagent Dual-core Learning

Adapt-Kcr: a Novel Deep Learning Framework for Accurate Prediction of Lysine Crotonylation Sites Based on Learning Embedding Features and Attention Architecture.

Evidential deep learning for trustworthy prediction of enzyme commission number

Enzyme Commission Number Prediction and Benchmarking with Hierarchical Dual-core Multitask Learning Framework