CLEAN-Contact: Contrastive Learning-enabled Enzyme Functional Annotation Prediction with Structural Inference

Yuxin Yang,Abby Jerger,Song Feng,Zixu Wang,Christina Brasfield,Margaret S Cheung,Jeremy Zucker,Qiang Guan
DOI: https://doi.org/10.1101/2024.05.14.594148
2024-10-08
Abstract:Recent years have witnessed the remarkable progress of deep learning within the realm of scientific disciplines, yielding a wealth of promising outcomes. A prominent challenge within this domain has been the task of predicting enzyme function, a complex problem that has seen the development of numerous computational methods, particularly those rooted in deep learning techniques. However, the majority of these methods have primarily focused on either amino acid sequence data or protein structure data, neglecting the potential synergy of combining of both modalities. To address this gap, we propose a novel ontrastive earning framework for nzyme functional notation prediction combined with protein amino acid sequences and maps (CLEAN-Contact). We rigorously evaluated the performance of our CLEAN-Contact framework against the state-of-the-art enzyme function prediction model using multiple benchmark datasets. Using CLEAN-Contact, we predicted novel enzyme functions within the proteome of MED4. Our findings convincingly demonstrate the substantial superiority of our CLEAN-Contact framework, marking a significant step forward in enzyme function prediction accuracy.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is enzyme function prediction. Specifically, the functional annotation of enzymes plays a crucial role in understanding the mechanisms in biological processes, and the function of enzymes is usually represented by the Enzyme Commission (EC) number. Traditional enzyme function prediction methods mainly rely on sequence similarity analysis, such as BLASTP and HH - suite, but these methods have limitations. In recent years, with the development of deep - learning technology, especially the solution of the protein structure prediction problem, people have begun to explore how to use these technologies to improve enzyme function prediction. However, most of the existing deep - learning - based methods mainly focus on amino acid sequence data or protein structure data, ignoring the potential synergy of combining the two. To solve this problem, the paper proposes a new framework named CLEAN - Contact, which combines contrastive learning and protein contact maps, aiming to improve the accuracy of enzyme function prediction by integrating amino acid sequence and protein structure information. ### Main contributions: 1. **Combining sequence and structure information**: The CLEAN - Contact framework utilizes both amino acid sequence data and protein contact maps, and combines the information of these two modalities through the contrastive learning method, thereby improving the performance of enzyme function prediction. 2. **Contrastive learning**: Through contrastive learning, the framework can minimize the embedding distance between enzymes with the same EC number, while maximizing the embedding distance between enzymes with different EC numbers. 3. **Performance evaluation**: The researchers evaluated CLEAN - Contact on multiple benchmark datasets and compared it with the current state - of - the - art enzyme function prediction models. The results show that CLEAN - Contact exhibits significant advantages in multiple metrics such as precision, recall, F1 - score, and AUC. 4. **Practical application**: The researchers used the CLEAN - Contact framework to predict the functions of unannotated enzymes in the proteome of the cyanobacterium Prochlorococcus marinus MED4, and discovered many new enzyme functions with high confidence. ### Conclusion: The CLEAN - Contact framework significantly improves the accuracy of enzyme function prediction by combining amino acid sequence and protein structure information. This method not only performs excellently in benchmark tests, but also shows its potential in practical applications, especially in discovering the functions of unannotated enzymes. Future research can further expand the application scope of this framework, for example, predicting a wider range of protein function annotations, such as Gene Ontology (GO) numbers and FunCat categories.