Abstract:The Gene Ontology (GO) is a formal, axiomatic theory with over 100,000 axioms that describe the molecular functions, biological processes and cellular locations of proteins in three subontologies. Predicting the functions of proteins using the GO requires both learning and reasoning capabilities in order to maintain consistency and exploit the background knowledge in the GO. Many methods have been developed to automatically predict protein functions, but effectively exploiting all the axioms in the GO for knowledge-enhanced learning has remained a challenge. We have developed DeepGO-SE, a method that predicts GO functions from protein sequences using a pretrained large language model. DeepGO-SE generates multiple approximate models of GO, and a neural network predicts the truth values of statements about protein functions in these approximate models. We aggregate the truth values over multiple models so that DeepGO-SE approximates semantic entailment when predicting protein functions. We show, using several benchmarks, that the approach effectively exploits background knowledge in the GO and improves protein function prediction compared to state-of-the-art methods.

What problem does this paper attempt to address?

The paper attempts to address several key issues in protein function prediction: 1. **Challenges in Protein Function Prediction**: Despite the increasing accuracy of protein structure prediction in recent years, protein function prediction remains challenging. This is mainly because the number of known protein functions is relatively small, and these functions are complex and interact in diverse ways. 2. **Limitations of Existing Methods**: Many existing protein function prediction methods rely on sequence similarity to predict functions. This approach works well for proteins that are highly similar to known functional domains but is less reliable for proteins with little or no sequence similarity. Additionally, existing methods often fail to fully utilize all the axioms in the Gene Ontology (GO) to enhance knowledge-driven learning. 3. **Prediction of Complex Biological Processes and Cellular Components**: Predicting the biological processes and cellular components that proteins participate in requires considering the presence and interactions of multiple proteins, rather than just the sequence or structural information of a single protein. Therefore, existing methods perform poorly in predicting these complex annotations. To address these issues, the authors developed a new method called DeepGO-SE, which combines protein sequence features generated by pre-trained large language models (such as ESM2), background knowledge from GO, and protein-protein interaction (PPI) information to predict protein functions through approximate semantic entailment. Specifically, DeepGO-SE achieves knowledge-enhanced learning through the following steps: 1. **Generating Approximate Models**: Based on background knowledge from GO (i.e., axioms) and assertions about proteins (e.g., "protein has function C"), an approximate model is generated. 2. **Representing Proteins**: Proteins are represented using ESM2 embeddings and treated as instances in the approximate model, maximizing the truth of statements like "protein has function C" within the model. 3. **Multiple Model Generation**: The above process is repeated to generate multiple approximate models, and the truth values of statements in these models are aggregated to perform approximate semantic entailment. Through this method, DeepGO-SE effectively utilizes background knowledge from GO to improve the accuracy of protein function prediction, particularly excelling in predicting complex biological processes and cellular component annotations. Experimental results show that DeepGO-SE significantly outperforms existing state-of-the-art methods in multiple benchmark tests.

Protein function prediction as approximate semantic entailment

DeepGO: Predicting protein functions from sequence and interactions using a deep ontology-aware classifier

Protein function prediction with gene ontology: from traditional to deep learning models

Protein Function Prediction With Functional and Topological Knowledge of Gene Ontology

DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms

DeepSS2GO: protein function prediction from secondary structure

DeepGOA: Predicting Gene Ontology Annotations of Proteins Via Graph Convolutional Network

A Deep Learning Framework for Gene Ontology Annotations with Sequence- and Network-Based Information

Partial order relation–based gene ontology embedding improves protein function prediction

Gene Ontology-Based Protein Function Prediction by Using Sequence Composition Information.

DeepGOPlus: improved protein function prediction from sequence

DeepText2GO: Improving Large-Scale Protein Function Prediction with Deep Semantic Text Representation.

DeepText2Go: Improving Large-Scale Protein Function Prediction with Deep Semantic Text Representation

Mutual annotation‐based prediction of protein domain functions with Domain2GO

Function Prediction For Hypothetical Proteins In Yeast Saccharomyces Cerevisiae Using Multiple Sources Of High-Throughput Data

Protein Function Prediction: From Traditional Classifier to Deep Learning

SDN2GO: An Integrated Deep Learning Model for Protein Function Prediction

Mapping Gene Ontology to Proteins Based on Protein-Protein Interaction Data

DeepAdd: Protein function prediction from k-mer embedding and additional features

Embeddings from deep learning transfer GO annotations beyond homology