Protein function prediction as approximate semantic entailment

Maxat Kulmanov,Francisco J. Guzmán-Vega,Paula Duek Roggli,Lydie Lane,Stefan T. Arold,Robert Hoehndorf
DOI: https://doi.org/10.1038/s42256-024-00795-w
IF: 23.8
2024-02-15
Nature Machine Intelligence
Abstract:The Gene Ontology (GO) is a formal, axiomatic theory with over 100,000 axioms that describe the molecular functions, biological processes and cellular locations of proteins in three subontologies. Predicting the functions of proteins using the GO requires both learning and reasoning capabilities in order to maintain consistency and exploit the background knowledge in the GO. Many methods have been developed to automatically predict protein functions, but effectively exploiting all the axioms in the GO for knowledge-enhanced learning has remained a challenge. We have developed DeepGO-SE, a method that predicts GO functions from protein sequences using a pretrained large language model. DeepGO-SE generates multiple approximate models of GO, and a neural network predicts the truth values of statements about protein functions in these approximate models. We aggregate the truth values over multiple models so that DeepGO-SE approximates semantic entailment when predicting protein functions. We show, using several benchmarks, that the approach effectively exploits background knowledge in the GO and improves protein function prediction compared to state-of-the-art methods.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?
The paper attempts to address several key issues in protein function prediction: 1. **Challenges in Protein Function Prediction**: Despite the increasing accuracy of protein structure prediction in recent years, protein function prediction remains challenging. This is mainly because the number of known protein functions is relatively small, and these functions are complex and interact in diverse ways. 2. **Limitations of Existing Methods**: Many existing protein function prediction methods rely on sequence similarity to predict functions. This approach works well for proteins that are highly similar to known functional domains but is less reliable for proteins with little or no sequence similarity. Additionally, existing methods often fail to fully utilize all the axioms in the Gene Ontology (GO) to enhance knowledge-driven learning. 3. **Prediction of Complex Biological Processes and Cellular Components**: Predicting the biological processes and cellular components that proteins participate in requires considering the presence and interactions of multiple proteins, rather than just the sequence or structural information of a single protein. Therefore, existing methods perform poorly in predicting these complex annotations. To address these issues, the authors developed a new method called DeepGO-SE, which combines protein sequence features generated by pre-trained large language models (such as ESM2), background knowledge from GO, and protein-protein interaction (PPI) information to predict protein functions through approximate semantic entailment. Specifically, DeepGO-SE achieves knowledge-enhanced learning through the following steps: 1. **Generating Approximate Models**: Based on background knowledge from GO (i.e., axioms) and assertions about proteins (e.g., "protein has function C"), an approximate model is generated. 2. **Representing Proteins**: Proteins are represented using ESM2 embeddings and treated as instances in the approximate model, maximizing the truth of statements like "protein has function C" within the model. 3. **Multiple Model Generation**: The above process is repeated to generate multiple approximate models, and the truth values of statements in these models are aggregated to perform approximate semantic entailment. Through this method, DeepGO-SE effectively utilizes background knowledge from GO to improve the accuracy of protein function prediction, particularly excelling in predicting complex biological processes and cellular component annotations. Experimental results show that DeepGO-SE significantly outperforms existing state-of-the-art methods in multiple benchmark tests.