GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning

Dan Kalifa,Uriel Singer,Kira Radinsky
2024-08-01
Abstract:Proteins play a vital role in biological processes and are indispensable for living organisms. Accurate representation of proteins is crucial, especially in drug development. Recently, there has been a notable increase in interest in utilizing machine learning and deep learning techniques for unsupervised learning of protein representations. However, these approaches often focus solely on the amino acid sequence of proteins and lack factual knowledge about proteins and their interactions, thus limiting their performance. In this study, we present GOProteinGNN, a novel architecture that enhances protein language models by integrating protein knowledge graph information during the creation of amino acid level representations. Our approach allows for the integration of information at both the individual amino acid level and the entire protein level, enabling a comprehensive and effective learning process through graph-based learning. By doing so, we can capture complex relationships and dependencies between proteins and their functional annotations, resulting in more robust and contextually enriched protein representations. Unlike previous fusion methods, GOProteinGNN uniquely learns the entire protein knowledge graph during training, which allows it to capture broader relational nuances and dependencies beyond mere triplets as done in previous work. We perform a comprehensive evaluation on several downstream tasks demonstrating that GOProteinGNN consistently outperforms previous methods, showcasing its effectiveness and establishing it as a state-of-the-art solution for protein representation learning.
Biomolecules,Machine Learning
What problem does this paper attempt to address?
The paper aims to address key issues in protein representation learning, particularly how to integrate rich knowledge graph information into protein representations to improve their quality and biological relevance. Specifically, the study proposes a new architecture called GOProteinGNN, which addresses the limitations of existing methods in the following ways: 1. **Integrating information at both the amino acid level and the protein level**: Most existing protein representation learning methods either focus solely on amino acid sequences (e.g., ProtBert) or attempt to create protein-level representations directly while integrating external information (e.g., KeAP). GOProteinGNN handles both levels of information within a unified framework. 2. **Comprehensive utilization of protein knowledge graphs**: Many existing methods simplify protein knowledge graphs, treating them merely as triplet-form information, which may lead to the loss of important relational details and contextual dependencies. GOProteinGNN fully leverages the structure of the entire protein knowledge graph throughout the pre-training process, ensuring the capture of complex interactions. 3. **Introducing a novel Graph Neural Networks (GNN) knowledge injection layer**: To integrate protein knowledge graph information with protein language models, GOProteinGNN designs a GNN Knowledge Injection (GKI) layer. This layer uses a Relational Graph Convolution Network (RGCN) to propagate information from the knowledge graph and incorporates this information into protein representations via the [CLS] token. Through these innovations, GOProteinGNN demonstrates superior performance over previous methods in various downstream tasks, establishing its position as the latest technological solution in the field of protein representation learning. Experimental results show significant performance improvements in tasks such as contact prediction, remote homology detection, and protein-protein interaction recognition.