GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning

Dan Kalifa,Uriel Singer,Kira Radinsky

2024-08-01

Abstract:Proteins play a vital role in biological processes and are indispensable for living organisms. Accurate representation of proteins is crucial, especially in drug development. Recently, there has been a notable increase in interest in utilizing machine learning and deep learning techniques for unsupervised learning of protein representations. However, these approaches often focus solely on the amino acid sequence of proteins and lack factual knowledge about proteins and their interactions, thus limiting their performance. In this study, we present GOProteinGNN, a novel architecture that enhances protein language models by integrating protein knowledge graph information during the creation of amino acid level representations. Our approach allows for the integration of information at both the individual amino acid level and the entire protein level, enabling a comprehensive and effective learning process through graph-based learning. By doing so, we can capture complex relationships and dependencies between proteins and their functional annotations, resulting in more robust and contextually enriched protein representations. Unlike previous fusion methods, GOProteinGNN uniquely learns the entire protein knowledge graph during training, which allows it to capture broader relational nuances and dependencies beyond mere triplets as done in previous work. We perform a comprehensive evaluation on several downstream tasks demonstrating that GOProteinGNN consistently outperforms previous methods, showcasing its effectiveness and establishing it as a state-of-the-art solution for protein representation learning.

Biomolecules,Machine Learning

What problem does this paper attempt to address?

The paper aims to address key issues in protein representation learning, particularly how to integrate rich knowledge graph information into protein representations to improve their quality and biological relevance. Specifically, the study proposes a new architecture called GOProteinGNN, which addresses the limitations of existing methods in the following ways: 1. **Integrating information at both the amino acid level and the protein level**: Most existing protein representation learning methods either focus solely on amino acid sequences (e.g., ProtBert) or attempt to create protein-level representations directly while integrating external information (e.g., KeAP). GOProteinGNN handles both levels of information within a unified framework. 2. **Comprehensive utilization of protein knowledge graphs**: Many existing methods simplify protein knowledge graphs, treating them merely as triplet-form information, which may lead to the loss of important relational details and contextual dependencies. GOProteinGNN fully leverages the structure of the entire protein knowledge graph throughout the pre-training process, ensuring the capture of complex interactions. 3. **Introducing a novel Graph Neural Networks (GNN) knowledge injection layer**: To integrate protein knowledge graph information with protein language models, GOProteinGNN designs a GNN Knowledge Injection (GKI) layer. This layer uses a Relational Graph Convolution Network (RGCN) to propagate information from the knowledge graph and incorporates this information into protein representations via the [CLS] token. Through these innovations, GOProteinGNN demonstrates superior performance over previous methods in various downstream tasks, establishing its position as the latest technological solution in the field of protein representation learning. Experimental results show significant performance improvements in tasks such as contact prediction, remote homology detection, and protein-protein interaction recognition.

GOProteinGNN: Leveraging Protein Knowledge Graphs for Protein Representation Learning

Learning Complete Protein Representation by Deep Coupling of Sequence and Structure

OntoProtein: Protein Pretraining With Gene Ontology Embedding

Neural Embeddings for Protein Graphs

DeepGOA: Predicting Gene Ontology Annotations of Proteins Via Graph Convolutional Network

An End-to-end Knowledge Graph Fused Graph Neural Network for Accurate Protein-Protein Interactions Prediction

GGN-GO: geometric graph networks for predicting protein function by multi-scale structure features

Graph2GO: a multi-modal attributed network embedding method for inferring protein functions

Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling

GONET: A Deep Network to Annotate Proteins via Recurrent Convolution Networks

Decoding the protein-ligand interactions using parallel graph neural networks

Prediction of protein–protein interaction using graph neural networks

DeepGraphGO: graph neural network for large-scale, multispecies protein function prediction

GraphPI: Efficient Protein Inference with Graph Neural Networks

Accurate Predictions of Molecular Properties of Proteins via Graph Neural Networks and Transfer Learning

ProteinRPN: Towards Accurate Protein Function Prediction with Graph-Based Region Proposals

DeepGATGO: A Hierarchical Pretraining-Based Graph-Attention Model for Automatic Protein Function Prediction

PersGNN: Applying Topological Data Analysis and Geometric Deep Learning to Structure-Based Protein Function Prediction

DeepRank-GNN-esm: a graph neural network for scoring protein–protein models using protein language model

A Deep Learning Framework for Gene Ontology Annotations with Sequence- and Network-Based Information

ProAffinity-GNN: A Novel Approach to Structure-based Protein-Protein Binding Affinity Prediction via a Curated Dataset and Graph Neural Networks