Beyond ESM2: Graph-Enhanced Protein Sequence Modeling with Efficient Clustering

Shujian Jiao,Bingxuan Li,Lei Wang,Xiaojin Zhang,Wei Chen,Jiajie Peng,Zhongyu Wei
2024-04-24
Abstract:Proteins are essential to life's processes, underpinning evolution and diversity. Advances in sequencing technology have revealed millions of proteins, underscoring the need for sophisticated pre-trained protein models for biological analysis and AI development. Facebook's ESM2, the most advanced protein language model to date, leverages a masked prediction task for unsupervised learning, crafting amino acid representations with notable biochemical accuracy. Yet, it lacks in delivering functional protein insights, signaling an opportunity for enhancing representation quality.Our study addresses this gap by incorporating protein family classification into ESM2's training.This approach, augmented with Community Propagation-Based Clustering Algorithm, improves global protein representations, while a contextual prediction task fine-tunes local amino acid accuracy. Significantly, our model achieved state-of-the-art results in several downstream experiments, demonstrating the power of combining global and local methodologies to substantially boost protein representation quality.
Biomolecules,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the shortcomings of existing protein language models (such as ESM2) in protein function prediction. Specifically, while ESM2 has high biochemical accuracy in amino acid representation, it has limitations in providing insights into protein functions. Therefore, this paper proposes an improved method by integrating protein family classification into the training process of ESM2 and enhancing global protein representation with a community propagation clustering algorithm, while optimizing local amino acid accuracy through contextual prediction tasks. The main contributions are as follows: 1. **Fusion of Graph Pre-training and Masked Language Modeling**: By introducing graph enhancement techniques, the ESM2 model is improved, achieving performance that surpasses ESM2 in protein-related tasks. 2. **Community Propagation Clustering Algorithm**: A novel and resource-efficient graph neural network training method is proposed for the pre-training tasks of protein sequences. 3. **Detailed Demonstration of the Role of Asynchronous Information Propagation Algorithm**: The role of the asynchronous information propagation algorithm in graph networks for protein sequence pre-training tasks is demonstrated. Through these improvements, the model achieves state-of-the-art results in multiple downstream experiments, showcasing the powerful capability of combining global and local methods to significantly enhance the quality of protein representation.