Endowing Protein Language Models with Structural Knowledge

Dexiong Chen,Philip Hartout,Paolo Pellizzoni,Carlos Oliver,Karsten Borgwardt
2024-01-26
Abstract:Understanding the relationships between protein sequence, structure and function is a long-standing biological challenge with manifold implications from drug design to our understanding of evolution. Recently, protein language models have emerged as the preferred method for this challenge, thanks to their ability to harness large sequence databases. Yet, their reliance on expansive sequence data and parameter sets limits their flexibility and practicality in real-world scenarios. Concurrently, the recent surge in computationally predicted protein structures unlocks new opportunities in protein representation learning. While promising, the computational burden carried by such complex data still hinders widely-adopted practical applications. To address these limitations, we introduce a novel framework that enhances protein language models by integrating protein structural data. Drawing from recent advances in graph transformers, our approach refines the self-attention mechanisms of pretrained language transformers by integrating structural information with structure extractor modules. This refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database, using the same masked language modeling objective as traditional protein language models. Empirical evaluations of PST demonstrate its superior parameter efficiency relative to protein language models, despite being pretrained on a dataset comprising only 542K structures. Notably, PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction. Our findings underscore the potential of integrating structural information into protein language models, paving the way for more effective and efficient protein modeling Code and pretrained models are available at
Quantitative Methods,Machine Learning,Biomolecules
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the difficult problem of understanding the relationships among protein sequences, structures and functions. Specifically, although protein language models (PLMs) have made remarkable progress in leveraging large - scale sequence databases, they rely on a large amount of sequence data and parameter sets, which limits their flexibility and practicality in real - world scenarios. Meanwhile, although the number of computationally predicted protein structures has surged in recent years, opening up new opportunities for protein representation learning, the computational burden of these complex data still hinders their wide application. Therefore, the paper proposes a new framework to enhance protein language models by integrating protein structure data, aiming to improve the parameter efficiency and practical application performance of the models. ### Main Contributions of the Paper 1. **Innovation in Methodology**: A new model named Protein Structure Transformer (PST) is introduced. This model optimizes the self - attention mechanism by integrating a structure extraction module in the pre - trained language transformer, thus fusing structural information. 2. **Performance Improvement**: Experiments have proven that PST performs excellently in multiple functional and structural prediction tasks. In particular, it can still outperform the existing state - of - the - art model ESM - 2 when pre - trained using a dataset of only 542,000 structures. 3. **Parameter Efficiency**: PST not only outperforms ESM - 2 in terms of accuracy but also shows outstanding performance in parameter efficiency, which means it can achieve better performance with a smaller model size. 4. **Wide Applicability**: The protein representation of PST can be widely applied to different downstream tasks without the need for specific fine - tuning for each task, which further improves the practicality and adaptability of the model. ### Key Technical Points - **Structure Extraction Module**: PST enables the model to capture local structural features by adding a structure extraction module in each self - attention layer. This module can be a shallow graph neural network (GNN), such as GIN. - **Pre - training Strategy**: The PST model is pre - trained using the same masked language modeling objective as traditional protein language models, but only updates the parameters of the structure extractor and keeps the backbone transformer frozen to improve parameter efficiency. - **Experimental Verification**: The paper verifies the performance of the PST model through multiple benchmark test sets (such as Gene Ontology classification, enzyme classification, protein family classification, etc.) and compares it with existing models, proving its superiority. ### Conclusion The paper demonstrates the potential of integrating structural information into protein language models, providing a new method for more effectively understanding and predicting protein functions. The PST model not only outperforms the existing state - of - the - art models in performance but also shows significant advantages in parameter efficiency and wide applicability.