Structure-Informed Protein Language Model

Zuobai Zhang,Jiarui Lu,Vijil Chenthamarakshan,Aurélie Lozano,Payel Das,Jian Tang
2024-02-07
Abstract:Protein language models are a powerful tool for learning protein representations through pre-training on vast protein sequence datasets. However, traditional protein language models lack explicit structural supervision, despite its relevance to protein function. To address this issue, we introduce the integration of remote homology detection to distill structural information into protein language models without requiring explicit protein structures as input. We evaluate the impact of this structure-informed training on downstream protein function prediction tasks. Experimental results reveal consistent improvements in function annotation accuracy for EC number and GO term prediction. Performance on mutant datasets, however, varies based on the relationship between targeted properties and protein structures. This underscores the importance of considering this relationship when applying structure-aware training to protein function prediction tasks. Code and model weights are available at
Biomolecules,Machine Learning
What problem does this paper attempt to address?
This paper mainly discusses how to incorporate protein structure information into protein language models to improve the accuracy of functional prediction. Traditional protein language models lack direct supervision of protein structure during pre-training, despite the crucial role of structure in protein function. To address this issue, researchers propose injecting structure information into the model through the task of remote homology detection, without the need for specific protein structures as input. Specifically, they use remote homology detection to identify proteins with similar structures but low sequence similarity, thereby complementing the training of protein language models. Models trained using this approach (such as ESM-2-650M-S) demonstrate more accurate functional annotation in downstream protein function prediction tasks, such as enzyme classification (EC) and gene ontology (GO) term prediction. However, the performance improvement on mutation datasets depends on the relationship between the target attributes and protein structure. Experimental results show that the integration of structure information can enhance the performance of protein functional annotation tasks. However, the effect may vary for certain tasks, such as mutation-related attribute prediction, when considering the relationship between protein structure and target attributes. The paper emphasizes the importance of understanding this relationship when applying structure-aware training to protein function prediction tasks. In conclusion, the paper proposes a new approach to enhance the structural knowledge of protein language models by utilizing remote homology detection, thereby improving the performance of protein functional prediction. It also highlights the need for further exploration of the role of structure information in protein language models to optimize protein representation learning.