S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure

Duolin Wang,Mahdi Pourmirzaei,Usman L Abbas,Shuai Zeng,Negin Manshour,Farzaneh Esmaili,Biplab Poudel,Yuexu Jiang,Qing Shao,Jin Chen,Dong Xu
DOI: https://doi.org/10.1101/2023.08.06.552203
2024-05-13
Abstract:Proteins play an essential role in various biological and engineering processes. Large protein language models (PLMs) present excellent potential to reshape protein research by accelerating the determination of protein function and the design of proteins with the desired functions. The prediction and design capacity of PLMs relies on the representation gained from the protein sequences. However, the lack of crucial 3D structure information in most PLMs restricts the prediction capacity of PLMs in various applications, especially those heavily dependent on 3D structures. To address this issue, we introduce S-PLM, a 3D structure-aware PLM that utilizes multi-view contrastive learning to align the sequence and 3D structure of a protein in a coordinated latent space. S-PLM applies Swin-Transformer on AlphaFold-predicted protein structures to embed the structural information and fuses it into sequence-based embedding from ESM2. Additionally, we provide a library of lightweight tuning tools to adapt S-PLM for diverse protein property prediction tasks. Our results demonstrate superior performance of S-PLM over sequence-only PLMs on all protein clustering and classification tasks, achieving competitiveness comparable to state-of-the-art methods requiring both sequence and structure inputs. S-PLM and its lightweight tuning tools are available at https://github.com/duolinwang/S-PLM/.
Bioinformatics
What problem does this paper attempt to address?
The main focus of this paper is on how to improve the performance of Protein Language Models (PLMs), particularly by incorporating the 3D structure information of proteins into the models. Current PLMs mainly rely on the amino acid sequence to predict the function of proteins and design proteins with specific functions, but the lack of 3D structure information limits their predictive ability in structure-dependent applications. To address this problem, the researchers propose S-PLM, which stands for Structure-aware Protein Language Model, that aligns sequences and structures in a coordinated latent space using multi-view contrastive learning. S-PLM utilizes the Swin-Transformer to process the protein structures predicted by AlphaFold and embeds the structure information, which is then fused with the sequence-based embeddings provided by ESM2. Furthermore, the paper provides a lightweight adaptation toolkit for various protein property prediction tasks. Experimental results show that S-PLM outperforms sequence-only PLMs on all protein clustering and classification tasks and performs competitively among state-of-the-art methods that require both sequence and structure inputs. Compared to models that solely use sequences, S-PLM achieves better performance without the need for structure prediction, thereby avoiding additional computational and time costs. The paper also explores lightweight fine-tuning strategies to adapt to specific protein prediction tasks, mitigating the issues of catastrophic forgetting and computational resource requirements associated with full fine-tuning of large PLMs. With these strategies, S-PLM achieves optimal or near-optimal performance in tasks such as gene ontology and protein secondary structure prediction. In summary, this paper aims to enhance the predictive ability of protein language models by leveraging the 3D structure information of proteins through the S-PLM model, without sacrificing efficiency.