Abstract:Proteins play an essential role in various biological and engineering processes. Large protein language models (PLMs) present excellent potential to reshape protein research by accelerating the determination of protein function and the design of proteins with the desired functions. The prediction and design capacity of PLMs relies on the representation gained from the protein sequences. However, the lack of crucial 3D structure information in most PLMs restricts the prediction capacity of PLMs in various applications, especially those heavily dependent on 3D structures. To address this issue, we introduce S-PLM, a 3D structure-aware PLM that utilizes multi-view contrastive learning to align the sequence and 3D structure of a protein in a coordinated latent space. S-PLM applies Swin-Transformer on AlphaFold-predicted protein structures to embed the structural information and fuses it into sequence-based embedding from ESM2. Additionally, we provide a library of lightweight tuning tools to adapt S-PLM for diverse protein property prediction tasks. Our results demonstrate superior performance of S-PLM over sequence-only PLMs on all protein clustering and classification tasks, achieving competitiveness comparable to state-of-the-art methods requiring both sequence and structure inputs. S-PLM and its lightweight tuning tools are available at https://github.com/duolinwang/S-PLM/.

What problem does this paper attempt to address?

The main focus of this paper is on how to improve the performance of Protein Language Models (PLMs), particularly by incorporating the 3D structure information of proteins into the models. Current PLMs mainly rely on the amino acid sequence to predict the function of proteins and design proteins with specific functions, but the lack of 3D structure information limits their predictive ability in structure-dependent applications. To address this problem, the researchers propose S-PLM, which stands for Structure-aware Protein Language Model, that aligns sequences and structures in a coordinated latent space using multi-view contrastive learning. S-PLM utilizes the Swin-Transformer to process the protein structures predicted by AlphaFold and embeds the structure information, which is then fused with the sequence-based embeddings provided by ESM2. Furthermore, the paper provides a lightweight adaptation toolkit for various protein property prediction tasks. Experimental results show that S-PLM outperforms sequence-only PLMs on all protein clustering and classification tasks and performs competitively among state-of-the-art methods that require both sequence and structure inputs. Compared to models that solely use sequences, S-PLM achieves better performance without the need for structure prediction, thereby avoiding additional computational and time costs. The paper also explores lightweight fine-tuning strategies to adapt to specific protein prediction tasks, mitigating the issues of catastrophic forgetting and computational resource requirements associated with full fine-tuning of large PLMs. With these strategies, S-PLM achieves optimal or near-optimal performance in tasks such as gene ontology and protein secondary structure prediction. In summary, this paper aims to enhance the predictive ability of protein language models by leveraging the 3D structure information of proteins through the S-PLM model, without sacrificing efficiency.

S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure

MULAN: Multimodal Protein Language Model for Sequence and Structure Encoding

Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs

Structure-Infused Protein Language Models

SaProt: Protein Language Modeling with Structure-aware Vocabulary

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

Simple, Efficient and Scalable Structure-aware Adapter Boosts Protein Language Models

Structure Language Models for Protein Conformation Generation

Bilingual Language Model for Protein Sequence and Structure

InterPLM: Discovering Interpretable Features in Protein Language Models via Sparse Autoencoders

THPLM: a sequence-based deep learning framework for protein stability changes prediction upon point variations using pretrained protein language model

A Systematic Study of Joint Representation Learning on Protein Sequences and Structures

Exploring evolution-aware & -free protein language models as protein function predictors

Long-context Protein Language Model

PLM-interact: extending protein language models to predict protein-protein interactions

DPLM-2: A Multimodal Diffusion Protein Language Model

PLMC: Language Model of Protein Sequences Enhances Protein Crystallization Prediction

CPE-Pro: A Structure-Sensitive Deep Learning Model for Protein Representation and Origin Evaluation

Multimodal Protein-Ligand Contrastive Pretraining for Effective and Efficient Drug Discovery

CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing