Abstract:Protein language models (PLMs) have shown remarkable capabilities in various protein function prediction tasks. However, while protein function is intricately tied to structure, most existing PLMs do not incorporate protein structure information. To address this issue, we introduce ProSST, a Transformer-based protein language model that seamlessly integrates both protein sequences and structures. ProSST incorporates a structure quantization module and a Transformer architecture with disentangled attention. The structure quantization module translates a 3D protein structure into a sequence of discrete tokens by first serializing the protein structure into residue-level local structures and then embeds them into dense vector space. These vectors are then quantized into discrete structure tokens by a pre-trained clustering model. These tokens serve as an effective protein structure representation. Furthermore, ProSST explicitly learns the relationship between protein residue token sequences and structure token sequences through the sequence-structure disentangled attention. We pre-train ProSST on millions of protein structures using a masked language model objective, enabling it to learn comprehensive contextual representations of proteins. To evaluate the proposed ProSST, we conduct extensive experiments on the zero-shot mutation effect prediction and several supervised downstream tasks, where ProSST achieves the state-of-the-art performance among all baselines. Our code and pretrained models are publicly available.

What problem does this paper attempt to address?

The paper aims to address the issue that Protein Language Models (PLMs) fail to effectively utilize protein structural information when predicting protein functions. Specifically, although the function of a protein is closely related to its structure, most existing protein language models primarily focus on modeling protein sequences while neglecting the importance of structural information. To solve this problem, the paper proposes a new model called ProSST (Protein Sequence-Structure Transformer). ProSST is a protein language model based on the Transformer architecture that can seamlessly integrate protein sequence and structural information. To achieve this, ProSST employs the following key methods: 1. **Structure Quantization Module**: This module converts the 3D protein structure into a series of discrete structural tokens. It first extracts residue-level local structural features by serializing the protein structure and embedding them into a dense vector space; then, it quantizes these vectors using a pre-trained clustering model to obtain effective protein structural representations. 2. **Sequence-Structure Disentangled Attention**: ProSST introduces a new attention mechanism in the Transformer model, namely sequence-structure disentangled attention, to explicitly learn the relationship between the protein residue sequence and the structural token sequence, thereby better capturing the complex features of protein sequences and structures. The paper pre-trains ProSST on a large dataset of protein structures, enabling it to learn comprehensive protein contextual representations. Experimental results show that ProSST achieves state-of-the-art performance in zero-shot mutation effect prediction and multiple supervised downstream tasks, demonstrating its strong capability in protein function prediction. Additionally, the paper provides detailed ablation studies to further validate the effectiveness of each design in ProSST.

ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention

DeProt: A protein language model with quantizied structure and disentangled attention

Endowing Protein Language Models with Structural Knowledge

SaProt: Protein Language Modeling with Structure-aware Vocabulary

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Structure-Informed Protein Language Model

S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

Protein Language Models and Structure Prediction: Connection and Progression

PSTP: Decoding Latent Sequence Grammar for Protein Phase Separation through Transfer Learning and Attention

Structure Language Models for Protein Conformation Generation

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

PSTP: Decoding Latent Sequence Grammar Through Transfer Learning and Attention for Protein Phase Separation

ProTokens: A Machine-Learned Language for Compact and Informative Encoding of Protein 3D Structures

Multi-level Protein Structure Pre-training via Prompt Learning

Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding

Simple, Efficient and Scalable Structure-aware Adapter Boosts Protein Language Models

Structure-Infused Protein Language Models