ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention

Mingchen Li,Yang Tan,Xinzhu Ma,Bozitao Zhong,Ziyi Zhou,Huiqun Yu,Wanli Ouyang,Liang Hong,Bingxin Zhou,Pan Tan
DOI: https://doi.org/10.1101/2024.04.15.589672
2024-05-17
Abstract:Protein language models (PLMs) have shown remarkable capabilities in various protein function prediction tasks. However, while protein function is intricately tied to structure, most existing PLMs do not incorporate protein structure information. To address this issue, we introduce ProSST, a Transformer-based protein language model that seamlessly integrates both protein sequences and structures. ProSST incorporates a structure quantization module and a Transformer architecture with disentangled attention. The structure quantization module translates a 3D protein structure into a sequence of discrete tokens by first serializing the protein structure into residue-level local structures and then embeds them into dense vector space. These vectors are then quantized into discrete structure tokens by a pre-trained clustering model. These tokens serve as an effective protein structure representation. Furthermore, ProSST explicitly learns the relationship between protein residue token sequences and structure token sequences through the sequence-structure disentangled attention. We pre-train ProSST on millions of protein structures using a masked language model objective, enabling it to learn comprehensive contextual representations of proteins. To evaluate the proposed ProSST, we conduct extensive experiments on the zero-shot mutation effect prediction and several supervised downstream tasks, where ProSST achieves the state-of-the-art performance among all baselines. Our code and pretrained models are publicly available.
Biology
What problem does this paper attempt to address?
The paper aims to address the issue that Protein Language Models (PLMs) fail to effectively utilize protein structural information when predicting protein functions. Specifically, although the function of a protein is closely related to its structure, most existing protein language models primarily focus on modeling protein sequences while neglecting the importance of structural information. To solve this problem, the paper proposes a new model called ProSST (Protein Sequence-Structure Transformer). ProSST is a protein language model based on the Transformer architecture that can seamlessly integrate protein sequence and structural information. To achieve this, ProSST employs the following key methods: 1. **Structure Quantization Module**: This module converts the 3D protein structure into a series of discrete structural tokens. It first extracts residue-level local structural features by serializing the protein structure and embedding them into a dense vector space; then, it quantizes these vectors using a pre-trained clustering model to obtain effective protein structural representations. 2. **Sequence-Structure Disentangled Attention**: ProSST introduces a new attention mechanism in the Transformer model, namely sequence-structure disentangled attention, to explicitly learn the relationship between the protein residue sequence and the structural token sequence, thereby better capturing the complex features of protein sequences and structures. The paper pre-trains ProSST on a large dataset of protein structures, enabling it to learn comprehensive protein contextual representations. Experimental results show that ProSST achieves state-of-the-art performance in zero-shot mutation effect prediction and multiple supervised downstream tasks, demonstrating its strong capability in protein function prediction. Additionally, the paper provides detailed ablation studies to further validate the effectiveness of each design in ProSST.