Abstract:Most protein language models (PLMs), which are used to produce high-quality protein representations, use only protein sequences during training. However, the known protein structure is crucial in many protein property prediction tasks, so there is a growing interest in incorporating the knowledge about the protein structure into a PLM. In this study, we propose MULAN, a MULtimodal PLM for both sequence and ANgle-based structure encoding. MULAN has a pre-trained sequence encoder and an introduced Structure Adapter, which are then fused and trained together. According to the evaluation on 7 downstream tasks of various nature, both small and medium-sized MULAN models show consistent improvement in quality compared to both sequence-only ESM-2 and structure-aware SaProt. Importantly, our model offers a cheap increase in the structural awareness of the protein representations due to finetuning of existing PLMs instead of training from scratch. We perform a detailed analysis of the proposed model and demonstrate its awareness of the protein structure. The implementation, training data and model checkpoints are available at https://github.com/DFrolova/MULAN.

What problem does this paper attempt to address?

The main objective of this paper is to propose a new multimodal protein language model (MULAN), aimed at improving the learning of protein representations by integrating protein sequence and structural information. Specifically, the MULAN model combines a sequence encoder with a newly introduced Structure Adapter, which can handle angle-based protein structural information. The key contributions of the paper include: 1. **MULAN Model**: This is a new multimodal protein language model that can simultaneously handle protein sequence and structural data. It includes a pre-trained sequence encoder and a Structure Adapter to process backbone torsion angles and side-chain torsion angles of protein structures. 2. **Structure Adapter**: This is an important component of MULAN, which uses torsion angles in protein structures to represent structural information. It can work on top of existing protein language models without the need to train the entire model from scratch, thus improving the model's understanding of protein structures at a lower cost. 3. **Performance Evaluation**: The paper evaluates the performance of MULAN on various downstream tasks, including protein stability prediction, fluorescence prediction, metal ion binding prediction, human protein-protein interaction prediction, and gene ontology classification. The results show that MULAN outperforms sequence-based models (such as ESM-2) and models that already consider some structural information (such as SaProt) in these tasks. 4. **Detailed Analysis**: The authors also conducted extensive ablation experiments to verify the effectiveness and structure-awareness of the model. For example, they demonstrated the importance of the Structure Adapter by comparing the performance of models under different settings and showed how masking uncertain structural predictions can reduce noise. 5. **Secondary Structure Prediction**: To further validate MULAN's ability to perceive protein structures, the paper also evaluates the model's performance on the protein secondary structure prediction task. The results indicate that MULAN also achieves significant improvements in this task. In summary, this study successfully enhances the utilization of structural information in protein language models by introducing the Structure Adapter and confirms the effectiveness of this approach in multiple downstream tasks.

MULAN: Multimodal Protein Language Model for Sequence and Structure Encoding

S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure

Structure-Infused Protein Language Models

DPLM-2: A Multimodal Diffusion Protein Language Model

Long-context Protein Language Model

EvoLlama: Enhancing LLMs' Understanding of Proteins via Multimodal Structure and Sequence Representations

Structure-Informed Protein Language Model

Bilingual Language Model for Protein Sequence and Structure

SaProt: Protein Language Modeling with Structure-aware Vocabulary

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

Simple, Efficient and Scalable Structure-aware Adapter Boosts Protein Language Models

Uni-Fold MuSSe: De Novo Protein Complex Prediction with Protein Language Models

Multi-Scale Protein Language Model for Unified Molecular Modeling

Endowing Protein Language Models with Structural Knowledge

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

PLMC: Language Model of Protein Sequences Enhances Protein Crystallization Prediction

Open-Source Protein Language Models for Function Prediction and Protein Design

Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs

ProteinAligner: A Multi-modal Pretraining Framework for Protein Foundation Models

PLM-interact: extending protein language models to predict protein-protein interactions