ProteinAligner: A Multi-modal Pretraining Framework for Protein Foundation Models

Li Zhang,Han Guo,Leah V Schaffer,Young Su Ko,Digvijay Singh,Hamid Rahmani,Danielle Grotjahn,Elizabeth Villa,Michael Gilson,Wei Wang,Trey Ideker,Eric Xing,Pengtao Xie

DOI: https://doi.org/10.1101/2024.10.06.616870

2024-10-06

Abstract:Protein foundation models, particularly protein language models, have demonstrated strong success in learning meaningful representations of proteins using transformer architectures pretrained on large-scale protein datasets with self-supervised learning. These representations have been highly effective for downstream tasks such as predicting protein functions and properties. However, most current protein foundation models focus on pretraining with amino acid sequences, often neglecting additional modalities like protein structures and related literature, both of which provide valuable insights. To address this gap, we propose a multi-modal pretraining approach that integrates three key modalities - protein sequences, structures, and literature text. In our framework, the protein sequence modality serves as the anchor, with the other two modalities aligned to it, enhancing the model's capacity to capture more comprehensive protein information. ProteinAligner outperformed state-of-the-art protein foundation models in predicting protein functions and properties across diverse downstream tasks.

Bioinformatics

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that current protein foundation models mainly rely on amino acid sequences for pre - training, while ignoring other important modal information, such as protein structures and related literature texts. These additional modalities provide rich biological insights and are crucial for a more comprehensive understanding of the functions and properties of proteins. Specifically, the paper points out: 1. **Protein structure**: It provides crucial three - dimensional information, which helps to understand how proteins fold and interact with other molecules, directly affecting their biological functions. 2. **Literature text**: It contains specific context information about protein mechanisms, behaviors, and interactions verified by experiments, which is difficult to infer solely from sequences or structures. To solve these problems, the paper proposes a multimodal pre - training framework - ProteinAligner, which integrates three key modalities: protein sequences, structures, and literature texts. By aligning these three modalities with protein sequences, this framework can learn richer and more comprehensive protein representations, thereby improving the accuracy of downstream tasks (such as predicting protein functions and properties). In summary, ProteinAligner aims to enhance the learning ability of protein foundation models by combining multiple - modal data to achieve more accurate prediction of protein functions and properties.

ProteinAligner: A Multi-modal Pretraining Framework for Protein Foundation Models

SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation

OneProt: Towards Multi-Modal Protein Foundation Models

Endowing Protein Language Models with Structural Knowledge

Multimodal pretraining for unsupervised protein representation learning

Multi-level Protein Structure Pre-training via Prompt Learning

Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains

Multi-level protein pre-training with Vabs-Net

Multimodal Protein-Ligand Contrastive Pretraining for Effective and Efficient Drug Discovery

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

SPRoBERTa: protein embedding learning with local fragment modeling

ProtPlat: an efficient pre-training platform for protein classification based on FastText

Modeling Protein Using Large-scale Pretrain Language Model

Protein Representation Learning by Geometric Structure Pretraining

Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX

MolBind: Multimodal Alignment of Language, Molecules, and Proteins

Progressive Multi-Modality Learning for Inverse Protein Folding

Multimodal Pre-Training Model for Sequence-based Prediction of Protein-Protein Interaction

Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation

ProteinCLIP: enhancing protein language models with natural language