xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

Bo Chen,Xingyi Cheng,Pan Li,Yangli-ao Geng,Jing Gong,Shen Li,Zhilei Bei,Xu Tan,Boyan Wang,Xin Zeng,Chiming Liu,Aohan Zeng,Yuxiao Dong,Jie Tang,Le Song

2024-01-11

Abstract:Protein language models have shown remarkable success in learning biological information from protein sequences. However, most existing models are limited by either autoencoding or autoregressive pre-training objectives, which makes them struggle to handle protein understanding and generation tasks concurrently. We propose a unified protein language model, xTrimoPGLM, to address these two types of tasks simultaneously through an innovative pre-training framework. Our key technical contribution is an exploration of the compatibility and the potential for joint optimization of the two types of objectives, which has led to a strategy for training xTrimoPGLM at an unprecedented scale of 100 billion parameters and 1 trillion training tokens. Our extensive experiments reveal that 1) xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. The model also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. 2) xTrimoPGLM not only can generate de novo protein sequences following the principles of natural ones, but also can perform programmable generation after supervised fine-tuning (SFT) on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.

Quantitative Methods,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper primarily addresses the limitations of existing protein language models in understanding and generation tasks. Existing models are usually limited to self-encoding or self-regressive pre-training objectives, which makes them inefficient in handling protein understanding tasks (such as structure and function prediction) and generation tasks (such as designing new protein sequences). To tackle this issue, researchers propose a unified protein language model called xTrimoPGLM, which combines bidirectional attention and self-regressive objectives to enhance the model's performance in understanding and generation. The key innovation of xTrimoPGLM lies in integrating the Masked Language Model (MLM) objective with the General Language Model (GLM) objective into one framework. It is trained on a large-scale pre-training dataset that consists of approximately 94 billion unique protein sequences, with a parameter count of 100 billion. Experimental results demonstrate that xTrimoPGLM surpasses 15 benchmarks among 18 protein understanding tasks and exhibits outstanding performance in 3D structure prediction, outperforming existing language model-based tools. Furthermore, xTrimoPGLM is capable of generating new protein sequences that comply with natural laws and can be programmatically designed through supervised fine-tuning, showcasing its tremendous potential. The paper also discusses the limitations of the model in practical applications, including adaptability to different protein tasks, accuracy in structure prediction, and reduction of generation errors, emphasizing the necessity for future improvements.

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing

Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX

Modeling Protein Using Large-scale Pretrain Language Model

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

ProtChatGPT: Towards Understanding Proteins with Large Language Models

ProteinAligner: A Multi-modal Pretraining Framework for Protein Foundation Models

Training Compute-Optimal Protein Language Models

xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

DPLM-2: A Multimodal Diffusion Protein Language Model

Mixture of Experts Enable Efficient and Effective Protein Understanding and Design

TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering

Endowing Protein Language Models with Structural Knowledge

ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention

Multi-Modal Large Language Model Enables Protein Function Prediction

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Collectively encoding protein properties enriches protein language models

MolXPT: Wrapping Molecules with Text for Generative Pre-training