Diffusion Language Models Are Versatile Protein Learners

Xinyou Wang,Zaixiang Zheng,Fei Ye,Dongyu Xue,Shujian Huang,Quanquan Gu

2024-10-17

Abstract:This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training makes DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2 (Lin et al., 2022). Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioner, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance. Code is released at \url{<a class="link-external link-https" href="https://github.com/bytedance/dplm" rel="external noopener nofollow">this https URL</a>}.

Machine Learning,Biomolecules

What problem does this paper attempt to address?

The main problem this paper attempts to address is the insufficient balance between generative capability and predictive capability in existing Protein Language Models (Protein LMs). Specifically: 1. **Masked Prediction Models (Masked-LMs)**: Models like the ESM series perform well in sequence understanding tasks but cannot generate protein sequences due to the lack of explicit generative modeling formulas. This may limit their predictive capabilities, as powerful generative models typically gain a deep understanding of the data by learning the data distribution while creating new samples. 2. **Autoregressive Models (AR-LMs)**: Models like ProGen perform well in generative tasks but often fall short in understanding sequence data, especially when dealing with structured biological macromolecules like proteins. Their unidirectional receptive field can only access one-sided sequence context, which limits their generative and predictive capabilities. To address these issues, the paper proposes a new Diffusion Protein Language Model (DPLM), which combines the advantages of generative and predictive models to achieve a universal and multifunctional protein language model. DPLM is based on a discrete diffusion probabilistic framework and can generate structurally reasonable, novel, and diverse protein sequences through generative pre-training on evolutionary-scale protein sequences. Additionally, DPLM provides effective representations for downstream predictive tasks. Moreover, DPLM supports various conditional generation strategies, making it more flexible and practical in real-world applications.

Diffusion Language Models Are Versatile Protein Learners

DPLM-2: A Multimodal Diffusion Protein Language Model

AMP-Diffusion: Integrating Latent Diffusion with Protein Language Models for Antimicrobial Peptide Generation

PRO-LDM: Protein Sequence Generation with a Conditional Latent Diffusion Model

Long-context Protein Language Model

Exploring evolution-aware & -free protein language models as protein function predictors

PLMC: Language Model of Protein Sequences Enhances Protein Crystallization Prediction

Protein generation with evolutionary diffusion: sequence is all you need

MeMDLM: De Novo Membrane Protein Design with Masked Discrete Diffusion Protein Language Models

Structure Language Models for Protein Conformation Generation

Knowledge-aware Reinforced Language Models for Protein Directed Evolution

Interpretable improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Learning immune receptor representations with protein language models

From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences

Modeling Protein Using Large-scale Pretrain Language Model

Efficient Inference, Training, and Fine-tuning of Protein Language Models

From PSSM to Pre-Trained Language Models

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

Protein A-like Peptide Design Based on Diffusion and ESM2 Models