Diffusion Language Models Are Versatile Protein Learners

Xinyou Wang,Zaixiang Zheng,Fei Ye,Dongyu Xue,Shujian Huang,Quanquan Gu
2024-10-17
Abstract:This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training makes DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2 (Lin et al., 2022). Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioner, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance. Code is released at \url{<a class="link-external link-https" href="https://github.com/bytedance/dplm" rel="external noopener nofollow">this https URL</a>}.
Machine Learning,Biomolecules
What problem does this paper attempt to address?
The main problem this paper attempts to address is the insufficient balance between generative capability and predictive capability in existing Protein Language Models (Protein LMs). Specifically: 1. **Masked Prediction Models (Masked-LMs)**: Models like the ESM series perform well in sequence understanding tasks but cannot generate protein sequences due to the lack of explicit generative modeling formulas. This may limit their predictive capabilities, as powerful generative models typically gain a deep understanding of the data by learning the data distribution while creating new samples. 2. **Autoregressive Models (AR-LMs)**: Models like ProGen perform well in generative tasks but often fall short in understanding sequence data, especially when dealing with structured biological macromolecules like proteins. Their unidirectional receptive field can only access one-sided sequence context, which limits their generative and predictive capabilities. To address these issues, the paper proposes a new Diffusion Protein Language Model (DPLM), which combines the advantages of generative and predictive models to achieve a universal and multifunctional protein language model. DPLM is based on a discrete diffusion probabilistic framework and can generate structurally reasonable, novel, and diverse protein sequences through generative pre-training on evolutionary-scale protein sequences. Additionally, DPLM provides effective representations for downstream predictive tasks. Moreover, DPLM supports various conditional generation strategies, making it more flexible and practical in real-world applications.