PRO-LDM: Protein Sequence Generation with a Conditional Latent Diffusion Model

Sitao Zhang,Zixuan Jiang,Rundong Huang,Shaoxun Mo,Letao Zhu,Peiheng Li,Ziyi Zhang,Emily Pan,Xi Chen,Yunfei Long,Qi Liang,Jin Tang,Renjing Xu,Rui Qing
DOI: https://doi.org/10.1101/2023.08.22.554145
2024-01-15
Abstract:Deep learning-driven protein design holds enormous potential despite the complexities in sequences and structures. Recent developments in diffusion models yielded success in structure design, but awaits progress in sequence design and are computationally demanding. Here we present PRO-LDM: an efficient framework combining design fidelity and computational efficiency, utilizing the diffusion model in latent space to design proteins with property tuning. The model employs a joint autoencoder to capture latent variable distributions and generate meaningful embeddings from sequences. PRO-LDM (1) learns representations from biological features in natural proteins at both amino-acid and sequence level; (2) generates native-like new sequences with enhanced diversity; and (3) conditionally designs new proteins with tailored properties or functions. The out-of-distribution design enables sampling notably different sequences by adjusting classifier guidance strength. Our model presents a feasible pathway and an integratable tool to extract physicochemical and evolutionary information embedded within primary sequences, for protein design and optimization.
Bioengineering
What problem does this paper attempt to address?
The main objective of this paper is to propose a new protein sequence generation framework—PRO-LDM (Protein Sequence Generation with Conditional Latent Diffusion Model), to address some key challenges in current protein design. Specifically, this study aims to tackle the following issues: 1. **Complexity of Protein Sequence Design**: Despite the great potential of deep learning-driven protein design, challenges remain in the complex relationship between sequence and structure. 2. **Improving Sequence Diversity and Fidelity**: Existing methods have succeeded in protein structure design but need improvement in the diversity and fidelity of generating new protein sequences. 3. **Reducing Computational Cost**: While diffusion models are effective in protein structure design, their computational cost is high, especially when multiple denoising iterations are required. 4. **Achieving Conditional Protein Design**: Researchers aim to develop methods that can design new proteins conditionally based on specific attributes or functions. To address the above issues, the paper proposes the PRO-LDM framework. This framework combines design fidelity and computational efficiency, utilizing a diffusion model in the latent space to design proteins with adjustable properties. The core contributions of PRO-LDM include: - **Multi-task Learning**: Integrates sequence generation and fitness prediction within a unified framework. - **Efficient Sequence Generation**: By conducting the diffusion process in the latent space, it reduces data dimensionality, accelerates sequence generation speed, and improves model efficiency. - **Preservation of Key Evolutionary Information**: Capable of retaining key positions and residues in the generated sequences, maintaining the protein scaffold and function. - **Replication of Global Amino Acid Relationships**: Learns the relationships between amino acids not only at the single amino acid level but also at the entire protein sequence level. Through these contributions, PRO-LDM aims to become a modular tool for extracting biological information from protein sequences and generating new species with target features.