Diffusion on language model embeddings for protein sequence generation

Viacheslav Meshchaninov,Pavel Strashnov,Andrey Shevtsov,Fedor Nikolaev,Nikita Ivanisenko,Olga Kardymon,Dmitry Vetrov
2024-03-06
Abstract:Protein design requires a deep understanding of the inherent complexities of the protein universe. While many efforts lean towards conditional generation or focus on specific families of proteins, the foundational task of unconditional generation remains underexplored and undervalued. Here, we explore this pivotal domain, introducing DiMA, a model that leverages continuous diffusion on embeddings derived from the protein language model, ESM-2, to generate amino acid sequences. DiMA surpasses leading solutions, including autoregressive transformer-based and discrete diffusion models, and we quantitatively illustrate the impact of the design choices that lead to its superior performance. We extensively evaluate the quality, diversity, distribution similarity, and biological relevance of the generated sequences using multiple metrics across various modalities. Our approach consistently produces novel, diverse protein sequences that accurately reflect the inherent structural and functional diversity of the protein space. This work advances the field of protein design and sets the stage for conditional models by providing a robust framework for scalable and high-quality protein sequence generation.
Machine Learning,Artificial Intelligence,Biomolecules
What problem does this paper attempt to address?
The paper focuses on the problem of unconditional generation of protein sequences, which is a fundamental but under-explored challenge in the field of protein design. Current methods often focus on conditional generation or specific protein families, neglecting comprehensive unconditional generation. The paper proposes a model called DiMA, which utilizes the embedding and continuous diffusion methods of protein language model (ESM-2) to generate amino acid sequences. DiMA surpasses existing solutions, including autoregressive Transformer models and discrete diffusion models, in terms of quality and diversity, and demonstrates the advantages of its design choices through multiple evaluation metrics. The working principle of DiMA is to first encode the amino acid sequence into continuous representations using ESM-2, and then train a diffusion model to reconstruct the disrupted embeddings. During the inference stage, protein embeddings are generated by iteratively refining from random Gaussian embeddings and decoded into amino acid sequences. The generated sequences exhibit high quality, diversity, and accurately reflect the structural and functional diversity of proteins. The paper also conducts extensive evaluations, including sequence quality, diversity, distribution similarity, and biological relevance, emphasizing that the protein sequences generated by DiMA can capture the characteristics of the training data, providing a powerful framework for the field of protein design and laying the foundation for conditional models. Additionally, the paper explores the impact of different design choices and training strategies.