DiscDiff: Latent Diffusion Model for DNA Sequence Generation

Zehui Li,Yuhao Ni,William A V Beardall,Guoxuan Xia,Akashaditya Das,Guy-Bart Stan,Yiren Zhao
2024-04-18
Abstract:This paper introduces a novel framework for DNA sequence generation, comprising two key components: DiscDiff, a Latent Diffusion Model (LDM) tailored for generating discrete DNA sequences, and Absorb-Escape, a post-training algorithm designed to refine these sequences. Absorb-Escape enhances the realism of the generated sequences by correcting `round errors' inherent in the conversion process between latent and input spaces. Our approach not only sets new standards in DNA sequence generation but also demonstrates superior performance over existing diffusion models, in generating both short and long DNA sequences. Additionally, we introduce EPD-GenDNA, the first comprehensive, multi-species dataset for DNA generation, encompassing 160,000 unique sequences from 15 species. We hope this study will advance the generative modelling of DNA, with potential implications for gene therapy and protein production.
Genomics,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper proposes a new framework for generating DNA sequences, addressing two main issues: 1) How to generate realistic DNA sequences to improve the limitations of existing diffusion models in handling discrete DNA sequences; 2) How to correct the "rounding errors" generated during the transformation from the latent space to the input space, enhancing the authenticity of the sequences. To achieve this, the paper introduces DiscDiff, a novel latent diffusion model for discrete sequences, and the Absorb-Escape training algorithm to enhance the accuracy and diversity of DNA sequences. Furthermore, the paper creates the first large-scale cross-species DNA generation dataset, EPD-GenDNA, for evaluating and advancing the development of DNA generation models.