scDiffusion: conditional generation of high-quality single-cell data using diffusion model

Erpai Luo,Minsheng Hao,Lei Wei,Xuegong Zhang
2024-03-05
Abstract:Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate the limited availability of data, generative models have been proposed to computationally generate synthetic scRNA-seq data. Nevertheless, the data generated with current models are not very realistic yet, especially when we need to generate data with controlled conditions. In the meantime, the Diffusion models have shown their power in generating data at high fidelity, providing a new opportunity for scRNA-seq generation.
Quantitative Methods,Machine Learning,Genomics
What problem does this paper attempt to address?
The paper attempts to address the challenges in generating single-cell RNA sequencing (scRNA-seq) data. Specifically: - **Data acquisition difficulties**: Despite significant advancements in scRNA-seq technology, obtaining sufficiently high-quality data remains challenging. Certain biological samples are difficult to obtain, and some cell types within samples may be too rare to analyze. - **Insufficiencies of existing generative models**: Although existing generative models can produce synthetic scRNA-seq data, the generated data is not realistic enough, especially when data under controlled conditions is needed. To address these issues, the research team developed the scDiffusion model, which combines diffusion models and foundational models to generate high-quality, controlled scRNA-seq data. By designing multiple classifiers to guide the diffusion process and proposing a new control strategy—Gradient Interpolation—the model can generate continuous cell developmental trajectories. Experimental results show that scDiffusion can generate single-cell gene expression data highly similar to real scRNA-seq data and can generate data for rare cell types under specific conditions, even for cell types beyond the training data. Additionally, using the Gradient Interpolation strategy, researchers successfully generated continuous developmental trajectories of mouse embryonic cells, demonstrating the strong potential of scDiffusion in enhancing real scRNA-seq data and studying cell fate.