stMCDI: Masked Conditional Diffusion Model with Graph Neural Network for Spatial Transcriptomics Data Imputation

Xiaoyu Li,Wenwen Min,Shunfang Wang,Changmiao Wang,Taosheng Xu
2024-03-16
Abstract:Spatially resolved transcriptomics represents a significant advancement in single-cell analysis by offering both gene expression data and their corresponding physical locations. However, this high degree of spatial resolution entails a drawback, as the resulting spatial transcriptomic data at the cellular level is notably plagued by a high incidence of missing values. Furthermore, most existing imputation methods either overlook the spatial information between spots or compromise the overall gene expression data distribution. To address these challenges, our primary focus is on effectively utilizing the spatial location information within spatial transcriptomic data to impute missing values, while preserving the overall data distribution. We introduce \textbf{stMCDI}, a novel conditional diffusion model for spatial transcriptomics data imputation, which employs a denoising network trained using randomly masked data portions as guidance, with the unmasked data serving as conditions. Additionally, it utilizes a GNN encoder to integrate the spatial position information, thereby enhancing model performance. The results obtained from spatial transcriptomics datasets elucidate the performance of our methods relative to existing approaches.
Genomics,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of missing value imputation in spatial transcriptomics data. Specifically, existing imputation methods either ignore the spatial information between sample points or affect the distribution of the overall gene expression data. To solve these problems, the authors propose a new conditional diffusion model—stMCDI, which utilizes spatial location information to impute missing values while maintaining the overall data distribution. The main contributions of stMCDI are as follows: 1. **Integration of Spatial Location Information**: By using a Graph Neural Network (GNN) encoder, the gene expression matrix is combined with spatial location information to construct a graph structure. 2. **Masking Strategy**: A masking technique is employed, enabling the model to predict unknown data segments based on known data segments, thereby improving imputation performance. This method also serves a self-supervised learning function, providing corresponding labels for the model. 3. **Conditional Diffusion Model**: By utilizing a conditional diffusion model, the known data segments are incorporated into the model as prior conditions, enhancing the model's ability to fit the data distribution and thereby improving the imputation effect.