stDiff: a diffusion model for imputing spatial transcriptomics through single-cell transcriptomics

Kongming Li,Jiahao Li,Yuhao Tao,Fei Wang
DOI: https://doi.org/10.1093/bib/bbae171
IF: 9.5
2024-03-27
Briefings in Bioinformatics
Abstract:Abstract Spatial transcriptomics (ST) has become a powerful tool for exploring the spatial organization of gene expression in tissues. Imaging-based methods, though offering superior spatial resolutions at the single-cell level, are limited in either the number of imaged genes or the sensitivity of gene detection. Existing approaches for enhancing ST rely on the similarity between ST cells and reference single-cell RNA sequencing (scRNA-seq) cells. In contrast, we introduce stDiff, which leverages relationships between gene expression abundance in scRNA-seq data to enhance ST. stDiff employs a conditional diffusion model, capturing gene expression abundance relationships in scRNA-seq data through two Markov processes: one introducing noise to transcriptomics data and the other denoising to recover them. The missing portion of ST is predicted by incorporating the original ST data into the denoising process. In our comprehensive performance evaluation across 16 datasets, utilizing multiple clustering and similarity metrics, stDiff stands out for its exceptional ability to preserve topological structures among cells, positioning itself as a robust solution for cell population identification. Moreover, stDiff’s enhancement outcomes closely mirror the actual ST data within the batch space. Across diverse spatial expression patterns, our model accurately reconstructs them, delineating distinct spatial boundaries. This highlights stDiff’s capability to unify the observed and predicted segments of ST data for subsequent analysis. We anticipate that stDiff, with its innovative approach, will contribute to advancing ST imputation methodologies.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to enhance the gene expression information of spatial transcriptomics (ST) data through single - cell transcriptomics data (scRNA - seq), thereby compensating for the missing gene expression parts in ST data. Specifically, the paper proposes a new method named **stDiff**, which uses the diffusion model to learn the relationships of gene expression from scRNA - seq data and applies it to the completion of ST data. ### Core of the problem: 1. **Limitations of ST data**: Spatial transcriptomics (ST) techniques can retain the spatial position information in tissues, but there are limitations in the sensitivity of gene detection or the number of detectable genes. For example, imaging - based methods perform well at single - cell resolution, but usually can only detect hundreds of pre - selected genes; while sequencing - based methods can detect gene expressions across the whole - transcriptome range, but their spatial resolution is greater than that of a single cell and the capture rate is limited. 2. **Deficiencies of existing methods**: Current methods for enhancing ST data mainly rely on the similarity between scRNA - seq data and ST data, and complete the unmeasured parts by identifying the expression patterns of shared genes. However, these methods face the following challenges: - The sparsity of scRNA - seq and ST data makes accurate alignment difficult. - The batch effect further increases the difficulty of establishing accurate alignment through shared genes. - When using scRNA - seq as a reference for completion, it is easy to introduce batch bias, causing the predicted gene expression to be in a different batch space from the actual ST data, increasing the complexity of downstream analysis. 3. **Objective**: The objective of the paper is to develop a new method **stDiff**, which learns the gene expression relationships in scRNA - seq data through the diffusion model and uses these relationships to complete the missing gene expression parts in ST data, while avoiding introducing batch bias and ensuring that the prediction results are as consistent as possible with the real ST data. --- ### Key points of the solution: - **Application of the diffusion model**: stDiff adopts a conditional diffusion model, which captures the gene expression relationships in scRNA - seq data through two Markov processes (forward diffusion and reverse denoising). - Forward diffusion process: gradually introduce random noise into the initial RNA data. - Reverse denoising process: gradually restore the original data through the learned denoising conditional distribution. - **Avoiding the influence of batch effect**: stDiff enhances the robustness of the model by perturbing scRNA - seq data, paying less attention to the absolute gene expression values but emphasizing the inter - relationships among gene expressions. - **Completion strategy**: stDiff does not rely on the similarity between scRNA - seq and ST data, but completes the data by learning the regulatory rules in scRNA - seq data and combining the information of ST data itself. This strategy is similar to regarding each scRNA - seq cell as a complete image, and ST data as the masked version of this image, and the task is to complete the masked part. --- ### Summary: The paper attempts to solve the problem of how to learn gene expression relationships from scRNA - seq data through the diffusion model and apply them to the completion of ST data, in order to overcome the limitations of existing methods in similarity calculation, batch effect handling, and prediction accuracy. Through this method, stDiff can more accurately predict the missing gene expression information while retaining the spatial topological structure of ST data, providing high - quality data support for subsequent biological analysis.