stDyer enables spatial domain clustering with dynamic graph embedding

Ke XU,Yu XU,Zirui Wang,Xin Maizie Zhou,Lu Zhang
DOI: https://doi.org/10.1101/2024.05.08.593252
2024-06-25
Abstract:Spatially resolved transcriptomics (SRT) data provide critical insights into gene expression patterns within tissue contexts, necessitating effective methods for identifying spatial domains. Traditional clustering techniques often overlook spatial information, leading to disjointed domains. Current computational approaches integrate spatial information but still face challenges in recognizing domain boundaries, scalability, and the need of independent clustering steps. We introduce stDyer, an end-to-end deep learning framework designed for spatial domain clustering in SRT data. stDyer combines a Gaussian Mixture Variational AutoEncoder (GMVAE) with graph attention networks (GATs) to simultaneously learn deep representations and perform clustering for units. A unique feature of stDyer is the dynamic graphs it adopts, which adaptively links units based on Gaussian Mixture assignments in the latent space, thereby improving spatial domain clustering and producing smoother domain boundaries. Additionally, stDyer's mini-batch neighbor sampling strategy facilitates scalability to large datasets and enables multi-GPU training. Benchmarking against state-of-the-art tools across various SRT technologies, stDyer demonstrates superior performance in spatial domain clustering, multi-slice analysis, and large-scale dataset handling.
Bioinformatics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiencies in spatial domain clustering methods in existing Spatially Resolved Transcriptomics (SRT) data. Specifically, traditional clustering techniques often overlook spatial information, resulting in the identified spatial domains lacking spatial continuity. Although existing computational methods incorporate spatial information, they still face challenges in identifying domain boundaries, scalability, and the need for independent clustering steps. ### Summary of Main Problems: 1. **Limitations of Traditional Clustering Methods**: - Traditional clustering methods (such as K - means and Leiden algorithms) rely solely on gene expression data and ignore spatial information, leading to the identified spatial domains lacking spatial continuity. 2. **Challenges of Existing Methods**: - **Identification of Domain Boundaries**: Existing methods have difficulties in identifying domain boundaries. - **Scalability**: Most methods struggle to handle large - scale datasets. - **Independent Clustering Steps**: Many existing models require independent clustering steps, which may lead to sub - optimal performance. 3. **Application in Multi - slice and Large - scale Datasets**: - Existing tools face scalability challenges when applied to multi - slice and large - scale datasets. ### stDyer's Solutions: To address the above problems, the author introduced stDyer, an end - to - end deep - learning framework for spatial domain clustering of SRT data. The main features of stDyer include: - **Gaussian Mixture Variational AutoEncoder (GMVAE)**: Combined with Graph Attention Networks (GATs), it simultaneously learns deep representations and performs clustering. - **Dynamic Graphs**: Adaptively links units according to Gaussian mixture assignments, thereby improving spatial domain clustering and generating smoother domain boundaries. - **Mini - batch Neighbor Sampling Strategy**: Enhances the ability to handle large - scale datasets and supports multi - GPU training. Through these improvements, stDyer has demonstrated excellent performance on multiple SRT technology platforms, especially in spatial domain clustering, multi - slice analysis, and large - scale dataset processing. ### Example of Formula: In stDyer, GMVAE assumes that units are embedded in the latent space following a Gaussian Mixture Model (GMM). Its probability distribution can be expressed as: \[ p(\mathbf{z}|\theta)=\sum_{k = 1}^{K}\pi_k\mathcal{N}(\mathbf{z}|\mu_k,\Sigma_k) \] where: - \(\mathbf{z}\) is the latent variable, - \(\pi_k\) is the weight of the \(k\)-th Gaussian component, - \(\mathcal{N}(\mathbf{z}|\mu_k,\Sigma_k)\) is a Gaussian distribution with mean \(\mu_k\) and covariance matrix \(\Sigma_k\). The parameters \(\mu\) and \(\sigma\) of the GMM are estimated by maximizing the log - likelihood of the marginal distribution of all units. In conclusion, stDyer significantly improves the performance of spatial domain clustering by combining deep learning and dynamic graph structures, solving the key problems in existing methods.