scASDC: Attention Enhanced Structural Deep Clustering for Single-cell RNA-seq Data

Wenwen Min,Zhen Wang,Fangfang Zhu,Taosheng Xu,Shunfang Wang
DOI: https://doi.org/10.48550/arXiv.2408.05258
2024-08-09
Abstract:Single-cell RNA sequencing (scRNA-seq) data analysis is pivotal for understanding cellular heterogeneity. However, the high sparsity and complex noise patterns inherent in scRNA-seq data present significant challenges for traditional clustering methods. To address these issues, we propose a deep clustering method, Attention-Enhanced Structural Deep Embedding Graph Clustering (scASDC), which integrates multiple advanced modules to improve clustering accuracy and <a class="link-external link-http" href="http://robustness.Our" rel="external noopener nofollow">this http URL</a> approach employs a multi-layer graph convolutional network (GCN) to capture high-order structural relationships between cells, termed as the graph autoencoder module. To mitigate the oversmoothing issue in GCNs, we introduce a ZINB-based autoencoder module that extracts content information from the data and learns latent representations of gene expression. These modules are further integrated through an attention fusion mechanism, ensuring effective combination of gene expression and structural information at each layer of the GCN. Additionally, a self-supervised learning module is incorporated to enhance the robustness of the learned embeddings. Extensive experiments demonstrate that scASDC outperforms existing state-of-the-art methods, providing a robust and effective solution for single-cell clustering tasks. Our method paves the way for more accurate and meaningful analysis of single-cell RNA sequencing data, contributing to better understanding of cellular heterogeneity and biological processes. All code and public datasets used in this paper are available at \url{<a class="link-external link-https" href="https://github.com/wenwenmin/scASDC" rel="external noopener nofollow">this https URL</a>} and \url{<a class="link-external link-https" href="https://zenodo.org/records/12814320" rel="external noopener nofollow">this https URL</a>}.
Genomics,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the key problems in single - cell RNA sequencing (scRNA - seq) data analysis, especially the challenges of high sparsity and complex noise patterns to traditional clustering methods. Specifically: 1. **High Sparsity and Complex Noise Patterns**: scRNA - seq data usually has high sparsity (i.e., most gene expression values are zero) and complex noise patterns, which make traditional clustering methods (such as k - means and hierarchical clustering) difficult to effectively handle these data. 2. **Limitations of Relying Only on Gene Expression Information**: Existing deep - learning clustering algorithms mainly focus on gene expression information and ignore the structural information between cells. This single information source may lead to unsatisfactory clustering results. 3. **Over - smoothing Problem in GCN**: Although methods based on graph convolutional networks (GCN) can capture the structural information between cells, they are prone to over - smoothing problems when dealing with large - scale data, resulting in the loss of key features embedded in the gene expression matrix. To address these problems, the paper proposes a method named Attention - Enhanced Structural Deep Embedding Graph Clustering (scASDC). This method improves the effect of single - cell clustering through the following innovations: - **Multi - layer GCN**: Use multi - layer GCN to capture high - order structural relationships between cells. - **ZINB - based Autoencoder**: Introduce an autoencoder module based on zero - inflated negative binomial distribution (ZINB) to extract content information from the data and learn the latent representation of gene expression, so as to alleviate the over - smoothing problem in GCN. - **Attention Fusion Mechanism**: Effectively combine gene expression information and structural information in each layer of GCN through the attention fusion mechanism. - **Self - supervised Learning Module**: Add a self - supervised learning module to enhance the robustness and accuracy of the learned embedding representation. In summary, this paper aims to improve the clustering performance of single - cell RNA sequencing data by combining structural information and gene expression information, and through the attention mechanism and self - supervised learning, so as to better understand cell heterogeneity and biological processes.