HyGAnno: hybrid graph neural network–based cell type annotation for single-cell ATAC sequencing data

Weihang Zhang,Yang Cui,Bowen Liu,Martin Loza,Sung-Joon Park,Kenta Nakai
DOI: https://doi.org/10.1093/bib/bbae152
IF: 9.5
2024-04-08
Briefings in Bioinformatics
Abstract:Reliable cell type annotations are crucial for investigating cellular heterogeneity in single-cell omics data. Although various computational approaches have been proposed for single-cell RNA sequencing (scRNA-seq) annotation, high-quality cell labels are still lacking in single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) data, because of extreme sparsity and inconsistent chromatin accessibility between datasets. Here, we present a novel automated cell annotation method that transfers cell type information from a well-labeled scRNA-seq reference to an unlabeled scATAC-seq target, via a parallel graph neural network, in a semi-supervised manner. Unlike existing methods that utilize only gene expression or gene activity features, HyGAnno leverages genome-wide accessibility peak features to facilitate the training process. In addition, HyGAnno reconstructs a reference–target cell graph to detect cells with low prediction reliability, according to their specific graph connectivity patterns. HyGAnno was assessed across various datasets, showcasing its strengths in precise cell annotation, generating interpretable cell embeddings, robustness to noisy reference data and adaptability to tumor tissues.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of insufficient cell - type annotation in single - cell ATAC - sequencing (scATAC - seq) data. Specifically, there are already multiple computational methods for cell - type annotation in single - cell RNA - sequencing (scRNA - seq) data, but high - quality cell labels are still lacking in scATAC - seq data. This is mainly due to: 1. **Data Sparsity**: scATAC - seq data is very sparse, and only about 10% of the peaks in each cell can be detected. 2. **Inconsistent Chromatin Accessibility**: There are significant differences in chromatin accessibility between different datasets. 3. **Different Feature Sets**: The feature set of scATAC - seq is open chromatin regions (peaks), while the feature set of scRNA - seq is gene expression, and there are large differences between them. To solve these problems, the authors propose a new hybrid model based on graph neural networks, HyGAnno, which is used to transfer cell - type information from scRNA - seq reference data to unannotated scATAC - seq target data. The main contributions of HyGAnno are: - **Combining Gene - level and Peak - level Features**: Utilize gene expression features and chromatin - accessibility - peak features simultaneously for training to ensure accurate cell annotation and provide a potential method for detecting cell - type - specific peaks. - **Semi - supervised Learning Framework**: Transfer cell - type information in a semi - supervised manner through parallel graph neural networks. - **Prediction Reliability Assessment**: Detect cells with low prediction reliability by reconstructing the reference - target cell graph. ### How HyGAnno Works The workflow of HyGAnno is as follows: 1. **Graph Construction and Anchor Cell Detection**: - Use PCA and LSI to perform dimensionality reduction on scRNA - seq and scATAC - seq data respectively. - Construct an RNA - cell graph and an ATAC - cell graph. - Use CCA to project the standardized gene - expression matrix and gene - activity matrix into a shared space and detect pairs of cells from different modalities as anchor cells. 2. **Graph Embedding and Label Transfer**: - Use parallel variational graph auto - encoders (VGAE) to embed the hybrid graph and the ATAC graph. - Transfer the labels of RNA cells to ATAC anchor cells and spread label knowledge. 3. **Graph Reconstruction and Prediction Reliability Assessment**: - Reconstruct a new graph by combining the hybrid graph and the ATAC graph to better describe the correlation between RNA cells and ATAC cells. - Calculate edge density and weight to assess the reliability of the prediction. 4. **Loss Function Optimization**: - Use the cross - entropy loss function to optimize the label prediction of reference cells. - Combine other loss functions (such as graph - reconstruction loss, alignment loss, etc.) for joint training. ### Experimental Results HyGAnno outperforms existing benchmark methods on multiple datasets, especially showing higher prediction resolution when dealing with unbalanced datasets. In addition, HyGAnno can also effectively identify cell - type - specific peaks, further improving the accuracy and interpretability of cell annotation. Through these improvements, HyGAnno not only improves the accuracy of cell - type annotation in scATAC - seq data but also provides a powerful tool for studying cell - type - specific gene - regulation mechanisms.