Distortion-free PCA on sample space for highly variable gene detection from single-cell RNA-seq data

Momo Matsuda,Yasunori Futamura,Xiucai Ye,Tetsuya Sakurai
DOI: https://doi.org/10.1007/s11704-022-1172-z
IF: 2.6688
2022-08-09
Frontiers of Computer Science
Abstract:Single-cell RNA-seq (scRNA-seq) allows the analysis of gene expression in each cell, which enables the detection of highly variable genes (HVG) that contribute to cell-to-cell variation within a homogeneous cell population. HVG detection is necessary for clustering analysis to improve the clustering result. scRNA-seq includes some genes that are expressed with a certain probability in all cells which make the cells indistinguishable. These genes are referred to as background noise. To remove the background noise and select the informative genes for clustering analysis, in this paper, we propose an effective HVG detection method based on principal component analysis (PCA). The proposed method utilizes PCA to evaluate the genes (features) on the sample space. The distortion-free principal components are selected to calculate the distance from the origin to gene as the weight of each gene. The genes that have the greatest distances to the origin are selected for clustering analysis. Experimental results on both synthetic and gene expression datasets show that the proposed method not only removes the background noise to select the informative genes for clustering analysis, but also outperforms the existing HVG detection methods.
computer science, information systems, theory & methods, software engineering
What problem does this paper attempt to address?