Abstract:Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell–cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L 2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.

nPCA: a linear dimensionality reduction method using a multilayer perceptron

K-nearest-neighbors induced topological PCA for single cell RNA-sequence data analysis

K-Nearest-Neighbors Induced Topological PCA for scRNA Sequence Data Analysis

Nonlinear Dimensionality Reduction for Discriminative Analytics of Multiple Datasets

GraphPCA: a fast and interpretable dimension reduction algorithm for spatial transcriptomics data

Supervised Discriminative Sparse PCA with Adaptive Neighbors for Dimensionality Reduction

Supervised Linear Dimension-Reduction Methods: Review, Extensions, and Comparisons

PCA-Boosted Autoencoders for Nonlinear Dimensionality Reduction in Low Data Regimes

Deep Residual Principal Component Analysis As Feature Engineering for Industrial Data Analytics

PLPCA: Persistent Laplacian Enhanced-PCA for Microarray Data Analysis

PLPCA: Persistent Laplacian-Enhanced PCA for Microarray Data Analysis

Intrinsic dimension estimation of data by principal component analysis

Adaptive dimensionality reduction for neural network-based online principal component analysis

Neighborhood Preserving Projections (NPP): A Novel Linear Dimension Reduction Method

Nonlinear Functional Principal Component Analysis Using Neural Networks

Sparse Unsupervised Dimensionality Reduction Algorithms

Capturing the Denoising Effect of PCA via Compression Ratio

Improved Dimensionality Reduction of various Datasets using Novel Multiplicative Factoring Principal Component Analysis (MPCA)

PCA-KL: a parametric dimensionality reduction approach for unsupervised metric learning

Nonlinear dimensionality reduction based visualization of single-cell RNA sequencing data

Demixed principal component analysis of neural population data