Abstract:Critical in revealing cell heterogeneity and identifying new cell subtypes, cell clustering based on single-cell RNA sequencing (scRNA-seq) is challenging. Due to the high noise, sparsity, and poor annotation of scRNA-seq data, existing state-of-the-art cell clustering methods usually ignore gene functions and gene interactions. In this study, we propose a feature extraction method, named FEGFS, to analyze scRNA-seq data, taking advantage of known gene functions. Specifically, we first derive the functional gene sets based on Gene Ontology (GO) terms and reduce their redundancy by semantic similarity analysis and gene repetitive rate reduction. Then, we apply the kernel principal component analysis to select features on each non-redundant functional gene set, and we combine the selected features (for each functional gene set) together for subsequent clustering analysis. To test the performance of FEGFS, we apply agglomerative hierarchical clustering based on FEGFS and compared it with seven state-of-the-art clustering methods on six real scRNA-seq datasets. For small datasets like Pollen and Goolam, FEGFS outperforms all methods on all four evaluation metrics including adjusted Rand index (ARI), normalized mutual information (NMI), homogeneity score (HOM), and completeness score (COM). For example, the ARIs of FEGFS are 0.955 and 0.910, respectively, on Pollen and Goolam; and those of the second-best method are only 0.938 and 0.910, respectively. For large datasets, FEGFS also outperforms most methods. For example, the ARIs of FEGFS are 0.781 on both Klein and Zeisel, which are higher than those of all other methods but slight lower than those of SC3 (0.798 and 0.807, respectively). Moreover, we demonstrate that CMF-Impute is powerful in reconstructing cell-to-cell and gene-to-gene correlation and in inferring cell lineage trajectories. As for application, take glioma as an example; we demonstrated that our clustering methods could identify important cell clusters related to glioma and also inferred key marker genes related to these cell clusters.

Gene selection for single cell RNA-seq data via fuzzy rough iterative computation model

Gene selection and clustering of single-cell data based on Fisher score and genetic algorithm

Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization

RFCell: A Gene Selection Approach for scRNA-seq Clustering Based on Permutation and Random Forest

A Cancer Gene Selection Algorithm Based on the K-S Test and CFS

Interpretable scRNA-seq Analysis with Intelligent Gene Selection

scTSSR: gene expression recovery for single-cell RNA sequencing using two-side sparse self-representation

Accurate and interpretable gene expression imputation on scRNA-seq data using IGSimpute

scCGImpute: An Imputation Method for Single-Cell RNA Sequencing Data Based on Similarities between Cells and Relationships among Genes

Optimal Gene Filtering for Single-Cell Data (Ogfsc)-a Gene Filtering Algorithm for Single-Cell RNA-seq Data

A Novel Single-Cell RNA Sequencing Data Feature Extraction Method Based on Gene Function Analysis and Its Applications in Glioma Study

Feature Genes Selection Using Fuzzy Rough Uncertainty Metric for Tumor Diagnosis.

scGIR: deciphering cellular heterogeneity via gene ranking in single-cell weighted gene correlation networks

A Novel Approach for Single Gene Selection Using Clustering and Dimensionality Reduction

Highly Regional Genes: graph-based gene selection for single-cell RNA-seq data

CellBRF: a feature selection method for single-cell clustering using cell balance and random forest

The fuzzy gene filter: A classifier performance assesment

Feature Genes Selection Based on Fuzzy Neighborhood Conditional Entropy

FEED: a feature selection method based on gene expression decomposition for single cell clustering

Optimal marker gene selection for cell type discrimination in single cell analyses

Gene selection for optimal prediction of cell position in tissues from single-cell transcriptomics data