Abstract:In this current era, the identification of both known and novel cell types, the representation of cells, predicting cell fates, classifying various tumor types, and studying heterogeneity in various cells are the key areas of interest in the analysis of single-cell RNA sequencing (scRNA-seq) data. Due to the nature of the data, cluster identification in single-cell sequencing data with high dimensions presents several difficulties. In this paper, we introduce a new framework that combines various strategies such as imputed matrix, minimum redundancy maximum relevance (MRMR) feature selection, and shrinkage clustering to discover gene signatures from scRNA-seq data. Firstly, we conducted the pre-filtering of the "drop-out" value in the data focusing solely on imputing the identified "drop-out" values. Next, we applied the MRMR feature selection method to the imputed data and obtained the top 100 features based on the MRMR feature selection optimization scores for further downstream analysis. Thereafter, we employed shrinkage clustering on the selected feature matrix to identify the cell clusters using a global optimization approach. Finally, we applied the Limma-Voom R tool employing voom normalization and an empirical Bayes test to detect differentially expressed features with a false discovery rate (FDR) < 0.001. In addition, we performed the KEGG pathway and gene ontology enrichment analysis of the identified biomarkers using David 6.8 software. Furthermore, we conducted miRNA target detection for the top gene markers and performed miRNA target gene interaction network analysis using the Cytoscape online tool. Subsequently, we compared our detected 100 markers with our previously detected top 100 cluster-specified markers ranked by FDR of the latest published article and discovered three common markers; namely, Cyp2b10, Mt1, Alpi, along with 97 novel markers. In addition, the Gene Set Enrichment Analysis (GSEA) of both marker sets also yields similar outcomes. Apart from this, we performed another comparative study with another published method, demonstrating that our model detects more significant markers than that model. To assess the efficiency of our framework, we apply it to another dataset and identify 20 strongly significant up-regulated markers. Additionally, we perform a comparative study of different imputation methods and include an ablation study to prove that every key phase of our framework is essential and strongly recommended. In summary, our proposed integrated framework efficiently discovers differentially expressed stronger gene signatures as well as up-regulated markers in single-cell RNA sequencing data.

GeneCover: A Combinatorial Approach for Label-free Marker Gene Selection

Optimal marker gene selection for cell type discrimination in single cell analyses

Gene panel selection for targeted spatial transcriptomics

A comparison of marker gene selection methods for single-cell RNA sequencing data

CellCover Captures Neural Stem Cell Progression in Mammalian Neocortical Development

MarkerMap: nonlinear marker selection for single-cell studies

SciGeneX: Enhancing transcriptional analysis through gene module detection in single-cell and spatial transcriptomics data

Hierarchical marker genes selection in scRNA-seq analysis

AutoGeneS: Automatic Gene Selection Using Multi-Objective Optimization for RNA-seq Deconvolution

CIARA: a Cluster-Independent Algorithm for Identifying Markers of Rare Cell Types from Single-Cell Sequencing Data

Spanve: an Statistical Method to Detect Clustering-friendly Spatially Variable Genes in Large-scale Spatial Transcriptomics Data

Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization

Expanding the coverage of spatial proteomics: a machine learning approach

CORTADO: Hill Climbing Optimization for Cell-Type Specific Marker Gene Discovery

Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering

A machine learning method for the discovery of minimum marker gene combinations for cell type identification from single-cell RNA sequencing

SpatialMap: Spatial Mapping of Unmeasured Gene Expression Profiles in Spatial Transcriptomic Data Using Generalized Linear Spatial Models

scPanel: a tool for automatic identification of sparse gene panels for generalizable patient classification using scRNA-seq datasets

CellMapper: Rapid and Accurate Inference of Gene Expression in Difficult-to-isolate Cell Types

Stabilized marker gene identification and functional annotation from single-cell transcriptomic data

SMaSH: A scalable, general marker gene identification framework for single-cell RNA sequencing and Spatial Transcriptomics