DUBStepR: correlation-based feature selection for clustering single-cell RNA sequencing data

Bobby Ranjan,Wenjie Sun,Jinyu Park,Kunal Mishra,Ronald Xie,Fatemeh Alipour,Vipul Singhal,Florian Schmidt,Ignasius Joanito,Nirmala Arul Rayan,Michelle Gek Liang Lim,Shyam Prabhakar
DOI: https://doi.org/10.1101/2020.10.07.330563
2020-10-08
Abstract:Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. However, we found that the performance of existing feature selection methods was inconsistent across benchmark datasets, and occasionally even worse than without feature selection. Moreover, existing methods ignored information contained in gene-gene correlations. We therefore developed DUBStepR ( D etermining the U nderlying B asis using Step wise R egression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUBStepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. In a published scRNA-seq dataset from sorted monocytes, DUBStepR sensitively detected a rare and previously invisible population of contaminating basophils. DUBStepR is scalable to over a million cells, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data.
What problem does this paper attempt to address?