Abstract:Single-cell RNA sequencing (scRNA-seq) has become a popular experimental method to study variation of gene expression within a population of cells. However, obtaining an accurate picture of the diversity of distinct gene expression states that are present in a given dataset is highly challenging because the sparsity of the scRNA-seq data and its inhomogeneous measurement noise properties. Although a vast number of different methods is applied in the literature for clustering cells into subsets with 'similar' expression profiles, these methods generally lack rigorously specified objectives, involve multiple complex layers of normalization, filtering, feature selection, dimensionality-reduction, employ ad hoc measures of distance or similarity between cells, often ignore the known measurement noise properties of scRNA-seq measurements, and include a large number of tunable parameters. Consequently, it is virtually impossible to assign concrete biophysical meaning to the clusterings that result from these methods. Here we address the following problem: Given raw unique molecule identifier (UMI) counts of an scRNA-seq dataset, partition the cells into subsets such that the gene expression states of the cells in each subset are statistically indistinguishable, and each subset corresponds to a distinct gene expression state. That is, we aim to partition cells so as to maximally reduce the complexity of the dataset without removing any of its meaningful structure. We show that, given the known measurement noise structure of scRNA-seq data, this problem is mathematically well-defined and derive its unique solution from first principles. We have implemented this solution in a tool called Cellstates which operates directly on the raw data and automatically determines the optimal partition and cluster number, with zero tunable parameters. We show that, on synthetic datasets, Cellstates almost perfectly recovers optimal partitions. On real data, Cellstates robustly identifies subtle substructure within groups of cells that are traditionally annotated as a common cell type. Moreover, we show that the diversity of gene expression states that Cellstates identifies systematically depends on the tissue of origin and not on technical features of the experiments such as the total number of cells and total UMI count per cell. In addition to the Cellstates tool we also provide a small toolbox of software to place the identified cellstates into a hierarchical tree of higher-order clusters, to identify the most important differentially expressed genes at each branch of this hierarchy, and to visualize these results. Single-cell RNA sequencing (scRNA-seq) has the promise to offer fundamental new insights into how gene expression is regulated but analyzing such data is very challenging because of its sparsity and heterogeneity in its measurement noise. For example, one common component of scRNA-seq analysis procedures is the clustering of groups of cells with similar gene expression, but current methods use complex ad hoc schemes that involve multiple layers of complex transformations of the data making it hard to interpret their results. Here we present a method that clusters cells so as maximally reduce the complexity of the data without removing any of its meaningful structure, by grouping only cells whose expression profiles are statistically indistinguishable. Importantly, we show that, given the known measurement noise structure of scRNA-seq data, this problem has a unique solution which we derive from first principles. We implemented this method in a tool called Cellstates which operates directly on the raw data with zero tunable parameters. We validate the power of Cellstates by showing it performs almost perfectly on synthetic data and outcompetes existing methods on real data consisting of known mixtures of cells.

Clustering single cells: a review of approaches on high-and low-depth single-cell RNA-seq data

Review of Single-cell RNA-seq Data Clustering for Cell Type Identification and Characterization

A critical assessment of clustering algorithms to improve cell clustering and identification in single-cell transcriptome study

Single-cell RNA-seq clustering: datasets, models, and algorithms

Machine learning and statistical methods for clustering single-cell RNA-sequencing data

Identifying cell types from single-cell data based on similarities and dissimilarities between cells

Analysis of Single-Cell RNA-seq Data by Clustering Approaches

A Hybrid Deep Clustering Approach for Robust Cell Type Profiling Using Single-cell RNA-seq Data

Cell Type Differentiation Using Network Clustering Algorithms

Dimension Reduction and Clustering Models for Single-Cell RNA Sequencing Data: A Comparative Study

Deep Learning for clustering single-cell RNA-seq Data

Clustering ensemble in scRNA-seq data analysis: Methods, applications and challenges

A Cell Marker-Based Clustering Strategy (cmcluster) for Precise Cell Type Identification of Scrna-Seq Data

Single-cell RNA-seq Data Clustering: A Survey with Performance Comparison Study

Identification of cell types from single cell data using stable clustering

Identifying Cell Subpopulations and Their Genetic Drivers from Single-Cell RNA-Seq Data Using a Biclustering Approach

An Introduction to the Analysis of Single-Cell RNA-Sequencing Data

Identifying cell states in single-cell RNA-seq data at statistically maximal resolution

Single-Cell Transcriptome Profiling Simulation Reveals the Impact of Sequencing Parameters and Algorithms on Clustering

Benchmarking clustering algorithms on estimating the number of cell types from single-cell RNA-sequencing data

Significance analysis for clustering with single-cell RNA-sequencing data