Abstract:Single-cell RNA sequencing (scRNA-seq) has become a popular experimental method to study variation of gene expression within a population of cells. However, obtaining an accurate picture of the diversity of distinct gene expression states that are present in a given dataset is highly challenging because the sparsity of the scRNA-seq data and its inhomogeneous measurement noise properties. Although a vast number of different methods is applied in the literature for clustering cells into subsets with 'similar' expression profiles, these methods generally lack rigorously specified objectives, involve multiple complex layers of normalization, filtering, feature selection, dimensionality-reduction, employ ad hoc measures of distance or similarity between cells, often ignore the known measurement noise properties of scRNA-seq measurements, and include a large number of tunable parameters. Consequently, it is virtually impossible to assign concrete biophysical meaning to the clusterings that result from these methods. Here we address the following problem: Given raw unique molecule identifier (UMI) counts of an scRNA-seq dataset, partition the cells into subsets such that the gene expression states of the cells in each subset are statistically indistinguishable, and each subset corresponds to a distinct gene expression state. That is, we aim to partition cells so as to maximally reduce the complexity of the dataset without removing any of its meaningful structure. We show that, given the known measurement noise structure of scRNA-seq data, this problem is mathematically well-defined and derive its unique solution from first principles. We have implemented this solution in a tool called Cellstates which operates directly on the raw data and automatically determines the optimal partition and cluster number, with zero tunable parameters. We show that, on synthetic datasets, Cellstates almost perfectly recovers optimal partitions. On real data, Cellstates robustly identifies subtle substructure within groups of cells that are traditionally annotated as a common cell type. Moreover, we show that the diversity of gene expression states that Cellstates identifies systematically depends on the tissue of origin and not on technical features of the experiments such as the total number of cells and total UMI count per cell. In addition to the Cellstates tool we also provide a small toolbox of software to place the identified cellstates into a hierarchical tree of higher-order clusters, to identify the most important differentially expressed genes at each branch of this hierarchy, and to visualize these results. Single-cell RNA sequencing (scRNA-seq) has the promise to offer fundamental new insights into how gene expression is regulated but analyzing such data is very challenging because of its sparsity and heterogeneity in its measurement noise. For example, one common component of scRNA-seq analysis procedures is the clustering of groups of cells with similar gene expression, but current methods use complex ad hoc schemes that involve multiple layers of complex transformations of the data making it hard to interpret their results. Here we present a method that clusters cells so as maximally reduce the complexity of the data without removing any of its meaningful structure, by grouping only cells whose expression profiles are statistically indistinguishable. Importantly, we show that, given the known measurement noise structure of scRNA-seq data, this problem has a unique solution which we derive from first principles. We implemented this method in a tool called Cellstates which operates directly on the raw data with zero tunable parameters. We validate the power of Cellstates by showing it performs almost perfectly on synthetic data and outcompetes existing methods on real data consisting of known mixtures of cells.

Identifying Cell Subpopulations and Their Genetic Drivers from Single-Cell RNA-Seq Data Using a Biclustering Approach

Joint CC and Bimax: A Biclustering Method for Single-Cell RNA-Seq Data Analysis

An Effective Biclustering-Based Framework for Identifying Cell Subpopulations From scRNA-seq Data

Analysis of Single-Cell RNA-seq Data by Clustering Approaches

Cell Type Differentiation Using Network Clustering Algorithms

QUBIC2: A novel biclustering algorithm for large-scale bulk RNA-sequencing and single-cell RNA-sequencing data analysis

Spectral clustering of single cells using Siamese nerual network combined with improved affinity matrix

CellBIC: bimodality-based top-down clustering of single-cell RNA sequencing data reveals hierarchical structure of the cell type

Robust scRNA-seq Cell Types Identification by Self-Guided Deep Clustering Network

CAbiNet: joint clustering and visualization of cells and genes for single-cell transcriptomics

A critical assessment of clustering algorithms to improve cell clustering and identification in single-cell transcriptome study

QUBIC2: a novel and robust biclustering algorithm for analyses and interpretation of large-scale RNA-Seq data

Identifying cell states in single-cell RNA-seq data at statistically maximal resolution

Single-cell RNA-seq clustering: datasets, models, and algorithms

Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering

Statistical significance of cluster membership for determination of cell identities in single cell genomics

A Hybrid Deep Clustering Approach for Robust Cell Type Profiling Using Single-cell RNA-seq Data

Identification of cell types from single cell data using stable clustering

Tenofovir in Indian children.

A Cell Marker-Based Clustering Strategy (cmcluster) for Precise Cell Type Identification of Scrna-Seq Data

A new and effective two-step clustering approach for single cell RNA sequencing data