Identifying cell states in single-cell RNA-seq data at statistically maximal resolution
Pascal Grobecker,Thomas Sakoparnig,Erik van Nimwegen
DOI: https://doi.org/10.1371/journal.pcbi.1012224
2024-07-13
PLoS Computational Biology
Abstract:Single-cell RNA sequencing (scRNA-seq) has become a popular experimental method to study variation of gene expression within a population of cells. However, obtaining an accurate picture of the diversity of distinct gene expression states that are present in a given dataset is highly challenging because the sparsity of the scRNA-seq data and its inhomogeneous measurement noise properties. Although a vast number of different methods is applied in the literature for clustering cells into subsets with 'similar' expression profiles, these methods generally lack rigorously specified objectives, involve multiple complex layers of normalization, filtering, feature selection, dimensionality-reduction, employ ad hoc measures of distance or similarity between cells, often ignore the known measurement noise properties of scRNA-seq measurements, and include a large number of tunable parameters. Consequently, it is virtually impossible to assign concrete biophysical meaning to the clusterings that result from these methods. Here we address the following problem: Given raw unique molecule identifier (UMI) counts of an scRNA-seq dataset, partition the cells into subsets such that the gene expression states of the cells in each subset are statistically indistinguishable, and each subset corresponds to a distinct gene expression state. That is, we aim to partition cells so as to maximally reduce the complexity of the dataset without removing any of its meaningful structure. We show that, given the known measurement noise structure of scRNA-seq data, this problem is mathematically well-defined and derive its unique solution from first principles. We have implemented this solution in a tool called Cellstates which operates directly on the raw data and automatically determines the optimal partition and cluster number, with zero tunable parameters. We show that, on synthetic datasets, Cellstates almost perfectly recovers optimal partitions. On real data, Cellstates robustly identifies subtle substructure within groups of cells that are traditionally annotated as a common cell type. Moreover, we show that the diversity of gene expression states that Cellstates identifies systematically depends on the tissue of origin and not on technical features of the experiments such as the total number of cells and total UMI count per cell. In addition to the Cellstates tool we also provide a small toolbox of software to place the identified cellstates into a hierarchical tree of higher-order clusters, to identify the most important differentially expressed genes at each branch of this hierarchy, and to visualize these results. Single-cell RNA sequencing (scRNA-seq) has the promise to offer fundamental new insights into how gene expression is regulated but analyzing such data is very challenging because of its sparsity and heterogeneity in its measurement noise. For example, one common component of scRNA-seq analysis procedures is the clustering of groups of cells with similar gene expression, but current methods use complex ad hoc schemes that involve multiple layers of complex transformations of the data making it hard to interpret their results. Here we present a method that clusters cells so as maximally reduce the complexity of the data without removing any of its meaningful structure, by grouping only cells whose expression profiles are statistically indistinguishable. Importantly, we show that, given the known measurement noise structure of scRNA-seq data, this problem has a unique solution which we derive from first principles. We implemented this method in a tool called Cellstates which operates directly on the raw data with zero tunable parameters. We validate the power of Cellstates by showing it performs almost perfectly on synthetic data and outcompetes existing methods on real data consisting of known mixtures of cells.
biochemical research methods,mathematical & computational biology