Abstract:Abstract Background Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. Results We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth. Conclusions We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction.

Compound models and Pearson residuals for single-cell RNA-seq data without UMIs

Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

The Poisson distribution model fits UMI-based single-cell RNA-sequencing data

Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression

A mechanistic model for the negative binomial distribution of single-cell mRNA counts

Multi-Experiment Nonlinear Mixed Effect Modeling of Single-Cell Translation Kinetics after Transfection

UMI-count modeling and differential expression analysis for single-cell RNA sequencing

A deep generative model for single-cell RNA sequencing with application to detecting differentially expressed genes

Non-linear Normalization for Non-UMI Single Cell RNA-Seq

Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model

A deep generative model for gene expression profiles from single-cell RNA sequencing

Comparison and evaluation of statistical error models for scRNA-seq

Modeling non-uniformity in short-read rates in RNA-Seq data

Negative binomial count splitting for single-cell RNA sequencing data

UMI-tools: Modelling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy

Intrinsic molecular identifiers enable robust molecular counting in single-cell sequencing

Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers

Joint Estimation of Isoform Expression and Isoform-Specific Read Distribution Using Multisample RNA-Seq Data.

Differences in molecular sampling and data processing explain variation among single-cell and single-nucleus RNA-seq experiments

Robust estimation of isoform expression with RNA-Seq data