Abstract:Background: While single cell RNA sequencing becomes a powerful technology, the presence of the large number of zero counts represents a challenge for both wet-lab processing and data analysis. Imputation of these dropouts can now be performed by three categories of algorithms: Model or smoothing, Matrix theory or Deep learning. However, two fundamental questions remain unsettled: (1) whether imputation should be performed; (2) which imputation algorithm to use with various downstream applications. Notably, imputation is not commonly used in real scRNA-seq applications because of their uncertain benefits, concerns about false inferences in downstream applications, and the lack of in-depth benchmark. Methods: Here, we performed two tasks. First, we developed an algorithm using adaptive low-rank full matrix factorization (afMF) based on a previous limited implementation confined to using low rank matrix decomposition (ALRA). Second, to evaluate the impact of various imputation algorithms on downstream analyses, a new benchmark framework incorporating commonly used downstream applications was developed. This benchmark framework put emphasis on real datasets which had ground truth or matched bulk data such that algorithm performance was compared to more convinced data rather than less realistic simulated parameters. Results: Our results indicated that afMF and ALRA (matrix based) provided good imputation and outperformed raw log-normalization in various downstream applications. afMF outperformed ALRA in several evaluations (cell-level differential expression analysis, GSEA, classification, biomarker prediction, clustering, SC-bulk profiling similarity). Besides, afMF ranked among the top levels in automatic cell type annotation, trajectory inference by DPT, and AUCell & SCENIC. Both showed acceptable scalability, while afMF had longer running time. MAGIC (smoothing based) and AutoClass (deep learning based) also performed well but may produce false positives. In contrast, more complicated methods (other deep learning or model based) were prone to overfitting and data distortion. We also found that certain downstream algorithms are not compatible with imputation, including trajectory inference with Slingshot and cell-cell communication. Prior imputation either showed no improvement or generated false positive findings with these downstream applications. Conclusions: We hope this in-depth evaluation and the algorithm developed in this study can enhance the selection of appropriate imputation algorithm for specific scRNA-seq downstream analyses.

Issues arising from benchmarking single-cell RNA sequencing imputation methods

Reply to "Issues arising from benchmarking single-cell RNA sequencing imputation methods"

A systematic evaluation of single-cell RNA-sequencing imputation methods

Evaluating Imputation Methods for Single-Cell RNA-seq Data

Are dropout imputation methods for scRNA-seq effective for scATAC-seq data?

Scimc: a Platform for Benchmarking Comparison and Visualization Analysis of Scrna-Seq Data Imputation Methods.

Benchmarking imputation methods for network inference using a novel method of synthetic scRNA-seq data generation

Dropout Imputation and Batch Effect Correction for Single-Cell RNA Sequencing Data

SAE-Impute: imputation for single-cell data via subspace regression and auto-encoders

SCRIBE: a new approach to dropout imputation and batch effects correction for single-cell RNA-seq data

scRNA-seq mixology: towards better benchmarking of single cell RNA-seq analysis methods

A Posterior Probability Based Bayesian Method for Single-Cell RNA-seq Data Imputation.

The shaky foundations of simulating single-cell RNA sequencing data

Cellular Similarity based Imputation for Single cell RNA Sequencing Data

Benchmarking deep learning methods for biologically conserved single-cell integration

SAVER: gene expression recovery for single-cell RNA sequencing

Low-Rank Full Matrix Factorization for dropout imputation in single cell RNA-seq and benchmarking with imputation algorithms for downstream applications

Schinter: Imputing Dropout Events for Single-Cell RNA-seq Data with Limited Sample Size

A benchmark of batch-effect correction methods for single-cell RNA sequencing data

Scsagan: A Scrna-Seq Data Imputation Method Based on Semi-Supervised Learning and Probabilistic Latent Semantic Analysis

A comparison of methods accounting for batch effects in differential expression analysis of UMI count based single cell RNA sequencing