Abstract:Background: While single cell RNA sequencing becomes a powerful technology, the presence of the large number of zero counts represents a challenge for both wet-lab processing and data analysis. Imputation of these dropouts can now be performed by three categories of algorithms: Model or smoothing, Matrix theory or Deep learning. However, two fundamental questions remain unsettled: (1) whether imputation should be performed; (2) which imputation algorithm to use with various downstream applications. Notably, imputation is not commonly used in real scRNA-seq applications because of their uncertain benefits, concerns about false inferences in downstream applications, and the lack of in-depth benchmark. Methods: Here, we performed two tasks. First, we developed an algorithm using adaptive low-rank full matrix factorization (afMF) based on a previous limited implementation confined to using low rank matrix decomposition (ALRA). Second, to evaluate the impact of various imputation algorithms on downstream analyses, a new benchmark framework incorporating commonly used downstream applications was developed. This benchmark framework put emphasis on real datasets which had ground truth or matched bulk data such that algorithm performance was compared to more convinced data rather than less realistic simulated parameters. Results: Our results indicated that afMF and ALRA (matrix based) provided good imputation and outperformed raw log-normalization in various downstream applications. afMF outperformed ALRA in several evaluations (cell-level differential expression analysis, GSEA, classification, biomarker prediction, clustering, SC-bulk profiling similarity). Besides, afMF ranked among the top levels in automatic cell type annotation, trajectory inference by DPT, and AUCell & SCENIC. Both showed acceptable scalability, while afMF had longer running time. MAGIC (smoothing based) and AutoClass (deep learning based) also performed well but may produce false positives. In contrast, more complicated methods (other deep learning or model based) were prone to overfitting and data distortion. We also found that certain downstream algorithms are not compatible with imputation, including trajectory inference with Slingshot and cell-cell communication. Prior imputation either showed no improvement or generated false positive findings with these downstream applications. Conclusions: We hope this in-depth evaluation and the algorithm developed in this study can enhance the selection of appropriate imputation algorithm for specific scRNA-seq downstream analyses.

Low-Rank Full Matrix Factorization for dropout imputation in single cell RNA-seq and benchmarking with imputation algorithms for downstream applications

CMF-Impute: an Accurate Imputation Tool for Single-Cell RNA-seq Data.

scRNMF: An imputation method for single-cell RNA-seq data by robust and non-negative matrix factorization

Evaluating Imputation Methods for Single-Cell RNA-seq Data

scNMF-Impute: imputation for single-cell RNA-seq data based on nonnegative matrix factorization.

A systematic evaluation of single-cell RNA-sequencing imputation methods

Missing Value Imputation With Low-Rank Matrix Completion in Single-Cell RNA-Seq Data by Considering Cell Heterogeneity

scRMD: Imputation for single cell RNA-seq data via robust matrix decomposition

A Posterior Probability Based Bayesian Method for Single-Cell RNA-seq Data Imputation.

Evaluating the performance of dropout imputation and clustering methods for single-cell RNA sequencing data

Regulatory Network-Based Imputation of Dropouts in Single-Cell RNA Sequencing Data

SmartImpute: A Targeted Imputation Framework for Single-cell Transcriptome Data

Sciamc:Single-Cell Imputation Via Adaptive Matrix Completion

Scimc: a Platform for Benchmarking Comparison and Visualization Analysis of Scrna-Seq Data Imputation Methods.

scCGImpute: An Imputation Method for Single-Cell RNA Sequencing Data Based on Similarities between Cells and Relationships among Genes

Are dropout imputation methods for scRNA-seq effective for scATAC-seq data?

Imputing dropouts for single-cell RNA sequencing based on multi-objective optimization

Benchmarking scRNA-seq imputation tools with respect to network inference highlights deficits in performance at high levels of sparsity

CDSImpute: an Ensemble Similarity Imputation Method for Single-Cell RNA Sequence Dropouts

An efficient scRNA-seq dropout imputation method using graph attention network

SinCWIm: An imputation method for single-cell RNA sequence dropouts using weighted alternating least squares