Abstract:Background: While single cell RNA sequencing becomes a powerful technology, the presence of the large number of zero counts represents a challenge for both wet-lab processing and data analysis. Imputation of these dropouts can now be performed by three categories of algorithms: Model or smoothing, Matrix theory or Deep learning. However, two fundamental questions remain unsettled: (1) whether imputation should be performed; (2) which imputation algorithm to use with various downstream applications. Notably, imputation is not commonly used in real scRNA-seq applications because of their uncertain benefits, concerns about false inferences in downstream applications, and the lack of in-depth benchmark. Methods: Here, we performed two tasks. First, we developed an algorithm using adaptive low-rank full matrix factorization (afMF) based on a previous limited implementation confined to using low rank matrix decomposition (ALRA). Second, to evaluate the impact of various imputation algorithms on downstream analyses, a new benchmark framework incorporating commonly used downstream applications was developed. This benchmark framework put emphasis on real datasets which had ground truth or matched bulk data such that algorithm performance was compared to more convinced data rather than less realistic simulated parameters. Results: Our results indicated that afMF and ALRA (matrix based) provided good imputation and outperformed raw log-normalization in various downstream applications. afMF outperformed ALRA in several evaluations (cell-level differential expression analysis, GSEA, classification, biomarker prediction, clustering, SC-bulk profiling similarity). Besides, afMF ranked among the top levels in automatic cell type annotation, trajectory inference by DPT, and AUCell & SCENIC. Both showed acceptable scalability, while afMF had longer running time. MAGIC (smoothing based) and AutoClass (deep learning based) also performed well but may produce false positives. In contrast, more complicated methods (other deep learning or model based) were prone to overfitting and data distortion. We also found that certain downstream algorithms are not compatible with imputation, including trajectory inference with Slingshot and cell-cell communication. Prior imputation either showed no improvement or generated false positive findings with these downstream applications. Conclusions: We hope this in-depth evaluation and the algorithm developed in this study can enhance the selection of appropriate imputation algorithm for specific scRNA-seq downstream analyses.

Benchmarking scRNA-seq imputation tools with respect to network inference highlights deficits in performance at high levels of sparsity

Identifying strengths and weaknesses of methods for computational network inference from single-cell RNA-seq data

Evaluating Imputation Methods for Single-Cell RNA-seq Data

scCGImpute: An Imputation Method for Single-Cell RNA Sequencing Data Based on Similarities between Cells and Relationships among Genes

scINRB: single-cell gene expression imputation with network regularization and bulk RNA-seq data

Benchmarking imputation methods for network inference using a novel method of synthetic scRNA-seq data generation

Low-Rank Full Matrix Factorization for dropout imputation in single cell RNA-seq and benchmarking with imputation algorithms for downstream applications

GE-Impute: graph embedding-based imputation for single-cell RNA-seq data

Are dropout imputation methods for scRNA-seq effective for scATAC-seq data?

NISC: Neural Network-Imputation for Single-Cell RNA Sequencing and Cell Type Clustering

A systematic evaluation of single-cell RNA-sequencing imputation methods

scGCL: an imputation method for scRNA-seq data based on graph contrastive learning

SmartImpute: A Targeted Imputation Framework for Single-cell Transcriptome Data

Evaluating the performance of dropout imputation and clustering methods for single-cell RNA sequencing data

scCAN: Clustering With Adaptive Neighbor-Based Imputation Method for Single-Cell RNA-Seq Data

Collaborative Structure-Preserved Missing Data Imputation for Single-Cell RNA-Seq Clustering

A novel method for single-cell data imputation using subspace regression

Imputation method for single-cell RNA-seq data using neural topic model

Cellular Similarity based Imputation for Single cell RNA Sequencing Data

ccImpute: an accurate and scalable consensus clustering based algorithm to impute dropout events in the single-cell RNA-seq data

scIDPMs: single-cell RNA-seq imputation using diffusion probabilistic models