Low-Rank Full Matrix Factorization for dropout imputation in single cell RNA-seq and benchmarking with imputation algorithms for downstream applications

Jinghan Huang,Anson Chun Man Chow,Nelson Leung-sang Tang,Sheung Chi Yam
DOI: https://doi.org/10.1101/2024.10.21.619343
2024-10-23
Abstract:Background: While single cell RNA sequencing becomes a powerful technology, the presence of the large number of zero counts represents a challenge for both wet-lab processing and data analysis. Imputation of these dropouts can now be performed by three categories of algorithms: Model or smoothing, Matrix theory or Deep learning. However, two fundamental questions remain unsettled: (1) whether imputation should be performed; (2) which imputation algorithm to use with various downstream applications. Notably, imputation is not commonly used in real scRNA-seq applications because of their uncertain benefits, concerns about false inferences in downstream applications, and the lack of in-depth benchmark. Methods: Here, we performed two tasks. First, we developed an algorithm using adaptive low-rank full matrix factorization (afMF) based on a previous limited implementation confined to using low rank matrix decomposition (ALRA). Second, to evaluate the impact of various imputation algorithms on downstream analyses, a new benchmark framework incorporating commonly used downstream applications was developed. This benchmark framework put emphasis on real datasets which had ground truth or matched bulk data such that algorithm performance was compared to more convinced data rather than less realistic simulated parameters. Results: Our results indicated that afMF and ALRA (matrix based) provided good imputation and outperformed raw log-normalization in various downstream applications. afMF outperformed ALRA in several evaluations (cell-level differential expression analysis, GSEA, classification, biomarker prediction, clustering, SC-bulk profiling similarity). Besides, afMF ranked among the top levels in automatic cell type annotation, trajectory inference by DPT, and AUCell & SCENIC. Both showed acceptable scalability, while afMF had longer running time. MAGIC (smoothing based) and AutoClass (deep learning based) also performed well but may produce false positives. In contrast, more complicated methods (other deep learning or model based) were prone to overfitting and data distortion. We also found that certain downstream algorithms are not compatible with imputation, including trajectory inference with Slingshot and cell-cell communication. Prior imputation either showed no improvement or generated false positive findings with these downstream applications. Conclusions: We hope this in-depth evaluation and the algorithm developed in this study can enhance the selection of appropriate imputation algorithm for specific scRNA-seq downstream analyses.
Bioinformatics
What problem does this paper attempt to address?