Benchmarking scRNA-seq imputation tools with respect to network inference highlights deficits in performance at high levels of sparsity

Lisa Maria Steinheuer,Sebastian Canzler,Jörg Hackermüller
DOI: https://doi.org/10.1101/2021.04.02.438193
2021-04-02
Abstract:Abstract Gene correlation network inference from single-cell transcriptomics data potentially allows to gain unprecendented insights into cell type-specific regulatory programs. ScRNA-seq data is severely affected by dropout, which significantly hampers and restrains current downstream analysis. Although newly developed tools are capable to deal with sparse data, no appropriate single-cell network inference workflow has been established. A potential way to end this deadlock is the application of data imputation methods, which already proofed to be useful in specific contexts of single-cell data analysis, e.g., recovering cell clusters. In order to infer cell-type specific networks, two prerequisites must be met: the identification of cluster-specific cell-types and the network inference itself. Here, we propose a benchmarking framework to investigate both objections. By using suitable reference data with inherent correlation structure, six representative imputation tools and appropriate evaluation measures, we were able to systematically infer the impact of data imputation on network inference. Major network structures were found to be preserved in low dropout data sets. For moderately sparse data sets, DCA was able to recover gene correlation structures, although systematically introducing higher correlation values. No imputation tool was able to recover true signals from high dropout data. However, by using an additional biological data set we could show that cell-cell correlation by means of specific marker gene expression was not compromised through data imputation. Our analysis showed that network inference is feasible for low and moderately sparse data sets by using the unimputed and DCA-prepared data, respectively. High sparsity data, on the other side, still pose a major problem since current imputation techniques are not able to facilitate network inference. The annotation of cluster-specific cell-types as a prerequisite is not hampered by data imputation but their power to restore the deeply hidden correlation structures is still not sufficient enough.
What problem does this paper attempt to address?