DAE-TPGM: A deep autoencoder network based on a two-part-gamma model for analyzing single-cell RNA-seq data

Shuchang Zhao,Li Zhang,Xuejun Liu
DOI: https://doi.org/10.1016/j.compbiomed.2022.105578
Abstract:Single-cell RNA sequencing (scRNA-seq) can reveal differences in genetic material at the single-cell level and is widely used in biomedical studies. However, the minute RNA content within individual cells often results in a high number of dropouts and introduces random noise of scRNA-seq data, concealing the original gene expression pattern. Therefore, data normalization is critical in the analysis pipeline to adjust for unexpected biological and technical effects, leading to a particular bimodal expression pattern exhibited in the semi-continuous normalized data. We further find the positive continuous expression presents a right-skewed distribution, which is still under-explored by mainstream dimensionality reduction and imputation methods. We introduced a deep autoencoder network based on a two-part-gamma model (DAE-TPGM) for joint dimensionality reduction and imputation of scRNA-seq data. DAE-TPGM uses a two-part-gamma model to capture the statistical characteristics of semi-continuous normalized data and adaptively explores the potential relationships between genes for promoting data imputation by deep autoencoder. Just as the classic application scenarios that use an autoencoder in dimensionality reduction, our personalized autoendoer can capture phenotypic information on the peripheral blood mononuclear cells (PBMC) better and clearly infer continuous phenotype information for hematopoiesis in mice. Compared with that of mainstream imputation methods such as MAGIC, SAVER, scImpute and DCA, the new model achieved substantial improvement on the recognition of cellular phenotypes in two real datasets, and the comprehensive analyses on synthetic "ground truth" data demonstrated that our method obtains competitive advantages over other imputation methods in discovering underlying gene expression patterns in time-course data.
What problem does this paper attempt to address?