Identification of cell types, states and programs by learning gene set representations

Soroor Hediyeh-zadeh,Holly J. Whitfield,Malvika Kharbanda,Fabiola Curion,Dharmesh D. Bhuva,Fabian J. Theis,Melissa J. Davis
DOI: https://doi.org/10.1101/2023.09.08.556842
2023-01-01
Abstract:As single cell molecular data expand, there is an increasing need for algorithms that efficiently query and prioritize gene programs, cell types and states in single-cell sequencing data, particularly in cell atlases. Here we present scDECAF, a statistical learning algorithm to identify cell types, states and programs in single-cell gene expression data using vector representation of gene sets, which improves biological interpretation by selecting a subset of most biologically relevant programs. We applied scDECAF to scRNAseq data from PBMC, Lung, Pancreas, Brain and slide-tags snRNA of human prefrontal cortex for automatic cell type annotation. We demonstrate that scDECAF can recover perturbed gene programs in Lupus PBMC cells stimulated with IFNbeta and TGFBeta-induced cells undergoing epithelial-to-mesenchymal transition. scDECAF delineates patient-specific heterogeneity in cellular programs in Ovarian Cancer data. Using a healthy PBMC reference, we apply scDECAF to a mapped query PBMC COVID-19 case-control dataset and identify multicellular programs associated with severe COVID-19. scDECAF can improve biological interpretation and complement reference mapping analysis, and provides a method for gene set and pathway analysis in single cell gene expression data. ### Competing Interest Statement The authors have declared no competing interest.
What problem does this paper attempt to address?