Penalised regression improves imputation of cell-type specific expression using RNA-seq data from mixed cell populations compared to domain-specific methods

Wei-Yu Lin,Melissa Kartawinata,Bethany R Jebson,Restuadi Restuadi,CLUSTER Consortium,Lucy R Wedderburn,Chris Wallace
DOI: https://doi.org/10.1101/2023.09.11.556650
2024-03-22
Abstract:Differential gene expression (DGE) studies often use bulk RNA sequencing of mixed cell populations because single cell or sorted cell sequencing may be prohibitively expensive. However, mixed cell studies may miss differential expression that is restricted to specific cell populations. Computational deconvolution can be used to estimate cell fractions from bulk expression data and infer average cell-type expression in a set of samples (eg cases or controls), but imputing sample-level cell-type expression is required for quantitative traits and is less commonly addressed. Here, we assessed the accuracy of imputing sample-level cell-type expression using a real dataset where mixed peripheral blood mononuclear cells (PBMC) and sorted (CD4, CD8, CD14, CD19) RNA sequencing data were generated from the same subjects (N=158). We compared three domain-specific methods, CIBERSORTx, bMIND and debCAM/swCAM, and two cross-domain machine learning methods, multiple response LASSO and RIDGE, that had not been used for this task before. LASSO/RIDGE showed higher sensitivity but lower specificity for recovering DGE signals seen in observed data compared to deconvolution methods, although LASSO/RIDGE had higher area under curves (median=0.84-0.87 across cell types) than deconvolution methods (0.62-0.77). Machine learning methods have the potential to outperform domain-specific methods when suitable training data are available.
Bioinformatics
What problem does this paper attempt to address?