Abstract:Dissecting large bulk RNA-seq data into cell proportions and cell type specific expression profiles could significantly enhance our understanding of disease mechanisms at cell level and facilitate the identification of novel drug targets and development of efficient intervention strategies. In this study, we presented a scRNA-seq marker (gene) informed cell deconvolution and expression inference (CausalCellInfer) framework. CausalCellInfer employed causal inference principles to automatically identify a small set of critical marker genes from the reference scRNA-seq dataset. It then integrates deep neural networks with regularized matrix completion algorithms to deconvolute cell proportions and estimate cell type specific (CTS) expression profiles. Most importantly, we pioneered the application of the proposed framework to imputed expression data from large-scale genome-wide association studies (GWAS). We verified the efficacy of our proposed method by comparing it against existing state-of-the-art cell deconvolution methods, including CIBERSORTx, DWLS, Scaden, and MuSic across various real and pseudo-bulk samples. Furthermore, we proposed the use of a wide range of enrichment analyses to demonstrate the reliability of CausalCellInfer in estimating CTS profiles. Our proposed framework consistently outperformed existing methods with significantly higher concordance correlation coefficient (CCC), lower mean absolute error (MAE) and root mean square error (RMSE) across all testing real and pseudo bulk samples. Importantly, it also demonstrated superior computational efficiency compared to all benchmarked methods except MuSic. We also applied our trained models to deconvolute 4 tissue-specific expression cell proportions and estimate the corresponding CTS expression profiles, leveraging UK Biobank data (UKBB). We conducted a series of cellular-level analyses, including cell proportion association analysis, causal gene detection, transcriptome wide association analysis (TWAS), for 24 phenotypes in UKBB based on estimated cell composition and CTS expression profiles. Of note, the estimated proportions of various cell types were indicative of disease onset. For example, T2DM patients demonstrated a significant decrease in the proportions of alpha and beta cells in comparison to controls. We also tested their associations. Our method exhibited satisfactory positive predictive values (PPV) in uncovering differentially expressed (DE) genes for the majority of cell types. Encouragingly, most identified CTS causally relevant genes were found to be significantly enriched in target diseases or related pathophysiology. In conclusion, we presented a novel framework for inferring cell-type proportions and CTS expression, with novel applications to GWAS-imputed expression data from large-scale biobank program. Our work also shed light on how differential cell-type proportion and CTS expression may be associated with susceptibility to different diseases and their prognoses, bridging scRNA-seq and clinical phenotypes in large-scale biobank studies.

Penalised regression improves imputation of cell-type specific expression using RNA-seq data from mixed cell populations compared to domain-specific methods

Advances in mixed cell deconvolution enable quantification of cell types in spatially-resolved gene expression data

Identification of cell types, states and programs by learning gene set representations

Deconvolution from bulk gene expression by leveraging sample-wise and gene-wise similarities and single-cell RNA-Seq data

A Novel Computational Complete Deconvolution Method Using RNA-seq Data

Advances in mixed cell deconvolution enable quantification of cell types in spatial transcriptomic data

Evaluating Imputation Methods for Single-Cell RNA-seq Data

Critical Differential Expression Assessment for Individual Bulk RNA-Seq Projects

Estimating cell compositions and cell-type-specific expressions from GWAS data using invariant causal prediction, deep learning and regularized matrix completion

Experimental validation of methods for differential gene expression analysis and sample pooling in RNA-seq

A computational method for direct imputation of cell type-specific expression profiles and cellular compositions from bulk-tissue RNA-Seq in brain disorders

Evaluating performance and applications of sample-wise cell deconvolution methods on human brain transcriptomic data

Likelihood-based deconvolution of bulk gene expression data using single-cell references

SCDC: Bulk Gene Expression Deconvolution by Multiple Single-Cell RNA Sequencing References

Computational de novo discovery of distinguishing genes for biological processes and cell types in complex tissues

CATD: A reproducible pipeline for selecting cell-type deconvolution methods across tissues

A systematic evaluation of single-cell RNA-sequencing imputation methods

Missing cell types in single-cell references impact deconvolution of bulk data but are detectable

Robust and accurate estimation of cellular fraction from tissue omics data via ensemble deconvolution

Highly Accurate Estimation of Cell Type Abundance in Bulk Tissues Based on Single‐Cell Reference and Domain Adaptive Matching

Accurate Estimation of Cell-Type Composition from Gene Expression Data