Root Causal Inference from Single Cell RNA Sequencing with the Negative Binomial

Eric V. Strobl
2023-07-11
Abstract:Accurately inferring the root causes of disease from sequencing data can improve the discovery of novel therapeutic targets. However, existing root causal inference algorithms require perfectly measured continuous random variables. Single cell RNA sequencing (scRNA-seq) datasets contain large numbers of cells but non-negative counts measured by an error prone process. We therefore introduce an algorithm called Root Causal Inference with Negative Binomials (RCI-NB) that accounts for count-based measurement error by separating negative binomial distributions into their gamma and Poisson components; the gamma distributions form a fully identifiable but latent post non-linear causal model representing the true RNA expression levels, which we only observe with Poisson corruption. RCI-NB identifies patient-specific root causal contributions from scRNA-seq datasets by integrating novel sparse regression and goodness of fit testing procedures that bypass Poisson measurement error. Experiments demonstrate significant improvements over existing alternatives.
Genomics
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper "Root Causal Inference from Single Cell RNA Sequencing with the Negative Binomial" aims to address the problem of accurately inferring the root causes of diseases from single-cell RNA sequencing (scRNA-seq) data. Specifically, the authors propose a new algorithm—Negative Binomial Root Causal Inference (RCI-NB)—to overcome the limitations of existing root causal inference algorithms when dealing with scRNA-seq data. #### Background and Challenges 1. **Limitations of Existing Methods**: - Most existing root causal inference algorithms assume that the data are perfectly measured continuous random variables. - Single-cell RNA sequencing data contain a large number of cells, but the measurements are non-negative counts and are prone to errors during the sequencing process. - Existing methods cannot effectively handle these non-negative counts and measurement errors. 2. **Research Objectives**: - Develop a root causal inference algorithm that can handle non-negative counts and measurement errors in scRNA-seq data. - Construct a model that can identify patient-specific root causes by separating the gamma and Poisson components of the negative binomial distribution. - Enhance the ability to discover novel therapeutic targets from scRNA-seq data. #### Method Overview 1. **Model Construction**: - Use the negative binomial distribution to model scRNA-seq data, where the Poisson component represents measurement errors, and the gamma distribution represents the true RNA expression levels. - Bypass Poisson measurement errors and directly recover parameters from count data through negative binomial regression and goodness-of-fit tests. 2. **Algorithm Steps**: - **Parameter Estimation**: Systematically identify the parameters of the negative binomial distribution through sparse regression and goodness-of-fit testing. - **Root Cause Contribution**: Quantify the impact of each error term on diagnostic labels using the recovered parameters to determine patient-specific root causes. 3. **Experimental Validation**: - Validate the superiority of the RCI-NB algorithm over existing methods through experiments on simulated and real scRNA-seq datasets. #### Main Contributions - **Innovation**: For the first time, apply the negative binomial distribution to root causal inference, addressing measurement error issues in scRNA-seq data. - **Accuracy**: Experimental validation shows that RCI-NB significantly outperforms existing methods in identifying patient-specific root causes. - **Practicality**: Provides a powerful tool for discovering novel therapeutic targets from scRNA-seq data, aiding in the understanding of disease mechanisms and drug development. In summary, this paper introduces the RCI-NB algorithm to solve the challenge of accurately inferring disease root causes from single-cell RNA sequencing data, providing new methods and tools for biomedical research.