Identifying Relevant Covariates in RNA-seq Analysis by Pseudo-Variable Augmentation

Yet Nguyen,Dan Nettleton
DOI: https://doi.org/10.1007/s13253-024-00665-3
2024-11-04
Journal of Agricultural Biological and Environmental Statistics
Abstract:RNA-sequencing (RNA-seq) technology allows for the identification of differentially expressed genes, which are genes whose mean transcript abundance levels vary across conditions. In practice, RNA-seq datasets often include covariates that are of primary interest in addition to a set of covariates that are subject to selection. Some of these covariates may be relevant to gene expression levels, while others may be irrelevant. Ignoring relevant covariates or attempting to adjust for the effect of irrelevant covariates can compromise the identification of differentially expressed genes. To address this issue, we propose a variable selection method that uses pseudo-variables to control the expected proportion of selected covariates that are irrelevant. Our method accurately selects relevant covariates while keeping the false selection rate below a specified level. We demonstrate that our method outperforms existing methods for detecting differentially expressed genes when working with available covariates. Our method is implemented in FSRAnalysisBS function of the R package csrnaseq , which is available at www.github.com/ntyet/csrnaseq. The analysis and simulation are available at www.github.com/ntyet/csrnaseq/tree/main/analysis.
statistics & probability,mathematical & computational biology,biology
What problem does this paper attempt to address?