ULV: A robust statistical method for clustered data, with applications to multisubject, single-cell omics data

Mingyu Du,Kevin Johnston,Veronica Berrocal,Wei Li,Xiangmin Xu,Zhaoxia Yu
2024-06-11
Abstract:Molecular and genomic technological advancements have greatly enhanced our understanding of biological processes by allowing us to quantify key biological variables such as gene expression, protein levels, and microbiome compositions. These breakthroughs have enabled us to achieve increasingly higher levels of resolution in our measurements, exemplified by our ability to comprehensively profile biological information at the single-cell level. However, the analysis of such data faces several critical challenges: limited number of individuals, non-normality, potential dropouts, outliers, and repeated measurements from the same individual. In this article, we propose a novel method, which we call U-statistic based latent variable (ULV). Our proposed method takes advantage of the robustness of rank-based statistics and exploits the statistical efficiency of parametric methods for small sample sizes. It is a computationally feasible framework that addresses all the issues mentioned above simultaneously. An additional advantage of ULV is its flexibility in modeling various types of single-cell data, including both RNA and protein abundance. The usefulness of our method is demonstrated in two studies: a single-cell proteomics study of acute myelogenous leukemia (AML) and a single-cell RNA study of COVID-19 symptoms. In the AML study, ULV successfully identified differentially expressed proteins that would have been missed by the pseudobulk version of the Wilcoxon rank-sum test. In the COVID-19 study, ULV identified genes associated with covariates such as age and gender, and genes that would be missed without adjusting for covariates. The differentially expressed genes identified by our method are less biased toward genes with high expression levels. Furthermore, ULV identified additional gene pathways likely contributing to the mechanisms of COVID-19 severity.
Methodology,Quantitative Methods,Computation
What problem does this paper attempt to address?
This paper proposes a new method called the U-statistic based on latent variables (ULV) for handling cluster data, especially the challenges encountered in single-cell omics data analysis. These challenges include small sample size, non-normality, latent missing values, outliers, and repeated measurements from the same sample. The ULV method combines the robustness of rank-based statistics and the small sample efficiency of parametric methods to simultaneously address these issues. The paper points out that with advances in molecular and genomic technologies, we are able to quantify key biological variables, such as gene expression and protein levels, at the single-cell level. However, the analysis of these data faces many difficulties. The authors emphasize that ignoring the clustering structure of the data can lead to incorrect conclusions, and mention that applying methods suitable for bulk sequencing data directly to single-cell sequencing data can result in a large number of false positive results. To address these issues, the ULV method is built on the generalized version of the Mann-Whitney U test and evaluates differential expression between cases and controls through a two-stage framework. In the first stage, non-parametric rank-based methods are used to compute the differences between individuals from different groups. The second stage handles the dependency caused by clustering through a parameter model assuming latent variables. This approach is flexible for different types of single-cell data and can adjust covariates to mitigate potential biases. The paper validates the performance of the ULV method through simulation studies and two real-world cases (single-cell proteomics study of acute myeloid leukemia and single-cell RNA study of COVID-19 symptoms). The results show that compared to existing methods, ULV performs well in controlling the type I error rate while having comparable power of detection. In the AML study, ULV identifies differentially expressed proteins that were missed by the pseudo-batch versions of the Wilcoxon rank-sum test. In the COVID-19 study, ULV discovers age- and sex-related genes as well as genes that may be missed without adjusting covariates, with fewer biases towards highly expressed genes. In summary, the ULV method introduced in this paper provides new statistical analysis tools for single-cell data aimed at improving robustness and accuracy of the analysis, particularly in handling small samples, non-normal distributions, and cluster data.