A semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data

Michael B. Sohn,Kristin Scheible,Steven R. Gill
DOI: https://doi.org/10.1101/2024.09.05.611521
2024-09-10
Abstract:High sparsity (i.e., excessive zeros) in microbiome data, which are high-dimensional and compositional, is unavoidable and can significantly alter analysis results. However, efforts to address this high sparsity have been very limited because, in part, it is impossible to justify the validity of any such methods, as zeros in microbiome data arise from multiple sources (e.g., true absence, stochastic nature of sampling). The most common approach is to treat all zeros as structural zeros (i.e., true absence) or rounded zeros (i.e., undetected due to detection limit). However, this approach can underestimate the mean abundance while overestimating its variance because many zeros can arise from the stochastic nature of sampling and/or functional redundancy (i.e., different microbes can perform the same functions), thus losing power. In this manuscript, we argue that treating all zeros as missing values would not significantly alter analysis results if the proportion of structural zeros is similar for all taxa, and we propose a semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data. We demonstrate the merits of the proposed method and its beneficial effects on downstream analyses in extensive simulation studies. We reanalyzed a type II diabetes (T2D) dataset to determine differentially abundant species between T2D patients and non-diabetic controls.
Bioinformatics
What problem does this paper attempt to address?
This paper attempts to solve the problem of high sparsity (i.e., a large number of zero values) in microbiome data. Microbiome data are usually high - dimensional and compositional, and a large number of zero values therein may stem from multiple reasons, such as true absence, non - detection due to the limit of detection, the random nature of sampling, or functional redundancy, etc. The existence of these zero values can significantly affect the results of data analysis, especially when using the log - ratio analysis method, because zero values are undefined in the log - ratio. Currently, most methods for dealing with these zero values are based on the assumption that all zero values are either structural zero values (i.e., true absence) or technical zero values (i.e., non - detection due to the limit of detection), but these assumptions may not be completely accurate and are difficult to verify. The paper proposes a semi - parametric multiple imputation method (Multiple Imputation for Compositional data, MIC) for processing high - sparse, high - dimensional, and compositional microbiome data. This method assumes that all zero values can be regarded as missing values or sampling zero values, and as long as the proportions of structural zero values of all taxa are similar, the impact of this assumption on the analysis results will be limited. Through this method, the authors aim to reduce the bias caused by incorrect handling of zero values and improve the accuracy of downstream analysis. Specifically, the main contributions of the paper include: 1. **Proposing a new imputation method**: Handling zero values in high - sparse, high - dimensional, and compositional microbiome data through the semi - parametric multiple imputation method. 2. **Verifying the effectiveness of the method**: Through extensive simulation studies, demonstrating the superior performance of this method in estimating mean proportions and ratios, especially being robust under different zero - value proportions and correlation structures. 3. **Applying to real - data**: Re - analyzing a type 2 diabetes (T2D) data set and identifying differentially abundant species between T2D patients and non - diabetic control groups. Through these efforts, the paper provides a new and more reliable method for microbiome data analysis, which helps to more accurately understand the relationship between the microbiome and health and disease.