It's All Relative: New Regression Paradigm for Microbiome Compositional Data

Gen Li,Yan Li,Kun Chen
DOI: https://doi.org/10.48550/arXiv.2011.05951
2020-11-12
Abstract:Microbiome data are complex in nature, involving high dimensionality, compositionally, zero inflation, and taxonomic hierarchy. Compositional data reside in a simplex that does not admit the standard Euclidean geometry. Most existing compositional regression methods rely on transformations that are inadequate or even inappropriate in modeling data with excessive zeros and taxonomic structure. We develop a novel relative-shift regression framework that directly uses compositions as predictors. The new framework provides a paradigm shift for compositional regression and offers a superior biological interpretation. New equi-sparsity and taxonomy-guided regularization methods and an efficient smoothing proximal gradient algorithm are developed to facilitate feature aggregation and dimension reduction in regression. As a result, the framework can automatically identify clinically relevant microbes even if they are important at different taxonomic levels. A unified finite-sample prediction error bound is developed for the proposed regularized estimators. We demonstrate the efficacy of the proposed methods in extensive simulation studies. The application to a preterm infant study reveals novel insights of association between the gut microbiome and neurodevelopment.
Methodology,Applications
What problem does this paper attempt to address?
This paper attempts to address the complexity issues of microbiome data in regression analysis, especially the challenges posed by high - dimensionality, compositionality, zero - inflation and taxonomic hierarchical structure. Specifically: 1. **Special properties of compositional data**: Microbiome data usually exist in the form of relative abundance (composition), and these data are located in a simplex space and are not suitable for standard Euclidean geometry. Therefore, traditional regression methods need to transform the data (such as log - ratio transformation), but these transformation methods are often insufficient or inappropriate when dealing with excessive zero values and taxonomic structures. 2. **Limitations of existing methods**: - **Zero - value handling**: The commonly used log - transformation cannot directly handle zero values. The usual practice is to replace zero values with a small positive number, but this may introduce bias. - **Poor biological interpretability**: It is difficult to intuitively interpret the biological significance of the transformed data. - **Insufficient utilization of taxonomic tree structure**: Existing methods are difficult to effectively combine the taxonomic tree structure for regularization, resulting in possible inconsistencies in the analysis results at different taxonomic levels. 3. **Proposed new framework**: To solve the above problems, the author has developed a new Relative - Shift regression framework, which directly uses compositional data as predictor variables without the need for transformation. This new framework provides a paradigm shift in compositional regression and has better biological interpretability. 4. **Model features**: - **Intercept - free linear regression**: By eliminating the intercept term, the model is fully identifiable on compositional data. - **Direct handling of zero values**: No additional steps are required to handle zero values. - **Feature aggregation**: Feature aggregation is achieved through equal - sparsity and taxonomic - tree - guided regularization methods, thereby reducing the dimension and improving interpretability. 5. **Theoretical contributions**: The author also proposed a unified finite - sample prediction error bound and proved the effectiveness of this method in high - dimensional situations. In summary, this paper aims to provide a new regression analysis method that can better handle the complex characteristics of microbiome data while maintaining good biological interpretability and statistical performance.