Bayesian Variable Selection for Multivariate Zero-Inflated Models: Application to Microbiome Count Data

Kyu Ha Lee,Brent A. Coull,Anna-Barbara Moscicki,Bruce J. Paster,Jacqueline R. Starr
DOI: https://doi.org/10.48550/arXiv.1711.00157
2018-05-21
Abstract:Microorganisms play critical roles in human health and disease. It is well known that microbes live in diverse communities in which they interact synergistically or antagonistically. Thus for estimating microbial associations with clinical covariates, multivariate statistical models are preferred. Multivariate models allow one to estimate and exploit complex interdependencies among multiple taxa, yielding more powerful tests of exposure or treatment effects than application of taxon-specific univariate analyses. In addition, the analysis of microbial count data requires special attention because data commonly exhibit zero inflation. To meet these needs, we developed a Bayesian variable selection model for multivariate count data with excess zeros that incorporates information on the covariance structure of the outcomes (counts for multiple taxa), while estimating associations with the mean levels of these outcomes. Although there has been a great deal of effort in zero-inflated models for longitudinal data, little attention has been given to high-dimensional multivariate zero-inflated data modeled via a general correlation structure. Through simulation, we compared performance of the proposed method to that of existing univariate approaches, for both the binary and count parts of the model. When outcomes were correlated the proposed variable selection method maintained type I error while boosting the ability to identify true associations in the binary component of the model. For the count part of the model, in some scenarios the the univariate method had higher power than the multivariate approach. This higher power was at a cost of a highly inflated false discovery rate not observed with the proposed multivariate method. We applied the approach to oral microbiome data from the Pediatric HIV/AIDS Cohort Oral Health Study and identified five species (of 44) associated with HIV infection.
Applications
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the statistical analysis of microbiome data, especially the modeling of high - dimensional multivariate zero - inflated data. Specifically, the paper focuses on: 1. **Handling a large number of zero values in microbiome data**: A large number of zero values often appear in microbiome data (that is, some microorganisms are not detected in many samples), and these zero values exceed the number expected by standard count distributions (such as Poisson distribution, negative binomial distribution or Dirichlet multinomial distribution). Therefore, a statistical method that can effectively handle such zero - inflated data is required. 2. **Considering the interdependent relationships among multiple microorganisms**: Microbiome data are usually multivariate, and there are complex interactions (synergistic or antagonistic) among different microorganisms. Traditional univariate analysis methods ignore these interdependent relationships and may lead to a loss of test power. Therefore, a multivariate analysis method that can simultaneously consider the interdependent relationships among multiple microorganisms is required. 3. **Performing outcome - specific covariate selection**: Since microbiome data usually contain multiple response variables (that is, multiple microorganisms), a method that can perform specific covariate selection for each response variable is required to identify exposure or treatment factors related to specific microorganisms. To solve these problems, the paper proposes a multivariate zero - inflated model (MZIP model) based on Bayesian variable selection, which can: - Handle a large number of zero values in multivariate count data. - Consider the complex interdependent relationships among multiple microorganisms. - Perform outcome - specific covariate selection and identify exposure or treatment factors related to specific microorganisms. Through simulation studies and application to real - data, the paper verifies the superior performance of this model in handling zero - inflated data and multivariate correlations, especially its ability to identify true associations.