BAMITA: Bayesian Multiple Imputation for Tensor Arrays

Ziren Jiang,Gen Li,Eric F. Lock
2024-10-31
Abstract:Data increasingly take the form of a multi-way array, or tensor, in several biomedical domains. Such tensors are often incompletely observed. For example, we are motivated by longitudinal microbiome studies in which several timepoints are missing for several subjects. There is a growing literature on missing data imputation for tensors. However, existing methods give a point estimate for missing values without capturing uncertainty. We propose a multiple imputation approach for tensors in a flexible Bayesian framework, that yields realistic simulated values for missing entries and can propagate uncertainty through subsequent analyses. Our model uses efficient and widely applicable conjugate priors for a CANDECOMP/PARAFAC (CP) factorization, with a separable residual covariance structure. This approach is shown to perform well with respect to both imputation accuracy and uncertainty calibration, for scenarios in which either single entries or entire fibers of the tensor are missing. For two microbiome applications, it is shown to accurately capture uncertainty in the full microbiome profile at missing timepoints and used to infer trends in species diversity for the population. Documented R code to perform our multiple imputation approach is available at <a class="link-external link-https" href="https://github.com/lockEF/MultiwayImputation" rel="external noopener nofollow">this https URL</a> .
Methodology,Quantitative Methods,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of multiple imputation of missing values in multi - dimensional data (tensors), especially its application in the biomedical field. Specifically, the author focuses on how to not only provide point estimates but also accurately capture uncertainty when dealing with missing data. The following are the main problems and objectives of this study: 1. **Limitations of existing methods**: - Most of the existing tensor completion methods only provide single - point estimates of missing values without considering uncertainty. - Such a single - point estimate will lead to underestimation of uncertainty in subsequent analysis, thus affecting the accuracy of inference. 2. **Research objectives**: - Propose a multiple imputation method based on the Bayesian framework (BAMITA) to generate multiple simulated values to reflect the uncertainty of missing items. - Sample through the posterior predictive distribution to ensure that uncertainty can be correctly propagated into subsequent analysis. - Use CANDECOMP/PARAFAC (CP) decomposition and introduce effective conjugate priors to improve computational efficiency and applicability. - Adopt a separable covariance structure for error terms to better capture correlations in different modes. 3. **Specific application scenarios**: - In long - term microbiome studies, data at some time points are completely missing (fiber deficiency). For example, in longitudinal microbiome studies, microbial abundance data of some subjects at some time points are missing. - This method can be used to accurately capture the uncertainty of the entire microbial community and infer the trend of species diversity in the population. ### Main contributions - **Multiple imputation**: A new Bayesian multiple imputation method is proposed, which can accurately capture uncertainty while imputing missing values. - **Uncertainty propagation**: Ensure the correct propagation of uncertainty in subsequent analysis, improving the effectiveness and reliability of inference. - **Efficient algorithm**: Use an efficient MCMC sampling algorithm, combined with CP decomposition and a separable covariance structure, making the model suitable for large - scale and high - dimensional data. - **Empirical verification**: Through simulation experiments and actual microbiome data analysis, the superior performance of this method in imputation accuracy and uncertainty calibration is verified. In conclusion, by proposing the BAMITA method, this paper solves the key problem that existing tensor completion methods cannot effectively capture uncertainty, providing a more reliable method for data analysis in the biomedical field.