Abstract:In metabolomics, the study of small molecules in biological samples, data are often acquired through mass spectrometry. The resulting data contain highly correlated variables, typically with a larger number of variables than observations. Missing data are prevalent, and imputation is critical as data acquisition can be difficult and expensive, and many analysis methods necessitate complete data. In such data, missing at random (MAR) missingness occurs due to acquisition or processing error, while missing not at random (MNAR) missingness occurs when true values lie below the threshold for detection. Existing imputation methods generally assume one missingness type, or impute values outside the physical constraints of the data, which lack utility. A truncated factor analysis model with an infinite number of factors (tIFA) is proposed to facilitate imputation in metabolomics data, in a statistically and physically principled manner. Truncated distributional assumptions underpin tIFA, ensuring cognisance of the data's physical constraints when imputing. Further, tIFA allows for both MAR and MNAR missingness, and a Bayesian inferential approach provides uncertainty quantification for imputed values and missingness types. The infinite factor model parsimoniously models the high-dimensional, multicollinear data, with nonparametric shrinkage priors obviating the need for model selection tools to infer the number of latent factors. A simulation study is performed to assess the performance of tIFA and an application to a urinary metabolomics dataset results in a full dataset with practically useful imputed values, and associated uncertainty, ready for use in metabolomics analyses. Open-source R code accompanies tIFA, facilitating its widespread use.

Infinite Mixtures of Infinite Factor Analysers

Bayesian mixtures of common factor analyzers: Model, variational inference, and applications

Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data

Clustering Multivariate Data using Factor Analytic Bayesian Mixtures with an Unknown Number of Components

Infinite max-margin factor analysis via data augmentation

Simultaneous Bayesian Clustering and Model Selection with Mixture of Robust Factor Analyzers

Overfitting Bayesian Mixtures of Factor Analyzers with an Unknown Number of Components

Infinite mixtures of multivariate normal-inverse Gaussian distributions for clustering of skewed data

Multimode process data modeling: A Dirichlet process mixture model based Bayesian robust factor analyzer approach

The Infinite Student'S T-Mixture For Robust Modeling

A Review of Bayesian Methods for Infinite Factorisations

Mixture Models With a Prior on the Number of Components

Parsimonious Mixtures of Matrix Variate Bilinear Factor Analyzers

Missing data imputation using a truncated infinite factor model with application to metabolomics data

Model-based clustering via mixtures of unrestricted skew normal factor analyzers with complete and incomplete data

Identifiable and interpretable nonparametric factor analysis

Model-based clustering based on sparse finite Gaussian mixtures

Recursive Mixture Factor Analyzer for Monitoring Multimode Time-Variant Industrial Processes

Probabilistic Matrix Factorization for Data with Attributes Based on Finite Mixture Modeling

Factor Adjusted Spectral Clustering for Mixture Models

Variational Inference and Sparsity in High-Dimensional Deep Gaussian Mixture Models