Abstract:BackgroundIn modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation.ResultsIn this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of "imputability measure" (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package "phenomeImpute" is made publicly available.ConclusionsSimulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author's publication website.

Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

To impute or not to impute in untargeted metabolomics - that is the compositional question

Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data

Estimation and inference in metabolomics with non-random missing data and latent factors

Mining the unknown: a systems approach to metabolite identification combining genetic and metabolic information.

Using statistical techniques and replication samples for imputation of metabolite missing values

Genome-wide Association Studies of Missing Metabolite Measures: Results From Two Population-based Studies

Imputation of missing values in lipidomic datasets

Multi-scale variational autoencoder for imputation of missing values in untargeted metabolomics using whole-genome sequencing data

From differential abundance to mtGWAS: accurate and scalable methodology for metabolomics data with non-ignorable missing observations and latent factors

The Effects of Nonignorable Missing Data on Label-Free Mass Spectrometry Proteomics Experiments.

Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies

Missing data imputation using a truncated infinite factor model with application to metabolomics data

The effects of non-ignorable missing data on label-free mass spectrometry proteomics

Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

GSimp: A Gibbs Sampler Based Left-Censored Missing Value Imputation Approach for Metabolomics Studies

Multi-Omics Regulatory Network Inference in the Presence of Missing Data

Statistical Methods for the Analysis of High-Throughput Metabolomics Data

Evaluation of different approaches for missing data imputation on features associated to genomic data

Quantitative Comparison of Statistical Methods for Analyzing Human Metabolomics Data