Abstract:BackgroundIn modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation.ResultsIn this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of "imputability measure" (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package "phenomeImpute" is made publicly available.ConclusionsSimulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author's publication website.

Imputing Missing Data by Fully Conditional Models : Some Cautionary Examples and Guidelines

Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence

Multiple Imputation with Multivariate Imputation by Chained Equation (mice) Package

Missing Data Imputation: Focusing on Single Imputation.

Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation

Integrating multi-source block-wise missing data in model selection

What Is a Good Imputation Under MAR Missingness?

19 Incomplete Data in Epidemiology and Medical Statistics

On Missing Data Imputation for IRB Models

Multiple Imputation by Ordered Monotone Blocks with Application to the Anthrax Vaccine Research Program

Systematically missing data in distributed data networks: multiple imputation when data cannot be pooled

Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches

Unleashing the Potential of Diffusion Models for Incomplete Data Imputation

Nonparametric Statistical Inference and Imputation for Incomplete Categorical Data

Are deep learning models superior for missing data imputation in large surveys? Evidence from an empirical comparison

A Realistic Evaluation of Methods for Handling Missing Data When There is a Mixture of MCAR, MAR, and MNAR Mechanisms in the Same Dataset

Multiple Imputation: A Review of Practical and Theoretical Findings

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

A stacked approach for chained equations multiple imputation incorporating the substantive model

Multiple Imputation When Variables Exceed Observations: An Overview of Challenges and Solutions

A Comparative Study of Imputation Methods for Multivariate Ordinal Data