Abstract:BackgroundIn modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation.ResultsIn this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of "imputability measure" (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package "phenomeImpute" is made publicly available.ConclusionsSimulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author's publication website.

Microarray Missing Data Imputation Based on A Set Theoretic Framework and Biological Constraints

A meta-data based method for DNA microarray imputation

Microarray Missing Value Imputation

Missing Value Estimation Algorithms on Cluster and Representativeness Preservation of Gene Expression Microarray Data

An efficient ensemble method for missing value imputation in microarray gene expression data

Missing Microarray Data Estimation Based on Projection Onto Convex Sets Method

The theoretic framework of local weighted approximation for microarray missing value estimation

Usage of Clustering and Weighted Nearest Neighbors for Efficient Missing Data Imputation of Microarray Gene Expression Dataset

A Global Learning with Local Preservation Method for Microarray Data Imputation

Microarray Missing Value Imputation: A Regularized Local Learning Method

A hybrid imputation approach for microarray missing value estimation

Missing Value Estimation for DNA Microarray Gene Expression Data by Support Vector Regression Imputation and Orthogonal Coding Scheme

Missing value estimation for microarray data based on fuzzy C-means clustering

DNA Microarray Data Imputation and Significance Analysis of Differential Expression

A Weighted Local Least Squares Imputation Method for Missing Value Estimation in Microarray Gene Expression Data.

Evaluations on Several Imputation Approaches of Integrated Omics Data

Gaussian Mixture Clustering and Imputation of Microarray Data.

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

An imputation approach for oligonucleotide microarrays

A joint optimization framework integrated with biological knowledge for clustering incomplete gene expression data

Effects of Replacing the Unreliable Cdna Microarray Measurements on the Disease Classification Based on Gene Expression Profiles and Functional Modules.