Abstract:In many application settings, data have missing entries, which makes subsequent analyses challenging. An abundant literature addresses missing values in an inferential framework, aiming at estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and test data. We first rewrite classic missing values results for this setting. We then show the consistency of two approaches, test-time multiple imputation and single imputation in prediction. A striking result is that the widely-used method of imputing with a constant prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is frowned upon as it distorts the distribution of the data. The consistency of such a popular simple approach is important in practice. Finally, to contrast procedures based on imputation prior to learning with procedures that optimize the missing-value handling for prediction, we consider decision trees. Indeed, decision trees are among the few methods that can tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing empirically different missing values strategies in trees, we recommend using the "missing incorporated in attribute" method as it can handle both non-informative and informative missing values.

Efficient missing data imputation for supervised learning.

Missing Values Imputation Based on Iterative Learning

Missing Data Imputation: Focusing on Single Imputation.

Iterative missing value imputation based on feature importance

Missing Data Imputation by Utilizing Information Within Incomplete Instances

Missing value imputation using unsupervised machine learning techniques

Missing Value Estimation for Mixed-Attribute Data Sets

On the consistency of supervised learning with missing values

Data Imputation by Pursuing Better Classification: A Supervised Kernel-Based Method

Missing Value Imputation With Unsupervised Backpropagation

Imputations for High Missing Rate Data in Covariates Via Semi-supervised Learning Approach

Locally linear reconstruction based missing value imputation for supervised learning

Missing values imputation hypothesis: An experimental evaluation

An Experimental Survey of Missing Data Imputation Algorithms

A web-based approach to data imputation

Method for Incomplete and Imbalanced Data Based on Multivariate Imputation by Chained Equations and Ensemble Learning

Discrete Missing Data Imputation Using Multilayer Perceptron and Momentum Gradient Descent

Unleashing the Potential of Diffusion Models for Incomplete Data Imputation

Missing Data Imputation for Classification Problems

Efficient and effective data imputation with influence functions