Abstract:By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance. And in a second follow-up experiment, we evaluate numerical imputation of one-hot encoded categorical attributes. We reach the following conclusions. Firstly, missing-indicators generally increase classification performance. Secondly, with missing-indicators, nearest neighbour and iterative imputation do not lead to better performance than simple mean/mode imputation. Thirdly, for decision trees, pruning is necessary to prevent overfitting. Fourthly, the thresholds above which missing-indicators are more likely than not to improve performance are lower for categorical attributes than for numerical attributes. Lastly, mean imputation of numerical attributes preserves some of the information from missing values. Consequently, when not using missing-indicators it can be advantageous to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.

Missing Data Preprocessing in Credit Classification: One-Hot Encoding or Imputation?

Missing data imputation using classification and regression trees

Missing Data Imputation: Focusing on Single Imputation.

Missing value imputation using unsupervised machine learning techniques

Imputations for High Missing Rate Data in Covariates Via Semi-supervised Learning Approach

Discrete Missing Data Imputation Using Multilayer Perceptron and Momentum Gradient Descent

Do we really need imputation in AutoML predictive modeling?

Can machine learning paradigm improve attribute noise problem in credit risk classification?

Missing Data Imputation for Classification Problems

Handling missing values and imbalanced classes in machine learning to predict consumer preference: Demonstrations and comparisons to prominent methods

Does imputation matter? Benchmark for predictive models

Prediction of default probability by using statistical models for rare events

Missing values imputation hypothesis: An experimental evaluation

No imputation without representation

Missing Value Imputation via Clusterwise Linear Regression

The effect of feature extraction and data sampling on credit card fraud detection

On Missing Data Imputation for IRB Models

Data Driven Credit Risk Management Process: a Machine Learning Approach

Data Imputation by Pursuing Better Classification: A Supervised Kernel-Based Method

An End-to-End Model for Time Series Classification In the Presence of Missing Values

Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction