Abstract:Missing value imputation (MVI) is a key task in data science, in which learning models are built from incomplete data. In contrast to externally driven MVI algorithms, this study proposes a novel risk-minimisation-based MVI algorithm (RM-MVI) that considers both the internal characteristics of missing data and the external performance for specific classification applications. RM-MVI is technically designed for labelled data and is applied in two stages: filling with structural risk minimization (SRM) and refining with empirical risk minimization (ERM). In the filling stage, an autoencoder with a single hidden layer is trained on the original dataset without missing values. Missing values are first initialised with random numbers, and the imputation values are then preliminarily optimised based on the derived updating rule to minimise the structural risk-oriented objective function. After the imputation values have been preliminarily optimised in the filling stage, a neural-network-based classifier is trained in the refining stage to optimise the imputation values sophisticatedly by reducing the empirical risk. Experiments were conducted on several benchmark datasets to validate the feasibility, rationality, and effectiveness of the proposed RM-MVI algorithm. The results show that (1) the optimization processes of the imputation values corresponding to the SRM and ERM are convergent so that the optimized imputation values can be obtained; (2) SRM can ensure distribution consistency of the imputation values that are preliminarily optimised in the filling stage, while ERM can optimise the imputation values sophisticatedly in the refining stage, which is more helpful for classifier training; and (3) the RM-MVI algorithm can yield considerably better MVI performance on benchmark datasets than 11 well-known MVI algorithms, such as a 26% higher distribution consistency ratio and 2% to 5% higher testing accuracies for 6 classifiers on average. This demonstrates that RM-MVI is a viable approach for addressing MVI problems.

MIDIA: exploring denoising autoencoders for missing data imputation

Multiple Imputation with Denoising Autoencoder using Metamorphic Truth and Imputation Feedback

The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning

Missing Value Imputation on Multidimensional Time Series

Multiview data fusion technique for missing value imputation in multisensory air pollution dataset

mDAE : modified Denoising AutoEncoder for missing data imputation

Siamese autoencoder architecture for the imputation of data missing not at random

Discrete Missing Data Imputation Using Multilayer Perceptron and Momentum Gradient Descent

An Intelligent Missing Data Imputation Techniques: A Review

A novel and efficient risk minimization-based missing value imputation algorithm

DBT-DMAE: An Effective Multivariate Time Series Pre-Train Model under Missing Data

TriD-MAE: A Generic Pre-trained Model for Multivariate Time Series with Missing Values

Machine Learning for Missing Value Imputation

Do we really need imputation in AutoML predictive modeling?

A matrix completion-based multiview learning method for imputing missing values in buoy monitoring data

Conditional expectation with regularization for missing data imputation

Missing Value Estimation for Mixed-Attribute Data Sets

A Missing Value Filling Model Based on Feature Fusion Enhanced Autoencoder

DiffImpute: Tabular Data Imputation With Denoising Diffusion Probabilistic Model

Missing Features Reconstruction Using a Wasserstein Generative Adversarial Imputation Network

M$^3$-Impute: Mask-guided Representation Learning for Missing Value Imputation