A novel and efficient risk minimization-based missing value imputation algorithm

Yu-Lin He,Jia-Yin Yu,Xu Li,Philippe Fournier-Viger,Joshua Zhexue Huang
DOI: https://doi.org/10.1016/j.knosys.2024.112435
IF: 8.139
2024-09-01
Knowledge-Based Systems
Abstract:Missing value imputation (MVI) is a key task in data science, in which learning models are built from incomplete data. In contrast to externally driven MVI algorithms, this study proposes a novel risk-minimisation-based MVI algorithm (RM-MVI) that considers both the internal characteristics of missing data and the external performance for specific classification applications. RM-MVI is technically designed for labelled data and is applied in two stages: filling with structural risk minimization (SRM) and refining with empirical risk minimization (ERM). In the filling stage, an autoencoder with a single hidden layer is trained on the original dataset without missing values. Missing values are first initialised with random numbers, and the imputation values are then preliminarily optimised based on the derived updating rule to minimise the structural risk-oriented objective function. After the imputation values have been preliminarily optimised in the filling stage, a neural-network-based classifier is trained in the refining stage to optimise the imputation values sophisticatedly by reducing the empirical risk. Experiments were conducted on several benchmark datasets to validate the feasibility, rationality, and effectiveness of the proposed RM-MVI algorithm. The results show that (1) the optimization processes of the imputation values corresponding to the SRM and ERM are convergent so that the optimized imputation values can be obtained; (2) SRM can ensure distribution consistency of the imputation values that are preliminarily optimised in the filling stage, while ERM can optimise the imputation values sophisticatedly in the refining stage, which is more helpful for classifier training; and (3) the RM-MVI algorithm can yield considerably better MVI performance on benchmark datasets than 11 well-known MVI algorithms, such as a 26% higher distribution consistency ratio and 2% to 5% higher testing accuracies for 6 classifiers on average. This demonstrates that RM-MVI is a viable approach for addressing MVI problems.
computer science, artificial intelligence
What problem does this paper attempt to address?