Imputation of Missing Values in Training Data Using Variational Autoencoder

Xuerui Hong,Shuang Hao
DOI: https://doi.org/10.1109/icdew58674.2023.00013
2023-01-01
Abstract:Missing values in the training data must be handled before training the machine learning (ML) model. One widely used practice is to delete the data with missing values directly, but this will undoubtedly cause the loss of information and thus affect the model’s accuracy. Missing values can also be estimated using extensively studied statistical and ML-based imputation algorithms. The emergence of deep generative models also opens up new opportunities, especially for dealing with a particularly large number of missing values. However, none of these methods aim to improve further the utility of imputed training data for the downstream target task.In this paper, we propose VAIM, a variational autoencoder (VAE) based framework for imputing missing values in training data. VAIM contains three modules: VAE works as a generator to guess the value of missing data; a discriminator tries to distinguish the original observable data from the data imputed by VAE; a learner is trained to guarantee the utility of imputed training data in the downstream classification task. We also introduce the self-attention mechanism in VAE, which allows the model to focus on more relevant information when imputing the missing values. Experiments on multiple datasets with different missing rates and different missing types show that our method performs better than state-of-the-art approaches.
What problem does this paper attempt to address?