Abstract:Nowadays, the issue of MV imputation has become one of the research hotspots in the field of data quality, since the missing values (MVs) are prevalent in real-world datasets and bring challenges to advanced data analytics algorithms. To impute the MVs, most existing approaches directly derive one estimation for each MV, which is categorized as the single imputation (SI). However, the SI ignores the uncertainty of the MVs, and thereby usually derive unsatisfactory imputation results compared to the Multiple imputation (MI). To extract the uncertainty of the MVs, the MI algorithms derive multiple candidate estimations for each MV. Nevertheless, existing MI approaches are few due to the complicated data-handling process. Accordingly, in this paper, by exploring the Variational Auto-Encoder (VAE) model, we propose a new MI approach, namely MIVAE (Multiple Imputation based on Variational Auto-Encoder) to impute MVs for the tabular data. In MIVAE, we first add a corrupted input layer (where the synthetic MVs are introduced) adjacent to the original input layer to make the model capable of MV issue. Then, we obtain multiple rather than single candidate estimations for each data sample from the posterior distribution of the latent variables learned by our designed model. In such way, the multiple imputation is effectively implemented where the uncertainty of the MVs are extracted perfectly. Next, to obtain satisfactory imputation results, we add a data analysis layer at the end of the network to integrate multiple candidate estimations intelligently. Finally, the experimental results over four real-world datasets demonstrate that MIVAE achieves significantly higher imputation accuracy compared to existing solutions, and MIVAE are capable of handling both numerical and categorized tabular data. For example, the imputation accuracy based on MIVAE improves up to about 40% and 30% compared with PMM and MIWAE (which are the state-of-the-art MI approach) over the CropMapping dataset, respectively. Moreover, we train a MIVAE model over three datasets containing MVs, respectively. By leveraging the trained MIVAE, the classification performance over the imputed data is similar to that over the complete data.

Imputation of Missing Values in Training Data Using Variational Autoencoder

Variational Autoencoding with Conditional Iterative Sampling for Missing Data Imputation

Variational Auto-Encoders Based on the Shift Correction for Imputation of Specific Missing in Multivariate Time Series

Variational Auto-Decoder: A Method for Neural Generative Modeling from Incomplete Data

Continuous Imputation of Missing Values in Time Series Via Wasserstein Generative Adversarial Imputation Networks and Variational Auto-Encoders Model

MIVAE: Multiple Imputation Based on Variational Auto-Encoder

VAEs in the Presence of Missing Data

Multiple Imputation with Denoising Autoencoder using Metamorphic Truth and Imputation Feedback

Missing Features Reconstruction Using a Wasserstein Generative Adversarial Imputation Network

Missing value imputation in multivariate time series with end-to-end generative adversarial networks

MIDIA: exploring denoising autoencoders for missing data imputation

A Missing Value Filling Model Based on Feature Fusion Enhanced Autoencoder

GP-VAE: Deep Probabilistic Time Series Imputation

Proposition of a Theoretical Model for Missing Data Imputation using Deep Learning and Evolutionary Algorithms

Deep Generative Imputation Model for Missing Not At Random Data

Posterior Consistency for Missing Data in Variational Autoencoders

Missing Values Imputation Based on Iterative Learning

Siamese autoencoder architecture for the imputation of data missing not at random

Imputation of Continuous Missing Values in Profile Data.

Do we really need imputation in AutoML predictive modeling?

Imputation of Missing Values in Time Series Using an Adaptive-Learned Median-Filled Deep Autoencoder