Abstract:We study the problem of imputing missing values in a dataset, which has important applications in many domains. The key to missing value imputation is to capture the data distribution with incomplete samples and impute the missing values accordingly. In this paper, by leveraging the fact that any two batches of data with missing values come from the same data distribution, we propose to impute the missing values of two batches of samples by transforming them into a latent space through deep invertible functions and matching them distributionally. To learn the transformations and impute the missing values simultaneously, a simple and well-motivated algorithm is proposed. Our algorithm has fewer hyperparameters to fine-tune and generates high-quality imputations regardless of how missing values are generated. Extensive experiments over a large number of datasets and competing benchmark algorithms show that our method achieves state-of-the-art performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the imputation of missing values in datasets, which has important application value in many fields. Specifically, the focus of the article is on how to capture the data distribution and impute missing values accordingly when dealing with a large amount of data with missing values. ### Key Challenges in Missing Value Imputation 1. **Data Distribution Modeling**: How to effectively model the data distribution in the presence of a large number of missing values is a very challenging problem. 2. **Complex Geometric Structures**: Data in the real world usually exhibits complex geometric structures, and simple distance metrics (such as Euclidean distance) may not be able to capture these structures well, resulting in incorrect imputation results. ### Limitations of Existing Methods - **Conditional Distribution Modeling**: Some methods choose to model the conditional data distribution (i.e., the distribution of one feature under the condition of other features), but this requires specifying different models for each feature, which is rather cumbersome in practice. - **Deep Generative Models**: Although the method of using deep generative models to capture the data distribution is effective, it usually requires a large amount of hyper - parameter tuning and is sensitive to the missing - value generation mechanism. ### Proposed New Method This paper proposes a new method - Transformed Distribution Matching (TDM), which improves existing methods through the following steps: 1. **Latent Space Transformation**: Transform data samples into a latent space through a deep invertible function. In this space, the distances between samples can better reflect their similarities and differences in the original data space. 2. **Optimal Transport Matching**: Use optimal transport (OT) in the latent space for distribution matching to ensure that the empirical distributions of two batches of data are as close as possible after imputation. 3. **Joint Learning**: Design a simple and effective algorithm to learn the transformation function and impute missing values simultaneously, reducing the number of hyper - parameters that need to be adjusted. ### Advantages of the Method - **Simplified Model Tuning**: Compared with existing methods, the TDM algorithm has fewer hyper - parameters, reducing the difficulty of tuning. - **High - Quality Imputation**: Regardless of how the missing values are generated, TDM can generate high - quality imputation results. - **Wide Applicability**: Verified by a large number of experiments, TDM performs well on multiple datasets and under different missing - value mechanisms, achieving state - of - the - art performance. ### Conclusion By introducing the ideas of latent - space transformation and distribution matching, TDM successfully overcomes the limitations of existing methods on data with complex geometric structures and provides a more general and effective method for missing - value imputation.

Transformed Distribution Matching for Missing Value Imputation

LSPT-D: Local Similarity Preserved Transport for Direct Industrial Data Imputation

Unleashing the Potential of Diffusion Models for Incomplete Data Imputation

Missing Values Imputation Based on Iterative Learning

Best Fit Missing Value Imputation (BFMVI) Algorithm for Incomplete Data in the Internet of Things.

Online Missing Value Imputation for High-Dimensional Mixed-Type Data via Generalized Factor Models

Multiple Imputation with Denoising Autoencoder using Metamorphic Truth and Imputation Feedback

Iterative missing value imputation based on feature importance

Missing Value Estimation for Mixed-Attribute Data Sets

Deep Ensembles Meets Quantile Regression: Uncertainty-aware Imputation for Time Series

Probabilistic Imputation for Time-series Classification with Missing Data

Conditional expectation with regularization for missing data imputation

Latent Space Score-based Diffusion Model for Probabilistic Multivariate Time Series Imputation

Missing value imputation using unsupervised machine learning techniques

A Benchmark for Data Imputation Methods

Missing Value Imputation on Multidimensional Time Series

Imputing Various Incomplete Attributes Via Distance Likelihood Maximization

Online Missing Value Imputation and Change Point Detection with the Gaussian Copula

Method for Incomplete and Imbalanced Data Based on Multivariate Imputation by Chained Equations and Ensemble Learning

MCFlow: Monte Carlo Flow Models for Data Imputation

ITI-IQA: a Toolbox for Heterogeneous Univariate and Multivariate Missing Data Imputation Quality Assessment