Transformed Distribution Matching for Missing Value Imputation

He Zhao,Ke Sun,Amir Dezfouli,Edwin Bonilla
2023-06-23
Abstract:We study the problem of imputing missing values in a dataset, which has important applications in many domains. The key to missing value imputation is to capture the data distribution with incomplete samples and impute the missing values accordingly. In this paper, by leveraging the fact that any two batches of data with missing values come from the same data distribution, we propose to impute the missing values of two batches of samples by transforming them into a latent space through deep invertible functions and matching them distributionally. To learn the transformations and impute the missing values simultaneously, a simple and well-motivated algorithm is proposed. Our algorithm has fewer hyperparameters to fine-tune and generates high-quality imputations regardless of how missing values are generated. Extensive experiments over a large number of datasets and competing benchmark algorithms show that our method achieves state-of-the-art performance.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the imputation of missing values in datasets, which has important application value in many fields. Specifically, the focus of the article is on how to capture the data distribution and impute missing values accordingly when dealing with a large amount of data with missing values. ### Key Challenges in Missing Value Imputation 1. **Data Distribution Modeling**: How to effectively model the data distribution in the presence of a large number of missing values is a very challenging problem. 2. **Complex Geometric Structures**: Data in the real world usually exhibits complex geometric structures, and simple distance metrics (such as Euclidean distance) may not be able to capture these structures well, resulting in incorrect imputation results. ### Limitations of Existing Methods - **Conditional Distribution Modeling**: Some methods choose to model the conditional data distribution (i.e., the distribution of one feature under the condition of other features), but this requires specifying different models for each feature, which is rather cumbersome in practice. - **Deep Generative Models**: Although the method of using deep generative models to capture the data distribution is effective, it usually requires a large amount of hyper - parameter tuning and is sensitive to the missing - value generation mechanism. ### Proposed New Method This paper proposes a new method - Transformed Distribution Matching (TDM), which improves existing methods through the following steps: 1. **Latent Space Transformation**: Transform data samples into a latent space through a deep invertible function. In this space, the distances between samples can better reflect their similarities and differences in the original data space. 2. **Optimal Transport Matching**: Use optimal transport (OT) in the latent space for distribution matching to ensure that the empirical distributions of two batches of data are as close as possible after imputation. 3. **Joint Learning**: Design a simple and effective algorithm to learn the transformation function and impute missing values simultaneously, reducing the number of hyper - parameters that need to be adjusted. ### Advantages of the Method - **Simplified Model Tuning**: Compared with existing methods, the TDM algorithm has fewer hyper - parameters, reducing the difficulty of tuning. - **High - Quality Imputation**: Regardless of how the missing values are generated, TDM can generate high - quality imputation results. - **Wide Applicability**: Verified by a large number of experiments, TDM performs well on multiple datasets and under different missing - value mechanisms, achieving state - of - the - art performance. ### Conclusion By introducing the ideas of latent - space transformation and distribution matching, TDM successfully overcomes the limitations of existing methods on data with complex geometric structures and provides a more general and effective method for missing - value imputation.