Abstract:Missing value imputation in machine learning is the task of estimating the missing values in the dataset accurately using available information. In this task, several deep generative modeling methods have been proposed and demonstrated their usefulness, e.g., generative adversarial imputation networks. Recently, diffusion models have gained popularity because of their effectiveness in the generative modeling task in images, texts, audio, etc. To our knowledge, less attention has been paid to the investigation of the effectiveness of diffusion models for missing value imputation in tabular data. Based on recent development of diffusion models for time-series data imputation, we propose a diffusion model approach called "Conditional Score-based Diffusion Models for Tabular data" (TabCSDI). To effectively handle categorical variables and numerical variables simultaneously, we investigate three techniques: one-hot encoding, analog bits encoding, and feature tokenization. Experimental results on benchmark datasets demonstrated the effectiveness of TabCSDI compared with well-known existing methods, and also emphasized the importance of the categorical embedding techniques.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is **the imputation of missing values in tabular data**. Specifically, the author aims to develop a diffusion - model - based method (called "TabCSDI") for handling missing values in tabular data. Tabular data usually contains numerical and categorical variables. While existing diffusion models perform well in handling time - series data, they are less applied in tabular data.
### Problem Background
In real - world applications, data sets for training prediction models often contain missing values. These missing values may be caused by various reasons, such as human error, privacy issues, or data collection difficulties. The existence of missing values will affect the performance of the model, so effective imputation methods are required to estimate these missing values.
### Existing Methods
Currently, missing - value imputation methods are mainly divided into two categories:
1. **Iterative methods**: Estimate the conditional distribution of each feature through multiple iterations until convergence.
2. **Deep generative model methods**: Use generative models to generate values for the missing parts based on the observed data.
Although existing methods such as MICE (Multiple Imputation by Chained Equations) and GAIN (Generative Adversarial Imputation Network) perform well in some cases, they have limitations when handling tabular data, especially when dealing with data that contains both numerical and categorical variables simultaneously.
### Contributions of the Paper
To fill this gap, the author proposes **TabCSDI (Conditional Score - based Diffusion Models for Tabular data)**, a new method based on the diffusion model. The main innovations of this method include:
- **Support for numerical and categorical variables**: Handle categorical variables by introducing three encoding techniques (one - hot encoding, analog bits encoding, and feature tokenization).
- **Improved model architecture**: Based on the CSDI (Conditional Score - based Diffusion Model) time - series data imputation method, remove the time - transformation layer, and use simple residual connections and multi - layer perceptrons to adapt to the characteristics of tabular data.
### Experimental Results
The author conducted experiments on multiple benchmark data sets. The results show that TabCSDI performs excellently in imputing numerical variables. Especially on mixed - type data sets (such as Diabetes and Census), the effect of the FT (Feature Tokenization) encoding scheme is particularly significant.
### Summary
This paper proposes a new diffusion - model method, TabCSDI, for imputing missing values in tabular data. By introducing support for categorical variables and an improved model architecture, TabCSDI achieves performance comparable to or even better than existing methods on multiple data sets, especially when dealing with data that is a mixture of numerical and categorical variables.
### Formula Summary
The formulas involved in the paper are mainly used to evaluate the imputation effect, including the root - mean - square error (RMSE) and the error rate (Err):
\[
\text{RMSE}(j)=\sqrt{\frac{\sum_{i\in M_j}(\hat{x}_{ij}-y_{ij})^2}{N_j^{\text{miss}}}}
\]
\[
\text{Err}(j)=\frac{1}{N_j^{\text{miss}}}\sum_{i\in M_j}1[\hat{x}_{ij}\neq y_{ij}]
\]
where \(M_j\) is the set of missing - value indices of feature \(j\), \(N_j^{\text{miss}}\) is the number of missing values of feature \(j\), \(\hat{x}_{ij}\) is the imputed value, \(y_{ij}\) is the true value, and \(1[\cdot]\) is the indicator function.