Abstract:Missing value imputation in machine learning is the task of estimating the missing values in the dataset accurately using available information. In this task, several deep generative modeling methods have been proposed and demonstrated their usefulness, e.g., generative adversarial imputation networks. Recently, diffusion models have gained popularity because of their effectiveness in the generative modeling task in images, texts, audio, etc. To our knowledge, less attention has been paid to the investigation of the effectiveness of diffusion models for missing value imputation in tabular data. Based on recent development of diffusion models for time-series data imputation, we propose a diffusion model approach called "Conditional Score-based Diffusion Models for Tabular data" (TabCSDI). To effectively handle categorical variables and numerical variables simultaneously, we investigate three techniques: one-hot encoding, analog bits encoding, and feature tokenization. Experimental results on benchmark datasets demonstrated the effectiveness of TabCSDI compared with well-known existing methods, and also emphasized the importance of the categorical embedding techniques.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is **the imputation of missing values in tabular data**. Specifically, the author aims to develop a diffusion - model - based method (called "TabCSDI") for handling missing values in tabular data. Tabular data usually contains numerical and categorical variables. While existing diffusion models perform well in handling time - series data, they are less applied in tabular data. ### Problem Background In real - world applications, data sets for training prediction models often contain missing values. These missing values may be caused by various reasons, such as human error, privacy issues, or data collection difficulties. The existence of missing values will affect the performance of the model, so effective imputation methods are required to estimate these missing values. ### Existing Methods Currently, missing - value imputation methods are mainly divided into two categories: 1. **Iterative methods**: Estimate the conditional distribution of each feature through multiple iterations until convergence. 2. **Deep generative model methods**: Use generative models to generate values for the missing parts based on the observed data. Although existing methods such as MICE (Multiple Imputation by Chained Equations) and GAIN (Generative Adversarial Imputation Network) perform well in some cases, they have limitations when handling tabular data, especially when dealing with data that contains both numerical and categorical variables simultaneously. ### Contributions of the Paper To fill this gap, the author proposes **TabCSDI (Conditional Score - based Diffusion Models for Tabular data)**, a new method based on the diffusion model. The main innovations of this method include: - **Support for numerical and categorical variables**: Handle categorical variables by introducing three encoding techniques (one - hot encoding, analog bits encoding, and feature tokenization). - **Improved model architecture**: Based on the CSDI (Conditional Score - based Diffusion Model) time - series data imputation method, remove the time - transformation layer, and use simple residual connections and multi - layer perceptrons to adapt to the characteristics of tabular data. ### Experimental Results The author conducted experiments on multiple benchmark data sets. The results show that TabCSDI performs excellently in imputing numerical variables. Especially on mixed - type data sets (such as Diabetes and Census), the effect of the FT (Feature Tokenization) encoding scheme is particularly significant. ### Summary This paper proposes a new diffusion - model method, TabCSDI, for imputing missing values in tabular data. By introducing support for categorical variables and an improved model architecture, TabCSDI achieves performance comparable to or even better than existing methods on multiple data sets, especially when dealing with data that is a mixture of numerical and categorical variables. ### Formula Summary The formulas involved in the paper are mainly used to evaluate the imputation effect, including the root - mean - square error (RMSE) and the error rate (Err): \[ \text{RMSE}(j)=\sqrt{\frac{\sum_{i\in M_j}(\hat{x}_{ij}-y_{ij})^2}{N_j^{\text{miss}}}} \] \[ \text{Err}(j)=\frac{1}{N_j^{\text{miss}}}\sum_{i\in M_j}1[\hat{x}_{ij}\neq y_{ij}] \] where \(M_j\) is the set of missing - value indices of feature \(j\), \(N_j^{\text{miss}}\) is the number of missing values of feature \(j\), \(\hat{x}_{ij}\) is the imputed value, \(y_{ij}\) is the true value, and \(1[\cdot]\) is the indicator function.

Diffusion models for missing value imputation in tabular data

Diffusion Models for Tabular Data Imputation and Synthetic Data Generation

MissDiff: Training Diffusion Models on Tabular Data with Missing Values

CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation

Denoising Diffusion Straightforward Models for Energy Conversion Monitoring Data Imputation

Self-Supervision Improves Diffusion Models for Tabular Data Imputation

TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation

Diffusion Models for Multivariate Time Series Generation with Missing Values

An Observed Value Consistent Diffusion Model for Imputing Missing Values in Multivariate Time Series

DiffImpute: Tabular Data Imputation With Denoising Diffusion Probabilistic Model

Continuous Diffusion for Mixed-Type Tabular Data

Temporal Disentangled Contrastive Diffusion Model for Spatiotemporal Imputation

Latent Space Score-based Diffusion Model for Probabilistic Multivariate Time Series Imputation

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

Unleashing the Potential of Diffusion Models for Incomplete Data Imputation

Rethinking the Diffusion Models for Numerical Tabular Data Imputation from the Perspective of Wasserstein Gradient Flow

Score-CDM: Score-Weighted Convolutional Diffusion Model for Multivariate Time Series Imputation

MTSCI: A Conditional Diffusion Model for Multivariate Time Series Consistent Imputation

TabDDPM: Modelling Tabular Data with Diffusion Models

ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly Detection

Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space