DiffImpute: Tabular Data Imputation With Denoising Diffusion Probabilistic Model

Yizhu Wen,Kai Yi,Jing Ke,Yiqing Shen
2024-03-20
Abstract:Tabular data plays a crucial role in various domains but often suffers from missing values, thereby curtailing its potential utility. Traditional imputation techniques frequently yield suboptimal results and impose substantial computational burdens, leading to inaccuracies in subsequent modeling tasks. To address these challenges, we propose DiffImpute, a novel Denoising Diffusion Probabilistic Model (DDPM). Specifically, DiffImpute is trained on complete tabular datasets, ensuring that it can produce credible imputations for missing entries without undermining the authenticity of the existing data. Innovatively, it can be applied to various settings of Missing Completely At Random (MCAR) and Missing At Random (MAR). To effectively handle the tabular features in DDPM, we tailor four tabular denoising networks, spanning MLP, ResNet, Transformer, and U-Net. We also propose Harmonization to enhance coherence between observed and imputed data by infusing the data back and denoising them multiple times during the sampling stage. To enable efficient inference while maintaining imputation performance, we propose a refined non-Markovian sampling process that works along with Harmonization. Empirical evaluations on seven diverse datasets underscore the prowess of DiffImpute. Specifically, when paired with the Transformer as the denoising network, it consistently outperforms its competitors, boasting an average ranking of 1.7 and the most minimal standard deviation. In contrast, the next best method lags with a ranking of 2.8 and a standard deviation of 0.9. The code is available at
Machine Learning,Artificial Intelligence,Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the imputation of missing values in tabular data. Specifically, tabular data plays a crucial role in multiple fields, but is often affected by missing values, thus limiting its potential application value. Traditional imputation methods (such as mean imputation, median imputation, etc.) often produce sub - optimal results and have a large computational burden, leading to inaccuracies in subsequent modeling tasks. To solve these problems, the author proposes a new model named **DiffImpute**, which is based on the Denoising Diffusion Probabilistic Model (DDPM). By training on complete tabular datasets, DiffImpute ensures that it can generate reliable imputation values for missing entries without compromising the authenticity of the existing data. In addition, DiffImpute can be applied to multiple situations such as Missing Completely at Random (MCAR) and Missing at Random (MAR). To effectively handle tabular features, the author designs four different denoising network architectures: MLP, ResNet, Transformer, and U - Net. These networks aim to improve the accuracy and efficiency of imputation. In addition, the author also introduces the **Harmonization** technique to enhance the consistency between the observed data and the imputed data. To accelerate the inference speed, the author proposes an improved non - Markovian sampling process and uses it in combination with the Harmonization technique. ### Main contributions: 1. **DiffImpute**: A denoising diffusion model for tabular data imputation, which can be trained under MCAR and MAR missing mechanisms, providing a more stable and simplified training and inference process. 2. **Time Step Tokenizer**: Embeds time - order information into the denoising network, adapting to four tabular denoising network architectures (MLP, ResNet, Transformer, and U - Net). 3. **Harmonization**: Enhances the consistency between the imputed data and the observed data during the sampling stage. 4. **Impute - DDIM**: Accelerates the sampling process while maintaining the imputation quality. Experiments on seven different datasets prove that DiffImpute significantly outperforms other methods in performance, especially when using Transformer as the denoising network.