CoDi: Co-evolving Contrastive Diffusion Models for Mixed-type Tabular Synthesis

Chaejeong Lee,Jayoung Kim,Noseong Park
2023-09-21
Abstract:With growing attention to tabular data these days, the attempt to apply a synthetic table to various tasks has been expanded toward various scenarios. Owing to the recent advances in generative modeling, fake data generated by tabular data synthesis models become sophisticated and realistic. However, there still exists a difficulty in modeling discrete variables (columns) of tabular data. In this work, we propose to process continuous and discrete variables separately (but being conditioned on each other) by two diffusion models. The two diffusion models are co-evolved during training by reading conditions from each other. In order to further bind the diffusion models, moreover, we introduce a contrastive learning method with a negative sampling method. In our experiments with 11 real-world tabular datasets and 8 baseline methods, we prove the efficacy of the proposed method, called CoDi.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the synthesis problem of mixed - type tabular data (including continuous and discrete variables). Specifically, the existing tabular data synthesis methods face challenges when dealing with discrete variables, especially in maintaining the correlation between continuous and discrete variables. Therefore, the authors propose a new method, namely CoDi (Co - evolving Contrastive Diffusion Models), aiming to handle continuous and discrete variables respectively through two diffusion models, and improve the quality of the generated data through the co - evolving conditional diffusion model and the contrastive learning method. ### Main problem description in the paper 1. **Limitations of existing methods**: - Tabular data usually contains mixed - type data (continuous and discrete variables). - When dealing with discrete variables, existing methods usually map discrete variables into the continuous space for processing, which may lead to sub - optimal results, especially in maintaining the correlation between variables. - Existing generative models have difficulty in simultaneously ensuring the quality, diversity, and generation time of the generated data when dealing with mixed - type data. 2. **Proposed solutions**: - **Separate processing of continuous and discrete variables**: Use two diffusion models to process continuous and discrete variables respectively, in order to better learn their respective distributions. - **Co - evolving conditional diffusion model**: These two diffusion models read conditions from each other during the training process, thus co - evolving to ensure that the generated data has a reasonable correlation between continuous and discrete variables. - **Contrastive learning**: Introduce the contrastive learning method to further strengthen the connection between the two diffusion models and ensure that the generated data is closer to the real data. ### Formula summary - **Diffusion process**: - Forward diffusion process in continuous space: \[ q(x_t|x_{t - 1})=\mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0,(1 - \bar{\alpha}_t)I) \] - Forward diffusion process in discrete space: \[ q(x_t|x_{t - 1}) = C(x_t; \bar{\alpha}_t x_0+(1 - \bar{\alpha}_t)/K) \] - **Loss function**: - Loss function of the continuous diffusion model: \[ L_{\text{DiffC}}(\theta_C):=\mathbb{E}_{t,x_0,\epsilon}\left[\left\|\epsilon - \epsilon_{\theta_C}(x_t,t|x_D^t)\right\|^2\right] \] - Loss function of the discrete diffusion model: \[ L_{\text{DiffD}}(\theta_D)=\mathbb{E}_q\left[D_{\text{KL}}[q(x_T|x_0)\|p(x_T)]-\log p_{\theta_D}(x_0|x_1,x_C^1)+\sum_{t = 2}^T D_{\text{KL}}(q(x_{t - 1}|x_t,x_0)\|p_{\theta_D}(x_{t - 1}|x_t,x_C^t))\right] \] - **Contrastive learning loss function**: \[ L_{\text{CL}}(A,P,N)=\sum_{i = 0}^S\max\{d(A_i,P_i)-d(A_i,N_i)+m,0\} \] Through these methods, CoDi can effectively handle mixed - type tabular data and has demonstrated superior performance on multiple real - world datasets.