Abstract:With growing attention to tabular data these days, the attempt to apply a synthetic table to various tasks has been expanded toward various scenarios. Owing to the recent advances in generative modeling, fake data generated by tabular data synthesis models become sophisticated and realistic. However, there still exists a difficulty in modeling discrete variables (columns) of tabular data. In this work, we propose to process continuous and discrete variables separately (but being conditioned on each other) by two diffusion models. The two diffusion models are co-evolved during training by reading conditions from each other. In order to further bind the diffusion models, moreover, we introduce a contrastive learning method with a negative sampling method. In our experiments with 11 real-world tabular datasets and 8 baseline methods, we prove the efficacy of the proposed method, called CoDi.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the synthesis problem of mixed - type tabular data (including continuous and discrete variables). Specifically, the existing tabular data synthesis methods face challenges when dealing with discrete variables, especially in maintaining the correlation between continuous and discrete variables. Therefore, the authors propose a new method, namely CoDi (Co - evolving Contrastive Diffusion Models), aiming to handle continuous and discrete variables respectively through two diffusion models, and improve the quality of the generated data through the co - evolving conditional diffusion model and the contrastive learning method. ### Main problem description in the paper 1. **Limitations of existing methods**: - Tabular data usually contains mixed - type data (continuous and discrete variables). - When dealing with discrete variables, existing methods usually map discrete variables into the continuous space for processing, which may lead to sub - optimal results, especially in maintaining the correlation between variables. - Existing generative models have difficulty in simultaneously ensuring the quality, diversity, and generation time of the generated data when dealing with mixed - type data. 2. **Proposed solutions**: - **Separate processing of continuous and discrete variables**: Use two diffusion models to process continuous and discrete variables respectively, in order to better learn their respective distributions. - **Co - evolving conditional diffusion model**: These two diffusion models read conditions from each other during the training process, thus co - evolving to ensure that the generated data has a reasonable correlation between continuous and discrete variables. - **Contrastive learning**: Introduce the contrastive learning method to further strengthen the connection between the two diffusion models and ensure that the generated data is closer to the real data. ### Formula summary - **Diffusion process**: - Forward diffusion process in continuous space: \[ q(x_t|x_{t - 1})=\mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0,(1 - \bar{\alpha}_t)I) \] - Forward diffusion process in discrete space: \[ q(x_t|x_{t - 1}) = C(x_t; \bar{\alpha}_t x_0+(1 - \bar{\alpha}_t)/K) \] - **Loss function**: - Loss function of the continuous diffusion model: \[ L_{\text{DiffC}}(\theta_C):=\mathbb{E}_{t,x_0,\epsilon}\left[\left\|\epsilon - \epsilon_{\theta_C}(x_t,t|x_D^t)\right\|^2\right] \] - Loss function of the discrete diffusion model: \[ L_{\text{DiffD}}(\theta_D)=\mathbb{E}_q\left[D_{\text{KL}}[q(x_T|x_0)\|p(x_T)]-\log p_{\theta_D}(x_0|x_1,x_C^1)+\sum_{t = 2}^T D_{\text{KL}}(q(x_{t - 1}|x_t,x_0)\|p_{\theta_D}(x_{t - 1}|x_t,x_C^t))\right] \] - **Contrastive learning loss function**: \[ L_{\text{CL}}(A,P,N)=\sum_{i = 0}^S\max\{d(A_i,P_i)-d(A_i,N_i)+m,0\} \] Through these methods, CoDi can effectively handle mixed - type tabular data and has demonstrated superior performance on multiple real - world datasets.

CoDi: Co-evolving Contrastive Diffusion Models for Mixed-type Tabular Synthesis

TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation

Controllable Tabular Data Synthesis Using Diffusion Models

AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing

Continuous Diffusion for Mixed-Type Tabular Data

Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models

A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis

DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents

MissDiff: Training Diffusion Models on Tabular Data with Missing Values

TabDDPM: Modelling Tabular Data with Diffusion Models

Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space

Diffusion Models for Tabular Data Imputation and Synthetic Data Generation

Any-to-Any Generation via Composable Diffusion

TimeAutoDiff: Combining Autoencoder and Diffusion model for time series tabular data synthesizing

ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models

Training Class-Imbalanced Diffusion Model Via Overlap Optimization

Tabular Data Generation using Binary Diffusion

Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

Discrete Copula Diffusion

Stable Diffusion for Data Augmentation in COCO and Weed Datasets

CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation