Abstract:Diffusion models have emerged as a robust framework for various generative tasks, such as image and audio synthesis, and have also demonstrated a remarkable ability to generate mixed-type tabular data comprising both continuous and discrete variables. However, current approaches to training diffusion models on mixed-type tabular data tend to inherit the imbalanced distributions of features present in the training dataset, which can result in biased sampling. In this research, we introduce a fair diffusion model designed to generate balanced data on sensitive attributes. We present empirical evidence demonstrating that our method effectively mitigates the class imbalance in training data while maintaining the quality of the generated samples. Furthermore, we provide evidence that our approach outperforms existing methods for synthesizing tabular data in terms of performance and fairness.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to generate balanced data through diffusion models when generating mixed - type tabular data, especially the problem of distribution imbalance on sensitive attributes. Specifically, current methods for generating mixed - type tabular data using diffusion models often inherit the imbalance of feature distributions in the training dataset, leading to sampling bias. The paper proposes a fair diffusion model, aiming to generate data that is balanced on sensitive attributes while maintaining the quality of the generated samples. In addition, the paper also provides experimental evidence to prove that its method is superior to existing tabular data synthesis methods in terms of performance and fairness. The main contributions of the paper include: 1. Introducing a framework based on diffusion models, specifically for learning the distribution of mixed - type tabular data under multiple attributes. 2. Generating tabular data that is balanced on a predefined set of sensitive attributes, solving the inherent bias in the data. 3. Through comprehensive evaluation, it is proved that this model surpasses existing models in specific performance and fairness. The diffusion model mentioned in the paper is a probabilistic generation model that uses Markov processes to gradually convert from noise to the target data distribution. The model consists of two parts: a forward process and a reverse process. In the forward process, the given data $x_0$ is gradually transformed into noise $x_T$, and in the reverse process, $x_T$ is restored to the original data $x_0$. For continuous features, a Gaussian diffusion kernel is used; for discrete features, a polynomial diffusion kernel is used. The paper describes these processes in detail and proposes methods for optimizing parameters to adapt to mixed - type data containing continuous and discrete features. To achieve balanced mixed - type tabular data generation, the paper proposes a multivariate latent guidance method, which reduces the representation differences of sensitive attributes in synthetic data through conditional generation and balanced sampling techniques. In addition, the paper also introduces the backbone deep neural network architecture for the posterior estimator and the balanced sampling process after training. In the experimental part, the paper uses tabular datasets of seven classification tasks to evaluate the effectiveness and fairness of the proposed method. Comparisons are made with multiple baseline models, including the latest diffusion models, generative adversarial networks, variational auto - encoders, and the traditional SMOTE method. Evaluation metrics include machine - learning efficiency, fairness scores of classifiers (such as demographic parity ratio and equal - opportunity ratio), as well as data originality and the degree of category - characteristic imbalance within sensitive features. Overall, the paper aims to solve the distribution imbalance problem in mixed - type tabular data generation by improving the diffusion model, thereby improving the quality and fairness of the generated data.

Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models

Controllable Tabular Data Synthesis Using Diffusion Models

Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space

Continuous Diffusion for Mixed-Type Tabular Data

Synthetic Tabular Data Generation for Class Imbalance and Fairness: A Comparative Study

Diffusion Models for Tabular Data Imputation and Synthetic Data Generation

TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation

Tabular Data Generation using Binary Diffusion

Class-Balancing Diffusion Models

Synthesizing Mixed-type Electronic Health Records using Diffusion Models

Data Augmentation via Diffusion Model to Enhance AI Fairness

BioDiffusion: A Versatile Diffusion Model for Biomedical Signal Synthesis

FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation

AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing

MissDiff: Training Diffusion Models on Tabular Data with Missing Values

CoDi: Co-evolving Contrastive Diffusion Models for Mixed-type Tabular Synthesis

Diffusion-nested Auto-Regressive Synthesis of Heterogeneous Tabular Data

TabDDPM: Modelling Tabular Data with Diffusion Models

CTSyn: A Foundational Model for Cross Tabular Data Generation

Training Class-Imbalanced Diffusion Model Via Overlap Optimization