Balanced Mixed-Type Tabular Data Synthesis with Diffusion Models

Zeyu Yang,Peikun Guo,Khadija Zanna,Akane Sano
2024-04-12
Abstract:Diffusion models have emerged as a robust framework for various generative tasks, such as image and audio synthesis, and have also demonstrated a remarkable ability to generate mixed-type tabular data comprising both continuous and discrete variables. However, current approaches to training diffusion models on mixed-type tabular data tend to inherit the imbalanced distributions of features present in the training dataset, which can result in biased sampling. In this research, we introduce a fair diffusion model designed to generate balanced data on sensitive attributes. We present empirical evidence demonstrating that our method effectively mitigates the class imbalance in training data while maintaining the quality of the generated samples. Furthermore, we provide evidence that our approach outperforms existing methods for synthesizing tabular data in terms of performance and fairness.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to generate balanced data through diffusion models when generating mixed - type tabular data, especially the problem of distribution imbalance on sensitive attributes. Specifically, current methods for generating mixed - type tabular data using diffusion models often inherit the imbalance of feature distributions in the training dataset, leading to sampling bias. The paper proposes a fair diffusion model, aiming to generate data that is balanced on sensitive attributes while maintaining the quality of the generated samples. In addition, the paper also provides experimental evidence to prove that its method is superior to existing tabular data synthesis methods in terms of performance and fairness. The main contributions of the paper include: 1. Introducing a framework based on diffusion models, specifically for learning the distribution of mixed - type tabular data under multiple attributes. 2. Generating tabular data that is balanced on a predefined set of sensitive attributes, solving the inherent bias in the data. 3. Through comprehensive evaluation, it is proved that this model surpasses existing models in specific performance and fairness. The diffusion model mentioned in the paper is a probabilistic generation model that uses Markov processes to gradually convert from noise to the target data distribution. The model consists of two parts: a forward process and a reverse process. In the forward process, the given data $x_0$ is gradually transformed into noise $x_T$, and in the reverse process, $x_T$ is restored to the original data $x_0$. For continuous features, a Gaussian diffusion kernel is used; for discrete features, a polynomial diffusion kernel is used. The paper describes these processes in detail and proposes methods for optimizing parameters to adapt to mixed - type data containing continuous and discrete features. To achieve balanced mixed - type tabular data generation, the paper proposes a multivariate latent guidance method, which reduces the representation differences of sensitive attributes in synthetic data through conditional generation and balanced sampling techniques. In addition, the paper also introduces the backbone deep neural network architecture for the posterior estimator and the balanced sampling process after training. In the experimental part, the paper uses tabular datasets of seven classification tasks to evaluate the effectiveness and fairness of the proposed method. Comparisons are made with multiple baseline models, including the latest diffusion models, generative adversarial networks, variational auto - encoders, and the traditional SMOTE method. Evaluation metrics include machine - learning efficiency, fairness scores of classifiers (such as demographic parity ratio and equal - opportunity ratio), as well as data originality and the degree of category - characteristic imbalance within sensitive features. Overall, the paper aims to solve the distribution imbalance problem in mixed - type tabular data generation by improving the diffusion model, thereby improving the quality and fairness of the generated data.