Abstract:The generation of synthetic data is a state-of-the-art approach to leverage when access to real data is limited or privacy regulations limit the usability of sensitive data. A fair amount of research has been conducted on synthetic data generation for single-tabular datasets, but only a limited amount of research has been conducted on multi-tabular datasets with complex table relationships. In this paper we propose the algorithm HCTGAN to synthesize multi-tabular data from complex multi-tabular datasets. We compare our results to the probabilistic model HMA1. Our findings show that our proposed algorithm can more efficiently sample large amounts of synthetic data for deep and complex multi-tabular datasets, whilst achieving adequate data quality and always guaranteeing referential integrity. We conclude that the HCTGAN algorithm is suitable for generating large amounts of synthetic data efficiently for deep multi-tabular datasets with complex relationships. We additionally suggest that the HMA1 model should be used on smaller datasets when emphasis is on data quality.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the challenges of generating complex multi - tabular data, especially when there are complex relationships between multiple tables. Specifically, the author proposes a new algorithm named HCTGAN (Hierarchical Conditional Tabular GAN) for synthesizing multi - tabular data with complex relationships. #### Main problems include: 1. **Privacy protection and data access limitations**: - In many application scenarios, due to the limitations of privacy regulations (such as GDPR, PIPA, etc.), it becomes difficult or impossible to access real data. Therefore, an effective method is needed to generate synthetic data to replace real data for research and development. 2. **Limitations of existing methods**: - Currently, most research on synthetic data generation mainly focuses on single - table data, while relatively little research has been done on multi - table data, especially multi - table data with complex relationships. - Existing probability models (such as HMA1) perform poorly when dealing with large and complex multi - table data, the training and sampling processes are inefficient, and they cannot guarantee the referential integrity of the generated data between tables. 3. **Generating high - quality and structurally correct synthetic data**: - To ensure the quality and usability of synthetic data, it is not only necessary to generate data similar to real data, but also to ensure that the relationships between multiple tables in the generated data are correct and consistent. #### Main contributions of HCTGAN: - **Efficient multi - table data generation**: HCTGAN can efficiently generate a large amount of synthetic data while maintaining data quality, and is especially suitable for deep and complex multi - table data sets. - **Guarantee of referential integrity**: The synthetic data generated by HCTGAN always guarantees referential integrity, which is difficult to achieve by other methods (such as HMA1). - **Scalability and flexibility**: HCTGAN can handle arbitrarily complex relational data sets, and its training and sampling algorithms are suitable for large - scale data generation tasks. Through these improvements, HCTGAN provides a new solution to the problem of synthesizing complex multi - table data, especially for those application scenarios that require privacy protection or are limited by data access.

Hierarchical Conditional Tabular GAN for Multi-Tabular Synthetic Data Generation

Effective and Privacy preserving Tabular Data Synthesizing

Generating Synthetic Mixed-Type Longitudinal Electronic Health Records for Artificial Intelligent Applications

Synthesizing Tabular Data using Generative Adversarial Networks

Multi-objective evolutionary GAN for tabular data synthesis

Enhanced Conditional GAN for High-Quality Synthetic Tabular Data Generation in Mobile-Based Cardiovascular Healthcare

CTAB-GAN+: enhancing tabular data synthesis

CasTGAN: Cascaded Generative Adversarial Network for Realistic Tabular Data Synthesis

Causal-TGAN: Generating Tabular Data Using Causal Generative Adversarial Networks

TAEGAN: Generating Synthetic Tabular Data For Data Augmentation

Row Conditional-TGAN for generating synthetic relational databases

Tabular Data Synthesis with Generative Adversarial Networks: Design Space and Optimizations

OCT-GAN: Neural ODE-based Conditional Tabular GANs

Distributed Conditional GAN (discGAN) For Synthetic Healthcare Data Generation

An improved tabular data generator with VAE-GMM integration

MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data

Comparative Analysis of Generative AI Techniques for Addressing the Tabular Data Generation Problem in Medical Records

Permutation-Invariant Tabular Data Synthesis

TableGAN-MCA: Evaluating Membership Collisions of GAN-Synthesized Tabular Data Releasing

Invertible Tabular GANs: Killing Two Birds with OneStone for Tabular Data Synthesis