Hierarchical Conditional Tabular GAN for Multi-Tabular Synthetic Data Generation

Wilhelm Ågren,Victorio Úbeda Sosa
2024-11-11
Abstract:The generation of synthetic data is a state-of-the-art approach to leverage when access to real data is limited or privacy regulations limit the usability of sensitive data. A fair amount of research has been conducted on synthetic data generation for single-tabular datasets, but only a limited amount of research has been conducted on multi-tabular datasets with complex table relationships. In this paper we propose the algorithm HCTGAN to synthesize multi-tabular data from complex multi-tabular datasets. We compare our results to the probabilistic model HMA1. Our findings show that our proposed algorithm can more efficiently sample large amounts of synthetic data for deep and complex multi-tabular datasets, whilst achieving adequate data quality and always guaranteeing referential integrity. We conclude that the HCTGAN algorithm is suitable for generating large amounts of synthetic data efficiently for deep multi-tabular datasets with complex relationships. We additionally suggest that the HMA1 model should be used on smaller datasets when emphasis is on data quality.
Machine Learning,Databases
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges of generating complex multi - tabular data, especially when there are complex relationships between multiple tables. Specifically, the author proposes a new algorithm named HCTGAN (Hierarchical Conditional Tabular GAN) for synthesizing multi - tabular data with complex relationships. #### Main problems include: 1. **Privacy protection and data access limitations**: - In many application scenarios, due to the limitations of privacy regulations (such as GDPR, PIPA, etc.), it becomes difficult or impossible to access real data. Therefore, an effective method is needed to generate synthetic data to replace real data for research and development. 2. **Limitations of existing methods**: - Currently, most research on synthetic data generation mainly focuses on single - table data, while relatively little research has been done on multi - table data, especially multi - table data with complex relationships. - Existing probability models (such as HMA1) perform poorly when dealing with large and complex multi - table data, the training and sampling processes are inefficient, and they cannot guarantee the referential integrity of the generated data between tables. 3. **Generating high - quality and structurally correct synthetic data**: - To ensure the quality and usability of synthetic data, it is not only necessary to generate data similar to real data, but also to ensure that the relationships between multiple tables in the generated data are correct and consistent. #### Main contributions of HCTGAN: - **Efficient multi - table data generation**: HCTGAN can efficiently generate a large amount of synthetic data while maintaining data quality, and is especially suitable for deep and complex multi - table data sets. - **Guarantee of referential integrity**: The synthetic data generated by HCTGAN always guarantees referential integrity, which is difficult to achieve by other methods (such as HMA1). - **Scalability and flexibility**: HCTGAN can handle arbitrarily complex relational data sets, and its training and sampling algorithms are suitable for large - scale data generation tasks. Through these improvements, HCTGAN provides a new solution to the problem of synthesizing complex multi - table data, especially for those application scenarios that require privacy protection or are limited by data access.