Abstract:Synthetic tabular data is crucial for sharing and augmenting data across silos, especially for enterprises with proprietary data. However, existing synthesizers are designed for centrally stored data. Hence, they struggle with real-world scenarios where features are distributed across multiple silos, necessitating on-premise data storage. We introduce SiloFuse, a novel generative framework for high-quality synthesis from cross-silo tabular data. To ensure privacy, SiloFuse utilizes a distributed latent tabular diffusion architecture. Through autoencoders, latent representations are learned for each client's features, masking their actual values. We employ stacked distributed training to improve communication efficiency, reducing the number of rounds to a single step. Under SiloFuse, we prove the impossibility of data reconstruction for vertically partitioned synthesis and quantify privacy risks through three attacks using our benchmark framework. Experimental results on nine datasets showcase SiloFuse's competence against centralized diffusion-based synthesizers. Notably, SiloFuse achieves 43.8 and 29.8 higher percentage points over GANs in resemblance and utility. Experiments on communication show stacked training's fixed cost compared to the growing costs of end-to-end training as the number of training iterations increases. Additionally, SiloFuse proves robust to feature permutations and varying numbers of clients.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate high - quality synthetic tabular data in a cross - silo environment while ensuring privacy protection. Specifically: 1. **Limitations of existing methods**: Most existing synthesizers are designed for centrally - stored data and cannot handle features distributed across multiple data silos, which limits their application in the real world. In addition, directly sharing raw data may violate privacy regulations (such as GDPR), so a method that can maintain data privacy and enable data collaboration is required. 2. **Research questions**: How to design and train high - quality tabular diffusion models to generate synthetic data without centralizing the original data and ensure the quality and practicality of these data? ### Main challenges - **Feature encoding**: Tabular data contains discrete and continuous features and requires an appropriate encoding method to adapt to training. Traditional one - hot encoding will increase the feature dimension and lead to sparsity problems. - **Feature correlation**: In order to make the synthetic data similar to the original data distribution, it is necessary to capture the feature correlations across data silos, but there is a lack of the ability to learn the correlations of global features. - **Communication cost**: In the distributed training process, frequent data exchanges will lead to high communication costs and affect training efficiency. ### Solutions The paper proposes the SiloFuse framework, which aims to solve the above problems. The main features of SiloFuse include: - **Latent representation learning**: Sensitive features are encoded into continuous latent features through an auto - encoder, and a Gaussian diffusion model is used to generate new synthetic data in the latent space. In this way, global feature correlations can be learned without exposing the original data. - **Stacked training paradigm**: The auto - encoder and the diffusion model are trained at the client and the coordinator respectively, and the entire training process can be completed with only one communication, greatly reducing the communication cost. - **Privacy protection**: Through theoretical proof, in the case of vertical partition synthesis, the original data cannot be reconstructed; and the privacy risks under different attacks are evaluated through a benchmark framework. ### Experimental results The experimental results show that SiloFuse performs excellently on nine datasets. Compared with the GAN - based centralized method, SiloFuse has increased by 43.8% and 29.8 percentage points in similarity and practicality respectively. In addition, SiloFuse also shows advantages in communication efficiency. As the number of training iterations increases, its fixed communication cost is much lower than the cost of end - to - end training. ### Summary SiloFuse provides an innovative solution that can efficiently generate high - quality synthetic tabular data on the premise of ensuring privacy and is suitable for cross - data - silo scenarios.

SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models

CTSyn: A Foundational Model for Cross Tabular Data Generation

SynDiffix: More accurate synthetic structured data

Generating Synthetic Fair Syntax-agnostic Data by Learning and Distilling Fair Representation

CollaFuse: Collaborative Diffusion Models

FedSyn: Synthetic Data Generation using Federated Learning

CollaFuse: Navigating Limited Resources and Privacy in Collaborative Generative AI

Data synthesis based on generative adversarial networks

Unlocking the Potential of Federated Learning: The Symphony of Dataset Distillation via Deep Generative Latents

Tabular Data Synthesis with Generative Adversarial Networks: Design Space and Optimizations

Federated Learning with GAN-based Data Synthesis for Non-IID Clients.

Quantifying and Mitigating Privacy Risks for Tabular Generative Models

Permutation-Invariant Tabular Data Synthesis

A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis

End to End Collaborative Synthetic Data Generation

AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing

SYNC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources

FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation

TabSAL: Synthesizing Tabular Data with Small Agent Assisted Language Models