SiloFuse: Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models

Aditya Shankar,Hans Brouwer,Rihan Hai,Lydia Chen
2024-04-04
Abstract:Synthetic tabular data is crucial for sharing and augmenting data across silos, especially for enterprises with proprietary data. However, existing synthesizers are designed for centrally stored data. Hence, they struggle with real-world scenarios where features are distributed across multiple silos, necessitating on-premise data storage. We introduce SiloFuse, a novel generative framework for high-quality synthesis from cross-silo tabular data. To ensure privacy, SiloFuse utilizes a distributed latent tabular diffusion architecture. Through autoencoders, latent representations are learned for each client's features, masking their actual values. We employ stacked distributed training to improve communication efficiency, reducing the number of rounds to a single step. Under SiloFuse, we prove the impossibility of data reconstruction for vertically partitioned synthesis and quantify privacy risks through three attacks using our benchmark framework. Experimental results on nine datasets showcase SiloFuse's competence against centralized diffusion-based synthesizers. Notably, SiloFuse achieves 43.8 and 29.8 higher percentage points over GANs in resemblance and utility. Experiments on communication show stacked training's fixed cost compared to the growing costs of end-to-end training as the number of training iterations increases. Additionally, SiloFuse proves robust to feature permutations and varying numbers of clients.
Machine Learning,Cryptography and Security,Databases,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate high - quality synthetic tabular data in a cross - silo environment while ensuring privacy protection. Specifically: 1. **Limitations of existing methods**: Most existing synthesizers are designed for centrally - stored data and cannot handle features distributed across multiple data silos, which limits their application in the real world. In addition, directly sharing raw data may violate privacy regulations (such as GDPR), so a method that can maintain data privacy and enable data collaboration is required. 2. **Research questions**: How to design and train high - quality tabular diffusion models to generate synthetic data without centralizing the original data and ensure the quality and practicality of these data? ### Main challenges - **Feature encoding**: Tabular data contains discrete and continuous features and requires an appropriate encoding method to adapt to training. Traditional one - hot encoding will increase the feature dimension and lead to sparsity problems. - **Feature correlation**: In order to make the synthetic data similar to the original data distribution, it is necessary to capture the feature correlations across data silos, but there is a lack of the ability to learn the correlations of global features. - **Communication cost**: In the distributed training process, frequent data exchanges will lead to high communication costs and affect training efficiency. ### Solutions The paper proposes the SiloFuse framework, which aims to solve the above problems. The main features of SiloFuse include: - **Latent representation learning**: Sensitive features are encoded into continuous latent features through an auto - encoder, and a Gaussian diffusion model is used to generate new synthetic data in the latent space. In this way, global feature correlations can be learned without exposing the original data. - **Stacked training paradigm**: The auto - encoder and the diffusion model are trained at the client and the coordinator respectively, and the entire training process can be completed with only one communication, greatly reducing the communication cost. - **Privacy protection**: Through theoretical proof, in the case of vertical partition synthesis, the original data cannot be reconstructed; and the privacy risks under different attacks are evaluated through a benchmark framework. ### Experimental results The experimental results show that SiloFuse performs excellently on nine datasets. Compared with the GAN - based centralized method, SiloFuse has increased by 43.8% and 29.8 percentage points in similarity and practicality respectively. In addition, SiloFuse also shows advantages in communication efficiency. As the number of training iterations increases, its fixed communication cost is much lower than the cost of end - to - end training. ### Summary SiloFuse provides an innovative solution that can efficiently generate high - quality synthetic tabular data on the premise of ensuring privacy and is suitable for cross - data - silo scenarios.