Synthesizing High-Utility Tabular Data with Enhanced Privacy Via Split-and-Discard Pre-Training

Liwei Luo,Heyuan Huang,Bingbing Zhang,Yankai Xie,Chi Zhang,Lingbo Wei
DOI: https://doi.org/10.1109/globecom54140.2023.10437831
2023-01-01
Abstract:Data sharing has led to the emergence of the deep generative model (DGM) with differential privacy for synthesizing tabular data. However, existing methods struggle to synthesize high-utility tabular data with enhanced privacy. One challenge is degraded data utility due to the limited number of training iterations available under strong privacy guarantees. The other challenge is that widely-used encoding schemes may leak the sensitive distribution of continuous features. To this end, we propose a novel pipeline incorporating split-and-discard pre-training and an embedding module to synthesize data. To reduce the impact of limited iterations, we employ the split-and-discard pre-training method. This method leverages the intrinsic structure of DGM, which can be split into discriminative and generative sub-models. By conducting pre-training and discarding specific sub-models of DGM on private data, we address these challenges while training models with differential privacy. To preserve the privacy of continuous features, we propose a piecewise linear one-hot encoding scheme followed by an embedding layer. We instantiate this pipeline using variational autoencoders and generative adversarial networks respectively and compare them against popular models and variants. Results show that our pipeline on private data effectively balances privacy and utility.
What problem does this paper attempt to address?