Inference With Combining Rules From Multiple Differentially Private Synthetic Datasets

Leila Nombo,Anne-Sophie Charest
2024-05-08
Abstract:Differential privacy (DP) has been accepted as a rigorous criterion for measuring the privacy protection offered by random mechanisms used to obtain statistics or, as we will study here, synthetic datasets from confidential data. Methods to generate such datasets are increasingly numerous, using varied tools including Bayesian models, deep neural networks and copulas. However, little is still known about how to properly perform statistical inference with these differentially private synthetic (DIPS) datasets. The challenge is for the analyses to take into account the variability from the synthetic data generation in addition to the usual sampling variability. A similar challenge also occurs when missing data is imputed before analysis, and statisticians have developed appropriate inference procedures for this case, which we tend extended to the case of synthetic datasets for privacy. In this work, we study the applicability of these procedures, based on combining rules, to the analysis of DIPS datasets. Our empirical experiments show that the proposed combining rules may offer accurate inference in certain contexts, but not in all cases.
Methodology,Cryptography and Security,Machine Learning,Applications
What problem does this paper attempt to address?
This paper discusses how to handle additional variability when using different differentially private (DP) synthetic datasets for statistical inference. Currently, there is limited knowledge on how to properly analyze these data for statistical inference, although methods exist for generating DP synthetic datasets. The paper examines whether combining rules are applicable for handling DP synthetic datasets, by comparing multiple synthetic datasets to estimate the variability introduced by the synthetic process. The experiments demonstrate that in some cases, combining rules can provide accurate inference, but not in all situations. The research also involves different DP data generation mechanisms, including statistical and deep learning-based methods, and tests different variance estimators.