Utility Assessment of Synthetic Data Generation Methods

Md Sakib Nizam Khan,Niklas Reje,Sonja Buchegger
DOI: https://doi.org/10.48550/arXiv.2211.14428
2022-11-23
Abstract:Big data analysis poses the dual problem of privacy preservation and utility, i.e., how accurate data analyses remain after transforming original data in order to protect the privacy of the individuals that the data is about - and whether they are accurate enough to be meaningful. In this paper, we thus investigate across several datasets whether different methods of generating fully synthetic data vary in their utility a priori (when the specific analyses to be performed on the data are not known yet), how closely their results conform to analyses on original data a posteriori, and whether these two effects are correlated. We find some methods (decision-tree based) to perform better than others across the board, sizeable effects of some choices of imputation parameters (notably the number of released datasets), no correlation between broad utility metrics and analysis accuracy, and varying correlations for narrow metrics. We did get promising findings for classification tasks when using synthetic data for training machine learning models, which we consider worth exploring further also in terms of mitigating privacy attacks against ML models such as membership inference and model inversion.
Machine Learning,Cryptography and Security
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: in big - data analysis, how to ensure that the generated synthetic data has sufficient utility on the premise of protecting personal privacy, that is, whether these synthetic data can accurately reflect the characteristics of the original data, and whether the consistency between the analysis results and the original data can be predicted in the case of unknown specific analysis tasks. Specifically, the paper focuses on the following aspects: 1. **Comparison of the utility of different synthetic data generation methods**: Researchers evaluated the differences in utility of different generation methods (such as parameter - based methods, decision - tree methods, saturated - model methods, etc.) through multiple public datasets (such as Adult, Polish, and Avila). They not only examined the similarity of these methods in the overall distribution (using generalized utility measures such as KL divergence), but also examined their performance in specific analysis tasks (such as the accuracy of classification tasks). 2. **The influence of the variable - selection order in the synthesis process**: The influence of choosing different variable orders in the synthesis process on the quality of the final synthetic data was explored. Different variable orders may cause the synthetic data to retain or lose certain important relationships, thus affecting its utility. 3. **Comparison between appropriate and inappropriate synthesis methods**: The effects of two synthesis methods were compared - one is to draw new values from the posterior predictive distribution (appropriate synthesis), and the other is not (inappropriate synthesis). The results show that in most cases, the performance of inappropriate synthesis is better than that of appropriate synthesis. 4. **The correlation between utility measures and analysis accuracy**: The correlation between commonly used utility measure indicators (such as confidence - interval - overlap CIO, KL divergence, etc.) and the performance of synthetic data in specific analysis tasks was studied. It was found that there is no obvious correlation between generalized utility measures and the accuracy of specific analysis tasks, which indicates that we need to understand more carefully which factors really determine the practicality of synthetic data. 5. **The influence of the number of multiple datasets**: When releasing multiple synthetic datasets, the change trend of the utility of synthetic data as the number of datasets increases was analyzed. The results show that although increasing the number of datasets can improve utility, there is a phenomenon of diminishing marginal effects. In summary, this paper aims to comprehensively evaluate the utility of different types of synthetic data generation methods and explore the key factors affecting the utility of synthetic data, providing a reference basis for future research.