Abstract:Big data analysis poses the dual problem of privacy preservation and utility, i.e., how accurate data analyses remain after transforming original data in order to protect the privacy of the individuals that the data is about - and whether they are accurate enough to be meaningful. In this paper, we thus investigate across several datasets whether different methods of generating fully synthetic data vary in their utility a priori (when the specific analyses to be performed on the data are not known yet), how closely their results conform to analyses on original data a posteriori, and whether these two effects are correlated. We find some methods (decision-tree based) to perform better than others across the board, sizeable effects of some choices of imputation parameters (notably the number of released datasets), no correlation between broad utility metrics and analysis accuracy, and varying correlations for narrow metrics. We did get promising findings for classification tasks when using synthetic data for training machine learning models, which we consider worth exploring further also in terms of mitigating privacy attacks against ML models such as membership inference and model inversion.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: in big - data analysis, how to ensure that the generated synthetic data has sufficient utility on the premise of protecting personal privacy, that is, whether these synthetic data can accurately reflect the characteristics of the original data, and whether the consistency between the analysis results and the original data can be predicted in the case of unknown specific analysis tasks. Specifically, the paper focuses on the following aspects: 1. **Comparison of the utility of different synthetic data generation methods**: Researchers evaluated the differences in utility of different generation methods (such as parameter - based methods, decision - tree methods, saturated - model methods, etc.) through multiple public datasets (such as Adult, Polish, and Avila). They not only examined the similarity of these methods in the overall distribution (using generalized utility measures such as KL divergence), but also examined their performance in specific analysis tasks (such as the accuracy of classification tasks). 2. **The influence of the variable - selection order in the synthesis process**: The influence of choosing different variable orders in the synthesis process on the quality of the final synthetic data was explored. Different variable orders may cause the synthetic data to retain or lose certain important relationships, thus affecting its utility. 3. **Comparison between appropriate and inappropriate synthesis methods**: The effects of two synthesis methods were compared - one is to draw new values from the posterior predictive distribution (appropriate synthesis), and the other is not (inappropriate synthesis). The results show that in most cases, the performance of inappropriate synthesis is better than that of appropriate synthesis. 4. **The correlation between utility measures and analysis accuracy**: The correlation between commonly used utility measure indicators (such as confidence - interval - overlap CIO, KL divergence, etc.) and the performance of synthetic data in specific analysis tasks was studied. It was found that there is no obvious correlation between generalized utility measures and the accuracy of specific analysis tasks, which indicates that we need to understand more carefully which factors really determine the practicality of synthetic data. 5. **The influence of the number of multiple datasets**: When releasing multiple synthetic datasets, the change trend of the utility of synthetic data as the number of datasets increases was analyzed. The results show that although increasing the number of datasets can improve utility, there is a phenomenon of diminishing marginal effects. In summary, this paper aims to comprehensively evaluate the utility of different types of synthetic data generation methods and explore the key factors affecting the utility of synthetic data, providing a reference basis for future research.

Utility Assessment of Synthetic Data Generation Methods

On Utility and Privacy in Synthetic Genomic Data

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data

Synthetic Data: Revisiting the Privacy-Utility Trade-off

Post-processing Private Synthetic Data for Improving Utility on Selected Measures

Utility Theory of Synthetic Data Generation

Synthetic Census Microdata Generation: A Comparative Study of Synthesis Methods Examining the Trade-Off Between Disclosure Risk and Utility

A Scoping Review of Privacy and Utility Metrics in Medical Synthetic Data

Boosting Data Analytics With Synthetic Volume Expansion

Evaluating utility in synthetic banking microdata applications

Advancing microdata privacy protection: A review of synthetic data methods

On the Trade-Off between Fidelity, Utility and Privacy of Synthetic Patient Data

A density ratio framework for evaluating the utility of synthetic data

An evaluation of the replicability of analyses using synthetic health data

Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data

Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics

Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains

Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method