Synthetic Data, Similarity-based Privacy Metrics, and Regulatory (Non-)Compliance

Georgi Ganev
2024-07-26
Abstract:In this paper, we argue that similarity-based privacy metrics cannot ensure regulatory compliance of synthetic data. Our analysis and counter-examples show that they do not protect against singling out and linkability and, among other fundamental issues, completely ignore the motivated intruder test.
Cryptography and Security,Artificial Intelligence,Computers and Society
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **Are similarity - based privacy metrics (SBPMs) sufficient to ensure that synthetic data complies with regulatory requirements?** Specifically, the paper explores the following aspects: 1. **Background and Motivation**: - Synthetic data (data generated by machine - learning generative models) is increasingly widely used outside academia, such as in releasing public census data, sharing sensitive financial and health data, etc. - Although these applications satisfy formal privacy definitions (such as Differential Privacy, DP), many research papers and companies rely on empirical similarity - based privacy metrics (SBPMs) rather than strict theoretical guarantees. 2. **Main Problem**: - The core issue of the paper is to question whether similarity - based privacy metrics are sufficient to ensure that synthetic data complies with regulatory requirements. The author believes that due to the fundamental problems and unreliable and inconsistent nature of SBPMs, they cannot ensure compliance. 3. **Specific Problems**: - **Lack of Theoretical Guarantee**: SBPMs have no clear threat model or strategic adversary, ignoring important security and regulatory principles. - **Privacy Treated as Binary Property**: SBPMs regard privacy leakage as a binary property, assuming that synthetic data sets that pass the test are safe, even if the training data needs to be queried for each release. - **Privacy Treated as Data Property**: SBPMs consider privacy as an attribute of data, rather than an attribute of the generative model/process, resulting in inconsistent results and increasing the risk of privacy leakage. - **Non - Comparative Process**: SBPMs do not compare situations with and without individual participation, making the system vulnerable to attack. - **Misinterpretation**: Test results may be misread, and failure to reject the null hypothesis does not mean that privacy is actually protected. - **Practical Problems**: Most SBPMs implementations require discretization of data, resulting in imprecise calculations and over - stating privacy protection. 4. **Counter - example Demonstration**: - The paper demonstrates the unreliability and inconsistency of SBPMs through three counter - examples, including completely leaking test data, leaking outliers in training data, etc. In summary, this paper aims to reveal the deficiencies of similarity - based privacy metrics in ensuring the regulatory compliance of synthetic data and calls for the adoption of more stringent theoretical guarantees and evaluation methods.