Practical privacy metrics for synthetic data

Gillian M Raab,Beata Nowok,Chris Dibben
DOI: https://doi.org/10.48550/arXiv.2406.16826
2024-06-25
Abstract:This paper explains how the synthpop package for R has been extended to include functions to calculate measures of identity and attribute disclosure risk for synthetic data that measure risks for the records used to create the synthetic data. The basic function, disclosure, calculates identity disclosure for a set of quasi-identifiers (keys) and attribute disclosure for one variable specified as a target from the same set of keys. The second function, <a class="link-external link-http" href="http://disclosure.summary" rel="external noopener nofollow">this http URL</a>, is a wrapper for the first and presents summary results for a set of targets. This short paper explains the measures of disclosure risk and documents how they are calculated. We recommend two measures: $RepU$ (replicated uniques) for identity disclosure and $DiSCO$ (Disclosive in Synthetic Correct Original) for attribute disclosure. Both are expressed a \% of the original records and each can be compared to similar measures calculated from the original data. Experience with using the functions on real data found that some apparent disclosures could be identified as coming from relationships in the data that would be expected to be known to anyone familiar with its features. We flag cases when this seems to have occurred and provide means of excluding them.
Applications
What problem does this paper attempt to address?