Privacy risk from synthetic data: practical proposals

Gillian M Raab
DOI: https://doi.org/10.48550/arXiv.2409.04257
2024-09-06
Abstract:This paper proposes and compares measures of identity and attribute disclosure risk for synthetic data. Data custodians can use the methods proposed here to inform the decision as to whether to release synthetic versions of confidential data. Different measures are evaluated on two data sets. Insight into the measures is obtained by examining the details of the records identified as posing a disclosure risk. This leads to methods to identify, and possibly exclude, apparently risky records where the identification or attribution would be expected by someone with background knowledge of the data. The methods described are available as part of the \textbf{synthpop} package for \textbf{R}.
Applications
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? The paper titled "Privacy Risks of Synthetic Data: Practical Recommendations" mainly explores and compares various methods for evaluating identity disclosure risk and attribute disclosure risk in synthetic data (SD). Specifically, the paper aims to: 1. **Provide practical tools**: Provide practical tools for data custodians to evaluate the potential privacy risks that may be brought about after the release of synthetic data, thereby helping them decide whether they should release these synthetic versions of confidential data. 2. **Define and evaluate disclosure risks**: Quantify the privacy risks of synthetic data by introducing and evaluating different disclosure risk metrics, such as "replicated uniques" (repU) and the proportion of "correctly predicted in synthetic data but correct in the original data" (DiSCO). 3. **Exclude high - risk records**: Propose methods to identify and potentially exclude those records that would be considered risky based on background knowledge, in order to reduce the possibility of privacy leakage. 4. **Develop open - source tools**: Integrate these methods into the `synthpop` package in the R language, so that researchers can conveniently use these tools to evaluate the privacy risks of synthetic data. ### Core content of the paper - **Background and motivation**: The paper reviews the development of synthetic data as a privacy - enhancing technology over the past three decades and points out that in recent years, synthetic data has been widely used in various scenarios, such as anonymizing images and geographical data. However, the lack of effective privacy risk evaluation methods has become the main obstacle to its wide application. - **Research methods**: The authors propose several specific indicators for measuring identity disclosure risk and attribute disclosure risk and evaluate them on two datasets. By analyzing the details of records identified as having disclosure risks, an in - depth understanding of these metrics is provided. - **Application scenarios**: The paper describes in detail a simple example of how to use the Polish Quality of Life dataset (SD2011) to calculate the attribute disclosure risk of the target variable of depression scores. The results show that in five sets of synthetic data, only about 6% of the records have the same sensitive information in both the synthetic data and the original data. - **Exclude non - risk records**: To further improve the security of synthetic data, the paper also discusses how to exclude those records that do not actually pose a privacy risk. For example, unnecessary disclosure risks can be reduced by excluding certain specific target levels or key - value combinations. ### Conclusions and recommendations The paper finally summarizes the recommendations in practical applications and points out the future research directions. The authors emphasize that although the current evaluation methods have made certain progress, more research is still needed to improve the privacy risk evaluation framework, especially in terms of adaptability and effectiveness in different application scenarios. In general, this paper is committed to developing and promoting practical privacy risk evaluation tools to promote the wider application of synthetic data under the premise of protecting privacy.