Abstract:We investigate whether generating synthetic data can be a viable strategy for providing access to detailed geocoding information for external researchers, without compromising the confidentiality of the units included in the database. Our work was motivated by a recent project at the Institute for Employment Research (IAB) in Germany that linked exact geocodes to the Integrated Employment Biographies (IEB), a large administrative database containing several million records. We evaluate the performance of three synthesizers regarding the trade-off between preserving analytical validity and limiting disclosure risks: One synthesizer employs Dirichlet Process mixtures of products of multinomials (DPMPM), while the other two use different versions of Classification and Regression Trees (CART). In terms of preserving analytical validity, our proposed synthesis strategy for geocodes based on categorical CART models outperforms the other two. If the risks of the synthetic data generated by the categorical CART synthesizer are deemed too high, we demonstrate that synthesizing additional variables is the preferred strategy to address the risk-utility trade-off in practice, compared to limiting the size of the regression trees or relying on the strategy of providing geographical information only on an aggregated level. We also propose strategies for making the synthesizers scalable for large files, present analytical validity measures and disclosure risk measures for the generated data, and provide general recommendations for statistical agencies considering the synthetic data approach for disseminating detailed geographical information.

Using saturated count models for user-friendly synthesis of categorical data

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

On integrating the number of synthetic data sets $m$ into the 'a priori' synthesis approach

Synthesizing geocodes to facilitate access to detailed geographical information in large scale administrative data

Synthetic Census Microdata Generation: A Comparative Study of Synthesis Methods Examining the Trade-Off Between Disclosure Risk and Utility

Risk-Efficient Bayesian Data Synthesis for Privacy Protection

Data Privacy Protection and Utility Preservation through Bayesian Data Synthesis: A Case Study on Airbnb Listings

Practical privacy metrics for synthetic data

PrivSyn: Differentially Private Data Synthesis

Bayesian Estimation of Attribute Disclosure Risks in Synthetic Data with the $\texttt{AttributeRiskCalculation}$ R Package

Private Tabular Survey Data Products Through Synthetic Microdata Generation

Multiply-Imputed Synthetic Data: Advice to the Imputer

A density ratio framework for evaluating the utility of synthetic data

Synthetic Census Data Generation via Multidimensional Multiset Sum

Differentially Private Verification of Survey-Weighted Estimates

"Minus-One" Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity

Privacy risk from synthetic data: practical proposals

Tabular Data Synthesis with Differential Privacy: A Survey

Statistical disclosure control for numeric microdata via sequential joint probability preserving data shuffling

One Step to Efficient Synthetic Data

Advancing microdata privacy protection: A review of synthetic data methods