Synthetic Census Data Generation via Multidimensional Multiset Sum

Cynthia Dwork,Kristjan Greenewald,Manish Raghavan

2024-04-16

Abstract:The US Decennial Census provides valuable data for both research and policy purposes. Census data are subject to a variety of disclosure avoidance techniques prior to release in order to preserve respondent confidentiality. While many are interested in studying the impacts of disclosure avoidance methods on downstream analyses, particularly with the introduction of differential privacy in the 2020 Decennial Census, these efforts are limited by a critical lack of data: The underlying "microdata," which serve as necessary input to disclosure avoidance methods, are kept confidential.

Computers and Society,Cryptography and Security,Data Structures and Algorithms

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Since the decennial census data in the United States will be processed by various disclosure - avoidance techniques before being released to protect the privacy of respondents, this has led to the "microdata" used for disclosure - avoidance methods being kept confidential for 72 years. As a result, researchers are unable to obtain this original data to evaluate the impact of different disclosure - avoidance techniques (such as differential privacy) on subsequent data analysis. To solve this problem, the author proposes a method for generating synthetic microdata from the published census statistics, so that different disclosure - avoidance algorithms can be studied and compared without violating privacy. Specifically, the goals of the paper are: 1. **Provide tools to generate synthetic microdata**: Generate synthetic microdata that can be used as input for any disclosure - avoidance algorithm, using only publicly - released census statistics as input. 2. **Define a reasonable distribution and design a sampling algorithm**: Based on the published census statistics, define a reasonable microdata distribution and design an algorithm for sampling from this distribution. 3. **Solve the combinatorial optimization problem**: Formulate the synthetic data generation problem as a knapsack - style combinatorial optimization problem and develop new algorithms to effectively solve this problem. Through these methods, the author hopes to verify that the generated data is "close" to the real data and can be used to evaluate the effectiveness of different privacy - protection techniques.

Synthetic Census Data Generation via Multidimensional Multiset Sum

Synthetic Census Microdata Generation: A Comparative Study of Synthesis Methods Examining the Trade-Off Between Disclosure Risk and Utility

The 2010 Census Confidentiality Protections Failed, Here's How and Why

Confidentiality Protection in the 2020 US Census of Population and Housing

Escalation of Commitment: A Case Study of the United States Census Bureau Efforts to Implement Differential Privacy for the 2020 Decennial Census

Private Tabular Survey Data Products Through Synthetic Microdata Generation

The 2020 United States Decennial Census Is More Private Than You (Might) Think

Assessing Statistical Disclosure Risk for Differentially Private, Hierarchical Count Data, with Application to the 2020 U.S. Decennial Census

GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using Macro Data Sources

The Impact of the U.S. Census Disclosure Avoidance System on Redistricting and Voting Rights Analysis

Differential Privacy Protections in 2020 U.S. Decennial Census Data Do Not Impede Measurement of Racial and Ethnic Disparities

Differential Privacy in the 2020 Decennial Census and the Implications for Available Data Products

Privacy-Preserving Data Analysis for the Federal Statistical Agencies

"Minus-One" Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

Quantifying Privacy Risks of Public Statistics to Residents of Subsidized Housing

Multiply-Imputed Synthetic Data: Advice to the Imputer

PrivSyn: Differentially Private Data Synthesis

Releasing survey microdata with exact cluster locations and additional privacy safeguards

Impacts of Census Differential Privacy for Small-Area Disease Mapping to Monitor Health Inequities

30 Years of Synthetic Data