Synthetic Census Data Generation via Multidimensional Multiset Sum

Cynthia Dwork,Kristjan Greenewald,Manish Raghavan
2024-04-16
Abstract:The US Decennial Census provides valuable data for both research and policy purposes. Census data are subject to a variety of disclosure avoidance techniques prior to release in order to preserve respondent confidentiality. While many are interested in studying the impacts of disclosure avoidance methods on downstream analyses, particularly with the introduction of differential privacy in the 2020 Decennial Census, these efforts are limited by a critical lack of data: The underlying "microdata," which serve as necessary input to disclosure avoidance methods, are kept confidential.
Computers and Society,Cryptography and Security,Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Since the decennial census data in the United States will be processed by various disclosure - avoidance techniques before being released to protect the privacy of respondents, this has led to the "microdata" used for disclosure - avoidance methods being kept confidential for 72 years. As a result, researchers are unable to obtain this original data to evaluate the impact of different disclosure - avoidance techniques (such as differential privacy) on subsequent data analysis. To solve this problem, the author proposes a method for generating synthetic microdata from the published census statistics, so that different disclosure - avoidance algorithms can be studied and compared without violating privacy. Specifically, the goals of the paper are: 1. **Provide tools to generate synthetic microdata**: Generate synthetic microdata that can be used as input for any disclosure - avoidance algorithm, using only publicly - released census statistics as input. 2. **Define a reasonable distribution and design a sampling algorithm**: Based on the published census statistics, define a reasonable microdata distribution and design an algorithm for sampling from this distribution. 3. **Solve the combinatorial optimization problem**: Formulate the synthetic data generation problem as a knapsack - style combinatorial optimization problem and develop new algorithms to effectively solve this problem. Through these methods, the author hopes to verify that the generated data is "close" to the real data and can be used to evaluate the effectiveness of different privacy - protection techniques.