Noisy Measurements Are Important, the Design of Census Products Is Much More Important

John M. Abowd
DOI: https://doi.org/10.1162/99608f92.79d4660d
2024-05-01
Abstract:McCartan et al. (2023) call for "making differential privacy work for census data users." This commentary explains why the 2020 Census Noisy Measurement Files (NMFs) are not the best focus for that plea. The August 2021 letter from 62 prominent researchers asking for production of the direct output of the differential privacy system deployed for the 2020 Census signaled the engagement of the scholarly community in the design of decennial census data products. NMFs, the raw statistics produced by the 2020 Census Disclosure Avoidance System before any post-processing, are one component of that design-the query strategy output. The more important component is the query workload output-the statistics released to the public. Optimizing the query workload-the Redistricting Data (P.L. 94-171) Summary File, specifically-could allow the privacy-loss budget to be more effectively managed. There could be fewer noisy measurements, no post-processing bias, and direct estimates of the uncertainty from disclosure avoidance for each published statistic.
Cryptography and Security,Econometrics,Applications
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper mainly explores the **Differential Privacy mechanism (DP)** introduced in the 2020 US Census and its impact on data product design. Specifically, the author John M. Abowd discusses the following issues: 1. **Why the "Noisy Measurement Files (NMFs)" of the 2020 Census are not the focus of improvement**: - NMFs are noisy statistics generated when the differential privacy system processes raw data. Although these files contain a large amount of information, they are not the final data products released to the public. - A more important issue is how to optimize the query workload, that is, the final released statistics. By optimizing the query workload, the privacy loss budget can be managed more effectively, the number of noise measurements can be reduced, post - processing bias can be avoided, and the uncertainty of each released statistic can be directly estimated. 2. **Why the query strategy of the 2020 differential privacy system is much larger than the query workload**: - The query strategy refers to the set of statistics used to allocate the privacy loss budget. The query strategy of the 2020 Census contains 16 billion independent statistics, while the query workload is only 1.5 billion. - This gap is due to some design constraints, such as the need to preserve the non - negativity of all cells and ensure the consistency of the table hierarchy. These constraints force the system to handle more interaction terms, thus increasing the number of noise measurements. 3. **How to improve the format and content of official data products**: - The redistricting community (including academia and practitioners) needs to reach a consensus to determine future geographical atomic units and release formats. - Given a fixed privacy loss budget, researchers can explore different release table formats to better meet the needs of redistricting. For example, whether the entire privacy loss budget should be allocated to release queries (such as the 262 cells in the 2020 redistricting data table), or some compromise should be adopted. 4. **Applications and challenges of the differential privacy mechanism**: - While the differential privacy mechanism protects privacy, it also introduces noise, affecting the accuracy of data. The author points out that there is too much noise in NMFs, and direct use may be meaningless. Modeling is required to reduce measurement errors. - Post - processing steps (such as the TopDown algorithm) can reduce noise and produce data with similar accuracy to the swapping method. However, this also means that a balance needs to be found between privacy protection and data accuracy. ### Summary The core question of this paper is: **How can we design data products that can meet the needs of redistricting and have good statistical properties while protecting privacy?** The author emphasizes that future research should focus on optimizing the design of the query workload rather than just focusing on the noise measurement files themselves.