John M. Abowd,Ian M. Schmutte,William Sexton,Lars Vilhuber
Abstract:With vast databases at their disposal, private tech companies can compete with public statistical agencies to provide population statistics. However, private companies face different incentives to provide high-quality statistics and to protect the privacy of the people whose data are used. When both privacy protection and statistical accuracy are public goods, private providers tend to produce at least one suboptimally, but it is not clear which. We model a firm that publishes statistics under a guarantee of differential privacy. We prove that provision by the private firm results in inefficiently low data quality in this framework.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the inefficiency of private providers in providing demographic data when privacy protection and statistical accuracy are public goods. Specifically, the author focuses on the fact that when privacy protection and statistical accuracy are both public goods, private providers may over - provide privacy protection and under - provide data accuracy. This is because, in this case, private providers cannot fully capture all the external benefits to consumers brought by high data accuracy, while the demand for privacy protection is mainly determined by the cost - minimization problem of data providers.
### Background of the Paper and Problem Definition
With the advent of the big data era, private technology companies can use their large databases to compete with public statistical agencies in providing demographic data. However, these companies face different incentive mechanisms. On the one hand, they need to provide high - quality statistical data, and on the other hand, they need to protect the privacy of data subjects. When privacy protection and statistical accuracy are regarded as public goods, private providers often make sub - optimal choices between the two, but it is not certain which aspect is sub - optimized. The author proves through building a model that, in this framework, private providers will lead to overly low data quality.
### Model Overview
The author constructs a model that describes how a private data custodian releases statistical data while ensuring differential privacy. The key points of the model are:
1. **Differential Privacy**: The author uses the differential privacy mechanism to quantify the privacy loss in data release. Differential privacy is a method to ensure that individual privacy is not leaked during the data release process, and it achieves this by adding noise to the data.
2. **Trade - off between Data Accuracy and Privacy Protection**: The model assumes that the data custodian needs to balance data accuracy and privacy protection when releasing statistical data. Increasing data accuracy will reduce privacy protection, and vice versa.
3. **Characteristics of Public Goods**: Both data accuracy and privacy protection are regarded as public goods, that is, non - excludability and non - competitiveness. This means that all consumers can benefit from high data quality and privacy protection without affecting other consumers' use.
### Main Conclusions
The author proves through the model that when private providers provide demographic data, they will lead to overly low data quality and overly high privacy protection. This is because the external benefits brought by data accuracy cannot be fully captured by the willingness to pay of a single consumer, while the demand for privacy protection is mainly determined by the cost - minimization problem of data providers. Therefore, the private market cannot effectively balance the society's demands for privacy protection and data quality in this case.
### Formula Explanation
- **Definition of Differential Privacy**:
\[
\text{The query release mechanism } M \text{ satisfies } \epsilon\text{-differential privacy if for any pair of adjacent databases } D \text{ and } D', \text{ any query } Q \in Q, \text{ and any } B \in \mathcal{B}:
\]
\[
\Pr[M(D, Q) \in B \mid D, Q] \leq e^\epsilon \Pr[M(D', Q) \in B \mid D', Q]
\]
- **Definition of Data Accuracy**:
\[
\text{The query release mechanism } M \text{ satisfies } (\alpha, \beta)\text{-accuracy if for any query } Q \in Q \text{ and output } a \text{ there is}:
\]
\[
\Pr[|a - Q(D)| \leq \alpha \mid D, Q] \geq 1 - \beta
\]
- **Production Cost Function**:
\[
\text{The total production cost } C_{VCG}(I) = Q\left(\frac{H(I)}{N}\right) H(I) \epsilon(I)
\]
where:
\[
H(I) = N - (1 - I)N \left(\frac{1}{2} + \ln\left(\frac{1}{\beta}\right)\right)
\]