Tao Zhang,Rajagopal Venkatesaraman,Rajat K. De,Bradley A. Malin,Yevgeniy Vorobeychik
Abstract:An ability to share data, even in aggregated form, is critical to advancing both conventional and data science. However, insofar as such datasets are comprised of individuals, their membership in these datasets is often viewed as sensitive, with membership inference attacks (MIAs) threatening to violate their privacy. We propose a Bayesian game model for privacy-preserving publishing of data-sharing mechanism outputs (for example, summary statistics for sharing genomic data). In this game, the defender minimizes a combination of expected utility and privacy loss, with the latter being maximized by a Bayes-rational attacker. We propose a GAN-style algorithm to approximate a Bayes-Nash equilibrium of this game, and introduce the notions of Bayes-Nash generative privacy (BNGP) and Bayes generative privacy (BGP) risk that aims to optimally balance the defender's privacy and utility in a way that is robust to the attacker's heterogeneous preferences with respect to true and false positives. We demonstrate the properties of composition and post-processing for BGP risk and establish conditions under which BNGP and pure differential privacy (PDP) are equivalent. We apply our method to sharing summary statistics, where MIAs can re-identify individuals even from aggregated data. Theoretical analysis and empirical results demonstrate that our Bayesian game-theoretic method outperforms state-of-the-art approaches for privacy-preserving sharing of summary statistics.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to protect the privacy of individuals when sharing data (such as aggregated statistical data) and prevent Membership Inference Attacks (MIAs). Specifically, the paper proposes a method based on the Bayesian game model to balance the trade - off between privacy protection and data utility.
### Problem Background
Membership Inference Attacks (MIAs) are a type of attack that takes advantage of vulnerabilities in data analysis and machine learning to determine whether an individual's data is included in a certain dataset (such as a training dataset). This type of attack poses a serious privacy risk in sensitive areas (such as medical, biometric, location - based services, social media, and finance). However, MIAs can also be used as an auditing tool to assess privacy risks.
### Core Problem of the Paper
To address the privacy risks posed by MIAs, existing strategies include noise perturbation and Differential Privacy (DP). These methods reduce information leakage and enhance privacy protection by introducing randomness. However, increasing uncertainty will inevitably damage data utility. Therefore, how to maximize data utility while protecting privacy has become a key issue.
### Solution
The paper proposes a method based on the Bayesian game model, which models the privacy - utility trade - off as a game between the defender and the attacker. Specifically:
1. **Bayesian Game Model**:
- **Defender**: The goal is to minimize the expected privacy loss while maintaining the required data utility.
- **Attacker**: The goal is to maximize the success rate of membership inference based on its subjective beliefs and preferences.
2. **Generative Adversarial Network (GAN) Algorithm**:
- A GAN - like algorithm is proposed to approximate the Bayes - Nash equilibrium. The defender's strategy is represented by a neural network generator, which takes the real membership vector and an auxiliary random vector as input and generates a noise vector. The attacker's strategy is represented by a neural network discriminator, which processes the perturbed output and attempts to infer membership information.
3. **Bayesian - Generated Privacy Risk (BGP Risk)**:
- The concept of Bayesian - Generated Privacy Risk (BGP risk) is introduced, aiming to optimally balance the defender's privacy and utility and be robust to the attacker's heterogeneous preferences for true positives and false positives.
4. **Theoretical Analysis and Empirical Results**:
- Theoretical analysis shows that the proposed Bayesian game - theoretic method is superior to the existing state - of - the - art methods in terms of privacy protection. Empirical experiments also verify this, especially in the case of sharing genomic aggregated statistical data, where individuals can be re - identified even from aggregated data.
### Formula Examples
- **Membership Advantage**:
\[
\text{Adv}_k(A)=\Pr[A(d_k, x) = 1\mid b_k = 1]-\Pr[A(d_k, x) = 1\mid b_k = 0]
\]
- **Bayes - weighted Membership Advantage**:
\[
\text{Adv}(h_A,\sigma,\theta,\gamma;g_D)=(1 - \gamma)\sum_{k\in U, b_{-k}}\Pr[A(d_k, x; h_A,\sigma)=1\mid b_k = 1; g_D]\theta(b_k = 1, b_{-k})-\gamma\sum_{k\in U, b_{-k}}\Pr[A(d_k, x; h_A,\sigma)=1\mid b_k = 0; g_D]\theta(b_k = 0, b_{-k})
\]
Through this method, the paper provides an effective and robust way to protect privacy while ensuring the maximization of data utility.