On Rényi Differential Privacy in Statistics-Based Synthetic Data Generation

Takayuki Miura,Toshiki Shibahara,Masanobu Kii,Atsunori Ichikawa,Juko Yamamoto,Koji Chida
2023-03-31
Abstract:Privacy protection with synthetic data generation often uses differentially private statistics and model parameters to quantitatively express theoretical security. However, these methods do not take into account privacy protection due to the randomness of data generation. In this paper, we theoretically evaluate Rényi differential privacy of the randomness in data generation of a synthetic data generation method that uses the mean vector and the covariance matrix of an original dataset. Specifically, for a fixed $\alpha > 1$, we show the condition of $\varepsilon$ such that the synthetic data generation satisfies $(\alpha, \varepsilon)$-Rényi differential privacy under a bounded neighboring condition and an unbounded neighboring condition, respectively. In particular, under the unbounded condition, when the size of the original dataset and synthetic datase is 10 million, the mechanism satisfies $(4, 0.576)$-Rényi differential privacy. We also show that when we translate it into the traditional $(\varepsilon, \delta)$-differential privacy, the mechanism satisfies $(4.00, 10^{-10})$-differential privacy.
Cryptography and Security,Information Theory
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the privacy protection problem when generating synthetic data using statistical methods. Specifically, the author focuses on ensuring privacy protection by evaluating the randomness in the data generation process without adding intentional randomness. The following are the core problems and goals of the paper: 1. **Trade - off between privacy protection and data utility**: - In traditional methods, in order to protect privacy, differential privacy processing is usually performed on the generation parameters (such as the mean vector and covariance matrix). However, this method may reduce the utility of the generated synthetic data. - This paper proposes a new idea: even without directly performing differential privacy processing on the generation parameters, privacy protection can be ensured by evaluating the randomness in the data generation process, thus providing theoretical privacy guarantees without significantly reducing data utility. 2. **Theoretical evaluation of Rényi differential privacy**: - The author focuses on evaluating the Rényi differential privacy ($(\alpha,\varepsilon)$ - RDP) of the synthetic data generation method based on the mean vector and covariance matrix under the unbounded neighborhood condition and the bounded neighborhood condition. For a fixed $\alpha> 1$, they derive the conditions for $\varepsilon$ that make the synthetic data generation mechanism satisfy $(\alpha,\varepsilon)$ - RDP. - In particular, under the unbounded neighborhood condition, when the sizes of the original data set and the synthetic data set are 10 million, the mechanism satisfies $(4,0.576)$ - Rényi differential privacy; while under the bounded neighborhood condition, it satisfies $(4,2.307)$ - Rényi differential privacy. 3. **Numerical evaluation**: - The author uses the Adult data set for numerical evaluation and calculates the specific $\varepsilon$ values. The results show that when the data set scale is large (for example, $n = 10^{7}$), the mechanism can reach a very small $\varepsilon$ value (such as 0.576) under the unbounded neighborhood condition, and still maintain high security when converted to the traditional $(\varepsilon,\delta)$ - differential privacy. ### Formula summary - **Rényi Divergence**: \[ D_{\alpha}(P\|Q):=\frac{1}{\alpha - 1}\log\left(\int_{\mathbb{R}^{d}}P(x)^{\alpha}Q(x)^{1 - \alpha}\,dx\right) \] - **Rényi differential privacy conditions**: - $\varepsilon_{\alpha}$ under the unbounded neighborhood condition: \[ \varepsilon_{\alpha}:=\max\{\varepsilon_{\alpha1},\varepsilon_{\alpha2}\} \] where, \[ \varepsilon_{\alpha1}=\frac{\alpha}{2}\cdot\frac{\tau}{(n + 1)(n + 1-\alpha)}+\frac{\alpha d}{2(\alpha - 1)}\log\left(\frac{n}{n + 1}\right)-\frac{d}{2(\alpha - 1)}\log\left(\frac{1-\alpha}{n + 1}\right)-\frac{1}{2(\alpha - 1)}\log\min\left\{1,1+\frac{\alpha n\tau}{(n + 1)(n + 1-\alpha)}\left(1+\frac{\tau}{n + 1}\right)^{\alpha}\right\} \]