Differentially Private Low-dimensional Synthetic Data from High-dimensional Datasets

Yiyun He,Thomas Strohmer,Roman Vershynin,Yizhe Zhu
2024-02-13
Abstract:Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Unlike the standard perturbation analysis, our analysis of private PCA works without assuming the spectral gap for the covariance matrix.
Machine Learning,Cryptography and Security,Data Structures and Algorithms,Probability,Statistics Theory
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to maintain the utility of data while protecting individual privacy when generating low - dimensional synthetic data in high - dimensional datasets. Specifically, the paper focuses on the problem that when data is in a high - dimensional space, the accuracy of synthetic data will be affected due to the curse of dimensionality. ### Problem Background With the increasing demand for data sharing and the growing concern for privacy protection, how to conduct data analysis without revealing sensitive information has become an important research topic. Differential Privacy (DP), as a technical means of protecting privacy, has been widely used in many fields. However, most of the existing research on private synthetic data mainly focuses on low - dimensional data, and for high - dimensional datasets, how to efficiently generate useful low - dimensional synthetic data remains a challenge. ### Core Problems of the Paper The paper proposes a new algorithm, aiming to efficiently generate low - dimensional synthetic data from high - dimensional datasets and ensure its utility under the Wasserstein distance. Specifically, the paper solves the following key problems: 1. **Impact of the Curse of Dimensionality**: When data is in a high - dimensional space, traditional differential privacy methods will lead to a significant decline in the accuracy of synthetic data. The paper effectively bypasses the problem of the curse of dimensionality by introducing a private principal component analysis (Private PCA) method. 2. **Analysis without the Spectral Gap Assumption**: Traditional PCA analysis usually depends on the spectral gap assumption of the covariance matrix, while the method in this paper does not require this assumption, thus being applicable to a wider range of situations. 3. **Efficient Computational Complexity**: The algorithm proposed in the paper not only improves the accuracy of synthetic data but also ensures the efficiency of calculation and can complete the task within polynomial time. ### Formula Representation The key formulas involved in the paper include: - **Centered Covariance Matrix**: \[ M=\frac{1}{n - 1}\sum_{i = 1}^n(X_i-\bar{X})(X_i-\bar{X})^T \] where \(\bar{X}\) is the mean vector of the data. - **Expected Bound of Wasserstein Distance**: \[ \mathbb{E}[W_1(\mu_X,\mu_Y)]\lesssim d\sqrt{\sum_{i > d'}\sigma_i(M)}+(\varepsilon n)^{-1/d'} \] where \(\sigma_i(M)\) represents the \(i\)-th eigenvalue of the covariance matrix \(M\). Through these formulas, the paper shows how to generate low - dimensional synthetic data in high - dimensional datasets, ensure its utility under the Wasserstein distance, and protect individual privacy at the same time.