Abstract:Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Unlike the standard perturbation analysis, our analysis of private PCA works without assuming the spectral gap for the covariance matrix.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to maintain the utility of data while protecting individual privacy when generating low - dimensional synthetic data in high - dimensional datasets. Specifically, the paper focuses on the problem that when data is in a high - dimensional space, the accuracy of synthetic data will be affected due to the curse of dimensionality. ### Problem Background With the increasing demand for data sharing and the growing concern for privacy protection, how to conduct data analysis without revealing sensitive information has become an important research topic. Differential Privacy (DP), as a technical means of protecting privacy, has been widely used in many fields. However, most of the existing research on private synthetic data mainly focuses on low - dimensional data, and for high - dimensional datasets, how to efficiently generate useful low - dimensional synthetic data remains a challenge. ### Core Problems of the Paper The paper proposes a new algorithm, aiming to efficiently generate low - dimensional synthetic data from high - dimensional datasets and ensure its utility under the Wasserstein distance. Specifically, the paper solves the following key problems: 1. **Impact of the Curse of Dimensionality**: When data is in a high - dimensional space, traditional differential privacy methods will lead to a significant decline in the accuracy of synthetic data. The paper effectively bypasses the problem of the curse of dimensionality by introducing a private principal component analysis (Private PCA) method. 2. **Analysis without the Spectral Gap Assumption**: Traditional PCA analysis usually depends on the spectral gap assumption of the covariance matrix, while the method in this paper does not require this assumption, thus being applicable to a wider range of situations. 3. **Efficient Computational Complexity**: The algorithm proposed in the paper not only improves the accuracy of synthetic data but also ensures the efficiency of calculation and can complete the task within polynomial time. ### Formula Representation The key formulas involved in the paper include: - **Centered Covariance Matrix**: \[ M=\frac{1}{n - 1}\sum_{i = 1}^n(X_i-\bar{X})(X_i-\bar{X})^T \] where \(\bar{X}\) is the mean vector of the data. - **Expected Bound of Wasserstein Distance**: \[ \mathbb{E}[W_1(\mu_X,\mu_Y)]\lesssim d\sqrt{\sum_{i > d'}\sigma_i(M)}+(\varepsilon n)^{-1/d'} \] where \(\sigma_i(M)\) represents the \(i\)-th eigenvalue of the covariance matrix \(M\). Through these formulas, the paper shows how to generate low - dimensional synthetic data in high - dimensional datasets, ensure its utility under the Wasserstein distance, and protect individual privacy at the same time.

Differentially Private Low-dimensional Synthetic Data from High-dimensional Datasets

PrivSyn: Differentially Private Data Synthesis

Differentially Private Synthetic Data with Private Density Estimation

Online Differentially Private Synthetic Data Generation

Locally differentially private high-dimensional data synthesis

Differentially Private Synthetic Data Using KD-Trees

Differentially Private Synthetic Data: Applied Evaluations and Enhancements

Privacy-Preserving High-dimensional Data Collection with Federated Generative Autoencoder

Differentially Private Synthetic High-dimensional Tabular Stream

Differentially Private Synthetic Data Generation via Lipschitz-Regularised Variational Autoencoders

Differentially Private Data Generation with Missing Data

Differentially Private Synthetic Heavy-tailed Data

Assessment of differentially private synthetic data for utility and fairness in end-to-end machine learning pipelines for tabular data

Private measures, random walks, and synthetic data

Differentially Private Non Parametric Copulas: Generating synthetic data with non parametric copulas under privacy guarantees

DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing

Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms

Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains

Differentially Private Algorithms for Synthetic Power System Datasets