Abstract:Differential privacy is a mathematical concept that provides an information-theoretic security guarantee. While differential privacy has emerged as a de facto standard for guaranteeing privacy in data sharing, the known mechanisms to achieve it come with some serious limitations. Utility guarantees are usually provided only for a fixed, a priori specified set of queries. Moreover, there are no utility guarantees for more complex—but very common—machine learning tasks such as clustering or classification. In this paper we overcome some of these limitations. Working with metric privacy, a powerful generalization of differential privacy, we develop a polynomial-time algorithm that creates a private measure from a data set. This private measure allows us to efficiently construct private synthetic data that are accurate for a wide range of statistical analysis tools. Moreover, we prove an asymptotically sharp min-max result for private measures and synthetic data in general compact metric spaces, for any fixed privacy budget bounded away from zero. A key ingredient in our construction is a new superregular random walk , whose joint distribution of steps is as regular as that of independent random variables, yet which deviates from the origin logarithmically slowly.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently generate synthetic data with wide applicability of statistical analysis tools while maintaining data privacy. Specifically, the authors focus on how to overcome the limitations of existing methods under the Differential Privacy (DP) framework, especially: 1. **Query limitations**: Existing differential privacy mechanisms can usually only provide utility guarantees for a predefined set of queries, which means that either differential privacy needs to be used in an interactive scenario or queries must be specified in advance. 2. **Utility for complex tasks**: For more complex machine - learning tasks (such as clustering or classification), existing differential privacy mechanisms cannot provide utility guarantees. 3. **Trade - off between privacy and utility**: Differential privacy may have an unfavorable trade - off between privacy protection and data utility, resulting in low utility of the data set, thus limiting its use in many applications. To solve these problems, the authors introduce Metric Privacy, which is a powerful generalization of differential privacy. They develop a polynomial - time algorithm that can create a private measure from a data set. This private measure can be used to efficiently construct synthetic data that is accurate for a wide range of statistical analysis tools. In addition, the authors also prove an asymptotically optimal minimax result for private measures and synthetic data in general compact metric spaces, applicable to any fixed privacy budget \(\epsilon\). A key construction is a new superregular random walk, whose joint distribution of step lengths is as regular as that of independent random variables, but deviates from the origin only logarithmically slowly. Through these methods, the authors aim to provide a method that can generate synthetic data with high utility while maintaining privacy, thereby finding a better balance between data sharing and privacy protection.

Private measures, random walks, and synthetic data

Metric geometry of the privacy-utility tradeoff

PrivSyn: Differentially Private Data Synthesis

Differentially Private Synthetic Data with Private Density Estimation

Online Differentially Private Synthetic Data Generation

Differential Privacy on Finite Computers

User-Driven Synthetic Dataset Generation with Quantifiable Differential Privacy

pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity

Differentially Private Low-dimensional Synthetic Data from High-dimensional Datasets

Differentially Private Synthetic Heavy-tailed Data

Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms

Additive Noise Mechanisms for Making Randomized Approximation Algorithms Differentially Private

Statistical Inference in the Differential Privacy Model

Minimax Rates of Estimating Approximate Differential Privacy

Identification and Formal Privacy Guarantees

Not All Attributes are Created Equal: $d_{\mathcal{X}}$-Private Mechanisms for Linear Queries

Optimal error of query sets under the differentially-private matrix mechanism

Differentially Private Synthetic Data Using KD-Trees

A Statistical Framework for Differential Privacy

Constrained Differential Privacy for Count Data