Abstract:Common datasets have the form of elements with keys (e.g., transactions and products) and the goal is to perform analytics on the aggregated form of key and frequency pairs. A weighted sample of keys by (a function of) frequency is a highly versatile summary that provides a sparse set of representative keys and supports approximate evaluations of query statistics. We propose private weighted sampling (PWS): A method that ensures element-level differential privacy while retaining, to the extent possible, the utility of a respective non-private weighted sample. PWS maximizes the reporting probabilities of keys and estimation quality of a broad family of statistics. PWS improves over the state of the art also for the well-studied special case of private histograms, when no sampling is performed. We empirically demonstrate significant performance gains compared with prior baselines: 20%-300% increase in key reporting for common Zipfian frequency distributions and accuracy for $\times 2$-$ 8$ lower frequencies in estimation tasks. Moreover, PWS is applied as a simple post-processing of a non-private sample, without requiring the original data. This allows for seamless integration with existing implementations of non-private schemes and retaining the efficiency of schemes designed for resource-constrained settings such as massive distributed or streamed data. We believe that due to practicality and performance, PWS may become a method of choice in applications where privacy is desired.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to perform weighted sampling efficiently while protecting data privacy to preserve the key features and statistical information of the data set. Specifically, the paper proposes a new method - Private Weighted Sampling (PWS), aiming to maximize the reporting probability and estimation quality of key items under the premise of ensuring element - level differential privacy. ### Problem Background In big - data analysis, a common form of data set is composed of key - value pairs, with each key corresponding to a frequency. For example, products and the number of transactions in transaction data, query strings and the number of requests in search requests, etc. Weighted sampling is a commonly used technique. By selecting representative keys, it generates a sparse representation of the data set, thus supporting the approximate evaluation of various statistical queries. ### Privacy Challenges With the increasing awareness of data privacy, how to conduct data analysis effectively while protecting individual data privacy has become an important issue. Differential Privacy (DP) is a powerful privacy - protection technology and is widely regarded as the gold standard for privacy - protected data analysis. However, traditional differential privacy methods often lead to a significant decline in data utility when dealing with weighted sampling. ### Paper Goals The goal of the paper is to design a new private weighted sampling method (PWS) that can preserve the utility of non - private weighted sampling as much as possible while ensuring differential privacy. Specifically, the PWS method needs to meet the following two main goals: 1. **Maximizing Reporting Probability**: The reporting probability of each key should be as close as possible to that of non - private weighted sampling. 2. **Maximizing Utility**: The privatized samples should preserve the statistical utility of non - private samples as much as possible and support the accurate estimation of various statistical queries. ### Main Contributions 1. **Practicality**: The PWS method can be used as a post - processing step for existing non - private weighted sampling algorithms without revisiting the original data set, and is suitable for large - scale distributed or streaming data processing scenarios. 2. **End - to - End Privacy Analysis**: PWS fully utilizes the privacy gain of random sampling by formulating precise end - to - end privacy constraints, thus improving utility. 3. **Optimal Reporting Probability**: PWS maximizes the reporting probability of each key, and the reporting probability depends on privacy parameters, frequencies, and sampling rates. 4. **Estimation of Linear Statistics**: PWS provides a biased but low - variance estimator for estimating frequency - based linear statistics. 5. **Estimation of Ordinal Statistics**: Among all differential privacy sanitization methods, PWS is optimal for a wide range of ordinal statistics (such as approximate quantiles and top - k sets), maximizing the consistency probability of key pairs and the expected Kendall - τ rank correlation. 6. **Performance Improvement**: Compared with existing baseline methods, PWS shows significant advantages in reporting probability and estimation tasks in the low - frequency region, especially in data sets with long - tail distributions. ### Related Work The paper discusses existing work related to differential privacy, including the sub - optimality of the Laplace mechanism, the optimal estimator under pure differential privacy, and the optimization of reporting probability at different frequencies. The PWS method has made significant improvements in these aspects. ### Summary In conclusion, this paper proposes a new private weighted sampling method (PWS), which maximally preserves the key features and statistical information of the data set while ensuring differential privacy. The PWS method has high practicality and performance advantages in practical applications, especially when dealing with large - scale distributed or streaming data.

Differentially Private Weighted Sampling

Differentially Private Histogram Publication for Dynamic Datasets: an Adaptive Sampling Approach.

UPA: an Automated, Accurate and Efficient Differentially Private Big-Data Mining System

Differentially Private Sampling from Distributions

Differential Privacy Via Weighted Sampling Set Cover

Differentially Private Finite Population Estimation via Survey Weight Regularization

Private sampling: a noiseless approach for generating differentially private synthetic data

DPSW-Sketch: A Differentially Private Sketch Framework for Frequency Estimation over Sliding Windows (Technical Report)

Differentially Private Synthetic Data with Private Density Estimation

Privately Answering Queries on Skewed Data via Per-Record Differential Privacy

Privately Answering Queries on Skewed Data via Per Record Differential Privacy

Differentially Private Verification of Survey-Weighted Estimates

Personalized Privacy Amplification via Importance Sampling

Differentially private anonymized histograms

Federated Heavy Hitters Discovery with Differential Privacy

Private measures, random walks, and synthetic data

Improved Pan-Private Stream Density Estimation

Efficient and Privacy-Preserving Weighted Range Set Sampling in Cloud

Private Synthetic Data Generation in Small Memory

Fair and Differentially Private Distributed Frequency Estimation

Distributed Differential Privacy By Sampling