Private Synthetic Data Generation in Small Memory

Rayne Holland,Seyit Camtepe,Chandra Thapa,Jason Xue
2024-12-13
Abstract:Protecting sensitive information on data streams is a critical challenge for modern systems. Current approaches to privacy in data streams follow two strategies. The first transforms the stream into a private sequence, enabling the use of non-private analyses but incurring high memory costs. The second uses compact data structures to create private summaries but restricts flexibility to predefined queries. To address these limitations, we propose $\textsf{PrivHP}$, a lightweight synthetic data generator that ensures differential privacy while being resource-efficient. $\textsf{PrivHP}$ generates private synthetic data that preserves the input stream's distribution, allowing flexible downstream analyses without additional privacy costs. It leverages a hierarchical decomposition of the domain, pruning low-frequency subdomains while preserving high-frequency ones in a privacy-preserving manner. To achieve memory efficiency in streaming contexts, $\textsf{PrivHP}$ uses private sketches to estimate subdomain frequencies without accessing the full dataset. $\textsf{PrivHP}$ is parameterized by a privacy budget $\varepsilon$, a pruning parameter $k$ and the sketch width $w$. It can process a dataset of size $n$ in $\mathcal{O}((w+k)\log (\varepsilon n))$ space, $\mathcal{O}(\log (\varepsilon n))$ update time, and outputs a private synthetic data generator in $\mathcal{O}(k\log k\log (\varepsilon n))$ time. Prior methods require $\Omega(n)$ space and construction time. Our evaluation uses the expected 1-Wasserstein distance between the sampler and the empirical distribution. Compared to state-of-the-art methods, we demonstrate that the additional cost in utility is inversely proportional to $k$ and $w$. This represents the first meaningful trade-off between performance and utility for private synthetic data generation.
Cryptography and Security,Data Structures and Algorithms
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of generating privacy - protected synthetic data in resource - constrained environments (such as with small memory). Specifically, it targets how to ensure data privacy and optimize memory usage efficiency while maintaining the flexibility of data - flow analysis. #### 1. **Limitations of existing methods** - **High memory cost**: Current methods either provide privacy protection by converting data - flow into a sequence of private values, but this requires a large amount of memory, usually proportional to the database size. - **Limited query flexibility**: Another method is to use a compact data structure to provide a private summary of the data - flow, but these data structures are limited to predefined queries, limiting their flexibility. #### 2. **The proposed new method** To overcome the above limitations, the authors propose a new lightweight synthetic data generator - **PrivHP (Private Hot Partition)**, which can provide differential privacy guarantees in a resource - efficient manner. The main features of PrivHP are as follows: - **Differential privacy**: The generated synthetic data retains the distribution characteristics of the original data - flow and can be used for flexible downstream analysis without incurring additional privacy costs. - **Hierarchical decomposition**: PrivHP utilizes the hierarchical decomposition of the input domain, selectively pruning low - frequency sub - domains while retaining high - frequency sub - domains, and managing them in a privacy - protected manner. - **Memory efficiency**: By using private sketches to estimate sub - domain frequencies without accessing the entire data set, memory efficiency in the streaming context is ensured. #### 3. **Performance and utility trade - offs** After being parameterized, PrivHP can handle a data set of size \(n\), requires a space of \(O((w + k)\log(\varepsilon n))\), has an update time of \(O(\log(\varepsilon n))\), and outputs an \(\varepsilon\)-differential privacy synthetic data generator within \(O(k\log k\log(\varepsilon n))\) time. Compared to previous methods, PrivHP achieves the first meaningful trade - off between privacy, memory efficiency, and utility. #### 4. **Utility measurement** The utility of PrivHP is evaluated by measuring the expected 1 - Wasserstein distance between the sampler and the empirical distribution of the input. Experimental results show that PrivHP provides a better trade - off between utility and performance, especially in low - memory environments. ### Summary This paper proposes a new method named PrivHP that can efficiently generate synthetic data with differential privacy guarantees in resource - constrained environments. By combining hierarchical decomposition and private sketch techniques, PrivHP not only addresses the shortcomings of existing methods in terms of memory usage and query flexibility but also achieves a good balance between privacy protection and utility.