Abstract:Abstract Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility. In this paper, we aim at addressing the curse-of-dimensionality problem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-F ed -W ae , an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.

Towards Breaking the Curse of Dimensionality for High-Dimensional Privacy: An Extended Version

Preserving Privacy of High-Dimensional Data by l-Diverse Constrained Slicing

A divide-and-conquer approach to privacy-preserving high-dimensional big data release

A condensation approach to privacy preserving data mining

Non-linear Dimensionality Reduction for Privacy-Preserving Data Classification

A Privacy Preservation Method for High Dimensional Data Mining

Statistical Inference from High Dimensional Data

Dual Query: Practical Private Query Release for High Dimensional Data

$d_X$-Privacy for Text and the Curse of Dimensionality

Interpreting the Curse of Dimensionality from Distance Concentration and Manifold Effect

Differentially Private Low-dimensional Synthetic Data from High-dimensional Datasets

Utility Analysis and Enhancement of LDP Mechanisms in High-Dimensional Space

Privacy-Preserving High-dimensional Data Collection with Federated Generative Autoencoder

Selecting Features by their Resilience to the Curse of Dimensionality

How I learned to stop worrying and love the curse of dimensionality: an appraisal of cluster validation in high-dimensional spaces

The Opportunity in Difficulty: A Dynamic Privacy Budget Allocation Mechanism for Privacy-Preserving Multi-dimensional Data Collection

Breaking the Curse of Dimensionality with Isolation Kernel

LoHDP: Adaptive local differential privacy for high‐dimensional data publishing

A Dynamic Anonymization Privacy-Preserving Model Based on Hierarchical Sequential Three-Way Decisions

Investigating Privacy Leakage in Dimensionality Reduction Methods via Reconstruction Attack

Locally Differentially Private Multi-Dimensional Data Collection Via Haar Transform.