Differential Privacy Releasing of Hierarchical Origin/Destination Data with a TopDown Approach

Fabrizio Boninsegna,Francesco Silvestri
2024-12-12
Abstract:This paper presents a novel method to generate differentially private tabular datasets for hierarchical data, with a specific focus on origin-destination (O/D) trips. The approach builds upon the TopDown algorithm, a constraint-based mechanism designed to incorporate invariant queries into tabular data, developed by the US Census. O/D hierarchical data refers to datasets representing trips between geographical areas organized in a hierarchical structure (e.g., region $\rightarrow$ province $\rightarrow$ city). The developed method is crafted to improve accuracy on queries spanning wider geographical areas that can be obtained by aggregation. Maintaining high accuracy for aggregated geographical queries is a crucial attribute of the differentially private dataset, particularly for practitioners. Furthermore, the approach is designed to minimize false positives detection and to replicate the sparsity of the sensitive data. The key technical contributions of this paper include a novel TopDown algorithm that employs constrained optimization with Chebyshev distance minimization, with theoretical guarantees based on the maximum absolute error. Additionally, we propose a new integer optimization algorithm that significantly reduces the incidence of false positives. The effectiveness of the proposed approach is validated using both real-world and synthetic O/D datasets, demonstrating its ability to generate private data with high utility and a reduced number of false positives. We emphasize that the proposed algorithm is applicable to any tabular data with a hierarchical structure.
Data Structures and Algorithms
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to generate hierarchical origin/destination (O/D) datasets with high accuracy and low false - positive rate under the premise of protecting privacy. Specifically, the researchers propose a new method to process hierarchical O/D data, ensuring query accuracy at different geographical levels and minimizing the occurrence of false positives. ### Main problems: 1. **Protecting privacy**: O/D data contains personal movement patterns, and this information can be very sensitive. Once leaked, it will expose personal habits and frequently - visited places. Therefore, differential privacy (DP) technology needs to be adopted to protect individual privacy. 2. **Improving accuracy**: When performing aggregate queries in large geographical areas, traditional DP methods will lead to a decline in accuracy. This paper aims to improve the accuracy of queries in larger areas through new algorithms. 3. **Reducing false positives**: Adding noise in traditional DP methods may lead to false non - zero data points (i.e., false positives), which will affect the correctness of decision - making. The method proposed in this paper aims to significantly reduce the occurrence of false positives. ### Solution overview: The author introduces a new method based on the TopDown algorithm - InfTDA (Infinite TopDown Algorithm). This method uses Chebyshev distance minimization for constrained optimization, thereby ensuring that the generated DP datasets have high accuracy and low false - positive rate at different geographical levels. In addition, they also develop a fast integer - constrained optimization algorithm IntOpt to further reduce false positives. ### Key contributions: 1. **New TopDown algorithm**: Utilize Chebyshev distance minimization to provide a theoretical maximum absolute error guarantee. 2. **Reducing false positives**: Significantly reduce the occurrence of false positives through an integer optimization algorithm. 3. **Wide applicability**: This method is not only applicable to O/D data, but can also be applied to any hierarchical tabular data that can be represented by a tree structure. ### Theoretical and empirical analysis: - The author conducts a theoretical analysis of InfTDA and proves its accuracy at different geographical levels. - Through experiments with actual data (such as commuting flow data in Italy) and synthetic data, the effectiveness of this method is demonstrated, especially its excellent performance in reducing false positives. ### Conclusion: This paper proposes a novel framework based on the TopDown algorithm for generating hierarchical O/D datasets with high accuracy and low false - positive rate. This method not only provides a strong theoretical guarantee but also performs well in practical applications, providing strong support for fields such as traffic planning and epidemiology.