Abstract:Data sharing in today's information society poses a threat to individual privacy and organisational confidentiality. k-anonymity is a widely adopted model to prevent the owner of a record being re-identified. By generalising and/or suppressing certain portions of the released dataset, it guarantees that no records can be uniquely distinguished from at least other k-1 records. A key requirement for the k-anonymity problem is to minimise the information loss resulting from data modifications. This article proposes a top-down approach to solve this problem. It first considers each record as a vertex and the similarity between two records as the edge weight to construct a complete weighted graph. Then, an edge cutting algorithm is designed to divide the complete graph into multiple trees/components. The Large Components with size bigger than 2k-1 are subsequently split to guarantee that each resulting component has the vertex number between k and 2k-1. Finally, the generalisation operation is applied on the vertices in each component (i.e. equivalence class) to make sure all the records inside have identical quasi-identifier values. We prove that the proposed approach has polynomial running time and theoretical performance guarantee O(k). The empirical experiments show that our approach results in substantial improvements over the baseline heuristic algorithms, as well as the bottom-up approach with the same approximate bound O(k). Comparing to the baseline bottom-up O(logk)-approximation algorithm, when the required k is smaller than 50, the adopted top-down strategy makes our approach achieve similar performance in terms of information loss while spending much less computing time. It demonstrates that our approach would be a best choice for the k-anonymity problem when both the data utility and runtime need to be considered, especially when k is set to certain value smaller than 50 and the record set is big enough to make the runtime have to be taken into account.

Combining Top-Down and Bottom-Up: Scalable Sub-tree Anonymization over Big Data Using MapReduce on Cloud

A MapReduce Based Approach of Scalable Multidimensional Anonymization for Big Data Privacy Preservation on Cloud

Scalable Iterative Implementation of Mondrian for Big Data Multidimensional Anonymisation

SCAN: A Smart Application Platform for Empowering Parallelizations of Big Genomic Data Analysis in Clouds

Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud

A divide-and-conquer approach to privacy-preserving high-dimensional big data release

Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce

SaC-FRAPP: a scalable and cost-effective framework for privacy preservation over big data on cloud.

A Top-Down Approach For Approximate Data Anonymisation

Privacy-Preserving Layer over MapReduce on Cloud

K-Anonymity for Crowdsourcing Database

MtMR: Ensuring MapReduce Computation Integrity with Merkle Tree-Based Verifications

A Dynamic Anonymization Privacy-Preserving Model Based on Hierarchical Sequential Three-Way Decisions

T-PriDO: A Tree-based Privacy-Preserving and Contextual Collaborative Online Big Data Processing System.

Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering

Semi-Homogenous Generalization: Improving Homogenous Generalization for Privacy Preservation in Cloud Computing

Privacy-Preserving Hierarchical Anonymization Framework over Encrypted Data

A distributed computing model for big data anonymization in the networks

UPA: an Automated, Accurate and Efficient Differentially Private Big-Data Mining System

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

Security and Privacy Aspects in MapReduce on Clouds: A Survey