Scalable Local-Recoding Anonymization using Locality Sensitive Hashing for Big Data Privacy Preservation

Xuyun Zhang,Christopher Leckie,Wanchun Dou,Jinjun Chen,Ramamohanarao Kotagiri,Zoran Salcic
DOI: https://doi.org/10.1145/2983323.2983841
2016-01-01
Abstract:While cloud computing has become an attractive platform for supporting data intensive applications, a major obstacle to the adoption of cloud computing in sectors such as health and defense is the privacy risk associated with releasing data sets to third-parties in the cloud for analysis. A widely-adopted technique for data privacy preservation is to anonymize data via local recoding. However, most existing local-recoding techniques are either serial or distributed without directly optimising scalability, thus rendering them unsuitable for data intensive applications. In this paper, we propose a highly scalable approach to local-recoding anonymization in cloud computing, based on Locality Sensitive Hashing (LSH). Specifically, a novel semantic distance metric is presented for use with LSH to measure the similarity between two data records. Then, LSH with the MinHash function family can be employed to divide data sets into multiple partitions for use with MapReduce to parallelize computation while preserving similarity. By using our efficient LSH-based scheme, we can anonymize each partition through the use of a recursive agglomerative $k$-member clustering algorithm. Extensive experiments on real-life data sets show that our approach significantly improves the scalability and time-efficiency of local-recoding anonymization by orders of magnitude over existing approaches.
What problem does this paper attempt to address?