Scalable Iterative Implementation of Mondrian for Big Data Multidimensional Anonymisation

Xuyun Zhang,Lianyong Qi,Qiang He,Wanchun Dou
DOI: https://doi.org/10.1007/978-3-319-49145-5_31
2016-01-01
Abstract:Scalable data processing platforms built on cloud computing are becoming increasingly attractive as infrastructure for supporting big data mining and analytics applications. But privacy concerns are one of the major obstacles to make use of public cloud platforms. Practically, data generalisation is a widely adopted anonymisation technique for data privacy preservation in data publishing or sharing scenarios. Multidimensional anonymisation, a global-recoding generalisation scheme, has been a recent focus due to its capability of balancing data obfuscation and data usability. Existing approaches handled the scalability problem of multidimensional anonymisation for data sets much larger than main memory by storing data on disk at runtime, which incurs an impractical serial I/O cost. In this paper, we propose a scalable iterative multidimensional anonymisation approach for big data sets based on MapReduce, a state-of-the-art large-scale data processing paradigm. Our basic and intuitive idea is to partition a large data set recursively into smaller data partitions using MapReduce until all partitions can fit in memory of each computing node. A tree indexing structure is proposed to achieve recursive computation on MapReduce for data partitioning in multidimensional anonymisation. Experimental results on real-life data sets demonstrate that the proposed approach can significantly improve the scalability and time-efficiency of multidimensional anonymisation over existing approaches, and therefore is applicable to big data applications.
What problem does this paper attempt to address?