Abstract:Recently big data and its applications had sharp growth in various fields such as IoT, bioinformatics, eCommerce, and social media. The huge volume of data incurred enormous challenges to the architecture, infrastructure, and computing capacity of IT systems. Therefore, the compelling need of the scientific and industrial community is large-scale and robust computing systems. Since one of the characteristics of big data is value, data should be published for analysts to extract useful patterns from them. However, data publishing may lead to the disclosure of individuals’ private information. Among the modern parallel computing platforms, Apache Spark is a fast and in-memory computing framework for large-scale data processing that provides high scalability by introducing the resilient distributed dataset (RDDs). In terms of performance, Due to in-memory computations, it is 100 times faster than Hadoop. Therefore, Apache Spark is one of the essential frameworks to implement distributed methods for privacy-preserving in big data publishing (PPBDP). This paper uses the RDD programming of Apache Spark to propose an efficient parallel implementation of a new computing model for big data anonymization. This computing model has three-phase of in-memory computations to address the runtime, scalability, and performance of large-scale data anonymization. The model supports partition-based data clustering algorithms to preserve the λ-diversity privacy model by using transformation and actions on RDDs. Therefore, the authors have investigated Spark-based implementation for preserving the λ-diversity privacy model by two designed City block and Pearson distance functions. The results of the paper provide a comprehensive guideline allowing the researchers to apply Apache Spark in their own researches.

Parallel Fuzzy C-Means Clustering Based Big Data Anonymization Using Hadoop MapReduce

A MapReduce Based Approach of Scalable Multidimensional Anonymization for Big Data Privacy Preservation on Cloud

A divide-and-conquer approach to privacy-preserving high-dimensional big data release

Proximity-Aware Local-Recoding Anonymization with MapReduce for Scalable Big Data Privacy Preservation in Cloud

Combining Top-Down and Bottom-Up: Scalable Sub-tree Anonymization over Big Data Using MapReduce on Cloud

Scalable Iterative Implementation of Mondrian for Big Data Multidimensional Anonymisation

A Survey of Data Anonymization Techniques for Privacy-Preserving Mining in Bigdata

Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering

A distributed computing model for big data anonymization in the networks

Data De-anonymization : From Mobility Traces to On-line Social Networks

SaC-FRAPP: a scalable and cost-effective framework for privacy preservation over big data on cloud.

Adaptive whale optimization based clustering method for K- anonymization in social networks

Pituitary size assessed with magnetic resonance imaging as a measure of growth hormone secretion in long term survivors of childhood cancer.

Privacy-Preserving Hierarchical Anonymization Framework over Encrypted Data

Privacy-Preserving Machine Learning Algorithms for Big Data Systems

A Multi-level Clustering Approach for Anonymizing Large-Scale Physical Activity Data

General Graph Data De-Anonymization: From Mobility Traces To Social Networks

High-density information security storage method of big data center based on fuzzy clustering

A federated fuzzy c-means clustering algorithm

A Dynamic Anonymization Privacy-Preserving Model Based on Hierarchical Sequential Three-Way Decisions

Big data clustering using fractional sail fish-sparse fuzzy C-means and particle whale optimization based MapReduce framework