Anonymizing Big Data Streams Using In-memory Processing: A Novel Model Based on One-time Clustering
Elham Shamsinejad,Touraj Banirostam,Mir Mohsen Pedram,Amir Masoud Rahmani
DOI: https://doi.org/10.1007/s11265-024-01920-z
2024-05-26
Journal of Signal Processing Systems
Abstract:Big data privacy preservation is a critical challenge for data mining and data analysis. Existing methods for anonymizing big data streams using k-anonymity algorithms may cause high data loss, low data quality, and identity disclosure. In this paper, we propose a novel model for anonymizing big data streams using in-memory processing. The model uses a Spark framework to parallelize the anonymization process and a one-time clustering algorithm to avoid multiple iterations and allocate the data to optimal clusters. We evaluate the performance and effectiveness of the model using a real-world dataset and compare it with three popular k-anonymity algorithms: CRUE, Mean-Shift, and DBSCAN. The results show that the model has the lowest data loss and the highest data quality for different data sizes and k-values. The model is scalable, robust, adaptable, and flexible. The model can provide better data for data mining and data analysis while protecting data privacy and preventing data disclosure.
computer science, information systems,engineering, electrical & electronic