SecDedoop: Secure Deduplication with Access Control of Big Data in the HDFS/Hadoop Environment

P Ramya,C Sundar
DOI: https://doi.org/10.1089/big.2019.0120
IF: 4.426
Big Data
Abstract:With the rapid growth of storage providers, data deduplication is an essential storage optimization technique that greatly minimizes data storage costs by storing a unique copy of duplicate data. Nowadays, deduplication introduces various new challenges such as security and insufficient space issue. Hence, in this article, we propose a secure data deduplication with access control of big data over HDFS (Hadoop Distributed File System)/Hadoop environment, called SecDedoop. First, the system achieves security for data confidentiality by third party vendor using elliptic curve cryptography. There are two types of keys (public key and private key) generated for data retrieval. Second, we consider data deduplication. The user's original file is divided into a number of equal chunks. Then, each chunk (e.g., 1. txt) is tokenized into words and the weight of words is computed by using TF-IDF frequency. The SHA-3 hash computation is performed to the user's original file. If the hash value is not duplicate, then we store data in HDFS. The PSO (particle swarm optimization)-based MapReduce model is the proposed best data node selection. Initially, MapReduce process is finished for the user's original file and it results in the best set of data nodes; then, we apply PSO to compute the fitness value for best data node selection. Further, we consider MongoDB for fast indexing of the user's original files and also apply FCM (fuzzy-C-means clustering) for clustering the user's files. In this article, we consider the modified version of PSO and FCM to eliminate the open issues in conventional PSO and FCM. The performance of our proposed SecDedoop has been evaluated by using various performance metrics and also proved it outperforms better than previous approaches.
What problem does this paper attempt to address?