Abstract:With the rapid growth of storage providers, data deduplication is an essential storage optimization technique that greatly minimizes data storage costs by storing a unique copy of duplicate data. Nowadays, deduplication introduces various new challenges such as security and insufficient space issue. Hence, in this article, we propose a secure data deduplication with access control of big data over HDFS (Hadoop Distributed File System)/Hadoop environment, called SecDedoop. First, the system achieves security for data confidentiality by third party vendor using elliptic curve cryptography. There are two types of keys (public key and private key) generated for data retrieval. Second, we consider data deduplication. The user's original file is divided into a number of equal chunks. Then, each chunk (e.g., 1. txt) is tokenized into words and the weight of words is computed by using TF-IDF frequency. The SHA-3 hash computation is performed to the user's original file. If the hash value is not duplicate, then we store data in HDFS. The PSO (particle swarm optimization)-based MapReduce model is the proposed best data node selection. Initially, MapReduce process is finished for the user's original file and it results in the best set of data nodes; then, we apply PSO to compute the fitness value for best data node selection. Further, we consider MongoDB for fast indexing of the user's original files and also apply FCM (fuzzy-C-means clustering) for clustering the user's files. In this article, we consider the modified version of PSO and FCM to eliminate the open issues in conventional PSO and FCM. The performance of our proposed SecDedoop has been evaluated by using various performance metrics and also proved it outperforms better than previous approaches.

SAP: Similarity-aware Partitioning for Efficient Cloud Storage

Towards Optimizing Storage Costs on the Cloud

A secure framework for managing data in cloud storage using rapid asymmetric maximum based dynamic size chunking and fuzzy logic for deduplication

A Data Structure for Efficient File Deduplication in Cloud Storage

An Optimized Approach for Storing and Accessing Small Files on Cloud Storage

Deduplication Model Based on File-Similarity Clustering

Similarity and Locality Based Indexing for High Performance Data Deduplication.

SecDedoop: Secure Deduplication with Access Control of Big Data in the HDFS/Hadoop Environment

FuzzyDedup: Secure Fuzzy Deduplication for Cloud Storage

ESDedup: An efficient and secure deduplication scheme based on data similarity and blockchain for cloud-assisted medical storage systems

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

Improving Data Availability for Deduplication in Cloud Storage

Cloud Storage Management Technology for Small File Based on Two-Dimensional Packing Algorithm

Boafft: Distributed Deduplication for Big Data Storage in the Cloud

SiLo: a Similarity-Locality Based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput

SEARS: Space Efficient And Reliable Storage System in the Cloud

SimLESS: A Secure Deduplication System over Similar Data in Cloud Media Sharing

SP-Cache: Load-Balanced, Redundancy-Free Cluster Caching with Selective Partition

A Similarity-Aware Encrypted Deduplication Scheme with Flexible Access Control in the Cloud

Partition, Don't Sort! Compression Boosters for Cloud Data Ingestion Pipelines

A Novel Chunk Coalescing Algorithm for Data Deduplication in Cloud Storage