Abstract:Data deduplication, a data reduction technique that efficiently detects and eliminates redundant data chunks and files, has been widely applied in large-scale storage systems. Most existing deduplication-based storage systems employ content-defined chunking (CDC) and secure-hash-based fingerprinting (e.g., SHA1) to remove redundant data at the chunk level (e.g., 4 KB/8 KB chunks), which are extremely compute-intensive and thus time-consuming for storage systems. Therefore, we present P-Dedupe, a pipelined and parallelized data deduplication system that accelerates deduplication process by dividing the deduplication process into four stages (i.e., chunking, fingerprinting, indexing, and writing), pipelining these four stages with chunks & files (the processing data units for deduplication), and then parallelizing CDC and secure-hash-based fingerprinting stages to further alleviate the computation bottleneck. More important, to efficiently parallelize CDC with the requirements of both maximal and minimal chunk sizes and inspired by the MapReduce model, we first split the data stream into several segments (i.e., “Map”), where each segment will be running CDC in parallel with an independent thread, and then re-chunk and join the boundaries of these segments (i.e., “Reduce”) to ensure the chunking effectiveness of parallelized CDC. Experimental results of P-Dedupe with eight datasets on a quad-core Intel i7 processor suggest that P-Dedupe is able to accelerate the deduplication throughput near linearly by exploiting parallelism in the CDC-based deduplication process at the cost of only 0.02% decrease in the deduplication ratio. Our work provides contributions to big data science to ensure all files go through deduplication process quickly and thoroughly, and only process and analyze the same file once, rather than multiple times.

DS-Dedupe: A scalable, low network overhead data routing algorithm for inline cluster deduplication system

Ss-Dedup : A High Throughput Stateful Data Routing Algorithm For Cluster Deduplication System

AR-dedupe: an Efficient Deduplication Approach for Cluster Deduplication System

A Scalable Inline Cluster Deduplication Framework for Big Data Protection.

Research on Data Routing Strategy of Deduplication in Cloud Environment

GreDedup: A Greedy-Based Application-Aware Data Routing Strategy for Distributed Deduplication

A Novel Data Routing Strategy Based on Directories for Deduplication Clusters

Distributed Backup Data Deduplication System Based on Data Routing

Application-Aware Big Data Deduplication in Cloud Environment

Load Balance Strategy of Data Routing Algorithm Using Semantics for Deduplication Clusters

P-Dedupe: Exploiting Parallelism in Data Deduplication System

Accelerating Content-Defined-chunking Based Data Deduplication by Exploiting Parallelism.

PeerDedupe: Insights into the Peer-Assisted Sampling Deduplication.

A Novel Optimization Method to Improve De-duplication Storage System Performance

Dynamic Clustering-based Sharding in Distributed Deduplication Systems.

Pushing Collaborative Data Deduplication to the Network Edge: an Optimization Framework and System Design

Data Deduplication Based on Hadoop

Cluster and Single-Node Analysis of Long-Term Deduplication Patterns.

A Novel Data Redundancy Scheme for De-Duplication Storage System

Similarity and Locality Based Indexing for High Performance Data Deduplication.

AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment