Abstract:Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of the very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely-used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte-by-byte across data chunks and ② applying multiple transforms on all the calculated rolling hash values. In this paper, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and high compression ratio. Specifically, Odess first utilizes a novel Subwindow-based Parallel Rolling hash method (called SWPR) using SIMD (i.e., Single Instruction Multiple Data [1]) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Moreover, Odess uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set, and then quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead). Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼ 31.4 × and ∼ 7.9 × faster than the state-of-the-art N-Transform and Finesse (i.e., a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, Odess-based system’s throughput is about 3.20 × and 1.41 × higher than N-Transform and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼ 1.22 × higher compression ratio over Finesse.

DARE: A Deduplication-Aware Resemblance Detection and Elimination Scheme for Data Reduction with Low Overheads

Combining Deduplication and Delta Compression to Achieve Low-Overhead Data Reduction on Backup Datasets

A Delayed Container Organization Approach to Improve Restore Speed for Deduplication Systems.

The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression

DARM: A Deduplication-Aware Redundancy Management Approach for Reliable-Enhanced Storage Systems

Chunk Content is not Enough: Chunk-Context Aware Resemblance Detection for Deduplication Delta Compression

AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment

A Comprehensive Study of the Past, Present, and Future of Data Deduplication

A Novel Data Redundancy Scheme for De-Duplication Storage System

DEC: An Efficient Deduplication-Enhanced Compression Approach

P-Dedupe: Exploiting Parallelism in Data Deduplication System

SAFE: A Source Deduplication Framework for Efficient Cloud Backup Services

A Multi‐feature‐based Intelligent Redundancy Elimination Scheme for Cloud‐assisted Health Systems

Odess: Speeding up Resemblance Detection for Redundancy Elimination by Fast Content-Defined Sampling

AR-dedupe: an Efficient Deduplication Approach for Cluster Deduplication System

Accelerating Content-Defined-chunking Based Data Deduplication by Exploiting Parallelism.

Ddelta: A Deduplication-Inspired Fast Delta Compression Approach

Application-Aware Big Data Deduplication in Cloud Environment

SAUD: Semantics-Aware and Utility-Driven Deduplication Framework for Primary Storage.

CDAC: Content-Driven Deduplication-Aware Storage Cache

An Intelligent Data De-duplication Based Backup System